This is a guide on how to write a script using the Data & Insights Python Package - Socrata-py. This guide was designed to give more explanation to the example code listed with the package. The package GitHub also contains install information https://github.com/socrata/socrata-py.
To get the most out of using Socrata-py we recommend you have some understanding of programming or are confident in your ability to learn. This is a technical part of the platform and it could be easy to get over your head if unprepared.
If you are unfamiliar with python there is some wonderful documentation on some of the basics here and here. If you are just looking to automate your data import and there is no other data manipulation going on it may be easier to use out Automate This! guide, since the automated system there will generate the python code for you.
Always Needed
First, you will need to have the proper classes imported
# Import some stuff
from socrata.authorization import Authorization
from socrata import Socrata
import os
Next, you need to make sure to include your authentication information. This can be done by adding your username and password as system variables. Alternatively, if this script is going to be shared or run on a machine that you either do not want to have those system variables or can’t add those system variables. You can set up API Keys to allow the script to authenticate as you without sharing that information.
With System Variables
# Make an auth object
auth = Authorization(
"support.demo.socrata.com",
os.environ['SOCRATA_USERNAME'],
os.environ['SOCRATA_PASSWORD']
)
With API Key
# Make an auth object
auth = Authorization(
"support.demo.socrata.com",
"Your id here",
"Your secret here"
)
The last step while not fully necessary will help with the clarity of your code. We are going to create an object to hold the fact that we have authenticated into the domain. If you wish to skip this step any time the client object is used you can replace it with Socrata(auth).
# Authenticate into the domain
client = Socrata(auth)
Let’s Make a New Dataset
This section will touch on some of the basic functions and variables that will be used and referenced later. It will attempt to explain what these functions and variables are and how to use them.
The first step is to open the file you intend to upload. We are going to by-pass the explanation of this but for more information on file manipulation with Python check here https://www.w3schools.com/python/python_file_handling.asp
with open('MOCK_DATA.csv', 'rb') as file:
The next step is to create a new revision and output schema.
(revision, output) = client.create(
name = "Script Dataset",
description = "Uploaded With Socrata-py"
).csv(file)
To put it into terms that are used on the platform already, revision is the draft that you would make and output is the information that is getting changed by the draft.
The next step is the transform step. Using Socrata-py allows you to utilize the wonderful Transform Library we have to modify the data that is being added to this dataset. If you do not need to make any changes you can skip this step.
output = output\
.change_column_metadata('first_name', 'display_name').to('First Name')\
.change_column_metadata('first_name', 'description').to('This column shows the person's first name')\
.drop_column('gender')\
.add_column('full_name', 'Full Name', 'first_name || \' \' || last_name', 'this is the concatenated first and last name')\
.run()
After we run our output we want to wait for it to finish applying so we can apply our revision. This is also so we can validate that there were no errors in our transform step.
Important Note: Even if you did not do anything in your previous step depending on the size of your dataset the wait_for_finish will be necessary.
# Validation of the results step
output = output.wait_for_finish()
Error checking
# check for errors
assert output.attributes['error_count'] == 0
# If you want, you can get a csv stream of all the errors
errors = output.schema_errors_csv()
for line in errors.iter_lines():
print(line)
Finally, we apply our revision, this step is the same as publishing your draft
job = revision.apply(output_schema = output)
If you want to check the draft before it is actually published you can have the script open the revision. This step is not recommended if you are going to have your script run automatically but for testing and debugging this can be a good thing to add.
revision.open_in_browser()
We also want to make sure that we wait for our revision to apply in case any errors occur
job.wait_for_finish()
Congratulations! You have successfully written a Socrata-py python script.
Updating an existing Dataset
To update an existing dataset you are going to want to create a new script file.
The beginning of the script will start with the information in the ALWAYS NEEDED section and will proceed from there.
The first step is to open the file you intend to upload. We are going to by-pass the explanation of this but for more information on file manipulation with Python check here https://www.w3schools.com/python/python_file_handling.asp
with open('MOCK_DATA.csv', 'rb') as file:
The next step is to find the dataset that you want to update. You will need the UID of the dataset you are wanting to edit. This can be found at the end of the URL used to access that dataset. For example, in this URL https://support.demo.socrata.com/dataset/Combo-Chart-Sample-Data/nwie-4x78, nwie-4x78 is the UID of this dataset.
uid = "nwie-4x78"
view = client.views.lookup(uid)
Now we are going to create our revision. The type of revision will decide how the data will be added to our dataset. There are 3 types of revisions. These are all dependent on the type of update you are intending on doing.
Replace -
revision = view.revisions.create_replace_revision()
This revision type will replace the existing data with the new data in your revision.
Update -
revision = view.revisions.create_update_revision()
This revision type will append data for datasets without a row id or will upsert for datasets with a row id.
Delete -
revision = view.revisions.create_delete_revision()
This revision type will delete the dataset it is being sent to.
We are going to use the replace revision for the rest of this guide.
Now that we have a revision we need to start the upload process. We are going to start by creating our upload object. Note the name of the upload in MOCK_DATA.csv this name is arbitrary. It does not need to be the same as the name of your file though that is helpful as this is just the name of your upload object.
upload = revision.create_upload('MOCK_DATA.csv')
The next step is to create your upload. Again there are many different methods for this depending on the file you wish to upload. You can find a complete list on the main Socrata-py documentation site under the Source Object.
The main ones are:
CSV -
source = upload.csv(file)
DataFrame -
source = upload.df(file)
And XLSX -
source = upload.xlsx(file)
We are going to continue using the CSV method.
After you create your source object you are going to need to get your output object. You are going to do that using the Source Object to get the Input schema, then use the Input Schema to get the Output Schema.
input_schema = source.get_latest_input_schema()
output = input_schema.get_latest_output_schema()
The next step is optional, the transform step. You can follow the link to access our Transform Library. This will allow you to modify the transforms that exist on this dataset. This means if your dataset already has some transforms on the columns you do not need to repeat this step unless you intend on modifying the transform.
output = output\
.change_column_metadata('first_name', 'display_name').to('First Name')\
.change_column_metadata('first_name', 'description').to('This column shows the person's first name')\
.drop_column('gender')\
.add_column('full_name', 'Full Name', 'first_name || \' \' || last_name', 'this is the concatenated first and last name')\
.run()
The next step is to wait for the output to finish validating. This is necessary even if you did not add any transforms to your dataset. There will still be a created output schema that needs to be validated.
output = output.wait_for_finish()
You can use the same method as on import to check for schema errors
# check for errors
assert output.attributes['error_count'] == 0
# If you want, you can get a csv stream of all the errors
errors = output.schema_errors_csv()
for line in errors.iter_lines():
print(line)
Finally, we are going to apply our revision
job = revision.apply()
Here you do not need to set the output schema as all of the output schema information was created from our revision.
Optionally you can wait for the job to finish
job = job.wait_for_finish()
This step can be helpful as if there are any errors when running the job they will happen during this step and can be used to alert you.
Congratulations, you have successfully updated a dataset using Socrata-py.
Comments
Article is closed for comments.