First of all, what is DataSync?
DataSync is Socrata’s free, simple, and powerful publishing tool that allows users to schedule and automate their data updates and upload large data files. This article will cover:
Do you have Java version 7 or higher?
The first step is to ensure that the machine you plan to use DataSync on is running Java version 7 or higher. It is recommended that you use Java version 8 to maintain TLS compliance. Java is a simple computer programming language that is necessary to run many websites and programs.To find and download the latest version of Java you can navigate to http://www.oracle.com/technetwork/java/javase/downloads/index.html.
Do you have the latest version of DataSync
Next, you can download the latest version of DataSync by navigating to Socrata’s Github page at github.com/socrata/datasync/releases. When your download is complete, you may need to approve this download in your computer’s security settings.
When you open the user interface of DataSync 1.8.0 it will look like this:
Do you have the correct permissions?
Make sure you, or the users you’re setting up to use DataSync, have the correct permissions in order to make updates to published datasets and publish datasets. The user needs to be:
- Shared Owner (dataset level)
Firewall or Proxy server?
Does your organization’s network have a Firewall or Proxy Server set-up? If so, you will need to adjust DataSync’s preferences to account for this. If you’re not sure, check with your IT department.
Setting up and Saving DataSync Preferences
Once you’ve opened up the user interface for DataSync, you can click on “File”-->”Preferences” to set and save the preferences that make sure your update jobs can run successfully and you can keep up to date on your scheduled jobs
File Chunking Settings
The automatic limit on uploads to the Socrata platform is 250 MB, so DataSync will automatically break larger files into smaller chunks and upload them piece by piece. You can set these preferences for how many chunks and how large of chunks here.
Logging and Auto-Email Settings
Some DataSync users choose to set up a DataSync log dataset, which tracks all DataSync uploads, any errors that may have occurred, and how many rows were updated. This is particularly useful for users that automatically run many DataSync jobs because you can check for errors and that your updates were complete.
You can also opt to have emails sent when your jobs fail. That way you can keep up to date on all of your regularly scheduled jobs.
To learn how to set these sections up, please check out our developer article.
SMPT Settings and Proxy Settings
If you are uploading your data from behind a firewall you will use the section for SMPT Settings and if you are working from behind a proxy server you will need to use this section for Proxy Settings. If you are unsure about whether your organization’s network includes a Firewall or Proxy Server, check with your IT department.
If you would like to learn more about setting up the Firewall and Proxy Settings section, please see our article.
Authentication Details Section
The first section you will need to take a look at is the authentication details at the bottom of the window. Here you should type in my Socrata domain (including ‘https://’), and the username and password you use to login to your domain. As mentioned above, please note that you will need to be at least a publisher or admin on your Socrata domain OR an owner of the dataset you wish to update. Lastly, you will need to enter your app token. An app token is a user-specific code that allows users to reliably and securely use DataSync. This article will explain how to generate an app token to use with DataSync.
Job Details Section
The second section you will need to fill out is the Job Details section.
Step 1 - Select file to publish
Here I select the file I want to sync to my Socrata dataset. This file will need to be in CSV or TSV format.
Step 2 - Enter Dataset ID to update
To find your dataset’s ID or ‘4 by 4’, take a look at the URL of your dataset and you’ll see an 8 character code separated by a dash (for example: 'a1b2-c3d4'). You can just copy and paste this into the DataSync field.
Step 3 - Select Update Method
There are four options for the update method. Replace, Upsert, Append, Delete.
- Replace: Replaces the dataset with the data in the CSV/TSV file.
- Upsert: Updates any rows that already exist and inserts rows which do not. Ideal if you have a dataset that requires very frequent updates or in cases where doing a complete replace may take too long because of the size of the dataset.
- Append: Adds the data from your file to the bottom of the dataset. Note: While the append job type is still available, it is now considered a subset of the Upsert function. Append jobs will appear as Upserts in the Data Jobs page.
- Delete: Delete all rows matching Row Identifiers given in CSV/TSV file. The CSV/TSV should only contain a single column listing the Row Identifiers to delete.
In order to use the Upsert and Delete functions, I will need to have Row Identifiers set. To learn more about Row Identifiers, check out our documentation.
Step 4 - Tell us how to import your CSV (or TSV!)
You will need to use the “Map Fields” section if:
- the headers of your data file do not already match the API field names of your existing dataset
- you need to ignore any columns in your data file that do not exist in your dataset
- you need to adjust any of the other Advanced Settings for importing your file
Step 5 - Copy command for later (optional)
Using this option you can copy and save a command to run from your command line, or save to schedule future jobs. This comes in handy if you’re going to use the job again in the future.
You’ll need to save the job you just created in order to generate this command
Save Your Job
You’ll see one option to “Run Job Now” and one option to “Save Job”. The first will run the job without saving and the second will allow you to save a .SIJ (Socrata Integration Job) file for later use.
Check out our existing documentation to learn more about saving jobs for later!
Video: Set up a Datasync Job