Transforming and Validating data in Socrata before publishing

The new Socrata ingress experience aims to help you quickly validate, clean, and transform your data before loading it into a dataset which can be visualized and queried. Some of these transformations might include geocoding, reprojecting from state plane to WGS84, special date parsing, parsing text into another type, or more. The ingress experience allows you to run custom validations and transformations using a familiar SoQL grammar, but with functions geared towards transforming and validating, rather than querying and aggregating.

It turns out that in the ingress experience, all data type changes and geocoding is expressed as a SoQL data transformation. For example, when you choose to change a text column to a number using the type selection drop-down, that action generates a SoQL data transformation. Recently, we added functionality in the user interface to allow you to access and modify this underlying data transformation. This is what we’ll be walking through.

Adding and Transforming Data

To get started, add a data file in the new data ingress experience by clicking the “Add Data” button. You can use a local upload of a data file or an external link. Next, find the column you want to transform or validate, and from the column options drop-down, select “Data Transforms”

This will open the Data Transform editor. This editor is a code editor which accepts a SoQL expression. Initially, the editor contains the current expression that is used to produce column data. By default, it will just be a reference to the column in your original CSV.

As you type, it will recompile your expression in the context of the current dataset, reporting any errors. Common errors might be simple typos such as unmatched parentheses, or references to columns that don’t exist in your dataset. It will also catch type errors, such as trying to use the multiplication operator on two strings.

The following image shows the compilation failure after I typed a column name incorrectly. The red highlighted line in the editor shows which line the error occurred on and the red outlined box under the editor shows the reason.


The autocomplete function allows you to explore input columns or functions that you might want to use. If you start typing, the autocomplete window will show partial matches. As you hit the up and down arrows to select items, documentation for how to use that function will be shown in the blue outlined window at the bottom. You can also open the autocomplete window manually by hitting ctrl+space.

In the following picture, I used ctrl+space to bring up the autocomplete window and then hit the down arrow to explore functions. The to_fixed_timestamp function docs are shown, with a type signature and examples on how it can be used.

Building an Expression and Validating Data

Now that we have an overview of what the major pieces are, let’s begin building an expression! We can see in the column to the right that we’re working on a column called `dept` with several different values. One thing we might want to do is to validate that our dataset will only have a certain set of values. Let’s pretend (for the sake of this example) that our city has the following departments:

'SPU', 'SPD', 'SDOT', 'DPR', 'HSD'

If there is another value here, such as ‘SPX’, we know that’s an invalid value in our source data. We want to write an expression to ensure that only that set of values appears. Furthermore, we’d really like to pluck all the rows containing invalid values out of our source file and send them back to the data owner so we can figure out what the issue is.

We can use the “case” function to accomplish this. Once we’ve written our expression in the editor, and the “Query compilation successful” message shows up, we can click the “Run” button to run our expression.

The following image shows what an expression to accomplish that might look like.

Hint: Highlight a token (word) in the expression editor to bring up documentation. In the above image, I’ve highlighted “case” token, and the “case” documentation has been displayed.

After the expression has run, we see a count of errors. It looks like this column has 183 invalid department values! We can filter this column down to the errors just as we would in the Data Table Preview view, by clicking on the red error count at the top of the column.

Once we click save at the bottom of the screen, we will be redirected to the full data table view. When we publish this dataset, any row with an error will be omitted from our published dataset.

We can use the “Export Errors” button at the bottom of the screen to download a CSV of all the rows which had errors, along with the reason for the error.

When you add more data to you dataset (either updating or replacing) through the UI or using a script generated by the “Automate This” button at the bottom of the screen, the data transformation you just created will be automatically applied to your incoming data.

It’s important to note that if you use the legacy Publisher API, data transformations will not take place. Because it is an asynchronous, transactional process, you will likely want to use the legacy API if your changes to the data are frequent, well-formed, and don’t need to be transformed.

You can see other articles on specific transformations here:

Happy publishing!

Was this article helpful?
1 out of 1 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.