U-SQL – Extracting Information From a Column Containing JSON

There are plenty of examples on how to parse JSON files using U-SQL, but what if one of the columns of a TSV contains JSON like this example:

ID  Name  Vehicle 
1   Joe   "{""Year"":""2017"",""Make"":""Ford"",""Model"":""Fusion""}"
2   Bob   "{""Year"":""2016"",""Make"":""Kia"",""Model"":""Sorento""}"

We can adapt an example of Using the JSON User Defined Functions from the Microsoft.Analytics.Formatting.Samples.Formats library in the U-SQL Examples repository on GitHub.

Once we’ve built and registered the assemblies, the script is rather straightforward.

First up, we need to reference the assemblies we just registered.

REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; 

Next, we’ll extract the raw data

@rawData =
    EXTRACT Id int,
            Name string,
            JsonString string
    FROM "/file_with_json_column.tsv"
    USING Extractors.Tsv(skipFirstNRows:1);

In this next bit, we’ll use the JsonTuple function from the Microsoft.Analytics.Samples.Formats library convert the JSON string into a dictionary. Once we have the dictionary, we’ll pull values out of it.

@rawDataWithTuple =
    SELECT Id,
           Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(JsonString) AS JsonTuple
    FROM @rawData;

@expandedData =
    SELECT Id,
           JsonTuple["Year"] AS VehicleYear,
           JsonTuple["Make"] AS VehicleMake,
           JsonTuple["Model"] AS VehicleModel
    FROM @rawDataWithTuple;

Finally, we’ll output the expanded data so we can check our work.

OUTPUT @expandedData
TO "/expandedData.tsv"
USING Outputters.Tsv(outputHeader : true);

The full script and a sample file can be found here.

Posted in Azure, Big Data, JSON, U-SQL | 2 Comments

Using Azure Functions to Replace Cloud Workers

When looking at options for replacing Worker Roles in Azure Cloud Services, more and more, Azure Functions look to fit the bill.

I like the idea of just writing code that focuses on what needs to be done and not having to write plumbing code like polling a queue, handling parallel execution or dealing with poison messages.

With Azure Functions, you can trigger your code using blobs, event hubs, queues, timers and web hooks.

Now that Azure Functions can run on Linux and in a Docker container, we can deploy it however we want.

The idea is this:
– Use Azure Functions as the base image of my workers
– Each worker will handle one message type
– Create a Docker image for each worker
– Deploy them to a Kubernetes cluster so they can scale independently

To explore this idea, we’re going to build a small proof of concept.

A text file will get dropped in blob storage container, which will trigger a function that will verify that it has contents.

If it does, another function will read the file and split it by line and save each line into separate files.

High Level Process

Once the split is done, work will queue up for the another function that will uppercase the contents of each new file.

High Level Process Part 2

All of the functions described so far will not enqueue work for other functions. That will be left to yet another function that will handle scheduling of work.

The overall architecture should look something like this:

High Level Architecture

I should mention that while I’ve used some of these technologies before, some of this is still new to me.

This feels like it should all just work together nicely, but I haven’t really dug in to it yet.

Either way, this will be a bit of fun in figuring out if this is viable or it crashes and burns 🙂

In the next post we’ll take a crack at designing/implementing the first function, the scheduler.

Posted in Azure, Azure Functions | Leave a comment

Repost: Filtering columns out of data using Azure Data Factory

At work, every night we ingest a couple of large comma separated files (CSVs) that are 45 to 50 GB each. These files are generated externally, so we don’t control what information is in them. For our purposes, we really only need maybe 10 of the 50 or so columns in each file.

In order to reduce the size of the files coming in, we’re using Azure Data Factory to filter out the columns that we don’t want.

Let’s suppose that we have a CSV that looks like this:

Id Name Website
1 Apple apple.com
2 Google google.com
3 Microsoft microsoft.com

And what we really need are just the Id and Website columns:

Id Website
1 apple.com
2 google.com
3 microsoft.com

Azure Data Factory makes this relatively easy.

The first thing we need to do is to create datasets for our input and output files. When we create the datasets, we need to define the structure of each file.

The structure section for our input file will look like this:

  { "name": "Id", "type": "int" },
  { "name": "Name", "type": "String" },
  { "name": "Website", "type": "String" }

The structure section for our output file will look similar to our input structure, but we’ll omit the “Name” column

  { "name": "Id", "type": "int" },
  { "name": "Website", "type": "String" }

The last part we need to have Data Factory filter out that column is to use a translator in the Copy activity of our pipeline.

  "type": "Copy",
  "typeProperties": {
    "source": { "type": "BlobSource" },
    "sink": { "type": "BlobSink" },
    "translator": {
      "type": "TabularTranslator",
      "columnMappings": "Id: Id, WebSite: WebSite"
  "inputs": [ { "name": "InputDataSet" } ],
  "outputs": [ { "name": "OutputDataSet" } ]

In the columnMappings property you set which input column maps to which output column. In our case, they are named the same in both the input and output datasets. The format for the columnMappings value is

"inputcolumnX: outputColumnY, ..."

What’s really nice about this is that you can use this on more than just files. In our pipeline, we’re also using this to map results of an HTTP request to a CSV format.

Posted in Data Factory | Leave a comment

Whoops, starting over (or why you should really keep your contact info up to date)

Last year, I started blogging again. Nothing major, just tidbits from things I struggled with at work or found interesting.

Back in September, my hosting at dreamhost expired, they sent out warnings and I never got them.

It’s not their fault. I’m the idiot that stopped checking the email account I had listed as my primary email.

That was back in September, it’s December now. I’ve gone back in and re-enabled my hosting, but all of my previous content is gone.

Oh well.

The moral of this story is to make sure that your contact information is up to date, especially where it counts.

Leave a comment