Repost: Filtering columns out of data using Azure Data Factory

At work, every night we ingest a couple of large comma separated files (CSVs) that are 45 to 50 GB each. These files are generated externally, so we don’t control what information is in them. For our purposes, we really only need maybe 10 of the 50 or so columns in each file.

In order to reduce the size of the files coming in, we’re using Azure Data Factory to filter out the columns that we don’t want.

Let’s suppose that we have a CSV that looks like this:

Id Name Website
1 Apple apple.com
2 Google google.com
3 Microsoft microsoft.com

And what we really need are just the Id and Website columns:

Id Website
1 apple.com
2 google.com
3 microsoft.com

Azure Data Factory makes this relatively easy.

The first thing we need to do is to create datasets for our input and output files. When we create the datasets, we need to define the structure of each file.

The structure section for our input file will look like this:

"structure": 
[
  { "name": "Id", "type": "int" },
  { "name": "Name", "type": "String" },
  { "name": "Website", "type": "String" }
]

The structure section for our output file will look similar to our input structure, but we’ll omit the “Name” column

"structure": 
[
  { "name": "Id", "type": "int" },
  { "name": "Website", "type": "String" }
]

The last part we need to have Data Factory filter out that column is to use a translator in the Copy activity of our pipeline.

{
  "type": "Copy",
  "typeProperties": {
    "source": { "type": "BlobSource" },
    "sink": { "type": "BlobSink" },
    "translator": {
      "type": "TabularTranslator",
      "columnMappings": "Id: Id, WebSite: WebSite"
    }
  },
  "inputs": [ { "name": "InputDataSet" } ],
  "outputs": [ { "name": "OutputDataSet" } ]
  ...

In the columnMappings property you set which input column maps to which output column. In our case, they are named the same in both the input and output datasets. The format for the columnMappings value is

"inputcolumnX: outputColumnY, ..."

What’s really nice about this is that you can use this on more than just files. In our pipeline, we’re also using this to map results of an HTTP request to a CSV format.

This entry was posted in Data Factory. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *