Miller as a preprocessor for Datamash

Menu

One of the features Datamash provides is crosstab, which builds a Pivot table, similar to the ones you can build with Excel. This is more convenient than the pivot features provided by Miller.

However, Datamash is very strict about the format it reads, and it does not manage double quotes to escape comma in fields; since I often want to crosstab fields with commas, when I wanted to use this feature I had to use sed to preprocess the input file.

However, I recently realized that I can use miller as a preprocessor to clean the input data. The idea is the following:

sed might still be necessary, according to your locale, to have datamash perform computations on numbers with decimals. The syntax, however, is simplified.

Example

The input file has fields protected by double quotes, numbers with commas as decimal separator (Italian locale), and “-” in place of zeroes. A sample is available here and following table shows its content, verbatim

cat ./csv-cli-tools-input.csv
"name","surname","major","total","d1","d2"
"john","smith","arts","20","-","-16,8"
"carl","doe","engineering","30","-","12"
"adam","white","engineering","30","-","14,3"
"john","brown","arts","24","-","17,8"
"janet","smith","engineering","22","-","16,4"
"jane","doe","arts","20","14,8","26,8"
"anne","carol","engineering","10","12,2","26,3"

Pre-processing the file

We fix the format by:

  • using sed to replace “-” with “0.00”
  • using miller to convert the file to tsv, so that commas will appear only when used as decimal separators
  • using sed to replace “,” with “.”, so that decimal numbers can be correctly interpreted as such
sed 's/"-"/"0.00"/g' csv-cli-tools-input.csv | \
    mlr --icsv --otsv cat | \
    sed 's/,/./g' > csv-cli-tools-cleaned-for-datamash.tsv

Pivot Table

Now we are ready to perform the analyses.

For instance, count the number of grades each student got by subject:

cat ./csv-cli-tools-cleaned-for-datamash.tsv | \
    datamash crosstab surname,major count total -H 
GroupBy(surname) GroupBy(major) count(total)
  arts engineering
brown 1 N/A
carol N/A 1
doe 1 1
smith 1 1
white N/A 1