A quick tutorial on Miller

Recently I had to perform some analyses on a CSV file with a fairly complex format. I started with datamash, but I ended up using miller, because it performed considerably better on the input file I had, which had a tricky syntax.

This page reports on the tasks I performed and is a quick tutorial of Miller.

The input file

The input file has fields protected by double quotes, numbers with commas as decimal separator (Italian locale), and “-” in place of zeroes. A sample is available here and following table shows its content, verbatim

cat ./csv-cli-tools-input.csv

"name","surname","major","total","d1","d2"
"john","smith","arts","20","-","-16,8"
"carl","doe","engineering","30","-","12"
"adam","white","engineering","30","-","14,3"
"john","brown","arts","24","-","17,8"
"janet","smith","engineering","22","-","16,4"
"jane","doe","arts","20","14,8","26,8"
"anne","carol","engineering","10","12,2","26,3"

Tasks

These are the tasks I had to perform on the file:

Pre-process the file to simplify analyses: convert separator in decimal numbers and replace “-” with zeroes
Compute descriptive statistics of “total”
Compute average total by major (pivot)
Transpose row and columns of the pivot table computed at the previous step
Select surname and total
Select all records with d1 < 13.0
Sort record by total

Pre-processing the file

One of the issues with the file is the use of “,” both as field delimiter and separator of decimal places, which causes datamash to incorrectly parse the file.

We fix the format by:

using sed to replace “-” with “0.00”
using miller to convert the file to tsv, so that commas will appear only when used as decimal separators
using sed to replace “,” with “.”, so that decimal numbers can be correctly interpreted as such

sed 's/"-"/"0.00"/g' csv-cli-tools-input.csv | mlr --icsv --otsv cat | \
    sed 's/,/./g' | mlr --itsv --ocsv  cat > csv-cli-tools-input-cleaned.csv

Now we are ready to perform the analyses.

Descriptive statistics of total

mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv \
    stats1 -a p0,p25,p50,mean,p75,p100,stddev,var -f total

total_p0	total_p25	total_p50	total_mean	total_p75	total_p100	total_stddev	total_var
10	20	22	22.285714285714285	30	30	6.872997543873798	47.23809523809526

Notice that we have sued two “tricks” to simplify command composition and readability:

the --from options allows to put the input file as first argument. This makes the command more readable and simplifies composition of the command from the command line
the then keyword allows to compose different operations, reducing the number of pipes you need. For instance mlr cmd | mlr cmd2 can usually be replaced by mlr cmd1 then cmd2

Average total by major

mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv \
    stats1 -a mean -f total -g major then \
    format-values -f "%.2f"

major	total_mean
arts	21.33
engineering	23

Note. Org Mode cleans data a bit by removing one zero from the second number.

Select surname and total

mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv cut -f surname,total

surname	total
smith	20
doe	30
white	30
brown	24
smith	22
doe	20
carol	10

Select all records with d1 < 13.0

mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv filter '$d1 < 13'

name	surname	major	total	d1	d2
john	smith	arts	20	0.0	-16.8
carl	doe	engineering	30	0.0	12
adam	white	engineering	30	0.0	14.3
john	brown	arts	24	0.0	17.8
janet	smith	engineering	22	0.0	16.4
anne	carol	engineering	10	12.2	26.3

Sort record by total

mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv sort -nr total

name	surname	major	total	d1	d2
carl	doe	engineering	30	0.0	12
adam	white	engineering	30	0.0	14.3
john	brown	arts	24	0.0	17.8
janet	smith	engineering	22	0.0	16.4
john	smith	arts	20	0.0	-16.8
jane	doe	arts	20	14.8	26.8
anne	carol	engineering	10	12.2	26.3