A quick tutorial on Miller
Recently I had to perform some analyses on a CSV file with a fairly
complex format. I started with datamash
, but I ended up using
miller, because it performed considerably better on the input file I
had, which had a tricky syntax.
This page reports on the tasks I performed and is a quick tutorial of Miller.
The input file
The input file has fields protected by double quotes, numbers with commas as decimal separator (Italian locale), and “-” in place of zeroes. A sample is available here and following table shows its content, verbatim
cat ./csv-cli-tools-input.csv
"name","surname","major","total","d1","d2" "john","smith","arts","20","-","-16,8" "carl","doe","engineering","30","-","12" "adam","white","engineering","30","-","14,3" "john","brown","arts","24","-","17,8" "janet","smith","engineering","22","-","16,4" "jane","doe","arts","20","14,8","26,8" "anne","carol","engineering","10","12,2","26,3"
Tasks
These are the tasks I had to perform on the file:
- Pre-process the file to simplify analyses: convert separator in decimal numbers and replace “-” with zeroes
- Compute descriptive statistics of “total”
- Compute average total by major (pivot)
- Transpose row and columns of the pivot table computed at the previous step
- Select surname and total
- Select all records with d1 < 13.0
- Sort record by total
Pre-processing the file
One of the issues with the file is the use of “,” both as field
delimiter and separator of decimal places, which causes datamash
to incorrectly parse the file.
We fix the format by:
- using sed to replace “-” with “0.00”
- using miller to convert the file to tsv, so that commas will appear only when used as decimal separators
- using sed to replace “,” with “.”, so that decimal numbers can be correctly interpreted as such
sed 's/"-"/"0.00"/g' csv-cli-tools-input.csv | mlr --icsv --otsv cat | \ sed 's/,/./g' | mlr --itsv --ocsv cat > csv-cli-tools-input-cleaned.csv
Now we are ready to perform the analyses.
Descriptive statistics of total
mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv \
stats1 -a p0,p25,p50,mean,p75,p100,stddev,var -f total
total_p0 | total_p25 | total_p50 | total_mean | total_p75 | total_p100 | total_stddev | total_var |
10 | 20 | 22 | 22.285714285714285 | 30 | 30 | 6.872997543873798 | 47.23809523809526 |
Notice that we have sued two “tricks” to simplify command composition and readability:
- the
--from
options allows to put the input file as first argument. This makes the command more readable and simplifies composition of the command from the command line - the
then
keyword allows to compose different operations, reducing the number of pipes you need. For instancemlr cmd | mlr cmd2
can usually be replaced bymlr cmd1 then cmd2
Average total by major
mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv \ stats1 -a mean -f total -g major then \ format-values -f "%.2f"
major | total_mean |
arts | 21.33 |
engineering | 23 |
Note. Org Mode cleans data a bit by removing one zero from the second number.
Select surname and total
mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv cut -f surname,total
surname | total |
smith | 20 |
doe | 30 |
white | 30 |
brown | 24 |
smith | 22 |
doe | 20 |
carol | 10 |
Select all records with d1 < 13.0
mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv filter '$d1 < 13'
name | surname | major | total | d1 | d2 |
john | smith | arts | 20 | 0.0 | -16.8 |
carl | doe | engineering | 30 | 0.0 | 12 |
adam | white | engineering | 30 | 0.0 | 14.3 |
john | brown | arts | 24 | 0.0 | 17.8 |
janet | smith | engineering | 22 | 0.0 | 16.4 |
anne | carol | engineering | 10 | 12.2 | 26.3 |
Sort record by total
mlr --from csv-cli-tools-input-cleaned.csv --icsv --ocsv sort -nr total
name | surname | major | total | d1 | d2 |
carl | doe | engineering | 30 | 0.0 | 12 |
adam | white | engineering | 30 | 0.0 | 14.3 |
john | brown | arts | 24 | 0.0 | 17.8 |
janet | smith | engineering | 22 | 0.0 | 16.4 |
john | smith | arts | 20 | 0.0 | -16.8 |
jane | doe | arts | 20 | 14.8 | 26.8 |
anne | carol | engineering | 10 | 12.2 | 26.3 |