A quick tutorial on Miller

Recently I had to perform some analyses on a CSV file with a fairly complex format. I started with datamash, but I ended up using miller, because it performed considerably better on the input file I had, which had a tricky syntax.

This page reports on the tasks I performed and is a quick tutorial of Miller.

The input file

The input file has fields protected by double quotes, numbers with commas as decimal separator (Italian locale), and “-” in place of zeroes. A sample is available here and following table shows its content, formatted by Org Mode.

cat ./csv-cli-tools-input.csv

Tasks

These are the tasks I had to perform on the file:

  1. Pre-process the file to simplify analyses: convert separator in decimal numbers and replace “-” with zeroes
  2. Compute descriptive statistics of “total”
  3. Compute average total by major (pivot)
  4. Transpose row and columns of the pivot table computed at the previous step
  5. Select surname and total
  6. Select all records with d1 < 13.0
  7. Sort record by total

Pre-processing the file

One of the issues with the file is the use of field delimiters and “,” as decimal places, which causes datamash to incorrectly parse the file.

We fix the format by:

  • replacing “-” with “0.00” using sed
  • converting the file to tsv (so that commas will appear only as decimal separators) with miller
  • replacing “,” with “.” using sed
sed 's/"-"/"0.00"/g' csv-cli-tools-input.csv | mlr --icsv --otsv cat | sed 's/,/./g' | mlr --itsv --ocsv  cat > csv-cli-tools-input-cleaned.csv

Now we are ready to perform the analyses.

Descriptive statistics of total

mlr --icsv --ocsv stats1 -a p0,p25,p50,mean,p75,p100,stddev,var -f total  csv-cli-tools-input-cleaned.csv

Average total by major

mlr --icsv --ocsv stats1 -a mean -f total -g major  csv-cli-tools-input-cleaned.csv | mlr --icsv --ocsv format-values -f "%.2f"

Note. Org Mode cleans data a bit by removing one zero from the second number.

Select surname and total

mlr --icsv --ocsv cut -f surname,total csv-cli-tools-input-cleaned.csv

Select all records with d1 < 13.0

mlr --icsv --ocsv filter '$d1 < 13' csv-cli-tools-input-cleaned.csv

Sort record by total

mlr --icsv --ocsv sort -nr total csv-cli-tools-input-cleaned.csv

Author: Adolfo Villafiorita

Last modified: 2020-11-07 Sat 18:43 (created on: 2020-11-06 Fri 00:00)

Published: 2020-12-04 Fri 13:21