Miller as a preprocessor for Datamash

One of the features Datamash provides is crosstab, which builds a Pivot table, similar to the ones you can build with Excel. This is more convenient than the pivot features provided by Miller.

However, Datamash is very strict about the format it reads, and it does not manage double quotes to escape comma in fields; since I often want to crosstab fields with commas, when I wanted to use this feature I had to use sed to preprocess the input file.

However, I recently realized that I can use miller as a preprocessor to clean the input data. The idea is the following:

Miller reads the CSV file and outputs a TSV file, removing double quotes
Datamash takes the file generated by miller and builds the pivot table.

sed might still be necessary, according to your locale, to have datamash perform computations on numbers with decimals. The syntax, however, is simplified.

Example

The input file has fields protected by double quotes, numbers with commas as decimal separator (Italian locale), and “-” in place of zeroes. A sample is available here and following table shows its content, verbatim

cat ./csv-cli-tools-input.csv

"name","surname","major","total","d1","d2"
"john","smith","arts","20","-","-16,8"
"carl","doe","engineering","30","-","12"
"adam","white","engineering","30","-","14,3"
"john","brown","arts","24","-","17,8"
"janet","smith","engineering","22","-","16,4"
"jane","doe","arts","20","14,8","26,8"
"anne","carol","engineering","10","12,2","26,3"

Pre-processing the file

We fix the format by:

using sed to replace “-” with “0.00”
using miller to convert the file to tsv, so that commas will appear only when used as decimal separators
using sed to replace “,” with “.”, so that decimal numbers can be correctly interpreted as such

sed 's/"-"/"0.00"/g' csv-cli-tools-input.csv | \
    mlr --icsv --otsv cat | \
    sed 's/,/./g' > csv-cli-tools-cleaned-for-datamash.tsv

Pivot Table

Now we are ready to perform the analyses.

For instance, count the number of grades each student got by subject:

cat ./csv-cli-tools-cleaned-for-datamash.tsv | \
    datamash crosstab surname,major count total -H

GroupBy(surname)	GroupBy(major)	count(total)
	arts	engineering
brown	1	N/A
carol	N/A	1
doe	1	1
smith	1	1
white	N/A	1