Miller as a preprocessor for Datamash
One of the features Datamash provides is crosstab
, which builds a
Pivot table, similar to the ones you can build with Excel. This is
more convenient than the pivot features provided by Miller.
However, Datamash is very strict about the format it reads, and it
does not manage double quotes to escape comma in fields; since I often
want to crosstab fields with commas, when I wanted to use this feature
I had to use sed
to preprocess the input file.
However, I recently realized that I can use miller
as a
preprocessor to clean the input data. The idea is the following:
- Miller reads the CSV file and outputs a TSV file, removing double quotes
- Datamash takes the file generated by miller and builds the pivot table.
sed
might still be necessary, according to your locale, to have
datamash perform computations on numbers with decimals. The syntax,
however, is simplified.
Example
The input file has fields protected by double quotes, numbers with commas as decimal separator (Italian locale), and “-” in place of zeroes. A sample is available here and following table shows its content, verbatim
cat ./csv-cli-tools-input.csv
"name","surname","major","total","d1","d2" "john","smith","arts","20","-","-16,8" "carl","doe","engineering","30","-","12" "adam","white","engineering","30","-","14,3" "john","brown","arts","24","-","17,8" "janet","smith","engineering","22","-","16,4" "jane","doe","arts","20","14,8","26,8" "anne","carol","engineering","10","12,2","26,3"
Pre-processing the file
We fix the format by:
- using sed to replace “-” with “0.00”
- using miller to convert the file to tsv, so that commas will appear only when used as decimal separators
- using sed to replace “,” with “.”, so that decimal numbers can be correctly interpreted as such
sed 's/"-"/"0.00"/g' csv-cli-tools-input.csv | \ mlr --icsv --otsv cat | \ sed 's/,/./g' > csv-cli-tools-cleaned-for-datamash.tsv
Pivot Table
Now we are ready to perform the analyses.
For instance, count the number of grades each student got by subject:
cat ./csv-cli-tools-cleaned-for-datamash.tsv | \
datamash crosstab surname,major count total -H
GroupBy(surname) | GroupBy(major) | count(total) |
arts | engineering | |
brown | 1 | N/A |
carol | N/A | 1 |
doe | 1 | 1 |
smith | 1 | 1 |
white | N/A | 1 |