Tools to manipulate CSV files from the Command Line
Cleaning data and manually processing CSV files to make them readable to scripts and other applications is a tedious task.
There are some nice tools to manipulate CSV files from the command line in Linux, which can ease some of the burden and even provide lightweight alternatives to R and other analysis tools.
The following is a list of the tools I am aware of, together with their main characteristics: an “x” means the function is covered, “-” means partial coverage, “?” means that I did not check:
|Name||Check||Clean||Stats||Math||Pivot||Transpose||Select||Query||Sort||Join||Convert from Excel|
|miller||-||x||x||x||x||x (xtab)||x||x (expr)||x||x|
- Check: check file structure
- Clean: file “cleaning”, such as changing separators, field delimiters, pretty printing tables, aligning fields
- Stats: computation of descriptive statistics (min, max, average, variance, first and third quartile, …)
- Math: computation on columns (sum, count, uniq, average, …)
- Pivot: pivoting data
- Transpose: transposing rows and columns
- Query: performing queries and selecting data1
- Sort: sorting data2
- Join: joining two CSV file over the value of a column
- Convert Excel: conversion of Excel files
If you want to use these tools extensively, the quirks you need to pay attention to include:
- managing input files formats, especially for specific cases, such as fields surrounded by quotes, field separators appearing in quoted fields
- management of empty fields
- management of “N/A” when piping commands
I usually end up using datamash, mainly because of their simple and clean syntax. Sometimes I use xsv to preprocess the input file. Recently, however, I started appreciating miller, which is very flexible, feature rich, and robust in managing input file formats.
See A quick tutorial on Miller for more details.