Tools to manipulate CSV files from the Command Line

Cleaning data and manually processing CSV files to make them readable to scripts and other applications is a tedious task.

There are some nice tools to manipulate CSV files from the command line in Linux, which can ease some of the burden and even provide lightweight alternatives to R and other analysis tools.

The following is a list of the tools I am aware of, together with their main characteristics: an “x” means the function is covered, “-” means partial coverage, “?” means that I did not check:

Name Check Clean Stats Math Pivot Transpose Select Query Sort Join Convert from Excel
datamash x - x x x x x        
xsv ? x x x     x x x x  
miller - x x x x x (xtab) x x (expr) x x  
csvkit ? x x       x x (sql) x x x
pivot ?       x            
q ?             x (sql)      

where:

Recommendations

If you want to use these tools extensively, the quirks you need to pay attention to include:

  • managing input files formats, especially for specific cases, such as fields surrounded by quotes, field separators appearing in quoted fields
  • management of empty fields
  • management of “N/A” when piping commands

I usually end up using datamash, mainly because of their simple and clean syntax. Sometimes I use xsv to preprocess the input file. Recently, however, I started appreciating miller, which is very flexible, feature rich, and robust in managing input file formats.

See A quick tutorial on Miller for more details.

Footnotes:

1

Different from grep, since queries are performed on values in columns.

2

Similar to sort.

Author: Adolfo Villafiorita

Last modified: 2020-11-08 Sun 09:16 (created on: 2020-11-06 Fri 00:00)

Published: 2020-12-04 Fri 13:21