A literate programming tutorial using my book reading habits

Menu

This page started as a collection of statistics about my book reading habits, but it is also an example of the literate programming features Org Mode provides. More in details, it shows how you can use Org Mode to pass data to source blocks written in different languages. This allows, for instance, to generate data using Ruby, plot it in Gnuplot, analyze it with datamash or R, and export the resulting page to HTML.

There is no need to use Org Mode to achieve the same result: we could write different scripts and then use a building tool such as Make or Rake to glue the code or, possibly, write a monolithic script in Ruby which uses system calls to invoke other tools. Org Mode, however, allows to glue different pieces of code in a more natural way.

You can also view some list of books by year computed from the data.

Getting data about my reading habits

I keep data about my reading habits in Calibre, where I defined a couple of meta data fields, which allow me to record start and end date of each book I read. Other information is read directly from Calibre, which keeps title, authors, genre, and has a plugin to compute the number of pages based on the book length.

Calibre has a function to export all data in CSV. I have a function which makes it into a YAML file, which is the input file for this page. The following code reads the YAML with the books I read and defines a couple of functions which will be useful later.

The first line declares various execution parameters. The code is written in Ruby and executed in a :session, so that values are persisted and share among all other blocks written in the same language. We do not care about outputting the results of the code evaluation in the buffer, hence we can declare :results none.

require 'yaml'
require 'date'
books  = YAML::load_file "../_data/bookstream.yml",
                         permitted_classes: [Date]

# group an array of hashes according to the values of a key and add
# some data, such as the number of books belonging to each group and
# the total number of pages
def group_and_array(key, books)
  grouped = books.group_by { |x| x[key] }
  grouped.map { |k, v| 
    [ k, v.size, v.map { |p| p["pages"] }.inject(&:+) ] 
  }
end

# convert an aoh (= array of hashes) into a csv file.
# assume all entries have the same keys, although not necessarily in
# the same order (do not expect Hash.values to return the values in
# the same order among all entries of the array)
def aoh_to_a aoh
  keys = aoh[0].keys
  array = []
  aoh.each do |entry|
    array << keys.map { |key| entry[key].to_s } 
  end
  array
end

The books variable contains data which we can use to compute various statistics.

What genres do I like more?

Simple answer: science fiction and crime. Is it true, however? To answer this question we use the group_and_array we defined above, which groups book data according to a field, computes the number of items per category, the sum of books per category and returns an array of arrays.

The following piece of code, thus, groups books by genre and returns a table, which can be nicely shown by Org Mode. Notice that we attach a header ([ ["Genre", "Books", "Pages"] ]) and a separation line [nil]. I learned the trick about the separation line here. It allows us to output column headers.

We export both the code and the results of the execution and that we ask Org Mode to show, as results, the value returned from executing the code, that is, an array of arrays, in our case. A nice explanation of the values :results can assume is available in the Org Mode Manual.

header = [["Genre", "Books", "Pages"]] + [nil] 

body = group_and_array("category", books)
# remove books with no genre
body = body.select { |x| x[0] != "" }
# fix Genre string to improve output
body = body.map { |x| [x[0].gsub("_", " ").capitalize, x[1], x[2]] }
# sort it by frequency
body = body.sort { |x, y| x[1] <=> y[1] }.reverse

header + body
Genre Books Pages
Science fiction 65 25154
Crime 28 11947
Novel 15 2313
Non fiction 12 2545
History 12 6819
Science 11 2623
Food 11 4036
Management 9 1447
War biography 8 5091
Humour 8 860
Fiction 8 1987
Economics 4 1260
Sea 4 1007
History of science 4 1184
Tragedy 3 558
Comedy 3 637
Medicine 2 891
Biography 2 553
Kids 1 40

The number of pages is computed by a Calibre plugin whose results I cannot check and which could not run on some my entries, since the electronic version of the book was missing.

How many books do I read in a year?

More precisely: of the books I read, how many books did I start reading in a given year?

Once again, we can use the group_and_array function, grouping by year. This allows to show a table, which we enrich with a textual barplot, built using "-" * N, a Ruby construct to build strings of N repetitions of the given char.

header = [["Year", "Books", "Pages", "Avg. Pages/Day", "Plot" ]] + [nil]

body = group_and_array("started_year", books)
# remove books with no year
body = body.select { |x| x[0] }
# add some stats
body = body.map { |x| [ x[0], x[1], x[2], x[2] / 365, "-" * x[1] ] }
# sort
body = body.sort { |x, y| x[0] <=> y[0] }

header + body
Year Books Pages Avg. Pages/Day Plot
2012 8 2287 6 --------
2013 8 3646 9 --------
2014 13 3921 10 -------------
2015 19 5489 15 -------------------
2016 5 2295 6 -----
2017 6 1472 4 ------
2018 9 3507 9 ---------
2019 4 2064 5 ----
2020 6 3695 10 ------
2021 4 3498 9 ----
2022 8 4045 11 --------
2023 7 3331 9 -------
2024 2 460 1 --

We can plot the data using Gnuplot, passing as input the data of the table built by Ruby. This is achieved by the following source block, which takes as input the table above, through the :var barplot = books-per-year declaration. Notice that we also need to give a name to the table, with the #+NAME: books-per-year declaration.

The reset command in the Gnuplot script is rather useful, as it ensures that all settings are reset to their default values, otherwise Gnuplot will use any setting defined in previous blocks in this buffer.

reset 

set boxwidth 0.5
set grid ytics linestyle 0
set style fill solid 0.20 border 

set terminal svg size 1200,800 font 'Arial,10'

set title "Books Read"
set xlabel "Year"
set ylabel "Number of Books"

plot barplot using 1:2:xtic(1) with boxes lc rgb "#0045FF" title "Books read", \
     barplot using 1:($2+0.25):2 with labels title ""

histogram-with-labels.svg

How long does it take me to read a book?

The next question is how long it takes me to read a book, in calendar days. Notice that this is different from the actual days spent reading since calendar days are different from effort. In some cases I stopped reading some books and got back to finish them when I was in the mood again. In other cases I would read two books in parallel, even though this is something I did more often when I was younger. The table also shows genre and rating, although I have not very consistent in rating all the books I read.

Notice that here we use a slightly different notation for naming the output: we assign the name to the source block, rather than to its output. The effect is the same.

# find the books which I started and ended
read = books.select { |x| x["started"] and x["completed"] }

header = [
  ["Title", "Days", "Pages", "Avg. Pages / Day", "Genre", "Rating"]
] + [nil] 
body2 = read.map { |x|
  days = (x["completed"] - x["started"] ).to_i;
  [ x["title"],
    days, x["pages"],
    days != 0 ? ("%.2f" % (x["pages"] / days.to_f)) : "N/A",
    x["category"].gsub("_", " "),
    x["my_rating"] ] 
}
body2 = body2.sort { |x, y| x[1] <=> y[1] }.reverse

header + body2
Title Days Pages Avg. Pages / Day Genre Rating
Invisible Planets 681 388 0.57 science fiction 4
Even Dogs in the Wild 386 461 1.19 crime 4
Watchmen 371 0 0.00   0
Inferno 204 1969 9.65 war biography 5
Wild Swans: Three Daughters of China 181 714 3.94 history 5
The Birth of Plenty: How the Prosperity of the Modern World was Created 144 533 3.70 history 5
A History of the World 122 841 6.89 history 5
The Gulag Archipelago 122 553 4.53 biography 4
The Stand 121 1595 13.18 science fiction 4
The elegant universe: superstrings, hidden dimensions, and the quest for the ultimate theory 118 854 7.24   5
Apollo 108 608 5.63 history of science 5
How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843) 98 558 5.69 non fiction 3
Code Warriors: NSA’s Codebreakers and the Secret Intelligence War Against the Soviet Union 89 497 5.58   4
Dune: The Machine Crusade 85 841 9.89 science fiction 3
Sapiens: A Brief History of Humankind 81 455 5.62 history 5
Buying Time 78 337 4.32 science fiction 4
Periodic Tales 71 590 8.31 science 4
Consider the Lobster and Other Essays 70 346 4.94 non fiction 5
The Nutmeg’s Curse 69 703 10.19   3
The Third Plate: Field Notes on the Future of Food 69 552 8.00 food 4
The Omnivore’s Dilemma: A Natural History of Four Meals 64 491 7.67 food 4
The Korean War 59 919 15.58 war biography 4
Dune: The Butlerian Jihad 59 698 11.83 science fiction 3
The Trial 57 215 3.77 novel 4
The Hydrogen Sonata 56 604 10.79 science fiction 3
The Three-Body Problem (Remembrance of Earth’s Past) 55 427 7.76 science fiction 4
Big Bang 52 916 17.62   5
21 Lessons for the 21st Century 52 389 7.48 non fiction 5
The Forever War 52 271 5.21 science fiction 5
The Secret Life of Groceries: The Dark Miracle of the American Supermarket 50 705 14.10 food 4
A Brief History Of Time 48 410 8.54   4
Travels in the Interior Districts of Africa, 1795-7 48 254 5.29 history 4
In Search Of Schrodinger’s Cat 45 309 6.87 science 4
The New York Trilogy 44 352 8.00 crime 5
The Battle Of The Atlantic: The Allies’ Submarine Fight Against Hitler’s Gray Wolves Of The Sea 44 337 7.66 history 4
Leviathan Wakes 43 648 15.07 science fiction 2
A Memory Called Empire 42 790 18.81 science fiction 3
An Edible History of Humanity 41 257 6.27 food 4
Do No Harm Stories of Life, Death and Brain Surgery 39 281 7.21 medicine 4
Extreme Ownership: How U.S. Navy SEALs Lead and Win 36 297 8.25 management 5
Standing in Another Man’s Grave: A John Rebus Novel 36 458 12.72 crime 3
Swallow This 35 273 7.80 food 2
Packing for Mars 33 313 9.48 science 3
Project Hail Mary 31 854 27.55 science fiction 4
Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety 31 958 30.90 history 3
In Defense of Food 31 232 7.48 food 3
La fisica del diavolo (Italian Edition) 30 434 14.47   0
Land grabbing. Come il mercato delle terre crea il nuovo colonialismo (Indi) (Italian Edition) 30 222 7.40 food 5
Solaris 30 246 8.20 science fiction 4
The Naked Sun 30 275 9.17 science fiction 4
F*** You Very Much: The surprising truth about why people are so rude 28 313 11.18 non fiction 4
Return From The Stars 28 300 10.71 science fiction 3
Rendezvous With Rama 26 243 9.35 science fiction 5
The Illustrated Man 25 291 11.64 science fiction 3
The Power of Habit: Why We Do What We Do in Life and Business 25 382 15.28   5
Reality Is Not What It Seems: The Journey to Quantum Gravity 22 221 10.05 science 4
Slaughterhouse-Five (Kurt Vonnegut Series) 21 180 8.57 science fiction 5
Sbornie sacre, sbornie profane: L’ubriachezza dal Vecchio al Nuovo Mondo (Intersezioni) (Italian Edition) 20 156 7.80 history 4
Saints of the Shadow Bible 20 478 23.90 crime 4
Trash 19 136 7.16 non fiction 3
I signori del cibo. Viaggio nell’industria alimentare che sta distruggendo il pianeta (Italian Edition) 18 284 15.78 food 5
The 5th Wave 18 480 26.67 science fiction 4
The Martian: A Novel 17 412 24.24 science fiction 5
The Man in the High Castle 17 291 17.12 science fiction 4
Il paradiso maoista 16 460 28.75   4
Spillover. L’evoluzione delle epidemie (2014) 16 610 38.12 medicine 4
Kitchen Confidential Paperback 16 320 20.00 food 5
Tears of the Giraffe 16 204 12.75 fiction 4
The Futurological Congress 15 128 8.53 science fiction 4
A Briefer History of Time 15 133 8.87 science 4
Salt Sugar Fat: How the Food Giants Hooked Us 15 464 30.93 food 4
Denominazione di origine inventata: Le bugie del marketing sui prodotti tipici italiani (Italian Edition) 14 328 23.43   3
Okinawa: The Last Battle of World War II 13 200 15.38 history 3
Six Easy Pieces 11 164 14.91 science 5
Machines Like Me 10 529 52.90 science fiction 4
L’orribile karma della formica 10 436 43.60   0
Pista nera 9 249 27.67 crime 3
Il vecchio e il mare 9 84 9.33 novel 4
Ender’s Game (The Ender Quintet) 7 409 58.43 science fiction 5
We Are the Weather 6 236 39.33 food 5
Micromégas 6 25 4.17 science fiction 3
Worlds Apart: Worlds 6 243 40.50 science fiction 3
Worlds 6 262 43.67 science fiction 4
Deep Descent 6 275 45.83 sea 5
Artemis. La prima città sulla luna (Italian Edition) 5 387 77.40 science fiction 4
L’anima delle macchine: Tecnodestino, dipendenza tecnologica e uomo virtuale 5 256 51.20 science 3
The Circle 5 476 95.20 fiction 4
2001 - A Space Odyssey 4 389 97.25   0
Making a Submarine Officer - A story of the USS San Francisco (SSN 711) 3 295 98.33 sea 3
The No. 1 Ladies’ Detective Agency 3 230 76.67 humour 4
Alien Disgelo 2 0 0.00   0
Spaghetti robot. Il made in Italy che ci cambierà la vita 2 222 111.00 science 3
Morte dei Marmi (Contromano) (Italian Edition) 2 88 44.00 humour 3
Alien Volume 3 Icarus 1 0 0.00   3
Make Your Bed 1 61 61.00 management 5
Mia nonna era un pesce 0 40 N/A kids 3
25 Things About Life 0 26 N/A non fiction 3
Sette brevi lezioni di fisica 0 52 N/A science 4
Who Moved My Cheese 0 32 N/A management 3

The results can be summarized using different tools. Rather than diving into R and making the table above into a data frame, we use datamash and Gnuplot.

The command line utility datamash allows to perform basic operations on CSV files. Here we compute the fundamental statistics about column 2, that is, the number of days it takes to read a book:

echo "$bd" | datamash --header-out min 2 q1 2 median 2 q3 2 max 2 sstdev 2
min(field-2) q1(field-2) median(field-2) q3(field-2) max(field-2) sstdev(field-2)
0 10.5 30 55.5 681 88.987085010699

Then we use Gnuplot to draw the same data with a boxplot:

reset 

set terminal svg size 1200,800
set grid
set title "Books Duration"
set style data boxplot
set style boxplot
set xtics ("Duration" 1)

plot data using (1.0):2

boxplot.svg

Another interesting plot shows duration and length, to see whether there is any correlation between the two. In general we should expect longer books to take more time, but this is not necessarily the case.

reset 

set terminal svg size 1600,1200
set grid
set title "Books Reading Duration and Length"
set ylabel "Pages"

set xlabel "Days"
set xrange [0:150]
set mxtics 10

set grid mxtics mytics lc rgb("#AAAAAA")

plot data using 2:3 with points pt 5 notitle, \
     '' using 2:($3+15):($1) with labels notitle

duration-vs-length.svg

The last two plot are dedicated to understanding which genre I read in fewer calendar days. This is not necessarily a measure of the quality of the book, since more complex books might take more time to read, but be more interesting that books read fast. On the other hand, it might indicate an increased interest in reading the book.

In general, the most natural structure for input data in Gnuplot is with each variable taking its own column. The boxplot command, however, can take a fourth argument, which is a reference to a categorical variable to use.

reset 

set terminal svg size 1600,600

set title "Books Reading Speed by Genre"

set ylabel "Pages"
set grid
set nokey
set style data boxplot
set style boxplot
set datafile missing "N/A"
set style fill transparent solid 0.1

plot data using (1.0):4:(0.5):5 lc variable

genre-by-page-speed.svg