Coloring (and Drawing) Outside the Lines in ggplot

Time for another Twitter-inspired blog post this week, this time from a tweet by @JonKalodimos:

I had seen and appreciated Ann’s post on her makeover of the main graphic in [NPR’s story](http://www.npr.org/sections/money/2014/10/21/357629765/when-women-stopped-coding) and did a quick mental check of how I’d do the same in ggplot2 as I was reading it. Jon’s question was a good prompt to dump physical memory to internet memory.

Here’s the NPR graphic:

When_Women_Stopped_Coding___Planet_Money___NPR

It is actually pretty darn good on it’s own, but I also agree with Ann that direct labeling could have made it better. Here’s her makeover:

Let’s see how to do this in ggplot2. We’ll use the actual data from NPR’s story since the graphic was built with D3 and, hence, the data is part of the graphic. Let’s get the `library` stuff out of the way:

library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(gridExtra)
library(grid)

Now, we’ll grab the CSV that the NPR folks used for the graphic and take a look at it. I found it via Developer Tools in Chrome:

# use the NPR story data file ---------------------------------------------
# and be kind to NPR's bandwidth budget
url <- "http://apps.npr.org/dailygraphics/graphics/women-cs/data.csv"
fil <- "gender.csv"
if (!file.exists(fil)) download.file(url, fil)
 
gender <- read.csv(fil, stringsAsFactors=FALSE)
 
# take a look at the CSV structure ----------------------------------------
 
glimpse(gender)
 
## Observations: 48
## Variables:
## $ date              (int) 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, ...
## $ Medical.School    (dbl) 0.09, 0.10, 0.10, 0.09, 0.09, 0.11, 0.14, 0.17, 0.20, 0.22, 0.24, 0.25, 0.25, 0.25, 0.28, 0.29, 0.31, ...
## $ Law.School        (chr) "0.04", "0.04", "0.05", "0.07", "0.07", "0.1", "0.12", "0.16", "0.2", "0.24", "0.27", "0.28", "0.3", "...
## $ Physical.Sciences (chr) "0.14", "0.14", "0.14", "0.14", "0.14", "0.15", "0.16", "0.16", "0.17", "0.19", "0.2", "0.2", "0.22", ...
## $ Computer.science  (dbl) 0.146, 0.108, 0.120, 0.130, 0.129, 0.136, 0.136, 0.149, 0.164, 0.190, 0.198, 0.239, 0.258, 0.281, 0.30...
 
tail(gender)
 
##    date Medical School Law School Physical Sciences Computer science
## 43 2008           0.48       0.47              0.41            0.177
## 44 2009           0.48       0.47              0.42            0.179
## 45 2010           0.48       0.47              0.41            0.182
## 46 2011           0.47         tk                tk            0.177
## 47 2012           0.47         tk                tk            0.182
## 48 2013           0.46         tk                              0.179

Those `tk` values are referred to in the [code that makes the NPR graphic](http://apps.npr.org/dailygraphics/graphics/women-cs/js/graphic.js) so we’ll replace them with `NA`s and make all the columns numeric:

gender <- mutate_each(gender, funs(as.numeric))

We should also clean up the column names since we’ll be using them for the legend and the direct labels:

colnames(gender) <- str_replace(colnames(gender), "\\.", " ")
 
gender_long <- mutate(gather(gender, area, value, -date),
                      area=factor(area, levels=colnames(gender)[2:5],
                                  ordered=TRUE))

That that code link also has the colors NPR used for the graphic, so let’s define those now since we bothered to look at it:

gender_colors <- c('#11605E', '#17807E', '#8BC0BF','#D8472B')
names(gender_colors) <- colnames(gender)[2:5]

We’ll be needing those names later on, hence why I named the values in the vector.

With the data, labels and colors defined, we can make a “standard” ggplot:

chart_title <- expression(atop("What Happened To Women In Computer Science?",
                               atop(italic("% Of Women Majors, By Field"))))
 
gg <- ggplot(gender_long)
gg <- gg + geom_line(aes(x=date, y=value, group=area, color=area))
gg <- gg + scale_color_manual(name="", values=gender_colors)
gg <- gg + scale_y_continuous(label=percent)
gg <- gg + labs(x=NULL, y=NULL, title=chart_title)
gg <- gg + theme_bw(base_family="Helvetica")
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.key=element_blank())
gg

Rplot01

That’s also pretty good on it’s own. It’s possible to make it look a bit more like the NPR chart, but it’s hard to format a title & subtitle in a ggplot title _and_ have it left-justified, so I opted for font style. It’s also possible to make the legend look like NPR’s but that’s not the point of the post.

So, how do we make this look more like Ann’s makeover?

First we need to get the last values for each of the variables so we know what point on the `y` axis we need to place the labels. That’s made a bit trickier with the `NA`s:

last_vals <- sapply(colnames(gender)[2:5], function(x) last(na.exclude(gender[,x])))
last_date <- tail(gender$date)+1 # doing this ^ wld have made it a double

Next, we need to turn off the legend and increase the plot margin on the right-hand side:

gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin = unit(c(1, 7, 2, 1), "lines"))

I figured out those #’s by interactive trial-and-error, though I initially guessed `6` for the right-hand margin increase. Also, this should demonstrate one reason for the `gg <- gg +` madness you see in my code/posts since, when you start doing more in ggplot, you end up with that idiom more oft than not. Now, we add the labels. We do it with with custom annotations that are placed "one year" after the latest `x` value and at the same `y` value as the last reading of each area. We also color the label the same as the line, which is why we needed a named vector.

for (i in 1:length(last_vals)) {
  gg <- gg + annotation_custom(grob=textGrob(names(last_vals)[i], hjust=0,
                                             gp=gpar(fontsize=8, 
                                                     col=gender_colors[names(last_vals)[i]])),
                               xmin=2014, xmax=2014,
                               ymin=last_vals[i], ymax=last_vals[i])
}

Finally, we have to do some of the remaining work by hand since we have to turn off panel clipping and the only way I know how to do that is at the grob/gtable level, but it’s not that scary or complex of a task. Also, since we are manipulating the built ggplot object, we have to use `grid.draw` to present our chart:

gb <- ggplot_build(gg)
gt <- ggplot_gtable(gb)
 
gt$layout$clip[gt$layout$name=="panel"] <- "off"
 
grid.draw(gt)

Here’s the result:

Rplot02

I’ve deliberately left the fonts a bit small and not-changed their positions on the `y`-axis to give readers a bit of homework. They both _should_ be changed and the plot margins could also be tweaked a tad. You can find the complete code [on github](https://gist.github.com/hrbrmstr/83deb0baeabae0824389) so tweak away!

If you have another way to accomplish the same task or want to show off your tweaked version, drop a note in the comments or at that gist link.

Cover image from Data-Driven Security
Amazon Author Page

8 Comments Coloring (and Drawing) Outside the Lines in ggplot

  1. Paul Brennan

    Nice post and very useful to see this done in ggplot.
    Thanks.
    You seem to have a bit of javascript comment line left in fourth coding panel above. It’s not in the github code but it might be good to remove it from the above page.

    Reply
  2. Pingback: Coloring (and Drawing) Outside the Lines in ggplot

  3. Pingback: Coloring (and Drawing) Outside the Lines in ggplot | Mubashir Qasim

  4. Pingback: Distilled News | Data Analytics & R

  5. jrag

    Thanks for you post, it’s very interesting

    I suggest to use ‘melt’ function from reshape2 package to generate gender_long more easily :

    gender_long <- melt(gender,id.vars=c(‘date’))

    The main difference is that ‘area’ column is now labelled ‘variable’

    Reply
  6. Bea

    Thank you so much! One more question, when you save the graph, the text outside the plot seems to disappear for me. How can I avoid that? Thank you!

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.