Time for another Twitter-inspired blog post this week, this time from a tweet by @JonKalodimos:
Is there a way to do this in #rstats #ggplot2 https://t.co/kxWQFlYpbB
— Jonathan Kalodimos (@JonKalodimos) August 27, 2015
I had seen and appreciated Ann’s post on her makeover of the main graphic in [NPR’s story](http://www.npr.org/sections/money/2014/10/21/357629765/when-women-stopped-coding) and did a quick mental check of how I’d do the same in ggplot2 as I was reading it. Jon’s question was a good prompt to dump physical memory to internet memory.
Here’s the NPR graphic:
It is actually pretty darn good on it’s own, but I also agree with Ann that direct labeling could have made it better. Here’s her makeover:
Let’s see how to do this in ggplot2. We’ll use the actual data from NPR’s story since the graphic was built with D3 and, hence, the data is part of the graphic. Let’s get the `library` stuff out of the way:
library(ggplot2) library(dplyr) library(tidyr) library(stringr) library(scales) library(gridExtra) library(grid)
Now, we’ll grab the CSV that the NPR folks used for the graphic and take a look at it. I found it via Developer Tools in Chrome:
# use the NPR story data file --------------------------------------------- # and be kind to NPR's bandwidth budget url <- "http://apps.npr.org/dailygraphics/graphics/women-cs/data.csv" fil <- "gender.csv" if (!file.exists(fil)) download.file(url, fil) gender <- read.csv(fil, stringsAsFactors=FALSE) # take a look at the CSV structure ---------------------------------------- glimpse(gender) ## Observations: 48 ## Variables: ## $ date (int) 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, ... ## $ Medical.School (dbl) 0.09, 0.10, 0.10, 0.09, 0.09, 0.11, 0.14, 0.17, 0.20, 0.22, 0.24, 0.25, 0.25, 0.25, 0.28, 0.29, 0.31, ... ## $ Law.School (chr) "0.04", "0.04", "0.05", "0.07", "0.07", "0.1", "0.12", "0.16", "0.2", "0.24", "0.27", "0.28", "0.3", "... ## $ Physical.Sciences (chr) "0.14", "0.14", "0.14", "0.14", "0.14", "0.15", "0.16", "0.16", "0.17", "0.19", "0.2", "0.2", "0.22", ... ## $ Computer.science (dbl) 0.146, 0.108, 0.120, 0.130, 0.129, 0.136, 0.136, 0.149, 0.164, 0.190, 0.198, 0.239, 0.258, 0.281, 0.30... tail(gender) ## date Medical School Law School Physical Sciences Computer science ## 43 2008 0.48 0.47 0.41 0.177 ## 44 2009 0.48 0.47 0.42 0.179 ## 45 2010 0.48 0.47 0.41 0.182 ## 46 2011 0.47 tk tk 0.177 ## 47 2012 0.47 tk tk 0.182 ## 48 2013 0.46 tk 0.179
Those `tk` values are referred to in the [code that makes the NPR graphic](http://apps.npr.org/dailygraphics/graphics/women-cs/js/graphic.js) so we’ll replace them with `NA`s and make all the columns numeric:
gender <- mutate_each(gender, funs(as.numeric))
We should also clean up the column names since we’ll be using them for the legend and the direct labels:
colnames(gender) <- str_replace(colnames(gender), "\\.", " ") gender_long <- mutate(gather(gender, area, value, -date), area=factor(area, levels=colnames(gender)[2:5], ordered=TRUE))
That that code link also has the colors NPR used for the graphic, so let’s define those now since we bothered to look at it:
gender_colors <- c('#11605E', '#17807E', '#8BC0BF','#D8472B') names(gender_colors) <- colnames(gender)[2:5]
We’ll be needing those names later on, hence why I named the values in the vector.
With the data, labels and colors defined, we can make a “standard” ggplot:
chart_title <- expression(atop("What Happened To Women In Computer Science?", atop(italic("% Of Women Majors, By Field")))) gg <- ggplot(gender_long) gg <- gg + geom_line(aes(x=date, y=value, group=area, color=area)) gg <- gg + scale_color_manual(name="", values=gender_colors) gg <- gg + scale_y_continuous(label=percent) gg <- gg + labs(x=NULL, y=NULL, title=chart_title) gg <- gg + theme_bw(base_family="Helvetica") gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(legend.key=element_blank()) gg
That’s also pretty good on it’s own. It’s possible to make it look a bit more like the NPR chart, but it’s hard to format a title & subtitle in a ggplot title _and_ have it left-justified, so I opted for font style. It’s also possible to make the legend look like NPR’s but that’s not the point of the post.
So, how do we make this look more like Ann’s makeover?
First we need to get the last values for each of the variables so we know what point on the `y` axis we need to place the labels. That’s made a bit trickier with the `NA`s:
last_vals <- sapply(colnames(gender)[2:5], function(x) last(na.exclude(gender[,x]))) last_date <- tail(gender$date)+1 # doing this ^ wld have made it a double
Next, we need to turn off the legend and increase the plot margin on the right-hand side:
gg <- gg + theme(legend.position="none") gg <- gg + theme(plot.margin = unit(c(1, 7, 2, 1), "lines"))
I figured out those #’s by interactive trial-and-error, though I initially guessed `6` for the right-hand margin increase. Also, this should demonstrate one reason for the `gg <- gg +` madness you see in my code/posts since, when you start doing more in ggplot, you end up with that idiom more oft than not. Now, we add the labels. We do it with with custom annotations that are placed "one year" after the latest `x` value and at the same `y` value as the last reading of each area. We also color the label the same as the line, which is why we needed a named vector.
for (i in 1:length(last_vals)) { gg <- gg + annotation_custom(grob=textGrob(names(last_vals)[i], hjust=0, gp=gpar(fontsize=8, col=gender_colors[names(last_vals)[i]])), xmin=2014, xmax=2014, ymin=last_vals[i], ymax=last_vals[i]) }
Finally, we have to do some of the remaining work by hand since we have to turn off panel clipping and the only way I know how to do that is at the grob/gtable level, but it’s not that scary or complex of a task. Also, since we are manipulating the built ggplot object, we have to use `grid.draw` to present our chart:
gb <- ggplot_build(gg) gt <- ggplot_gtable(gb) gt$layout$clip[gt$layout$name=="panel"] <- "off" grid.draw(gt)
Here’s the result:
I’ve deliberately left the fonts a bit small and not-changed their positions on the `y`-axis to give readers a bit of homework. They both _should_ be changed and the plot margins could also be tweaked a tad. You can find the complete code [on github](https://gist.github.com/hrbrmstr/83deb0baeabae0824389) so tweak away!
If you have another way to accomplish the same task or want to show off your tweaked version, drop a note in the comments or at that gist link.