Poynter did a nice interactive piece on world population by income (i.e. “How Many Live on How Much, and Where”). I’m always on the lookout for optimized shapefiles and clean data (I’m teaching a data science certificate program starting this Fall) and the speed of the site load and the easy availability of the data set made this one a “must acquire”. Rather than just repeat Poynter’s D3-goodness, here’s a way to look at the income data in series of small multiple choropleths—using R & ggplot2
—that involves:
- downloading data & shapefiles from a web site
- using
dplyr
&tidyr
for data munging - applying custom fill color scale mapping in
ggplot
- ordering plots with a custom facet order (using
factor
s) - tweaking the
theme
andaes
thetics for a nicely finished result
By using D3, Poynter inherently made the data available. Pop open the “Developer Tools” in any browser, reload the page and look at the “Network” tab and you’ll see a list of files (you can sometimes see things in the source code, but this technique is often faster). The income data is a well-formed CSV file http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv and their highly optimized world map was also easy to discern http://www.pewglobal.org/wp-content/lib/js/world-geo.json. We’ll start by grabbing the map and using the same map projection that Poynter did (Robinson). Don’t be put off by all the library
calls since one of the best parts of R is the ever-increasing repository of great packages to help you get things done.
library(httr) # getting data library(rgdal) # working with shapefile library(dplyr) # awesome data manipulation library(readr) # faster reading of CSV data library(stringi) # string manipulation library(stringr) # string manipulation library(tidyr) # reshaping data library(grid) # for 'unit' library(scales) # for 'percent' library(ggplot2) # plotting library(ggthemes) # theme_map # this ensures you only download the shapefile once and hides # errors and warnings. remove `try` and `invisible` to see messages try(invisible(GET("http://www.pewglobal.org/wp-content/lib/js/world-geo.json", write_disk("world-geo.json"))), silent=TRUE) # use ogrListLayers("world-geo.json") to see file type & # layer info to use in the call to readOGR world <- readOGR("world-geo.json", "OGRGeoJSON") world_wt <- spTransform(world, CRS("+proj=robin")) world_map <- fortify(world_wt)
I would have liked to do fortify(world_wt, region="name")
(since that makes working with filling in countries by name much easier in the choropleth part of the code) but that generated TopologyException
errors (I’ve seen this happen quite a bit with simplified/optimized shapefiles and some non-D3 geo-packages). One can sometimes fix those with a strategic rgeos::gBuffer
call, but that didn’t work well in this case. We can still use country names with a slight rejiggering of the fortified data frame using dplyr
:
world_map %>% left_join(data_frame(id=rownames(world@data), name=world@data$name)) %>% select(-id) %>% rename(id=name) -> world_map
Now it’s time to get the data. The CSV file has annoying spaces in it that causes R to interpret all the columns as strings, so we can use dplyr
again to get them into the format we want them in. Note that I’m also making the percentages decimals so we can use percent
later on to easily format them.
# a good exercise would be to repeat the download code above # rather and make repeated calls to an external resource read_csv("http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv") %>% mutate_each(funs(str_trim)) %>% filter(id != "None") %>% mutate_each(funs(as.numeric(.)/100), -name, -id) -> dat
For this post, we’ll only be working with the actual share percentages, so let’s:
- ignore the “change” columns
- convert the data frame from wide to long
- extract out the income levels (e.g. “Poor”, “Low Income”…)
- set a factor order for them so our plots will be in the correct sequence
dat %>% gather(share, value, starts_with("Share"), -name, -id) %>% select(-starts_with("Change")) %>% mutate(label=factor(stri_trans_totitle(str_match(share, "Share ([[:alpha:]- ]+),")[,2]), c("Poor", "Low Income", "Middle Income", "Upper-Middle Income", "High Income"), ordered=TRUE)) -> share_dat
The stringi
package is really handy (stringr
is built on it, too). The stri_trans_totitle
function alleviates some mundane string operations and the stri_replace_all_regex
(below) also allows us to do vectorized regular expression replacements without a ton of code.
To keep the charts aligned, we’ll use Poynter’s color scale (which was easy to extract from the site’s code) and use the same legend breaks via `cut’. We’ll also format the labels for these breaks to make our legend nicer to view.
# use same "cuts" as poynter poynter_scale_breaks <- c(0, 2.5, 5, 10, 25, 50, 75, 80, 100) sprintf("%2.1f-%s", poynter_scale_breaks, percent(lead(poynter_scale_breaks/100))) %>% stri_replace_all_regex(c("^0.0", "-NA%"), c("0", "%"), vectorize_all=FALSE) %>% head(-1) -> breaks_labels share_dat %>% mutate(`Share %`=cut(value, c(0, 2.5, 5, 10, 25, 50, 75, 80, 100)/100, breaks_labels))-> share_dat share_pal <- c("#eaecd8", "#d6dab3", "#c2c98b", "#949D48", "#6e7537", "#494E24", "#BB792A", "#7C441C", "#ffffff")
Finally, we get to the good part and start plotting the visualization. There are only two layers: the base map and the choropleth filled map. We then:
- apply our manual color palette to them
- remove the line color slashes in the legend boxes
- setup the overall plot label
- tell
ggplot
which coordinate system to use (in this casecoord_equal
is fine since we already projected the points) - apply a base theme that’s good for mapping
- tweak the text and ensure our legend is in the position we want it to be ing
gg <- ggplot() gg <- gg + geom_map(data=world_map, map=world_map, aes(x=long, y=lat, group=group, map_id=id), color="#7f7f7f", fill="white", size=0.15) gg <- gg + geom_map(data=share_dat, map=world_map, aes(map_id=name, fill=`Share %`), color="#7f7f7f", size=0.15) gg <- gg + scale_fill_manual(values=share_pal) gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA))) gg <- gg + labs(title="World Population by Income\n") gg <- gg + facet_wrap(~label, ncol=2) gg <- gg + coord_equal() gg <- gg + theme_map() gg <- gg + theme(panel.margin=unit(1, "lines")) gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24)) gg <- gg + theme(legend.title=element_text(face="bold", hjust=0, size=12)) gg <- gg + theme(legend.text=element_text(size=10)) gg <- gg + theme(strip.text=element_text(face="bold", size=10)) gg <- gg + theme(strip.background=element_blank()) gg <- gg + theme(legend.position="bottom") gg
And, here’s the result (click for larger version):
The optimized shapefile makes for a very fast plot and you can plot individual chorpleths by filter
ing the data and not using facets.
While there are a number of choropleth packages out there for R, learning how to do the core components on your own can (a) make you appreciate those packages a bit more (b) give you the skills to do them on your own when you need a more customized version. Many of the theme tweaks will also apply to the ggplot
-based choropleth packages.
With this base, it should be a fun exercise to the reader to do something similar with Poynter’s “percentage point change” choropleth. You’ll need to change the color palette and manipulate different data columns to get the same scales and visual representation they do. Drop a note in the comments if you give this a go!
You can find all the code in this post in one convenient gist.