This occurrence of the bi-annual corruption of the space-time continuum (i.e. changing to/from standard/daylight time) in the U.S. caused me to make a slight change to the code [from an older post](https://rud.is/b/2014/09/23/seeing-the-daylight-with-r/). The `daylight()` function now auto-discovers the date and location information (via [telize](http://www.telize.com/)) from the caller, which means all you have to do to get a plot like this:
is to source the [new gist](https://gist.github.com/hrbrmstr/e435d4fa0c31b8e1a9d0) like this:
Remember that you should verify any code before blindly `source`ing it (in R or anywhere else) and make sure to use the SHA1 hash so you know you’re sourcing the proper code (and not potentially being pwnd).
Note that the granularity/accuracy of the geolocation is only as good as the Telize service (which uses MaxMind). The fact that this shows Vermont instead of Maine should make you all think thrice about trusting IP geolocation in general, especially you world-mapping cybersecurity folks.
Junk Charts [adeptly noted and fixed](http://junkcharts.typepad.com/junk_charts/2015/10/is-it-worth-the-drama.html) this excessively stylized chart from the WSJ this week:
Their take on it does reduce the ZOMGOSH WE ARE DOOMED! look and feel of the WSJ chart:
But, we can further reduce the drama by using a more neutral color encoding _and_ encode both the # of outbreaks and total size of the impacted flock populations _per week_ with a lollipop chart (and, thankfully the USDA makes this data readily available):
There was some chatter on the twitters this week about a relatively new D3-based charting library called [TauCharts](http://taucharts.com/) (also @taucharts). The API looked pretty clean and robust, so I started working on an htmlwidget for it and was quickly joined by the Widget Master himself, @timelyportfolio.
TauCharts definitely has a “grammar of graphics” feel about it and the default aesthetics are super-nifty While the developers are actively adding new features and “geoms”, the core points (think scatterplot), lines and bars (including horizontal bars!) geoms are quite robust and definitely ready for your dashboards.
Between the two of us, we have a _substantial_ part of the [charting library API](http://api.taucharts.com/) covered. I think the only major thing left unimplemented is composite charts (i.e. lines + bars + points on the same chart) and some minor tweaks around the edges.
While you can find it on [github](http://github.com/hrbrmstr/taucharts) and do the normal:
devtools::install_github("hrbrmstr/taucharts")
or, even use the official initial release version:
for the example below, mostly since it includes the data set I want to use to mimic the current, featured example on the [TauCharts homepage](http://taucharts.com/) and also has full documentation with examples.
Here’s all it takes to make a faceted scatterplot with:
Hybrid cars fuel economy by price and class It seems expensive cars are less efficient.
There are _tons_ more examples in the [TauCharts RPub](http://rpubs.com/hrbrmstr/taucharts) (and soon-to-be vignette) and @timelyportfolio will be featuring it in his weekly [widget update](http://www.buildingwidgets.com/).
Nathaniel Smith and Stéfan van der Walt presented a new colormap (for Python) at SciPy 2015 called viridis.
From the authors:
The default colourmap in Matplotlib is the colourful rainbow-map called Jet, which is deficient in many ways: small changes in the data sometimes produce large perceptual differences and vice-versa; its lightness gradient is non-monotonic; and, it is not particularly robust against color-blind viewing. Thus, a new default colormap is needed — but no obvious candidate has been found. Here, we present our proposed new default colormap for Matplotlib, and expose the theory, tools, data exploration and motivations behind its design.
You can also find out a tad more about their other colormap designs (a.k.a. the runner-ups), along with Parula, which is a proprietary MATLAB color map.
Simon Garnier (@sjmgarnier) took Nathaniel & Stéfan’s work and turned it into an R package.
Noam Ross (@noamross) & I piled on shortly thereafter to add some ggplot color scale_ functions which are (for now) only available in Simon’s github repo.
Rather than duplicate the examples already provided in the documentation of those functions, I thought there might be more efficacy in creating a post that helped showcase why you should switch from rainbow (et al) to viridis.
Since folks seem to like maps, we’ll work with one for the example, but let’s get some package machinations out of the way first:
Now, we’ll need a map to work with so let’s grab a U.S. max temperature GeoTIFF raster from NOAA (from the bitter cold month of February 2015) and project it to something more reasonable:
Let’s take a look at various color palettes with different types of vision. We’ll use a 3×2 grid and:
use 4 color palettes from grDevices,
make a gradient palette from one of the ColorBrewer sequential palettes, and
then (finally) use a viridis color palette.
We’ll take this grid of 6 maps and view it through the eyes of three different types of color vision as well as a fully desaturated version. Note that I’m not adding much cruft to the map display (including legends) since this isn’t about the values so much as it is about the visual perception of the colormaps.
Remember you can select/click/tap the map grids for (slightly) larger versions.
All of the maps convey the differences in max temperature. If you happen to have “normal” color vision you should be drawn to the bottom two (ColorBrewer on the left and Viridis on the right). They are both sequential and convey the temperature changes more precisely (and they aren’t as gosh-awful ugly the the other four).
While the ColorBrewer gradient may be “good”, Viridis is designed to be:
colorful & “pretty”
sequential (to not impose other structure on exploratory data analysis visualizations)
perceptually uniform (i.e. changes in the data should be accurately decoded by our brains) even when desaturated
accessible to colorblind viewers
and seems to meet those goals quite well.
Take a look at each of the vision-adjusted examples:
Hopefully both the ColorBrewer gradient and Viridis palettes stood out as conveying the temperature data with more precision and more consistently across all non-standard vision types as you progressed through each one.
To see this for yourself in your own work, grab Simon’s package and start substituting viridis for some of your usual defaults to see if it makes a difference in helping you convey the story your data is trying to tell, both more accurately and for a more diverse audience. Remember, the github version (which will be on CRAN soon) have handy ggplot scale_ functions to make using viridis as painless as possible.
I also updated my Melbourne Walking EDA project to use the viridis palette instead of parula (which I was only really using in defiance of MATLAB’s inane restrictions).
Poynter did a nice interactive piece on world population by income (i.e. “How Many Live on How Much, and Where”). I’m always on the lookout for optimized shapefiles and clean data (I’m teaching a data science certificate program starting this Fall) and the speed of the site load and the easy availability of the data set made this one a “must acquire”. Rather than just repeat Poynter’s D3-goodness, here’s a way to look at the income data in series of small multiple choropleths—using R & ggplot2—that involves:
downloading data & shapefiles from a web site
using dplyr & tidyr for data munging
applying custom fill color scale mapping in ggplot
ordering plots with a custom facet order (using factors)
tweaking the theme and aesthetics for a nicely finished result
By using D3, Poynter inherently made the data available. Pop open the “Developer Tools” in any browser, reload the page and look at the “Network” tab and you’ll see a list of files (you can sometimes see things in the source code, but this technique is often faster). The income data is a well-formed CSV file http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv and their highly optimized world map was also easy to discern http://www.pewglobal.org/wp-content/lib/js/world-geo.json. We’ll start by grabbing the map and using the same map projection that Poynter did (Robinson). Don’t be put off by all the library calls since one of the best parts of R is the ever-increasing repository of great packages to help you get things done.
library(httr)# getting datalibrary(rgdal)# working with shapefilelibrary(dplyr)# awesome data manipulationlibrary(readr)# faster reading of CSV datalibrary(stringi)# string manipulationlibrary(stringr)# string manipulationlibrary(tidyr)# reshaping datalibrary(grid)# for 'unit'library(scales)# for 'percent'library(ggplot2)# plottinglibrary(ggthemes)# theme_map# this ensures you only download the shapefile once and hides# errors and warnings. remove `try` and `invisible` to see messagestry(invisible(GET("http://www.pewglobal.org/wp-content/lib/js/world-geo.json",
write_disk("world-geo.json"))), silent=TRUE)# use ogrListLayers("world-geo.json") to see file type & # layer info to use in the call to readOGR
world <- readOGR("world-geo.json", "OGRGeoJSON")
world_wt <- spTransform(world, CRS("+proj=robin"))
world_map <- fortify(world_wt)
I would have liked to do fortify(world_wt, region="name") (since that makes working with filling in countries by name much easier in the choropleth part of the code) but that generated TopologyException errors (I’ve seen this happen quite a bit with simplified/optimized shapefiles and some non-D3 geo-packages). One can sometimes fix those with a strategic rgeos::gBuffer call, but that didn’t work well in this case. We can still use country names with a slight rejiggering of the fortified data frame using dplyr:
Now it’s time to get the data. The CSV file has annoying spaces in it that causes R to interpret all the columns as strings, so we can use dplyr again to get them into the format we want them in. Note that I’m also making the percentages decimals so we can use percent later on to easily format them.
# a good exercise would be to repeat the download code above # rather and make repeated calls to an external resource
read_csv("http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv")%>%
mutate_each(funs(str_trim))%>%filter(id !="None")%>%
mutate_each(funs(as.numeric(.)/100), -name, -id)-> dat
For this post, we’ll only be working with the actual share percentages, so let’s:
ignore the “change” columns
convert the data frame from wide to long
extract out the income levels (e.g. “Poor”, “Low Income”…)
set a factor order for them so our plots will be in the correct sequence
The stringi package is really handy (stringr is built on it, too). The stri_trans_totitle function alleviates some mundane string operations and the stri_replace_all_regex (below) also allows us to do vectorized regular expression replacements without a ton of code.
To keep the charts aligned, we’ll use Poynter’s color scale (which was easy to extract from the site’s code) and use the same legend breaks via `cut’. We’ll also format the labels for these breaks to make our legend nicer to view.
And, here’s the result (click for larger version):
The optimized shapefile makes for a very fast plot and you can plot individual chorpleths by filtering the data and not using facets.
While there are a number of choropleth packages out there for R, learning how to do the core components on your own can (a) make you appreciate those packages a bit more (b) give you the skills to do them on your own when you need a more customized version. Many of the theme tweaks will also apply to the ggplot-based choropleth packages.
With this base, it should be a fun exercise to the reader to do something similar with Poynter’s “percentage point change” choropleth. You’ll need to change the color palette and manipulate different data columns to get the same scales and visual representation they do. Drop a note in the comments if you give this a go!
I’ve been slowly prodding the [metricsgraphics package](https://github.com/hrbrmstr/metricsgraphics/) towards a 1.0.0 release, but there are some rough edges that still need sorting out. One of them is the ability to handle passing in variables for the `x` & `y` accessor values (you _can_ pass in bare and quoted strings). This can now be achieved (in the `dev01` branch) via `mjs_plot_` and in `mjs_plot` proper in the github main branch thanks to a [PR](https://github.com/hrbrmstr/metricsgraphics/pull/31) by [Jonathan Owen](https://github.com/jrowen). If everything stays stable with the PR, I’ll just fold the code into `mjs_plot` for the `0.9.0` CRAN release.
One other pending feature is the ability to turn _basic_ (single `geom_`) `ggplot` objects into `metricsgraphics` plots. Sometimes it’s just easier/nicer to “think” in `ggplot` and it may be the case that one might have coded a quick histogram/scatter/line plot in `ggplot` and want an equally quick interactive version. This can also now be achieved (again, in beta) via `as_mjsplot`. While the previous addition is fairly self-explanatory, this new one needs a few examples. Please note that the package installation is coming from the `dev01` branch:
As you can see, `as_mjsplot` will do it’s best to figure out the bins (if using `geom_histogram`) and also axis labels. Support for converting `geom_vline` and `geom_hline` to markers and baselines (respectively) is a work in progress.
I’ve only done limited testing with some basic single `geom_` constructs, but if there are any bugs with it or feature requests (remember, the MetricsGraphics.js library has a very limited repertoire) please post an issue on GitHub tagging the `dev01` branch.
The recent announcement of the [start of egg rationing in the U.S.](http://www.washingtonpost.com/blogs/wonkblog/wp/2015/06/05/the-largest-grocer-in-the-texas-is-now-rationing-eggs/) made me curious enough about the avian flu outbreak to try to dig into the numbers a bit. I finally stumbled upon a [USDA site](http://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/sa_detections_by_states/ct_ai_pacific_flyway/!ut/p/a1/lVPBkqIwEP2WOexpCxMBAY-oo6Kybo01q3JJNSGB1GCgSNTi7zcwe3CnZnQ3h1Sl-3X369cdlKADSiRcRA5aVBLK7p14ZLVd2sMJtqPFbvyMox-_5nGw8Z3t0jWAowHgL06I_47friOvi3_Bk-VsiHcO2qMEJVTqWhfoCHUhFKGV1ExqUoq0gab9hhWQ6twQXtGz6l8gxQlKUjAodXFryYRioBgRklfNqW_i3X0RIG_xGdOMdm5F0pYoDZqZ1FQTEKQGKrighJftFdqOX01Fho7c9iiAzS3HG6WWm2HbSnmAzYXxyA3AG1L-R487Df-TntNFuHT9jVHQDWwczUywP44xjrxH8b2eDzL0gHsj-1Bk8TwxReabn_56ZeP1CB0NSf9LFmMX7f5TtdXDtmYoib9z5YadQnYTT-PcVABdWN2s0eHuDry7b3agN3y2A-jw6Q4YfnlZpeZD7Ke3REKZOoEh0jDOGtYMikppdLher4OzymCQVxdUn15PgdMK6-0lwM7obR9ayTx_evoNPxBrVg!!/?1dmy&urile=wcm:path:/APHIS_Content_Library/SA_Our_Focus/SA_Animal_Health/SA_Animal_Disease_Information/SA_Avian_Health/SA_Detections_by_States/) that had an embedded HTML table of flock outbreak statistics by state, county and date (also flock type and whether it was a commercial enterprise or “backyard” farm). Just looking at the sum of flock sizes on that page shows that nearly 50 million birds have been impacted since December, 2014.
We can scrape the data with R & `rvest` and then use the shapefile hexbins from [previous posts](http://rud.is/b/2015/05/15/u-s-drought-monitoring-with-hexbin-state-maps-in-r/) to watch the spread week-over-week.
The number of packages I ended up relying on was a bit surprising. Let’s get them out of the way before focusing on the scraping and hexbin-making:
library(rvest)# scrapinglibrary(stringr)# string manipulationlibrary(lubridate)# date conversionlibrary(dplyr)# data mjnginglibrary(zoo)# for locflibrary(ggplot2)# plottinglibrary(rgdal)# map stufflibrary(rgeos)# map stuff
We also end up using `magrittr` and `tidyr` but only for one function, so you’ll see those with `::` in the code.
Grabbing the USDA page is pretty straightforward:
url<-"http://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/ct_avian_influenza_disease/!ut/p/a1/lVJbb4IwFP41e1qwFZDLI-oUnGgyswm8kAMUaAaFQNG4X7-ibnEPYtakDz3nO_kupyhAHgoYHGgGnFYMiv4daOFqa8vjKZad5c58wc7mY-Eaa13Z2qoA-AKA7xwL_53fvjpaP_-Gp_Z8jHcK2qMABTHjNc-RD3VO2zCuGCeMhwWNGmhOT7iFsOqaMK3irj2_gNESijAnUPD8tpLQlkBLQsrSqinPJi7tAwX2i4_5tSBgRUfYF_wM9mLqmCbIj2QzxZpMJMUYg6TGkSLBBCaSPEnSJIljXVH0q_kBdw_CO5sXkNnSslV9LQJTDRk7czGumy7GjnYFDOTrCw36XRJTRbt_mlo9VD1FgfuctqrVByA37szNBAPwXOpzR97gPi7tm30gb2AfQkxWVJH4ifvZLavFIsUQrA1JSUOaUV61HHnH43HUtQmMsuqA6vK9NJST9JluNlKwyPr7DT6YvRs!/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_pacific_flyway"#' read in the data, extract the table and clean up the fields#' also clean up the column names to since they are fairly nasty
pg <- html(url)
If you poke at the source for the page you’ll see there are two tables in the code and we only need the first one. Also, if you scan the rendered table on the USDA page by eye you’ll see that the column names are horrible for data analysis work and they are also inconsistent in the values used for various columns. Furthermore, there are commas in the flock counts and it would be handy to have the date as an actual date type. We can extract the table we need and clean all that up in a reasonably-sized `dplyr` pipe:
To make an animated map of cumulative flock totals by week, we’ll need to
– group the `birds` data frame by week and state
– calculate the cumulative sums
– fill in the gaps where there are missing state/week combinations
– carry the last observations by state/week forward in this expanded data frame
– make breaks for data ranges so we can more intelligently map them to colors
This ends up being a longer `dplyr` pipe than I usuall like to code (I think very long ones are hard to follow) but it gets the job done and is still pretty readable:
– determine where we’re going to break the data up
– feed that into a loop
– partition the data in the loop
– render the plot to a file
– combine all the individual images into an animation
For this graphic, I’m doing something a bit extra. The color ranges for the hexbin choropleth go from very light to very dark, so it would be helpful if the titles for the states went from very dark to very light, matching the state colors. The lines that do this check for state breaks that fall in the last two values and appropriately assign `”black”` or `”white”` as the color.
i <-0for(wk inunique(by_state_and_week$week)){# filter by week
by_state_and_week %>%filter(week==wk)-> this_wk
# hack to let us color the state labels in white or black depending on# the value of the fill
this_wk %>%filter(brks %in%c("1m-10m", "10m-50m"))%>%
.$state_abb %>%unique-> white_states
centers %>%
mutate(txt_col="black")%>%
mutate(txt_col=ifelse(id %in% white_states, "white", "black"))-> centers
# setup the plot
gg <- ggplot()
gg <- gg + geom_map(data=us_map, map=us_map,
aes(x=long, y=lat, map_id=id),
color="white", fill="#dddddd", size=2)
gg <- gg + geom_map(data=this_wk, map=us_map,
aes(fill=brks, map_id=state_abb),
color="white", size=2)
gg <- gg + geom_text(data=centers,
aes(label=id, x=x, y=y, color=txt_col), size=4)
gg <- gg + scale_color_identity()
gg <- gg + scale_fill_brewer(name="Combined flock size\n(all types)",
palette="RdPu", na.value="#dddddd", drop=FALSE)
gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA)))
gg <- gg + coord_map()
gg <- gg + labs(x=NULL, y=NULL,
title=sprintf("U.S. Avian Flu Total Impact as of %s\n", wk))
gg <- gg + theme_bw()
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(legend.title.align=1)# save the image# i'm using "quartz" here since I'm on a Mac. Use what works for you system to ensure you# get the best looking output pngpng(sprintf("output/%03d.png", i), width=800, height=500, type="quartz")print(gg)dev.off()
i <- i +1}
We could use one of the R animation packages to actually make the animation, but I know ImageMagick pretty well so I just call it as a `system` command:
If that’s a static image, open it in a new tab/window (or just click on it). I really didn’t want to do a looping gif but if you do just make the `-loop 1` into `-loop 0`.
Now, we can just re-run the code when the USDA refreshes the data.
The code, data and sample bitmaps are on [github](https://github.com/hrbrmstr/avianflu).
I set aside a small bit of time to give [rbokeh](https://github.com/bokeh/rbokeh) a try and figured I’d share a small bit of code that shows how to make the “same” chart in both ggplot2 and rbokeh.
#### What is Bokeh/rbokeh?
rbokeh is an [htmlwidget](http://htmlwidgets.org) wrapper for the [Bokeh](http://bokeh.pydata.org/en/latest/) visualization library that has become quite popular in Python circles. Bokeh makes creating interactive charts pretty simple and rbokeh lets you do it all with R syntax.
#### Comparing ggplot & rbokeh
This is not a comprehensive introduction into rbokeh. You can get that [here (officially)](http://hafen.github.io/rbokeh/). I merely wanted to show how a ggplot idiom would map to an rbokeh one for those that may be looking to try out the rbokeh library and are familiar with ggplot. They share a very common “grammar of graphics” base where you have a plot structure and add layers and aesthetics. They each do this a tad bit differently, though, as you’ll see.
First, let’s plot a line graph with some markers in ggplot. The data I’m using is a small time series that we’ll use to plot a cumulative sum of via a line graph. It’s small enough to fit inline:
– setup a base ggplot object
– add a layer of marker lines (which are the 3 `events` dates)
– add a layer of text for the marker lines
– add a layer of the actual line – note that we can use `cumsum(n)` vs pre-compute it
– setup scale and other aesthetic properties
Here, we set the `width` and `height` and configure some of the initial aesthetic options. Note that `outline_line_alpha=0` is the equivalent of `theme(panel.border=element_blank())`.
The markers and text do not work exactly as one might expect since there’s no way to specify a `data` parameter, so we have to set the colors manually. Also, since the target is a browser, points are specified in the same way you would with CSS. However, it’s a pretty easy translation from `geom_[hv]line` to `ly_abline` and `geom_text` to `ly_text`.
The `ly_lines` works pretty much like `geom_line`.
Notice that both ggplot and rbokeh can grok dates for plotting (though we do not need the `as.numeric` hack for rbokeh).
rbokeh will auto-compute bounds like ggplot would but I wanted the scale to go from 0 to 100 in each plot. You can think of `y_range` as `ylim` in ggplot.
To configure the axes, you work directly with `x_axis` and `y_axis` parameters vs `theme` elements in ggplot. To turn off only lines, I set the alpha to 0 in each and did the same with the y axis minor tick marks.
Here’s the rbokeh result:
NOTE: you can save out the widget with:
saveWidget(rb, file="rbokeh001.html")
and I like to use the following `iframe` settings to include the widgets:
Hopefully this helped a bit with translating some ggplot idioms over to rbokeh and developing a working mental model of rbokeh plots. As I play with it a bit more I’ll add some more examples here in the event there are “tricks” that need to be exposed. You can find the code [up on github](https://gist.github.com/hrbrmstr/a3a1be8132530b355bf9) and please feel free to drop a note in the comments if there are better ways of doing what I did or if you have other hints for folks.