rud.is

A quick, incomplete comparison of ggplot2 & rbokeh plotting idioms

2015-05-26 – 17:04
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, HTML5, Javascript, R
Tagged post
Comments (5)

I set aside a small bit of time to give [rbokeh](https://github.com/bokeh/rbokeh) a try and figured I’d share a small bit of code that shows how to make the “same” chart in both ggplot2 and rbokeh.

#### What is Bokeh/rbokeh?

rbokeh is an [htmlwidget](http://htmlwidgets.org) wrapper for the [Bokeh](http://bokeh.pydata.org/en/latest/) visualization library that has become quite popular in Python circles. Bokeh makes creating interactive charts pretty simple and rbokeh lets you do it all with R syntax.

#### Comparing ggplot & rbokeh

This is not a comprehensive introduction into rbokeh. You can get that [here (officially)](http://hafen.github.io/rbokeh/). I merely wanted to show how a ggplot idiom would map to an rbokeh one for those that may be looking to try out the rbokeh library and are familiar with ggplot. They share a very common “grammar of graphics” base where you have a plot structure and add layers and aesthetics. They each do this a tad bit differently, though, as you’ll see.

First, let’s plot a line graph with some markers in ggplot. The data I’m using is a small time series that we’ll use to plot a cumulative sum of via a line graph. It’s small enough to fit inline:

library(ggplot2)
library(rbokeh)
library(htmlwidgets)
 
structure(list(wk = structure(c(16069, 16237, 16244, 16251, 16279,
16286, 16300, 16307, 16314, 16321, 16328, 16335, 16342, 16349,
16356, 16363, 16377, 16384, 16391, 16398, 16412, 16419, 16426,
16440, 16447, 16454, 16468, 16475, 16496, 16503, 16510, 16517,
16524, 16538, 16552, 16559, 16566, 16573), class = "Date"), n = c(1L,
1L, 1L, 1L, 3L, 1L, 3L, 2L, 4L, 2L, 3L, 2L, 5L, 5L, 1L, 1L, 3L,
3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 7L, 1L, 2L, 6L, 7L, 1L, 1L,
1L, 2L, 2L, 7L, 1L)), .Names = c("wk", "n"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -38L)) -> by_week
 
events <- data.frame(when=as.Date(c("2014-10-09", "2015-03-20", "2015-05-15")),
                     what=c("Thing1", "Thing2", "Thing2"))

The ggplot version is pretty straightforward:

gg <- ggplot()
gg <- gg + geom_vline(data=events,
                      aes(xintercept=as.numeric(when), color=what),
                      linetype="dashed", alpha=1/2)
gg <- gg + geom_text(data=events,
                     aes(x=when, y=1, label=what, color=what),
                     hjust=1.1, size=3)
gg <- gg + geom_line(data=by_week, aes(x=wk, y=cumsum(n)))
gg <- gg + scale_x_date(expand=c(0, 0))
gg <- gg + scale_y_continuous(limits=c(0, 100))
gg <- gg + labs(x=NULL, y="Cumulative Stuff")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

We:

– setup a base ggplot object
– add a layer of marker lines (which are the 3 `events` dates)
– add a layer of text for the marker lines
– add a layer of the actual line – note that we can use `cumsum(n)` vs pre-compute it
– setup scale and other aesthetic properties

That gives us this:

Here’s a similar structure in rbokeh:

figure(width=550, height=375,
       logo="grey", outline_line_alpha=0) %>%
  ly_abline(v=events$when, color=c("red", "blue", "blue"), type=2, alpha=1/4) %>%
  ly_text(x=events$when, y=5, color=c("red", "blue", "blue"),
          text=events$what, align="right", font_size="7pt") %>%
  ly_lines(x=wk, y=cumsum(n), data=by_week) %>%
  y_range(c(0, 100)) %>%
  x_axis(grid=FALSE, label=NULL,
         major_label_text_font_size="8pt",
         axis_line_alpha=0) %>%
  y_axis(grid=FALSE,
         label="Cumulative Stuff",
         minor_tick_line_alpha=0,
         axis_label_text_font_size="10pt",
         axis_line_alpha=0) -> rb
rb

Here, we set the `width` and `height` and configure some of the initial aesthetic options. Note that `outline_line_alpha=0` is the equivalent of `theme(panel.border=element_blank())`.

The markers and text do not work exactly as one might expect since there’s no way to specify a `data` parameter, so we have to set the colors manually. Also, since the target is a browser, points are specified in the same way you would with CSS. However, it’s a pretty easy translation from `geom_[hv]line` to `ly_abline` and `geom_text` to `ly_text`.

The `ly_lines` works pretty much like `geom_line`.

Notice that both ggplot and rbokeh can grok dates for plotting (though we do not need the `as.numeric` hack for rbokeh).

rbokeh will auto-compute bounds like ggplot would but I wanted the scale to go from 0 to 100 in each plot. You can think of `y_range` as `ylim` in ggplot.

To configure the axes, you work directly with `x_axis` and `y_axis` parameters vs `theme` elements in ggplot. To turn off only lines, I set the alpha to 0 in each and did the same with the y axis minor tick marks.

Here’s the rbokeh result:

NOTE: you can save out the widget with:

saveWidget(rb, file="rbokeh001.html")

and I like to use the following `iframe` settings to include the widgets:

<iframe style="max-width=100%" 
        src="rbokeh001.html" 
        sandbox="allow-same-origin allow-scripts" 
        width="100%" 
        height="400" 
        scrolling="no" 
        seamless="seamless" 
        frameBorder="0"></iframe>

#### Wrapping up

Hopefully this helped a bit with translating some ggplot idioms over to rbokeh and developing a working mental model of rbokeh plots. As I play with it a bit more I’ll add some more examples here in the event there are “tricks” that need to be exposed. You can find the code [up on github](https://gist.github.com/hrbrmstr/a3a1be8132530b355bf9) and please feel free to drop a note in the comments if there are better ways of doing what I did or if you have other hints for folks.

Scraping jQuery DataTable Programmatic JSON with R

2015-05-18 – 06:35
Posted in Data Analysis, HTML5, R, web scraping
Tagged post
Comments (2)

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

(i.e. by the controls & style of the controls available) and in the source:

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr)
library(jsonlite)
library(dplyr) # for glimpse
 
url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"
 
resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4
##  $ draw           : int 1
##  $ recordsTotal   : int 7448
##  $ recordsFiltered: int 7448
##  $ data           :'data.frame':  7448 obs. of  9 variables:
##   ..$ box_art  : chr [1:7448] "<img src=\"http://cdn1.nflximg.net/images/9159/12119159.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/6195/20866195.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/3735/2243735.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn0.nflximg.net/images/2668/21112668.jpg\" width=\"55\" alt=\"Thumbnail\">" ...
##   ..$ title    : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ...
##   ..$ year     : chr [1:7448] "2001" "2013" "2006" "2015" ...
##   ..$ rating   : chr [1:7448] "3.3" "2.5" "3.6" "4" ...
##   ..$ category : chr [1:7448] "<a href=\"http://www.allflicks.net/category/thrillers/\">Thrillers</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" ...
##   ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ...
##   ..$ cast     : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ...
##   ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ...
##   ..$ id       : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

pre lang=”rsplus”>glimpse(recs$data)

Observations: 7448

Variables:

$ box_art (chr) “<img src=\”http://cdn1.nflximg.net/images/9159/12…

$ title (chr) “In the Bedroom”, “Wolfy: The Incredible Secret”, …

$ year (chr) “2001”, “2013”, “2006”, “2015”, “1993”, “2013”, “2…

$ rating (chr) “3.3”, “2.5”, “3.6”, “4”, “3.5”, “3.1”, “3.3”, “4….

$ category (chr) “<a href=\”http://www.allflicks.net/category/thril…

$ available (chr) “17 May 2015”, “17 May 2015”, “17 May 2015”, “17 M…

$ cast (chr) “Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T…

$ director (chr) “Todd Field”, “Éric Omond”, “Mucci Fassett, Nico R…

$ id (chr) “60022258”, “70302834”, “70053695”, “80028529”, “8…

Now, we can use that in any R workflow or write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.

U.S. Drought Monitoring With Hexbin State Maps in R

2015-05-15 – 07:18
Posted in cartography, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Leave a Comment

On the news, today, of the early stages of drought hitting the U.S. northeast states I decided to springboard off of yesterday’s post and show a more practical use of hexbin state maps than the built-in (and still purpose unknown to me) “bees” data.

The U.S. Drought Monitor site supplies more than just a pretty county-level map. There’s plenty of data and you can dynamically retrieve just data tables for the whole U.S., U.S. states and U.S. counties. Since we’re working with state hexbins, we just need the state-level data. Drought levels for all five stages are reported per-state, so we can take all this data and created a faceted/small-multiples map based on it.

This builds quite a bit on the previous work, so you’ll see some familiar code. Most of the new code is actually making the map look nice (the great part about this is that once you have the idiom down, it’s just a matter of running the script each day vs a billion mouse clicks). The other bit of new code is the data-retrieval component:

library(readr)
library(tidyr)
library(dplyr)

intensity <- c(D0="Abnormally Dry", D1="Moderate Drought", D2="Severe Drought", 
               D3="Extreme Drought", D4="Exceptional Drought")

today <- format(Sys.Date(), "%Y%m%d")

read_csv(sprintf("http://droughtmonitor.unl.edu/USDMStatistics.ashx/?mode=table&aoi=state&date=%s", today)) %>% 
  gather(drought_level, value, D0, D1, D2, D3, D4) %>% 
  mutate(intensity=factor(intensity[drought_level], 
                          levels=as.character(intensity), ordered=TRUE)) -> drought

This:

sets up a fast way to add the prettier description of the drought levels (besides D0, D1, etc)
dynamically uses today’s date as the parameter for the URL we read with read_csv (from the readr package)
covert the data from wide to long
adds the intensity description

The ggplot code will facet on the intensity level to make the overall map:

library(rgdal)
library(rgeos)
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(grid)

# get map from https://gist.github.com/hrbrmstr/51f961198f65509ad863#file-us_states_hexgrid-geojson

us <- readOGR("us_states_hexgrid.geojson", "OGRGeoJSON")

centers <- cbind.data.frame(data.frame(gCentroid(us, byid=TRUE), id=us@data$iso3166_2))

us_map <- fortify(us, region="iso3166_2")

intensity <- c(D0="Abnormally Dry", D1="Moderate Drought", D2="Severe Drought",
               D3="Extreme Drought", D4="Exceptional Drought")

today <- format(Sys.Date(), "%Y%m%d")

read_csv(sprintf("http://droughtmonitor.unl.edu/USDMStatistics.ashx/?mode=table&aoi=state&date=%s", today)) %>%
  gather(drought_level, value, D0, D1, D2, D3, D4) %>%
  mutate(intensity=factor(intensity[drought_level],
                          levels=as.character(intensity), ordered=TRUE)) -> drought

gg <- ggplot()
gg <- gg + geom_map(data=us_map, map=us_map,
                    aes(x=long, y=lat, map_id=id),
                    color="white", size=0.5)
gg <- gg + geom_map(data=drought, map=us_map,
                    aes(fill=value, map_id=State))
gg <- gg + geom_map(data=drought, map=us_map,
                    aes(map_id=State),
                    fill="#ffffff", alpha=0, color="white",
                    show_guide=FALSE)
gg <- gg + geom_text(data=centers, aes(label=id, x=x, y=y), color="white", size=4)
gg <- gg + scale_fill_distiller(name="State\nDrought\nCoverage", palette="RdPu", na.value="#7f7f7f",
                                labels=sprintf("%d%%", c(0, 25, 50, 75, 100)))
gg <- gg + coord_map()
gg <- gg + facet_wrap(~intensity, ncol=2)
gg <- gg + labs(x=NULL, y=NULL, title=sprintf("U.S. Drought Conditions as of %s\n", Sys.Date()))
gg <- gg + theme_bw()
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.margin=unit(3, "lines"))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(face="bold", hjust=0, size=14))
gg <- gg + theme(legend.position=c(0.75, 0.15))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(legend.title.align=1)

png(sprintf("%s.png", today), width=800, height=800)
print(gg)
dev.off()

20150515

Now, you can easily animate these over time to show the progression/regression of the drought conditions. If you're sure your audience can work with SVG files, you can use those for very crisp/sharp maps (and even feed it to D3 or path editing tools). If you have an example of how you're using hexbin choropleths, drop a note in the comments. The code from above is also on github.

GeoJSON Hexagonal “Statebins” in R

2015-05-14 – 14:29
Posted in cartography, Data Visualization, DataVis, DataViz, gis, R
Tagged post
Comments (10)

There’s been lots of buzz about “statebin” maps of late. A recent tweet by @andrewxhill referencing work by @dannydb pointed to a nice shapefile (alternate link) that ends up being a really great way to handle statebin maps (and I feel like a fool for not considering it for a more generic solution earlier).

Here is the GeoJSON, Shapefile download for the hexgrid states map based on @dannydb's https://t.co/0exQcC8D5S pic.twitter.com/F8pNH79MoQ

— Andrew W Hill (@andrewxhill) May 14, 2015

Here’s a way to use the GeoJSON version in R. I like GeoJSON since it’s a single file vs a directory of files and is readable vs binary. If you’re in a TL;DR hurry, you can just review the code in this gist. Read on for expository.

Taking a look around

When you download the GeoJSON, it should be in a file called us_states_hexgrid.geojson. We can see what’s in there with R pretty easily:

library(rgdal)

ogrInfo("us_states_hexgrid.geojson", "OGRGeoJSON")

## Source: "us_states_hexgrid.geojson", layer: "OGRGeoJSON"
## Driver: GeoJSON number of rows 51 
## Feature type: wkbPolygon with 2 dimensions
## Extent: (-137.9747 26.39343) - (-69.90286 55.3132)
## CRS: +proj=longlat +datum=WGS84 +no_defs  
## Number of fields: 6 
##         name type length typeName
## 1 cartodb_id    0      0  Integer
## 2 created_at    4      0   String
## 3 updated_at    4      0   String
## 4      label    4      0   String
## 5       bees    2      0     Real
## 6  iso3166_2    4      0   String

Along with the basic shapefile goodness, we have some data, too! We’ll use all this to make a chorpleth hexbin of “bees” (I have no idea what this is but assume it has something to do with bee population, which is a serious problem on the planet right now). Let’s dig in.

Plotting the bins

First we need to read in the data, which is pretty simple:


us <- readOGR("us_states_hexgrid.geojson", "OGRGeoJSON")

That ends up being a fairly complex object with polygons and data. However, we can take a quick look at it with base R graphics:


plot(us)

base1

Yay! While we could do most (if not all) the remainder of the graphics in base R, I personally believe ggplot is more intuitive and expressive, so let's do the same thing with ggplot. First, we'll have to get the data structure into something ggplot can handle:

library(ggplot2)

us_map <- fortify(us, region="iso3166_2")

That gives us a data frame with the 2-letter state abbreviations as the "region" keys. Now we can do a basic ggplot:

ggplot(data=us_map, aes(map_id=id, x=long, y=lat)) + 
  geom_map(map=us_map, color="black", fill="white")

Rplot

Ugh. Talk about ugly. But, at least it works! Now all we need to do is turn it into a choropleth, remove some map chart junk and make it look prettier!

Upping the aesthetics

There's a pretty good idiom for making maps in R. There's a handy layer/geom called geom_map which takes care of a ton of details under the hood. We can use it for making outlines and fills and add as many layers of them as we want/need. For our needs, we'll:

put down a base layer of polygons
add a fill layer for our data
get rid of map chart junk

This is all pretty straightforward once you get the hang of it:

g <- ggplot()

# Plot base map -----------------------------------------------------------

gg <- gg + geom_map(data=us_map, map=us_map,
                    aes(x=long, y=lat, map_id=id),
                    color="white", size=0.5)

# Plot filled polygons ----------------------------------------------------

gg <- gg + geom_map(data=us@data, map=us_map,
                    aes(fill=bees, map_id=iso3166_2))

# Remove chart junk for the “map" -----------------------------------------

gg <- gg + labs(x=NULL, y=NULL)
gg <- gg + theme_bw()
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg

Rplot01

Definitely better, but it still needs work. Outlines would be good and it definitely needs a better color palette. It would also be nice if the polygons weren't "warped". We can fix these issues by adding in a few other elements:

gg <- ggplot()
gg <- gg + geom_map(data=us_map, map=us_map,
                    aes(x=long, y=lat, map_id=id),
                    color="white", size=0.5)
gg <- gg + geom_map(data=us@data, map=us_map,
                    aes(fill=bees, map_id=iso3166_2))

# Overlay borders without ugly line on legend -----------------------------

gg <- gg + geom_map(data=us@data, map=us_map,
                    aes(map_id=iso3166_2),
                    fill="#ffffff", alpha=0, color="white",
                    show_guide=FALSE)

# ColorBrewer scale; using distiller for discrete vs continuous -----------

gg <- gg + scale_fill_distiller(palette="RdPu", na.value="#7f7f7f")

# coord_map mercator works best for the display ---------------------------

gg <- gg + coord_map()

gg <- gg + labs(x=NULL, y=NULL)
gg <- gg + theme_bw()
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg

Rplot02

Much better. We use a "hack" to keep the legend free of white slash marks for the polygon outlines (see the comments for a less-hackish way) and coord_map to let the projection handle the "unwarping". By using the distiller fill, we get discrete color bins vs continuous shades (use what you feel is most appropriate, though, for your own work).

Where are we?

Most statebin/hexbin maps still (probably) need state labels since (a) Americans are notoriously bad at geography and (b) even if they were good at geography, we've removed much of the base references for folks to work from accurately.

The data exists in the shapefile, but to get the labels put in the centers of each polygon we have to do a bit of work:

library(rgeos)

centers <- cbind.data.frame(data.frame(gCentroid(us, byid=TRUE), id=us@data$iso3166_2))

That gets us a data frame of the x & y centers of each polygon along with the (abbreviated) state name. We can now add a layer with geom_text to place the label. The following is the complete solution:

library(rgdal)
library(rgeos)
library(ggplot2)

us <- readOGR("us_states_hexgrid.geojson", "OGRGeoJSON")

centers <- cbind.data.frame(data.frame(gCentroid(us, byid=TRUE), id=us@data$iso3166_2))

us_map <- fortify(us, region="iso3166_2")

ggplot(data=us_map, aes(map_id=id, x=long, y=lat)) + geom_map(map=us_map, color="black", fill="white")

gg <- ggplot()
gg <- gg + geom_map(data=us_map, map=us_map,
                    aes(x=long, y=lat, map_id=id),
                    color="white", size=0.5)
gg <- gg + geom_map(data=us@data, map=us_map,
                    aes(fill=bees, map_id=iso3166_2))
gg <- gg + geom_map(data=us@data, map=us_map,
                    aes(map_id=iso3166_2),
                    fill="#ffffff", alpha=0, color="white",
                    show_guide=FALSE)
gg <- gg + geom_text(data=centers, aes(label=id, x=x, y=y), color="white", size=4)
gg <- gg + scale_fill_distiller(palette="RdPu", na.value="#7f7f7f")
gg <- gg + coord_map()
gg <- gg + labs(x=NULL, y=NULL)
gg <- gg + theme_bw()
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg

Rplot03

Wrapping up

This is a pretty neat way to work with "statebins" and I'll probably take some time over the summer to update my statebins package to use shapefiles and allow for more generic shapes. Ramnath Vaidyanathan has also done some work with statebins and javascript, so I'll see what I can do to merge all the functionality into one package.

If you've got an alternate way to work with these or have some interesting "bins" to show, drop a note in the comments.

quotebox – An NPR-like Embedded Twitter Quote Generator

I’m an avid NPR listener also follow a number of their programs and people on Twitter. I really dig their [quotable](https://github.com/nprapps/quotable) tweets. Here’s a sample of a recent one:

Minn. state senators cannot look other senators in the eye during floor debate. @ailsachang http://t.co/SfQBq4yyHQ pic.twitter.com/DNHGEiVA9j

— NPR News (@nprnews) May 8, 2015

I asked @brianboyer & @alykat (two member of the _awesome_ NPR Visuals team) if they had a workflow for this (they publish quite a bit of open source tools) and, well [of course they did](https://twitter.com/hrbrmstr/status/596656939720962048).

After poking around a bit, I determined that the workflow would be great for something like NPR but was a bit overkill for me and also did not meet all of my requirements. I really only needed something that would:

– work from the command-line (on OS X)
– take a JSON file for the quote
– generate an SVG (for blogging) and PNG (for tweeting)

Thus, [quotebox](http://github.com/hrbrmstr/quotebox) was born.

For the time being, `quotebox` is a python script that lets you pass in a JSON config file with quote components and spits out some files to use in blog posts & tweets. I can see it morphing into more, but for now you can create a quote file like:

{
  "quote" : "Making these quotes by hand is fine &amp; NPR Visuals web app is cool, but this has some advantages, too.",
  "source" : "Data-Driven Security co-author Bob Rudis on making “NPR-like quoteboxes” at the command-line.", 
  "logo" : "ddsec.png"
}

which will generate:

test

or (SVG version):

test

The script auto-adds “smart” quotes and an em-dash plus base64 encodes the logo file to a data URI so you can easily use the SVG file without needing to bring along the associated logo asset. The SVG itself has a couple (at least to me) interesting components. First, it uses the [Lato](https://www.google.com/fonts/specimen/Lato) font from Google Fonts via `defs`:

<defs>
  <style type="text/css">@import url('http://fonts.googleapis.com/css?family=Lato:300,900');</style>
</defs>

It then uses two `foreignObject` tags for the main quote and the attribution line. Here’s an example:

<foreignObject x="25" y="25" width="590" height="165">
  <body xmlns="http://www.w3.org/1999/xhtml"
    style="background: white; font-family: Lato; font-weight:900; font-stretch:normal; font-size:36px; line-height:36px; text-indent:-16px; margin-left:20px; color:#333333">
    <p>&#8220;Making these quotes by hand is fine &amp; NPR Visuals web app is cool, but this has some advantages, too.&#8221;</p>
  </body>
</foreignObject>

Like most SVG elements, you have full precision as to how you place and size the `foreignObject` and this object contains XHTML, so all text or graphics in them can be fully styled. Not every browser or command-line rendering engine supports this, but the latest Chrome & Safari browsers do along with the >=2.0.0 of [phantomjs](http://phantomjs.org/).

Once the SVG is built, the script uses phantomjs to render the resultant png file. You now have two assets (SVG & png) that can be used wherever you like, but posting it as a Twitter image post will require the use of png.

### Requirements & Usage

You’ll need python, [Pillow](https://python-pillow.github.io/) and [phantomjs](http://phantomjs.org/) to run the script and just need to execute

python quotebox.py test.json

to generate `test.svg` and `test.png`. On OS X I was using

python quotebox.py test.json && open test.svg && open test.png

to test out the image generation & conversion.

Note that many (most?) folks should use the NPR Visuals [quotable](https://github.com/nprapps/quotable) instead of this since it has full workflow management. This script was created primarily to oversimplify local usage (I don’t work in a newsroom) and play with SVG.

The python is not bullet proof (no error checking, lack of use of tempfiles/dirs) but it’s a decent basic foundation.

### Future work

There will definitely be command line switches for some options (quotes or no quotes) and possibly an extension to the JSON config file for fonts and colors. I’ll think supporting grabbing a logo from a URL is needed and automagic resizing to a series of constraints. I can also envision spinning up a quotebox app on Heroku or Google Apps and bridging it to a bookmarklet, but fork away and PR as you see fit. The creativity & ingenuity of the community knows no bounds.

New Fake Vulnerability Service – SVE

The SVE service is a fake new service solely put out there to facilitate some inane puzzle. You can reference SVE-2015-9999 via this URL and hopefully this will make it possible for google to get it back in their index.

More Airline Crashes via the Hadleyverse

I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:

Wrote some R code for looking at historical crash data https://t.co/bgrMj3PIZu #data #r #DataMining pic.twitter.com/zYzpOyD9JY

— PhilJ (@philjette) March 26, 2015

The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.

Just to get the setup bits out of the way, here’s all the packages I’ll be using:

library(dplyr)
library(rvest)
library(magrittr)
library(stringr)
library(lubridate)
library(pbapply)

Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.

The basic setup is to:

– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `

` element
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)

Some things worth mentioning:

– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary

Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):

#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
 
  crash_base <- "http://www.planecrashinfo.com/%d/%s.htm"
 
  if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
    stop("year must be >=1920 and <=current year", call.=FALSE)
  }
 
  # get crash date
 
  pg <- html(sprintf(crash_base, year, year))
  pg %>%
    html_nodes("table > tr > td:nth-child(1)") %>%
    html_text() %>%
    extract(-1) %>%
    dmy() %>%
    data_frame(date=.) -> date
 
  # get location and operator
 
  loc_op <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_trim() %>%
      str_replace_all("^(Near|Off) ", "") -> loc
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op
 
    data_frame(location=loc, operator=op)
 
  }))
 
  # get type & registration
 
  type_reg <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> typ
 
    pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> reg
 
    data_frame(type=typ, registration=reg)
 
  }))
 
  # get fatalities
 
  pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
    html_text() %>%
    str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>%
    lapply(function(x) {
      data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
    }) %>%
    bind_rows %>% tail(-1) -> afg
 
  bind_cols(date, loc_op, type_reg, afg)
 
}

While that gets one year, it’s super-simple to get all crashes since 1950:

crashes <- bind_rows(pblapply(1950:2015, get_data))

Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.

For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.

If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.

A look at airline crashes in R with googlesheets, dplyr & ggplot2

2015-03-30 – 14:32
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (5)

Over on The DO Loop, @RickWicklin does a nice job [visualizing the causes of airline crashes](http://blogs.sas.com/content/iml/2015/03/30/visualizing-airline-crashes/) in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:

– Give @jennybc’s new [googlesheets](http://github.com/jennybc/googlesheets) a spin
– Show some `dplyr` & `tidyr` data wrangling (never can have too many examples)
– Crank out some `ggplot` zero-based streamgraph-y area charts for the data with some extra `ggplot` wrangling for good measure

I also decided to use the colors in the [original David McCandless/Kashan visualization](http://www.informationisbeautiful.net/visualizations/plane-truth-every-single-commercial-plane-crash-visualized/).

#### Getting The Data

As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site [makes the data available](https://docs.google.com/spreadsheet/ccc?key=0AjOUPqcIwvnjdEx2akx5ZjJXSk9oM1E3dWpqZFJ6Nmc&usp=drive_web#gid=1), so I copied it to my Google Drive and gave her package a go:

library(googlesheets)
library(ggplot2) # we'll need the rest of the libraries later
library(dplyr)   # but just getting them out of the way
library(tidyr)
 
# this will prompt for authentication the first time
my_sheets <- list_sheets()
 
# which one is the flight data one
grep("Flight", my_sheets$sheet_title, value=TRUE)
 
## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" 
 
# get the sheet reference then the data from the second tab
flights <- register_ss("Flight Risk JSON")
flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL")
 
# take a quick look
glimpse(flights_csv)
 
## Observations: 440
## Variables:
## $ date       (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "...
## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1...
## $ loc        (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "...
## $ country    (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor...
## $ ref        (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ...
## $ airline    (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"...
## $ fat        (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4...
## $ px         (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"...
## $ cat        (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1...
## $ phase      (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi...
## $ cert       (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi...
## $ meta       (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h...
## $ cause      (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err...
## $ notes      (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
 
# the spreadsheet has a "helper" row for javascript, so we nix it
flights_csv <- flights_csv[-1,] # js vars removal
 
# and we convert some columns while we're at it
flights_csv %>%
  mutate(date=as.Date(date),
         fat=as.numeric(fat),
         px=as.numeric(px)) -> flights_csv

#### A Bit of Cleanup

Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.

The area chart also needs full values for each category combo per year, so we use `expand` from `tidyr` with `left_join` and `mutate` to fill in the gaps.

Finally, we make proper, ordered labels:

flights_csv %>%
  mutate(year=as.numeric(format(date, "%Y"))) %>%
  mutate(phase=tolower(phase),
         phase=ifelse(grepl("take", phase), "takeoff", phase),
         phase=ifelse(grepl("climb", phase), "takeoff", phase),
         phase=ifelse(grepl("ap", phase), "approach", phase)) %>%
  count(year, meta, phase) %>%
  left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% 
  mutate(n=ifelse(is.na(n), 0, n)) %>% 
  filter(!phase %in% c("grounded", "unknown")) %>%
  mutate(phase=factor(phase, 
                      levels=c("takeoff", "en_route", "approach", "landing"),
                      labels=c("Takeoff", "En Route", "Approach", "Landing"),
                      ordered=TRUE)) -> flights_dat

I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.

#### The Final Plot

I’m a big fan of an incremental, additive build idiom for `ggplot` graphics. By using the `gg <- gg + …` style one can move lines around, comment them out, etc without dealing with errant `+` signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows: - main `ggplot` call with mappings if the graph is short, otherwise add the mappings to the `geom`s - all `geom_` or `stat_` layers in the order I want them, and using line breaks to logically separate elements (like `aes`) or to wrap long lines for easier readability. - all `scale_` elements in order from axes to line to shape to color to fill to alpha; I'm not as consistent as I'd like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking - `facet` call (if any) - label setting, always with `labs` unless I really have a need for using `ggtitle` - base `theme_` call - all other `theme` elements, one per `gg <- gg +` line I know that's not everyone's cup of tea, but it's just how I roll `ggplot`-style. For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don't have Futura Medium.

flights_palette <- c("#702023", "#A34296", "#B06F31", "#939598", "#3297B0")
 
gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) 
gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area",
                       position="stack", method="gam", formula=y~s(x)) 
gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, 
                             labels=c("Criminal", "Human Error",
                                      "Mechanical", "Unknown", "Weather"))
gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13))
gg <- gg + facet_grid(~phase)
gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(text=element_text(family="Futura Medium"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(strip.background=element_rect(fill="#525252"))
gg <- gg + theme(strip.text=element_text(color="white"))
gg

That ultimately produces:

flights

with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the _bump_ between 2005 and 2010 for landing crashes.

As an aside, Boeing has a [really nice PDF](http://www.boeing.com/news/techissues/pdf/statsum.pdf) on some of this data with quite a bit more detail.