post Archives - Page 27 of 33

Tag Archives: post

Faceted “World Population by Income” Choropleths in ggplot

2015-07-09 – 14:25
Posted in cartography, Charts & Graphs, Data Visualization, DataVis, DataViz, gis, maps, R
Tagged post
Comments (5)

Poynter did a nice interactive piece on world population by income (i.e. “How Many Live on How Much, and Where”). I’m always on the lookout for optimized shapefiles and clean data (I’m teaching a data science certificate program starting this Fall) and the speed of the site load and the easy availability of the data set made this one a “must acquire”. Rather than just repeat Poynter’s D3-goodness, here’s a way to look at the income data in series of small multiple choropleths—using R & ggplot2—that involves:

downloading data & shapefiles from a web site
using dplyr & tidyr for data munging
applying custom fill color scale mapping in ggplot
ordering plots with a custom facet order (using factors)
tweaking the theme and aesthetics for a nicely finished result

By using D3, Poynter inherently made the data available. Pop open the “Developer Tools” in any browser, reload the page and look at the “Network” tab and you’ll see a list of files (you can sometimes see things in the source code, but this technique is often faster). The income data is a well-formed CSV file http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv and their highly optimized world map was also easy to discern http://www.pewglobal.org/wp-content/lib/js/world-geo.json. We’ll start by grabbing the map and using the same map projection that Poynter did (Robinson). Don’t be put off by all the library calls since one of the best parts of R is the ever-increasing repository of great packages to help you get things done.

library(httr)     # getting data
library(rgdal)    # working with shapefile
library(dplyr)    # awesome data manipulation
library(readr)    # faster reading of CSV data
library(stringi)  # string manipulation
library(stringr)  # string manipulation
library(tidyr)    # reshaping data
library(grid)     # for 'unit'
library(scales)   # for 'percent'
library(ggplot2)  # plotting
library(ggthemes) # theme_map
 
# this ensures you only download the shapefile once and hides
# errors and warnings. remove `try` and `invisible` to see messages
try(invisible(GET("http://www.pewglobal.org/wp-content/lib/js/world-geo.json",
                  write_disk("world-geo.json"))), silent=TRUE)
 
# use ogrListLayers("world-geo.json") to see file type & 
# layer info to use in the call to readOGR
 
world <- readOGR("world-geo.json", "OGRGeoJSON")
world_wt <- spTransform(world, CRS("+proj=robin"))
world_map <- fortify(world_wt)

I would have liked to do fortify(world_wt, region="name") (since that makes working with filling in countries by name much easier in the choropleth part of the code) but that generated TopologyException errors (I’ve seen this happen quite a bit with simplified/optimized shapefiles and some non-D3 geo-packages). One can sometimes fix those with a strategic rgeos::gBuffer call, but that didn’t work well in this case. We can still use country names with a slight rejiggering of the fortified data frame using dplyr:

world_map %>%
  left_join(data_frame(id=rownames(world@data), name=world@data$name)) %>%
  select(-id) %>%
  rename(id=name) -> world_map

Now it’s time to get the data. The CSV file has annoying spaces in it that causes R to interpret all the columns as strings, so we can use dplyr again to get them into the format we want them in. Note that I’m also making the percentages decimals so we can use percent later on to easily format them.

# a good exercise would be to repeat the download code above 
# rather and make repeated calls to an external resource
read_csv("http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv") %>%
  mutate_each(funs(str_trim)) %>%
  filter(id != "None") %>%
  mutate_each(funs(as.numeric(.)/100), -name, -id) -> dat

For this post, we’ll only be working with the actual share percentages, so let’s:

ignore the “change” columns
convert the data frame from wide to long
extract out the income levels (e.g. “Poor”, “Low Income”…)
set a factor order for them so our plots will be in the correct sequence

dat %>%
  gather(share, value, starts_with("Share"), -name, -id) %>%
  select(-starts_with("Change")) %>%
  mutate(label=factor(stri_trans_totitle(str_match(share, "Share ([[:alpha:]- ]+),")[,2]),
                      c("Poor", "Low Income", "Middle Income", "Upper-Middle Income", "High Income"),
                      ordered=TRUE)) -> share_dat

The stringi package is really handy (stringr is built on it, too). The stri_trans_totitle function alleviates some mundane string operations and the stri_replace_all_regex (below) also allows us to do vectorized regular expression replacements without a ton of code.

To keep the charts aligned, we’ll use Poynter’s color scale (which was easy to extract from the site’s code) and use the same legend breaks via `cut’. We’ll also format the labels for these breaks to make our legend nicer to view.

# use same "cuts" as poynter
poynter_scale_breaks <- c(0, 2.5, 5, 10, 25, 50, 75, 80, 100)
 
sprintf("%2.1f-%s", poynter_scale_breaks, percent(lead(poynter_scale_breaks/100))) %>%
  stri_replace_all_regex(c("^0.0", "-NA%"), c("0", "%"), vectorize_all=FALSE) %>%
  head(-1) -> breaks_labels
 
share_dat %>%
  mutate(`Share %`=cut(value,
                       c(0, 2.5, 5, 10, 25, 50, 75, 80, 100)/100,
                       breaks_labels))-> share_dat
 
share_pal <- c("#eaecd8", "#d6dab3", "#c2c98b", "#949D48", "#6e7537", "#494E24", "#BB792A", "#7C441C", "#ffffff")

Finally, we get to the good part and start plotting the visualization. There are only two layers: the base map and the choropleth filled map. We then:

apply our manual color palette to them
remove the line color slashes in the legend boxes
setup the overall plot label
tell ggplot which coordinate system to use (in this case coord_equal is fine since we already projected the points)
apply a base theme that’s good for mapping
tweak the text and ensure our legend is in the position we want it to be ing

gg <- ggplot()
 
gg <- gg + geom_map(data=world_map, map=world_map,
                    aes(x=long, y=lat, group=group, map_id=id),
                    color="#7f7f7f", fill="white", size=0.15)
 
gg <- gg + geom_map(data=share_dat, map=world_map,
                    aes(map_id=name, fill=`Share %`),
                    color="#7f7f7f", size=0.15)
 
gg <- gg + scale_fill_manual(values=share_pal)
 
gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA)))
gg <- gg + labs(title="World Population by Income\n")
gg <- gg + facet_wrap(~label, ncol=2)
 
gg <- gg + coord_equal()
 
gg <- gg + theme_map()
 
gg <- gg + theme(panel.margin=unit(1, "lines"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
gg <- gg + theme(legend.title=element_text(face="bold", hjust=0, size=12))
gg <- gg + theme(legend.text=element_text(size=10))
gg <- gg + theme(strip.text=element_text(face="bold", size=10))
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(legend.position="bottom")
 
gg

And, here’s the result (click for larger version):

The optimized shapefile makes for a very fast plot and you can plot individual chorpleths by filtering the data and not using facets.

While there are a number of choropleth packages out there for R, learning how to do the core components on your own can (a) make you appreciate those packages a bit more (b) give you the skills to do them on your own when you need a more customized version. Many of the theme tweaks will also apply to the ggplot-based choropleth packages.

With this base, it should be a fun exercise to the reader to do something similar with Poynter’s “percentage point change” choropleth. You’ll need to change the color palette and manipulate different data columns to get the same scales and visual representation they do. Drop a note in the comments if you give this a go!

You can find all the code in this post in one convenient gist.

Two pending features to metricsgraphics

2015-07-05 – 22:47
Posted in Charts & Graphs, d3, DataVis, DataViz, HTML5, R
Tagged post
Leave a Comment

I’ve been slowly prodding the [metricsgraphics package](https://github.com/hrbrmstr/metricsgraphics/) towards a 1.0.0 release, but there are some rough edges that still need sorting out. One of them is the ability to handle passing in variables for the `x` & `y` accessor values (you _can_ pass in bare and quoted strings). This can now be achieved (in the `dev01` branch) via `mjs_plot_` and in `mjs_plot` proper in the github main branch thanks to a [PR](https://github.com/hrbrmstr/metricsgraphics/pull/31) by [Jonathan Owen](https://github.com/jrowen). If everything stays stable with the PR, I’ll just fold the code into `mjs_plot` for the `0.9.0` CRAN release.

One other pending feature is the ability to turn _basic_ (single `geom_`) `ggplot` objects into `metricsgraphics` plots. Sometimes it’s just easier/nicer to “think” in `ggplot` and it may be the case that one might have coded a quick histogram/scatter/line plot in `ggplot` and want an equally quick interactive version. This can also now be achieved (again, in beta) via `as_mjsplot`. While the previous addition is fairly self-explanatory, this new one needs a few examples. Please note that the package installation is coming from the `dev01` branch:

devtools::install_github("hrbrmstr/metricsgraphics", ref="dev01") 
 
library(metricsgraphics)
library(ggplot2)
 
dat <- data.frame(year=seq(1790, 1970, 10),
                  uspop=as.numeric(uspop))
 
set.seed(5689)
movies <- movies[sample(nrow(movies), 1000), ]
 
gg1 <- ggplot(dat, aes(x=year, y=uspop)) + geom_line()
gg2 <- ggplot(dat, aes(x=year, y=uspop)) + geom_point()
gg3 <- ggplot(movies, aes(rating)) + geom_histogram()
gg4 <- ggplot(movies, aes(rating)) + geom_histogram(binwidth = 0.1)
 
gg1
as_mjsplot(gg1)
 
gg2
as_mjsplot(gg2)
 
gg3
as_mjsplot(gg3)
 
gg4
as_mjsplot(gg4)

Which you can see below:

As you can see, `as_mjsplot` will do it’s best to figure out the bins (if using `geom_histogram`) and also axis labels. Support for converting `geom_vline` and `geom_hline` to markers and baselines (respectively) is a work in progress.

I’ve only done limited testing with some basic single `geom_` constructs, but if there are any bugs with it or feature requests (remember, the MetricsGraphics.js library has a very limited repertoire) please post an issue on GitHub tagging the `dev01` branch.

Get by with a little (R) help from your friends (at GitHub)

@JennyBryan posted her [slides from the 2015 R Summit](https://github.com/jennybc/2015-06-28_r-summit-talk) and they are a must-read for instructors and even general stats/R-folk. She’s one of the foremost experts in R+GitHub and her personal and class workflows provide solid patterns worth emulation.

One thing she has mentioned a few times—and included in her R Summit talk—is the idea that you can lean on GitHub when official examples of a function are “kind of thin”. She uses a search for `vapply` as an example, showing how to [search for uses of `vapply` in CRAN](https://github.com/search?utf8=%E2%9C%93&q=vapply+extension%3AR+user%3Acran) (there’s a [read-only CRAN mirror on GitHub](http://www.r-pkg.org/)) and [in GitHub R code in general](https://github.com/search?utf8=%E2%9C%93&q=vapply+extension%3AR).

I remember throwing together a small function to kick up a browser from R for those URLs (in a response to one of her tweets), but realized this morning (after reading her slides last night) that it’s possible to not leave RStudio to get these GitHub search results (or, at least the first page of results). So, I threw together [this gist](https://gist.github.com/hrbrmstr/32e9c140129d7d51db52) which, when sourced, provides a `ghelp` function. This is the code:

ghelp <- function(topic, in_cran=TRUE) {
 
  require(htmltools) # for getting HTML to the viewer
  require(rvest)     # for scraping & munging HTML
 
  # github search URL base
  base_ext_url <- "https://github.com/search?utf8=%%E2%%9C%%93&q=%s+extension%%3AR"
  ext_url <- sprintf(base_ext_url, topic)
 
  # if searching with user:cran (the default) add that to the URL  
  if (in_cran) ext_url <- paste(ext_url, "+user%3Acran", sep="", collapse="")
 
  # at the time of writing, "rvest" and "xml2" are undergoing some changes, so
  # accommodate those of us who are on the bleeding edge of the hadleyverse
  # either way, we are just extracting out the results <div> for viewing in 
  # the viewer pane (it works in plain ol' R, too)
  if (packageVersion("rvest") < "0.2.0.9000") { 
    require(XML)
    pg <- html(ext_url)
    res_div <- paste(capture.output(html_node(pg, "div#code_search_results")), collapse="")
  } else {
    require(xml2)
    pg <- read_html(ext_url)
    res_div <- as.character(html_nodes(pg, "div#code_search_results"))
  }
 
  # clean up the HTML a bit   
  res_div <- gsub('How are these search results\\? <span class="removed_link" title="/contact">Tell us!</span>', '', res_div)
  # include a link to the results at the top of the viewer
  res_div <- gsub('href="/', 'href="http://github.com/', res_div)
  # build the viewer page, getting CSS from github-proper and hiding some cruft
  for_view <- sprintf('<html><head><link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github/index-4157068649cead58a7dd42dc9c0f2dc5b01bcc77921bc077b357e48be23aa237.css" media="all" rel="stylesheet" /><style>body{padding:20px}</style></head><body><span class="removed_link" title="%s">Show on GitHub</span><hr noshade size=1/>%s</body></html>', ext_url, res_div)
  # this makes it show in the viewer (or browser if you're using plain R)
  html_print(HTML(for_view))
 
}

Now, when you type `ghelp(“vapply”)`, you’ll get:

in the viewer pane (and similar with `ghelp(“vapply”, in_cran=FALSE)`). Clicking the top link will take you to the search results page on GitHub (in your default web browser), and all the other links will pop out to a browser as well.

If you’re the trusting type, you can `devtools::source_gist(’32e9c140129d7d51db52′)` or just add this to your R startup functions (or add it to your personal helper package).

There’s definitely room for some CSS hacking and it would be fairly straightforward to get _all_ the search results into the viewer by following the pagination links and stitching them all together (an exercise left to the reader).

DO Something Nifffty with R

2015-06-19 – 18:07
Posted in Apple Watch, automation, IFTTT, R
Tagged post
Comments (2)

@briandconnelly (of [pushoverr](http://crantastic.org/authors/4002) fame) made a super-cool post about [connecting R](http://bconnelly.net/2015/06/connecting-r-to-everything-with-ifttt/) to @IFTTT via IFTTT’s “Maker” channel. The IFTTT Maker interface to receive events is fairly straightforward and Brian’s code worked flawlessly, so it was easy to tweak a bit and [wrap into a package](https://github.com/hrbrmstr/nifffty).

To get started, you can clone an [example public IFTTT recipe](https://ifttt.com/recipes/300804-post-maker-event-values-to-dropbox-file) that posts data to a Dropbox folder. All you need to do is add the IFTTT channel to your account and add the API key to your `.Renviron` file, which will look something like:

IFTTT_API_KEY=uGlySTringOfCharACTers-

Sticking it in the `.Renviron` file is a nice platform-independent way of getting the variable in your environment (and you can set a similar environment variable in Travis for automated testing). NOTE that you’ll need to restart your R/RStudio session for that value to become available (what, you don’t have RStudio up all the time?).

Once you have the key in place and the IFTTT recipe setup, all you have to do is:

devtools::install_github("hrbrmstr/nifffty")
library(nifffty)
maker("rtest", "this", "is a", "test")

and you’ll have a file named something like `june_19__2015_at_0927am.txt` in a Dropbox folder that should have values like:

Value 1: this
Value 2: is a
Value 3: test

The potential IFTTT integrations are not limited to Dropbox (or iOS/Android notification as in Brian’s example). You can literally trigger _anything_ from R in the IFTTT catalogue. Go foRth and automate!

### DOing even MoRe

As stated, Brian’s code was complete, so making a package to post to IFTTT’s Maker channel was really easy. But, this is R and if we can get data _to_ it, things can be _even cooler_. So, I made a `receiver` function that will catch Maker channel web requests and do stuff with them. Since I have an Apple Watch, I decided I wanted to test the interface out with IFTTT’s [DO button](http://blog.ifttt.com/post/116563004243/do-for-ipad-and-apple-watch) and have R receive my coordinates whenever I press this:

(That’s an actual screen capture from the Apple Watch).

If you have both R and an Apple Watch, you can create a DO app as such (NOTE that DO apps work fine on the iPhone _without_ an Apple Watch):

That will `POST` a bit of JSON to whatever is listening for it. To get R to listen to it, you need to build a small script that I’ll call `listen.R`, create a function to process the data and register it with the instance of the [Rook](http://cran.r-project.org/web/packages/Rook/README.html) web server that it will automagically create for you. You can specify which IP address and/or port to listen on as well. NOTE that this R server _must_ be reachable _from_ the internet, so you may need to do port forwarding if behind a NAT firewall.

My sample `listen.R` looks like this:

library(nifffty)
 
# not the best name for a function but it works
do_it <- function(req) {
  require(jsonlite)
  print(fromJSON(names(req$POST())))
  writeLines(names(req$POST()), "/tmp/bob.txt")
}
 
# you can change the port to match what you put into IFTTT
rcvr <- receiver(port=10999, handler=do_it)
 
# this is not necessary but it's a good diagnostic for a test script
print(rcvr)
 
# keeps the script alive
while (TRUE) Sys.sleep(24 * 60 * 60)

You can run that from a command-line (I tested it on Ubuntu) as such:

bob@server:~$ Rscript listen.R
Loading required package: methods
Loading required package: Rook
Server started on 0.0.0.0:10999
[1] nifffty http://0.0.0.0:10999/custom/nifffty
 
Call browse() with an index number or name to run an application.

When you tap the DO button on the Apple Watch, you’ll see the following in the console:

Loading required package: jsonlite
 
Attaching package: ‘jsonlite’
 
The following object is masked from ‘package:utils’:
 
    View
 
$latitude
[1] 43.25931
 
$longitude
[1] -70.80062
 
$timestamp
[1] "June 19, 2015 at 04:45PM"

and the following in the file it was instructed to write to:

bob@server:~$ cat /tmp/bob.txt
{ "latitude": 43.2593566552207, "longitude": -70.8004647307757, "timestamp": "June 19, 2015 at 04:46PM" }

JSON is a good choice given how easy it is to work with in R and you can have R do _anything_ you can dream up with the data that’s sent to it. You can connect _any_ supported IFTTT component and hook any variable from said component into the JSON that R will receive.

### Coming Soon

The next step is to make a Shiny interface so it can receive and process real-time events. I’m hoping to make a sample Shiny app that will update my location on a map every time I tap the DO button and also compute some stats for it. I also need to put some polish on the receiver interface (add logging and some error checking).

If you have cool ideas or have hooked R up to IFTTT’s Maker channel to do something cool, drop a note in the comments and definitely submit any pull requests or issues [on github](https://github.com/hrbrmstr/nifffty).

metricsgraphics 0.8.5 is now on CRAN!

2015-06-15 – 09:33
Posted in d3, Data Visualization, DataVis, DataViz, Javascript, R
Tagged htmlwidget, post
Leave a Comment

I’m super-pleased to announce that the Benevolent CRAN Overlords [accepted the metricsgraphics package](http://cran.r-project.org/web/packages/metricsgraphics/index.html) into CRAN over the weekend. Now, you no longer need to rely on github/devtools to use [MetricsGraphics.js](http://metricsgraphicsjs.org/) charts from your R scripts. If you’re not familiar with `htmlwidgets`, take a look at [the official site for them](http://www.htmlwidgets.org/).

To make it easier to grok the package, I replicated many of the core [MetricsGraphics examples](http://metricsgraphicsjs.org/examples.htm) in the package [vignette](http://cran.r-project.org/web/packages/metricsgraphics/vignettes/introductiontometricsgraphics.html) (which is also below).

I’ll be finishing up support for all of the features of MetricsGraphics library, most importantly `POSIX[cl]t` support for time ranges in the not-too-distant future. You can drop feature requests, questions or problems [over at github](https://github.com/hrbrmstr/metricsgraphics/issues).

Animated US Hexbin Map of the Avian Flu Outbreak

2015-06-07 – 16:59
Posted in cartography, Charts & Graphs, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Comments (3)

The recent announcement of the [start of egg rationing in the U.S.](http://www.washingtonpost.com/blogs/wonkblog/wp/2015/06/05/the-largest-grocer-in-the-texas-is-now-rationing-eggs/) made me curious enough about the avian flu outbreak to try to dig into the numbers a bit. I finally stumbled upon a [USDA site](http://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/sa_detections_by_states/ct_ai_pacific_flyway/!ut/p/a1/lVPBkqIwEP2WOexpCxMBAY-oo6Kybo01q3JJNSGB1GCgSNTi7zcwe3CnZnQ3h1Sl-3X369cdlKADSiRcRA5aVBLK7p14ZLVd2sMJtqPFbvyMox-_5nGw8Z3t0jWAowHgL06I_47friOvi3_Bk-VsiHcO2qMEJVTqWhfoCHUhFKGV1ExqUoq0gab9hhWQ6twQXtGz6l8gxQlKUjAodXFryYRioBgRklfNqW_i3X0RIG_xGdOMdm5F0pYoDZqZ1FQTEKQGKrighJftFdqOX01Fho7c9iiAzS3HG6WWm2HbSnmAzYXxyA3AG1L-R487Df-TntNFuHT9jVHQDWwczUywP44xjrxH8b2eDzL0gHsj-1Bk8TwxReabn_56ZeP1CB0NSf9LFmMX7f5TtdXDtmYoib9z5YadQnYTT-PcVABdWN2s0eHuDry7b3agN3y2A-jw6Q4YfnlZpeZD7Ke3REKZOoEh0jDOGtYMikppdLher4OzymCQVxdUn15PgdMK6-0lwM7obR9ayTx_evoNPxBrVg!!/?1dmy&urile=wcm:path:/APHIS_Content_Library/SA_Our_Focus/SA_Animal_Health/SA_Animal_Disease_Information/SA_Avian_Health/SA_Detections_by_States/) that had an embedded HTML table of flock outbreak statistics by state, county and date (also flock type and whether it was a commercial enterprise or “backyard” farm). Just looking at the sum of flock sizes on that page shows that nearly 50 million birds have been impacted since December, 2014.

We can scrape the data with R & `rvest` and then use the shapefile hexbins from [previous posts](http://rud.is/b/2015/05/15/u-s-drought-monitoring-with-hexbin-state-maps-in-r/) to watch the spread week-over-week.

The number of packages I ended up relying on was a bit surprising. Let’s get them out of the way before focusing on the scraping and hexbin-making:

library(rvest)     # scraping
library(stringr)   # string manipulation
library(lubridate) # date conversion
library(dplyr)     # data mjnging
library(zoo)       # for locf
library(ggplot2)   # plotting
library(rgdal)     # map stuff
library(rgeos)     # map stuff

We also end up using `magrittr` and `tidyr` but only for one function, so you’ll see those with `::` in the code.

Grabbing the USDA page is pretty straightforward:

url <- "http://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/ct_avian_influenza_disease/!ut/p/a1/lVJbb4IwFP41e1qwFZDLI-oUnGgyswm8kAMUaAaFQNG4X7-ibnEPYtakDz3nO_kupyhAHgoYHGgGnFYMiv4daOFqa8vjKZad5c58wc7mY-Eaa13Z2qoA-AKA7xwL_53fvjpaP_-Gp_Z8jHcK2qMABTHjNc-RD3VO2zCuGCeMhwWNGmhOT7iFsOqaMK3irj2_gNESijAnUPD8tpLQlkBLQsrSqinPJi7tAwX2i4_5tSBgRUfYF_wM9mLqmCbIj2QzxZpMJMUYg6TGkSLBBCaSPEnSJIljXVH0q_kBdw_CO5sXkNnSslV9LQJTDRk7czGumy7GjnYFDOTrCw36XRJTRbt_mlo9VD1FgfuctqrVByA37szNBAPwXOpzR97gPi7tm30gb2AfQkxWVJH4ifvZLavFIsUQrA1JSUOaUV61HHnH43HUtQmMsuqA6vK9NJST9JluNlKwyPr7DT6YvRs!/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_pacific_flyway"
 
#' read in the data, extract the table and clean up the fields
#' also clean up the column names to since they are fairly nasty
 
pg <- html(url)

If you poke at the source for the page you’ll see there are two tables in the code and we only need the first one. Also, if you scan the rendered table on the USDA page by eye you’ll see that the column names are horrible for data analysis work and they are also inconsistent in the values used for various columns. Furthermore, there are commas in the flock counts and it would be handy to have the date as an actual date type. We can extract the table we need and clean all that up in a reasonably-sized `dplyr` pipe:

pg %>%
  html_nodes("table") %>%
  magrittr::extract2(1) %>%
  html_table(header=TRUE) %>%
  filter(`Flock size`!="pending") %>%
  mutate(Species=str_replace(tolower(Species), "s$", ""),
         `Avian influenza subtype*`=str_replace_all(`Avian influenza subtype*`, " ", ""),
         `Flock size`=as.numeric(str_replace_all(`Flock size`, ",", "")),
         `Confirmation date`=as.Date(mdy(`Confirmation date`))) %>%
  rename(state=State, county=County, flyway=Flyway, flock_type=`Flock type`,
         species=Species, subtype=`Avian influenza subtype*`, date=`Confirmation date`,
         flock_size=`Flock size`) -> birds

Let’s take a look at what we have:

glimpse(birds)
 
## Observations: 202
## Variables:
## $ state      (chr) "Iowa", "Minnesota", "Minnesota", "Iowa", "Minnesota", "Iowa",...
## $ county     (chr) "Sac", "Renville", "Renville", "Hamilton", "Kandiyohi", "Hamil...
## $ flyway     (chr) "Mississippi", "Mississippi", "Mississippi", "Mississippi", "M...
## $ flock_type (chr) "Commercial", "Commercial", "Commercial", "Commercial", "Comme...
## $ species    (chr) "turkey", "chicken", "turkey", "turkey", "turkey", "turkey", "...
## $ subtype    (chr) "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM-H5N2", "EA/AM...
## $ date       (date) 2015-06-04, 2015-06-04, 2015-06-04, 2015-06-04, 2015-06-03, 2...
## $ flock_size (dbl) 42200, 415000, 24800, 19600, 37000, 26200, 17200, 1115700, 159...

To make an animated map of cumulative flock totals by week, we’ll need to

– group the `birds` data frame by week and state
– calculate the cumulative sums
– fill in the gaps where there are missing state/week combinations
– carry the last observations by state/week forward in this expanded data frame
– make breaks for data ranges so we can more intelligently map them to colors

This ends up being a longer `dplyr` pipe than I usuall like to code (I think very long ones are hard to follow) but it gets the job done and is still pretty readable:

birds %>%
  mutate(week=as.numeric(format(birds$date, "%Y%U"))) %>%
  arrange(week) %>%
  group_by(week, state) %>%
  tally(flock_size) %>%
  group_by(state) %>%
  mutate(cum=cumsum(n)) %>%
  ungroup %>%
  select(week, state, cum) %>%
  mutate(week=as.Date(paste(week, 1), "%Y%U %u")) %>%
  left_join(tidyr::expand(., week, state), .) %>%
  group_by(state) %>%
  do(na.locf(.)) %>%
  mutate(state_abb=state.abb[match(state, state.name)],
         cum=as.numeric(ifelse(is.na(cum), 0, cum)),
         brks=cut(cum,
                  breaks=c(0, 200, 50000, 1000000, 10000000, 50000000),
                  labels=c("1-200", "201-50K", "50k-1m",
                           "1m-10m", "10m-50m"))) -> by_state_and_week

Now, we perform the standard animation steps:

– determine where we’re going to break the data up
– feed that into a loop
– partition the data in the loop
– render the plot to a file
– combine all the individual images into an animation

For this graphic, I’m doing something a bit extra. The color ranges for the hexbin choropleth go from very light to very dark, so it would be helpful if the titles for the states went from very dark to very light, matching the state colors. The lines that do this check for state breaks that fall in the last two values and appropriately assign `”black”` or `”white”` as the color.

i <- 0
 
for (wk in unique(by_state_and_week$week)) {
 
  # filter by week
 
  by_state_and_week %>% filter(week==wk) -> this_wk
 
  # hack to let us color the state labels in white or black depending on
  # the value of the fill
 
  this_wk %>%
    filter(brks %in% c("1m-10m", "10m-50m")) %>%
    .$state_abb %>%
    unique -> white_states
 
  centers %>%
    mutate(txt_col="black") %>%
    mutate(txt_col=ifelse(id %in% white_states, "white", "black")) -> centers
 
  # setup the plot
 
  gg <- ggplot()
  gg <- gg + geom_map(data=us_map, map=us_map,
                      aes(x=long, y=lat, map_id=id),
                      color="white", fill="#dddddd", size=2)
  gg <- gg + geom_map(data=this_wk, map=us_map,
                      aes(fill=brks, map_id=state_abb),
                      color="white", size=2)
  gg <- gg + geom_text(data=centers,
                       aes(label=id, x=x, y=y, color=txt_col), size=4)
  gg <- gg + scale_color_identity()
  gg <- gg + scale_fill_brewer(name="Combined flock size\n(all types)",
                               palette="RdPu", na.value="#dddddd", drop=FALSE)
  gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA)))
  gg <- gg + coord_map()
  gg <- gg + labs(x=NULL, y=NULL,
                  title=sprintf("U.S. Avian Flu Total Impact as of %s\n", wk))
  gg <- gg + theme_bw()
  gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
  gg <- gg + theme(panel.border=element_blank())
  gg <- gg + theme(panel.grid=element_blank())
  gg <- gg + theme(axis.ticks=element_blank())
  gg <- gg + theme(axis.text=element_blank())
  gg <- gg + theme(legend.position="bottom")
  gg <- gg + theme(legend.direction="horizontal")
  gg <- gg + theme(legend.title.align=1)
 
  # save the image
 
  # i'm using "quartz" here since I'm on a Mac. Use what works for you system to ensure you
  # get the best looking output png
 
  png(sprintf("output/%03d.png", i), width=800, height=500, type="quartz")
  print(gg)
  dev.off()
 
  i <- i + 1
 
}

We could use one of the R animation packages to actually make the animation, but I know ImageMagick pretty well so I just call it as a `system` command:

system("convert -delay 60 -loop 1 output/*png output/avian.gif")

All that results in:

If that’s a static image, open it in a new tab/window (or just click on it). I really didn’t want to do a looping gif but if you do just make the `-loop 1` into `-loop 0`.

Now, we can just re-run the code when the USDA refreshes the data.

The code, data and sample bitmaps are on [github](https://github.com/hrbrmstr/avianflu).

A quick, incomplete comparison of ggplot2 & rbokeh plotting idioms

2015-05-26 – 17:04
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, HTML5, Javascript, R
Tagged post
Comments (5)

I set aside a small bit of time to give [rbokeh](https://github.com/bokeh/rbokeh) a try and figured I’d share a small bit of code that shows how to make the “same” chart in both ggplot2 and rbokeh.

#### What is Bokeh/rbokeh?

rbokeh is an [htmlwidget](http://htmlwidgets.org) wrapper for the [Bokeh](http://bokeh.pydata.org/en/latest/) visualization library that has become quite popular in Python circles. Bokeh makes creating interactive charts pretty simple and rbokeh lets you do it all with R syntax.

#### Comparing ggplot & rbokeh

This is not a comprehensive introduction into rbokeh. You can get that [here (officially)](http://hafen.github.io/rbokeh/). I merely wanted to show how a ggplot idiom would map to an rbokeh one for those that may be looking to try out the rbokeh library and are familiar with ggplot. They share a very common “grammar of graphics” base where you have a plot structure and add layers and aesthetics. They each do this a tad bit differently, though, as you’ll see.

First, let’s plot a line graph with some markers in ggplot. The data I’m using is a small time series that we’ll use to plot a cumulative sum of via a line graph. It’s small enough to fit inline:

library(ggplot2)
library(rbokeh)
library(htmlwidgets)
 
structure(list(wk = structure(c(16069, 16237, 16244, 16251, 16279,
16286, 16300, 16307, 16314, 16321, 16328, 16335, 16342, 16349,
16356, 16363, 16377, 16384, 16391, 16398, 16412, 16419, 16426,
16440, 16447, 16454, 16468, 16475, 16496, 16503, 16510, 16517,
16524, 16538, 16552, 16559, 16566, 16573), class = "Date"), n = c(1L,
1L, 1L, 1L, 3L, 1L, 3L, 2L, 4L, 2L, 3L, 2L, 5L, 5L, 1L, 1L, 3L,
3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 7L, 1L, 2L, 6L, 7L, 1L, 1L,
1L, 2L, 2L, 7L, 1L)), .Names = c("wk", "n"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -38L)) -> by_week
 
events <- data.frame(when=as.Date(c("2014-10-09", "2015-03-20", "2015-05-15")),
                     what=c("Thing1", "Thing2", "Thing2"))

The ggplot version is pretty straightforward:

gg <- ggplot()
gg <- gg + geom_vline(data=events,
                      aes(xintercept=as.numeric(when), color=what),
                      linetype="dashed", alpha=1/2)
gg <- gg + geom_text(data=events,
                     aes(x=when, y=1, label=what, color=what),
                     hjust=1.1, size=3)
gg <- gg + geom_line(data=by_week, aes(x=wk, y=cumsum(n)))
gg <- gg + scale_x_date(expand=c(0, 0))
gg <- gg + scale_y_continuous(limits=c(0, 100))
gg <- gg + labs(x=NULL, y="Cumulative Stuff")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

We:

– setup a base ggplot object
– add a layer of marker lines (which are the 3 `events` dates)
– add a layer of text for the marker lines
– add a layer of the actual line – note that we can use `cumsum(n)` vs pre-compute it
– setup scale and other aesthetic properties

That gives us this:

Here’s a similar structure in rbokeh:

figure(width=550, height=375,
       logo="grey", outline_line_alpha=0) %>%
  ly_abline(v=events$when, color=c("red", "blue", "blue"), type=2, alpha=1/4) %>%
  ly_text(x=events$when, y=5, color=c("red", "blue", "blue"),
          text=events$what, align="right", font_size="7pt") %>%
  ly_lines(x=wk, y=cumsum(n), data=by_week) %>%
  y_range(c(0, 100)) %>%
  x_axis(grid=FALSE, label=NULL,
         major_label_text_font_size="8pt",
         axis_line_alpha=0) %>%
  y_axis(grid=FALSE,
         label="Cumulative Stuff",
         minor_tick_line_alpha=0,
         axis_label_text_font_size="10pt",
         axis_line_alpha=0) -> rb
rb

Here, we set the `width` and `height` and configure some of the initial aesthetic options. Note that `outline_line_alpha=0` is the equivalent of `theme(panel.border=element_blank())`.

The markers and text do not work exactly as one might expect since there’s no way to specify a `data` parameter, so we have to set the colors manually. Also, since the target is a browser, points are specified in the same way you would with CSS. However, it’s a pretty easy translation from `geom_[hv]line` to `ly_abline` and `geom_text` to `ly_text`.

The `ly_lines` works pretty much like `geom_line`.

Notice that both ggplot and rbokeh can grok dates for plotting (though we do not need the `as.numeric` hack for rbokeh).

rbokeh will auto-compute bounds like ggplot would but I wanted the scale to go from 0 to 100 in each plot. You can think of `y_range` as `ylim` in ggplot.

To configure the axes, you work directly with `x_axis` and `y_axis` parameters vs `theme` elements in ggplot. To turn off only lines, I set the alpha to 0 in each and did the same with the y axis minor tick marks.

Here’s the rbokeh result:

NOTE: you can save out the widget with:

saveWidget(rb, file="rbokeh001.html")

and I like to use the following `iframe` settings to include the widgets:

<iframe style="max-width=100%" 
        src="rbokeh001.html" 
        sandbox="allow-same-origin allow-scripts" 
        width="100%" 
        height="400" 
        scrolling="no" 
        seamless="seamless" 
        frameBorder="0"></iframe>

#### Wrapping up

Hopefully this helped a bit with translating some ggplot idioms over to rbokeh and developing a working mental model of rbokeh plots. As I play with it a bit more I’ll add some more examples here in the event there are “tricks” that need to be exposed. You can find the code [up on github](https://gist.github.com/hrbrmstr/a3a1be8132530b355bf9) and please feel free to drop a note in the comments if there are better ways of doing what I did or if you have other hints for folks.

Scraping jQuery DataTable Programmatic JSON with R

2015-05-18 – 06:35
Posted in Data Analysis, HTML5, R, web scraping
Tagged post
Comments (2)

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

(i.e. by the controls & style of the controls available) and in the source:

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr)
library(jsonlite)
library(dplyr) # for glimpse
 
url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"
 
resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4
##  $ draw           : int 1
##  $ recordsTotal   : int 7448
##  $ recordsFiltered: int 7448
##  $ data           :'data.frame':  7448 obs. of  9 variables:
##   ..$ box_art  : chr [1:7448] "<img src=\"http://cdn1.nflximg.net/images/9159/12119159.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/6195/20866195.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/3735/2243735.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn0.nflximg.net/images/2668/21112668.jpg\" width=\"55\" alt=\"Thumbnail\">" ...
##   ..$ title    : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ...
##   ..$ year     : chr [1:7448] "2001" "2013" "2006" "2015" ...
##   ..$ rating   : chr [1:7448] "3.3" "2.5" "3.6" "4" ...
##   ..$ category : chr [1:7448] "<a href=\"http://www.allflicks.net/category/thrillers/\">Thrillers</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" ...
##   ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ...
##   ..$ cast     : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ...
##   ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ...
##   ..$ id       : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

pre lang=”rsplus”>glimpse(recs$data)

Observations: 7448

Variables:

$ box_art (chr) “<img src=\”http://cdn1.nflximg.net/images/9159/12…

$ title (chr) “In the Bedroom”, “Wolfy: The Incredible Secret”, …

$ year (chr) “2001”, “2013”, “2006”, “2015”, “1993”, “2013”, “2…

$ rating (chr) “3.3”, “2.5”, “3.6”, “4”, “3.5”, “3.1”, “3.3”, “4….

$ category (chr) “<a href=\”http://www.allflicks.net/category/thril…

$ available (chr) “17 May 2015”, “17 May 2015”, “17 May 2015”, “17 M…

$ cast (chr) “Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T…

$ director (chr) “Todd Field”, “Éric Omond”, “Mucci Fassett, Nico R…

$ id (chr) “60022258”, “70302834”, “70053695”, “80028529”, “8…

Now, we can use that in any R workflow or write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.