hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Roll Your Own Gist Comments Notifier in R

2015-07-25 – 09:49
Posted in R, web scraping, xml
Tagged post, R, rstats, web scraping, XML
Comments (2)

As I was putting together the [coord_proj](https://rud.is/b/2015/07/24/a-path-towards-easier-map-projection-machinations-with-ggplot2/) ggplot2 extension I had posted a (https://gist.github.com/hrbrmstr/363e33f74e2972c93ca7) that I shared on Twitter. Said gist received a comment (several, in fact) and a bunch of us were painfully reminded of the fact that there is no built-in way to receive notifications from said comment activity.

@jennybryan posited that it could be possible to use IFTTT as a broker for these notifications, but after some checking that ended up not being directly doable since there are no “gist comment” triggers to act upon in IFTTT.

There are a few standalone Ruby gems that programmatically retrieve gist comments but I wasn’t interested in managing a Ruby workflow [ugh]. I did find a Heroku-hosted service – https://gh-rss.herokuapp.com/ – that will turn gist comments into an RSS/Atom feed (based on Ruby again). I gave it a shot and hooked it up to IFTTT but my feed is far enough down on the food chain there that it never gets updated. It was possible to deploy that app on my own Heroku instance, but—again—I’m not interested in managing a Ruby workflow.

The Ruby scripts pretty much:

– grab your main gist RSS/Atom feed
– visit each gist in the feed
– extract comments & comment metadata from them (if any)
– return a composite data structure you can do anything with

That’s super-easy to duplicate in R, so I decided to build a small R script that does all that and generates an RSS/Atom file which I added to my Feedly feeds (I’m pretty much always scanning RSS, so really didn’t need the IFTTT notification setup). I put it into a `cron` job that runs every hour. When Feedly refreshes the feed, a new entry will appear whenever there’s a new comment.

The script is below and [on github](https://gist.github.com/hrbrmstr/0ad1ced217edd137de27) (ironically as a gist). Here’s what you’ll grok from the code:

– one way to deal with the “default namespace” issue in R+XML
– one way to deal with error checking for scraping
– how to build an XML file (and, specifically, an RSS/Atom feed) with R
– how to escape XML entities with R
– how to get an XML object as a character string in R

You’ll definitely need to tweak this a bit for your own setup, but it should be a fairly complete starting point for you to work from. To see the output, grab the [generated feed](http://dds.ec/hrbrmstrgcfeed.xml).

# Roll your own GitHub Gist Comments Feed in R
 
library(xml2)    # github version
library(rvest)   # github version
library(stringr) # for str_trim & str_replace
library(dplyr)   # for data_frame & bind_rows
library(pbapply) # free progress bars for everyone!
library(XML)     # to build the RSS feed
 
who <- "hrbrmstr" # CHANGE ME!
 
# Grab the user's gist feed -----------------------------------------------
 
gist_feed <- sprintf("https://gist.github.com/%s.atom", who)
feed_pg <- read_xml(gist_feed)
ns <- xml_ns_rename(xml_ns(feed_pg), d1 = "feed")
 
# Extract the links & titles of the gists in the feed ---------------------
 
links <-  xml_attr(xml_find_all(feed_pg, "//feed:entry/feed:link", ns), "href")
titles <-  xml_text(xml_find_all(feed_pg, "//feed:entry/feed:title", ns))
 
#' This function does the hard part by iterating over the
#' links/titles and building a tbl_df of all the comments per-gist
get_comments <- function(links, titles) {
 
  bind_rows(pblapply(1:length(links), function(i) {
 
    # get gist
 
    pg <- read_html(links[i])
 
    # look for comments
 
    ref <- tryCatch(html_attr(html_nodes(pg, "div.timeline-comment-wrapper a[href^='#gistcomment']"), "href"),
                    error=function(e) character(0))
 
    # in theory if 'ref' exists then the rest will
 
    if (length(ref) != 0) {
 
      # if there were comments, get all the metadata we care about
 
      author <- html_text(html_nodes(pg, "div.timeline-comment-wrapper a.author"))
      timestamp <- html_attr(html_nodes(pg, "div.timeline-comment-wrapper time"), "datetime")
      contentpg <- str_trim(html_text(html_nodes(pg, "div.timeline-comment-wrapper div.comment-body")))
 
    } else {
      ref <- author <- timestamp <- contentpg <- character(0)
    }
 
    # bind_rows ignores length 0 tbl_df's
    if (sum(lengths(list(ref, author, timestamp, contentpg))==0)) {
      return(data_frame())
    }
 
    return(data_frame(title=titles[i], link=links[i],
                      ref=ref, author=author,
                      timestamp=timestamp, contentpg=contentpg))
 
  }))
 
}
 
comments <- get_comments(links, titles)
 
feed <- xmlTree("feed")
feed$addNode("id", sprintf("user:%s", who))
feed$addNode("title", sprintf("%s's gist comments", who))
feed$addNode("icon", "https://assets-cdn.github.com/favicon.ico")
feed$addNode("link", attrs=list(href=sprintf("https://github.com/%s", who)))
feed$addNode("updated", format(Sys.time(), "%Y-%m-%dT%H:%M:%SZ", tz="GMT"))
 
for (i in 1:nrow(comments)) {
 
  feed$addNode("entry", close=FALSE)
    feed$addNode("id", sprintf("gist:comment:%s:%s", who, comments[i, "timestamp"]))
    feed$addNode("link", attrs=list(href=sprintf("%s%s", comments[i, "link"], comments[i, "ref"])))
    feed$addNode("title", sprintf("Comment by %s", comments[i, "author"]))
    feed$addNode("updated", comments[i, "timestamp"])
    feed$addNode("author", close=FALSE)
      feed$addNode("name", comments[i, "author"])
    feed$closeTag()
    feed$addNode("content", saveXML(xmlTextNode(as.character(comments[i, "contentpg"])), prefix=""), 
                 attrs=list(type="html"))
  feed$closeTag()
 
}
 
rss <- str_replace(saveXML(feed), "<feed>", '<feed xmlns="http://www.w3.org/2005/Atom">')
 
writeLines(rss, con="feed.xml")

To get that RSS feed into something that an internet service can process you have to make sure that `feed.xml` is being written to a directory that translates to a publicly accessible web location (mine is at [http://dds.ec/hrbrmstrgcfeed.xml](http://dds.ec/hrbrmstrgcfeed.xml) if you want to see it).

On the internet-facing Ubuntu box that generated the feed I’ve got a `cron` entry:

30  * * * * /home/bob/bin/gengcfeed.R

which means it’s going to check github every 30 minutes for comment updates. Tune said parameters to your liking.

At the top of `gengcfeed.R` I have an `Rscript` shebang:

#!/usr/bin/Rscript

and the execute bit is set on the file.

Run the file by hand, first, and then test the feed via [https://validator.w3.org/feed/](https://validator.w3.org/feed/) to ensure it’s accessible and that it validates correctly. Now you can enter that feed URL into your favorite newsfeed reader (I use @feedly).

A Path Towards Easier Map Projection Machinations with ggplot2

2015-07-24 – 14:50
Posted in cartography, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Comments (7)

The $DAYJOB doesn’t afford much opportunity to work with cartographic datasets, but I really like maps and tinker with shapefiles and geo-data when I can, plus answer a ton of geo-questions on StackOverflow. R makes it easy—one might even say too easy—to work with maps. All it takes to make a map of the continental United States (CONUS) is:

library(maps)
map("state")

It’s a little more involved with ggplot2:

library(ggplot2)
library(ggthemes)

states <- map_data("state")

gg <- ggplot()
gg <- gg + geom_map(data=states, map=states,
                    aes(x=long, y=lat, map_id=region),
                    color="black", fill="white", size=0.25)
gg <- gg + theme_map()
gg

Both of those maps are horrible. The maps::map function defaults to using a rectangular projection with the aspect ratio chosen so that longitude and latitude scales are equivalent at the center of the picture. ggplot will size the graphic to the device window. Sticking with these defaults is a really bad idea. Why? I’ll let Mark Monmonier (author of How to Lie with Maps) explain:

Maps have three basic attributes: scale, projection, and symbolization. Each element is a source of distortion. As a group, they describe the essence of the map’s possibilities and limitations. No one can use maps or make maps safely and effectively without understanding map scales, map projections, and map symbols.

…

Map projections distort five geographic relationships: areas, angles, gross shapes, distances, and directions. Although some projections preserve local angles but not areas, others preserve areas but not local angles. All distort large shapes noticeably (but some distort continental shapes more than others), and all distort at least some distances and some directions.

There are great examples in his book on how map projections can inflate or diminish the area and relative importance of countries and regions, and how a map projection can itself become a rallying point for “cartographically oppressed” regions.

If map projections are important (and they are) what should you do? Both map and ggplot give you options to use projections that are found in the mapproj library (specifically using the mapproject function). The help for mapproject even gives an example of using the Albers equal-area conic projection when plotting the CONUS:

library(mapproj)
map("state", projection="albers", par=c(lat0=30, lat1=40))

As it’s full name suggests, Albers is an equal-area projection which is recommended for U.S. choropleths as it preserves the relative areas of geographic features. It’s also better than the ggplot default (“Mercator“) in it’s coord_map:

gg + coord_map()

But, we can pass in parameters to use a saner projection:

gg + coord_map("albers", lat0=30, lat1=40)

The mapproject function exposes 41 projections, which may seem generous; however, not all of them are practical and even the ones with parameters are not very customizable. Before we dig into that a bit more, you’re probably wondering (if you don’t work with cartography for a living or hobby) how one goes about choosing a map projection…

Choosing A Map Projection

Thankfully, many smart folks have thought quite a bit about choosing map projections and there are a number of resources you can draw upon when making your own choices.

The first one is Map Projections – A Working Manual. This is a free resource from the U.S. Geological Survey and was published back in 1987. It has a “companion” resource – An Album of Map Projections. Both are outstanding resources to have on hand as they provide a great deal of information on map projections. If you’re in a hurry, the “Album” makes for a good quick reference. Here’s the entry for our buddy, Albers:

(Go to page 10 of the “Album” for the larger version)

The image in the upper-right is a “Tissot indicatrix” (named for it’s creator Nicolas Auguste Tissot), which “puts Tissot indicatrices at every 30° of latitude and longitude, except at the poles. This shows the shape of infinitesimally small circles on the Earth as they appear when they are plotted by using a fixed finite scale at the same locations on a map. Every circle is plotted as a circle or an ellipse or, in extreme cases, as a straight line. On an equal-area projection, all these ellipses and circles are shown as having the same area. The flattening of the ellipse shows the extent of local shape distortion and how much the scale is changed and in what direction. On conformal map projections, all indicatrices remain circles, but the areas change. On other projections, both the areas and the shapes of the indicatrices change”. This makes for a great way to understand just how your creations are being distorted (though that may be obvious when you actually plot your maps).

The “Working Manual” also includes the equations necessary to compute the projections. Both provide projection usage guidance as well as the technically bits describing them.

The Institute for Environment and Sustainability has a similar guide for Map Projections for Europe.

Many countries and municipalities also have their own guides for using projections (e.g. California).

If you can handle what feels like a TARDIS trip back in time to the days of GeoCities, MapRef also provides good information on projections. You can also review Carlos A. Furuti’s compilation of projections for more information.

Using More Complex/Nuanced Projections

Despite being able to use Albers and 40 other projections via mapproject, having more flexibility would allow us to use grid systems (see the refs in the previous section for what those are) and also hyper-customize many projections without the need to write our own equations (be thankful of that as you skim the math in the “Working Manual”!).

Gerald Evenden developed a library and utility called proj for the USGS back in 1995 and said utility, library & projection specification idiom has been maintained and expanded ever since in the PROJ.4 project. This library/tool is straightforward to install on most (if not all) operating systems. PROJ.4 lets you specify projections in a (fairly complex) string format (often referred to as a proj4string, especially in R). For example, you can specify Albers for the U.S. with:

"+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96"

Cartographic Projection Procedures for the UNIX Environment—A User’s Manual (considered “the”manual for PROJ.4) explains how to build the specification. At the time of this blog post there are 134 named (i.e. +proj=NAME) projections availble for use & customization (run proj -l at a command-line to see the full list).

A simple example of why this is useful is when you need to plot a world map. Said activity should always be prefaced with a review of this seminal work to ensure you will choose a good projection—say, Winkel tripel. A quick glance of what mapproject supports will show you that you’re pretty much out of luck using the standard R tools we’ve seen so far for that but there is a handy +proj=wintri in PROJ.4 we can use.

Here’s how you’d plot that projection in the base plotting system:

library(sp)
library(rworldmap) # this pkg has waaaay better world shapefiles

plot(spTransform(getMap(), CRS("+proj=wintri")))

However, we can’t pass in PROJ.4 strings to ggplot’s coord_map, so we have to project the Earth first before plotting and use coord_equal:

world <- fortify(spTransform(getMap(), CRS("+proj=wintri")))

gg <- ggplot()
gg <- gg + geom_map(data=world, map=world,
                    aes(x=long, y=lat, map_id=id),
                    color="black", fill="white", size=0.25)
gg <- gg + coord_equal()
gg <- gg + theme_map()
gg

While that works, you have to pre-project any lines, points, etc. as well before plotting them with the map. For example, here are the locations of all of the non-military rocket launch sites (yeah, I’ve done that before, but it’s a fun data set! – and I had it handy):

library(rgdal) # for readOGR
library(httr)  # for downloading

launch_sites <- paste0("https://collaborate.org/sites/collaborate.org/",
                       "files/spacecentersnon-militaryspacelaunchsites.kml")
invisible(try(GET(launch_sites, write_disk("/tmp/launch-sites.kml")), silent=TRUE))

sites <- readOGR("/tmp/launch-sites.kml", 
                 "SpaceCenters: Non-Military Space Launch Sites", 
                 verbose=FALSE)
sites_wt <- spTransform(sites, CRS("+proj=wintri"))
sites_wt <- coordinates(sites_wt)

gg <- gg + geom_point(data=as.data.frame(sites_wt), 
                      aes(x=coords.x1, y=coords.x2), 
                      color="#b2182b")
gg

If you have more layers, that’s doable, but painful. If we could make it so ggplot allow for PROJ.4 projections in some coord_ this would help keep our plot idioms (and code) cleaner. Thus, coord_proj was born.

coord_proj mimics the functionality of coord_map but uses proj4::project instead of mapproject (and takes in any of those parameters). The world map with launch sites now looks like this (complete ggplot code):

world <- fortify(getMap())
sites <- as.data.frame(coordinates(readOGR("/tmp/launch-sites.kml", 
                                           "SpaceCenters: Non-Military Space Launch Sites", 
                                           verbose=FALSE)))

gg <- ggplot()
gg <- gg + geom_map(data=world, map=world,
                    aes(x=long, y=lat, map_id=id),
                    color="black", fill="white", size=0.25)
gg <- gg + geom_point(data=(sites), 
                      aes(x=coords.x1, y=coords.x2), 
                      color="#b2182b")
gg <- gg + coord_proj("+proj=wintri")
gg <- gg + theme_map()
gg

Now, if you want to go all Hobo-Dyer instead of Winkel tripel, it’s one, small change:

gg + coord_proj("+proj=cea +lat_ts=37.5")

Spatial Reference is a website by Howard Butler, Christopher Schmidt, Dane Springmeyer, and Josh Livni that helps assist others in their understanding, recording, and usage of spatial reference systems. It’s ultimate goal is to let you use URI’s to specify spatial reference systems, but you can use it to lookup Proj4 strings to meet virtually any need you have.

The Bad News

Presently, coord_proj is not part of the ggplot2 code base. I chose a really bad time to muck with this as Hadley is doing all sorts of spiffy stuff to ggplot2 and it’s not a good time to shove this in there.

You can fork your own copy of ggplot2 from Hadley’s GitHub repo and add this file to it, re-document/collate/build the package and then use it locally. It still needs some work (there’s a horrible hack in it that anyone familiar with geo-stuff will cringe at) but I welcome feedback/contributions. Hopefully this will get into ggplot2 (or out of it and into a separate package since Hadley is making ggplot2 very extensible in his update) at some point in the not-too-distant future.

Using the new ‘viridis’ colormap in R (thanks to Simon Garnier)

2015-07-20 – 03:11
Posted in cartography, Charts & Graphs, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Comments (7)

Nathaniel Smith and Stéfan van der Walt presented a new colormap (for Python) at SciPy 2015 called viridis.

From the authors:

The default colourmap in Matplotlib is the colourful rainbow-map called Jet, which is deficient in many ways: small changes in the data sometimes produce large perceptual differences and vice-versa; its lightness gradient is non-monotonic; and, it is not particularly robust against color-blind viewing. Thus, a new default colormap is needed — but no obvious candidate has been found. Here, we present our proposed new default colormap for Matplotlib, and expose the theory, tools, data exploration and motivations behind its design.

You can also find out a tad more about their other colormap designs (a.k.a. the runner-ups), along with Parula, which is a proprietary MATLAB color map.

Simon Garnier (@sjmgarnier) took Nathaniel & Stéfan’s work and turned it into an R package.

Noam Ross (@noamross) & I piled on shortly thereafter to add some ggplot color scale_ functions which are (for now) only available in Simon’s github repo.

Rather than duplicate the examples already provided in the documentation of those functions, I thought there might be more efficacy in creating a post that helped showcase why you should switch from rainbow (et al) to viridis.

Since folks seem to like maps, we’ll work with one for the example, but let’s get some package machinations out of the way first:

library(viridis)
library(raster)
library(scales)
library(dichromat)
library(rasterVis)
library(httr)
library(colorspace)

Now, we’ll need a map to work with so let’s grab a U.S. max temperature GeoTIFF raster from NOAA (from the bitter cold month of February 2015) and project it to something more reasonable:

temp_raster <- "http://ftp.cpc.ncep.noaa.gov/GIS/GRADS_GIS/GeoTIFF/TEMP/us_tmax/us.tmax_nohads_ll_20150219_float.tif"
 
try(GET(temp_raster,
        write_disk("us.tmax_nohads_ll_20150219_float.tif")), silent=TRUE)
us <- raster("us.tmax_nohads_ll_20150219_float.tif")
 
# albers FTW
us <- projectRaster(us, crs="+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")

We’ll also make a helper function to save us some typing and setup the base # of colors in the colormap:

n_col <- 64
 
img <- function(obj, col) {
  image(obj, col=col, asp=1, axes=FALSE, xaxs="i", xaxt='n', yaxt='n', ann=FALSE)
}

Let’s take a look at various color palettes with different types of vision. We’ll use a 3×2 grid and:

use 4 color palettes from grDevices,
make a gradient palette from one of the ColorBrewer sequential palettes, and
then (finally) use a viridis color palette.

We’ll take this grid of 6 maps and view it through the eyes of three different types of color vision as well as a fully desaturated version. Note that I’m not adding much cruft to the map display (including legends) since this isn’t about the values so much as it is about the visual perception of the colormaps.

Remember you can select/click/tap the map grids for (slightly) larger versions.

“Normal” Vision

par(mfrow=c(3, 2), mar=rep(0, 4))
img(us, rev(heat.colors(n_col)))
img(us, rev(rainbow(n_col)))
img(us, rev(topo.colors(n_col)))
img(us, rev(cm.colors(n_col)))
img(us, gradient_n_pal(brewer_pal("seq")(9))(seq(0, 1, length=n_col)))
img(us, rev(viridis(n_col)))

All of the maps convey the differences in max temperature. If you happen to have “normal” color vision you should be drawn to the bottom two (ColorBrewer on the left and Viridis on the right). They are both sequential and convey the temperature changes more precisely (and they aren’t as gosh-awful ugly the the other four).

While the ColorBrewer gradient may be “good”, Viridis is designed to be:

colorful & “pretty”
sequential (to not impose other structure on exploratory data analysis visualizations)
perceptually uniform (i.e. changes in the data should be accurately decoded by our brains) even when desaturated
accessible to colorblind viewers

and seems to meet those goals quite well.

Take a look at each of the vision-adjusted examples:

Green-Blind (Deuteranopia)

par(mfrow=c(3, 2), mar=rep(0, 4))
img(us, dichromat(rev(heat.colors(n_col)), "deutan"))
img(us, dichromat(rev(rainbow(n_col)), "deutan"))
img(us, dichromat(rev(topo.colors(n_col)), "deutan"))
img(us, dichromat(rev(cm.colors(n_col)), "deutan"))
img(us, dichromat(gradient_n_pal(brewer_pal("seq")(9))(seq(0, 1, length=n_col)), "deutan"))
img(us, dichromat(rev(viridis(n_col)), "deutan"))

Red-Blind (Protanopia)

par(mfrow=c(3, 2), mar=rep(0, 4))
img(us, dichromat(rev(heat.colors(n_col)), "protan"))
img(us, dichromat(rev(rainbow(n_col)), "protan"))
img(us, dichromat(rev(topo.colors(n_col)), "protan"))
img(us, dichromat(rev(cm.colors(n_col)), "protan"))
img(us, dichromat(gradient_n_pal(brewer_pal("seq")(9))(seq(0, 1, length=n_col)), "protan"))
img(us, dichromat(rev(viridis(n_col)), "protan"))

Blue-Blind (Tritanopia)

par(mfrow=c(3, 2), mar=rep(0, 4))
img(us, dichromat(rev(heat.colors(n_col)), "tritan"))
img(us, dichromat(rev(rainbow(n_col)), "tritan"))
img(us, dichromat(rev(topo.colors(n_col)), "tritan"))
img(us, dichromat(rev(cm.colors(n_col)), "tritan"))
img(us, dichromat(gradient_n_pal(brewer_pal("seq")(9))(seq(0, 1, length=n_col)), "tritan"))
img(us, dichromat(rev(viridis(n_col)), "tritan"))

Desaturated

par(mfrow=c(3, 2), mar=rep(0, 4))
img(us, desaturate(rev(heat.colors(n_col))))
img(us, desaturate(rev(rainbow(n_col))))
img(us, desaturate(rev(topo.colors(n_col))))
img(us, desaturate(rev(cm.colors(n_col))))
img(us, desaturate(gradient_n_pal(brewer_pal("seq")(9))(seq(0, 1, length=n_col))))
img(us, desaturate(rev(viridis(n_col))))

Hopefully both the ColorBrewer gradient and Viridis palettes stood out as conveying the temperature data with more precision and more consistently across all non-standard vision types as you progressed through each one.

To see this for yourself in your own work, grab Simon’s package and start substituting viridis for some of your usual defaults to see if it makes a difference in helping you convey the story your data is trying to tell, both more accurately and for a more diverse audience. Remember, the github version (which will be on CRAN soon) have handy ggplot scale_ functions to make using viridis as painless as possible.

I also updated my Melbourne Walking EDA project to use the viridis palette instead of parula (which I was only really using in defiance of MATLAB’s inane restrictions).

R Package to access the Open Movie Database (OMDB) API

It’s not on CRAN yet, but there’s a devtools-installable R package for getting data from the OMDB API.

It covers all of the public API endpoints:

find_by_id: Retrieve OMDB info by IMDB ID search
find_by_title: Retrieve OMDB info by title search
get_actors: Get actors from an omdb object as a vector
get_countries: Get countries from an omdb object as a vector
get_directors: Get directors from an omdb object as a vector
get_genres: Get genres from an omdb object as a vector
get_writers: Get writers from an omdb object as a vector
print.omdb: Print an omdb result
search_by_title: Lightweight omdb title search

Here’s a bit of it in action:

devtools::install_github("hrbrmstr/omdbapi")
library(dplyr)
library(pbapply)
 
search_by_title("Captain America")
 
# Source: local data frame [10 x 4]
# 
#                                                   Title  Year    imdbID   Type
# 1                    Captain America: The First Avenger  2011 tt0458339  movie
# 2                   Captain America: The Winter Soldier  2014 tt1843866  movie
# 3                                       Captain America  1990 tt0103923  movie
# 4                                       Captain America  1979 tt0078937  movie
# 5           Iron Man and Captain America: Heroes United  2014 tt3911200  movie
# 6                    Captain America II: Death Too Soon  1979 tt0078938  movie
# 7                                       Captain America  1944 tt0036697  movie
# 8                                       Captain America 1966– tt0206474 series
# 9                        Captain America: Super Soldier  2011 tt1740721   game
# 10 Comic Book Origins: Captain America - Winter Soldier  2014 tt3618126  movie
 
search_by_title("Captain America", year_of_release=2013)
 
# Source: local data frame [1 x 4]
# 
#                              Title Year    imdbID  Type
# 1 A Look Back at 'Captain America' 2013 tt3307378 movie
 
games <- search_by_title("Captain America", type="game")
 
glimpse(games)
 
# Observations: 2
# Variables:
# $ Title  (chr) "Captain America: Super Soldier", "Captain America and the A...
# $ Year   (chr) "2011", "1991"
# $ imdbID (chr) "tt1740721", "tt0421939"
# $ Type   (chr) "game", "game"
 
find_by_title(games$Title[1])
 
#      Title: Captain America: Super Soldier
#       Year: 2011
#      Rated: N/A
#   Released: 2011-07-19
#    Runtime: N/A
#      Genre: Action
#   Director: Michael McCormick, Robert Taylor
#     Writer: Christos N. Gage
#     Actors: Hayley Atwell, Chris Evans, Sebastian Stan, Neal McDonough
#       Plot: You play the Sentinel of Liberty as you raid the Red Skull's scientist
#             minion, Armin Zola's, lair.
#   Language: English
#    Country: USA
#     Awards: N/A
#     Poster: http://ia.media-imdb.com/images/M/
#             MV5BMTUwMzQ0NjE5N15BMl5BanBnXkFtZTgwODI3MzQxMTE@._V1_SX300.jpg
#  Metascore: N/A
# imdbRating: 7.2
#  imdbVotes: 271
#     imdbID: tt1740721
#       Type: game
 
find_by_title("Game of Thrones", type="series", season=1, episode=1)
 
#      Title: Winter Is Coming
#       Year: 2011
#      Rated: TV-MA
#   Released: 2011-04-17
#    Runtime: 62 min
#      Genre: Adventure, Drama, Fantasy
#   Director: Timothy Van Patten
#     Writer: David Benioff (created by), D.B. Weiss (created by), George R.R.
#             Martin ("A Song of Ice and Fire" by), David Benioff, D.B.
#             Weiss
#     Actors: Sean Bean, Mark Addy, Nikolaj Coster-Waldau, Michelle Fairley
#       Plot: Jon Arryn, the Hand of the King, is dead. King Robert Baratheon plans
#             to ask his oldest friend, Eddard Stark, to take Jon's
#             place. Across the sea, Viserys Targaryen plans to wed his
#             sister to a nomadic warlord in exchange for an army.
#   Language: English
#    Country: USA
#     Awards: N/A
#     Poster: http://ia.media-imdb.com/images/M/
#             MV5BMTk5MDU3OTkzMF5BMl5BanBnXkFtZTcwOTc0ODg5NA@@._V1_SX300.jpg
#  Metascore: N/A
# imdbRating: 8.5
#  imdbVotes: 12584
#     imdbID: tt1480055
#       Type: episode
 
get_genres(find_by_title("Star Trek: Deep Space Nine", season=5, episode=7))
 
# [1] "Action"    "Adventure" "Drama"
 
get_writers(find_by_title("Star Trek: Deep Space Nine", season=4, episode=6))
 
# [1] "Gene Roddenberry (based upon \"Star Trek\" created by)"
# [2] "Rick Berman (created by)"                              
# [3] "Michael Piller (created by)"                           
# [4] "David Mack"                                            
# [5] "John J. Ordover"
 
get_directors(find_by_id("tt1371111"))
 
# [1] "Tom Tykwer"     "Andy Wachowski" "Lana Wachowski"
 
get_countries(find_by_title("The Blind Swordsman: Zatoichi"))
 
# [1] "Japan"
 
ichi <- search_by_title("Zatoichi")
bind_rows(lapply(ichi$imdbID, function(x) {
  find_by_id(x, include_tomatoes = TRUE)
})) -> zato
 
par(mfrow=c(3,1)) 
boxplot(zato$tomatoUserMeter, horizontal=TRUE, main="Tomato User Meter", ylim=c(0, 100))
boxplot(zato$imdbRating, horizontal=TRUE, main="IMDB Rating", ylim=c(0, 10))
boxplot(zato$tomatoUserRating, horizontal=TRUE, main="Tomato User Rating", ylim=c(0, 5))

You can find out more at it’s github repo

Faceted “World Population by Income” Choropleths in ggplot

2015-07-09 – 14:25
Posted in cartography, Charts & Graphs, Data Visualization, DataVis, DataViz, gis, maps, R
Tagged post
Comments (5)

Poynter did a nice interactive piece on world population by income (i.e. “How Many Live on How Much, and Where”). I’m always on the lookout for optimized shapefiles and clean data (I’m teaching a data science certificate program starting this Fall) and the speed of the site load and the easy availability of the data set made this one a “must acquire”. Rather than just repeat Poynter’s D3-goodness, here’s a way to look at the income data in series of small multiple choropleths—using R & ggplot2—that involves:

downloading data & shapefiles from a web site
using dplyr & tidyr for data munging
applying custom fill color scale mapping in ggplot
ordering plots with a custom facet order (using factors)
tweaking the theme and aesthetics for a nicely finished result

By using D3, Poynter inherently made the data available. Pop open the “Developer Tools” in any browser, reload the page and look at the “Network” tab and you’ll see a list of files (you can sometimes see things in the source code, but this technique is often faster). The income data is a well-formed CSV file http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv and their highly optimized world map was also easy to discern http://www.pewglobal.org/wp-content/lib/js/world-geo.json. We’ll start by grabbing the map and using the same map projection that Poynter did (Robinson). Don’t be put off by all the library calls since one of the best parts of R is the ever-increasing repository of great packages to help you get things done.

library(httr)     # getting data
library(rgdal)    # working with shapefile
library(dplyr)    # awesome data manipulation
library(readr)    # faster reading of CSV data
library(stringi)  # string manipulation
library(stringr)  # string manipulation
library(tidyr)    # reshaping data
library(grid)     # for 'unit'
library(scales)   # for 'percent'
library(ggplot2)  # plotting
library(ggthemes) # theme_map
 
# this ensures you only download the shapefile once and hides
# errors and warnings. remove `try` and `invisible` to see messages
try(invisible(GET("http://www.pewglobal.org/wp-content/lib/js/world-geo.json",
                  write_disk("world-geo.json"))), silent=TRUE)
 
# use ogrListLayers("world-geo.json") to see file type & 
# layer info to use in the call to readOGR
 
world <- readOGR("world-geo.json", "OGRGeoJSON")
world_wt <- spTransform(world, CRS("+proj=robin"))
world_map <- fortify(world_wt)

I would have liked to do fortify(world_wt, region="name") (since that makes working with filling in countries by name much easier in the choropleth part of the code) but that generated TopologyException errors (I’ve seen this happen quite a bit with simplified/optimized shapefiles and some non-D3 geo-packages). One can sometimes fix those with a strategic rgeos::gBuffer call, but that didn’t work well in this case. We can still use country names with a slight rejiggering of the fortified data frame using dplyr:

world_map %>%
  left_join(data_frame(id=rownames(world@data), name=world@data$name)) %>%
  select(-id) %>%
  rename(id=name) -> world_map

Now it’s time to get the data. The CSV file has annoying spaces in it that causes R to interpret all the columns as strings, so we can use dplyr again to get them into the format we want them in. Note that I’m also making the percentages decimals so we can use percent later on to easily format them.

# a good exercise would be to repeat the download code above 
# rather and make repeated calls to an external resource
read_csv("http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv") %>%
  mutate_each(funs(str_trim)) %>%
  filter(id != "None") %>%
  mutate_each(funs(as.numeric(.)/100), -name, -id) -> dat

For this post, we’ll only be working with the actual share percentages, so let’s:

ignore the “change” columns
convert the data frame from wide to long
extract out the income levels (e.g. “Poor”, “Low Income”…)
set a factor order for them so our plots will be in the correct sequence

dat %>%
  gather(share, value, starts_with("Share"), -name, -id) %>%
  select(-starts_with("Change")) %>%
  mutate(label=factor(stri_trans_totitle(str_match(share, "Share ([[:alpha:]- ]+),")[,2]),
                      c("Poor", "Low Income", "Middle Income", "Upper-Middle Income", "High Income"),
                      ordered=TRUE)) -> share_dat

The stringi package is really handy (stringr is built on it, too). The stri_trans_totitle function alleviates some mundane string operations and the stri_replace_all_regex (below) also allows us to do vectorized regular expression replacements without a ton of code.

To keep the charts aligned, we’ll use Poynter’s color scale (which was easy to extract from the site’s code) and use the same legend breaks via `cut’. We’ll also format the labels for these breaks to make our legend nicer to view.

# use same "cuts" as poynter
poynter_scale_breaks <- c(0, 2.5, 5, 10, 25, 50, 75, 80, 100)
 
sprintf("%2.1f-%s", poynter_scale_breaks, percent(lead(poynter_scale_breaks/100))) %>%
  stri_replace_all_regex(c("^0.0", "-NA%"), c("0", "%"), vectorize_all=FALSE) %>%
  head(-1) -> breaks_labels
 
share_dat %>%
  mutate(`Share %`=cut(value,
                       c(0, 2.5, 5, 10, 25, 50, 75, 80, 100)/100,
                       breaks_labels))-> share_dat
 
share_pal <- c("#eaecd8", "#d6dab3", "#c2c98b", "#949D48", "#6e7537", "#494E24", "#BB792A", "#7C441C", "#ffffff")

Finally, we get to the good part and start plotting the visualization. There are only two layers: the base map and the choropleth filled map. We then:

apply our manual color palette to them
remove the line color slashes in the legend boxes
setup the overall plot label
tell ggplot which coordinate system to use (in this case coord_equal is fine since we already projected the points)
apply a base theme that’s good for mapping
tweak the text and ensure our legend is in the position we want it to be ing

gg <- ggplot()
 
gg <- gg + geom_map(data=world_map, map=world_map,
                    aes(x=long, y=lat, group=group, map_id=id),
                    color="#7f7f7f", fill="white", size=0.15)
 
gg <- gg + geom_map(data=share_dat, map=world_map,
                    aes(map_id=name, fill=`Share %`),
                    color="#7f7f7f", size=0.15)
 
gg <- gg + scale_fill_manual(values=share_pal)
 
gg <- gg + guides(fill=guide_legend(override.aes=list(colour=NA)))
gg <- gg + labs(title="World Population by Income\n")
gg <- gg + facet_wrap(~label, ncol=2)
 
gg <- gg + coord_equal()
 
gg <- gg + theme_map()
 
gg <- gg + theme(panel.margin=unit(1, "lines"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24))
gg <- gg + theme(legend.title=element_text(face="bold", hjust=0, size=12))
gg <- gg + theme(legend.text=element_text(size=10))
gg <- gg + theme(strip.text=element_text(face="bold", size=10))
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(legend.position="bottom")
 
gg

And, here’s the result (click for larger version):

The optimized shapefile makes for a very fast plot and you can plot individual chorpleths by filtering the data and not using facets.

While there are a number of choropleth packages out there for R, learning how to do the core components on your own can (a) make you appreciate those packages a bit more (b) give you the skills to do them on your own when you need a more customized version. Many of the theme tweaks will also apply to the ggplot-based choropleth packages.

With this base, it should be a fun exercise to the reader to do something similar with Poynter’s “percentage point change” choropleth. You’ll need to change the color palette and manipulate different data columns to get the same scales and visual representation they do. Drop a note in the comments if you give this a go!

You can find all the code in this post in one convenient gist.

Two pending features to metricsgraphics

2015-07-05 – 22:47
Posted in Charts & Graphs, d3, DataVis, DataViz, HTML5, R
Tagged post
Leave a Comment

I’ve been slowly prodding the [metricsgraphics package](https://github.com/hrbrmstr/metricsgraphics/) towards a 1.0.0 release, but there are some rough edges that still need sorting out. One of them is the ability to handle passing in variables for the `x` & `y` accessor values (you _can_ pass in bare and quoted strings). This can now be achieved (in the `dev01` branch) via `mjs_plot_` and in `mjs_plot` proper in the github main branch thanks to a [PR](https://github.com/hrbrmstr/metricsgraphics/pull/31) by [Jonathan Owen](https://github.com/jrowen). If everything stays stable with the PR, I’ll just fold the code into `mjs_plot` for the `0.9.0` CRAN release.

One other pending feature is the ability to turn _basic_ (single `geom_`) `ggplot` objects into `metricsgraphics` plots. Sometimes it’s just easier/nicer to “think” in `ggplot` and it may be the case that one might have coded a quick histogram/scatter/line plot in `ggplot` and want an equally quick interactive version. This can also now be achieved (again, in beta) via `as_mjsplot`. While the previous addition is fairly self-explanatory, this new one needs a few examples. Please note that the package installation is coming from the `dev01` branch:

devtools::install_github("hrbrmstr/metricsgraphics", ref="dev01") 
 
library(metricsgraphics)
library(ggplot2)
 
dat <- data.frame(year=seq(1790, 1970, 10),
                  uspop=as.numeric(uspop))
 
set.seed(5689)
movies <- movies[sample(nrow(movies), 1000), ]
 
gg1 <- ggplot(dat, aes(x=year, y=uspop)) + geom_line()
gg2 <- ggplot(dat, aes(x=year, y=uspop)) + geom_point()
gg3 <- ggplot(movies, aes(rating)) + geom_histogram()
gg4 <- ggplot(movies, aes(rating)) + geom_histogram(binwidth = 0.1)
 
gg1
as_mjsplot(gg1)
 
gg2
as_mjsplot(gg2)
 
gg3
as_mjsplot(gg3)
 
gg4
as_mjsplot(gg4)

Which you can see below:

As you can see, `as_mjsplot` will do it’s best to figure out the bins (if using `geom_histogram`) and also axis labels. Support for converting `geom_vline` and `geom_hline` to markers and baselines (respectively) is a work in progress.

I’ve only done limited testing with some basic single `geom_` constructs, but if there are any bugs with it or feature requests (remember, the MetricsGraphics.js library has a very limited repertoire) please post an issue on GitHub tagging the `dev01` branch.

Get by with a little (R) help from your friends (at GitHub)

@JennyBryan posted her [slides from the 2015 R Summit](https://github.com/jennybc/2015-06-28_r-summit-talk) and they are a must-read for instructors and even general stats/R-folk. She’s one of the foremost experts in R+GitHub and her personal and class workflows provide solid patterns worth emulation.

One thing she has mentioned a few times—and included in her R Summit talk—is the idea that you can lean on GitHub when official examples of a function are “kind of thin”. She uses a search for `vapply` as an example, showing how to [search for uses of `vapply` in CRAN](https://github.com/search?utf8=%E2%9C%93&q=vapply+extension%3AR+user%3Acran) (there’s a [read-only CRAN mirror on GitHub](http://www.r-pkg.org/)) and [in GitHub R code in general](https://github.com/search?utf8=%E2%9C%93&q=vapply+extension%3AR).

I remember throwing together a small function to kick up a browser from R for those URLs (in a response to one of her tweets), but realized this morning (after reading her slides last night) that it’s possible to not leave RStudio to get these GitHub search results (or, at least the first page of results). So, I threw together [this gist](https://gist.github.com/hrbrmstr/32e9c140129d7d51db52) which, when sourced, provides a `ghelp` function. This is the code:

ghelp <- function(topic, in_cran=TRUE) {
 
  require(htmltools) # for getting HTML to the viewer
  require(rvest)     # for scraping & munging HTML
 
  # github search URL base
  base_ext_url <- "https://github.com/search?utf8=%%E2%%9C%%93&q=%s+extension%%3AR"
  ext_url <- sprintf(base_ext_url, topic)
 
  # if searching with user:cran (the default) add that to the URL  
  if (in_cran) ext_url <- paste(ext_url, "+user%3Acran", sep="", collapse="")
 
  # at the time of writing, "rvest" and "xml2" are undergoing some changes, so
  # accommodate those of us who are on the bleeding edge of the hadleyverse
  # either way, we are just extracting out the results <div> for viewing in 
  # the viewer pane (it works in plain ol' R, too)
  if (packageVersion("rvest") < "0.2.0.9000") { 
    require(XML)
    pg <- html(ext_url)
    res_div <- paste(capture.output(html_node(pg, "div#code_search_results")), collapse="")
  } else {
    require(xml2)
    pg <- read_html(ext_url)
    res_div <- as.character(html_nodes(pg, "div#code_search_results"))
  }
 
  # clean up the HTML a bit   
  res_div <- gsub('How are these search results\\? <span class="removed_link" title="/contact">Tell us!</span>', '', res_div)
  # include a link to the results at the top of the viewer
  res_div <- gsub('href="/', 'href="http://github.com/', res_div)
  # build the viewer page, getting CSS from github-proper and hiding some cruft
  for_view <- sprintf('<html><head><link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github/index-4157068649cead58a7dd42dc9c0f2dc5b01bcc77921bc077b357e48be23aa237.css" media="all" rel="stylesheet" /><style>body{padding:20px}</style></head><body><span class="removed_link" title="%s">Show on GitHub</span><hr noshade size=1/>%s</body></html>', ext_url, res_div)
  # this makes it show in the viewer (or browser if you're using plain R)
  html_print(HTML(for_view))
 
}

Now, when you type `ghelp(“vapply”)`, you’ll get:

in the viewer pane (and similar with `ghelp(“vapply”, in_cran=FALSE)`). Clicking the top link will take you to the search results page on GitHub (in your default web browser), and all the other links will pop out to a browser as well.

If you’re the trusting type, you can `devtools::source_gist(’32e9c140129d7d51db52′)` or just add this to your R startup functions (or add it to your personal helper package).

There’s definitely room for some CSS hacking and it would be fairly straightforward to get _all_ the search results into the viewer by following the pagination links and stitching them all together (an exercise left to the reader).

DO Something Nifffty with R

2015-06-19 – 18:07
Posted in Apple Watch, automation, IFTTT, R
Tagged post
Comments (2)

@briandconnelly (of [pushoverr](http://crantastic.org/authors/4002) fame) made a super-cool post about [connecting R](http://bconnelly.net/2015/06/connecting-r-to-everything-with-ifttt/) to @IFTTT via IFTTT’s “Maker” channel. The IFTTT Maker interface to receive events is fairly straightforward and Brian’s code worked flawlessly, so it was easy to tweak a bit and [wrap into a package](https://github.com/hrbrmstr/nifffty).

To get started, you can clone an [example public IFTTT recipe](https://ifttt.com/recipes/300804-post-maker-event-values-to-dropbox-file) that posts data to a Dropbox folder. All you need to do is add the IFTTT channel to your account and add the API key to your `.Renviron` file, which will look something like:

IFTTT_API_KEY=uGlySTringOfCharACTers-

Sticking it in the `.Renviron` file is a nice platform-independent way of getting the variable in your environment (and you can set a similar environment variable in Travis for automated testing). NOTE that you’ll need to restart your R/RStudio session for that value to become available (what, you don’t have RStudio up all the time?).

Once you have the key in place and the IFTTT recipe setup, all you have to do is:

devtools::install_github("hrbrmstr/nifffty")
library(nifffty)
maker("rtest", "this", "is a", "test")

and you’ll have a file named something like `june_19__2015_at_0927am.txt` in a Dropbox folder that should have values like:

Value 1: this
Value 2: is a
Value 3: test

The potential IFTTT integrations are not limited to Dropbox (or iOS/Android notification as in Brian’s example). You can literally trigger _anything_ from R in the IFTTT catalogue. Go foRth and automate!

### DOing even MoRe

As stated, Brian’s code was complete, so making a package to post to IFTTT’s Maker channel was really easy. But, this is R and if we can get data _to_ it, things can be _even cooler_. So, I made a `receiver` function that will catch Maker channel web requests and do stuff with them. Since I have an Apple Watch, I decided I wanted to test the interface out with IFTTT’s [DO button](http://blog.ifttt.com/post/116563004243/do-for-ipad-and-apple-watch) and have R receive my coordinates whenever I press this:

(That’s an actual screen capture from the Apple Watch).

If you have both R and an Apple Watch, you can create a DO app as such (NOTE that DO apps work fine on the iPhone _without_ an Apple Watch):

That will `POST` a bit of JSON to whatever is listening for it. To get R to listen to it, you need to build a small script that I’ll call `listen.R`, create a function to process the data and register it with the instance of the [Rook](http://cran.r-project.org/web/packages/Rook/README.html) web server that it will automagically create for you. You can specify which IP address and/or port to listen on as well. NOTE that this R server _must_ be reachable _from_ the internet, so you may need to do port forwarding if behind a NAT firewall.

My sample `listen.R` looks like this:

library(nifffty)
 
# not the best name for a function but it works
do_it <- function(req) {
  require(jsonlite)
  print(fromJSON(names(req$POST())))
  writeLines(names(req$POST()), "/tmp/bob.txt")
}
 
# you can change the port to match what you put into IFTTT
rcvr <- receiver(port=10999, handler=do_it)
 
# this is not necessary but it's a good diagnostic for a test script
print(rcvr)
 
# keeps the script alive
while (TRUE) Sys.sleep(24 * 60 * 60)

You can run that from a command-line (I tested it on Ubuntu) as such:

bob@server:~$ Rscript listen.R
Loading required package: methods
Loading required package: Rook
Server started on 0.0.0.0:10999
[1] nifffty http://0.0.0.0:10999/custom/nifffty
 
Call browse() with an index number or name to run an application.

When you tap the DO button on the Apple Watch, you’ll see the following in the console:

Loading required package: jsonlite
 
Attaching package: ‘jsonlite’
 
The following object is masked from ‘package:utils’:
 
    View
 
$latitude
[1] 43.25931
 
$longitude
[1] -70.80062
 
$timestamp
[1] "June 19, 2015 at 04:45PM"

and the following in the file it was instructed to write to:

bob@server:~$ cat /tmp/bob.txt
{ "latitude": 43.2593566552207, "longitude": -70.8004647307757, "timestamp": "June 19, 2015 at 04:46PM" }

JSON is a good choice given how easy it is to work with in R and you can have R do _anything_ you can dream up with the data that’s sent to it. You can connect _any_ supported IFTTT component and hook any variable from said component into the JSON that R will receive.

### Coming Soon

The next step is to make a Shiny interface so it can receive and process real-time events. I’m hoping to make a sample Shiny app that will update my location on a map every time I tap the DO button and also compute some stats for it. I also need to put some polish on the receiver interface (add logging and some error checking).

If you have cool ideas or have hooked R up to IFTTT’s Maker channel to do something cool, drop a note in the comments and definitely submit any pull requests or issues [on github](https://github.com/hrbrmstr/nifffty).