Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Junk Charts did a post on [Don’t pick your tool before having your design](http://junkcharts.typepad.com/junk_charts/2015/09/dont-pick-your-tool-before-having-your-design.html) and made a claim that this:

6a00d8341e992c53ef01b7c7d60168970b-300wi

_”cannot be produced directly from a tool (without contorting your body in various painful locations)”_.

I beg to differ.

With R & ggplot2, I get to both pick my tool and design at the same time since I have a very flexible and multi-purpose tool. I also don’t believe that the code below qualifies as “contortions”, though I am a ggplot2 fanboi. It’s no different than Excel folks clicking on radio buttons and color pickers, except my process is easily repeatable & scalable once finalized (this is not finalized as it’s not 100% parameterized but it’s not difficult to do that last part).

library(ggplot2)
 
dat <- data.frame(year=2010:2015,
                  penalties=c(627, 625, 653, 617, 661, 730))
 
avg <- data.frame(val=mean(head(dat$penalties, -1)),
                  last=dat$penalties[6],
                  lab="5-Yr\nAvg")
 
gg <- ggplot(dat, aes(x=year, y=penalties))
gg <- gg + geom_point()
gg <- gg + scale_x_continuous(breaks=c(2010, 2014, 2015), limits=c(NA, 2015.1))
gg <- gg + scale_y_continuous(breaks=c(600, 650, 700, 750), 
                              limits=c(599, 751), expand=c(0,0))
gg <- gg + geom_segment(data=avg, aes(x=2010, xend=2015, y=val, yend=val), linetype="dashed")
gg <- gg + geom_segment(data=avg, aes(x=2015, xend=2015, y=val, yend=last), color="steelblue")
gg <- gg + geom_point(data=avg, aes(x=2015, y=val), shape=4, size=3)
gg <- gg + geom_text(data=avg, aes(x=2015, y=val), label="5-Yr\nAvg", size=2.5, hjust=-0.3)
gg <- gg + geom_point(data=avg, aes(x=2015, y=700), shape=17, col="steelblue")
gg <- gg + geom_point(data=avg, aes(x=2015, y=730), shape=4, size=3)
gg <- gg + labs(x=NULL, y="Number of Penalties", 
                title="NFL Penalties Jumped 15% in the\nFirst 3 Weeks of the 2015 Season\n")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid.minor=element_blank())
gg <- gg + theme(panel.grid.major.x=element_blank())
gg <- gg + theme(panel.grid.major.y=element_line(color="white"))
gg <- gg + theme(panel.background=element_rect(fill="#f3f2f7"))
gg <- gg + theme(axis.ticks=element_blank())
gg

forblog-1

Avast, me hearties! It’s time four t’ annual International Talk Like a Pirate Day #rstats post!

(OK, I won’t make you suffer continuous pirate-speak for the entire post)

I tried to be a bit more practical this year and have two treasuRe chests for you to (hopefully) enjoy.

A Package Full o’ Pirates

I’ve covered the Anti-shipping Activity Messages (ASAM) Database before for TLAPD before but getting, updating and working with the data has more pain points than it should, so I wrapped a small package around it.

Here’s how to get all pirate attacks this year (2015) so far:

# devtools::install_github("hrbrmstr/asam")
library(asam)
 
data(asam_shp)
pirates <- subset(asam_shp,
                  grepl("pirate", Aggressor, ignore.case=TRUE) &
                  format(DateOfOcc, "%Y") == "2015")
 
nrow(pirates)
## [1] 78

It looks like there have been 78 registered pirate attacks this year. The National Geospatial Intelligence Agency (NGIA) marks the attacks by lat/lon and also by region/subregion, and I managed to obtain the official polygons for these regions, so we can plot these attacks on a world map and also show the subregions:

library(ggplot2)
 
# get the ASAM subregion polygons
subregions <- asam_subregions()
subregions_map <- fortify(subregions)
 
# get the world map
world <- map_data("world")
 
# get the points for the pirate attack occurrences
pirate_pts <- data.frame(pirates)
 
gg <- ggplot()
 
# world map layer
gg <- gg + geom_map(data=world, map=world,
                    aes(x=long, y=lat, map_id=region),
                    color="black", fill="#e7e7e7", size=0.15)
# ASAM regions layer
gg <- gg + geom_map(data=subregions_map, map=subregions_map,
                    aes(x=long, y=lat, map_id=id),
                    color="white", fill="white", size=0.15, alpha=0)
 
# attacks
gg <- gg + geom_point(data=pirate_pts, color="black", fill="yellow", 
                      aes(x=coords.x1, y=coords.x2), shape=21)
 
gg <- gg + xlim(-170, 170)
gg <- gg + ylim(-58, 75)
gg <- gg + coord_map("mollweide")
gg <- gg + theme_map()
gg <- gg + theme(panel.background=element_rect(fill="steelblue"))
gg

README-map-1

There is quite a bit more data than just location, though and we can work with it much better in an interactive map.

Makin’ Interactive Pirate Maps

Now, what makes the following an interactive pirate map is not so much the fact that we’ll be plotting points of pirate attacks on a Leaflet map, but we’ll also be using a pirate treasure map theme on the Leaflet map.

Let’s start with showing how to use a general pirate map theme before adding in the ASAM data.

You’ll have to pause here and head on over to MapBox to register for a (free) account. You’ll need to go through the gyrations to eventually get a public token and mapbox id to use the pirate map tiles they have. I store those in my .Renviron so I don’t have to cut/paste inane strings when I need to use this, or other, APIs or need to keep them from prying eyes. Since MapBox exposes these strings in GET call URLs, the use of environment variables is strictly for convenience in this case.

library(leaflet)
 
mapbox_public_token <- Sys.getenv("MAPBOX_PUBLIC_TOKEN")
mapbox_map_id <- Sys.getenv("PIRATE_MAP_ID")
mapbox_url <- "https://a.tiles.mapbox.com/v4/%s/{z}/{x}/{y}.png?access_token=%s"
mapbox_tiles_template <- sprintf(mapbox_url, mapbox_map_id, mapbox_public_token)

Now, what good is a pirate map without an ‘X’ marking the spot for some treasure. For that we’ll need an ‘X’:

in a format we can use with Leaflet:

x_marker <- icons("http://rud.is/dl/x.png",
                  iconHeight=64, iconWidth=64,
                  iconAnchorX=32, iconAnchorY=32)

Now, we can display a pirate map for all scurvy dogs to see:

leaflet() %>%
  addTiles(mapbox_tiles_template) %>%
  setView(lng=-50.9249, lat=45.68929, zoom=3) %>%
  addMarkers(-70.2667, 43.6667, icon=x_marker)

NOTE: I have not buried treasure in Portland, Maine, but go nuts digging at that location if you still want to.

Pirates on Pirate Maps

We can make a [crude] interactive ASAM browser by combining our data from above with our new, pirate-y mapping capabilities:

library(asam)
library(sp)
library(dplyr)
library(leaflet)
 
data(asam_shp)
dat <- subset(asam_shp,
              DateOfOcc > as.Date("2015-01-01") &
                grepl("pirate", Aggressor, ignore.case=TRUE))
# could also do data.frame(dat)
dat <- bind_cols(dat@data, data.frame(coordinates(dat), stringsAsFactors=FALSE))

We’ll build a popup with the ASAM incident description fields and add it and the pirate incident points to a pirate-themed Leaflet map:

popup_template <- '<div style="background:#f3e0b5; padding:10px"><b>Date:</b> %s
<span style="float:right"><a target="_blank" href="https://msi.nga.mil/NGAPortal/msi/query_results.jsp?MSI_queryType=ASAM&amp;MSI_generalFilterType=SpecificNumber&amp;MSI_generalFilterValue=%s&amp;MSI_additionalFilterType1=None&amp;MSI_additionalFilterType2=-999&amp;MSI_additionalFilterValue1=-999&amp;MSI_additionalFilterValue2=-999&amp;MSI_outputOptionType1=SortBy&amp;MSI_outputOptionType2=-999&amp;MSI_outputOptionValue1=Date_DESC&amp;MSI_outputOptionValue2=-999&amp;MSI_MAP=-999">ASAM Record</a>
</span><br/>
<b>Victim:</b> %s<br/>
<b>Description:</b> %s</div>'
 
nona <- function(x) ifelse(is.na(x), " ", x)
 
pirate_pops <- sprintf(popup_template,
                       dat$date,
                       gsub("-", "_", dat$Reference),
                       dat$Victim,
                       paste0(nona(dat$Descript),
                              nona(dat$Desc1), nona(dat$Desc2), nona(dat$Desc3),
                              nona(dat$Desc4), nona(dat$Desc5), nona(dat$Desc6),
                              sep=" "))
 
mapbox_public_token <- Sys.getenv("MAPBOX_PUBLIC_TOKEN")
mapbox_map_id <- Sys.getenv("PIRATE_MAP_ID")
mapbox_url <- "https://a.tiles.mapbox.com/v4/%s/{z}/{x}/{y}.png?access_token=%s"
mapbox_tiles_template <- sprintf(mapbox_url, mapbox_map_id, mapbox_public_token)
 
leaflet() %>%
  addTiles(mapbox_tiles_template) %>%
  setView(lng=-50.9249, lat=45.68929, zoom=3) %>%
  addCircles(dat$coords.x1, dat$coords.x2, radius=300,
             color="#664c1f", popup=pirate_pops)

Select any of the circle marks and you’ll get a popup with a description and link to the official ASAM record (like this):

tlapd2015012_html

Fin

I’m not sure when I’ll get back to the asam package, but it could use some attention. The Aggressor field could be auto-cleaned to make it more usable and a dplyr-esque interface could be developed to select incidents. Also, since it includes a shapefile of subregions, that could also be used to do more spatial-oriented analyses of the incidents. It’s all there for any pirate lackey to pilfer.

Drop a note in the comments if you have any of your own pirate-y creations or an issue on github for feature requests & bug reports.

I glimpsed a [post](http://blagrants.blogspot.com/2015/09/nest-thermostat-and-r-creating-shiny.html) in the RSS feeds today on how to connect Nest data with a Shiny dashboard and was compelled to post a less brute-force way to get data from the Nest API. The authors of the Shiny+Nest post used `system` calls to `curl` and regular expression character vector operations to slice, dice & work with the Nest API/data. However, we have packages like `RCurl`, `curl` and `httr` that all make `system` calls unnecessary. Here’s how to use the Nest API in a more R-like (or `httr`-like) fashion.

### Nest & OAuth 2.0

Nest uses OAuth 2.0 for authentication. As the authors of the other post pointed out, you’ll need to go to `https://developer.nest.com/clients` (create an account if you don’t already have one) to setup a new client, making sure to leave the `OAuth Redirect URL` field _blank_. Copy the _Client ID_ for use later and store the _Client secret_ in your `.Renviron` file as `NEST_CONSUMER_SECRET=…` (restart any open R sessions so R will re-read the `.Renviron` file).

Once your API client information is setup, you’ll need a way of working with it. We first create a Nest OAuth endpoint that has the core URLs that `httr` will use to help authorize the client with. Unfortunately, not all OAuth 2.0 endpoints work the same way. For the Nest API an initial `state` parameter is required even if using PIN-based authentication (which we are in this example since Nest doesn’t honor a dynamic callback URL as far as I can tell). This is how top setup the `httr` endpoint.

library(httr)
library(jsonlite)
 
nest <- oauth_endpoint(
  request=NULL,
  authorize="https://home.nest.com/login/oauth2?state=login",
  access="https://api.home.nest.com/oauth2/access_token"
)

Now, we need to setup the “app”. This is more of an “R” need than an “OAuth” need. Use the _Client ID_ you copied earlier. `httr` will pull the secret from the environment variable you created.

nest_app <- oauth_app("nest", key="a8bf6e0c-89a0-40ae-869a-943e928316f5")

With that out of the way, now we authorize the client. Because we’re using PIN-based authentication, the user will have to cut/paste the URL displayed in the R Console into a browser then cut/paste the PIN displayed in the browser back into the R Console. With `cache=TRUE` this will be a one-time event.

nest_token <- oauth2.0_token(nest, nest_app, use_oob=TRUE, cache=TRUE)

We can now use our shiny new token to make Nest API calls.

### Using the Nest API

Since this is not a detailed introduction to the Nest API in general, you may want to take the time to read [their documentation](https://developer.nest.com/documentation/topics/api-guides-and-reference). Here’s how to get a list of all the devices for the account and then read the data from the first thermostat. I’m gaming this a bit since I only have one Nest device and it is a thermostat, but you can use their simulator to play with more data.

To get all the devices in use, it’s just a call do the `devices` path. Again, not all OAuth 2.0 APIs work the same way, so instead of embedding the access token into the http request headers, you need to specify it in the query parameters:

req <- GET("https://developer-api.nest.com",
           path="devices",
           query=list(auth=nest_token$credentials$access_token))
stop_for_status(req)
devices <- fromJSON(content(req, as=text))

Now, `devices` contains a very ugly list (most JSON APIs return really ugly responses) but we can easily get the ID of the first thermometer (which we’ll need to use in the next API call) by doing:

first_thermostat <- names(devices$thermostats)[1]

Use `str(devices)` to see what else is there for use.

Now, to get the data from thermostat, all you have to do is:

req <- GET("https://developer-api.nest.com/",
           path=sprintf("devices/thermostats/%s", first_thermostat),
           query=list(auth=nest_token$credentials$access_token))
stop_for_status(req)
thermo <- data.frame(fromJSON(content(req, as="text")),
                     stringsAsFactors=FALSE)

And, you can display the current temperature/humidity via:

cat(thermo$ambient_temperature_f, "F / ", thermo$humidity, "%", sep="")

### Fin

Ideally, one would wrap this into a package (which I _may_ do but feel free to take this code and make one before I get the cycles to get to it) and add more error checking and convenience functions for working with the data. But, you can now adapt the [Nest+Shiny dashboard](http://blagrants.blogspot.com/2015/09/nest-thermostat-and-r-creating-shiny.html) post code to use proper API calls and data structures vs `system(“curl…”)` and `gsub()`.

A contiguous version of the above code is in [this gist](https://gist.github.com/hrbrmstr/2da2312170a84340c14f).

A huge change is coming to ggplot2 and you can get a preview of it over at Hadley’s github repo. I’ve been keenly interested in this as I will be fixing, finishing & porting coord_proj to it once it’s done.

Hadley & Winston have re-built ggplot2 with an entirely new object-oriented system called ggproto. With ggproto it’s now possible to easily extend ggplot2 from within your own packages (since source() is so last century), often times with very little effort.

Before attempting to port coord_proj I wanted to work through adding a Geom and Stat since thought it would be cool to be able to have interpolated line charts (and it helps answer some recurring StackOverflow “spline”/ggplot2 questions) and also prefer KernSmooth::bkde over the built-in density function (which geom_density and stat_density both use).

To that end, I’ve made a new github-installable package called ggalt (h/t to @jayjacobs for the better package name than I originally came up with) where I’ll be adding new Geoms, Stats, Coords (et al) as I craft them. For now, let me introduce both geom_xspline() and geom_bkde() to show how easy it is to incorporate new functionality into ggplot2.

While not a requirement, I think it’s a going to be a good idea to make both a paired Geom and Stat when adding those types of functionality to ggplot2. I found it easier to work with custom parameters this way and it also makes it feel a bit more like the way ggplot2 itself works. For the interpolated line geom/stat I used R’s graphics::xpsline function. Here’s all it took to give ggplot2 lines some curves (you can find the commented version on github):

geom_xspline <- function(mapping = NULL, data = NULL, stat = "xspline",
                      position = "identity", show.legend = NA,
                      inherit.aes = TRUE, na.rm = TRUE,
                      spline_shape=-0.25, open=TRUE, rep_ends=TRUE, ...) {
  layer(
    geom = GeomXspline,
    mapping = mapping,
    data = data,
    stat = stat,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(spline_shape=spline_shape,
                  open=open,
                  rep_ends=rep_ends,
                  ...)
  )
}

GeomXspline <- ggproto("GeomXspline", GeomLine,
  required_aes = c("x", "y"),
  default_aes = aes(colour = "black", size = 0.5, linetype = 1, alpha = NA)
)

stat_xspline <- function(mapping = NULL, data = NULL, geom = "line",
                     position = "identity", show.legend = NA, inherit.aes = TRUE,
                     spline_shape=-0.25, open=TRUE, rep_ends=TRUE, ...) {
  layer(
    stat = StatXspline,
    data = data,
    mapping = mapping,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(spline_shape=spline_shape,
                  open=open,
                  rep_ends=rep_ends,
                  ...
    )
  )
}

StatXspline <- ggproto("StatXspline", Stat,

  required_aes = c("x", "y"),

  compute_group = function(self, data, scales, params,
                           spline_shape=-0.25, open=TRUE, rep_ends=TRUE) {
    tf <- tempfile(fileext=".png")
    png(tf)
    plot.new()
    tmp <- xspline(data$x, data$y, spline_shape, open, rep_ends, draw=FALSE, NA, NA)
    invisible(dev.off())
    unlink(tf)

    data.frame(x=tmp$x, y=tmp$y)
  }
)

If that seems like alot of code, it really isn't. What we have there are:

  • two functions that handle the Geom aspects &
  • two functions that handle the Stat aspects.

Let's look at the Stat functions first, though you can also just read the handy vignette, too.

Adding Stats

In this particular case, we have it easy. We get to use geom_line/GeomLine as the base geom_ for the layer since all we're doing is generating more points for it to draw line segments between. We create the creative interface to our new Stat with stat_xspline add three new parameters with default values:

  • spline_shape
  • open
  • rep_ends

"Added three new parameters to what?" you ask? GeomLine/geom_line default to StatIdentity/stat_identity and if you look at the source code, that Stat just returns the data back in the form it came in. We're going to take these three new parameters and pass them to xspline and then return entirely new values back for ggplot2/grid to draw for us, so we tell it to call our new computation engine by giving it the StatXspline value to the layer. By using GeomLine/geom_line as the geom parameter, all we have to do is ensure we pass back the proper values. We do that in compute_group since ggplot2 will segment the incoming data into groups (via the group aesthetic) for us. We take each group and run them through the xspline with the parameters the user specified. If I didn't have to use the hack to work around what seems to be errant plot device issues in xspline, the call would be one line.

Adding Geoms

We pair up the Stat with a very basic Geom "shim" so we can use them interchangeably. It's the same idiom, an "object" function and the user-callable function. In this case, it's super-lightweight since we're really having geom_line do all the work for us. In a [very] future post, I'll cover more complex Geoms that require use of the underlying grid graphics system, but I suspect most of your own additions may be able to use the lightweight idiom here (and that's covered in the vignette).

Putting Our New Functions To Work

With our new additions to ggplot2, we can compare the output of geom_smooth to geom_xspline with some test data:

set.seed(1492)
dat <- data.frame(x=c(1:10, 1:10, 1:10),
                  y=c(sample(15:30, 10), 2*sample(15:30, 10), 3*sample(15:30, 10)),
                  group=factor(c(rep(1, 10), rep(2, 10), rep(3, 10)))
)

ggplot(dat, aes(x, y, group=group, color=factor(group))) +
  geom_point(color="black") +
  geom_smooth(se=FALSE, linetype="dashed", size=0.5) +
  geom_xspline(size=0.5)

README-unnamed-chunk-4-3

The github page has more examples for the function, but you don't have to be envious of the smooth D3 curves any more.

I realize this particular addition is not extremely helpful/beneficial, but the next one is. We'll look at adding a new/more accurate density Stat/Geom in the next installment and then discuss the "on-steroids" roxygen2 comments you'll end up using for your creations in part 3.

I engage with the Stack[Overflow|Exchange] community quite a bit and was super-happy @treycausey made the [Stack Overflow bot](https://twitter.com/StackOverflowR) (@StackOverflowR) since I’m also on Twitter alot (mostly hanging out in these days).

However, questions exist in other Stack watering holes, like the [Geographic Information Systems Stack Exchange](http://gis.stackexchange.com/questions/tagged/r). [Cross Validated](http://stats.stackexchange.com/questions/tagged/r) and the fledgling [Data Science Stack Exchange](http://datascience.stackexchange.com/questions/tagged/r). They are all easy enough to follow in your favorite RSS reader (which _is_ @feedly, _right_?), but it’s also equally as easy to turn them into Twitter bots (here they are):

– @DataSciSERBot (Data Science Stack Exchange posts)
– @GISStackExchR (Geographic Information Systems posts)
– @RStatsStExBot (Cross Validated Stack Exchange posts)

They use [this IFTTT recipe](https://ifttt.com/recipes/322136-twitter-bot-for-data-science-stack-exchange-rstats-questions) to take the RSS feeds from each Stack community and turn them into tweets. Each forum (and, hence, Twitter bot) has _way_ less volume than @StackOverflowR and also tend to be of less general interest than @StackOverflowR. However, each specialized question forums [usually] have _really_ interesting problems & answers that I’ve learned a great deal from. You may get inspiration for a solution to something, see a really neat way to accomplish a task, get an idea for a new R package or just gain new and useful knowledge in areas previously unfamiliar to you.

They won’t be spamming the Twitter channel (the errant tag on one of the tweets was fixed in the recipe), so you’ll have to follow them or deliberately check in on them to see the updates.

I’ll try to keep an eye out for other Stack communities that feature and add them to the bot family. If you see any I’ve missed or am missing in the future, drop a comment here or note to @hrbrmstr on Twitter.

Time for another Twitter-inspired blog post this week, this time from a tweet by @JonKalodimos:

I had seen and appreciated Ann’s post on her makeover of the main graphic in [NPR’s story](http://www.npr.org/sections/money/2014/10/21/357629765/when-women-stopped-coding) and did a quick mental check of how I’d do the same in ggplot2 as I was reading it. Jon’s question was a good prompt to dump physical memory to internet memory.

Here’s the NPR graphic:

When_Women_Stopped_Coding___Planet_Money___NPR

It is actually pretty darn good on it’s own, but I also agree with Ann that direct labeling could have made it better. Here’s her makeover:

Let’s see how to do this in ggplot2. We’ll use the actual data from NPR’s story since the graphic was built with D3 and, hence, the data is part of the graphic. Let’s get the `library` stuff out of the way:

library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(gridExtra)
library(grid)

Now, we’ll grab the CSV that the NPR folks used for the graphic and take a look at it. I found it via Developer Tools in Chrome:

# use the NPR story data file ---------------------------------------------
# and be kind to NPR's bandwidth budget
url <- "http://apps.npr.org/dailygraphics/graphics/women-cs/data.csv"
fil <- "gender.csv"
if (!file.exists(fil)) download.file(url, fil)
 
gender <- read.csv(fil, stringsAsFactors=FALSE)
 
# take a look at the CSV structure ----------------------------------------
 
glimpse(gender)
 
## Observations: 48
## Variables:
## $ date              (int) 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, ...
## $ Medical.School    (dbl) 0.09, 0.10, 0.10, 0.09, 0.09, 0.11, 0.14, 0.17, 0.20, 0.22, 0.24, 0.25, 0.25, 0.25, 0.28, 0.29, 0.31, ...
## $ Law.School        (chr) "0.04", "0.04", "0.05", "0.07", "0.07", "0.1", "0.12", "0.16", "0.2", "0.24", "0.27", "0.28", "0.3", "...
## $ Physical.Sciences (chr) "0.14", "0.14", "0.14", "0.14", "0.14", "0.15", "0.16", "0.16", "0.17", "0.19", "0.2", "0.2", "0.22", ...
## $ Computer.science  (dbl) 0.146, 0.108, 0.120, 0.130, 0.129, 0.136, 0.136, 0.149, 0.164, 0.190, 0.198, 0.239, 0.258, 0.281, 0.30...
 
tail(gender)
 
##    date Medical School Law School Physical Sciences Computer science
## 43 2008           0.48       0.47              0.41            0.177
## 44 2009           0.48       0.47              0.42            0.179
## 45 2010           0.48       0.47              0.41            0.182
## 46 2011           0.47         tk                tk            0.177
## 47 2012           0.47         tk                tk            0.182
## 48 2013           0.46         tk                              0.179

Those `tk` values are referred to in the [code that makes the NPR graphic](http://apps.npr.org/dailygraphics/graphics/women-cs/js/graphic.js) so we’ll replace them with `NA`s and make all the columns numeric:

gender <- mutate_each(gender, funs(as.numeric))

We should also clean up the column names since we’ll be using them for the legend and the direct labels:

colnames(gender) <- str_replace(colnames(gender), "\\.", " ")
 
gender_long <- mutate(gather(gender, area, value, -date),
                      area=factor(area, levels=colnames(gender)[2:5],
                                  ordered=TRUE))

That that code link also has the colors NPR used for the graphic, so let’s define those now since we bothered to look at it:

gender_colors <- c('#11605E', '#17807E', '#8BC0BF','#D8472B')
names(gender_colors) <- colnames(gender)[2:5]

We’ll be needing those names later on, hence why I named the values in the vector.

With the data, labels and colors defined, we can make a “standard” ggplot:

chart_title <- expression(atop("What Happened To Women In Computer Science?",
                               atop(italic("% Of Women Majors, By Field"))))
 
gg <- ggplot(gender_long)
gg <- gg + geom_line(aes(x=date, y=value, group=area, color=area))
gg <- gg + scale_color_manual(name="", values=gender_colors)
gg <- gg + scale_y_continuous(label=percent)
gg <- gg + labs(x=NULL, y=NULL, title=chart_title)
gg <- gg + theme_bw(base_family="Helvetica")
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.key=element_blank())
gg

Rplot01

That’s also pretty good on it’s own. It’s possible to make it look a bit more like the NPR chart, but it’s hard to format a title & subtitle in a ggplot title _and_ have it left-justified, so I opted for font style. It’s also possible to make the legend look like NPR’s but that’s not the point of the post.

So, how do we make this look more like Ann’s makeover?

First we need to get the last values for each of the variables so we know what point on the `y` axis we need to place the labels. That’s made a bit trickier with the `NA`s:

last_vals <- sapply(colnames(gender)[2:5], function(x) last(na.exclude(gender[,x])))
last_date <- tail(gender$date)+1 # doing this ^ wld have made it a double

Next, we need to turn off the legend and increase the plot margin on the right-hand side:

gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin = unit(c(1, 7, 2, 1), "lines"))

I figured out those #’s by interactive trial-and-error, though I initially guessed `6` for the right-hand margin increase. Also, this should demonstrate one reason for the `gg <- gg +` madness you see in my code/posts since, when you start doing more in ggplot, you end up with that idiom more oft than not. Now, we add the labels. We do it with with custom annotations that are placed "one year" after the latest `x` value and at the same `y` value as the last reading of each area. We also color the label the same as the line, which is why we needed a named vector.

for (i in 1:length(last_vals)) {
  gg <- gg + annotation_custom(grob=textGrob(names(last_vals)[i], hjust=0,
                                             gp=gpar(fontsize=8, 
                                                     col=gender_colors[names(last_vals)[i]])),
                               xmin=2014, xmax=2014,
                               ymin=last_vals[i], ymax=last_vals[i])
}

Finally, we have to do some of the remaining work by hand since we have to turn off panel clipping and the only way I know how to do that is at the grob/gtable level, but it’s not that scary or complex of a task. Also, since we are manipulating the built ggplot object, we have to use `grid.draw` to present our chart:

gb <- ggplot_build(gg)
gt <- ggplot_gtable(gb)
 
gt$layout$clip[gt$layout$name=="panel"] <- "off"
 
grid.draw(gt)

Here’s the result:

Rplot02

I’ve deliberately left the fonts a bit small and not-changed their positions on the `y`-axis to give readers a bit of homework. They both _should_ be changed and the plot margins could also be tweaked a tad. You can find the complete code [on github](https://gist.github.com/hrbrmstr/83deb0baeabae0824389) so tweak away!

If you have another way to accomplish the same task or want to show off your tweaked version, drop a note in the comments or at that gist link.

UPDATE: `docxtractr` is now [on CRAN](https://cran.rstudio.com/web/packages/docxtractr/index.html)

———————

This is more of a follow-up from [yesterday’s post](http://rud.is/b/2015/08/23/using-r-to-get-data-out-of-word-docs/). The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there’s now a `devtools`-installable package [on github](https://github.com/hrbrmstr/docxtractr) that makes it way easier to get information about the tables in a Word document and extract them—uniform or not.

There are plenty of examples in the GitHub README and also in the package examples. But, I will show the basic functionality here.

The package ships with four example Word documents, but we’ll work with the last one: `complex.doc`. It has five tables and the last two have varying columns and rows and look like:

complex

Let’s read those two in:

complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

docx_tbl_count(complx)
#> [1] 5

docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/docxtractr/examples/complex.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]
#> 
#> Table 4
#>   total cells: 11
#>   row count  : 4
#>   uniform    : unlikely => found differing cell counts (3, 2) across some rows 
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 5
#>   total cells: 21
#>   row count  : 7
#>   uniform    : likely!
#>   has header : unlikely


docx_extract_tbl(complx, 4, header=TRUE)
#> Source: local data frame [3 x 3]
#> 
#>   Foo  Bar Baz
#> 1  Aa BbCc  NA
#> 2  Dd   Ee  Ff
#> 3  Gg   Hh  ii

docx_extract_tbl(complx, 5, header=TRUE)
#> Source: local data frame [6 x 3]
#> 
#>    Foo Bar Baz
#> 1   Aa  Bb  Cc
#> 2   Dd  Ee  Ff
#> 3   Gg  Hh  Ii
#> 4 Jj88  Kk  Ll
#> 5       Uu  Ii
#> 6   Hh  Ii   h

It reads in “uniform” tables properly and will warn you if there is a header marked in Word but not asked for in the extraction.

Next steps are to both allow specifying column types and try to guess column types (`readr` has some nice functions for this) and perhaps return more metadata (if possible).

Feature requests & bug reports are most welcome [on GitHub](https://github.com/hrbrmstr/docxtractr/issues).

NOTE: after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a package.

Also: docxtractr is now on CRAN


This was asked on twitter recently:

The answer is a very cautious “yes”. Much depends on how well-formed and un-formatted the table is.

Take this really simple docx file: data.docx.

It has a single table in it:

data_docx

Now, .docx files are just zipped directories, so rename that to data.zip, unzip it and navigate to data/word/document.xml and you’ll see something like this (though it’ll be more compressed):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
    <w:tbl>
        <w:tblPr>
            <w:tblStyle w:val="TableGrid"/>
            <w:tblW w:w="0" w:type="auto"/>
            <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
        </w:tblPr>
        <w:tblGrid>
            <w:gridCol w:w="2337"/>
            <w:gridCol w:w="2337"/>
            <w:gridCol w:w="2338"/>
            <w:gridCol w:w="2338"/>
        </w:tblGrid>
        <w:tr w:rsidR="00244D8A" w14:paraId="6808A6FE" w14:textId="77777777" w:rsidTr="00244D8A">
            <w:tc>
                <w:tcPr>
                    <w:tcW w:w="2337" w:type="dxa"/>
                </w:tcPr>
                <w:p w14:paraId="7D006905" w14:textId="77777777" w:rsidR="00244D8A" w:rsidRDefault="00244D8A">
                    <w:r>
                        <w:t>This</w:t>
                    </w:r>
                </w:p>
            </w:tc>
            <w:tc>
                <w:tcPr>
                    <w:tcW w:w="2337" w:type="dxa"/>
                </w:tcPr>
                <w:p w14:paraId="13C9E52C" w14:textId="77777777" w:rsidR="00244D8A" w:rsidRDefault="00244D8A">
                    <w:r>
                        <w:t>Is</w:t>
                    </w:r>
                </w:p>
            </w:tc>
...

We can easily make out a table structure with rows and columns. In the simplest cases (which is all I’ll cover in this post) where the rows and columns are uniform it’s pretty easy to grab the data:

library(xml2)

# read in the XML file
doc <- read_xml("data/word/document.xml")

# there is an egregious use of namespaces in these files
ns <- xml_ns(doc)

# extract all the table cells (this is assuming one table in the document)
cells <- xml_find_all(doc, ".//w:tbl/w:tr/w:tc", ns=ns)

# convert the cells to a matrix then to a data.frame)
dat <- data.frame(matrix(xml_text(cells), ncol=4, byrow=TRUE), 
                  stringsAsFactors=FALSE)

# if there are column headers, make them the column name and remove that line
colnames(dat) <- dat[1,]
dat <- dat[-1,]
rownames(dat) <- NULL

dat

##   This      Is     A   Column
## 1    1     Cat   3.4      Dog
## 2    3    Fish 100.3     Bird
## 3    5 Pelican   -99 Kangaroo

You’ll need to clean up the column types, but you have at least freed the data from the evil file format it was in.

If there is more than one table you can use XML node targeting to process each one separately or into a list. I’ve wrapped that functionality into a rudimentary function that will:

  • auto-copy a Word doc to a temporary location
  • rename it to a zip
  • unzip it to a temporary location
  • read in the document.xml
  • auto-determine the number of tables in the document
  • auto-calculate # rows & # columns per table
  • convert each table
  • return all the tables into a list
  • clean up the temporarily created items
library(xml2)

get_tbls <- function(word_doc) {
  
  tmpd <- tempdir()
  tmpf <- tempfile(tmpdir=tmpd, fileext=".zip")
  
  file.copy(word_doc, tmpf)
  unzip(tmpf, exdir=sprintf("%s/docdata", tmpd))
  
  doc <- read_xml(sprintf("%s/docdata/word/document.xml", tmpd))
  
  unlink(tmpf)
  unlink(sprintf("%s/docdata", tmpd), recursive=TRUE)

  ns <- xml_ns(doc)
  
  tbls <- xml_find_all(doc, ".//w:tbl", ns=ns)
  
  lapply(tbls, function(tbl) {
    
    cells <- xml_find_all(tbl, "./w:tr/w:tc", ns=ns)
    rows <- xml_find_all(tbl, "./w:tr", ns=ns)
    dat <- data.frame(matrix(xml_text(cells), 
                             ncol=(length(cells)/length(rows)), 
                             byrow=TRUE), 
                      stringsAsFactors=FALSE)
    colnames(dat) <- dat[1,]
    dat <- dat[-1,]
    rownames(dat) <- NULL
    dat
    
  })
  
}

Using this multi-table Word doc – doc3:

data3

we can extract the three tables thusly:

get_tbls("~/Dropbox/data3.docx")

## [[1]]
##   This      Is     A   Column
## 1    1     Cat   3.4      Dog
## 2    3    Fish 100.3     Bird
## 3    5 Pelican   -99 Kangaroo
## 
## [[2]]
##   Foo Bar Baz
## 1  Aa  Bb  Cc
## 2  Dd  Ee  Ff
## 3  Gg  Hh  ii
## 
## [[3]]
##   Foo Bar
## 1  Aa  Bb
## 2  Dd  Ee
## 3  Gg  Hh
## 4  1    2
## 5  Zz  Jj
## 6  Tt  ii

This function tries to calculate the rows/columns per table but it does rely on a uniform table structure.

Have an alternate method or more feature-complete way of handling Word docs as tabular data sources? Then definitely drop a note in the comments.