The SVE service is a fake new service solely put out there to facilitate some inane puzzle. You can reference SVE-2015-9999 via this URL and hopefully this will make it possible for google to get it back in their index.
Author Archives: hrbrmstr
I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:
Wrote some R code for looking at historical crash data
https://t.co/bgrMj3PIZu
#data #r #DataMining pic.twitter.com/zYzpOyD9JY
— PhilJ (@philjette) March 26, 2015
The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.
Just to get the setup bits out of the way, here’s all the packages I’ll be using:
library(dplyr) library(rvest) library(magrittr) library(stringr) library(lubridate) library(pbapply)
Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.
The basic setup is to:
– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)
Some things worth mentioning:
– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary
Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):
#' retrieve crash data for a given year #' defaults to current year #' earliest year in the database is 1920 get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) { crash_base <- "http://www.planecrashinfo.com/%d/%s.htm" if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) { stop("year must be >=1920 and <=current year", call.=FALSE) } # get crash date pg <- html(sprintf(crash_base, year, year)) pg %>% html_nodes("table > tr > td:nth-child(1)") %>% html_text() %>% extract(-1) %>% dmy() %>% data_frame(date=.) -> date # get location and operator loc_op <- bind_rows(lapply(1:length(date), function(i) { pg %>% html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% str_trim() %>% str_replace_all("^(Near|Off) ", "") -> loc pg %>% html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op data_frame(location=loc, operator=op) })) # get type & registration type_reg <- bind_rows(lapply(1:length(date), function(i) { pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>% ifelse(.=="?", NA, .) -> typ pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>% ifelse(.=="?", NA, .) -> reg data_frame(type=typ, registration=reg) })) # get fatalities pg %>% html_nodes("table > tr > td:nth-child(4)") %>% html_text() %>% str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>% lapply(function(x) { data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4])) }) %>% bind_rows %>% tail(-1) -> afg bind_cols(date, loc_op, type_reg, afg) }
While that gets one year, it’s super-simple to get all crashes since 1950:
crashes <- bind_rows(pblapply(1950:2015, get_data))
Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.
For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.
If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.
Over on The DO Loop, @RickWicklin does a nice job [visualizing the causes of airline crashes](http://blogs.sas.com/content/iml/2015/03/30/visualizing-airline-crashes/) in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:
– Give @jennybc’s new [googlesheets](http://github.com/jennybc/googlesheets) a spin
– Show some `dplyr` & `tidyr` data wrangling (never can have too many examples)
– Crank out some `ggplot` zero-based streamgraph-y area charts for the data with some extra `ggplot` wrangling for good measure
I also decided to use the colors in the [original David McCandless/Kashan visualization](http://www.informationisbeautiful.net/visualizations/plane-truth-every-single-commercial-plane-crash-visualized/).
#### Getting The Data
As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site [makes the data available](https://docs.google.com/spreadsheet/ccc?key=0AjOUPqcIwvnjdEx2akx5ZjJXSk9oM1E3dWpqZFJ6Nmc&usp=drive_web#gid=1), so I copied it to my Google Drive and gave her package a go:
library(googlesheets) library(ggplot2) # we'll need the rest of the libraries later library(dplyr) # but just getting them out of the way library(tidyr) # this will prompt for authentication the first time my_sheets <- list_sheets() # which one is the flight data one grep("Flight", my_sheets$sheet_title, value=TRUE) ## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" # get the sheet reference then the data from the second tab flights <- register_ss("Flight Risk JSON") flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL") # take a quick look glimpse(flights_csv) ## Observations: 440 ## Variables: ## $ date (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "... ## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1... ## $ loc (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "... ## $ country (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor... ## $ ref (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ... ## $ airline (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"... ## $ fat (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4... ## $ px (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"... ## $ cat (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1... ## $ phase (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi... ## $ cert (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi... ## $ meta (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h... ## $ cause (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err... ## $ notes (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... # the spreadsheet has a "helper" row for javascript, so we nix it flights_csv <- flights_csv[-1,] # js vars removal # and we convert some columns while we're at it flights_csv %>% mutate(date=as.Date(date), fat=as.numeric(fat), px=as.numeric(px)) -> flights_csv
#### A Bit of Cleanup
Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.
The area chart also needs full values for each category combo per year, so we use `expand` from `tidyr` with `left_join` and `mutate` to fill in the gaps.
Finally, we make proper, ordered labels:
flights_csv %>% mutate(year=as.numeric(format(date, "%Y"))) %>% mutate(phase=tolower(phase), phase=ifelse(grepl("take", phase), "takeoff", phase), phase=ifelse(grepl("climb", phase), "takeoff", phase), phase=ifelse(grepl("ap", phase), "approach", phase)) %>% count(year, meta, phase) %>% left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% mutate(n=ifelse(is.na(n), 0, n)) %>% filter(!phase %in% c("grounded", "unknown")) %>% mutate(phase=factor(phase, levels=c("takeoff", "en_route", "approach", "landing"), labels=c("Takeoff", "En Route", "Approach", "Landing"), ordered=TRUE)) -> flights_dat
I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.
#### The Final Plot
I’m a big fan of an incremental, additive build idiom for `ggplot` graphics. By using the `gg <- gg + …` style one can move lines around, comment them out, etc without dealing with errant `+` signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows: - main `ggplot` call with mappings if the graph is short, otherwise add the mappings to the `geom`s - all `geom_` or `stat_` layers in the order I want them, and using line breaks to logically separate elements (like `aes`) or to wrap long lines for easier readability. - all `scale_` elements in order from axes to line to shape to color to fill to alpha; I'm not as consistent as I'd like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking - `facet` call (if any) - label setting, always with `labs` unless I really have a need for using `ggtitle` - base `theme_` call - all other `theme` elements, one per `gg <- gg +` line I know that's not everyone's cup of tea, but it's just how I roll `ggplot`-style. For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don't have Futura Medium.
flights_palette <- c("#702023", "#A34296", "#B06F31", "#939598", "#3297B0") gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area", position="stack", method="gam", formula=y~s(x)) gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, labels=c("Criminal", "Human Error", "Mechanical", "Unknown", "Weather")) gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13)) gg <- gg + facet_grid(~phase) gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase") gg <- gg + theme_bw() gg <- gg + theme(legend.position="bottom") gg <- gg + theme(text=element_text(family="Futura Medium")) gg <- gg + theme(plot.title=element_text(face="bold", hjust=0)) gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(strip.background=element_rect(fill="#525252")) gg <- gg + theme(strip.text=element_text(color="white")) gg
That ultimately produces:
with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the _bump_ between 2005 and 2010 for landing crashes.
As an aside, Boeing has a [really nice PDF](http://www.boeing.com/news/techissues/pdf/statsum.pdf) on some of this data with quite a bit more detail.
It seems Ruben C. Arslan had the waffle idea about the same time I did. Apart from some extra spiffy XKCD-like styling, one other thing his waffling routines allowed for was using FontAwesome icons. When you use an icon vs a block, you are really making a basic version of isotype pictograms. They can add a dimension to the story you’re trying to tell without using any words. I’ve added two parameters to a pre-release CRAN version that I’d like folks to kick the tyres on a bit. Said parameters are use_glyph
— which is either FALSE
or a character string for a FontAwesome icon (more on that in a bit) — and glyph_size
— which is a numeric value for the font size since it won’t scale when the graphic resizes.
Fonts in R & waffle
One part of R that is (with apologies to Winston and others) weak is fonts. You can use fonts, but doing so is often not pretty (despite guidance on the subject) and not without problems (we tried using a custom font again for this year’s DBIR graphics and failed miserably — again — due to font issues and R and had to have the graphics folks substitute them in).
To use the FontAwesome glyphs you need to:
- grab the
ttf
version from here - install it on your system
- install the extrafont package
- run
font_import()
(get some coffee/scotch while you wait) - load
extrafont
when you need to use these glyphs
Once you do that, you’re probably ready to make isotype pictograms with waffle
. I say probably since this process worked on two of my OS X systems but not a third. Same R version. Same RStudio version. Same import process. (This is part of the reason for my lament of the state of fonts since I’m not exactly an n00b with either R, Macs or fonts.)
Making isotype pictograms
I did borrow some code from Ruben, but I hate typing unicode characters and I suspect most folks do as well. If you do any work in straight HTML/CSS, you know you can just refer to the various FontAwesome glyphs by name. To use FontAwesome glyphs with waffle
you specify the font name (no fa-
prefix) vs unicode character. If you want to see what’s available (and don’t want to bookmark the FontAwesome site) you can run either fa_list()
which will give you a list of available FontAwesome glyph names or use fa_grep()
and supply a pattern name. For example, running fa_grep("car")
gives you:
## [1] "car" "caret-down" "caret-left" ## [4] "caret-right" "caret-square-o-down" "caret-square-o-left" ## [7] "caret-square-o-right" "caret-square-o-up" "caret-up" ## [10] "cart-arrow-down" "cart-plus" "cc-mastercard" ## [13] "credit-card" "shopping-cart"
Any grep
regex will work in that function.
You’ll need to devtools::install_github("hrbrmstr/waffle", ref="cran")
to use the dev/pre-CRAN version of waffle
before doing anything.
To make an isotype pictogram version of the health records breaches waffle chart, you can do the following:
library(waffle) library(extrafont) parts <- c(`Un-breached\nUS Population`=(318-11-79), `Premera`=11, `Anthem`=79) waffle(parts/10, rows=3, colors=c("#969696", "#1879bf", "#009bda"), use_glyph="medkit", size=8)
So, please kick the tyres, post comments about your font successes & woes and definitely link to any isotype pictograms you create.
Vis expert Naomi Robbins did an excellent [critique](http://www.forbes.com/sites/naomirobbins/2015/03/19/color-problems-with-figures-from-the-jerusalem-post/) of the [graphics](http://www.jpost.com/Israel-Elections/Analysis-The-Israel-election-decided-by-one-vote-394229) that went along with an article on Israeli election in the Jerusalem Post.
Non-uniform and color-blind-unfriendly categorical colors and disproportionate arc sizes are definitely three substantial issues in that series of visualizations. We can rectify all of them with two new packages of mine: [waffle](http://github.com/hrbrmstr/waffle) & [adobecolor](http://github.com/hrbrmstr/adobecolor). The former provides a good alternative to pie charts (no charts at all are a good alternative to pie charts) and the latter makes it possible to share color palettes without passing long strings of hex-encoded colors.
Using [XScope](http://xscopeapp.com/) I encoded a color-blind-friendly palette from [Brian Connelly](http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/) and saved the palette off as an Adobe Color file (`ACO`). I then took the values from the charts and mapped each party to a particular color. Then, I made ordered and proportional waffle charts using the the values and aligned colors. The results are below:
# install.packages("waffle") # devtools::install_github("hrbrmstr/swatches") library(waffle) library(swatches) national_unity <- c(`Zionist Union (27)`=27, `Likud (27)`=27, `Kulanu (10)`=10, `Shas (7)`=7, `UTJ (6)`=6) right_wing <- c(`Likud (27)`=27, `Kulanu (10)`=10, `Bayit Yehudi (8)`=8, `Shas (7)`=7, `UTJ (6)`=6, `Yisrael Beytenu (5)`=5) herzog_led <- c(`Zionist Union (27)`=27, `Kulanu (10)`=10, `Shas (7)`=7, `UTJ (6)`=6, `Meretz (5)`=5) party_colors <- rev(read_aco("http://rud.is/dl/israel.aco")) zion <- party_colors[1] likud <- party_colors[2] kulanu <- party_colors[3] shas <- party_colors[4] utj <- party_colors[5] visrael <- party_colors[6] meretz <- party_colors[7] bayit <- party_colors[6] nw <- waffle(national_unity, rows=5, colors=c(zion, likud, kulanu, shas, utj), title="\nNational unity government") + theme(plot.title=element_text(size=12, face="bold")) rw <- waffle(right_wing, rows=5, colors=c(likud, kulanu, bayit, shas, utj, visrael), title="\nRight Wing", pad=3) + theme(plot.title=element_text(size=12, face="bold")) hw <- waffle(herzog_led, rows=5, colors=c(zion, kulanu, shas, utj, meretz), title="\nHerzog led", pad=5) + theme(plot.title=element_text(size=12, face="bold")) iron(nw, rw, hw)
If I knew my audience did not have color processing issues, I’d use a better palette. Regardless, these results are far better than the careless pies presented in the original story. The squares represent the same quantities in each chart and the colors also map to the parties.
Honestly, though, you could get a better idea with simple, un-tweaked base graphics bar charts:
par(mfrow=c(3,1)) barplot(national_unity, col=c(zion, likud, kulanu, shas, utj), main="National unity government") barplot(right_wing, col=c(likud, kulanu, bayit, shas, utj, visrael), main="Right Wing") barplot(herzog_led, col=c(zion, kulanu, shas, utj, meretz), main="Herzog led")
Please consider your readers and the message you’re trying to convey when developing visualizations, especially when you have as large an audience as the Jerusalem Post.
install.packages("waffle")
and library(waffle)
vs the devtools
dance.My disdain for pie charts is fairly well-known, but I do concede that there are times one needs to communicate parts of a whole graphically verses using just words or a table. When that need arises, I’m partial to “waffle charts” or “square pie charts”. @eagereyes did a great post a while ago on them (make sure to read the ‘debate’ between Robert and @hadleywickham in the comments, too), so head there for the low-down on them. Rather than have every waffle chart I make be a one-off creation, I made an R package for them.
There is currently one function in the package — waffle
— and said function doesn’t mimic all the goodness of these charts as described in Robert’s post (yet). It does, however, do a pretty decent job covering the basics. Let’s take the oft-cited New York times “debt” graphic:
We can replicate that pretty closely in R. To make it as simple as possible, the waffle
function takes a named numeric vector. If no names are specified, or you leave some names out, LETTERS
will be used to fill in the gaps. The function takes your data quite literally, so if you give it a vector that sums up to, say, 10,000
, then the function will try to create a ggplot object with 10,000 geom_rect
elements. Needless to say, that’s a bad idea. So, I suggest using the raw numbers in the vector and passing in a scaled version of the vector to the function. That way, you can play with the values to get the desired look. Here’s the R version of of the NYT graphic:
# devtools::install_github("hrbrmstr/waffle") library(waffle) savings <- c(`Mortgage ($84,911)`=84911, `Auto and\ntuition loans ($14,414)`=14414, `Home equity loans ($10,062)`=10062, `Credit Cards ($8,565)`=8565) waffle(savings/392, rows=7, size=0.5, colors=c("#c7d4b6", "#a3aabd", "#a0d0de", "#97b5cf"), title="Average Household Savings Each Year", xlab="1 square == $392")
This package evolved from a teensy gist I made earlier this year to help communicate the scope of the Anthem data breach in the US. Since then, a recent breach at Premera occurred and added to the tally. Here’s two views of that data, one with one square equalling one million people and another with one square equalling ten million people (using the blue shade from each of the company’s logos):
parts <- c(`Un-breached\nUS Population`=(318-11-79), `Premera`=11, `Anthem`=79) waffle(parts, rows=8, size=1, colors=c("#969696", "#1879bf", "#009bda"), title="Health records breaches as fraction of US Population", xlab="One square == 1m ppl")
waffle(parts/10, rows=3, colors=c("#969696", "#1879bf", "#009bda"), title="Health records breaches as fraction of US Population", xlab="One square == 10m ppl"
I’m betting that gets alot bluer by the end of the year.
The function returns a ggplot object, so fonts, sizes, etc can all be customized and the source is up on github for all to play with and contribute to.
Along with adding support for filling in the chart as shown in the @eagereyes post, there will also be an htmlwidget version coming as well. Standard drill applies: issues/enhancements to github issues, feedback and your own examples in the comments.
UPDATE
Thanks to a PR by @timelyportfolio, there is now a widget option in the package.
I’ve been seeing an uptick in static US “lower 48” maps with “meh” projections this year, possibly caused by a flood of new folks resolving to learn R but using pretty old documentation or tutorials. I’ve also been seeing an uptick in folks needing to geocode US city/state to lat/lon. I thought I’d tackle both in a quick post to show how to (simply) use a decent projection for lower 48 US maps and then how to use a _very_ basic package I wrote – [localgeo](http://github.com/hrbrmstr/localgeo) to avoid having to use an external API/service for basic city/state geocoding.
### Albers All The Way
I could just plot an Albers projected map, but it’s more fun to add some data. We’ll start with some setup libraries and then read in some recent earthquake data, then filter it for our map display:
library(ggplot2) library(dplyr) library(readr) # devtools::install_github("hadley/readr") # Earthquakes ------------------------------------------------------------- # get quake data ---------------------------------------------------------- quakes <- read_csv("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_month.csv") # filter all but lower 48 US ---------------------------------------------- quakes %>% filter(latitude>=24.396308, latitude<=49.384358, longitude>=-124.848974, longitude<=-66.885444) -> quakes # bin by .5 --------------------------------------------------------------- quakes$Magnitude <- as.numeric(as.character(cut(quakes$mag, breaks=c(2.5, 3, 3.5, 4, 4.5, 5), labels=c(2.5, 3, 3.5, 4, 4.5), include.lowest=TRUE)))
Many of my mapping posts use quite a few R geo libraries, but this one just needs `ggplot2`. We extract the US map data, turn it into something `ggplot` can work with, then plot our quakes on the map:
us <- map_data("state") us <- fortify(us, region="region") # theme_map --------------------------------------------------------------- devtools::source_gist("33baa3a79c5cfef0f6df") # plot -------------------------------------------------------------------- gg <- ggplot() gg <- gg + geom_map(data=us, map=us, aes(x=long, y=lat, map_id=region, group=group), fill="#ffffff", color="#7f7f7f", size=0.25) gg <- gg + geom_point(data=quakes, aes(x=longitude, y=latitude, size=Magnitude), color="#cb181d", alpha=1/3) gg <- gg + coord_map("albers", lat0=39, lat1=45) gg <- gg + theme_map() gg <- gg + theme(legend.position="right") gg
### Local Geocoding
There are many APIs with corresponding R packages/functions to perform geocoding (one really spiffy recent one is [geocodeHERE](http://cran.r-project.org/web/packages/geocodeHERE/)). While Nokia’s service is less restrictive than Google’s, most of these sites are going to have some kind of restriction on the number of calls per second/minute/day. You could always install the [Data Science Toolkit](http://www.datasciencetoolkit.org/) locally (note: it was down as of the original posting of this blog) and perform the geocoding locally, but it does take some effort (and space/memory) to setup and get going.
If you have relatively clean data and only need city/state resolution, you can use a package I made – [localgeo](http://github.com/hrbrmstr/localgeo) as an alternative. I took a US Gov census shapefile and extracted city, state, lat, lon into a data frame and put a lightweight function shim over it (it’s doing nothing more than `dplyr::left_join`). It won’t handle nuances like “St. Paul, MN” == “Saint Paul, MN” and, for now, it requires you to do the city/state splitting, but I’ll be tweaking it over the year to be a bit more forgiving.
We can give this a go and map the [greenest cities in the US in 2014](http://www.nerdwallet.com/blog/cities/greenest-cities-america/) as crowned by, er, Nerd Wallet. I went for “small data file with city/state in it”, so if you know of a better source I’ll gladly use it instead. Nerd Wallet used DataWrapper, so getting the actual data was easy and here’s a small example of how to get the file, perform the local geocoding and use an Albers projection for plotting the points. The code below assumes you’re still in the R session that used some of the `library` calls earlier in the post.
library(httr) library(localgeo) # devtools::install_github("hrbrmstr/localgeo") library(tidyr) # greenest cities --------------------------------------------------------- # via: http://www.nerdwallet.com/blog/cities/greenest-cities-america/ url <- "https://gist.githubusercontent.com/hrbrmstr/1078fb798e3ab17556d2/raw/53a9af8c4e0e3137a0a8d4d6332f7a6073d93fb5/greenest.csv" greenest <- read.table(text=content(GET(url), as="text"), sep=",", header=TRUE, stringsAsFactors=FALSE) greenest %>% separate(City, c("city", "state"), sep=", ") %>% filter(!state %in% c("AK", "HI")) -> greenest greenest_geo <- geocode(greenest$city, greenest$state) gg <- ggplot() gg <- gg + geom_map(data=us, map=us, aes(x=long, y=lat, map_id=region, group=group), fill="#ffffff", color="#7f7f7f", size=0.25) gg <- gg + geom_point(data=greenest_geo, aes(x=lon, y=lat), shape=21, color="#006d2c", fill="#a1d99b", size=4) gg <- gg + coord_map("albers", lat0=39, lat1=45) gg <- gg + labs(title="Greenest Cities") gg <- gg + theme_map() gg <- gg + theme(legend.position="right") gg
Let me reinforce that the `localgeo` package will most assuredly fail to geocode some city/state combinations. I’m looking for a more comprehensive shapefile to ensure I have the complete list of cities and I’ll be adding some code to help make the lookups more forgiving. It may at least help when you bump into an API limit and need to crank out something in a hurry.
In preparation for using some of our streamgraphs for production (PDF/print) graphics, I ended up having to hand-edit labels in on one of the graphics in an Adobe product. This bumped up the priority on adding annotation functions to the streamgraph package (you really don’t want to have to hand-edit graphics if at all possible, trust me). To illustrate them, I’ll use unemployment data that I started gathering for a course I’m teaching in the Fall.
We’ll start with the setup and initial data gathering:
library(dplyr) library(streamgraph) library(pbapply) url <- "http://www.bls.gov/lau/ststdsadata.txt" dat <- readLines(url)
This data is not exactly in a happy format (hit the URL in your browser and you’ll see what I mean). It was definitely made for line printers/human consumption and I feel bad for any human that has to stare at it. The function I’m using to extract data is not necessarily what I’d do to just read in the whole data, but it’s more for teaching something else than optimization. It’ll do for our purposes here:
get_state_data <- function(state) { section <- paste("^%s| (", paste0(month.name, sep="", collapse="|"), ")\ +[[:digit:]]{4}", sep="", collapse="") section <- sprintf(section, state) vals <- gsub("^\ +|\ +$", "", grep(section, dat, value=TRUE)) state_vals <- gsub("^.* \\.+", "", vals[seq(from=2, to=length(vals), by=2)]) cols <- read.table(text=state_vals) cols$month <- as.Date(sprintf("01 %s", vals[seq(from=1, to=length(vals), by=2)]), format="%d %B %Y") cols$state <- state cols %>% select(8:9, 1:8) %>% mutate(V1=as.numeric(gsub(",", "", V1)), V2=as.numeric(gsub(",", "", V2)), V4=as.numeric(gsub(",", "", V4)), V6=as.numeric(gsub(",", "", V6)), V3=V3/100, V5=V5/100, V7=V7/100) %>% rename(civ_pop=V1, labor_force=V2, labor_force_pct=V3, employed=V4, employed_pct=V5, unemployed=V6, unemployed_pct=V7) } state_unemployment <- bind_rows(pblapply(state.name, get_state_data))
This will give us a data frame for employment(/unemployment) rates for all the (US) states. I only wanted to focus on New England and a few others for the course example, so this bit filters out them out:
state_unemployment %>% filter(state %in% c("California", "Ohio", "Rhode Island", "Maine", "Massachusetts", "Connecticut", "Vermont", "New Hampshire", "Nebraska")) -> some
With that setup out of the way, let me introduce the two new functions: `sg_add_marker` and `sg_annotate`. `sg_add_marker` adds a vertical, dotted line that spans the height of the graph and is placed at the designated spot on the x axis. You can add an optional label for the marker by specifying the y position, label text, color, size, space away from the line and how it’s aligned – start (left), center (middle), right (end). This is primarily useful for placing the label on either side of the line.
`sg_annotate` is for adding text anywhere on the streamgraph. The original use for it was to label streams, but you can use it any way you think would add meaning to your streamgraph. You can see them both in action below, where I plot the streamgraph for unemployment (%) for the selected states, then label the start of each recession since 1980 (with the peak national unemployment rate) with a marker and also label each stream:
streamgraph(some, "state", "unemployed_pct", "month") %>% sg_axis_x(tick_interval=10, tick_units = "year", tick_format="%Y") %>% sg_axis_y(0) %>% sg_add_marker(x=as.Date("1981-07-01"), "1981 (10.8%)", anchor="end") %>% sg_add_marker(x=as.Date("1990-07-01"), "1990 (7.8%)", anchor="start") %>% sg_add_marker(x=as.Date("2001-03-01"), "2001 (6.3%)", anchor="end") %>% sg_add_marker(x=as.Date("2007-12-01"), "2007 (10.1%)", anchor="end") %>% sg_annotate(label="Vermont", x=as.Date("1978-04-01"), y=0.6, color="#ffffff") %>% sg_annotate(label="Maine", x=as.Date("1978-03-01"), y=0.30, color="#ffffff") %>% sg_annotate(label="Nebraska", x=as.Date("1977-06-01"), y=0.41, color="#ffffff") %>% sg_annotate(label="Massachusetts", x=as.Date("1977-06-01"), y=0.36, color="#ffffff") %>% sg_annotate(label="New Hampshire", x=as.Date("1978-03-01"), y=0.435, color="#ffffff") %>% sg_annotate(label="California", x=as.Date("1978-02-01"), y=0.175, color="#ffffff") %>% sg_annotate(label="Rhode Island", x=as.Date("1977-11-01"), y=0.55, color="#ffffff") %>% sg_annotate(label="Ohio", x=as.Date("1978-06-01"), y=0.485, color="#ffffff") %>% sg_annotate(label="Connecticut", x=as.Date("1978-01-01"), y=0.235, color="#ffffff") %>% sg_fill_tableau() %>% sg_legend(show=TRUE)
I probably could have positioned the annotations a bit better, but this should be a good enough example to get the general idea. I may add an option to place the marker vertical lines behind streamgraph and will be adding some toggle options to the interactive version (to hide/show markers and/or annotations).
As usual, the package is up [on github](https://github.com/hrbrmstr/streamgraph) and a contiguous copy of the above snippets are in [this gist](https://gist.github.com/hrbrmstr/4e181ae045807ca3a858).
Three final notes. First, I suggest enabling the y axis when you’re trying to figure out where the y position for a label should be (since the y axis range is calculated by the summed span of the data). Second, the x axis works with both dates and continuous values, but you need to match what you setup the streamgraph with. Finally, just a tip: I’ve found [SVG Crowbar 2](http://nytimes.github.io/svg-crowbar/) to be super-helpful when I need to extract these streamgraphs out for non-interactive reproduction. Just yank the SVG out with it and hand it (or a converted form of it) to whomever is handling final production and they should be able to work with it.