More Airline Crashes via the Hadleyverse

I saw a fly-by #rstats mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a GitHub repo by @philjette. It seems there’s a web site (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:

The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. dplyr, rvest, etc). I even submitted a pull request to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny ggplot2 graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.

Just to get the setup bits out of the way, here’s all the packages I’ll be using:

library(dplyr)
library(rvest)
library(magrittr)
library(stringr)
library(lubridate)
library(pbapply)

Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.

The basic setup is to:

  • grab the HTML for the page of a given year
  • extract and format the crash dates
  • extract location & operator information, which is made slightly annoying since the site uses a <br> and includes spurious newlines within a single <td> element
  • extract aircraft type and registration (same issues as previous column)
  • extract accident details, which are embedded in a highly formatted column that requires str_match_all to handle (well)

Some things worth mentioning:

  • data_frame is super-helpful in not-creating factors from the character vectors
  • bind_rows and bind_cols are a nice alternative to using data.table functions
  • I think stringr needs a more pipe-friendly replacement for gsub and, perhaps, even ifesle (yes, I guess I could submit a PR). The . just feels wrong in pipes to me, still
  • if you’re not using pbapply functions (free progress bars for everyone!) you should be, especially for long scraping operations
  • sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary

Here’s the new get_data function (updated per comment and to also add some more hadleyverse goodness):

#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
 
  crash_base <- "http://www.planecrashinfo.com/%d/%s.htm"
 
  if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
    stop("year must be >=1920 and <=current year", call.=FALSE)
  }
 
  # get crash date
 
  pg <- html(sprintf(crash_base, year, year))
  pg %>%
    html_nodes("table > tr > td:nth-child(1)") %>%
    html_text() %>%
    extract(-1) %>%
    dmy() %>%
    data_frame(date=.) -> date
 
  # get location and operator
 
  loc_op <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_trim() %>%
      str_replace_all("^(Near|Off) ", "") -> loc
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op
 
    data_frame(location=loc, operator=op)
 
  }))
 
  # get type & registration
 
  type_reg <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> typ
 
    pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> reg
 
    data_frame(type=typ, registration=reg)
 
  }))
 
  # get fatalities
 
  pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
    html_text() %>%
    str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>%
    lapply(function(x) {
      data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
    }) %>%
    bind_rows %>% tail(-1) -> afg
 
  bind_cols(date, loc_op, type_reg, afg)
 
}

While that gets one year, it’s super-simple to get all crashes since 1950:

crashes <- bind_rows(pblapply(1950:2015, get_data))

Yep. That’s it. Now crashes contains a data.frame (well, tbl_df) of all the crashes since 1950, ready for further analysis.

For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.

If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.