More Airline Crashes via the Hadleyverse

I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:

The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.

Just to get the setup bits out of the way, here’s all the packages I’ll be using:

library(dplyr)
library(rvest)
library(magrittr)
library(stringr)
library(lubridate)
library(pbapply)

Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.

The basic setup is to:

– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `

` element
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)

Some things worth mentioning:

– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary

Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):

#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
 
  crash_base <- "http://www.planecrashinfo.com/%d/%s.htm"
 
  if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
    stop("year must be >=1920 and <=current year", call.=FALSE)
  }
 
  # get crash date
 
  pg <- html(sprintf(crash_base, year, year))
  pg %>%
    html_nodes("table > tr > td:nth-child(1)") %>%
    html_text() %>%
    extract(-1) %>%
    dmy() %>%
    data_frame(date=.) -> date
 
  # get location and operator
 
  loc_op <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_trim() %>%
      str_replace_all("^(Near|Off) ", "") -> loc
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op
 
    data_frame(location=loc, operator=op)
 
  }))
 
  # get type & registration
 
  type_reg <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> typ
 
    pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> reg
 
    data_frame(type=typ, registration=reg)
 
  }))
 
  # get fatalities
 
  pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
    html_text() %>%
    str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>%
    lapply(function(x) {
      data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
    }) %>%
    bind_rows %>% tail(-1) -> afg
 
  bind_cols(date, loc_op, type_reg, afg)
 
}

While that gets one year, it’s super-simple to get all crashes since 1950:

crashes <- bind_rows(pblapply(1950:2015, get_data))

Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.

For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.

If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.

Cover image from Data-Driven Security
Amazon Author Page

10 Comments More Airline Crashes via the Hadleyverse

  1. Pingback: More Airline Crashes via the Hadleyverse | infopunk.org

  2. Colin

    I managed to run the above code only by first running the following:
    crash_base = “http://www.planecrashinfo.com/%d/%d”

    Reply
  3. Bill Clay (@expersso)

    A suggestion would be to use str_trim from the stringr package instead of gsub(“(^[[:space:]]*|[[:space:]]*$|\\n)”, “”, .). And maybe dmy() from the lubridate instead of as.Date(, format=”%d %b %Y”)

    Reply
  4. Bruce Dudek

    I was prepared to use this as a beginning of my own transition to the hadleyverse. But error at the outset has me stumped:
    Error in ifelse(. == “?”, NA, pg %>% html_nodes(xpath = sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\\\n)", "")) :
    object ‘.’ not found

    sessionInfo()
    R version 3.1.3 (2015-03-09)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1

    locale:
    [1] LCCOLLATE=EnglishUnited States.1252 LCCTYPE=EnglishUnited States.1252
    [3] LCMONETARY=EnglishUnited States.1252 LCNUMERIC=C
    [5] LC
    TIME=English_United States.1252

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] lubridate1.3.3 pbapply1.1-1 stringr0.6.2 magrittr1.0.1 rvest0.2.0 dplyr0.4.1

    loaded via a namespace (and not attached):
    [1] assertthat0.1 DBI0.3.1 digest0.6.4 httr0.5 lazyeval0.1.10 memoise0.2.1 parallel3.1.3
    [8] plyr
    1.8.1 Rcpp0.11.5 RCurl1.95-4.3 selectr0.2-3 tools3.1.3 XML_3.98-1.1

    Reply
  5. Bruce Dudek

    Further debugging of the error I had above revealed that your code requires a more recent version of magrittr than I had installed. Works fine with magrittr_1.5.

    Reply
  6. Bruce Dudek

    More detailed examination of the data suggests that you have reversed the labels for FATALITIES (also misspelled) and ABOARD. The original code by Phil Jette had it correct. Thanks for doing this. It has been a good learning exercise for me.

    Reply
  7. Pingback: Наборы данных | Анализ малых данных

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.