I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:
Wrote some R code for looking at historical crash data
https://t.co/bgrMj3PIZu
#data #r #DataMining pic.twitter.com/zYzpOyD9JY
— PhilJ (@philjette) March 26, 2015
The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.
Just to get the setup bits out of the way, here’s all the packages I’ll be using:
library(dplyr) library(rvest) library(magrittr) library(stringr) library(lubridate) library(pbapply)
Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.
The basic setup is to:
– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)
Some things worth mentioning:
– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary
Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):
#' retrieve crash data for a given year #' defaults to current year #' earliest year in the database is 1920 get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) { crash_base <- "http://www.planecrashinfo.com/%d/%s.htm" if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) { stop("year must be >=1920 and <=current year", call.=FALSE) } # get crash date pg <- html(sprintf(crash_base, year, year)) pg %>% html_nodes("table > tr > td:nth-child(1)") %>% html_text() %>% extract(-1) %>% dmy() %>% data_frame(date=.) -> date # get location and operator loc_op <- bind_rows(lapply(1:length(date), function(i) { pg %>% html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% str_trim() %>% str_replace_all("^(Near|Off) ", "") -> loc pg %>% html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op data_frame(location=loc, operator=op) })) # get type & registration type_reg <- bind_rows(lapply(1:length(date), function(i) { pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>% ifelse(.=="?", NA, .) -> typ pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>% ifelse(.=="?", NA, .) -> reg data_frame(type=typ, registration=reg) })) # get fatalities pg %>% html_nodes("table > tr > td:nth-child(4)") %>% html_text() %>% str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>% lapply(function(x) { data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4])) }) %>% bind_rows %>% tail(-1) -> afg bind_cols(date, loc_op, type_reg, afg) }
While that gets one year, it’s super-simple to get all crashes since 1950:
crashes <- bind_rows(pblapply(1950:2015, get_data))
Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.
For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.
If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.
8 Comments
I managed to run the above code only by first running the following:
crash_base = “http://www.planecrashinfo.com/%d/%d”
thx. added the missing line back
A suggestion would be to use str_trim from the stringr package instead of gsub(“(^[[:space:]]*|[[:space:]]*$|\\n)”, “”, .). And maybe dmy() from the lubridate instead of as.Date(, format=”%d %b %Y”)
Yes! Two great suggestions to keep it even more in the Hadleyerse. I’ll make the changes later today. thx.
I was prepared to use this as a beginning of my own transition to the hadleyverse. But error at the outset has me stumped:
Error in ifelse(. == “?”, NA,
pg %>% html_nodes(xpath = sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>% html_text() %>% str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\\\n)", "")
) :object ‘.’ not found
locale:
[1] LCCOLLATE=EnglishUnited States.1252 LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 LCNUMERIC=C
[5] LCTIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate1.3.3 pbapply1.1-1 stringr0.6.2 magrittr1.0.1 rvest0.2.0 dplyr0.4.1
loaded via a namespace (and not attached):
[1] assertthat0.1 DBI0.3.1 digest0.6.4 httr0.5 lazyeval0.1.10 memoise0.2.1 parallel3.1.3
[8] plyr1.8.1 Rcpp0.11.5 RCurl1.95-4.3 selectr0.2-3 tools3.1.3 XML_3.98-1.1
I believe I forgot a
library(lubridate)
. Post has been updated as well as the code in https://github.com/hrbrmstr/CrashData/blob/master/PlaneCrashesHadleyverse.RFurther debugging of the error I had above revealed that your code requires a more recent version of magrittr than I had installed. Works fine with magrittr_1.5.
More detailed examination of the data suggests that you have reversed the labels for FATALITIES (also misspelled) and ABOARD. The original code by Phil Jette had it correct. Thanks for doing this. It has been a good learning exercise for me.
2 Trackbacks/Pingbacks
[…] I saw a fly-by #rstats mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a GitHub repo by @philjette. It seems there’s a web site (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it: […] […]
[…] крушений самолётов. Здесь пример, как её можно […]