I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:
The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.
Just to get the setup bits out of the way, here’s all the packages I’ll be using:
library(dplyr)
library(rvest)
library(magrittr)
library(stringr)
library(lubridate)
library(pbapply)
Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.
The basic setup is to:
– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `
` element
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)
Some things worth mentioning:
– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary
Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):
#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
crash_base <- "http://www.planecrashinfo.com/%d/%s.htm"
if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
stop("year must be >=1920 and <=current year", call.=FALSE)
}
# get crash date
pg <- html(sprintf(crash_base, year, year))
pg %>%
html_nodes("table > tr > td:nth-child(1)") %>%
html_text() %>%
extract(-1) %>%
dmy() %>%
data_frame(date=.) -> date
# get location and operator
loc_op <- bind_rows(lapply(1:length(date), function(i) {
pg %>%
html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
html_text() %>%
str_trim() %>%
str_replace_all("^(Near|Off) ", "") -> loc
pg %>%
html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
html_text() %>%
str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op
data_frame(location=loc, operator=op)
}))
# get type & registration
type_reg <- bind_rows(lapply(1:length(date), function(i) {
pg %>%
html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
html_text() %>%
str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
ifelse(.=="?", NA, .) -> typ
pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
html_text() %>%
str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
ifelse(.=="?", NA, .) -> reg
data_frame(type=typ, registration=reg)
}))
# get fatalities
pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
html_text() %>%
str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>%
lapply(function(x) {
data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
}) %>%
bind_rows %>% tail(-1) -> afg
bind_cols(date, loc_op, type_reg, afg)
}
While that gets one year, it’s super-simple to get all crashes since 1950:
crashes <- bind_rows(pblapply(1950:2015, get_data))
Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.
For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.
If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.
Over on The DO Loop, @RickWicklin does a nice job [visualizing the causes of airline crashes](http://blogs.sas.com/content/iml/2015/03/30/visualizing-airline-crashes/) in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:
– Give @jennybc’s new [googlesheets](http://github.com/jennybc/googlesheets) a spin
– Show some `dplyr` & `tidyr` data wrangling (never can have too many examples)
– Crank out some `ggplot` zero-based streamgraph-y area charts for the data with some extra `ggplot` wrangling for good measure
I also decided to use the colors in the [original David McCandless/Kashan visualization](http://www.informationisbeautiful.net/visualizations/plane-truth-every-single-commercial-plane-crash-visualized/).
#### Getting The Data
As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site [makes the data available](https://docs.google.com/spreadsheet/ccc?key=0AjOUPqcIwvnjdEx2akx5ZjJXSk9oM1E3dWpqZFJ6Nmc&usp=drive_web#gid=1), so I copied it to my Google Drive and gave her package a go:
library(googlesheets)
library(ggplot2) # we'll need the rest of the libraries later
library(dplyr) # but just getting them out of the way
library(tidyr)
# this will prompt for authentication the first time
my_sheets <- list_sheets()
# which one is the flight data one
grep("Flight", my_sheets$sheet_title, value=TRUE)
## [1] "Copy of Flight Risk JSON" "Flight Risk JSON"
# get the sheet reference then the data from the second tab
flights <- register_ss("Flight Risk JSON")
flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL")
# take a quick look
glimpse(flights_csv)
## Observations: 440
## Variables:
## $ date (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "...
## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1...
## $ loc (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "...
## $ country (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor...
## $ ref (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ...
## $ airline (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"...
## $ fat (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4...
## $ px (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"...
## $ cat (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1...
## $ phase (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi...
## $ cert (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi...
## $ meta (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h...
## $ cause (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err...
## $ notes (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
# the spreadsheet has a "helper" row for javascript, so we nix it
flights_csv <- flights_csv[-1,] # js vars removal
# and we convert some columns while we're at it
flights_csv %>%
mutate(date=as.Date(date),
fat=as.numeric(fat),
px=as.numeric(px)) -> flights_csv
#### A Bit of Cleanup
Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.
The area chart also needs full values for each category combo per year, so we use `expand` from `tidyr` with `left_join` and `mutate` to fill in the gaps.
Finally, we make proper, ordered labels:
flights_csv %>%
mutate(year=as.numeric(format(date, "%Y"))) %>%
mutate(phase=tolower(phase),
phase=ifelse(grepl("take", phase), "takeoff", phase),
phase=ifelse(grepl("climb", phase), "takeoff", phase),
phase=ifelse(grepl("ap", phase), "approach", phase)) %>%
count(year, meta, phase) %>%
left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>%
mutate(n=ifelse(is.na(n), 0, n)) %>%
filter(!phase %in% c("grounded", "unknown")) %>%
mutate(phase=factor(phase,
levels=c("takeoff", "en_route", "approach", "landing"),
labels=c("Takeoff", "En Route", "Approach", "Landing"),
ordered=TRUE)) -> flights_dat
I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.
#### The Final Plot
I’m a big fan of an incremental, additive build idiom for `ggplot` graphics. By using the `gg <- gg + …` style one can move lines around, comment them out, etc without dealing with errant `+` signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows:
- main `ggplot` call with mappings if the graph is short, otherwise add the mappings to the `geom`s
- all `geom_` or `stat_` layers in the order I want them, and using line breaks to logically separate elements (like `aes`) or to wrap long lines for easier readability.
- all `scale_` elements in order from axes to line to shape to color to fill to alpha; I'm not as consistent as I'd like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking
- `facet` call (if any)
- label setting, always with `labs` unless I really have a need for using `ggtitle`
- base `theme_` call
- all other `theme` elements, one per `gg <- gg +` line
I know that's not everyone's cup of tea, but it's just how I roll `ggplot`-style.
For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don't have Futura Medium.
flights_palette <- c("#702023", "#A34296", "#B06F31", "#939598", "#3297B0")
gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta))
gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area",
position="stack", method="gam", formula=y~s(x))
gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette,
labels=c("Criminal", "Human Error",
"Mechanical", "Unknown", "Weather"))
gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13))
gg <- gg + facet_grid(~phase)
gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(text=element_text(family="Futura Medium"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(strip.background=element_rect(fill="#525252"))
gg <- gg + theme(strip.text=element_text(color="white"))
gg
That ultimately produces:

with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the _bump_ between 2005 and 2010 for landing crashes.
As an aside, Boeing has a [really nice PDF](http://www.boeing.com/news/techissues/pdf/statsum.pdf) on some of this data with quite a bit more detail.
|