A look at airline crashes in R with googlesheets, dplyr & ggplot2

Over on The DO Loop, @RickWicklin does a nice job visualizing the causes of airline crashes in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:

  • Give @jennybc’s new googlesheets a spin
  • Show some dplyr & tidyr data wrangling (never can have too many examples)
  • Crank out some ggplot zero-based streamgraph-y area charts for the data with some extra ggplot wrangling for good measure

I also decided to use the colors in the original David McCandless/Kashan visualization.

Getting The Data

As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site makes the data available, so I copied it to my Google Drive and gave her package a go:

library(ggplot2) # we'll need the rest of the libraries later
library(dplyr)   # but just getting them out of the way
# this will prompt for authentication the first time
my_sheets <- list_sheets()
# which one is the flight data one
grep("Flight", my_sheets$sheet_title, value=TRUE)
## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" 
# get the sheet reference then the data from the second tab
flights <- register_ss("Flight Risk JSON")
flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL")
# take a quick look
## Observations: 440
## Variables:
## $ date       (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "...
## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1...
## $ loc        (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "...
## $ country    (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor...
## $ ref        (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ...
## $ airline    (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"...
## $ fat        (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4...
## $ px         (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"...
## $ cat        (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1...
## $ phase      (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi...
## $ cert       (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi...
## $ meta       (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h...
## $ cause      (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err...
## $ notes      (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
# the spreadsheet has a "helper" row for javascript, so we nix it
flights_csv <- flights_csv[-1,] # js vars removal
# and we convert some columns while we're at it
flights_csv %>%
         px=as.numeric(px)) -> flights_csv

A Bit of Cleanup

Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.

The area chart also needs full values for each category combo per year, so we use expand from tidyr with left_join and mutate to fill in the gaps.

Finally, we make proper, ordered labels:

flights_csv %>%
  mutate(year=as.numeric(format(date, "%Y"))) %>%
         phase=ifelse(grepl("take", phase), "takeoff", phase),
         phase=ifelse(grepl("climb", phase), "takeoff", phase),
         phase=ifelse(grepl("ap", phase), "approach", phase)) %>%
  count(year, meta, phase) %>%
  left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% 
  mutate(n=ifelse(is.na(n), 0, n)) %>% 
  filter(!phase %in% c("grounded", "unknown")) %>%
                      levels=c("takeoff", "en_route", "approach", "landing"),
                      labels=c("Takeoff", "En Route", "Approach", "Landing"),
                      ordered=TRUE)) -> flights_dat

I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.

The Final Plot

I’m a big fan of an incremental, additive build idiom for ggplot graphics. By using the gg <- gg + … style one can move lines around, comment them out, etc without dealing with errant + signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows:

  • main ggplot call with mappings if the graph is short, otherwise add the mappings to the geoms
  • all geom_ or stat_ layers in the order I want them, and using line breaks to logically separate elements (like aes) or to wrap long lines for easier readability.
  • all scale_ elements in order from axes to line to shape to color to fill to alpha; I’m not as consistent as I’d like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking
  • facet call (if any)
  • label setting, always with labs unless I really have a need for using ggtitle
  • base theme_ call
  • all other theme elements, one per gg <- gg + line

I know that’s not everyone’s cup of tea, but it’s just how I roll ggplot-style.

For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don’t have Futura Medium.

flights_palette <- c("#702023", "#A34296", "#B06F31", "#939598", "#3297B0")
gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) 
gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area",
                       position="stack", method="gam", formula=y~s(x)) 
gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, 
                             labels=c("Criminal", "Human Error",
                                      "Mechanical", "Unknown", "Weather"))
gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13))
gg <- gg + facet_grid(~phase)
gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(text=element_text(family="Futura Medium"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(strip.background=element_rect(fill="#525252"))
gg <- gg + theme(strip.text=element_text(color="white"))

That ultimately produces:


with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the bump between 2005 and 2010 for landing crashes.

As an aside, Boeing has a really nice PDF on some of this data with quite a bit more detail.

Pre-CRAN waffle update – isotype pictograms

It seems Ruben C. Arslan had the waffle idea about the same time I did. Apart from some extra spiffy XKCD-like styling, one other thing his waffling routines allowed for was using FontAwesome icons. When you use an icon vs a block, you are really making a basic version of isotype pictograms. They can add a dimension to the story you’re trying to tell without using any words. I’ve added two parameters to a pre-release CRAN version that I’d like folks to kick the tyres on a bit. Said parameters are use_glyph– which is either FALSE or a character string for a FontAwesome icon (more on that in a bit) — and glyph_size — which is a numeric value for the font size since it won’t scale when the graphic resizes.

Fonts in R & waffle

One part of R that is (with apologies to Winston and others) weak is fonts. You can use fonts, but doing so is often not pretty (despite guidance on the subject) and not without problems (we tried using a custom font again for this year’s DBIR graphics and failed miserably — again — due to font issues and R and had to have the graphics folks substitute them in).

To use the FontAwesome glyphs you need to:

  • grab the ttf version from here
  • install it on your system
  • install the extrafont package
  • run font_import() (get some coffee/scotch while you wait)
  • load extrafont when you need to use these glyphs

Once you do that, you’re probably ready to make isotype pictograms with waffle. I say probably since this process worked on two of my OS X systems but not a third. Same R version. Same RStudio version. Same import process. (This is part of the reason for my lament of the state of fonts since I’m not exactly an n00b with either R, Macs or fonts.)

Making isotype pictograms

I did borrow some code from Ruben, but I hate typing unicode characters and I suspect most folks do as well. If you do any work in straight HTML/CSS, you know you can just refer to the various FontAwesome glyphs by name. To use FontAwesome glyphs with waffle you specify the font name (no fa- prefix) vs unicode character. If you want to see what’s available (and don’t want to bookmark the FontAwesome site) you can run either fa_list() which will give you a list of available FontAwesome glyph names or use fa_grep() and supply a pattern name. For example, running fa_grep("car") gives you:

##  [1] "car"                  "caret-down"           "caret-left"          
##  [4] "caret-right"          "caret-square-o-down"  "caret-square-o-left" 
##  [7] "caret-square-o-right" "caret-square-o-up"    "caret-up"            
## [10] "cart-arrow-down"      "cart-plus"            "cc-mastercard"       
## [13] "credit-card"          "shopping-cart"

Any grep regex will work in that function.

You’ll need to devtools::install_github("hrbrmstr/waffle", ref="cran") to use the dev/pre-CRAN version of waffle before doing anything.

To make an isotype pictogram version of the health records breaches waffle chart, you can do the following:

parts <- c(`Un-breached\nUS Population`=(318-11-79), `Premera`=11, `Anthem`=79)
waffle(parts/10, rows=3, colors=c("#969696", "#1879bf", "#009bda"),
       use_glyph="medkit", size=8)


So, please kick the tyres, post comments about your font successes & woes and definitely link to any isotype pictograms you create.