Come Fly With Me (well, not really) — Comparing Involuntary Disembarking Rates Across U.S. Airlines in R

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the #rstats universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.

Buy on AmazonDDS Blog
DDS PodcastAmazon Author Page

Leave a Reply