R Archives - Page 33 of 56

Category Archives: R

Come Fly With Me (well, not really) — Comparing Involuntary Disembarking Rates Across U.S. Airlines in R

2017-04-13 – 06:56
Posted in data wrangling, ggplot, R, web scraping
Tagged post
Leave a Comment

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the #rstats universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

Overall annual boarding denial % is small but there were more involuntary denials than I expected ?✈️

Data: https://t.co/PCtK3Ythgp pic.twitter.com/vnl7Sk5qpI

— Боб Рудіс (@hrbrmstr) April 11, 2017

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.

R⁶ — RStudio Server Client? Make An App For That!

RStudio is a great way to work through analyses tasks, and I suspect most folks use the “desktop” version of the product on their local workstations.

The fine folks at RStudio also make a server version (the codebase for RStudio is able to generate server or desktop and they are generally in 100% feature parity when it comes to interactive use). You only need to set it up on a compatible server and then access it via any modern web browser to get virtually the same experience you have on the desktop.

I use RStudio Server as well as RStudio Desktop and have never been comfortable mixing web browsing tasks and analysis tasks in the same browser (it’s one of the many reasons I dislike jupyter notebooks). I also keep many apps open and inevitably would try to cmd-tab (I’m on macOS) between apps to find the RStudio server one only to realize I shld have been keyboard tabbing through Chrome tabs.

Now, it’s not too hard to fire up a separate Chrome or Safari instance to get a separate server but it’d be great if there was a way to make it “feel” more like an app — just like RStudio Desktop. Well, it turns out there is a way using nativefier.

If you use the Slack standalone desktop client, the Atom text editor or a few other “modern” apps, they are pretty much just web pages wrapped in a browser shell using something like Electron. Jia Hao came up with the idea of being able to do the same thing for any web page.

To create a standalone RStudio Server client for a particular RStudio Server instance, just do the following after installing nativefier:

nativefier "https://rstudio.example.com:8787/" --name "RStudio @ Example"

Replace the URL with the one you currently use in-browser (and, please consider using SSL/TLS when connecting over the public internet) and use any name that will be identifiable to you. You get a safe, standalone application and will never have to worry about browser crashes impacting your workflow.

There are many customizations you can make to the app shell and you can even use your own icons to represent servers differently. It’s all super-simple to setup and get working.

Note that for macOS folks there has been a way pre-nativefier to do this same thing with a tool called Fluid but it uses the Apple WebKit/Safari shell vs the Chrome shell and I prefer the Chrome shell and cross-platform app-making ability.

Hopefully this quick R⁶ tip will help you corral your RStudio Server connections. And, don’t forget to join in on the R⁶ bandwagon and share your own quick tips, snippets and hints to help the broader R community level-up their #rstats work.

Simple Offline Airport Wi-Fi Tracker in R

@visualisingdata rebroadcast this tweet today:

Wireless Passwords From Airports And Lounges Around The World https://t.co/OV0WJfwj8E

— deb verhoeven (@bestqualitycrab) April 2, 2017

The Google Maps interface is a bit meh and the “formatted” data is also a bit meh but the data is useful when travelling (NOTE: always use a VPN in airports on both your mobile devices and laptops).

You don’t need their app or online access to take advantage of this data if you’re willing to do a tiny bit of work. apply

You’ll find an R-generated version of the airport Wi-Fi finder below or via this direct link.

The premise is simple:

grab the airport data KML that the Google Maps interface uses (NOTE: Visit the original URL from the Tweet occasionally to see if the KML URL changes)
make cleaner, more formatted text for the popups
toss up a leaflet map
add a searchable DT::datatable interface
render the R markdown page to HTML and keep it with you
refresh the generated HTML right before you go on your world tour

The code is directly embedded in the generated R markdown document and also in this gist so I won’t pollute this post with code blocks.

I’m hoping @bhaskar_vk will apply his mad, l33t Leaflet h@x0r $k1llz to this crude, quick hack and kick this up a notch as there is substantial room for improvement. If you give that a go as well, drop a note in the comments or on Twitter. Some ideas for improvement:

better markers (please, not those airplane ones :-)
way better text cleanup
link the table and map (I think that means using Shiny but I’m likely wrong about that)
make it better on mobile (it works on mobile but there are different design considerations to make it more usable on tiny glowing rectangles)
make a “trip planner” Shiny app, letting folks select their airports and produce a handy digital reference card for them as they move about the globe

Superclassing to R⁶

To avoid “branding” confusion with R⁴ I’m superclassing it to R⁶ and encouraging others in the R community to don the moniker and do their own small, focused posts on topics that would help the R community learn things. Feel free to use R⁶ (I’ll figure out an acronym later). Feel free to tag your posts as R⁶ (or r6) and use the moniker as you see fit.

I’ll eventually tag the 2 current “r4” posts as “r6”.

Hopefully we can link together a cadre of R⁶ posts into a semi-organized structure that all can benefit from.

R⁶ — Snow Day Facets

Back in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up here in Maine and @mrshrbrmstr asked how often this — snow in May — has occurred near us.

As with all of these “R⁶ posts, expository is minimal and the focus is generally to demonstrate one small concept.

What I’ve done here (first) is make a full tidyverse update to the snowfirst code posted in the aforementioned blog post. You’ll need to clone that repo if you’re trying to work verbatim from the code below (otherwise just change file path code):

library(rprojroot)
library(stringi)
library(hrbrthemes)
library(tidyverse)

pre <- find_rstudio_root_file()

# Get and read in Maine precip ------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/state17_ME.txt.gz"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = fil,
         col_positions = fwf_widths(c(6, 4, 2, 4, rep(c(5, 1, 1, 1), 31)),
                                    col_names = c("coop_id", "year", "month", "element",
                                                  flatten_chr(map(1:31, ~paste("r_", c("v", "fm", "fq", "fs"),
                                                                               .x, sep=""))))),
         col_types = paste0("ciic", paste0(rep("iccc", 31), collapse=""), collapse=""),
         na = c("", "NA", "-", "-9999")) %>%
  gather(day, value, starts_with("r_v")) %>%
  select(-starts_with("r_")) %>%
  mutate(day = as.numeric(stri_replace_first_fixed(day, "r_v", ""))) %>%
  mutate(date = sprintf("%s-%02d-%02d", year, month, day)) -> daily_wx

# Read in stations ------------------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/ushcn-stations.txt"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = file.path(pre, "data", "ushcn-stations.txt"),
         col_positions = fwf_widths(c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                                    col_names = c("coop_id", "latitude", "longitude",
                                                  "elevation", "state", "name",
                                                  "component_1", "component_2",
                                                  "component_3", "utc_offset")),
         col_types = "cdddcccccc") -> stations

closestStation <- function(stations, lat, lon, restrict_to = NULL) {
  if (!is.null(restrict_to)) stations <- filter(stations, state == restrict_to)
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                            (stations$longitude-lon)^2))
  stations[index,]
}

# compute total snow amounts per month ----------------------------------------------

(near_me <- closestStation(stations, 43.2672, -70.8617, restrict_to="ME"))

Now that we have the data, the short lesson here is just exposing the fact that you can get blank facets for free with ggplot2. I’m pointing this out as many folks seem to not like reading R documentation or miss things in said documentation (in fact, I had to be instructed today by @thomasp85 about a ggplot2 theme element setting that I didn’t know about and should have since I do try to keep up).

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(
    n = n / 10, # readings are in 10ths of inches
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = lubridate::month(date, TRUE, FALSE)
  ) %>%
  ggplot(aes(x=date, y=n)) +
  geom_segment(aes(xend=date, yend=0), size=0.75, color="#9ecae1") +
  scale_y_continuous(limits=c(0, 65)) +
  facet_wrap(~month_name, ncol=3, drop=FALSE, scales="free") +
  labs(x=NULL, y="inches", title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="Y", axis_text_size=8)

Without ggplot2 helping us out we would have had to do some work to have those no-value facets to show up. I also like how there are no x-axis labels since there’s no data. ggplot2::facet_wrap() has many, very granular options for customizing the appearance of facets:

facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
           shrink = TRUE, labeller = "label_value", as.table = TRUE,
           switch = NULL, drop = TRUE, dir = "h", strip.position = "top")

If you haven’t played with them, you can use this example to try them out.

Fin

Even though that visualization gets the message across, I kinda like this view a bit better:

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(n = n / 10) %>%
  complete(year, month=1:12) %>%
  mutate(
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = factor(lubridate::month(date, TRUE, FALSE), levels=rev(month.name))
  ) %>% 
  ggplot(aes(year, month_name)) +
  geom_tile(aes(fill=n), color="#b2b2b2", size=0.15) +
  scale_x_continuous(expand=c(0,0.15), position="top") +
  viridis::scale_fill_viridis(name = "Total inches", na.value="white") +
  labs(x=NULL, y=NULL, title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="", axis_text_size = 10) +
  guides(fill=guide_colourbar(label.position = "top", direction = "horizontal", title.vjust = 0)) +
  theme(legend.title = element_text(size=10)) +
  theme(legend.key.height = unit(0.5, "lines")) +
  theme(legend.position = c(0.9, 1.25))

The precision is lacking in the heatmap view, but you get a quick impression of when it has/hasn’t snowed. Plus you get to use viridis ?

All the updated code in in the snowfirst repo.

Crank you your own, small code snippets or ideas to the R community. R⁶ is an open tag and perhaps we can band together to make a distributed cadre of helpful, digestible posts the R community can benefit from.

All-in on R⁶ : Progress [bars] on first post

@eddelbuettel’s idea is a good one. (it’s a quick read…jump there and come back). We’ll avoid confusion and call it R⁶ over here. Feel free to don the superclass.

I often wait for a complete example or new package announcement to blog something when a briefly explained snippet might have sufficient utility for many R users. Also, tweets are fleeting and twitter could end up on the island of misfit social media sites if it can’t generate revenue or find a giant buyer this year. Don’t get me wrong, twitter convos are fine/useful, but blogs are at least semi-permanent, especially if you let them be hoovered up by the Internet Archive (“Save Page Now” on their site or use this handy Chrome extension).

I’ll tag all R⁶ posts as “r6” if you want to auto-filter those out of your stream or just page through them.

I’ll also lead off the these micro-posts with a simple one: adding progress bars to your tidyverse purrr operations.

The purrr::map* functions enable expressive and type-safe vectorized operations. Mine are usually over a few million/billion IPv4 addresses or domain names/URLs and often involve moderately lengthy tasks so I usually add the ability to incorporate progress bars to functions I make (and, I’m trying hard to get out of the bad habit of long-ish anonymous functions in purrr calls). The following is a toy example, but it’s a working example you can run in your interactive R session now:

library(tidyverse)

arduously_long_nchar <- function(input_var, .pb=NULL) {
  
  if ((!is.null(.pb)) && inherits(.pb, "Progress") && (.pb$i < .pb$n)) .pb$tick()$print()
  
  Sys.sleep(1)
  
  nchar(input_var)
  
}

pb <- progress_estimated(length(letters))

map_int(letters, arduously_long_nchar, .pb=pb)

And, yes, I did make you wait ~26 seconds (unless you were intrepid enough to reduce the amount of sleep time :-)

If you happen to forget the progress bar object (or know you don’t need one):

map_int(letters, arduously_long_nchar)

the function still works (sans progress bars).

If you happen to also mess up what you pass in to the .pb parameter or get your progress bar out of sync with your object it won’t error out on you (it can be made much safer and wrapped in another function, say — tick_off(.pb) — but this is supposed to be a small post).

Comments/feedback/your-own-progress-methods are most welcome and encouraged.

Exploring 2017 Retail Store Closings with R

A story about one of the retail chains (J.C. Penny) releasing their list of stores closing in 2017 crossed paths with my Feedly reading list today and jogged my memory that there were a number of chains closing many of their doors this year, and I wanted to see the impact that might have on various states.

I’m also doing this to add one more example of:

scraping (with content caching)
data cleaning
per-capita normalization
comparing salient information to other indicators

to the growing list of great examples out there by the extended R community. Plus, I feel compelled to try to keep up with @ma_salmon’s blogging pace.

Let’s jump right in…

library(httr)
library(rvest)
library(knitr)
library(kableExtra)
library(ggalt)
library(statebins)
library(hrbrthemes)
library(tidyverse)

options(knitr.table.format = "html")
update_geom_font_defaults(font_rc_light, size = 2.75)

“Closing” lists of four major retailers — K-Mart, Sears, Macy’s and J.C. Penny — abound (HTML formatting a list seems to be the “easy way out” story-wise for many blogs and newspapers). We can dig a bit deeper than just a plain set of lists, but first we need the data.

The Boston Globe has a nice, predictable, mostly-uniform pattern to their list-closing “stories”, so we’ll use that data. Site content can change quickly, so it makes sense to try to cache content whenever possible as we scrape it. To that end, we’ll use httr::GET vs xml2::read_html since GET preserves all of the original request and response information and read_html returns an external pointer that has no current support for serialization without extra work.

closings <- list(
  kmart = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-kmart-stores-closing-around/4kJ0YVofUWHy5QJXuPBAuM/story.html",
  sears = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-sears-stores-closing-around/yHaP6nV2C4gYw7KLhuWuFN/story.html",
  macys = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-macy-stores-closing-around/6TY8a3vy7yneKV1nYcwY7K/story.html",
    jcp = "https://www.bostonglobe.com/business/2017/03/17/the-full-list-penney-stores-closing-around/vhoHjI3k75k2pSuQt2mZpO/story.html"
)

saved_pgs <- "saved_store_urls.rds"

if (file.exists(saved_pgs)) {
  pgs <- read_rds(saved_pgs)
} else {
  pgs <- map(closings, GET)
  write_rds(pgs, saved_pgs)
}

This is what we get from that scraping effort:

map(pgs, content) %>%
  map(html_table) %>%
  walk(~glimpse(.[[1]]))
## Observations: 108
## Variables: 3
## $ X1 <chr> "300 Highway 78 E", "2003 US Hwy 280 Bypass", "3600 Wilson ...
## $ X2 <chr> "Jasper", "Phenix City", "Bakersfield", "Coalinga", "Kingsb...
## $ X3 <chr> "AL", "AL", "CA", "CA", "CA", "CA", "CO", "CO", "CT", "FL",...
## Observations: 42
## Variables: 4
## $ X1 <chr> "301 Cox Creek Pkwy", "1901 S Caraway Road", "90 Elm St; En...
## $ X2 <chr> "Florence", "Jonesboro", "Enfield", "Lake Wales", "Albany",...
## $ X3 <chr> "AL", "AR", "CT", "FL", "GA", "GA", "IN", "KS", "KY", "LA",...
## $ X4 <chr> "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y", "Y", "Y", "Y",...
## Observations: 68
## Variables: 6
## $ X1 <chr> "Mission Valley Apparel", "Paseo Nuevo", "*Laurel Plaza", "...
## $ X2 <chr> "San Diego", "Santa Barbara", "North Hollywood", "Simi Vall...
## $ X3 <chr> "CA", "CA", "CA", "CA", "FL", "FL", "FL", "FL", "FL", "GA",...
## $ X4 <int> 385000, 141000, 475000, 190000, 101000, 195000, 143000, 140...
## $ X5 <int> 1961, 1990, 1995, 2006, 1995, 2000, 1977, 1974, 2000, 1981,...
## $ X6 <int> 140, 77, 105, 105, 68, 83, 86, 73, 72, 69, 9, 57, 54, 87, 5...
## Observations: 138
## Variables: 3
## $ Mall/Shopping Center <chr> "Auburn Mall", "Tannehill Promenade", "Ga...
## $ City                 <chr> "Auburn", "Bessemer", "Gadsden", "Jasper"...
## $ State                <chr> "AL", "AL", "AL", "AL", "AR", "AR", "AZ",...

We now need to normalize the content of the lists.

map(pgs, content) %>%
  map(html_table) %>%
  map(~.[[1]]) %>%
  map_df(select, abb=3, .id = "store") -> closings

We’re ultimately just looking for city/state for this simple exercise, but one could do more precise geolocation (perhaps with rgeocodio) and combine that with local population data, job loss estimates, current unemployment levels, etc. to make a real story out of the closings vs just do the easy thing and publish a list of stores.

count(closings, abb) %>%
  left_join(data_frame(name = state.name, abb = state.abb)) %>%
  left_join(usmap::statepop, by = c("abb"="abbr")) %>%
  mutate(per_capita = (n/pop_2015) * 1000000) %>%
  select(name, n, per_capita) -> closings_by_state

(NOTE: you can get the code for the entire Rmd via RPubs or GitHub)

Compared to unemployment/underutilization

I’d have used the epidata package to get the current state unemployment data but it’s not quite current enough, so we can either use a package to get data from the Bureau of Labor Statistics or just scrape it. A quick coin-flip says we’ll scrape the data.

We’ll use the U-6 rate since that is an expanded definition of “underutilization” including “total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers” and is likely to more representative for the populations working at retail chains. I could be wrong. I only play an economist on ?. If you are an economist, please drop a note telling me where I errd in my thinking ?

pg <- read_html("https://www.bls.gov/lau/stalt16q4.htm")

html_nodes(pg, "table#alternmeas16\\:IV") %>% 
  html_table(header = TRUE, fill = TRUE) %>%
  .[[1]] %>% 
  docxtractr::assign_colnames(1) %>% 
  rename(name=State) %>% 
  as_data_frame() %>% 
  slice(2:52) %>% 
  type_convert() %>% 
  left_join(closings_by_state, by="name") %>% 
  filter(!is.na(n)) -> with_unemp

ggplot(with_unemp, aes(per_capita, `U-6`)) +
  geom_label(aes(label=name), fill="#8c96c6", color="white", size=3.5, family=font_rc) +
  scale_x_continuous(limits=c(-0.125, 6.75)) +
  labs(x="Closings per-capita (1MM)", 
       y="BLS Labor Underutilization (U-6 rate)",
       title="Per-capita store closings compared to current BLS U-6 Rate") +
  theme_ipsum_rc(grid="XY")

If I were a reporter (again, I only play one on ?), I think I’d be digging a bit deeper on the impact of these (and the half-dozen or so other) retailers closing locations in New Mexico, Nevada, West Virginia, Wyoming, (mebbe Maine, though I’m super-b ased :-), North Dakota & South Dakota. I also hope @Marketplace does a few more stories on the changing retail landscape in the U.S. over the coming months to see if there are any overt consequences to the loss of jobs and anchor stores.

If you end up tinkering with the data, drop a note in the comments if something you discover piques your interest. For those interested in potentially marrying the data up with some additional cartography, there should be enough precision the store lists to get distinct enough lat/lon paris after geocoding (I did a quick test with rgeocodio) to make some interesting map views, especially if you can find more store closing lists.

Making a Case for case_when

This is a brief (and likely obvious, for some folks) post on the dplyr::case_when() function.

Part of my work-work is dealing with data from internet scans. When we’re performing a deeper inspection of a particular internet protocol or service we try to capture as much system and service metadata as possible. Sifting through said metadata to find individual and collective insight is often a painful task given the diversity in the internet ecosystem.

One attribute we try to collect in all our service scans is operating system (OS) version. For many of our minutiae-focused researchers, it’s vital to know if a host is using “CoreOS 899.17.0” vs “CoreOS 835.9.0”. For much of the aggregation and clustering work we do, “CoreOS” is just fine.

In broad scans for any given service the OS diversity can be YUGE. There may be upwards of 10 different variations each of Windows, Red Hat, Ubuntu, Debian, et. al. present along with a smattering of very highly infrequent OS-types such as “Scientific Linux”. Plus, we can always count on probes returning many NA values for many discrete attribute queries, including OS type+version.

There are many ways to reduce a diverse list of OS type+version strings to a reduced target set. switch() and ifelse() are likely go-to solutions for many of you reading this. If you are in those camps and haven’t tried dplyr::case_when() read on!

Noise Reduction

To illustrate the utility of case_when(), let’s walk through an example. I created a tiny excerpt of just the OS type + version info from 500 observations out of a much larger internet scan. You can find that data at https://rud.is/dl/os.txt. Let’s take a look at the OS diversity:

library(ggalt)
library(hrbrthemes)
library(tidyverse)

os <- read_lines("https://rud.is/dl/os.txt", na = "NA")

str(table(os, useNA = "always"))
##  'table' int [1:28(1d)] 2 3 1 1 1 44 3 101 1 6 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ os: chr [1:28] "" "<unknown>" "Amazon Linux AMI 2016.03" "Amazon Linux AMI 2016.09" ...

sort(unique(os))
##  [1] ""                                           
##  [2] "<unknown>"                                  
##  [3] "Amazon Linux AMI 2016.03"                   
##  [4] "Amazon Linux AMI 2016.09"                   
##  [5] "Arch Linux"                                 
##  [6] "CentOS Linux 7 (Core)"                      
##  [7] "CoreOS 766.4.0"                             
##  [8] "CoreOS 899.17.0"                            
##  [9] "Debian GNU/Linux 7 (wheezy)"                
## [10] "Debian GNU/Linux 8 (jessie)"                
## [11] "Fedora 20 (Heisenbug)"                      
## [12] "linux"                                      
## [13] "openSUSE Leap 42.2"                         
## [14] "RancherOS v0.7.0"                           
## [15] "Red Hat Enterprise Linux Server 7.2 (Maipo)"
## [16] "Red Hat Enterprise Linux Server 7.3 (Maipo)"
## [17] "Ubuntu 14.04.1 LTS"                         
## [18] "Ubuntu 14.04.2 LTS"                         
## [19] "Ubuntu 14.04.3 LTS"                         
## [20] "Ubuntu 14.04.4 LTS"                         
## [21] "Ubuntu 14.04.5 LTS"                         
## [22] "Ubuntu 15.10"                               
## [23] "Ubuntu 16.04.1 LTS"                         
## [24] "Ubuntu 16.04.2 LTS"                         
## [25] "Ubuntu 16.10"                               
## [26] "Windows Server 2016 Datacenter"             
## [27] "Windows Server 2016 Standard"

There are 29 (including NA) different strings in just a tiny excerpt. Ugh.

If we want to group all Windows results as “Windows”, all Red Hat, CentOS and Fedora results as “Fedora”, all Ubuntu and Debian results as “Debian” and all CoreOS and Amazon results as “Amazon” while keeping NA_s_ NA and lumping everything else as “Other” it’s super-easy with case_when():

ELSE <- TRUE

case_when(
  grepl("Windows", os) ~ "Windows-ish",
  grepl("Red Hat|CentOS|Fedora", os) ~ "Fedora-ish",
  grepl("Ubuntu|Debian", os) ~ "Debian-ish",
  grepl("CoreOS|Amazon", os) ~ "Amazon-ish",
  is.na(os) ~ "Unknown",
  ELSE ~ "Other"
) %>%
  table() %>%
  as_data_frame() %>%
  set_names(c("os", "Node Count")) %>%
  arrange(`Node Count`) %>%
  mutate(os = factor(os, os)) %>%
  ggplot(aes(`Node Count`, os)) +
  geom_lollipop(horizontal = TRUE, size=1.5, color="#54278f") +
  scale_x_comma(limits=c(0,300)) +
  labs(y=NULL, title="OS Types") +
  theme_ipsum_rc(grid="X")

The clever formula (~) syntax used by case_when() enables you to cleanly and effortlessly reduce factor/categorical levels and also lets you preserve NA values (which I translated to “Unknown”). Since ELSE is used in the SQL CASE statement and dplyr::case_when() is a riff of said SQL cousin, I like to use an assigned ELSE to make it more visually explicit, but using TRUE is just as good (and, perhaps, better since TRUE can’t get namespace clobbered like the ELSE variable can).

FIN

If you’re in sequential or nested ifelse() Hades or are frustrated by switch() limitations, give dplyr::case_when() a try for your next project.

Epilogue

Not enough time earlier to add other methods, so this hint from @drob will have to suffice for now:

@hrbrmstr alternative: fuzzyjoin's regex_left_join? Advantages:

1. regexes can be in config file
2. one string can match multiple regexes pic.twitter.com/9IavUTEDMJ

— David Robinson (@drob) March 10, 2017