Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.

RStudio is a great way to work through analyses tasks, and I suspect most folks use the “desktop” version of the product on their local workstations.

The fine folks at RStudio also make a server version (the codebase for RStudio is able to generate server or desktop and they are generally in 100% feature parity when it comes to interactive use). You only need to set it up on a compatible server and then access it via any modern web browser to get virtually the same experience you have on the desktop.

I use RStudio Server as well as RStudio Desktop and have never been comfortable mixing web browsing tasks and analysis tasks in the same browser (it’s one of the many reasons I dislike jupyter notebooks). I also keep many apps open and inevitably would try to cmd-tab (I’m on macOS) between apps to find the RStudio server one only to realize I shld have been keyboard tabbing through Chrome tabs.

Now, it’s not too hard to fire up a separate Chrome or Safari instance to get a separate server but it’d be great if there was a way to make it “feel” more like an app — just like RStudio Desktop. Well, it turns out there is a way using nativefier.

If you use the Slack standalone desktop client, the Atom text editor or a few other “modern” apps, they are pretty much just web pages wrapped in a browser shell using something like Electron. Jia Hao came up with the idea of being able to do the same thing for any web page.

To create a standalone RStudio Server client for a particular RStudio Server instance, just do the following after installing nativefier:

nativefier "https://rstudio.example.com:8787/" --name "RStudio @ Example"

Replace the URL with the one you currently use in-browser (and, please consider using SSL/TLS when connecting over the public internet) and use any name that will be identifiable to you. You get a safe, standalone application and will never have to worry about browser crashes impacting your workflow.

There are many customizations you can make to the app shell and you can even use your own icons to represent servers differently. It’s all super-simple to setup and get working.

Note that for macOS folks there has been a way pre-nativefier to do this same thing with a tool called Fluid but it uses the Apple WebKit/Safari shell vs the Chrome shell and I prefer the Chrome shell and cross-platform app-making ability.

Hopefully this quick R⁶ tip will help you corral your RStudio Server connections. And, don’t forget to join in on the R⁶ bandwagon and share your own quick tips, snippets and hints to help the broader R community level-up their work.

@visualisingdata rebroadcast this tweet today:

The Google Maps interface is a bit meh and the “formatted” data is also a bit meh but the data is useful when travelling (NOTE: always use a VPN in airports on both your mobile devices and laptops).

You don’t need their app or online access to take advantage of this data if you’re willing to do a tiny bit of work. apply

You’ll find an R-generated version of the airport Wi-Fi finder below or via this direct link.

The premise is simple:

  • grab the airport data KML that the Google Maps interface uses (NOTE: Visit the original URL from the Tweet occasionally to see if the KML URL changes)
  • make cleaner, more formatted text for the popups
  • toss up a leaflet map
  • add a searchable DT::datatable interface
  • render the R markdown page to HTML and keep it with you
  • refresh the generated HTML right before you go on your world tour

The code is directly embedded in the generated R markdown document and also in this gist so I won’t pollute this post with code blocks.

I’m hoping @bhaskar_vk will apply his mad, l33t Leaflet h@x0r $k1llz to this crude, quick hack and kick this up a notch as there is substantial room for improvement. If you give that a go as well, drop a note in the comments or on Twitter. Some ideas for improvement:

  • better markers (please, not those airplane ones :-)
  • way better text cleanup
  • link the table and map (I think that means using Shiny but I’m likely wrong about that)
  • make it better on mobile (it works on mobile but there are different design considerations to make it more usable on tiny glowing rectangles)
  • make a “trip planner” Shiny app, letting folks select their airports and produce a handy digital reference card for them as they move about the globe

To avoid “branding” confusion with R⁴ I’m superclassing it to R⁶ and encouraging others in the R community to don the moniker and do their own small, focused posts on topics that would help the R community learn things. Feel free to use R⁶ (I’ll figure out an acronym later). Feel free to tag your posts as R⁶ (or r6) and use the moniker as you see fit.

I’ll eventually tag the 2 current “r4” posts as “r6”.

Hopefully we can link together a cadre of R⁶ posts into a semi-organized structure that all can benefit from.

Back in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up here in Maine and @mrshrbrmstr asked how often this — snow in May — has occurred near us.

As with all of these “R⁶ posts, expository is minimal and the focus is generally to demonstrate one small concept.

What I’ve done here (first) is make a full tidyverse update to the snowfirst code posted in the aforementioned blog post. You’ll need to clone that repo if you’re trying to work verbatim from the code below (otherwise just change file path code):

library(rprojroot)
library(stringi)
library(hrbrthemes)
library(tidyverse)

pre <- find_rstudio_root_file()

# Get and read in Maine precip ------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/state17_ME.txt.gz"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = fil,
         col_positions = fwf_widths(c(6, 4, 2, 4, rep(c(5, 1, 1, 1), 31)),
                                    col_names = c("coop_id", "year", "month", "element",
                                                  flatten_chr(map(1:31, ~paste("r_", c("v", "fm", "fq", "fs"),
                                                                               .x, sep=""))))),
         col_types = paste0("ciic", paste0(rep("iccc", 31), collapse=""), collapse=""),
         na = c("", "NA", "-", "-9999")) %>%
  gather(day, value, starts_with("r_v")) %>%
  select(-starts_with("r_")) %>%
  mutate(day = as.numeric(stri_replace_first_fixed(day, "r_v", ""))) %>%
  mutate(date = sprintf("%s-%02d-%02d", year, month, day)) -> daily_wx

# Read in stations ------------------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/ushcn-stations.txt"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = file.path(pre, "data", "ushcn-stations.txt"),
         col_positions = fwf_widths(c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                                    col_names = c("coop_id", "latitude", "longitude",
                                                  "elevation", "state", "name",
                                                  "component_1", "component_2",
                                                  "component_3", "utc_offset")),
         col_types = "cdddcccccc") -> stations

closestStation <- function(stations, lat, lon, restrict_to = NULL) {
  if (!is.null(restrict_to)) stations <- filter(stations, state == restrict_to)
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                            (stations$longitude-lon)^2))
  stations[index,]
}

# compute total snow amounts per month ----------------------------------------------

(near_me <- closestStation(stations, 43.2672, -70.8617, restrict_to="ME"))

Now that we have the data, the short lesson here is just exposing the fact that you can get blank facets for free with ggplot2. I’m pointing this out as many folks seem to not like reading R documentation or miss things in said documentation (in fact, I had to be instructed today by @thomasp85 about a ggplot2 theme element setting that I didn’t know about and should have since I do try to keep up).

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(
    n = n / 10, # readings are in 10ths of inches
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = lubridate::month(date, TRUE, FALSE)
  ) %>%
  ggplot(aes(x=date, y=n)) +
  geom_segment(aes(xend=date, yend=0), size=0.75, color="#9ecae1") +
  scale_y_continuous(limits=c(0, 65)) +
  facet_wrap(~month_name, ncol=3, drop=FALSE, scales="free") +
  labs(x=NULL, y="inches", title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="Y", axis_text_size=8)

Without ggplot2 helping us out we would have had to do some work to have those no-value facets to show up. I also like how there are no x-axis labels since there’s no data. ggplot2::facet_wrap() has many, very granular options for customizing the appearance of facets:

facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
           shrink = TRUE, labeller = "label_value", as.table = TRUE,
           switch = NULL, drop = TRUE, dir = "h", strip.position = "top")

If you haven’t played with them, you can use this example to try them out.

Fin

Even though that visualization gets the message across, I kinda like this view a bit better:

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(n = n / 10) %>%
  complete(year, month=1:12) %>%
  mutate(
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = factor(lubridate::month(date, TRUE, FALSE), levels=rev(month.name))
  ) %>% 
  ggplot(aes(year, month_name)) +
  geom_tile(aes(fill=n), color="#b2b2b2", size=0.15) +
  scale_x_continuous(expand=c(0,0.15), position="top") +
  viridis::scale_fill_viridis(name = "Total inches", na.value="white") +
  labs(x=NULL, y=NULL, title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="", axis_text_size = 10) +
  guides(fill=guide_colourbar(label.position = "top", direction = "horizontal", title.vjust = 0)) +
  theme(legend.title = element_text(size=10)) +
  theme(legend.key.height = unit(0.5, "lines")) +
  theme(legend.position = c(0.9, 1.25))

The precision is lacking in the heatmap view, but you get a quick impression of when it has/hasn’t snowed. Plus you get to use viridis ?

All the updated code in in the snowfirst repo.

Crank you your own, small code snippets or ideas to the R community. R⁶ is an open tag and perhaps we can band together to make a distributed cadre of helpful, digestible posts the R community can benefit from.

@eddelbuettel’s idea is a good one. (it’s a quick read…jump there and come back). We’ll avoid confusion and call it R⁶ over here. Feel free to don the superclass.

I often wait for a complete example or new package announcement to blog something when a briefly explained snippet might have sufficient utility for many R users. Also, tweets are fleeting and twitter could end up on the island of misfit social media sites if it can’t generate revenue or find a giant buyer this year. Don’t get me wrong, twitter convos are fine/useful, but blogs are at least semi-permanent, especially if you let them be hoovered up by the Internet Archive (“Save Page Now” on their site or use this handy Chrome extension).

I’ll tag all R⁶ posts as “r6” if you want to auto-filter those out of your stream or just page through them.

I’ll also lead off the these micro-posts with a simple one: adding progress bars to your tidyverse purrr operations.

The purrr::map* functions enable expressive and type-safe vectorized operations. Mine are usually over a few million/billion IPv4 addresses or domain names/URLs and often involve moderately lengthy tasks so I usually add the ability to incorporate progress bars to functions I make (and, I’m trying hard to get out of the bad habit of long-ish anonymous functions in purrr calls). The following is a toy example, but it’s a working example you can run in your interactive R session now:

library(tidyverse)

arduously_long_nchar <- function(input_var, .pb=NULL) {
  
  if ((!is.null(.pb)) && inherits(.pb, "Progress") && (.pb$i < .pb$n)) .pb$tick()$print()
  
  Sys.sleep(1)
  
  nchar(input_var)
  
}

pb <- progress_estimated(length(letters))

map_int(letters, arduously_long_nchar, .pb=pb)

And, yes, I did make you wait ~26 seconds (unless you were intrepid enough to reduce the amount of sleep time :-)

If you happen to forget the progress bar object (or know you don’t need one):

map_int(letters, arduously_long_nchar)

the function still works (sans progress bars).

If you happen to also mess up what you pass in to the .pb parameter or get your progress bar out of sync with your object it won’t error out on you (it can be made much safer and wrapped in another function, say — tick_off(.pb) — but this is supposed to be a small post).

Comments/feedback/your-own-progress-methods are most welcome and encouraged.

A story about one of the retail chains (J.C. Penny) releasing their list of stores closing in 2017 crossed paths with my Feedly reading list today and jogged my memory that there were a number of chains closing many of their doors this year, and I wanted to see the impact that might have on various states.

I’m also doing this to add one more example of:

  • scraping (with content caching)
  • data cleaning
  • per-capita normalization
  • comparing salient information to other indicators

to the growing list of great examples out there by the extended R community. Plus, I feel compelled to try to keep up with @ma_salmon’s blogging pace.

Let’s jump right in…

library(httr)
library(rvest)
library(knitr)
library(kableExtra)
library(ggalt)
library(statebins)
library(hrbrthemes)
library(tidyverse)

options(knitr.table.format = "html")
update_geom_font_defaults(font_rc_light, size = 2.75)

“Closing” lists of four major retailers — K-Mart, Sears, Macy’s and J.C. Penny — abound (HTML formatting a list seems to be the “easy way out” story-wise for many blogs and newspapers). We can dig a bit deeper than just a plain set of lists, but first we need the data.

The Boston Globe has a nice, predictable, mostly-uniform pattern to their list-closing “stories”, so we’ll use that data. Site content can change quickly, so it makes sense to try to cache content whenever possible as we scrape it. To that end, we’ll use httr::GET vs xml2::read_html since GET preserves all of the original request and response information and read_html returns an external pointer that has no current support for serialization without extra work.

closings <- list(
  kmart = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-kmart-stores-closing-around/4kJ0YVofUWHy5QJXuPBAuM/story.html",
  sears = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-sears-stores-closing-around/yHaP6nV2C4gYw7KLhuWuFN/story.html",
  macys = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-macy-stores-closing-around/6TY8a3vy7yneKV1nYcwY7K/story.html",
    jcp = "https://www.bostonglobe.com/business/2017/03/17/the-full-list-penney-stores-closing-around/vhoHjI3k75k2pSuQt2mZpO/story.html"
)

saved_pgs <- "saved_store_urls.rds"

if (file.exists(saved_pgs)) {
  pgs <- read_rds(saved_pgs)
} else {
  pgs <- map(closings, GET)
  write_rds(pgs, saved_pgs)
}

This is what we get from that scraping effort:

map(pgs, content) %>%
  map(html_table) %>%
  walk(~glimpse(.[[1]]))
## Observations: 108
## Variables: 3
## $ X1 <chr> "300 Highway 78 E", "2003 US Hwy 280 Bypass", "3600 Wilson ...
## $ X2 <chr> "Jasper", "Phenix City", "Bakersfield", "Coalinga", "Kingsb...
## $ X3 <chr> "AL", "AL", "CA", "CA", "CA", "CA", "CO", "CO", "CT", "FL",...
## Observations: 42
## Variables: 4
## $ X1 <chr> "301 Cox Creek Pkwy", "1901 S Caraway Road", "90 Elm St; En...
## $ X2 <chr> "Florence", "Jonesboro", "Enfield", "Lake Wales", "Albany",...
## $ X3 <chr> "AL", "AR", "CT", "FL", "GA", "GA", "IN", "KS", "KY", "LA",...
## $ X4 <chr> "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y", "Y", "Y", "Y",...
## Observations: 68
## Variables: 6
## $ X1 <chr> "Mission Valley Apparel", "Paseo Nuevo", "*Laurel Plaza", "...
## $ X2 <chr> "San Diego", "Santa Barbara", "North Hollywood", "Simi Vall...
## $ X3 <chr> "CA", "CA", "CA", "CA", "FL", "FL", "FL", "FL", "FL", "GA",...
## $ X4 <int> 385000, 141000, 475000, 190000, 101000, 195000, 143000, 140...
## $ X5 <int> 1961, 1990, 1995, 2006, 1995, 2000, 1977, 1974, 2000, 1981,...
## $ X6 <int> 140, 77, 105, 105, 68, 83, 86, 73, 72, 69, 9, 57, 54, 87, 5...
## Observations: 138
## Variables: 3
## $ Mall/Shopping Center <chr> "Auburn Mall", "Tannehill Promenade", "Ga...
## $ City                 <chr> "Auburn", "Bessemer", "Gadsden", "Jasper"...
## $ State                <chr> "AL", "AL", "AL", "AL", "AR", "AR", "AZ",...

We now need to normalize the content of the lists.

map(pgs, content) %>%
  map(html_table) %>%
  map(~.[[1]]) %>%
  map_df(select, abb=3, .id = "store") -> closings

We’re ultimately just looking for city/state for this simple exercise, but one could do more precise geolocation (perhaps with rgeocodio) and combine that with local population data, job loss estimates, current unemployment levels, etc. to make a real story out of the closings vs just do the easy thing and publish a list of stores.

count(closings, abb) %>%
  left_join(data_frame(name = state.name, abb = state.abb)) %>%
  left_join(usmap::statepop, by = c("abb"="abbr")) %>%
  mutate(per_capita = (n/pop_2015) * 1000000) %>%
  select(name, n, per_capita) -> closings_by_state

(NOTE: you can get the code for the entire Rmd via RPubs or GitHub)

Compared to unemployment/underutilization

I’d have used the epidata package to get the current state unemployment data but it’s not quite current enough, so we can either use a package to get data from the Bureau of Labor Statistics or just scrape it. A quick coin-flip says we’ll scrape the data.

We’ll use the U-6 rate since that is an expanded definition of “underutilization” including “total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers” and is likely to more representative for the populations working at retail chains. I could be wrong. I only play an economist on ?. If you are an economist, please drop a note telling me where I errd in my thinking ?

pg <- read_html("https://www.bls.gov/lau/stalt16q4.htm")

html_nodes(pg, "table#alternmeas16\\:IV") %>% 
  html_table(header = TRUE, fill = TRUE) %>%
  .[[1]] %>% 
  docxtractr::assign_colnames(1) %>% 
  rename(name=State) %>% 
  as_data_frame() %>% 
  slice(2:52) %>% 
  type_convert() %>% 
  left_join(closings_by_state, by="name") %>% 
  filter(!is.na(n)) -> with_unemp

ggplot(with_unemp, aes(per_capita, `U-6`)) +
  geom_label(aes(label=name), fill="#8c96c6", color="white", size=3.5, family=font_rc) +
  scale_x_continuous(limits=c(-0.125, 6.75)) +
  labs(x="Closings per-capita (1MM)", 
       y="BLS Labor Underutilization (U-6 rate)",
       title="Per-capita store closings compared to current BLS U-6 Rate") +
  theme_ipsum_rc(grid="XY")

If I were a reporter (again, I only play one on ?), I think I’d be digging a bit deeper on the impact of these (and the half-dozen or so other) retailers closing locations in New Mexico, Nevada, West Virginia, Wyoming, (mebbe Maine, though I’m super-b ased :-), North Dakota & South Dakota. I also hope @Marketplace does a few more stories on the changing retail landscape in the U.S. over the coming months to see if there are any overt consequences to the loss of jobs and anchor stores.

If you end up tinkering with the data, drop a note in the comments if something you discover piques your interest. For those interested in potentially marrying the data up with some additional cartography, there should be enough precision the store lists to get distinct enough lat/lon paris after geocoding (I did a quick test with rgeocodio) to make some interesting map views, especially if you can find more store closing lists.

I listen to @NPR throughout the day (on most days) and a story on Ohmconnect piqued my interest (it aired 5 days prior to this post). The TLDR on Ohmconnect is that it ostensibly helps you save energy by making you more aware of consumption and can be enabled to control various bits of IoT you have in your abode to curtail wanton power usage.

OK. So…?

Such a service requires access to (possibly many) accounts and devices to facilitate said awareness and control. Now, it’s 2017 and there’s this thing called OAuth that makes giving such access quite a bit safer than it was in the “old days” when you pretty much had to give your main username and password out to “connect” things.

It — apparently — is not 2017 wherever Ohmconnect developers reside since they ask for your credentials to every service and integration you want enabled. Don’t believe me? Take a look:

That’s just from (mostly) the non-thermostat integrations. They ask for your credentials for all services. That’s insane.

I can understand that they may need power company credentials since such industries are usually far behind the curve when it comes to internet-enablement. That doesn’t mean it’s A Good Thing to provide said credentials, but it’s a necessary evil when a service provider has no support for OAuth and you really want to use some integration to their portal.

Virtually all of the possible Ohmconnect-supported service integrations have OAuth support. Here’s a list of the ones that do/dont:

OAuth Support:

Appears to have no OAuth Support:

  • Lennox
  • Lutron
  • Radio Thermostat (Filtrete)
  • Revolv
  • WeMo

NOTE: The ones labeled as having no OAuth support may have either commercial OAuth support or hidden OAuth support. I’ll gladly modify the post if you leave a comment with official documentation showing they have OAuth support.

On the plus side, Ohmconnect developers now have some links they can follow to learn about OAuth and fix their woefully insecure service.

Why Are Credentials Bad?

Ohmconnect has to store your credentials for other services either in the clear or in some way that’s easy for them to reverse/decode. That means when criminals breach their servers (yes, when) they’ll get access to all the credentials you’ve entered on all those sites. Even if you’re one of the few who don’t use the same password everywhere and manage credentials in an app like @1Password it’s still both a pain to change them and you’ll be at risk during whatever the time-period is between breach and detection (which can be a very long time).

In the highly unlikely event they are doing the OAuth in the background for you (a complete violation of OAuth principles) they still take and process (and, likely store) your credentials for that transaction.

Either way, the request for and use of credentials is either (at best) a naive attempt at simplifying the user experience or (at worse) a brazen disregard for accepted norms for modern user-service integration for non-obvious reasons.

NOTE: I say “when” above as this would be a lovely target of choice for thieves given the types of data it can collect and the demographic that’s likely to use it.

What Can You Do?

Well, if you’re a current Ohmconnect you can cancel your account and change all the credentials for the services you connected. Yes, I’m being serious. If you really like their service, contact customer support and provide the above links and demand that they use OAuth instead.

You should absolutely not connect the devices/services that are on the “Appears to have no OAuth Support” list above to any third-party service if that service needs your credentials to make the connection. There’s no excuse for a cloud-based service to not support OAuth and there are plenty of choices for home/device control. Pick another brand and use it instead.

If you aren’t an Ohmconnect user, I would not sign up until they support OAuth. By defaulting to the “easy” use of username & password they are showing they really don’t take your security & privacy seriously and that means they really don’t deserve your business.

FIN

It is my firm belief that @NPR should either remove the story or issue guidance along with it in text and in audio form. They showcased this company and have all but directly encouraged listeners to use it. Such recommendations should come after much more research, especially security-focused research (they can ask for help with that from many orgs that will give them good advice for free).

In case you’re wondering, I did poke them about this on Twitter immediately after the NPR story and my initial signup attempt but they ignored said poke.

I’m also not providing any links to them given their lax security practices.