Skip navigation

Tag Archives: post

The intrepid @ma_salmon cranked out another blog post, remixing classical music schedule data from Radio Swiss Classic. It’s a fun post and you should read it before continuing here.

Seriously, click the link and go read it before continuing.

No, I mean it. Click the link or the rest of this makes no sense ;-)

OK, good. You finally read her ? post.

Now, I’m riffing off of said post here for four reasons. Three of the reasons are short, one is longer.

The first, short one is: be kind to web servers when scraping. If you ran a site and suddenly got hit with 3,000+ immediately sequential requests you might not be able to handle it depending on your server config. At a minimum add a Sys.sleep(sample(seq(0,1,0.25), 1)) before each sequential scrape and — if you can spare the time — sample(5,1) would be even better for a delay.

The second, short one is: purrr::safely() is your bff when it comes to xml2::read_html() and other network-ops. The internet is fundamentally broken. Nodes die. Pages get lost. Links rot. You have to be able to handle exceptions and if you define something like s_read_html <- safely(read_html) then when you do s_read_html("https://example.com/") the $result component will be NULL if the network request failed but will contain valid, parsed HTML if it succeeds. It is silent by default and works quite well (as we’ll see below).

The third, short one is: MPGA (Make Progress-bars Great Again). dplyr::progress_estimated() can really simplify the usage of progress bars in purrr calls (drop a note in the comments if the code is confusing and I’ll add some expository).

The last requires the code example for context:

library(rvest)
library(stringi)
library(lubridate)
library(tidyverse)

s_read_html <- safely(read_html)

# helper for brevity
xtract_nodes <- function(node, css) {
  html_nodes(node, css) %>% html_text(trim = TRUE)
}

get_one_day_program <- function(date=Sys.Date(),
                                base_url="http://www.radioswissclassic.ch/en/music-programme/search/%s",
                                pb=NULL) {

  if (!is.null(pb)) pb$tick()$print()

  Sys.sleep(sample(seq(0,1,0.25), 1)) # ideally, make this sample(5,1)

  date <- ymd(date) # handles case where input is character ISO date

  pg <- s_read_html(sprintf(base_url, format(date, "%Y%m%d")))

  if (!is.null(pg$result)) {

    data_frame(

      date = date,
      duration = xtract_nodes(pg$result, 'div[class="playlist"] *
                                            span[class="time hidden-xs"]') %>% hm() %>% as.numeric(),
      artist = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="titletag"]'),
      title = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="artist"]'),

      hour = purrr::map(0:23, ~{
        if (.x<23) {
          nod <- html_nodes(pg$result,
                             xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')
                                                                 and (following-sibling::div[@id='%02d'])]", .x, .x+1))
        } else {
          nod <- html_nodes(pg$result,
                            xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')]", .x))
        }
        rep(.x, length(nod))
      }) %>%
        flatten_int()

    )

  } else {
    closeAllConnections()
    NULL
  }

}

search_dates <- seq(from = ymd("2008-09-01"), to = ymd("2017-04-22"), by = "1 day")

pb <- progress_estimated(length(search_dates[1:5]))
programs_df <- map_df(search_dates[1:5], get_one_day_program, pb=pb)
programs_df
## # A tibble: 825 × 5
##          date duration                    artist                                                                         title  hour
##        <date>    <dbl>                     <chr>                                                                         <chr> <int>
## 1  2008-09-01       60   Franz Anton Hoffmeister "Andante grazioso" From Flute Quartet In A Major (After Mozart's KV 331) (CH)     0
## 2  2008-09-01      360     Johann Nepomuk Hummel                              "Rondo brillante" Op. 56 For Piano And Orchestra     0
## 3  2008-09-01     1380            Franz Schubert                       "Andante con moto" From Symphony No. 9 In C Major D 944     0
## 4  2008-09-01     2340       Camille Saint-Saëns                                       Violin Concerto No. 1 In A Major Op. 20     0
## 5  2008-09-01     3000        Alexander Scriabin                                           Nocturne In A Flat Major Op. posth.     0
## 6  2008-09-01     3180        Alexander Glazunov                                          Valse From "Scènes de ballet" Op. 52     0
## 7  2008-09-01     3540 Carl Philipp Emanuel Bach                                                           Symphony In G Major     0
## 8  2008-09-01     4200            Giuseppe Verdi                      "O Signore, dal tetto natio" From The Opera "I Lombardi"     1
## 9  2008-09-01     4440             Franz Krommer                                 Clarinet Concerto In E Flat Major Op. 36 (CH)     1
## 10 2008-09-01     5820            Georges Onslow             "Andantino molto cantabile" From Symphony No. 4 In G Major Op. 71     1
## # ... with 815 more rows

One of the reasons Maëlle created her post was to use XPath. Now, I was around when XML was defined and I have a sad, long history with the format, so XPath & I are old friends adversaries. However, there are simpler ways to target some of the nodes.

xpath="//span[@class='time hidden-xs']//text()" is ++gd XPath but it doesn’t need to be if we switch to using html_nodes() which will automatically translate CSS selectors to XPath for us. That bit of XPath turns into div[class="playlist"] * span[class="time hidden-xs"]. Why the extra selector at the beginning? Read on!

div[class="playlist"] * span[class="time hidden-xs"] actually translates to the following XPath:

selectr::css_to_xpath('div[class="playlist"] * span[class="time hidden-xs"]')
## [1] "descendant-or-self::div[@class = 'playlist']/descendant::*/descendant::span[@class = 'time hidden-xs']"

I use the parent playlist <div> because a few of the code bits in Maëlle’s post have to subtract away the last node because the XPath expression is a bit too greedy and also gets the “now playing” info vs just the “what played that day” info. It’s not strictly necessary for the time-code but it is for the artist & title. You can see that it simplifies the scraping a bit.

However, we can use XPath for to scrape the “hour the song played” and use it to fill the resultant data frame.

This .//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row') and (following-sibling::div[@id='%02d'])] is not the most complex XPath but it is pretty gnarly, yet it also shows the power of XPath. What we’re doing in that purrr::map() call (which said XPath is in) is:

  • if the hour is 0:22, then use get all the sibling target nodes between one <div id="hh"> and the next <div id="hh">.
  • if the hour is 23, then get all the target nodes until there are no sibling
  • for either result, make an integer vector containing the hour repeated n times (n being the length of the number of songs played in the hour)
  • flatten it all into one big integer vector

(also: note that whitespace is your bff as well when it comes to formatting XPath queries)

If any read_html request is “bad” NULL will be returned instead of a data_frame, which purrr::map_df() will ignore.

I only did 5 scrapes since I won’t be using the data, but it’s working well on other random sequences I tried.

I tossed in a few more alternative ways to get some of the data, which you can pick up on if you compare the each code bits to each other.

Drop any questions, jibes or better XPath queries (once you post an XPath query on the internet the XPath wonks — like me ? — come out of hiding to prey on innocent bloggers) in the comments.

I caught this “gem” in the Wall Street Journal tonight:

It’s pretty hard to compare store-to-store, even though it is fairly clear which ones are going-going-gone. If we want to see the relative percentage of each store closing and also want to see how they stack up against each other, then let’s make a column of 100% bars and label total stores in each:

library(hrbrthemes)
library(tidyverse)

read.table(text='store,closing,total
"Radio Shack",550,1500
"Payless",400,2600
"Rue21",400,1100
"The Limited",250,250
"bebe",180,180
"Wet Seal",170,170
"Crocs",160,560
"JCPenny",138,1000
"American Apparel",110,110
"Kmart",109,735
"hhgregg",88,220
"Sears",41,695', sep=",", header=TRUE, stringsAsFactors=FALSE) %>% 
  as_tibble() %>% 
  mutate(remaining = total - closing,
         gone = round((closing/total) * 100)/100,
         stay = 1-gone,
         rem_lab = ifelse(remaining == 0, "", scales::comma(remaining))) %>% 
  arrange(desc(stay)) %>% 
  mutate(store=factor(store, levels=store)) -> closing_df

update_geom_font_defaults(font_rc)

ggplot(closing_df) +
  geom_segment(aes(0, store, xend=gone, yend=store, color="Closing"), size=8) +
  geom_segment(aes(gone, store, xend=gone+stay, yend=store, color="Remaining"), size=8) +
  geom_text(aes(x=0, y=store, label=closing), color="white", hjust=0, nudge_x=0.01) +
  geom_text(aes(x=1, y=store, label=rem_lab), color="white", hjust=1, nudge_x=-0.01) +
  scale_x_percent() +
  scale_color_ipsum(name=NULL) +
  labs(x=NULL, y=NULL, 
       title="Selected 2017 Store closings (estimated)",
       subtitle="Smaller specialty chains such as Bebe and American Apparel are closing their stores,\nwhile lareger chains such as J.C. Penny and Sears are scaling back their footprint.") +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(legend.position=c(0.875, 1.025)) +
  theme(legend.direction="horizontal")

One might try circle packing or a treemap to show both relative store count and percentage, but I think the bigger story is the percent reduction for each retail chain. It’d be cool to see what others come up with.

It’s likely you’ve seen the news regarding yet-another researcher showing off a phishing domain attack. The technique is pretty simple:

  • find a target domain you want to emulate
  • register a homoglpyh version of it
  • use the hacker’s favorite tool, Let’s Encrypt to serve it up with a nice, shiny green lock icon
  • deploy some content
  • phish someone
  • Profit!

The phishing works since International Domain Names have been “a thing” for a while (anything for the registrars to make more money) and Let’s Encrypt provides a domain-laundering service for these attackers. But, why should attackers have all the fun! Let’s make some domain homoglyphs in R.

Have Glyph, Will Hack

Rob Dawson has a spiffy homoglyph generator and even has a huge glyph-alike file, but we don’t need the full list to don the hacker cap for this exercise. I’ve made a stripped-down version of it that has (mostly) glyphs that should display correctly in “western” locales. You can pull the full list and tweak the example to broaden the attack capabilities. Let’s take a look:

library(stringi)
library(urltools)
library(purrr)

URL <- "https://rud.is/dl/homoglyphs.txt" # trimmed down from https://github.com/codebox/homoglyph
fil <- basename(URL)
invisible(try(httr::GET(URL, httr::write_disk(fil)), silent = TRUE))

chars <- stri_read_lines(fil)
idx_char <- stri_sub(chars, 1,1)
stri_sub(chars, 1, 1) <-  ""
chars <- set_names(chars, idx_char)

tail(chars)
##                                         u 
##          "ʋυцս\u1d1cu??????????????????" 
##                                         v 
##        "νѵט\u1d20ⅴ∨⋁v??????????????????" 
##                                         w 
##                                      "w" 
##                                         x 
##                "×хᕁᕽ᙮ⅹ⤫⤬⨯x?????????????" 
##                                         y 
## "ɣʏγуүყ\u1d8c\u1effℽy??????????????????" 
##                                         z 
##                   "\u1d22z?????????????"

What we did there was to read in the homoglpyh lines and create a lookup table for Latin characters. Now we need a transformation function.

to_homoglyph <- function(domain) {

  suf <- suffix_extract(domain)
  domain <- stri_replace_last_fixed(domain, sprintf(".%s", suf$suffix[1]), "")

  domain_split <- stri_split_boundaries(domain, type="character")[[1]]

  map_chr(domain_split, ~{
    found <-  chars[.x]
    pos <- sample(stri_count_boundaries(found, type="character"), 1)
    stri_sub(found, pos, pos)
  }) %>%
    c(".", suf$suffix[1]) %>%
    stri_join(collapse="")

}

The basic idea is to:

  • carve out the domain suffix (we need to ensure valid TLDs/suffixes are used in the final domain)
  • split the input domain into separate characters
  • select a homoglyph of the character at random
  • join the separate glpyhs and the TLD/suffix back together.

We can try it out with a very familiar domain:

(converted <- to_homoglyph("google.com"))
## [1] "ƍ၀໐?|?.com"

Now, that’s using all possible homoglyphs and it might not look like google.com to you, but imagine whittling down the list to ones that are really close to Latin character set matches. Or, imagine you’re in a hurry and see that version of Google’s URL with a shiny, green lock icon from Let’s Encrypt. You might not really give it a second thought if the page looked fine (or were on a mobile browser without a location bar showing).

What’s the solution?

Firefox has a configuration setting to turn these IDNs into punycode in the location bar. What does that mean? We can use the urltools::puny_encode() function to find out:

puny_encode("ƍ၀໐?|?.com")
## [1] "xn--|-npa992hbmb6w79iesa.com"

Most folks will be much less likely to trust that domain name (if they bother looking in the location bar). Note that it will still have the “everything’s ?” green Let’s Encrypt lock icon, but you shouldn’t be trusting SSL/TLS anymore for integrity or authenticity anyway.

Chrome Canary (super early bird alpha versions) expands IDNs to punycode by default today and a shorter-cycle release to stable channel is forthcoming. I’m told Edge does somewhat sane things with IDNs and if Safari doesn’t presently handle them Apple will likely release an interstitial security update to handle it.

FIN

See if you can generate some fun look-alike’s, such as ???????.com and drop some latte change to register an IDN and add a free hacking certificate to it to see just how easy this entire process is. Note that attackers are automating this process, so they may have beat you to your favorite homoglyph IDN.

If you’re on Chrome, give the Punycode Alert extension a go if you’d like some extra notification/protection from these domains.

NOTE: to_homoglyph() is not vectorised (it’s an exercise left to the reader).

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.

RStudio is a great way to work through analyses tasks, and I suspect most folks use the “desktop” version of the product on their local workstations.

The fine folks at RStudio also make a server version (the codebase for RStudio is able to generate server or desktop and they are generally in 100% feature parity when it comes to interactive use). You only need to set it up on a compatible server and then access it via any modern web browser to get virtually the same experience you have on the desktop.

I use RStudio Server as well as RStudio Desktop and have never been comfortable mixing web browsing tasks and analysis tasks in the same browser (it’s one of the many reasons I dislike jupyter notebooks). I also keep many apps open and inevitably would try to cmd-tab (I’m on macOS) between apps to find the RStudio server one only to realize I shld have been keyboard tabbing through Chrome tabs.

Now, it’s not too hard to fire up a separate Chrome or Safari instance to get a separate server but it’d be great if there was a way to make it “feel” more like an app — just like RStudio Desktop. Well, it turns out there is a way using nativefier.

If you use the Slack standalone desktop client, the Atom text editor or a few other “modern” apps, they are pretty much just web pages wrapped in a browser shell using something like Electron. Jia Hao came up with the idea of being able to do the same thing for any web page.

To create a standalone RStudio Server client for a particular RStudio Server instance, just do the following after installing nativefier:

nativefier "https://rstudio.example.com:8787/" --name "RStudio @ Example"

Replace the URL with the one you currently use in-browser (and, please consider using SSL/TLS when connecting over the public internet) and use any name that will be identifiable to you. You get a safe, standalone application and will never have to worry about browser crashes impacting your workflow.

There are many customizations you can make to the app shell and you can even use your own icons to represent servers differently. It’s all super-simple to setup and get working.

Note that for macOS folks there has been a way pre-nativefier to do this same thing with a tool called Fluid but it uses the Apple WebKit/Safari shell vs the Chrome shell and I prefer the Chrome shell and cross-platform app-making ability.

Hopefully this quick R⁶ tip will help you corral your RStudio Server connections. And, don’t forget to join in on the R⁶ bandwagon and share your own quick tips, snippets and hints to help the broader R community level-up their work.

@visualisingdata rebroadcast this tweet today:

The Google Maps interface is a bit meh and the “formatted” data is also a bit meh but the data is useful when travelling (NOTE: always use a VPN in airports on both your mobile devices and laptops).

You don’t need their app or online access to take advantage of this data if you’re willing to do a tiny bit of work. apply

You’ll find an R-generated version of the airport Wi-Fi finder below or via this direct link.

The premise is simple:

  • grab the airport data KML that the Google Maps interface uses (NOTE: Visit the original URL from the Tweet occasionally to see if the KML URL changes)
  • make cleaner, more formatted text for the popups
  • toss up a leaflet map
  • add a searchable DT::datatable interface
  • render the R markdown page to HTML and keep it with you
  • refresh the generated HTML right before you go on your world tour

The code is directly embedded in the generated R markdown document and also in this gist so I won’t pollute this post with code blocks.

I’m hoping @bhaskar_vk will apply his mad, l33t Leaflet h@x0r $k1llz to this crude, quick hack and kick this up a notch as there is substantial room for improvement. If you give that a go as well, drop a note in the comments or on Twitter. Some ideas for improvement:

  • better markers (please, not those airplane ones :-)
  • way better text cleanup
  • link the table and map (I think that means using Shiny but I’m likely wrong about that)
  • make it better on mobile (it works on mobile but there are different design considerations to make it more usable on tiny glowing rectangles)
  • make a “trip planner” Shiny app, letting folks select their airports and produce a handy digital reference card for them as they move about the globe

To avoid “branding” confusion with R⁴ I’m superclassing it to R⁶ and encouraging others in the R community to don the moniker and do their own small, focused posts on topics that would help the R community learn things. Feel free to use R⁶ (I’ll figure out an acronym later). Feel free to tag your posts as R⁶ (or r6) and use the moniker as you see fit.

I’ll eventually tag the 2 current “r4” posts as “r6”.

Hopefully we can link together a cadre of R⁶ posts into a semi-organized structure that all can benefit from.

Back in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up here in Maine and @mrshrbrmstr asked how often this — snow in May — has occurred near us.

As with all of these “R⁶ posts, expository is minimal and the focus is generally to demonstrate one small concept.

What I’ve done here (first) is make a full tidyverse update to the snowfirst code posted in the aforementioned blog post. You’ll need to clone that repo if you’re trying to work verbatim from the code below (otherwise just change file path code):

library(rprojroot)
library(stringi)
library(hrbrthemes)
library(tidyverse)

pre <- find_rstudio_root_file()

# Get and read in Maine precip ------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/state17_ME.txt.gz"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = fil,
         col_positions = fwf_widths(c(6, 4, 2, 4, rep(c(5, 1, 1, 1), 31)),
                                    col_names = c("coop_id", "year", "month", "element",
                                                  flatten_chr(map(1:31, ~paste("r_", c("v", "fm", "fq", "fs"),
                                                                               .x, sep=""))))),
         col_types = paste0("ciic", paste0(rep("iccc", 31), collapse=""), collapse=""),
         na = c("", "NA", "-", "-9999")) %>%
  gather(day, value, starts_with("r_v")) %>%
  select(-starts_with("r_")) %>%
  mutate(day = as.numeric(stri_replace_first_fixed(day, "r_v", ""))) %>%
  mutate(date = sprintf("%s-%02d-%02d", year, month, day)) -> daily_wx

# Read in stations ------------------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/ushcn-stations.txt"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = file.path(pre, "data", "ushcn-stations.txt"),
         col_positions = fwf_widths(c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                                    col_names = c("coop_id", "latitude", "longitude",
                                                  "elevation", "state", "name",
                                                  "component_1", "component_2",
                                                  "component_3", "utc_offset")),
         col_types = "cdddcccccc") -> stations

closestStation <- function(stations, lat, lon, restrict_to = NULL) {
  if (!is.null(restrict_to)) stations <- filter(stations, state == restrict_to)
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                            (stations$longitude-lon)^2))
  stations[index,]
}

# compute total snow amounts per month ----------------------------------------------

(near_me <- closestStation(stations, 43.2672, -70.8617, restrict_to="ME"))

Now that we have the data, the short lesson here is just exposing the fact that you can get blank facets for free with ggplot2. I’m pointing this out as many folks seem to not like reading R documentation or miss things in said documentation (in fact, I had to be instructed today by @thomasp85 about a ggplot2 theme element setting that I didn’t know about and should have since I do try to keep up).

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(
    n = n / 10, # readings are in 10ths of inches
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = lubridate::month(date, TRUE, FALSE)
  ) %>%
  ggplot(aes(x=date, y=n)) +
  geom_segment(aes(xend=date, yend=0), size=0.75, color="#9ecae1") +
  scale_y_continuous(limits=c(0, 65)) +
  facet_wrap(~month_name, ncol=3, drop=FALSE, scales="free") +
  labs(x=NULL, y="inches", title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="Y", axis_text_size=8)

Without ggplot2 helping us out we would have had to do some work to have those no-value facets to show up. I also like how there are no x-axis labels since there’s no data. ggplot2::facet_wrap() has many, very granular options for customizing the appearance of facets:

facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
           shrink = TRUE, labeller = "label_value", as.table = TRUE,
           switch = NULL, drop = TRUE, dir = "h", strip.position = "top")

If you haven’t played with them, you can use this example to try them out.

Fin

Even though that visualization gets the message across, I kinda like this view a bit better:

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(n = n / 10) %>%
  complete(year, month=1:12) %>%
  mutate(
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = factor(lubridate::month(date, TRUE, FALSE), levels=rev(month.name))
  ) %>% 
  ggplot(aes(year, month_name)) +
  geom_tile(aes(fill=n), color="#b2b2b2", size=0.15) +
  scale_x_continuous(expand=c(0,0.15), position="top") +
  viridis::scale_fill_viridis(name = "Total inches", na.value="white") +
  labs(x=NULL, y=NULL, title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="", axis_text_size = 10) +
  guides(fill=guide_colourbar(label.position = "top", direction = "horizontal", title.vjust = 0)) +
  theme(legend.title = element_text(size=10)) +
  theme(legend.key.height = unit(0.5, "lines")) +
  theme(legend.position = c(0.9, 1.25))

The precision is lacking in the heatmap view, but you get a quick impression of when it has/hasn’t snowed. Plus you get to use viridis ?

All the updated code in in the snowfirst repo.

Crank you your own, small code snippets or ideas to the R community. R⁶ is an open tag and perhaps we can band together to make a distributed cadre of helpful, digestible posts the R community can benefit from.