hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Travis-CI Flaw Exposed Some ‘Secure’ Environment Variable Contents

Tagging this as #rstats-related since many R coders use Travis-CI to automate package builds (and other things). Security researcher Ivan Vyshnevskyi did some ++gd responsible disclosure to the Travis-CI folks letting them know they were leaking the contents of “secure” environment variables in the build logs.

The TL;DR on “secure” environment variables is that they let you store secrets — such as OAuth keys or API tokens — ostensibly “securely” (they have to be decrypted to be used so someone/something has they keys to do that so it’s not really “secure”). That is, they should not leak them in build logs. Except that they did…for a bit.

As mentioned, this flaw was reported and is now fixed. Regen your “secrets” and keep an eye on Travis security announcements moving forward.

Plot the Vote: Making U.S. Senate & House Cartograms in R

Political machinations are a tad insane in the U.S. these days & I regularly hit up @ProPublica & @GovTrack sites (& sub to the GovTrack e-mail updates) as I try to be an informed citizen, especially since I’ve got a Senator and Representative who seem to be in the sway of ?.

I’ve always appreciated the ProPublica and GovTrack cartograms as they present a great deal of information in a compact space (especially the House versions). Something nudged me into starting an R package to let folks create them in R (mainly with ggplot2 but an htmlwidget version is planned), which I’ve dubbed voteogram.

With the voteogram package, you can:

pull ProPublica roll call vote data for the 101st Congress up through today (via roll_call())
plot ProPublica-esque Senate roll call vote cartograms
plot ProPublica-esque House roll call vote cartograms
plot GovTrack-esque House roll call vote cartograms

GovTrack uses — what I’ve seen @thosjleeper refer to as — a “parliamentary plot” for their version of the Senate roll call cartogram and sir Leeper already has that type of plot covered in ggparliament, so I’ve just focused on the other ones here.

Roll Call

You need data for these cartogram generation functions and you can specify your own populated data frame (the needed columns are in the manual pages for the cartogram plotters). However, you’ll likely want to plot existing data that others have tallied and ProPublica makes that super simple since each vote is in a standalone JSON file. All you have to do is specify whether you want the roll call vote for the house or senate, the Congress number (current one is 115), the session number (current one is 1) and the roll call vote number.

For example, we can see all the ~~idiots~~ Representatives who voted, recently, to ~~kill people~~ repeal the ACA with the following function call:

(h256 <- roll_call("house", 115, 1, 256))
## 115th Congress / Session: 1 / House Roll Call: 256 / May  4, 2017
## 
## American Health Care Act
## 
## Result: Passed

str(h256, max.level = 1)
## List of 29
##  $ vote_id              : chr "H_115_1_256"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 256
##  $ needed_to_pass       : int 216
##  $ date_of_vote         : chr "May  4, 2017"
##  $ time_of_vote         : chr "02:18 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "RECORDED VOTE"
##  $ question             : chr "On Passage"
##  $ description          : chr "American Health Care Act"
##  $ nyt_title            : chr "On Passage"
##  $ total_yes            : int 217
##  $ total_no             : int 213
##  $ total_not_voting     : int 1
##  $ gop_yes              : int 217
##  $ gop_no               : int 20
##  $ gop_not_voting       : int 1
##  $ dem_yes              : int 0
##  $ dem_no               : int 193
##  $ dem_not_voting       : int 0
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "No"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##  - attr(*, "class")= chr [1:2] "pprc" "list"

As you can see, it has a custom print function and the usable data (for cartographic needs) is in $votes. You can go to town with just that information, making bar charts or tracking individual Congress-critter votes.

Do your best to cache this data as you retrieve it. ProPublica is a non-profit and the JSON files are on AWS. While there’s a certain number of free bits of bandwidth-per-month allotted buy Amazon’s S3 service, best to make sure you’re not tipping them over on any given month. Plus, the vote data doesn’t change once it’s recorded. Consider donating to them if you decided to always grab fresh copies.

There’s a fortify function for this object (it’s classed pprc) so you can pass it right into ggplot() for use or pipe it into a dplyr chain for aggregation & filtering.

House Rules

With the data in hand, we can make some cartograms (the real purpose of the package). I riffed off the ProPublica colors (and haven’t fully finished copying them yet as I need to search for 2 more categories of Independent voting colors) but you can replace them with anything you want. Just reset the scale and use the names in the exposed color value vectors.

There’s also a theme_voteogram() which is designed to augment any base theme (like hrbrthemes::theme_ipsum_rc()) (it’s much like ggthemes::theme_map()).

Here’s the ProPublica view for that particular vote:

house_carto(rep) +
  labs(x=NULL, y=NULL, 
       title="House Vote 256 - Passes American Health Care Act,\nRepealing Obamacare") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

The house_carto() function defaults to the ProPublica cartogram, but you can easily change that:

house_carto(rep, "gt") +
  labs(x=NULL, y=NULL, 
       title="House Vote 256 - Passes American Health Care Act,\nRepealing Obamacare") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

Senate Drools

Again, the senate_carto() function only has the ProPublica-esque cartogram available and works pretty much the same way after getting the Senate vote data:

sen <- roll_call("senate", 115, 1, 110)

senate_carto(sen) +
  labs(title="Senate Vote 110 - Invokes Cloture on Neil Gorsuch Nomination") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

FIN

There’s a bit of work left to do in the package (including an htmlwidget version). You’re invited to file PRs or Issues as you are so moved.

Scrapeover Friday — a.k.a. Another R Scraping Makeover

I caught a glimpse of a tweet by @dataandme on Friday:

Using R & rvest to explore Malaysian property mkt: "Web Scraping: The Sequel, Propwall.my" https://t.co/daZOOJJfPN #rstats #rvest pic.twitter.com/u6QMhm4M3e

— Mara Averick (@dataandme) May 5, 2017

Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her finger on the pulse of everything that’s happening in the data science world and is one of the most ardent amplifiers there is.

The post she linked to was a bit older (2015) and had a very “stream of consciousness” feel to it. I actually wish more R folks took to their blogs like this to post their explorations into various topics. The code in this post likely worked at the time it was posted and accomplished the desired goal (which means it was ultimately decent code). Said practice will ultimately help both you and others.

Makeover Time

As I’ve noted before, web scraping has some rules, even though they can be tough to find. This post made a very common mistake of not putting in a time delay between requests (a cardinal scraping rule) which we’ll fix in a moment.

There are a few other optimizations we can make. The first is moving from a for loop to something a bit more vectorized. Another is to figure out how many pages we need to scrape from information in the first set of results.

However, an even bigger one is to take advantage of the underlying XHR POST request that the new version of the site ultimately calls (it appears this site has undergone some changes since the blog post and it’s unlikely the code in the post actually works now).

Let’s start by setting up a function to grab individual pages:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

get_page <- function(i=1, pb=NULL) {
  
  if (!is.null(pb)) pb$tick()$print()
  
  POST(url = "http://www.propwall.my/wp-admin/admin-ajax.php", 
       body = list(action = "star_property_classified_list_change_ajax", 
                   tab = "Most Relevance", 
                   page = as.integer(i), location = "Mont Kiara", 
                   category = "", listing = "For Sale", 
                   price = "", keywords = "Mont Kiara, Kuala Lumpur", 
                   filter_id = "17", filter_type = "Location", 
                   furnishing = "", builtup = "", 
                   tenure = "", view = "list", 
                   map = "on", blurb = "0"), 
       encode = "form") -> res
  
  stop_for_status(res)
  
  res <- content(res, as="parsed") 
  
  Sys.sleep(sample(seq(0,2,0.5), 1))
  
  res
  
}

The i parameter gets passed into the body of the POST request. You can find that XHR POST request via the Network tab of your browser Developer Tools view. You can either transcribe it by hand or use the curlconverter package (which is temporarily off CRAN so you’ll need to get it from github) to auto-convert it to an httr::VERB request.

We also add a parameter (default to NULL) to support the use of a progress bar (so we can see what’s going on). If we pass in a populated dplyr progress bar, this will tick it down for us.

Now, we can use that to get the total number of listings.

get_page(1) %>% 
  html_node(xpath=".//a[contains(., 'Classifieds:')]") %>% 
  html_text() %>% 
  stri_match_last_regex("([[:digit:],]+)$") %>% 
  .[,2] %>% 
  stri_replace_all_fixed(",", "") %>% 
  as.numeric() -> classified_ct

total_pages <- 1 + (classified_ct %/% 20)

We’ll setup another function to extract the listing URLs and titles:

get_listings <- function(pg) {
  data_frame(
    link = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>%  html_attr("href"),
    description = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>% html_text(trim = TRUE)  
  )
}

Rather than chain calls to html_nodes() we take advantage of well-formed CSS selectors (which ultimately gets auto-translated to XPath strings). This has the advantage of speed (though that’s not necessarily an issue when web scraping) as well as brevity.

Now, we’ll scrape all the listings:

pb <- progress_estimated(total_pages)
listings_df <- map_df(1:total_pages, ~get_listings(get_page(.x, pb)))

Yep. That’s it. Everything’s been neatly abstracted into functions and we’ve taken advantage of some modern R idioms to accomplish our first task.

FIN

With the above code you should be able to do your own makeover of the remaining code in the original post. Remember to:

add a delay when you sequentially scrape pages from a site
abstract out common operations into functions
take advantage of purrr functions (or built-in *apply functions) to avoid for loops

I’ll close with a note about adhering to site terms of service / terms and conditions. Nothing I found when searching for ToS/ToC on the site suggested that scraping, automated grabbing or use of the underlying data in bulk was prohibited. Many sites have such restrictions — like IMDB (I mention that as it’s been used alot lately by R folks and it really shouldn’t be). LinkedIn recently sued scrapers for ToS such violations.

I fundamentally believe violating ToS is unethical behavior and should be avoided just on those grounds. When I come across sites I need information from that have restrictive ToS I contact the site owner (when I can find them) and ask them for permission and have only been refused a small handful of times. Given those recent legal actions, it’s also to better be safe than sorry.

R⁶ — Using pandoc from R + A Neat Package For Reading Subtitles

Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus”) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this:

to:

Does pandoc work?
=================

Simple document with **bold** and *italics*.

This is definitely a job that pandoc can handle.

pandoc is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown / knitr.

Yes, you read that correctly. Your beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.

Doing the aforementioned conversion from docx to markdown is super-simple from R:

rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")

Give the help on rmarkdown::pandoc_convert() a read as well as the very thorough and helpful documentation over at pandoc.org to see the power available at your command.

Just One More Thing

This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist :-)

There’s a neat, non-on-CRAN package by François Keck called subtools — https://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.

There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.

Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext to see how the season looked.

The overall scoring for a given episode is naive and can definitely be improved upon.

Definitely drop a link to anything you create in the comments!

# devtools::install_github("fkeck/subtools")

library(subtools)
library(tidytext)
library(hrbrthemes)
library(tidyverse)

data(stop_words)

bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")

fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE)

pb <- progress_estimated(length(fils))

map_df(1:length(fils), ~{

  pb$tick()$print()

  read.subtitles(fils[.x]) %>%
    sentencify() %>%
    .$subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by="word") %>%
    inner_join(bing, by="word") %>%
    inner_join(afinn, by="word") %>%
    mutate(season = 1, ep = .x)

}) %>% as_tibble() -> season_sentiments


count(season_sentiments, ep, sentiment) %>%
  mutate(pct = n/sum(n),
         pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent

ggplot() +
  geom_ribbon(data = filter(bing_sent, sentiment=="positive"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  geom_ribbon(data = filter(bing_sent, sentiment=="negative"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) +
  scale_y_continuous(expand=c(0,0), limits=c(-1,1),
                     labels=c("100%\nnegative", "50%", "0", "50%", "positive\n100%")) +
  labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1",
       subtitle="Sentiment balance per episode") +
  scale_fill_ipsum(name="Sentiment") +
  guides(fill = guide_legend(reverse=TRUE)) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1)))

Decomposing Composers with R

The intrepid @ma_salmon cranked out another blog post, remixing classical music schedule data from Radio Swiss Classic. It’s a fun post and you should read it before continuing here.

Seriously, click the link and go read it before continuing.

No, I mean it. Click the link or the rest of this makes no sense ;-)

OK, good. You finally read her ? post.

Now, I’m riffing off of said post here for four reasons. Three of the reasons are short, one is longer.

The first, short one is: be kind to web servers when scraping. If you ran a site and suddenly got hit with 3,000+ immediately sequential requests you might not be able to handle it depending on your server config. At a minimum add a Sys.sleep(sample(seq(0,1,0.25), 1)) before each sequential scrape and — if you can spare the time — sample(5,1) would be even better for a delay.

The second, short one is: purrr::safely() is your bff when it comes to xml2::read_html() and other network-ops. The internet is fundamentally broken. Nodes die. Pages get lost. Links rot. You have to be able to handle exceptions and if you define something like s_read_html <- safely(read_html) then when you do s_read_html("https://example.com/") the $result component will be NULL if the network request failed but will contain valid, parsed HTML if it succeeds. It is silent by default and works quite well (as we’ll see below).

The third, short one is: MPGA (Make Progress-bars Great Again). dplyr::progress_estimated() can really simplify the usage of progress bars in purrr calls (drop a note in the comments if the code is confusing and I’ll add some expository).

The last requires the code example for context:

library(rvest)
library(stringi)
library(lubridate)
library(tidyverse)

s_read_html <- safely(read_html)

# helper for brevity
xtract_nodes <- function(node, css) {
  html_nodes(node, css) %>% html_text(trim = TRUE)
}

get_one_day_program <- function(date=Sys.Date(),
                                base_url="http://www.radioswissclassic.ch/en/music-programme/search/%s",
                                pb=NULL) {

  if (!is.null(pb)) pb$tick()$print()

  Sys.sleep(sample(seq(0,1,0.25), 1)) # ideally, make this sample(5,1)

  date <- ymd(date) # handles case where input is character ISO date

  pg <- s_read_html(sprintf(base_url, format(date, "%Y%m%d")))

  if (!is.null(pg$result)) {

    data_frame(

      date = date,
      duration = xtract_nodes(pg$result, 'div[class="playlist"] *
                                            span[class="time hidden-xs"]') %>% hm() %>% as.numeric(),
      artist = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="titletag"]'),
      title = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="artist"]'),

      hour = purrr::map(0:23, ~{
        if (.x<23) {
          nod <- html_nodes(pg$result,
                             xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')
                                                                 and (following-sibling::div[@id='%02d'])]", .x, .x+1))
        } else {
          nod <- html_nodes(pg$result,
                            xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')]", .x))
        }
        rep(.x, length(nod))
      }) %>%
        flatten_int()

    )

  } else {
    closeAllConnections()
    NULL
  }

}

search_dates <- seq(from = ymd("2008-09-01"), to = ymd("2017-04-22"), by = "1 day")

pb <- progress_estimated(length(search_dates[1:5]))
programs_df <- map_df(search_dates[1:5], get_one_day_program, pb=pb)
programs_df
## # A tibble: 825 × 5
##          date duration                    artist                                                                         title  hour
##        <date>    <dbl>                     <chr>                                                                         <chr> <int>
## 1  2008-09-01       60   Franz Anton Hoffmeister "Andante grazioso" From Flute Quartet In A Major (After Mozart's KV 331) (CH)     0
## 2  2008-09-01      360     Johann Nepomuk Hummel                              "Rondo brillante" Op. 56 For Piano And Orchestra     0
## 3  2008-09-01     1380            Franz Schubert                       "Andante con moto" From Symphony No. 9 In C Major D 944     0
## 4  2008-09-01     2340       Camille Saint-Saëns                                       Violin Concerto No. 1 In A Major Op. 20     0
## 5  2008-09-01     3000        Alexander Scriabin                                           Nocturne In A Flat Major Op. posth.     0
## 6  2008-09-01     3180        Alexander Glazunov                                          Valse From "Scènes de ballet" Op. 52     0
## 7  2008-09-01     3540 Carl Philipp Emanuel Bach                                                           Symphony In G Major     0
## 8  2008-09-01     4200            Giuseppe Verdi                      "O Signore, dal tetto natio" From The Opera "I Lombardi"     1
## 9  2008-09-01     4440             Franz Krommer                                 Clarinet Concerto In E Flat Major Op. 36 (CH)     1
## 10 2008-09-01     5820            Georges Onslow             "Andantino molto cantabile" From Symphony No. 4 In G Major Op. 71     1
## # ... with 815 more rows

One of the reasons Maëlle created her post was to use XPath. Now, I was around when XML was defined and I have a sad, long history with the format, so XPath & I are old ~~friends~~ adversaries. However, there are simpler ways to target some of the nodes.

xpath="//span[@class='time hidden-xs']//text()" is ++gd XPath but it doesn’t need to be if we switch to using html_nodes() which will automatically translate CSS selectors to XPath for us. That bit of XPath turns into div[class="playlist"] * span[class="time hidden-xs"]. Why the extra selector at the beginning? Read on!

div[class="playlist"] * span[class="time hidden-xs"] actually translates to the following XPath:

selectr::css_to_xpath('div[class="playlist"] * span[class="time hidden-xs"]')
## [1] "descendant-or-self::div[@class = 'playlist']/descendant::*/descendant::span[@class = 'time hidden-xs']"

I use the parent playlist <div> because a few of the code bits in Maëlle’s post have to subtract away the last node because the XPath expression is a bit too greedy and also gets the “now playing” info vs just the “what played that day” info. It’s not strictly necessary for the time-code but it is for the artist & title. You can see that it simplifies the scraping a bit.

However, we can use XPath for to scrape the “hour the song played” and use it to fill the resultant data frame.

This .//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row') and (following-sibling::div[@id='%02d'])] is not the most complex XPath but it is pretty gnarly, yet it also shows the power of XPath. What we’re doing in that purrr::map() call (which said XPath is in) is:

if the hour is 0:22, then use get all the sibling target nodes between one <div id="hh"> and the next <div id="hh">.
if the hour is 23, then get all the target nodes until there are no sibling
for either result, make an integer vector containing the hour repeated n times (n being the length of the number of songs played in the hour)
flatten it all into one big integer vector

(also: note that whitespace is your bff as well when it comes to formatting XPath queries)

If any read_html request is “bad” NULL will be returned instead of a data_frame, which purrr::map_df() will ignore.

I only did 5 scrapes since I won’t be using the data, but it’s working well on other random sequences I tried.

I tossed in a few more alternative ways to get some of the data, which you can pick up on if you compare the each code bits to each other.

Drop any questions, jibes or better XPath queries (once you post an XPath query on the internet the XPath wonks — like me ? — come out of hiding to prey on innocent bloggers) in the comments.

Shuttering Pies With Retiring Stores

2017-04-21 – 22:32
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (7)

I caught this “gem” in the Wall Street Journal tonight:

It’s pretty hard to compare store-to-store, even though it is fairly clear which ones are going-going-gone. If we want to see the relative percentage of each store closing and also want to see how they stack up against each other, then let’s make a column of 100% bars and label total stores in each:

library(hrbrthemes)
library(tidyverse)

read.table(text='store,closing,total
"Radio Shack",550,1500
"Payless",400,2600
"Rue21",400,1100
"The Limited",250,250
"bebe",180,180
"Wet Seal",170,170
"Crocs",160,560
"JCPenny",138,1000
"American Apparel",110,110
"Kmart",109,735
"hhgregg",88,220
"Sears",41,695', sep=",", header=TRUE, stringsAsFactors=FALSE) %>% 
  as_tibble() %>% 
  mutate(remaining = total - closing,
         gone = round((closing/total) * 100)/100,
         stay = 1-gone,
         rem_lab = ifelse(remaining == 0, "", scales::comma(remaining))) %>% 
  arrange(desc(stay)) %>% 
  mutate(store=factor(store, levels=store)) -> closing_df

update_geom_font_defaults(font_rc)

ggplot(closing_df) +
  geom_segment(aes(0, store, xend=gone, yend=store, color="Closing"), size=8) +
  geom_segment(aes(gone, store, xend=gone+stay, yend=store, color="Remaining"), size=8) +
  geom_text(aes(x=0, y=store, label=closing), color="white", hjust=0, nudge_x=0.01) +
  geom_text(aes(x=1, y=store, label=rem_lab), color="white", hjust=1, nudge_x=-0.01) +
  scale_x_percent() +
  scale_color_ipsum(name=NULL) +
  labs(x=NULL, y=NULL, 
       title="Selected 2017 Store closings (estimated)",
       subtitle="Smaller specialty chains such as Bebe and American Apparel are closing their stores,\nwhile lareger chains such as J.C. Penny and Sears are scaling back their footprint.") +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(legend.position=c(0.875, 1.025)) +
  theme(legend.direction="horizontal")

One might try circle packing or a treemap to show both relative store count and percentage, but I think the bigger story is the percent reduction for each retail chain. It’d be cool to see what others come up with.

When Homoglyphs Attack! Generating Phishing Domain Names with R

It’s likely you’ve seen the news regarding yet-another researcher showing off a phishing domain attack. The technique is pretty simple:

find a target domain you want to emulate
register a homoglpyh version of it
use ~~the hacker’s favorite tool,~~ Let’s Encrypt to serve it up with a nice, shiny green lock icon
deploy some content
phish someone
Profit!

The phishing works since International Domain Names have been “a thing” for a while (anything for the registrars to make more money) and Let’s Encrypt provides a domain-laundering service for these attackers. But, why should attackers have all the fun! Let’s make some domain homoglyphs in R.

Have Glyph, Will Hack

Rob Dawson has a spiffy homoglyph generator and even has a huge glyph-alike file, but we don’t need the full list to don the hacker cap for this exercise. I’ve made a stripped-down version of it that has (mostly) glyphs that should display correctly in “western” locales. You can pull the full list and tweak the example to broaden the attack capabilities. Let’s take a look:

library(stringi)
library(urltools)
library(purrr)

URL <- "https://rud.is/dl/homoglyphs.txt" # trimmed down from https://github.com/codebox/homoglyph
fil <- basename(URL)
invisible(try(httr::GET(URL, httr::write_disk(fil)), silent = TRUE))

chars <- stri_read_lines(fil)
idx_char <- stri_sub(chars, 1,1)
stri_sub(chars, 1, 1) <-  ""
chars <- set_names(chars, idx_char)

tail(chars)
##                                         u 
##          "ʋυцս\u1d1cｕ??????????????????" 
##                                         v 
##        "νѵט\u1d20ⅴ∨⋁ｖ??????????????????" 
##                                         w 
##                                      "ｗ" 
##                                         x 
##                "×хᕁᕽ᙮ⅹ⤫⤬⨯ｘ?????????????" 
##                                         y 
## "ɣʏγуүყ\u1d8c\u1effℽｙ??????????????????" 
##                                         z 
##                   "\u1d22ｚ?????????????"

What we did there was to read in the homoglpyh lines and create a lookup table for Latin characters. Now we need a transformation function.

to_homoglyph <- function(domain) {

  suf <- suffix_extract(domain)
  domain <- stri_replace_last_fixed(domain, sprintf(".%s", suf$suffix[1]), "")

  domain_split <- stri_split_boundaries(domain, type="character")[[1]]

  map_chr(domain_split, ~{
    found <-  chars[.x]
    pos <- sample(stri_count_boundaries(found, type="character"), 1)
    stri_sub(found, pos, pos)
  }) %>%
    c(".", suf$suffix[1]) %>%
    stri_join(collapse="")

}

The basic idea is to:

carve out the domain suffix (we need to ensure valid TLDs/suffixes are used in the final domain)
split the input domain into separate characters
select a homoglyph of the character at random
join the separate glpyhs and the TLD/suffix back together.

We can try it out with a very familiar domain:

(converted <- to_homoglyph("google.com"))
## [1] "ƍ၀໐?|?.com"

Now, that’s using all possible homoglyphs and it might not look like google.com to you, but imagine whittling down the list to ones that are really close to Latin character set matches. Or, imagine you’re in a hurry and see that version of Google’s URL with a shiny, green lock icon from Let’s Encrypt. You might not really give it a second thought if the page looked fine (or were on a mobile browser without a location bar showing).

What’s the solution?

Firefox has a configuration setting to turn these IDNs into punycode in the location bar. What does that mean? We can use the urltools::puny_encode() function to find out:

puny_encode("ƍ၀໐?|?.com")
## [1] "xn--|-npa992hbmb6w79iesa.com"

Most folks will be much less likely to trust that domain name (if they bother looking in the location bar). Note that it will still have the “everything’s ?” green Let’s Encrypt lock icon, but you shouldn’t be trusting SSL/TLS anymore for integrity or authenticity anyway.

Chrome Canary (super early bird alpha versions) expands IDNs to punycode by default today and a shorter-cycle release to stable channel is forthcoming. I’m told Edge does somewhat sane things with IDNs and if Safari doesn’t presently handle them Apple will likely release an interstitial security update to handle it.

FIN

See if you can generate some fun look-alike’s, such as ???????.com and drop some latte change to register an IDN and add a free hacking certificate to it to see just how easy this entire process is. Note that attackers are automating this process, so they may have beat you to your favorite homoglyph IDN.

If you’re on Chrome, give the Punycode Alert extension a go if you’d like some extra notification/protection from these domains.

NOTE: to_homoglyph() is not vectorised (it’s an exercise left to the reader).

Come Fly With Me (well, not really) — Comparing Involuntary Disembarking Rates Across U.S. Airlines in R

2017-04-13 – 06:56
Posted in data wrangling, ggplot, R, web scraping
Tagged post
Leave a Comment

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the #rstats universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

Overall annual boarding denial % is small but there were more involuntary denials than I expected ?✈️

Data: https://t.co/PCtK3Ythgp pic.twitter.com/vnl7Sk5qpI

— Боб Рудіс (@hrbrmstr) April 11, 2017

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.