R Archives - Page 9 of 56

Category Archives: R

Mining R 4.0.0 Changelog for Nuggets of Gold: #1 stopifnot()

R 4.0.0 has been out for a while, now, and — apart from a case where merge() was slower than dirt — it’s been really stable for at least me (I use it daily on macOS, Linux, and Windows). Sure, it came with some headline-grabbing features/upgrades, but I’ve started looking at what other useful nuggets might be in the changelog and decided to blog them as I find them.

Today’s nugget is the venerable stopifnot() function which was significantly enhanced by this PR by Neil Fultz.

Prior to R 4.0.0, if you wanted to use stopifnot() to perform some input validation (a.k.a. — in this case — [assertions])(https://en.wikipedia.org/wiki/Assertion_(software_development)) you’d do something like this (I’m borrowing from Neil’s example):

some_ƒ <- function(alpha, gradtol, steptol, interlim) {

  stopifnot(
    (is.numeric(alpha)),
    (length(alpha) == 1),
    (alpha > 0),
    (alpha < 1),
    (is.numeric(gradtol)),
    (length(gradtol) == 1),
    (gradtol > 0),
    (is.numeric(steptol)),
    (length(steptol) == 1),
    (steptol > 0),
    (is.numeric(interlim)),
    (length(interlim) == 1),
    (interlim > 0)
   )

  message("Do something awesome")

}

When run with acceptable inputs we get:

some_ƒ(0.5, 3, 10, 100)
## Do something awesome

But, when run with something out of kilter:

some_ƒ("a", 3, 10, 100)
##  Error in some_ƒ("a", 3, 10, 100) : (is.numeric(alpha)) is not TRUE

we get a semi-useful, but somewhat unfriendly message back. Sure, it points to the right expression, but we’re supposed to be the kinder, friendlier data science (and general purpose) language who cares a bit more about our users. To that end, many folks switch to doing something like this:

some_ƒ <- function(alpha, gradtol, steptol, interlim) {

  if (!is.numeric(alpha))   { stop('Error: alpha should be numeric') }
  if (length(alpha) != 1)   { stop('Error: alpha should be a single value'); }
  if (alpha < 0)            { stop('Error: alpha is negative'); }
  if (alpha > 1)            { stop('Error: alpha is greater than one'); }
  if (!is.numeric(gradtol)) { stop('Error: gradtol should be numeric') }
  if (length(gradtol) != 1) { stop('Error: gradtol should be a single value'); }
  if (gradtol <= 0)         { stop('Error: gradtol should be positive'); }
  if (!is.numeric(steptol)) { stop('Error: steptol should be numeric') }
  if (length(steptol) != 1) { stop('Error: steptol should be a single value'); }
  if (steptol <= 0)         { stop('Error: steptol should be positive'); }
  if (!is.numeric(iterlim)) { stop('Error: iterlim should be numeric') }
  if (length(iterlim) != 1) { stop('Error: iterlim should be a single value'); }
  if (iterlim <= 0)         { stop('Error: iterlim should be positive'); }

  message("Do something awesome")

}

which results in:

some_ƒ("a", 3, 10, 100)
##  Error in some_ƒ("a", 3, 10, 100) : Error: alpha should be numeric

(you can make even better error messages than that).

Neal thought there had to be a better way, and made one! The ... expressions can be named and those names will become the error message:

some_ƒ <- function(alpha, gradtol, steptol, interlim) {

  stopifnot(
    'alpha should be numeric'          = (is.numeric(alpha)),
    'alpha should be a single value'   = (length(alpha) == 1),
    'alpha is negative'                = (alpha > 0),
    'alpha is greater than one'        = (alpha < 1),
    'gradtol should be numeric'        = (is.numeric(gradtol)),
    'gradtol should be a single value' = (length(gradtol) == 1),
    'gradtol should be positive'       = (gradtol > 0),
    'steptol should be numeric'        = (is.numeric(steptol)),
    'steptol should be a single value' = (length(steptol) == 1),
    'steptol should be positive'       = (steptol > 0),
    'iterlim should be numeric'        = (is.numeric(interlim)),
    'iterlim should be a single value' = (length(interlim) == 1),
    'iterlim should be positive'       = (interlim > 0)
   )

  message("Do something awesome")

}

some_ƒ("a", 3, 10, 100)

##  Error in some_ƒ("a", 3, 10, 100) : alpha should be numeric

Way easier to write and way more respectful to the caller.

Gratuitous Statistics

CRAN has ~2,600 packages that use stopifnot() in their package /R/ code with the following selected distributions (charts are all log10 scale):

Here are the packages with 50 or more files using stopifnot():

   pkg              n
   <chr>        <int>
 1 spatstat       252
 2 pracma         145
 3 QuACN           80
 4 raster          74
 5 spdep           61
 6 lavaan          54
 7 surveillance    53
 8 copula          50

Here are the packages with one or more files that have 100 or more calls to stopifnot() in them:

   pkg                 fil                            ct
   <chr>               <chr>                       <int>
 1 ff                  ordermerge.R                  278
 2 OneArmPhaseTwoStudy zzz.R                         142
 3 bit64               integer64.R                   137
 4 updog               rflexdog.R                    124
 5 RNetCDF             RNetCDF.R                     123
 6 Rlda                rlda.R                        105
 7 aster2              transform.R                   105
 8 ads                 fads.R                        104
 9 georob              georob_exported_functions.R   104
10 bit64               highlevel64.R                 101

O_O That’s quite a bit of checking!

FIN

If you’re working on switching to R 4.0.0 or have switched, this and many other new features await! Drop a note in the comments with your favorite new feature (or, even better, a link to a blog post on said feature!).

As I get time to dig out some more nuggets I’ll add more posts to this series.

Attach Your R Code To Charts You Tweet For Reproducible R Tweets!

I caught this tweet by Terence Eden about using Twitter image alt-text to “PGP sign” tweet and my mind immediately went to “how can I abuse this for covert communications, malicious command-and-control, and embedding R code in tweets?”.

When you paste or upload an image to tweet (web interface, at least) you have an opportunity to add “alt” text which is — in theory — supposed to help communicate the content of the image to folks using assistive technology. Terence figured out the alt-text limit on Twitter is large (~1K) which is plenty of room for useful R code.

I poked around for something to use as an example and settled on using data from COVID Stimulus Watch. The following makes the chart in this tweet — https://twitter.com/hrbrmstr/status/1261641887603179520.

I’m not posting the chart here b/c it’s nothing special, but the code for it is below.

library(hrbrthemes);

x <- read.csv("https://data.covidstimuluswatch.org/prog.php?&detail=export_csv")[,3:5];

x[,3] <- as.numeric(gsub("[$,]","",x[,3]));
x <- x[(x[,1]>20200400)&x[,3]>0,];
x[,1] <- as.Date(as.character(x[,1]),"%Y%m%d");

ggplot(x, aes(Award.Date, Grant.Amount, fill=Award.Type)) +
  geom_col() +
  scale_y_comma(
    labels = c("$0", "$5bn", "$10bn", "$15bn")
  ) +
  labs(
    title = "COVID Stimulus Watch: Grants",
    caption = "Source: https://data.covidstimuluswatch.org/prog.php?detail=opening"
  ) +
  theme_ipsum_es(grid="XY")

Semicolons are necessary b/c newlines are going to get stripped when we paste that code block into the alt-text entry box.

We can read that code back into R with some help from read_html() & {styler}:

library(rtweet)
library(rvest)
library(stringi)
library(magrittr)

pg <- read_html("https://twitter.com/hrbrmstr/status/1261641887603179520")

html_nodes(pg, "img") %>% 
  html_attr("alt") %>% 
  keep(stri_detect_fixed, "library") %>% 
  styler::style_text()

library(hrbrthemes)
x <- read.csv("https://data.covidstimuluswatch.org/prog.php?&detail=export_csv")[, 3:5]
x[, 3] <- as.numeric(gsub("[$,]", "", x[, 3]))
x <- x[(x[, 1] > 20200400) & x[, 3] > 0, ]
x[, 1] <- as.Date(as.character(x[, 1]), "%Y%m%d")
ggplot(x, aes(Award.Date, Grant.Amount, fill = Award.Type)) +
  geom_col() +
  scale_y_comma(
    labels = c("$0", "$5bn", "$10bn", "$15bn")
  ) +
  labs(
    title = "COVID Stimulus Watch: Grants",
    caption = "Source: https://data.covidstimuluswatch.org/prog.php?detail=opening"
  ) +
  theme_ipsum_es(grid = "XY")

Twitter’s API does not seem to return alt-text: (see UPDATE)

rtweet::lookup_statuses("1261641887603179520") %>% 
  jsonlite::toJSON(pretty=TRUE)
## [
##   {
##     "user_id": "5685812",
##     "status_id": "1261641887603179520",
##     "created_at": "2020-05-16 12:57:20",
##     "screen_name": "hrbrmstr",
##     "text": "Twitter's img alt-text limit is YUGE! So, we can abuse it for semi-covert comms channels, C2, or for \"embedding\" the code ## that makes this chart!\n\nUse `read_html()` on URL of this tweet; find 'img' nodes w/html_nodes(); extract 'alt' attr text w/## html_attr(). #rstats \n\nh/t @edent https://t.co/v5Ut8TzlRO",
##     "source": "Twitter Web App",
##     "display_text_width": 278,
##     "is_quote": false,
##     "is_retweet": false,
##     "favorite_count": 8,
##     "retweet_count": 2,
##     "hashtags": ["rstats"],
##     "symbols": [null],
##     "urls_url": [null],
##     "urls_t.co": [null],
##     "urls_expanded_url": [null],
##     "media_url": ["http://pbs.twimg.com/media/EYI_W-xWsAAZFeP.png"],
##     "media_t.co": ["https://t.co/v5Ut8TzlRO"],
##     "media_expanded_url": ["https://twitter.com/hrbrmstr/status/1261641887603179520/photo/1"],
##     "media_type": ["photo"],
##     "ext_media_url": ["http://pbs.twimg.com/media/EYI_W-xWsAAZFeP.png"],
##     "ext_media_t.co": ["https://t.co/v5Ut8TzlRO"],
##     "ext_media_expanded_url": ["https://twitter.com/hrbrmstr/status/1261641887603179520/photo/1"],
##     "mentions_user_id": ["14054507"],
##     "mentions_screen_name": ["edent"],
##     "lang": "en",
##     "geo_coords": ["NA", "NA"],
##     "coords_coords": ["NA", "NA"],
##     "bbox_coords": ["NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"],
##     "status_url": "https://twitter.com/hrbrmstr/status/1261641887603179520",
##     "name": "boB • Everywhere is Baltimore • Rudis",
##     "location": "Doors & Corners",
##     "description": "Don't look at me…I do what he does—just slower. 🇷 #rstats avuncular • pampa • #tired • 👨‍🍳 • ✝️ • Prìomh ## Neach-saidheans Dàta @ @rapid7",
##     "url": "https://t.co/RgY1wHjoqM",
##     "protected": false,
##     "followers_count": 11886,
##     "friends_count": 458,
##     "listed_count": 667,
##     "statuses_count": 84655,
##     "favourites_count": 15140,
##     "account_created_at": "2007-05-01 14:04:24",
##     "verified": true,
##     "profile_url": "https://t.co/RgY1wHjoqM",
##     "profile_expanded_url": "https://rud.is/b",
##     "profile_banner_url": "https://pbs.twimg.com/profile_banners/5685812/1398248552",
##     "profile_background_url": "http://abs.twimg.com/images/themes/theme15/bg.png",
##     "profile_image_url": "http://pbs.twimg.com/profile_images/824974380803334144/Vpmh_s3x_normal.jpg"
##   }
## ]

but I still need to poke over at the API docs to figure out if there is a way to get it more programmatically. (see UPDATE)

If we want to be incredibly irresponsible and daft (like a recently semi-shuttered R package installation service) we can throw caution to the wind and just plot it outright:

library(rtweet)
library(rvest)
library(stringi)
library(magrittr)

pg <- read_html("https://twitter.com/hrbrmstr/status/1261641887603179520")

html_nodes(pg, "img") %>% 
  html_attr("alt") %>% 
  keep(stri_detect_fixed, "library") %>% 
  textConnection() %>% 
  source() %>% # THIS IS DANGEROUS DO NOT TRY THIS AT HOME
  print()

Seriously, though, don’t do that. Lots of bad things can happen when you source() from the internet.

UPDATE (2020-05-17)

You can use:

paste(as.character(parse(text = "...")), collapse = "; ")

to “minify” R code for the alt-text.

And, you can use https://github.com/hrbrmstr/rtweet until it is PR’d back into {rtweet} proper to send tweets with image alt-text. The “status” functions also return any alt-text in a new ext_alt_text column.

FIN

Now, you can make your Twitter charts reproducible on-platform (until Twitter does something to thwart this new communication and file-sharing channel).

Since twitter status URLs are just GET requests, orgs should consider running the content of those URLs through alt-text extractors just in case there’s some funny business going on across user endpoints.

New Package — {cdccovidview} — To Work with the U.S. CDC’s New COVID-19 Trackers: COVIDView and COVID-NET

The United States Centers for Disease Control (CDC from now on) has setup two new public surveillance resources for COVID-19. Together, COVIDView and COVID-NET provide similar weekly surveillance data as FluView does for influenza-like illnesses (ILI).

The COVIDView resources are HTML tables (O_O) and, while the COVID-NET interface provides a “download” button, there is no exposed API to make it easier for the epidemiological community to work with these datasets.

Enter {cdccovidview} — https://cinc.rud.is/web/packages/cdccovidview/ — which scrapes the tables and uses the hidden API in the same way {cdcfluview}(https://cran.rstudio.com/web/packages/cdcfluview/index.html) does for the FluView data.

Weekly case, hospitalization, and mortality data is available at the national, state and regional levels (where provided) and I tried to normalize the fields across each of the tables/datasets (I hate to pick on them when they’re down, but these two sites are seriously sub-optimal from a UX and just general usage perspective).

After you follow the above URL for information on how to install the package, it should “just work”. No API keys are needed, but the CDC may change the layout of tables and fields structure of the hidden API at any time, so keep an eye out for updates.

Using it is pretty simple, just use one of the functions to grab the data you want and then work with it.

library(cdccovidview)
library(hrbrthemes)
library(tidyverse)

hosp <- laboratory_confirmed_hospitalizations()

hosp
## # A tibble: 4,590 x 8
##    catchment      network   year  mmwr_year mmwr_week age_category cumulative_rate weekly_rate
##    <chr>          <chr>     <chr> <chr>     <chr>     <chr>                  <dbl>       <dbl>
##  1 Entire Network COVID-NET 2020  2020      10        0-4 yr                   0           0  
##  2 Entire Network COVID-NET 2020  2020      11        0-4 yr                   0           0  
##  3 Entire Network COVID-NET 2020  2020      12        0-4 yr                   0           0  
##  4 Entire Network COVID-NET 2020  2020      13        0-4 yr                   0.3         0.3
##  5 Entire Network COVID-NET 2020  2020      14        0-4 yr                   0.6         0.3
##  6 Entire Network COVID-NET 2020  2020      15        0-4 yr                  NA          NA  
##  7 Entire Network COVID-NET 2020  2020      16        0-4 yr                  NA          NA  
##  8 Entire Network COVID-NET 2020  2020      17        0-4 yr                  NA          NA  
##  9 Entire Network COVID-NET 2020  2020      18        0-4 yr                  NA          NA  
## 10 Entire Network COVID-NET 2020  2020      19        0-4 yr                  NA          NA  
## # … with 4,580 more rows

c(
  "0-4 yr", "5-17 yr", "18-49 yr", "50-64 yr", "65+ yr", "65-74 yr", "75-84 yr", "85+"
) -> age_f

mutate(hosp, start = mmwr_week_to_date(mmwr_year, mmwr_week)) %>%
  filter(!is.na(weekly_rate)) %>%
  filter(catchment == "Entire Network") %>%
  select(start, network, age_category, weekly_rate) %>%
  filter(age_category != "Overall") %>%
  mutate(age_category = factor(age_category, levels = age_f)) %>%
  ggplot() +
  geom_line(
    aes(start, weekly_rate)
  ) +
  scale_x_date(
    date_breaks = "2 weeks", date_labels = "%b\n%d"
  ) +
  facet_grid(network~age_category) +
  labs(
    x = NULL, y = "Rates per 100,000 pop",
    title = "COVID-NET Weekly Rates by Network and Age Group",
    caption = sprintf("Source: COVID-NET: COVID-19-Associated Hospitalization Surveillance Network, Centers for Disease Control and Prevention.\n<https://gis.cdc.gov/grasp/COVIDNet/COVID19_3.html>; Accessed on %s", Sys.Date())
  ) +
  theme_ipsum_es(grid="XY")

FIN

This is brand new and — as noted — things may change or break due to CDC site changes. I may have also missed a table or two (it’s a truly terrible site).

If you notice things are missing or would like a different interface to various data endpoints, drop an issue or PR wherever you’re most comfortable.

Stay safe!

{uaparserjs} Updated on CRAN & Using webpack to Make {V8} Application Bundles

Just a quick note that thanks to a gentle nudge an updated version of {uaparser} — a package that processes User Agent strings web clients send to servers — is making its way to all the CRAN mirrors and is also available on CINC. The most significant change is a much overdue update to the user agent regex dictionary.

It takes something like this Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.106 Chrome/15.0.874.106 Safari/535.2 and turns it into a tidy data frame:

uaparserjs::ua_parse("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.106 Chrome/15.0.874.106 Safari/535.2")
## # A tibble: 1 x 9
##   userAgent                                   ua.family ua.major ua.minor ua.patch os.family os.major os.minor device.family
##   <chr>                                       <chr>     <chr>    <chr>    <chr>    <chr>     <chr>    <chr>    <chr>        
## 1 Mozilla/5.0 (X11; Linux x86_64) AppleWebKi… Chromium  15       0        874      Ubuntu    11       10       Other

The js on the end of the package name is a nod that it uses the javascript ua-parser-core module via Jeroen’s seriously awesome {V8} package. Four years ago, {uaparserjs} did not work on Windows due to V8 VM stack limitations on Windows. Today, it now works on all platforms!

Tis no slouch, either, as it processes 100 user agent strings in ~20ms. No speed demon, but should get the job done for most use-cases.

There is an excellent C++-backed R version that is not on CRAN and some heavy dependencies, but is faster than the javascript version if you need to process scads of user agent strings (I tend to use at-scale Scala environments for this now, hence the reason for a long delay update).

Jeroen has an excellent writeup on how to use browserify to create an application bundle for javascript-backed R packages or scripts. There are some idiosyncrasies with the ua-parser-core reference implementation that was causing me no end of trouble with that method. On a lark, I tried:

$ webpack --mode="production" index.js -o bundle.js

and it worked perfectly on the first try (both in creating the app bundle and that bundle working just as before). This is due in no small part to Jeroen getting {V8} package to work with more recent lib-v8 releases (which is also why it works on Windows now). I’ll try to write-up the webpack alternate method and PR into the vignette as I get time.

FIN

As usual, kick the tyres, jump in with PRs or issues and — most of all — be safe out there!

For folks who interact with CRAN or R Core: they’re continuing to support our community during these crazy times so when you are in exchanges with them, definitely take some time out to add an extra nod of thanks for managing to do so whilst juggling the same things we all are.

The Waffle House Index

Waffle House announced it was closing hundreds of stores this week due to SARS-Cov-2 (a.k.a COVID-19). This move garnered quite a bit of media attention since former FEMA Administrator Craig Fugate used the restaurant chain as both an indicator of the immediate and overall severity of a natural disaster. [He’s not the only one](https://www.ehstoday.com/emergency-management/article/21906815/what-do-waffles-have-to-do-with-risk-management. The original concept was pretty straightforward:

For example, if a Waffle House store is open and offering a full menu, the index is green. If it is open but serving from a limited menu, it’s yellow. When the location has been forced to close, the index is red. Because Waffle House is well prepared for disasters, Kouvelis said, it’s rare for the index to hit red. For example, the Joplin, Mo., Waffle House survived the tornado and remained open.

“They know immediately which stores are going to be affected and they call their employees to know who can show up and who cannot,” he said. “They have temporary warehouses where they can store food and most importantly, they know they can operate without a full menu. This is a great example of a company that has learned from the past and developed an excellent emergency plan.”

SARS-Cov-2 is not a tropical storm, so conditions are a bit different and a tad more complex when it comes to basing severity of this particular disaster (mostly caused by inept politicians across the globe), which gave me an idea for how to make the Waffle House Index a proper index, i.e. a _”statistical measure of change in a representative group of individual data points.”_¹.

In the case of an outbreak, rather than a simple green/yellow/red condition state, using the ratio of closed to open Waffle House locations as a numeric index — [0-1] — seems to make more sense since it may better help indicate:

when shelter-in-place became mandatory where a given restaurant is located
the severity of SARS-Cov-2-caused symptoms for a given location
disruptions in the supply chain for a given location due to SARS-Cov-2

I kinda desperately needed a covidistraction so I set out to see how hard it would be to build such an index metric.

Waffle House lets you find locations via a standard map/search interface. They provide lots of data via that map which can be used to figure out which stores are open and which are closed. There’s a nascent R package which contains all the recipes necessary for the data gathering. However, you don’t need to use it, since it powers wafflehouseindex.us which is collecting the data when the store closings info changes and provides a snapshot of the latest data daily (direct CSV link).

The historical data will make it to a git repo at some point in the near future.

The current index value is 21.2, which increased quickly after the first value of 18.1 (that event was the catalyst for getting the site up and package done) and the closed locations are on the map at the beginning of the post. I went with three qualitative levels on the gauge mostly to keep things simple.

There will absolutely be more location closings and it will be interesting (and, ultimately, very depressing and likely grave) to see how high the index goes and how long it stays above zero.

FIN

The metric is — for the time being — computed across all stores. As noted earlier, this could be broken down into regional index scores to intuit the aforementioned three indicators on a more local level. The historical data (apart from the first closings announcement) is being saved so it will be possible to go back and compute regional indexes when I’ve got more time.

I shall reiterate that you should grab the data from http://wafflehouseindex.us/data/latest.csv vs use the R package since there’s no point in dup’ing the gathering and the historical data will be up and maintained soon.

Stay safe, folks.

https://en.wikipedia.org/wiki/Index_(economics) ↩

{catchpole} Redux and Hashing Files & Websites with {ssdeepr}

Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:

If we tweak the buffer space around the squares, I think the cartogram looks better:

but, you should likely use a different palette (see this Twitter thread for examples).

I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:

library(sf)
library(hrbrthemes) 
library(catchpole)
library(tidyverse)

delegates <- read_delegates()

candidates_expanded <- expand_candidates()

gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx"))

m <- delegates_map()

# split off each "area" on the map so we can make a border+background
list(
  setdiff(state.abb, c("HI", "AK")),
  "AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS"
) %>% 
  map(~{
    suppressWarnings(suppressMessages(st_buffer(
      x = st_union(m[m$state %in% .x, ]),
      dist = 0.0001,
      endCapStyle = "SQUARE"
    )))
  }) -> m_borders

gg <- ggplot()
for (mb in m_borders) {
  gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125)
}

gg + 
  geom_sf(
    data = gsf,
    aes(fill = candidate),
    col = "white", shape = 22, size = 3, stroke = 0.125
  ) +
  scale_fill_manual(
    name = NULL,
    na.value = "#f0f0f0",
    values = c(
      "Biden" = '#f0027f',
      "Sanders" = '#7fc97f',
      "Warren" = '#beaed4',
      "Buttigieg" = '#fdc086',
      "Klobuchar" = '#ffff99',
      "Gabbard" = '#386cb0',
      "Bloomberg" = '#bf5b17'
    ),
    limits = intersect(unique(delegates$candidate), names(delegates_pal))
  ) +
  guides(
    fill = guide_legend(
      override.aes = list(size = 4)
    )
  ) +
  coord_sf(datum = NA) +
  theme_ipsum_es(grid="") +
  theme(legend.position = "bottom")

{ssdeepr}

Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.

Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at $DAYJOB), I went ahead and packaged that up as well.

I recommend using the hash_con() function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file() doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).

These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?

library(ssdeepr) # see the links above for installation

cran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html"))
cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html"))
cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html"))

hash_compare(cran1, cran2)
## [1] 0

hash_compare(cran1, cran3)
## [1] 94

I picked on cran.biotools.fr as I saw they were well-behind CRAN-proper on the monitoring page.

I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!

library(httr)
library(ssdeepr)
library(splashr)

# regular grab
h1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth"))

# you need Splash running for javascript-enabled scraping this way
sp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass")

# js-enabled with one ua
sp %>%
  splash_user_agent(ua_macos_chrome) %>%
  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%
  splash_wait(2) %>%
  splash_html(raw_html = TRUE) -> js1

# js-enabled with another ua
sp %>%
  splash_user_agent(ua_ios_safari) %>%
  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%
  splash_wait(2) %>%
  splash_html(raw_html = TRUE) -> js2

h2 <- hash_raw(js1)
h3 <- hash_raw(js2)

# same way {rvest} does it
res <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth")

h4 <- hash_raw(content(res, as = "raw"))

Now, let’s compare them:

hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal
## [1] 100

# things look way different with js-enabled

hash_compare(h1, h2)
## [1] 0
hash_compare(h1, h3)
## [1] 0

# and with variations between user-agents

hash_compare(h2, h3)
## [1] 0

hash_compare(h2, h4)
## [1] 0

# only doing this for completeness

hash_compare(h3, h4)
## [1] 0

For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):

length(js1)
## [1] 432914

length(js2)
## [1] 270538

nchar(
  paste0(
    readLines(url("https://en.wikipedia.org/wiki/Donald_Knuth")),
    collapse = "\n"
  )
)
## [1] 373078

length(content(res, as = "raw"))
## [1] 374099

FIN

If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!

The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).

As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.

Make WSJ-esque Über Tuesday Democrat Delegate Cartograms in R with {catchpole}

For folks who are smart enough not to go near Twitter, I’ve been on a hiatus from the platform insofar as reading the Twitter feed goes. “Why” isn’t the subject of this post so I won’t go into it, but I’ve broken this half-NYE resolution on more than one occasion and am very glad I did so late January when I caught a RT of this tweet by WSJ’s Brian McGill:

You guys! I spent 2 days making a cartogram by hand with 3,979 freaking squares on it. One for each delegate to the Democratic National Convention. Never doing it again.* Apologies to Georgia, you got the short end of the stick shape-wise.
.
.
.
.
*absolutely going to to it again pic.twitter.com/1qIgUw1oIc

— Brian McGill (@brian_mcgill) January 29, 2020

You can find it here, and a static copy of a recent one is below:

I kinda wanted to try to make a woefully imperfect static version of it in R with {ggplot2} so poked around at that URL’s XHR objects and javascript to see if I could find the cartogram and the data source.

The data source was easy as it’s an XHR loaded JSON file: https://asset.wsj.net/wsjnewsgraphics/election/2020/delegates.json.

The cartogram bits… were not. Brian’s two-days of manual effort still needed to be put into something that goes onto a web page and news outlets are super-talented at making compact, fast-loading interactive visualizations, which means one tool they use is “webpack”-esque tools to combine many small javascript files into one. I did traipse through it seeing if there as a back-end JSON or CSV somewhere but could not locate it. However, their cartogram library builds the SVG you see on the page. If you use Developer Tools to inspect any element of the SVG then copy the whole SVG “outer HTML” and save it to a local file:

After using an intercept proxy, it turns out this is a dynamically loaded resource, too: https://asset.wsj.net/wsjnewsgraphics/election/delegate-tracker/carto.svg.

That SVG has three top layer groups and has some wicked transforms in it. There was no way I was going to attempt a {statebins}-esque approach to this copycat project (i.e. convert the squares to a grid and map things manually like Brian did) but I had an idea and used Adobe Illustrator to remove the state names layer and the background polygon layer, then “flatten” the image (which — to over-simplify the explanation — flattens all the transforms), and save it back out.

Then, I added a some magic metadata prescribed by svg2geojson to turn the SVG into a GeoJSON file (which {sf} can read!). (That sentence just made real cartographers & geocomp’ers weep, btw).

Now, that I had something R could use in a bit of an easier fashion there was still work to be done. The SVG 1-px <rect> elements ended up coming across as POLYGONs and many, many more point-squares came along for the ride (in retrospect, I think they may have been the borders around the states, more on that in a bit).

I used {purrr} and {st_coordinates} to figure out where all the 1-px “polygons” were in the {sf} object and isolated them, then added an index field (1:n, n being the number of delegate squares for a given state).

I read in the original SVG with {xml2} and extracted the named state groups. Thankfully the order and number of “blocks” matched the filtered {sf} object. I merged them together, turned the 1-px POLYGONs into POINTs, and made the final {sf} object which I put in the nascent {catchpole} package (location below). Here’s a quick view of it using plot():

library(catchpole) # hrbrmstr/catchpole

plot(delegates_map()[1])

delegates_map()
## Simple feature collection with 3979 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -121.9723 ymin: 37.36802 xmax: -121.9581 ymax: 37.37453
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    state idx                   geometry
## 1     WY   1 POINT (-121.9693 37.37221)
## 2     WY   2 POINT (-121.9693 37.37212)
## 3     WY   3 POINT (-121.9691 37.37221)
## 4     WY   4 POINT (-121.9691 37.37212)
## 5     WY   5 POINT (-121.9691 37.37203)
## 6     WY   6 POINT (-121.9691 37.37194)
## 7     WY   7 POINT (-121.9691 37.37185)
## 8     WY   8  POINT (-121.969 37.37221)
## 9     WY   9  POINT (-121.969 37.37212)
## 10    WY  10  POINT (-121.969 37.37203)

All that was needed was to try it out with the real data.

I simplified that process quite a bit in {catchpole} but also made it possible to work with the individual bits on your own. {gg_catchpole()} will fetch the WSJ delegate JSON and build the basic map for you using my dark “ipsum” theme:

library(sf)
library(catchpole) # hrbrmstr/catchpole
library(hrbrthemes)
library(tidyverse)

gg_catchpole() +
  theme_ft_rc(grid="") +
  theme(legend.position = "bottom")

BONUS!

Now that you have the WSJ JSON file, you can do other, basic visualizations with it:

library(hrbrthemes) 
library(waffle)
library(geofacet)
library(tidyverse)

jsonlite::fromJSON(
  url("https://asset.wsj.net/wsjnewsgraphics/election/2020/delegates.json"),
  simplifyDataFrame = FALSE
) -> del

c(
  "Biden" = "#5ac4c2",
  "Sanders" = "#63bc51",
  "Warren" = "#9574ae",
  "Buttigieg" = "#007bb1",
  "Klobuchar" = "#af973a",
  "Bloomberg" = "#AA4671",
  "Steyer" = "#4E4EAA",
  "Yang" = "#C76C48",
  "Gabbard" = "#7B8097"
) -> dcols

bind_cols(del$data$US$delCount) %>% 
  gather(candidate, delegates) %>% 
  filter(delegates > 0) %>%
  arrange(desc(delegates)) %>% 
  mutate(candidate = fct_inorder(candidate)) %>%
  ggplot(aes(candidate, delegates)) +
  geom_col(fill = ggthemes::tableau_color_pal()(1), width = 0.55) +
  labs(
    x = NULL, y = "# Delegates",
    title = "2020 Democrat POTUS Race Delegate Counts",
    subtitle = sprintf("Date: %s", Sys.Date()),
    caption = "Data source: WSJ <https://asset.wsj.net/wsjnewsgraphics/election/2020/delegates.json>\n@hrbrmstr #rstats"
  ) +
  theme_ipsum_rc(grid="Y")

bind_cols(del$data$US$delCount) %>% 
  gather(candidate, delegates) %>% 
  filter(delegates > 0) %>%
  arrange(desc(delegates)) %>% 
  mutate(candidate = fct_inorder(candidate)) %>%
  ggplot(aes(fill=candidate, values=delegates)) +
  geom_waffle(color = "white", size = 0.5) +
  scale_fill_manual(name = NULL, values = dcols) +
  coord_fixed() +
  labs(
    x = NULL, y = "# Delegates",
    title = "2020 Democrat POTUS Race Delegate Counts",
    subtitle = sprintf("Date: %s", Sys.Date()),
    caption = "Data source: WSJ <https://asset.wsj.net/wsjnewsgraphics/election/2020/delegates.json>\n@hrbrmstr #rstats"
  ) +
  theme_ipsum_rc(grid="") +
  theme_enhance_waffle()

state_del <- del
state_del$data[["US"]] <- NULL

map_df(state_del$data, ~bind_cols(.x$delCount), .id = "state") %>% 
  gather(candidate, delegates, -state) %>% 
  filter(delegates > 0) %>% 
  ggplot(aes(candidate, delegates)) +
  geom_col(aes(fill = candidate), col = NA, width = 0.55) +
  scale_fill_manual(name = NULL, values = dcols) +
  facet_geo(~state) +
  labs(
    x = NULL, y = "# Delegates",
    title = "2020 Democrat POTUS Race Delegate Counts by State",
    subtitle = sprintf("Date: %s", Sys.Date()),
    caption = "Data source: WSJ <https://asset.wsj.net/wsjnewsgraphics/election/2020/delegates.json>\n@hrbrmstr #rstats"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank()) +
  theme(panel.spacing.x = unit(0.5, "lines")) +
  theme(panel.spacing.y = unit(0.1, "lines")) +
  theme(legend.position = c(0.95, 0.1)) +
  theme(legend.justification = c(1, 0))

FIN

More work needs to be done on the map and {catchpole} itself but there’s a sufficient base for others to experiment with (PRs and your own blog posts welcome!).

W/r/t “more on that later” bits: The extra polygons were very likely borders and I think borders would help the cartogram, but we can make them with {sf}, too. We can also add in a layer for state names and/or just figure out the centroid for each point grouping (with {sf}) and get places for labels that way). Not sure I’ll have time for any of that (this whole process went quickly, believe it or not).

Also: ggiraph::geom_sf_interactive() can be used as a poor-dude’s popup to turn this (quickly) into an interactive piece.

If you hit up https://git.rud.is/hrbrmstr/catchpole you’ll find the package and URLs to other social coding sites (though GitUgh has been plagued with downtime and degraded performance the past few weeks so you should really think about moving your workloads to real service).

Have fun mapping Über Tuesday and share your creations, PR’s, ideas, etc for the package wherever you’re most comfortable.

BIMI Up, Scotty! A look at Brand Indicators for Message Identification (BIMI) Adoption with R and the Alexa Top 1m

It seems that the need for MX, DKIM, SPF, and DMARC records for modern email setups were just not enough acronyms (and setup tasks) for some folks, resulting in the creation of yet-another-acronym — BIMI, or, Brand Indicators for Message Identification. The goal of BIMI is to “provide a mechanism for mail senders to publish a validated logotype that mail receivers can display with the senders’ messages.” You can read about the rationale for BIMI and the preliminary RFC for crafting BIMI DNS TXT records over a few caffeinated beverages. I’ll try to TL;DR the high points below.

The idea behind BIMI is to provide a visual indicator of the brand associated with a mail message; i.e. you’ll have an image to look at somewhere in the mail list display and/or mail message display of your mail client if it supports BIMI. This visual indicator is merely an image URL association with a brand mail domain through the use of a new special-prefix DNS TXT record. Mail intermediaries and mail clients are only supposed to allow presentation of BIMI-record provided images after verifying that the email domain itself conforms to the DMARC standard (which you should be using if you’re an organization/brand and shame on you if you’re not by now). In fact, the goal of BIMI is to help ensure:

the organization is legitimate
the domain names are controlled by the organization
the organization has current rights to display the indicator

When BIMI validation is being performed, the party requesting validation is currently authorized to do so by the organization and is who they say they are.

If you’re having flashbacks to the lost era of when SSL certificates were supposed to have similar integrity assertions, you’re not alone (thanks, LE).

What’s Really Going On?

I’m not part of any working group associated with BIMI, I just measure and study the internet for a living. As someone who is as likely to use alpine to peruse mail as I am a thick email client or (heaven forbid) web client, BIMI will be of little value to me since I’m not really going to see said images anyway.

Reading through all the BIMI (and associated) RFCs, email security & email marketing vendor blogs/papers, and general RFC commentators, BIMI isn’t solving any problem that well-armored DMARC configurations aren’t already solving. It appears to be driven mainly by brand marketing wonks who just want to shove brand logos in front of you and have one more way to track you.

Yep, tracking email perusals (even if it’s just a list view) will be one of the benefits (to brands and marketing firms) and is most assuredly a non-stated primary goal of this standard. To help illustrate this, let’s look at the BIMI record for one of the most notorious tracking brands on the planet, Verizon (in this case, Verizon Wireless). When you receive a BIMI-“enhanced” email from verizonwireless.com the infrastructure handling the email receipt will look for and process the BIMI header that was sent along for the ride and eventually query a TXT record for default._bimi.verizonwireless.com (or whatever the sender has specified instead of default — more on that in a bit). In this case the response will be:

v=BIMI1; l=https://ecrm.e.verizonwireless.com/AC/Global/Bling/Images/checkmark/verizon.svg;

which means the image they want displayed is at that URL. Your client will have to fetch that during an interactive session, so your IP address — at a minimum — will be leaked when that fetch happens.

Brands can specify something other than the default. selector with the email, so they could easily customize that to be a unique identifier which will “be you” and know when you’ve at least looked at said email in a list view (provided that’s how your email client will show it) if not in the email proper. Since this is a “high integrity” visual component of the message, it’s likely not going to be subject to the “do not load external images/content” rules you have setup (you do view emails with images turned off initially, right?).

So, this is likely just one more way the IETF RFC system is being abused by large corporations to continue to erode our privacy (and get their horribly designed logos in our faces).

Let’s see who are the early adopters of BIMI.

BIMI Through the Alexa Looking Glass

Amazon had stopped updating the Alexa Top 1m sites for a while but it’s been back for quite some time so we can use it to see how many sites in the top 1m have BIMI records.

We’ll use the {zdnsr} package (also on GitLab, SourceHut, BitBucket, and GitUgh) to perform a million default._bimi prefix queries and see how many valid BIMI TXT record responses we get.

library(zdnsr) # hrbrmstr/zdnsr on social coding sites
library(stringi)
library(urltools)
library(tidyverse)

refresh_publc_nameservers_list() # get a current list of active nameservers we can use

# read in the top1m
top1m <- read_csv("~/data/top-1m.csv", col_names = c("rank", "domain")) # http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

# fire off a million queries, storing good results where we can pick them up later
zdns_query(
  entities = sprintf("%s.%s", "default._bimi", top1m$domain),
  query_type = "TXT",
  num_nameservers = 500,
  output_file = "~/data/top1m-bimi.json",
)

# ~10-30m later depending on your system/network/randomly chosen resolvers

bmi <- jsonlite::stream_in(file("~/data/top1m-bimi.json")) # using jsonlite vs ndjson since i don't want a "flat" structure

idx <- which(lengths(bmi$data$answers) > 0) # find all the ones with non-0 results

# start making a tidy data structure
tibble(
  answer = bmi$data$answers[idx]
) %>%
  unnest(answer) %>%
  filter(grepl("^v=BIM", answer)) %>% # only want BIMI records, more on this in a bit
  mutate(
    l = stri_match_first_regex(answer, "l=([^;]+)")[,2], # get the image link
    l_dom = domain(l) # get the image domain
  ) %>% 
  bind_cols(
    suffix_extract(.$name) # so we can get the apex domain below
  ) %>% 
  mutate(
    name_apex = glue::glue("{domain}.{suffix}"),
    name_stripped = stri_replace_first_regex(
      name, "^default\\._bimi\\.", ""
    )
  ) %>% 
  select(name, name_stripped, name_apex, l, l_dom, answer) -> bimi_df

Here’s what we get:

bimi_df
## # A tibble: 321 x 6
##    name       name_stripped  name_apex  l                            l_dom               answer                       
##    <chr>      <chr>          <glue>     <chr>                        <chr>               <chr>                        
##  1 default._… ebay.com       ebay.com   https://ir.ebaystatic.com/p… ir.ebaystatic.com   v=BIMI1; l=https://ir.ebayst…
##  2 default._… linkedin.com   linkedin.… https://media.licdn.com/med… media.licdn.com     v=BIMI1; l=https://media.lic…
##  3 default._… wish.com       wish.com   https://wish.com/static/img… wish.com            v=BIMI1; l=https://wish.com/…
##  4 default._… dropbox.com    dropbox.c… https://cfl.dropboxstatic.c… cfl.dropboxstatic.… v=BIMI1; l=https://cfl.dropb…
##  5 default._… spotify.com    spotify.c… https://message-editor.scdn… message-editor.scd… v=BIMI1; l=https://message-e…
##  6 default._… ebay.co.uk     ebay.co.uk https://ir.ebaystatic.com/p… ir.ebaystatic.com   v=BIMI1; l=https://ir.ebayst…
##  7 default._… asos.com       asos.com   https://content.asos-media.… content.asos-media… v=BIMI1; l=https://content.a…
##  8 default._… wix.com        wix.com    https://valimail-app-prod-u… valimail-app-prod-… v=BIMI1; l=https://valimail-…
##  9 default._… cnn.com        cnn.com    https://amplify.valimail.co… amplify.valimail.c… v=BIMI1; l=https://amplify.v…
## 10 default._… salesforce.com salesforc… https://c1.sfdcstatic.com/c… c1.sfdcstatic.com   v=BIMI1; l=https://c1.sfdcst…
## # … with 311 more rows

I should re-run this mass query since it usually takes 3-4 runs to get a fully comprehensive set of results (I should also really use work’s infrastructure to do the lookups against the authoritative nameservers for each organization like we do for our FDNS studies, but this was a spur-of-the-moment project idea to see if we should add BIMI to our studies and my servers are “free” whereas AWS nodes most certainly are not).

To account for the aforementioned “comprehensiveness” issues, we’ll round up the total from 310 to 400 (the average difference between 1 and 4 bulk queries is more like 5% than 20% but I’m in a generous mood), so 0.04% of the domains in the Alexa Top 1m have BIMI records. Not all of those domains are going to have MX records but it’s safe to say less than 1% of the brands on the Alexa Top 1m have been early BIMI adopters. This is not surprising since it’s not really a fully baked standard and no real clients support it yet (AOL doesn’t count, apologies to the Oathers). Google claims to be “on board” with BIMI, so once they adopt it, we should see that percentage go up.

Tracking isn’t limited to a tricked out dynamic DNS configuration that customizes selectors for each recipient. Since many brands use third party services for all things email, those clearinghouses are set to get some great data on you if these preliminary results are any indicator:

count(bimi_df, l_dom, sort=TRUE)
## # A tibble: 255 x 2
##    l_dom                                                                          n
##    <chr>                                                                      <int>
##  1 irepo.primecp.com                                                             13
##  2 www.letakomat.sk                                                               9
##  3 valimail-app-prod-us-west-2-auth-manager-assets.s3.us-west-2.amazonaws.com     8
##  4 static.mailkit.eu                                                              7
##  5 astatic.ccmbg.com                                                              5
##  6 def0a2r1nm3zw.cloudfront.net                                                   4
##  7 static.be2.com                                                                 4
##  8 www.christin-medium.com                                                        4
##  9 amplify.valimail.com                                                           3
## 10 bimi-host.250ok.com                                                            3
## # … with 245 more rows

The above code counted how many BIMI URLs are hosted at a particular domain and the top 5 are all involved in turning you into the product for other brands.

Speaking of brands, these are the logos of the early adopters which I made by generating some HTML from an R script and screen capturing the browser result:

FIN

The data from the successful BIMI results of the mass DNS query is at https://rud.is/dl/2020-02-21-bimi-responses.json.gz. Knowing there are results to be had, I’ll be setting up a regular (proper) mass-query of the Top 1m and see how things evolve over time and possibly get it on the work docket. We may just do a mass BIMI prefix query against all FDNS apex domains just to see a broader scale result, so stay tuned.

Drop note if you discover any more insights from the data (there are a few in there I’m saving for a future post) or your own BIMI inquiries; also drop a note if you have a good defense for BIMI other than marketing and tracking.