Skip navigation

(This is part 2 of n “quick hit” posts, each walking through some approaches to speeding up components of an iterative operation. Go here for part 1).

Thanks to the aforementioned previous post, we now have a super fast way of reading individual text files containing HTTP headers from HEAD requests into a character vector:

library(Rcpp)

vapply(
  X = fils, 
  FUN = cpp_read_file, # see previous post for the source for this C++ Rcpp function
  FUN.VALUE = character(1), 
  USE.NAMES = FALSE
) -> hdrs

head(hdrs, 2)
## [1] "HTTP/1.1 200 OK\r\nDate: Mon, 08 Jun 2020 14:40:45 GMT\r\nServer: Apache\r\nLast-Modified: Sun, 26 Apr 2020 00:06:47 GMT\r\nETag: \"ace-ec1a0-5a4265fd413c0\"\r\nAccept-Ranges: bytes\r\nContent-Length: 967072\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Type: application/x-msdownload\r\n\r\n"                                   
## [2] "HTTP/1.1 200 OK\r\nDate: Mon, 08 Jun 2020 14:43:46 GMT\r\nServer: Apache\r\nLast-Modified: Wed, 05 Jun 2019 03:52:22 GMT\r\nETag: \"423-d99a0-58a8b864f8980\"\r\nAccept-Ranges: bytes\r\nContent-Length: 891296\r\nX-XSS-Protection: 1; mode=block\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Type: application/x-msdownload\r\n\r\n"

However, I need the headers and values broken out so I can eventually get to the analysis I need to do, and a data frame of name/value columns would be the most helpful format. We’ll use {stringi} to help us build a function (explanation of what it’s doing is in comment annotations) that turns each unkempt string into a very kempt data frame:

library(stringi)

parse_headers <- function(x) {

  # split lines from into a character vector
  split_hdrs <- stri_split_lines(x, omit_empty = TRUE)

  lapply(split_hdrs, function(lines) {

    # we don't care about the HTTP x/x ...
    lines <- lines[-1]

    # make a matrix out of found NAME: VALUE
    hdrs <- stri_match_first_regex(lines, "^([^:]*):\\s*(.*)$")

    if (nrow(hdrs) > 0) { # if we have any
      data.frame(
        name = stri_replace_all_fixed(stri_trans_tolower(hdrs[,2]), "-", "_"),
        value = hdrs[,3]
      )
    } else { # if we don't have any
      NULL
    }

  })

}

parse_headers(hdrs[1:3])
## [[1]]
##              name                         value
## 1            date Mon, 08 Jun 2020 14:40:45 GMT
## 2          server                        Apache
## 3   last_modified Sun, 26 Apr 2020 00:06:47 GMT
## 4            etag     "ace-ec1a0-5a4265fd413c0"
## 5   accept_ranges                         bytes
## 6  content_length                        967072
## 7 x_frame_options                    SAMEORIGIN
## 8    content_type      application/x-msdownload
## 
## [[2]]
##               name                         value
## 1             date Mon, 08 Jun 2020 14:43:46 GMT
## 2           server                        Apache
## 3    last_modified Wed, 05 Jun 2019 03:52:22 GMT
## 4             etag     "423-d99a0-58a8b864f8980"
## 5    accept_ranges                         bytes
## 6   content_length                        891296
## 7 x_xss_protection                 1; mode=block
## 8  x_frame_options                    SAMEORIGIN
## 9     content_type      application/x-msdownload
## 
## [[3]]
##           name                         value
## 1         date Mon, 08 Jun 2020 14:23:53 GMT
## 2       server                        Apache
## 3 content_type text/html; charset=iso-8859-1

parse_header(hdrs[1])
##              name                         value
## 1            date Mon, 08 Jun 2020 14:40:45 GMT
## 2          server                        Apache
## 3   last_modified Sun, 26 Apr 2020 00:06:47 GMT
## 4            etag     "ace-ec1a0-5a4265fd413c0"
## 5   accept_ranges                         bytes
## 6  content_length                        967072
## 7 x_frame_options                    SAMEORIGIN
## 8    content_type      application/x-msdownload

Unfortunately, this takes almost 16 painful seconds to crunch through the ~75K text entries:

system.time(tmp <- parse_headers(hdrs))
##   user  system elapsed 
## 15.033   0.097  15.227 

as each call can be near 150 microseconds:

microbenchmark(
  ph = parse_headers(hdrs[1]),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##  expr     min       lq     mean  median      uq     max neval
##    ph 143.328 146.8995 154.8609 148.361 158.121 415.332  1000

A big reason it takes so long is the data frame creation. If you’ve never looked at the source for data.frame() have a go at it — https://github.com/wch/r-source/blob/86532f5aa3d9880f4c1c9e74a417005616846a34/src/library/base/R/dataframe.R#L435-L603 — before continuing.

Back? Great! The {base} data.frame() has tons of guard rails to make sure you’re getting what you think you asked for across a myriad of use cases. I learned about a trick to make data frame creation faster when I started playing with {ggplot2} source. Said trick has virtually no guard rails — it just adds a class, and row.names attribute to a list — so you really should only use it in cases like this where you have a very good idea of the structure and values of the data frame you’re making. Here’s an even more simplified version of the function in the {ggplot2} source:

fast_frame <- function(x = list()) {

  lengths <- vapply(x, length, integer(1))
  n <- if (length(x) == 0 || min(lengths) == 0) 0 else max(lengths)
  class(x) <- "data.frame"
  attr(x, "row.names") <- .set_row_names(n) # help(.set_row_names) for info

  x

}

Now, we’ll change parse_headers() a bit to use that function instead of data.frame():

parse_headers <- function(x) {

  # split lines from into a character vector
  split_hdrs <- stri_split_lines(x, omit_empty = TRUE)

  lapply(split_hdrs, function(lines) {

    # we don't care about the HTTP x/x ...
    lines <- lines[-1]

    # make a matrix out of found NAME: VALUE
    hdrs <- stri_match_first_regex(lines, "^([^:]*):\\s*(.*)$")

    if (nrow(hdrs) > 0) { # if we have any
      fast_frame(
        list(
          name = stri_replace_all_fixed(stri_trans_tolower(hdrs[,2]), "-", "_"),
          value = hdrs[,3]
        )
      )
    } else { # if we don't have any
      NULL
    }

  })

}

Note that we had to pass in a list() to it vs bare name/value vectors.

How much faster is it? Quite a bit:

microbenchmark(
  ph = parse_headers(hdrs[1]),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##  expr   min      lq     mean median      uq      max neval
##    ph 27.94 28.7205 34.66066 29.024 29.3785 4144.402  1000

This speedup means the painful ~15s is now just a tolerable ~3s:

system.time(tmp <- parse_headers(hdrs))
##  user  system elapsed 
## 2.901   0.011   2.918 

FIN

Normally, guard rails are awesome, and you can have even more safe code (which means safer and more reproducible analyses) when using {tidyverse} functions. As noted in the previous post, I’m doing a great deal of iterative work, have more than one set of headers I’m crunching on, and am testing out different approaches/theories, so going from 16 seconds to 3 seconds does truly speed up my efforts and has an even bigger impact when I process around 3 million raw header records.

I think I promised {future} work in this post (asynchronous pun not intended), but we’ll get to that eventually (probably the next post).

If you have your own favorite way to speedup data frame creation (or extracting target values from raw text records) drop a note in the comments!

(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.)

I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this:

HTTP/1.1 200 OK
Date: Mon, 08 Jun 2020 14:40:45 GMT
Server: Apache
Last-Modified: Sun, 26 Apr 2020 00:06:47 GMT
ETag: "ace-ec1a0-5a4265fd413c0"
Accept-Ranges: bytes
Content-Length: 967072
X-Frame-Options: SAMEORIGIN
Content-Type: application/x-msdownload

I do this quite a bit in R when we create new studies at work, but I’m usually only working with a few files. In this case I had to go through all these files to determine if a condition hypothesis (more on that in one of the future posts) was accurate.

Reading in a bunch of files (each one into a string) is fairly straightforward in R since readChar() will do the work of reading and we just wrap that in an iterator:

length(fils)
## [1] 73514 

# check file size distribution
summary(
  vapply(
    X = fils,
    FUN = file.size,
    FUN.VALUE = numeric(1),
    USE.NAMES = FALSE
  )
)
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 19.0   266.0   297.0   294.8   330.0  1330.0 

# they're all super small

system.time(
  vapply(
    X = fils, 
    FUN = function(.f) readChar(.f, file.size(.f)), 
    FUN.VALUE = character(1), 
    USE.NAMES = FALSE
  ) -> tmp 
)
##  user  system elapsed 
## 2.754   1.716   4.475 

NOTE: You can use lapply() or sapply() to equal effect as they all come in around 5 seconds on a modern SSD-backed system.

Now, five seconds is completely acceptable (though that brief pause does feel awfully slow for some reason), but can we do better? I mean we do have some choices when it comes to slurping up the contents of a file into a length 1 character vector:

  • base::readChar()
  • readr::read_file()
  • stringi::stri_read_raw() (+ rawToChar())

Do any of them beat {base}? Let’s see (using the largest of the files):

library(stringi)
library(readr)
library(microbenchmark)

largest <- fils[which.max(sapply(fils, file.size))]

file.size(largest)
## [1] 1330

microbenchmark(
  base = readChar(largest, file.size(largest)),
  readr = read_file(largest),
  stringi = rawToChar(stri_read_raw(largest)),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##     expr     min       lq      mean   median       uq     max neval
##     base  79.862  93.5040  98.02751  95.3840 105.0125 161.566  1000
##    readr 163.874 186.3145 190.49073 189.1825 192.1675 421.256  1000
##  stringi  52.113  60.9690  67.17392  64.4185  74.9895 249.427  1000

I had predicted that the {stringi} approach would be slower given that we have to explicitly turn the raw vector into a character vector, but it is modestly faster. ({readr} has quite a bit of functionality baked into it — for good reasons — which doesn’t help it win any performance contests).

I still felt there had to be an even faster method, especially since I knew that the files all had HTTP response headers and that they every one of them could each be easily read into a string in (pretty much) one file read operation. That knowledge will let us make a C++ function that cuts some corners (more like “sands” some corners, really). We’ll do that right in R via {Rcpp} in this function (annotated in C++ code comments):

library(Rcpp)

cppFunction(code = '
String cpp_read_file(std::string fil) {

  // our input stream
  std::ifstream in(fil, std::ios::in | std::ios::binary);

  if (in) { // we can work with the file

  #ifdef Win32
    struct _stati64 st; // gosh i hate windows
    _wstati64(fil.cstr(), &st) // this shld work but I did not test it
  #else
    struct stat st;
    stat(fil.c_str(), &st);
  #endif

    std::string out(st.st_size, 0); // make a string buffer to hold the data

    in.seekg(0, std::ios::beg); // ensure we are at the beginning
    in.read(&out[0], st.st_size); // read in the file
    in.close();

    return(out);

  } else {
    return(NA_STRING); // file missing or other errors returns NA
  }

}
', includes = c(
  "#include <fstream>",
  "#include <string>",
  "#include <sys/stat.h>"
))

Is that going to be faster?

microbenchmark(
  base = readChar(largest, file.size(largest)),
  readr = read_file(largest),
  stringi = rawToChar(stri_read_raw(largest)),
  rcpp = cpp_read_file(largest),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##     expr     min       lq      mean   median       uq     max neval
##     base  80.500  91.6910  96.82752  94.3475 100.6945 295.025  1000
##    readr 161.679 175.6110 185.65644 186.7620 189.7930 399.850  1000
##  stringi  51.959  60.8115  66.24508  63.9250  71.0765 171.644  1000
##     rcpp  15.072  18.3485  21.20275  21.0930  22.6360  62.988  1000

It sure looks like it, but let’s put it to the test:

system.time(
  vapply(
    X = fils, 
    FUN = cpp_read_file, 
    FUN.VALUE = character(1), 
    USE.NAMES = FALSE
  ) -> tmp 
)
##  user  system elapsed 
## 0.446   1.244   1.693 

I’ll take a two-second wait over a five-second wait any day!

FIN

I have a few more cases coming up where there will be 3-5x the number of (similar) files that I’ll need to process, and this optimization will shave time off as I iterate through each analysis, so the trivial benefits here will pay off more down the road.

The next post in this particular series will show how to use the {future} family to reduce the time it takes to turn those HTTP headers into data we can use.

If I missed your favorite file slurping function, drop a note in the comments and I’ll update the post with new benchmarks.

The Marshall Project has a solid story and set of visualizations on the impact of COVID-19 in U.S. prisons. They keep the data (and vis) regularly updated. They do great work and this is an important topic, but this visualization breaks my “ordered grid” OCD:

To be fair, it’s not supposed to line up as the dots are part of an animation process that has them drop from top to bottom and appears to be designed to have an “organic” feel.

We can use the {waffle} package to iron out these wrinkled non-grids into some semblance of order, and try to replicate the chart as much as possible along the way.

Getting the Data

We first need the data and, thankfully, the MP folks provided it…just not in a way you’d expect (or that’s straightforward to use).

Do a “view source” on that URL in your browser and scroll down to line ~1,455 and you should see this:

That’s the data, right on the page, encoded in javascript🤔. This makes sense as it is fueling a javascript visualization and many sites are embedding data right on the page vs fetch via an XHR request to make it easier for web archives to store and retrieve working visualizations. We can totally work with this data, and we’ll do that now, along with getting some boilerplate out of the way:

library(V8)         # work with javascript data
library(stringi)    # string ops
library(rvest)      # web scrape
library(ggtext)     # pretty ggplot text with markdown
library(waffle)     # waffle charts // install_github("hrbrmstr/waffle")
library(hrbrthemes) # install_github("hrbrmstr/hrbrthemes") or don't use the font theme and pick another one
library(tidyerse)   # duh

gg <- glue::glue # for plot labels (later)

# get the page source
pg <- read_html("https://www.themarshallproject.org/2020/05/01/a-state-by-state-look-at-coronavirus-in-prisons")

# setup a V8 VM context
ctx <- v8()

# grab the "data" and make it a V8 VM object
html_nodes(pg, xpath=".//script[contains(., 'var STATES_DATA')]") %>%
  html_text() %>%
  ctx$eval()

# get the data into R
states_data <- ctx$get("STATES_DATA")

glimpse(states_data)
## Rows: 918
## Columns: 20
## $ ``                 <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",…
## $ name               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "G…
## $ abbreviation       <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "M…
## $ week_of            <chr> "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26"…
## $ unrevised_cases    <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ cases              <chr> "0", "0", "0", "0", "1", "0", "NA", "0", "0", "4", "0", "0", "3", "0", "0", "0", "NA", "0", "0", "0", "9", "23…
## $ new_cases          <chr> "0", "0", "0", "0", "1", "0", "NA", "0", "0", "4", "0", "0", "3", "0", "0", "0", "NA", "0", "0", "0", "9", "23…
## $ filled_cases       <chr> "0", "0", "0", "0", "1", "0", "0", "0", "0", "4", "0", "0", "3", "0", "0", "0", "0", "0", "0", "0", "9", "23",…
## $ case_rate          <chr> "0", "0", "0", "0", "0.0812809883768187", "0", "NA", "0", "0", "0.728318857996031", "0", "0", "0.7865757734661…
## $ deaths             <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "NA", "0", "0", "0", "0", "0",…
## $ new_deaths         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "NA", "0", "0", "0", "0", "0",…
## $ filled_deaths      <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ death_rate         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0.182079714499008", "0", "0", "0", "0", "0", "0", "NA", "0", "0"…
## $ staff_multiples    <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ prisoner_multiples <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ tested             <chr> "NA", "4", "15", "0", "166", "NA", "NA", "4", "NA", "NA", "NA", "10", "13", "NA", "NA", "0", "NA", "32", "NA",…
## $ as_of_date         <chr> "NA", "2019-12-31", "2020-04-15", "2020-02-29", "2020-04-15", "NA", "NA", "2020-04-15", "NA", "NA", "NA", "202…
## $ march_pop          <chr> "NA", "4997", "42282", "18181", "123030", "NA", "NA", "5042", "NA", "NA", "NA", "7816", "38140", "NA", "NA", "…
## $ april_pop          <chr> "NA", "4997", "41674", "18181", "118466", "NA", "NA", "4624", "NA", "NA", "NA", "7641", "36904", "NA", "NA", "…
## $ test_rate          <chr> "NA", "8.00480288172904", "3.54760891159359", "0", "13.4926440705519", "NA", "NA", "7.93335977786593", "NA", "…

The comments in the code go a long way, but jist is that we extract out the javascript block containing that var STATES_DATA… data, have {V8} wrangle it in javascript for us, then get the result and take a look at it. Now for the real work.

Making the Data Useful

We need — at a minimum — dates and numbers. We’re also going to mimic the visualization, so we’ll be dividing new case counts by 10 for the “1 dot == 10 cases” waffle chart and creating useful axis labels. This is pretty basic wrangling:

states_data %>% 
  select(week_of, new_cases) %>% 
  mutate(
    week_of = as.Date(week_of),
    new_cases = suppressWarnings(as.numeric(new_cases))
  ) %>%
  count(week_of, wt = new_cases) %>%
  arrange(week_of) %>%
  mutate(
    wk = format(week_of, "Week of\n%b %d"),
    div10 = as.integer(round(n/10)),
  ) %>%
  as_tibble() -> cases

glimpse(cases)
## Rows: 18
## Columns: 4
## $ week_of <date> 2020-03-26, 2020-04-01, 2020-04-08, 2020-04-15, 2020-04-22, 2020-04-29, 2020-05-06, 2020-05-13, 2020-05-20, 2020-05-27, …
## $ n       <dbl> 56, 268, 810, 1672, 6872, 4788, 5538, 5115, 3940, 5323, 6027, 3335, 2258, 2452, 3856, 4488, 7324, 6595
## $ wk      <chr> "Week of\nMar 26", "Week of\nApr 01", "Week of\nApr 08", "Week of\nApr 15", "Week of\nApr 22", "Week of\nApr 29", "Week o…
## $ div10   <int> 6, 27, 81, 167, 687, 479, 554, 512, 394, 532, 603, 334, 226, 245, 386, 449, 732, 660

Using the {waffle} package to make “waffle bar charts” means we’ll end up with panels/strips which will become “axis labels”. I like the fact that the MP folks did not label each week, so we’ll have to account for that as well. One of the simplest ways to do that is to make those labels spaces, but a unique number of them since we’re going to make an ordered factor to ensure the strips are in the right order. This is also pretty straightforward:

cases$wk[c(1, 3:5, 7:9, 11:13, 15:17)] <- stri_pad("", 1:13)
cases$wk <- fct_inorder(cases$wk)

The vector of numbers in the first line are the weeks we want to be blank and we’ll turn them into space-padded strings, each with an increasing number of spaces, then we’ll turn the entire vector of weeks into a factor in the right order.

Making the Chart

The rest is all {ggplot2} magic, so let’s get the whole plot code out of the way before talking about it:

ggplot() +
  geom_waffle(
    data = cases,
    aes(fill = "new cases", values = div10),
    flip = TRUE, n_cols = 10, radius = unit(3, "pt"),           
    color = "white"
  ) +
  geom_text(
    data = tibble(
      idx = c(1, 17, 18),
      wk = cases$wk[idx],
      y = (cases$div10[idx] %/% 10),
      actual_cases = cases$n[idx],
      lab = gg("{scales::comma(actual_cases, 1)} new\ncases")
    ),
    aes(1, y, label = lab),
    vjust = 0, hjust = 0, nudge_y = 2,
    size = 3.5, family = font_gs, lineheight = 0.875
  ) +
  scale_y_continuous(
    expand = c(0, 0.125),
    breaks = seq(0, 70, 10),
    labels = scales::comma(seq(0, 7000, 1000)),
    limits = c(0, 80)
  ) +
  scale_fill_manual(
    values = c("#366b7b")
  ) +
  facet_wrap(~wk, nrow=1, strip.position = "bottom") +
  coord_fixed() +
  labs(
    x = NULL, y = NULL,
    title = "There have been at least<br/><span style='color:#366b7b;font-size:32pt;'>**70,717**</span> <span style='font-size:24pt'>cases</span><br/>of coronavirus reported among prisoners.<br/><span style='font-size:18pt;'>**46,846** prisoners have recovered.</span>",
    subtitle = "Each <span style='color:#366b7b;font-size:16pt;'>&bull;</span> represents 10 new cases.",
    caption = "Source: (data) <www.themarshallproject.org/2020/05/01/a-state-by-state-look-at-coronavirus-in-prisons>"
  ) +
  theme_ipsum_gs(
    grid="Y",
    strip_text_family = font_gs, strip_text_face = "plain",
    plot_title_family = font_gs, plot_title_face = "plain",
    subtitle_family = font_an, subtitle_face = "plain", subtitle_size = 10
  ) +
  theme(
    legend.position = "none",
    strip.text = element_text(hjust = 0.5),
    axis.text.x = element_blank(),
    panel.spacing.x = unit(20, "pt"),
    plot.title = element_markdown(),
    plot.subtitle = element_markdown(),
  )

There’s quite a bit going on there, so let’s break it down:

  • We’re telling geom_waffle() to use our data, and giving it a single category to fill (as there is only one) along with the number of elements in the category. The radius parameter lets us have non-square “dots”, and n_cols + flip sets up the grid to match the one from MP.
  • We need labels on top, too (just three of them) so we’ll pick the vector indices of the ones with labels and get the week strip labels, y positions, new case counts for that day, and an appropriately formatted label and plot them. We’re starting the label at the first X position in each strip and plotting the labels at the height of the “bar”.
  • We’re customizing the Y scale to reflect the 1 == 10 representation of the data and using the same blue as MP did for the fill scale.
  • To get them all to mimic a real X axis, we’re ensuring there’s only one row of facets and putting the facet labels at the bottom.
  • By using coord_fixed we can get circles (or as close to them as you like)
  • We’re using some markdown in the labs(), courtesy of {ggtext}’s element_markdown() and setting some font stylings in the base theme (use a different one if you get font errors or read the docs). We rely on this to “fake” a legend.
  • Finally, we tweak strip positions and some formatting to produce:

(You likely need to view that in your own plot window in R/RStudio or zoom in a bit)

FIN

If you spend some more time on it you can get super-close to the Marshall Project’s finished product.

A bonus from scraping is that you also get two more datasets from the page: STATES_DATA and STATE_NOTES:

glimpse(ctx$get("STATES_RATES"))
## Rows: 51
## Columns: 23
## $ ``                 <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",…
## $ name               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "G…
## $ abbreviation       <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "M…
## $ unrevised_cases    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ cases              <chr> "165", "16", "684", "3789", "7066", "783", "1344", "508", "3898", "1113", "0", "763", "343", "729", "333", "91…
## $ new_cases          <chr> "27", "14", "115", "307", "608", "115", "1", "213", "1266", "49", "0", "69", "6", "1", "109", "1", "146", "50"…
## $ filled_cases       <chr> "165", "16", "684", "3789", "7066", "783", "1344", "508", "3898", "1113", "0", "763", "343", "729", "333", "91…
## $ case_rate          <chr> "77.9994327313983", "32.0192115269162", "164.131112924125", "2084.04378197019", "596.458055475833", "449.68986…
## $ deaths             <chr> "14", "0", "13", "25", "40", "3", "7", "7", "34", "26", "0", "0", "13", "20", "1", "4", "6", "16", "0", "8", "…
## $ new_deaths         <chr> "2", "0", "0", "9", "5", "0", "0", "0", "5", "1", "0", "0", "0", "0", "0", "0", "2", "0", "0", "0", "0", "0", …
## $ filled_deaths      <chr> "14", "0", "13", "25", "40", "3", "7", "7", "34", "26", "0", "0", "13", "20", "1", "4", "6", "16", "0", "8", "…
## $ death_rate         <chr> "6.61813368630046", "0", "3.11945097662811", "13.750618777845", "3.376496209883", "1.72294968986906", "5.72925…
## $ staff_multiples    <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ prisoner_multiples <chr> "NA", "NA", "NA", "NA", "NA", "13723", "NA", "NA", "NA", "NA", "NA", "1884", "NA", "NA", "NA", "NA", "NA", "NA…
## $ as_of_date         <chr> "2020-01-31", "NA", "2020-04-15", "2020-02-29", "2020-04-15", "2020-03-31", "2020-04-01", "NA", "NA", "NA", "2…
## $ march_pop          <chr> "21154", "NA", "42282", "18181", "123030", "17600", "12422", "NA", "NA", "NA", "4631", "7816", "38140", "26891…
## $ april_pop          <chr> "21154", "NA", "41674", "18181", "118466", "17412", "12218", "NA", "NA", "NA", "4631", "7641", "36904", "26891…
## $ test_rate          <chr> "313.888626264536", "NA", "1176.99284925853", "4720.86243880975", "5048.53713301707", "4241.90213645762", "819…
## $ recovered          <chr> "41", "2", "376", "2970", "4940", "628", "1324", "391", "NA", "881", "0", "100", "307", "716", "208", "906", "…
## $ date               <chr> "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "2…
## $ case_ratio         <chr> "-45.6429050602488", "-7.23368674670148", "-19.650267894127", "1714.81334912848", "488.054058525172", "538.378…
## $ death_ratio        <chr> "149.040167449448", "-100", "-22.1877969354022", "1009.53669385711", "72.0346501657667", "-38.5633906796423", …
## $ test_ratio         <chr> "-74.3653779188402", "NA", "6.07104744754024", "224.056036286688", "205.156799892597", "441.281822953195", "34…

glimpse(ctx$get("STATE_NOTES"))
## Rows: 18
## Columns: 3
## $ state <chr> "TN", "TN", "VA", "NM", "NM", "MN", "MN", "VT", "RI", "RI", "MI", "MD", "CT", "AK", "DE", "HI", "LA", "MA"
## $ type  <chr> "prisoners", "staff", "staff", "prisoners", "staff", "prisoners", "staff", "prisoners", "prisoners", "staff", "prisoners", …
## $ text  <chr> "After testing everyone in all of their prisons, Tennessee has said it is releasing the total number of tests conducted and…

which means you can recreate all the visualizations on the page for practice (or to explore them more). You can check out {devoutsvg}(https://github.com/coolbutuseless/devoutsvg) if you want to try to mimic the gradient fills as well, though they will be supported directly in R in the next major version (and are now in R-devel).

The incredibly talented folks over at Bishop Fox were quite generous this week, providing a scanner for figuring out PAN-OS GlobalProtect versions. I’ve been using their decoding technique and date-based fingerprint table to keep an eye on patch status (over at $DAYJOB we help customers, organizations, and national cybersecurity centers get ahead of issues as best as we can).

We have at-scale platforms for scanning the internet and aren’t running the panos-scanner repo code, but I since there is Python code for doing this, I thought it might be fun to show R folks how to do the same thing and show folks how to use the {httr} package to build something similar (we won’t hit all the URLs their script assesses, primarily for brevity).

What Are We Doing Again?

Palo Alto makes many things, most of which are built on their custom linux distribution dubbed PAN-OS. One of the things they build is a VPN product that lets users remotely access internal company resources. It’s had quite the number of pretty horrible problems of late.

Folks contracted to assess security defenses (colloquially dubbed “pen-testers” tho no pens are usually involved) and practitioners within an organization who want to gain an independent view of what their internet perimeter looks like often assemble tools to perform
ad-hoc assessments. Sure, there are commercial tools for performing these assessments ($DAYJOB makes some!), but these open source tools make it possible for folks to learn from each other and for makers of products (like PAN-OS) to do a better job securing their creations.

In this case, the creation is a script that lets the caller figure out what version of PAN-OS is running on a given GlobalProtect box.

To follow along at home, you’ll need access to a PAN-OS system, as I’m not providing an IP address of one for you. It’s really not hard to find one (just google a bit or stand up a trial one from the vendor). Throughout the examples I’ll be using {glue} to replace ip and port in various function calls, so let’s get some setup bits out of the way:

library(httr)
library(tidyverse) # for read_fwf() (et al), pluck(), filter(), and %>%

gg <- glue::glue # no need to bring in the entire namespace just for this

Assuming you have a valid ip and port, let’s try making a request against your PAN-OS GlobalProtect (hereafter using “GP” so save keystrokes) system:

httr::HEAD(
  url = gg("https://{ip}:{port}/global-protect/login.esp")
) -> res
## Error in curl::curl_fetch_memory(url, handle = handle) : 
##  SSL certificate problem: self signed certificate

We’re using a HEAD request as we really don’t need the contents of the remote file (unless you need to verify it truly is a PAN-OS GP server), just the metadata about it. You can use a traditional GET request if you like, though.

We immediately run into a snag since these boxes tend to use a self-signed SSL/TLS certificate which web clients aren’t thrilled about dealing with unless explicitly configured to. We can circumvent this with some configuration options, but you should not use the following incantations haphazardly. SSL/TLS no longer really means what it used to (thanks, Let’s Encrypt!) but you have no guarantees of what’s being delivered to you is legitimate if you hit a plaintext web site or one with an invalid certificate. Enough with the soapbox, let’s make the request:

httr::HEAD(
  url = gg("https://{ip}:{port}/global-protect/login.esp"),
  config = httr::config(
    ssl_verifyhost =FALSE, 
    ssl_verifypeer = FALSE
  )
) -> res

httr::status_code(res)
## [1] 200

In that request, we’ve told the underlying {curl} library calls to not verify the validity of the host or peer certificates associated with the service. Again, don’t do this haphazardly to get around generic SSL/TLS problems when making normal API calls or scraping sites.

Since we only made a HEAD request, we’re just getting back headers, so let’s take a look at them:

str(httr::headers(res), 1)
## List of 18
##  $ date                     : chr "Fri, 10 Jul 2020 15:02:32 GMT"
##  $ content-type             : chr "text/html; charset=UTF-8"
##  $ content-length           : chr "11749"
##  $ connection               : chr "keep-alive"
##  $ etag                     : chr "\"7e0d5e2b6add\""
##  $ pragma                   : chr "no-cache"
##  $ cache-control            : chr "no-store, no-cache, must-revalidate, post-check=0, pre-check=0"
##  $ expires                  : chr "Thu, 19 Nov 1981 08:52:00 GMT"
##  $ x-frame-options          : chr "DENY"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; samesite=lax; secure; httponly"
##  $ strict-transport-security: chr "max-age=31536000;"
##  $ x-xss-protection         : chr "1; mode=block;"
##  $ x-content-type-options   : chr "nosniff"
##  $ content-security-policy  : chr "default-src 'self'; script-src 'self' 'unsafe-inline'; img-src * data:; style-src 'self' 'unsafe-inline';"
##  - attr(*, "class")= chr [1:2] "insensitive" "list"

As an aside, I’ve always found the use of PHP code in security products quite, er, fascinating.

The value we’re really looking for here is etag (which really looks like ETag in the raw response).

Bishop Fox (and others) figured out that that header value contains a timestamp in the last 8 characters. That timestamp maps to the release date of the particular PAN-OS version. Since Palo Alto maintains multiple, supported versions of PAN-OS and generally releases patches for them all at the same time, the mapping to an exact version is not super precise, but it’s sufficient to get an idea of whether that system is at a current, supported patch level.

The last 8 characters of 7e0d5e2b6add are 5e2b6add, which — as Bishop Fox notes in their repo — is just a hexadecimal encoding of the POSIX timestamp, in this case, 1579903709 or 2020-01-24 22:08:29 GMT (we only care about the date, so really 2020-01-24).

We can compute that with R, but first we need to note that the value is surrounded by " quotes, so we’ll have to deal with that during the processing:

httr::headers(res) %>% 
  pluck("etag") %>% 
  gsub('"', '', .) %>% 
  substr(5, 12) %>% 
  as.hexmode() %>% 
  as.integer() %>% 
  anytime::anytime(tz = "GMT") %>% 
  as.Date() -> version_date

version_date
## [1] "2020-01-24"

To get the associated version(s), we need to look the date up in their table, which is in a fixed-width format that we can read via:

read_fwf(
  file = "https://raw.githubusercontent.com/noperator/panos-scanner/master/version-table.txt",
  col_positions = fwf_widths(c(10, 14), c("version", "date")),
  col_types = "cc",
  trim_ws = TRUE
) %>% 
  mutate(
    date = lubridate::mdy(date)
  ) -> panos_trans

panos_trans
## # A tibble: 153 x 2
##    version  date      
##    <chr>    <date>    
##  1 6.0.0    2013-12-23
##  2 6.0.1    2014-02-26
##  3 6.0.2    2014-04-18
##  4 6.0.3    2014-05-29
##  5 6.0.4    2014-07-30
##  6 6.0.5    2014-09-04
##  7 6.0.5-h3 2014-10-07
##  8 6.0.6    2014-10-07
##  9 6.0.7    2014-11-18
## 10 6.0.8    2015-01-13
## # … with 143 more rows

Now, let’s see what version or versions this might be:

filter(panos_trans, date == version_date)
## # A tibble: 2 x 2
##   version date      
##   <chr>   <date>    
## 1 9.0.6   2020-01-24
## 2 9.1.1   2020-01-24

Putting It All Together

We can make a command line script for this (example) scanner:

#!env Rscript
library(purrr)

gg <- glue::glue

# we also use {httr}, {readr}, {lubridate}, {anytime}, and {jsonlite}

args <- commandArgs(trailingOnly = TRUE)

stopifnot(
  c(
    "Must supply both IP address and port" = length(args) == 2
  )
)

ip <- args[1]
port <-  args[2]

httr::HEAD(
  url = gg("https://{ip}:{port}/global-protect/login.esp"),
  config = httr::config(
    ssl_verifyhost =FALSE, 
    ssl_verifypeer = FALSE
  )
) -> res

httr::headers(res) %>% 
  pluck("etag") %>% 
  gsub('"', '', .) %>% 
  substr(5, 12) %>% 
  as.hexmode() %>% 
  as.integer() %>% 
  anytime::anytime(tz = "GMT") %>% 
  as.Date() -> version_date

panos_trans <- readr::read_csv("panos-versions.txt", col_types = "cD")

res <- panos_trans[panos_trans[["date"]] == version_date,]

if (nrow(res) == 0) {
  cat(gg('{{"ip":"{ip}","port":"{port}","version"=null,"date"=null}}\n'))
} else {
  res$ip <- ip
  res$port <- port
  jsonlite::stream_out(res[,c("ip", "port", "version", "date")], verbose = FALSE)
}

Save that as panos-scanner.R and make it executable (provided you’re on a non-legacy operating system that doesn’t support such modern capabilities). Save panos_trans as a CSV file in the same directory and try it against another (sanitized IP/port) system:

./panos-scanner.R 10.20.30.40 5678                                                                                                                                                    1
{"ip":"10.20.30.40","port":"5678","version":"9.1.2","date":"2020-03-30"}

FIN

To be complete, the script should test all the URLs the ones in the script from Bishop Fox does and stand up many more guard rails to handle errors associated with unreachable hosts, getting headers from a system that is not a PAN-OS GP host, and ensuring the ETag is a valid date.

You can grab the code [from this repo](https://git.rud.is/hrbrmstr/2020-07-10-panos-rstats.

I’ve changed my years-long avatar to the blue Cap’ shield because the colors of our flag have no business being displayed in any venue until Donald Trump is no longer President (one way or another). The red/white/blue triad has been coopted by an authoritarian, sociopathic puppet and is now a symbol of fear, greed, hate, and evil. I refuse to be associated with it until the principles it is supposed to stand for are even remotely embodied by those who serve our country. I would like to hope that is in mid-January 2021, but I’m not optimistic we’ll have a peaceful change of power.

Be safe. Be well. Be an ally.

Tis been a long time coming, but a minor change to default S3 parameters in tibbles finally caused a push of {sergeant} — the R package that lets you use the Apache Drill REST API via {DBI}, {dplyr}, or directly — to CRAN.

The CRAN automatic processing system approved the release just under 19 minutes from submission.

What’s in version 0.9.0 (or, better: “what’s changed since CRAN 0.6.0?”):

  • API retries! (#ty @jameslamb)
  • Spelling! (#ty @trinker)
  • No Java! (me)
  • 0-row result handling! (#ty @scottcame)
  • progress bars! (#ty @scottcame)
  • fix auth issue! (#ty @Ahine)
  • better docker UX! (#ty @davidski)
  • column order preservation (fixes #18)
  • new `drill_functions()
  • much better error messages
  • 64-bit integer support
  • CTAS from a tbl (#29)
  • switch to {tinytest}
  • DBI tests

Some of this has been in past blog announcements, but now they’re all together in CRAN.

FIN

Apache Drill 1.18.0 (if there is one, the company that bought MapR has abandoned Drill so the fate of Drill is somewhat up in the air) will have support for streaming resultset, which will make the REST API actually usable for large data returns, so that’s on the menu for {sergeant} 1.0.0.

Getting {sergeant.caffeinated} on CRAN is also on the TODO as it contains the separated out JDBC bits.

Many thanks, again, to all the contributors who made 0.9.0 possible. As usual, file issues and/or PRs wherever you’re most comfortable.

I (and, apparently, Gandalf O_o) are pleased to announce that RSwitch version 1.7.0 has been released. (Direct Download)

RSwitch is a macOS menubar utility that:

  • makes it dead simple to manage multiple macOS R versions
  • use the latest RStudio daily builds
  • access remote RStudio Server sessions using in a purpose-built browser that lets you use keystrokes you’re used to in RStudio Desktop
  • see switch between active RStudio projects
  • access a catalog of helpful local and web resources

Version 1.7.0 fixes a few bugs and adds the following new features:

  • New command-line switcher. After a one-line command (shown here in the RSwitch Guide) you can now just rswitch #.#[.#] from the your favorite macOS terminal. This should make it easier to switch R versions in tests/local automation.

  • File uploads and exports to RStudio Server. RSwitch’s RStudio Server purpose-built browser and manager now supports file uploads and exports to RStudio Server sessions. This is super new functionality that has been tested with more than a few edge cases, but I never use this feature (that’s what scp and curl are for IMO), so if you have any problems using them, please drop an issue or comment.

  • Ensure RStudio owns .R and .Rmd files. macOS R folks who use other environments such as Xcode know all to well how other development tools just love to own .R files (and, sometimes, .Rmd files). While I use other editors to get R stuff done much more frequently these days, I still prefer it if RStudio owns these two file types, so there’s a preference you can set for it (off by default).

  • More resources. The “Reference Desk” adds a few more resources to save you some room in your browser bookmarks.

  • Works on 10.13. I relented and included macOS 10.13 in the build support. If Apple does, indeed, release a new alpha version of macOS this Spring, 10.13 support will be removed when 10.16 is released. You folks really need to upgrade macOS to stay safe.

The Guide has been updated to provide information on all the new features.

Please kick the tyres and let me know if you have any issues with the new release and, as you’ll see if you read the Guide, much of the new functionality came from user-suggestions, so don’t hesitate to drop ideas/wants as I may have just enough Swift coding talent to pull a few of them off.

(Here’s the direct download link, again, to prevent you having to scroll up to get your hands on RSwitch 1.7.0 :-)

R 4.0.0 has been out for a while, now, and — apart from a case where merge() was slower than dirt — it’s been really stable for at least me (I use it daily on macOS, Linux, and Windows). Sure, it came with some headline-grabbing features/upgrades, but I’ve started looking at what other useful nuggets might be in the changelog and decided to blog them as I find them.

Today’s nugget is the venerable stopifnot() function which was significantly enhanced by this PR by Neil Fultz.

Prior to R 4.0.0, if you wanted to use stopifnot() to perform some input validation (a.k.a. — in this case — [assertions])(https://en.wikipedia.org/wiki/Assertion_(software_development)) you’d do something like this (I’m borrowing from Neil’s example):

some_ƒ <- function(alpha, gradtol, steptol, interlim) {

  stopifnot(
    (is.numeric(alpha)),
    (length(alpha) == 1),
    (alpha > 0),
    (alpha < 1),
    (is.numeric(gradtol)),
    (length(gradtol) == 1),
    (gradtol > 0),
    (is.numeric(steptol)),
    (length(steptol) == 1),
    (steptol > 0),
    (is.numeric(interlim)),
    (length(interlim) == 1),
    (interlim > 0)
   )

  message("Do something awesome")

}

When run with acceptable inputs we get:

some_ƒ(0.5, 3, 10, 100)
## Do something awesome

But, when run with something out of kilter:

some_ƒ("a", 3, 10, 100)
##  Error in some_ƒ("a", 3, 10, 100) : (is.numeric(alpha)) is not TRUE

we get a semi-useful, but somewhat unfriendly message back. Sure, it points to the right expression, but we’re supposed to be the kinder, friendlier data science (and general purpose) language who cares a bit more about our users. To that end, many folks switch to doing something like this:

some_ƒ <- function(alpha, gradtol, steptol, interlim) {

  if (!is.numeric(alpha))   { stop('Error: alpha should be numeric') }
  if (length(alpha) != 1)   { stop('Error: alpha should be a single value'); }
  if (alpha < 0)            { stop('Error: alpha is negative'); }
  if (alpha > 1)            { stop('Error: alpha is greater than one'); }
  if (!is.numeric(gradtol)) { stop('Error: gradtol should be numeric') }
  if (length(gradtol) != 1) { stop('Error: gradtol should be a single value'); }
  if (gradtol <= 0)         { stop('Error: gradtol should be positive'); }
  if (!is.numeric(steptol)) { stop('Error: steptol should be numeric') }
  if (length(steptol) != 1) { stop('Error: steptol should be a single value'); }
  if (steptol <= 0)         { stop('Error: steptol should be positive'); }
  if (!is.numeric(iterlim)) { stop('Error: iterlim should be numeric') }
  if (length(iterlim) != 1) { stop('Error: iterlim should be a single value'); }
  if (iterlim <= 0)         { stop('Error: iterlim should be positive'); }

  message("Do something awesome")

}

which results in:

some_ƒ("a", 3, 10, 100)
##  Error in some_ƒ("a", 3, 10, 100) : Error: alpha should be numeric

(you can make even better error messages than that).

Neal thought there had to be a better way, and made one! The ... expressions can be named and those names will become the error message:

some_ƒ <- function(alpha, gradtol, steptol, interlim) {

  stopifnot(
    'alpha should be numeric'          = (is.numeric(alpha)),
    'alpha should be a single value'   = (length(alpha) == 1),
    'alpha is negative'                = (alpha > 0),
    'alpha is greater than one'        = (alpha < 1),
    'gradtol should be numeric'        = (is.numeric(gradtol)),
    'gradtol should be a single value' = (length(gradtol) == 1),
    'gradtol should be positive'       = (gradtol > 0),
    'steptol should be numeric'        = (is.numeric(steptol)),
    'steptol should be a single value' = (length(steptol) == 1),
    'steptol should be positive'       = (steptol > 0),
    'iterlim should be numeric'        = (is.numeric(interlim)),
    'iterlim should be a single value' = (length(interlim) == 1),
    'iterlim should be positive'       = (interlim > 0)
   )

  message("Do something awesome")

}

some_ƒ("a", 3, 10, 100)

##  Error in some_ƒ("a", 3, 10, 100) : alpha should be numeric

Way easier to write and way more respectful to the caller.

Gratuitous Statistics

CRAN has ~2,600 packages that use stopifnot() in their package /R/ code with the following selected distributions (charts are all log10 scale):

stopifnot usage: files using it per package

Here are the packages with 50 or more files using stopifnot():

   pkg              n
   <chr>        <int>
 1 spatstat       252
 2 pracma         145
 3 QuACN           80
 4 raster          74
 5 spdep           61
 6 lavaan          54
 7 surveillance    53
 8 copula          50

stopifnot calls per-file

Here are the packages with one or more files that have 100 or more calls to stopifnot() in them:

   pkg                 fil                            ct
   <chr>               <chr>                       <int>
 1 ff                  ordermerge.R                  278
 2 OneArmPhaseTwoStudy zzz.R                         142
 3 bit64               integer64.R                   137
 4 updog               rflexdog.R                    124
 5 RNetCDF             RNetCDF.R                     123
 6 Rlda                rlda.R                        105
 7 aster2              transform.R                   105
 8 ads                 fads.R                        104
 9 georob              georob_exported_functions.R   104
10 bit64               highlevel64.R                 101

O_O That’s quite a bit of checking!

FIN

If you’re working on switching to R 4.0.0 or have switched, this and many other new features await! Drop a note in the comments with your favorite new feature (or, even better, a link to a blog post on said feature!).

As I get time to dig out some more nuggets I’ll add more posts to this series.