Skip navigation

Category Archives: R

It’s been a while since I’ve posted anything R-related and, while this one will be brief, it may be of use to some R folks who have taken the leap into Big Sur and/or Apple Silicon. Stay to the end for an early Christmas 🎁!

Big Sur Report

As + Twitter-folks know (before my permanent read hiatus began), I’ve been on Big Sur since the very first developer betas were out. I’m even on the latest beta (11.1b1) as I type this post.

Apple picked away at many nits that they introduced with Big Sur, but it’s still more of a “Vista” release than Catalina was given all the clicks involved with installing new software. However, despite making Simon’s life a bit more difficult (re: notarization of R binaries) I’m happy to report that R 4.0.3 and the latest RStudio 1.4 daily releases work great on Big Sur. To be on the safe side, I highly recommend putting the R command-line binaries and RStudio and R-GUI .apps into both “Developer Tools” and “Full Disk Access” in the Security & Privacy preferences pane. While not completely necessary, it may save some debugging (or clicks of “OK”) down the road.

The Xcode command-line tools are finally in a stable state and can be used instead of the massive Xcode.app installation. This was problematic for a few months, but Apple has been pretty consistent keeping it version-stable with Xcode-proper.

Homebrew is pretty much feature complete on Big Sur (for Intel-architecture Macs) and I’ve run everything from the {tidyverse} to {sf}-verse, ODBC/JDBC and more. Zero issues have come up, and with the pending public release (in a few weeks) of 11.1, I can safely say you should be fine moving your R analyses and workflows to Big Sur.

Apple Silicon Report

Along with all the other madness, 2020 has resurrected the “Processor Wars” by making their own ARM 64-bit chips (the M1 series). The Mac mini flavor is fast, but I suspect some of the “feel” of that speed comes from the faster SSDs and beefed up I/O plane to get to those SSDs. These newly released Mac models require Big Sur, so if you’re not prepared to deal with that, you should hold off on a hardware upgrade.

Another big reason to hold off on a hardware upgrade is that the current M1 chips cannot handle more than 16GB of RAM. I do most of the workloads requiring memory heavy-lifting on a 128GB RAM, 8-core Ubuntu 20 server, so what would have been a deal breaker in the past is not much of one today. Still, R folks who have gotten used to 32GB+ of RAM on laptops/desktops will absolutely need to wait for the next-gen chips.

Most current macOS software — including R and RStudio — is going to run in the Rosetta 2 “translation environment”. I’ve not experienced the 20+ seconds of startup time that others have reported, but RStudio 1.4 did take noticeably (but not painfully) longer on the first run than it has on subsequent ones. Given how complex the RStudio app is (chromium engine, Qt, C++, Java, oh my!) I was concerned it might not work at all, but it does indeed work with absolutely no problems. Even the ODBC drivers (Athena, Drill, Postgres) I need to use in daily work all work great with R/RStudio.

This means things can only get even better (i.e. faster) as all these components are built with “Universal” processor support.

Homebrew can work, but it requires the use of the arch command to ensure everything is running the Rosetta 2 translation environment and nothing I’ve installed (from source) has required anything from Homebrew. Do not let that last sentence lull you into a false sense of excitement. So far, I’ve tested:

  • core {tidyverse}
  • {DBI}, {odbc}, {RJDBC}, and (hence) {dbplyr}
  • partially extended {ggplot2}-verse
  • {httr}, {rvest}, {xml2}
  • {V8}
  • a bunch of self-contained, C[++]-backed or just base R-backed stats packages

and they all work fine.

I have installed separate (non-Universal) binaries of fd, ripgrep, bat plus a few others, and almost have a working Rust toolchain up (both Rust and Go are very close to stable on Apple’s M1 architecture).

If there are specific packages/package ecosystems you’d like tested or benchmarked, drop a note in the comments. I’ll likely bee adding more field report updates over the coming weeks as I experiment with additional components.

Now, if you are a macOS R user, you already know — thanks to Tomas and Simon — that we are in wait mode for a working Fortran compiler before we will see a Universal build of R. The good news is that things are working fine under Rosetta 2 (so far).

RStudio Update

If there is any way you can use RStudio Desktop + Server Pro, I would heartily recommend that you do so. The remote R session in-app access capabilities are dreamy and addictive as all get-out.

I also (finally) made a (very stupid simple) PR into the dailies so that RStudio will be counted as a “Developer Tool” for Screen Time accounting.

RSwitch Update

🎁-time! As incentive to try Big Sur and/or Apple Silicon, I started work on version 2 of RSwitch which you can grab from — https://rud.is/rswitch/releases/RSwitch-2.0.0b.app.zip. For those new to RSwitch, it is a modern alternative to the legacy RSwitch that enabled easy switching of R versions on macOS.

The new version requires Big Sur as its based on Apple’s new “SwiftUI” and takes advantage of some components that Apple has not seen fit to make backwards compatible. The current version has fewer features than the 1.x series, but I’m still working out what should and should not be included in the app. Drop notes in the comments with feature requests (source will be up after 🦃 Day).

The biggest change is that the app is not just a menu but an popup window:

Each of those tappable download regions presents a cancellable download progress bar (all three can run at the same time), and any downloaded disk images will be auto-mounted. That third tappable download region is for downloading RStudio Pro dailies. You can get notifications for both (or neither) Community and Pro versions:

The R version switchers also provides more info about the installed versions:

(And, yes, the r79439 is what’s in Rversion.h of each of those installations. I kinda want to know why, now.)

The interface is very likely going to change dramatically, but the app is a Universal binary, so will work on M1 and Intel Big Sur Macs.

NOTE: it’s probably a good idea to put this into “Full Disk Access” in the Security & Privacy preferences pane, but you should likely wait until I post the source so you can either build it yourself or validate that it’s only doing what it says on the tin for yourself (the app is benign but you should be wary of any app you put on your Macs these days).

WARNING: For now it’s a Dark Mode only app so if you need it to be non-hideous in Light Mode, wait until next week as I add in support for both modes.

FIN

If you’re running into issues, the macOS R mailing list is probably a better place to post issues with R and BigSur/Apple Silicon, but feel free to drop a comment if you are having something you want me to test/take a stab at, and/or change/add to RSwitch.

(The RSwitch 2.0.0b release yesterday had a bug that I think has been remedied in the current downloadable version.)

The CDC continues to “deliver” in 2020, this time by changing the JSON response of one of the hidden APIs that my {cdcfluview} package wraps. CDC: So helpful!

It was a quick fix, and version 0.9.2 passed automated CRAN checks in ~9.42 minutes! 💙 the CRAN team!

Plus, a special shout-out to Ian-McGovern (GH ID) for triaging the issue even before I had a chance to get 🙌 on ⌨️.

I also took the opportunity to switch to {progress} from {dplyr}’s progress_estimated() (which is deprecated).

You can wait for it on CRAN or find it on all the usual social source sharing suspects, including my own: https://git.rud.is/hrbrmstr/cdcfluview.

(This is part 2 of n “quick hit” posts, each walking through some approaches to speeding up components of an iterative operation. Go here for part 1).

Thanks to the aforementioned previous post, we now have a super fast way of reading individual text files containing HTTP headers from HEAD requests into a character vector:

library(Rcpp)

vapply(
  X = fils, 
  FUN = cpp_read_file, # see previous post for the source for this C++ Rcpp function
  FUN.VALUE = character(1), 
  USE.NAMES = FALSE
) -> hdrs

head(hdrs, 2)
## [1] "HTTP/1.1 200 OK\r\nDate: Mon, 08 Jun 2020 14:40:45 GMT\r\nServer: Apache\r\nLast-Modified: Sun, 26 Apr 2020 00:06:47 GMT\r\nETag: \"ace-ec1a0-5a4265fd413c0\"\r\nAccept-Ranges: bytes\r\nContent-Length: 967072\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Type: application/x-msdownload\r\n\r\n"                                   
## [2] "HTTP/1.1 200 OK\r\nDate: Mon, 08 Jun 2020 14:43:46 GMT\r\nServer: Apache\r\nLast-Modified: Wed, 05 Jun 2019 03:52:22 GMT\r\nETag: \"423-d99a0-58a8b864f8980\"\r\nAccept-Ranges: bytes\r\nContent-Length: 891296\r\nX-XSS-Protection: 1; mode=block\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Type: application/x-msdownload\r\n\r\n"

However, I need the headers and values broken out so I can eventually get to the analysis I need to do, and a data frame of name/value columns would be the most helpful format. We’ll use {stringi} to help us build a function (explanation of what it’s doing is in comment annotations) that turns each unkempt string into a very kempt data frame:

library(stringi)

parse_headers <- function(x) {

  # split lines from into a character vector
  split_hdrs <- stri_split_lines(x, omit_empty = TRUE)

  lapply(split_hdrs, function(lines) {

    # we don't care about the HTTP x/x ...
    lines <- lines[-1]

    # make a matrix out of found NAME: VALUE
    hdrs <- stri_match_first_regex(lines, "^([^:]*):\\s*(.*)$")

    if (nrow(hdrs) > 0) { # if we have any
      data.frame(
        name = stri_replace_all_fixed(stri_trans_tolower(hdrs[,2]), "-", "_"),
        value = hdrs[,3]
      )
    } else { # if we don't have any
      NULL
    }

  })

}

parse_headers(hdrs[1:3])
## [[1]]
##              name                         value
## 1            date Mon, 08 Jun 2020 14:40:45 GMT
## 2          server                        Apache
## 3   last_modified Sun, 26 Apr 2020 00:06:47 GMT
## 4            etag     "ace-ec1a0-5a4265fd413c0"
## 5   accept_ranges                         bytes
## 6  content_length                        967072
## 7 x_frame_options                    SAMEORIGIN
## 8    content_type      application/x-msdownload
## 
## [[2]]
##               name                         value
## 1             date Mon, 08 Jun 2020 14:43:46 GMT
## 2           server                        Apache
## 3    last_modified Wed, 05 Jun 2019 03:52:22 GMT
## 4             etag     "423-d99a0-58a8b864f8980"
## 5    accept_ranges                         bytes
## 6   content_length                        891296
## 7 x_xss_protection                 1; mode=block
## 8  x_frame_options                    SAMEORIGIN
## 9     content_type      application/x-msdownload
## 
## [[3]]
##           name                         value
## 1         date Mon, 08 Jun 2020 14:23:53 GMT
## 2       server                        Apache
## 3 content_type text/html; charset=iso-8859-1

parse_header(hdrs[1])
##              name                         value
## 1            date Mon, 08 Jun 2020 14:40:45 GMT
## 2          server                        Apache
## 3   last_modified Sun, 26 Apr 2020 00:06:47 GMT
## 4            etag     "ace-ec1a0-5a4265fd413c0"
## 5   accept_ranges                         bytes
## 6  content_length                        967072
## 7 x_frame_options                    SAMEORIGIN
## 8    content_type      application/x-msdownload

Unfortunately, this takes almost 16 painful seconds to crunch through the ~75K text entries:

system.time(tmp <- parse_headers(hdrs))
##   user  system elapsed 
## 15.033   0.097  15.227 

as each call can be near 150 microseconds:

microbenchmark(
  ph = parse_headers(hdrs[1]),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##  expr     min       lq     mean  median      uq     max neval
##    ph 143.328 146.8995 154.8609 148.361 158.121 415.332  1000

A big reason it takes so long is the data frame creation. If you’ve never looked at the source for data.frame() have a go at it — https://github.com/wch/r-source/blob/86532f5aa3d9880f4c1c9e74a417005616846a34/src/library/base/R/dataframe.R#L435-L603 — before continuing.

Back? Great! The {base} data.frame() has tons of guard rails to make sure you’re getting what you think you asked for across a myriad of use cases. I learned about a trick to make data frame creation faster when I started playing with {ggplot2} source. Said trick has virtually no guard rails — it just adds a class, and row.names attribute to a list — so you really should only use it in cases like this where you have a very good idea of the structure and values of the data frame you’re making. Here’s an even more simplified version of the function in the {ggplot2} source:

fast_frame <- function(x = list()) {

  lengths <- vapply(x, length, integer(1))
  n <- if (length(x) == 0 || min(lengths) == 0) 0 else max(lengths)
  class(x) <- "data.frame"
  attr(x, "row.names") <- .set_row_names(n) # help(.set_row_names) for info

  x

}

Now, we’ll change parse_headers() a bit to use that function instead of data.frame():

parse_headers <- function(x) {

  # split lines from into a character vector
  split_hdrs <- stri_split_lines(x, omit_empty = TRUE)

  lapply(split_hdrs, function(lines) {

    # we don't care about the HTTP x/x ...
    lines <- lines[-1]

    # make a matrix out of found NAME: VALUE
    hdrs <- stri_match_first_regex(lines, "^([^:]*):\\s*(.*)$")

    if (nrow(hdrs) > 0) { # if we have any
      fast_frame(
        list(
          name = stri_replace_all_fixed(stri_trans_tolower(hdrs[,2]), "-", "_"),
          value = hdrs[,3]
        )
      )
    } else { # if we don't have any
      NULL
    }

  })

}

Note that we had to pass in a list() to it vs bare name/value vectors.

How much faster is it? Quite a bit:

microbenchmark(
  ph = parse_headers(hdrs[1]),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##  expr   min      lq     mean median      uq      max neval
##    ph 27.94 28.7205 34.66066 29.024 29.3785 4144.402  1000

This speedup means the painful ~15s is now just a tolerable ~3s:

system.time(tmp <- parse_headers(hdrs))
##  user  system elapsed 
## 2.901   0.011   2.918 

FIN

Normally, guard rails are awesome, and you can have even more safe code (which means safer and more reproducible analyses) when using {tidyverse} functions. As noted in the previous post, I’m doing a great deal of iterative work, have more than one set of headers I’m crunching on, and am testing out different approaches/theories, so going from 16 seconds to 3 seconds does truly speed up my efforts and has an even bigger impact when I process around 3 million raw header records.

I think I promised {future} work in this post (asynchronous pun not intended), but we’ll get to that eventually (probably the next post).

If you have your own favorite way to speedup data frame creation (or extracting target values from raw text records) drop a note in the comments!

(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.)

I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this:

HTTP/1.1 200 OK
Date: Mon, 08 Jun 2020 14:40:45 GMT
Server: Apache
Last-Modified: Sun, 26 Apr 2020 00:06:47 GMT
ETag: "ace-ec1a0-5a4265fd413c0"
Accept-Ranges: bytes
Content-Length: 967072
X-Frame-Options: SAMEORIGIN
Content-Type: application/x-msdownload

I do this quite a bit in R when we create new studies at work, but I’m usually only working with a few files. In this case I had to go through all these files to determine if a condition hypothesis (more on that in one of the future posts) was accurate.

Reading in a bunch of files (each one into a string) is fairly straightforward in R since readChar() will do the work of reading and we just wrap that in an iterator:

length(fils)
## [1] 73514 

# check file size distribution
summary(
  vapply(
    X = fils,
    FUN = file.size,
    FUN.VALUE = numeric(1),
    USE.NAMES = FALSE
  )
)
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 19.0   266.0   297.0   294.8   330.0  1330.0 

# they're all super small

system.time(
  vapply(
    X = fils, 
    FUN = function(.f) readChar(.f, file.size(.f)), 
    FUN.VALUE = character(1), 
    USE.NAMES = FALSE
  ) -> tmp 
)
##  user  system elapsed 
## 2.754   1.716   4.475 

NOTE: You can use lapply() or sapply() to equal effect as they all come in around 5 seconds on a modern SSD-backed system.

Now, five seconds is completely acceptable (though that brief pause does feel awfully slow for some reason), but can we do better? I mean we do have some choices when it comes to slurping up the contents of a file into a length 1 character vector:

  • base::readChar()
  • readr::read_file()
  • stringi::stri_read_raw() (+ rawToChar())

Do any of them beat {base}? Let’s see (using the largest of the files):

library(stringi)
library(readr)
library(microbenchmark)

largest <- fils[which.max(sapply(fils, file.size))]

file.size(largest)
## [1] 1330

microbenchmark(
  base = readChar(largest, file.size(largest)),
  readr = read_file(largest),
  stringi = rawToChar(stri_read_raw(largest)),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##     expr     min       lq      mean   median       uq     max neval
##     base  79.862  93.5040  98.02751  95.3840 105.0125 161.566  1000
##    readr 163.874 186.3145 190.49073 189.1825 192.1675 421.256  1000
##  stringi  52.113  60.9690  67.17392  64.4185  74.9895 249.427  1000

I had predicted that the {stringi} approach would be slower given that we have to explicitly turn the raw vector into a character vector, but it is modestly faster. ({readr} has quite a bit of functionality baked into it — for good reasons — which doesn’t help it win any performance contests).

I still felt there had to be an even faster method, especially since I knew that the files all had HTTP response headers and that they every one of them could each be easily read into a string in (pretty much) one file read operation. That knowledge will let us make a C++ function that cuts some corners (more like “sands” some corners, really). We’ll do that right in R via {Rcpp} in this function (annotated in C++ code comments):

library(Rcpp)

cppFunction(code = '
String cpp_read_file(std::string fil) {

  // our input stream
  std::ifstream in(fil, std::ios::in | std::ios::binary);

  if (in) { // we can work with the file

  #ifdef Win32
    struct _stati64 st; // gosh i hate windows
    _wstati64(fil.cstr(), &st) // this shld work but I did not test it
  #else
    struct stat st;
    stat(fil.c_str(), &st);
  #endif

    std::string out(st.st_size, 0); // make a string buffer to hold the data

    in.seekg(0, std::ios::beg); // ensure we are at the beginning
    in.read(&out[0], st.st_size); // read in the file
    in.close();

    return(out);

  } else {
    return(NA_STRING); // file missing or other errors returns NA
  }

}
', includes = c(
  "#include <fstream>",
  "#include <string>",
  "#include <sys/stat.h>"
))

Is that going to be faster?

microbenchmark(
  base = readChar(largest, file.size(largest)),
  readr = read_file(largest),
  stringi = rawToChar(stri_read_raw(largest)),
  rcpp = cpp_read_file(largest),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##     expr     min       lq      mean   median       uq     max neval
##     base  80.500  91.6910  96.82752  94.3475 100.6945 295.025  1000
##    readr 161.679 175.6110 185.65644 186.7620 189.7930 399.850  1000
##  stringi  51.959  60.8115  66.24508  63.9250  71.0765 171.644  1000
##     rcpp  15.072  18.3485  21.20275  21.0930  22.6360  62.988  1000

It sure looks like it, but let’s put it to the test:

system.time(
  vapply(
    X = fils, 
    FUN = cpp_read_file, 
    FUN.VALUE = character(1), 
    USE.NAMES = FALSE
  ) -> tmp 
)
##  user  system elapsed 
## 0.446   1.244   1.693 

I’ll take a two-second wait over a five-second wait any day!

FIN

I have a few more cases coming up where there will be 3-5x the number of (similar) files that I’ll need to process, and this optimization will shave time off as I iterate through each analysis, so the trivial benefits here will pay off more down the road.

The next post in this particular series will show how to use the {future} family to reduce the time it takes to turn those HTTP headers into data we can use.

If I missed your favorite file slurping function, drop a note in the comments and I’ll update the post with new benchmarks.

The Marshall Project has a solid story and set of visualizations on the impact of COVID-19 in U.S. prisons. They keep the data (and vis) regularly updated. They do great work and this is an important topic, but this visualization breaks my “ordered grid” OCD:

To be fair, it’s not supposed to line up as the dots are part of an animation process that has them drop from top to bottom and appears to be designed to have an “organic” feel.

We can use the {waffle} package to iron out these wrinkled non-grids into some semblance of order, and try to replicate the chart as much as possible along the way.

Getting the Data

We first need the data and, thankfully, the MP folks provided it…just not in a way you’d expect (or that’s straightforward to use).

Do a “view source” on that URL in your browser and scroll down to line ~1,455 and you should see this:

That’s the data, right on the page, encoded in javascript🤔. This makes sense as it is fueling a javascript visualization and many sites are embedding data right on the page vs fetch via an XHR request to make it easier for web archives to store and retrieve working visualizations. We can totally work with this data, and we’ll do that now, along with getting some boilerplate out of the way:

library(V8)         # work with javascript data
library(stringi)    # string ops
library(rvest)      # web scrape
library(ggtext)     # pretty ggplot text with markdown
library(waffle)     # waffle charts // install_github("hrbrmstr/waffle")
library(hrbrthemes) # install_github("hrbrmstr/hrbrthemes") or don't use the font theme and pick another one
library(tidyerse)   # duh

gg <- glue::glue # for plot labels (later)

# get the page source
pg <- read_html("https://www.themarshallproject.org/2020/05/01/a-state-by-state-look-at-coronavirus-in-prisons")

# setup a V8 VM context
ctx <- v8()

# grab the "data" and make it a V8 VM object
html_nodes(pg, xpath=".//script[contains(., 'var STATES_DATA')]") %>%
  html_text() %>%
  ctx$eval()

# get the data into R
states_data <- ctx$get("STATES_DATA")

glimpse(states_data)
## Rows: 918
## Columns: 20
## $ ``                 <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",…
## $ name               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "G…
## $ abbreviation       <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "M…
## $ week_of            <chr> "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26", "2020-03-26"…
## $ unrevised_cases    <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ cases              <chr> "0", "0", "0", "0", "1", "0", "NA", "0", "0", "4", "0", "0", "3", "0", "0", "0", "NA", "0", "0", "0", "9", "23…
## $ new_cases          <chr> "0", "0", "0", "0", "1", "0", "NA", "0", "0", "4", "0", "0", "3", "0", "0", "0", "NA", "0", "0", "0", "9", "23…
## $ filled_cases       <chr> "0", "0", "0", "0", "1", "0", "0", "0", "0", "4", "0", "0", "3", "0", "0", "0", "0", "0", "0", "0", "9", "23",…
## $ case_rate          <chr> "0", "0", "0", "0", "0.0812809883768187", "0", "NA", "0", "0", "0.728318857996031", "0", "0", "0.7865757734661…
## $ deaths             <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "NA", "0", "0", "0", "0", "0",…
## $ new_deaths         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "NA", "0", "0", "0", "0", "0",…
## $ filled_deaths      <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ death_rate         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0.182079714499008", "0", "0", "0", "0", "0", "0", "NA", "0", "0"…
## $ staff_multiples    <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ prisoner_multiples <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ tested             <chr> "NA", "4", "15", "0", "166", "NA", "NA", "4", "NA", "NA", "NA", "10", "13", "NA", "NA", "0", "NA", "32", "NA",…
## $ as_of_date         <chr> "NA", "2019-12-31", "2020-04-15", "2020-02-29", "2020-04-15", "NA", "NA", "2020-04-15", "NA", "NA", "NA", "202…
## $ march_pop          <chr> "NA", "4997", "42282", "18181", "123030", "NA", "NA", "5042", "NA", "NA", "NA", "7816", "38140", "NA", "NA", "…
## $ april_pop          <chr> "NA", "4997", "41674", "18181", "118466", "NA", "NA", "4624", "NA", "NA", "NA", "7641", "36904", "NA", "NA", "…
## $ test_rate          <chr> "NA", "8.00480288172904", "3.54760891159359", "0", "13.4926440705519", "NA", "NA", "7.93335977786593", "NA", "…

The comments in the code go a long way, but jist is that we extract out the javascript block containing that var STATES_DATA… data, have {V8} wrangle it in javascript for us, then get the result and take a look at it. Now for the real work.

Making the Data Useful

We need — at a minimum — dates and numbers. We’re also going to mimic the visualization, so we’ll be dividing new case counts by 10 for the “1 dot == 10 cases” waffle chart and creating useful axis labels. This is pretty basic wrangling:

states_data %>% 
  select(week_of, new_cases) %>% 
  mutate(
    week_of = as.Date(week_of),
    new_cases = suppressWarnings(as.numeric(new_cases))
  ) %>%
  count(week_of, wt = new_cases) %>%
  arrange(week_of) %>%
  mutate(
    wk = format(week_of, "Week of\n%b %d"),
    div10 = as.integer(round(n/10)),
  ) %>%
  as_tibble() -> cases

glimpse(cases)
## Rows: 18
## Columns: 4
## $ week_of <date> 2020-03-26, 2020-04-01, 2020-04-08, 2020-04-15, 2020-04-22, 2020-04-29, 2020-05-06, 2020-05-13, 2020-05-20, 2020-05-27, …
## $ n       <dbl> 56, 268, 810, 1672, 6872, 4788, 5538, 5115, 3940, 5323, 6027, 3335, 2258, 2452, 3856, 4488, 7324, 6595
## $ wk      <chr> "Week of\nMar 26", "Week of\nApr 01", "Week of\nApr 08", "Week of\nApr 15", "Week of\nApr 22", "Week of\nApr 29", "Week o…
## $ div10   <int> 6, 27, 81, 167, 687, 479, 554, 512, 394, 532, 603, 334, 226, 245, 386, 449, 732, 660

Using the {waffle} package to make “waffle bar charts” means we’ll end up with panels/strips which will become “axis labels”. I like the fact that the MP folks did not label each week, so we’ll have to account for that as well. One of the simplest ways to do that is to make those labels spaces, but a unique number of them since we’re going to make an ordered factor to ensure the strips are in the right order. This is also pretty straightforward:

cases$wk[c(1, 3:5, 7:9, 11:13, 15:17)] <- stri_pad("", 1:13)
cases$wk <- fct_inorder(cases$wk)

The vector of numbers in the first line are the weeks we want to be blank and we’ll turn them into space-padded strings, each with an increasing number of spaces, then we’ll turn the entire vector of weeks into a factor in the right order.

Making the Chart

The rest is all {ggplot2} magic, so let’s get the whole plot code out of the way before talking about it:

ggplot() +
  geom_waffle(
    data = cases,
    aes(fill = "new cases", values = div10),
    flip = TRUE, n_cols = 10, radius = unit(3, "pt"),           
    color = "white"
  ) +
  geom_text(
    data = tibble(
      idx = c(1, 17, 18),
      wk = cases$wk[idx],
      y = (cases$div10[idx] %/% 10),
      actual_cases = cases$n[idx],
      lab = gg("{scales::comma(actual_cases, 1)} new\ncases")
    ),
    aes(1, y, label = lab),
    vjust = 0, hjust = 0, nudge_y = 2,
    size = 3.5, family = font_gs, lineheight = 0.875
  ) +
  scale_y_continuous(
    expand = c(0, 0.125),
    breaks = seq(0, 70, 10),
    labels = scales::comma(seq(0, 7000, 1000)),
    limits = c(0, 80)
  ) +
  scale_fill_manual(
    values = c("#366b7b")
  ) +
  facet_wrap(~wk, nrow=1, strip.position = "bottom") +
  coord_fixed() +
  labs(
    x = NULL, y = NULL,
    title = "There have been at least<br/><span style='color:#366b7b;font-size:32pt;'>**70,717**</span> <span style='font-size:24pt'>cases</span><br/>of coronavirus reported among prisoners.<br/><span style='font-size:18pt;'>**46,846** prisoners have recovered.</span>",
    subtitle = "Each <span style='color:#366b7b;font-size:16pt;'>&bull;</span> represents 10 new cases.",
    caption = "Source: (data) <www.themarshallproject.org/2020/05/01/a-state-by-state-look-at-coronavirus-in-prisons>"
  ) +
  theme_ipsum_gs(
    grid="Y",
    strip_text_family = font_gs, strip_text_face = "plain",
    plot_title_family = font_gs, plot_title_face = "plain",
    subtitle_family = font_an, subtitle_face = "plain", subtitle_size = 10
  ) +
  theme(
    legend.position = "none",
    strip.text = element_text(hjust = 0.5),
    axis.text.x = element_blank(),
    panel.spacing.x = unit(20, "pt"),
    plot.title = element_markdown(),
    plot.subtitle = element_markdown(),
  )

There’s quite a bit going on there, so let’s break it down:

  • We’re telling geom_waffle() to use our data, and giving it a single category to fill (as there is only one) along with the number of elements in the category. The radius parameter lets us have non-square “dots”, and n_cols + flip sets up the grid to match the one from MP.
  • We need labels on top, too (just three of them) so we’ll pick the vector indices of the ones with labels and get the week strip labels, y positions, new case counts for that day, and an appropriately formatted label and plot them. We’re starting the label at the first X position in each strip and plotting the labels at the height of the “bar”.
  • We’re customizing the Y scale to reflect the 1 == 10 representation of the data and using the same blue as MP did for the fill scale.
  • To get them all to mimic a real X axis, we’re ensuring there’s only one row of facets and putting the facet labels at the bottom.
  • By using coord_fixed we can get circles (or as close to them as you like)
  • We’re using some markdown in the labs(), courtesy of {ggtext}’s element_markdown() and setting some font stylings in the base theme (use a different one if you get font errors or read the docs). We rely on this to “fake” a legend.
  • Finally, we tweak strip positions and some formatting to produce:

(You likely need to view that in your own plot window in R/RStudio or zoom in a bit)

FIN

If you spend some more time on it you can get super-close to the Marshall Project’s finished product.

A bonus from scraping is that you also get two more datasets from the page: STATES_DATA and STATE_NOTES:

glimpse(ctx$get("STATES_RATES"))
## Rows: 51
## Columns: 23
## $ ``                 <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",…
## $ name               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "G…
## $ abbreviation       <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "M…
## $ unrevised_cases    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ cases              <chr> "165", "16", "684", "3789", "7066", "783", "1344", "508", "3898", "1113", "0", "763", "343", "729", "333", "91…
## $ new_cases          <chr> "27", "14", "115", "307", "608", "115", "1", "213", "1266", "49", "0", "69", "6", "1", "109", "1", "146", "50"…
## $ filled_cases       <chr> "165", "16", "684", "3789", "7066", "783", "1344", "508", "3898", "1113", "0", "763", "343", "729", "333", "91…
## $ case_rate          <chr> "77.9994327313983", "32.0192115269162", "164.131112924125", "2084.04378197019", "596.458055475833", "449.68986…
## $ deaths             <chr> "14", "0", "13", "25", "40", "3", "7", "7", "34", "26", "0", "0", "13", "20", "1", "4", "6", "16", "0", "8", "…
## $ new_deaths         <chr> "2", "0", "0", "9", "5", "0", "0", "0", "5", "1", "0", "0", "0", "0", "0", "0", "2", "0", "0", "0", "0", "0", …
## $ filled_deaths      <chr> "14", "0", "13", "25", "40", "3", "7", "7", "34", "26", "0", "0", "13", "20", "1", "4", "6", "16", "0", "8", "…
## $ death_rate         <chr> "6.61813368630046", "0", "3.11945097662811", "13.750618777845", "3.376496209883", "1.72294968986906", "5.72925…
## $ staff_multiples    <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "N…
## $ prisoner_multiples <chr> "NA", "NA", "NA", "NA", "NA", "13723", "NA", "NA", "NA", "NA", "NA", "1884", "NA", "NA", "NA", "NA", "NA", "NA…
## $ as_of_date         <chr> "2020-01-31", "NA", "2020-04-15", "2020-02-29", "2020-04-15", "2020-03-31", "2020-04-01", "NA", "NA", "NA", "2…
## $ march_pop          <chr> "21154", "NA", "42282", "18181", "123030", "17600", "12422", "NA", "NA", "NA", "4631", "7816", "38140", "26891…
## $ april_pop          <chr> "21154", "NA", "41674", "18181", "118466", "17412", "12218", "NA", "NA", "NA", "4631", "7641", "36904", "26891…
## $ test_rate          <chr> "313.888626264536", "NA", "1176.99284925853", "4720.86243880975", "5048.53713301707", "4241.90213645762", "819…
## $ recovered          <chr> "41", "2", "376", "2970", "4940", "628", "1324", "391", "NA", "881", "0", "100", "307", "716", "208", "906", "…
## $ date               <chr> "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "20200721", "2…
## $ case_ratio         <chr> "-45.6429050602488", "-7.23368674670148", "-19.650267894127", "1714.81334912848", "488.054058525172", "538.378…
## $ death_ratio        <chr> "149.040167449448", "-100", "-22.1877969354022", "1009.53669385711", "72.0346501657667", "-38.5633906796423", …
## $ test_ratio         <chr> "-74.3653779188402", "NA", "6.07104744754024", "224.056036286688", "205.156799892597", "441.281822953195", "34…

glimpse(ctx$get("STATE_NOTES"))
## Rows: 18
## Columns: 3
## $ state <chr> "TN", "TN", "VA", "NM", "NM", "MN", "MN", "VT", "RI", "RI", "MI", "MD", "CT", "AK", "DE", "HI", "LA", "MA"
## $ type  <chr> "prisoners", "staff", "staff", "prisoners", "staff", "prisoners", "staff", "prisoners", "prisoners", "staff", "prisoners", …
## $ text  <chr> "After testing everyone in all of their prisons, Tennessee has said it is releasing the total number of tests conducted and…

which means you can recreate all the visualizations on the page for practice (or to explore them more). You can check out {devoutsvg}(https://github.com/coolbutuseless/devoutsvg) if you want to try to mimic the gradient fills as well, though they will be supported directly in R in the next major version (and are now in R-devel).

The incredibly talented folks over at Bishop Fox were quite generous this week, providing a scanner for figuring out PAN-OS GlobalProtect versions. I’ve been using their decoding technique and date-based fingerprint table to keep an eye on patch status (over at $DAYJOB we help customers, organizations, and national cybersecurity centers get ahead of issues as best as we can).

We have at-scale platforms for scanning the internet and aren’t running the panos-scanner repo code, but I since there is Python code for doing this, I thought it might be fun to show R folks how to do the same thing and show folks how to use the {httr} package to build something similar (we won’t hit all the URLs their script assesses, primarily for brevity).

What Are We Doing Again?

Palo Alto makes many things, most of which are built on their custom linux distribution dubbed PAN-OS. One of the things they build is a VPN product that lets users remotely access internal company resources. It’s had quite the number of pretty horrible problems of late.

Folks contracted to assess security defenses (colloquially dubbed “pen-testers” tho no pens are usually involved) and practitioners within an organization who want to gain an independent view of what their internet perimeter looks like often assemble tools to perform
ad-hoc assessments. Sure, there are commercial tools for performing these assessments ($DAYJOB makes some!), but these open source tools make it possible for folks to learn from each other and for makers of products (like PAN-OS) to do a better job securing their creations.

In this case, the creation is a script that lets the caller figure out what version of PAN-OS is running on a given GlobalProtect box.

To follow along at home, you’ll need access to a PAN-OS system, as I’m not providing an IP address of one for you. It’s really not hard to find one (just google a bit or stand up a trial one from the vendor). Throughout the examples I’ll be using {glue} to replace ip and port in various function calls, so let’s get some setup bits out of the way:

library(httr)
library(tidyverse) # for read_fwf() (et al), pluck(), filter(), and %>%

gg <- glue::glue # no need to bring in the entire namespace just for this

Assuming you have a valid ip and port, let’s try making a request against your PAN-OS GlobalProtect (hereafter using “GP” so save keystrokes) system:

httr::HEAD(
  url = gg("https://{ip}:{port}/global-protect/login.esp")
) -> res
## Error in curl::curl_fetch_memory(url, handle = handle) : 
##  SSL certificate problem: self signed certificate

We’re using a HEAD request as we really don’t need the contents of the remote file (unless you need to verify it truly is a PAN-OS GP server), just the metadata about it. You can use a traditional GET request if you like, though.

We immediately run into a snag since these boxes tend to use a self-signed SSL/TLS certificate which web clients aren’t thrilled about dealing with unless explicitly configured to. We can circumvent this with some configuration options, but you should not use the following incantations haphazardly. SSL/TLS no longer really means what it used to (thanks, Let’s Encrypt!) but you have no guarantees of what’s being delivered to you is legitimate if you hit a plaintext web site or one with an invalid certificate. Enough with the soapbox, let’s make the request:

httr::HEAD(
  url = gg("https://{ip}:{port}/global-protect/login.esp"),
  config = httr::config(
    ssl_verifyhost =FALSE, 
    ssl_verifypeer = FALSE
  )
) -> res

httr::status_code(res)
## [1] 200

In that request, we’ve told the underlying {curl} library calls to not verify the validity of the host or peer certificates associated with the service. Again, don’t do this haphazardly to get around generic SSL/TLS problems when making normal API calls or scraping sites.

Since we only made a HEAD request, we’re just getting back headers, so let’s take a look at them:

str(httr::headers(res), 1)
## List of 18
##  $ date                     : chr "Fri, 10 Jul 2020 15:02:32 GMT"
##  $ content-type             : chr "text/html; charset=UTF-8"
##  $ content-length           : chr "11749"
##  $ connection               : chr "keep-alive"
##  $ etag                     : chr "\"7e0d5e2b6add\""
##  $ pragma                   : chr "no-cache"
##  $ cache-control            : chr "no-store, no-cache, must-revalidate, post-check=0, pre-check=0"
##  $ expires                  : chr "Thu, 19 Nov 1981 08:52:00 GMT"
##  $ x-frame-options          : chr "DENY"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; secure; HttpOnly"
##  $ set-cookie               : chr "PHPSESSID=bde5668131c14b765e3e75f8ed5514a0; path=/; samesite=lax; secure; httponly"
##  $ strict-transport-security: chr "max-age=31536000;"
##  $ x-xss-protection         : chr "1; mode=block;"
##  $ x-content-type-options   : chr "nosniff"
##  $ content-security-policy  : chr "default-src 'self'; script-src 'self' 'unsafe-inline'; img-src * data:; style-src 'self' 'unsafe-inline';"
##  - attr(*, "class")= chr [1:2] "insensitive" "list"

As an aside, I’ve always found the use of PHP code in security products quite, er, fascinating.

The value we’re really looking for here is etag (which really looks like ETag in the raw response).

Bishop Fox (and others) figured out that that header value contains a timestamp in the last 8 characters. That timestamp maps to the release date of the particular PAN-OS version. Since Palo Alto maintains multiple, supported versions of PAN-OS and generally releases patches for them all at the same time, the mapping to an exact version is not super precise, but it’s sufficient to get an idea of whether that system is at a current, supported patch level.

The last 8 characters of 7e0d5e2b6add are 5e2b6add, which — as Bishop Fox notes in their repo — is just a hexadecimal encoding of the POSIX timestamp, in this case, 1579903709 or 2020-01-24 22:08:29 GMT (we only care about the date, so really 2020-01-24).

We can compute that with R, but first we need to note that the value is surrounded by " quotes, so we’ll have to deal with that during the processing:

httr::headers(res) %>% 
  pluck("etag") %>% 
  gsub('"', '', .) %>% 
  substr(5, 12) %>% 
  as.hexmode() %>% 
  as.integer() %>% 
  anytime::anytime(tz = "GMT") %>% 
  as.Date() -> version_date

version_date
## [1] "2020-01-24"

To get the associated version(s), we need to look the date up in their table, which is in a fixed-width format that we can read via:

read_fwf(
  file = "https://raw.githubusercontent.com/noperator/panos-scanner/master/version-table.txt",
  col_positions = fwf_widths(c(10, 14), c("version", "date")),
  col_types = "cc",
  trim_ws = TRUE
) %>% 
  mutate(
    date = lubridate::mdy(date)
  ) -> panos_trans

panos_trans
## # A tibble: 153 x 2
##    version  date      
##    <chr>    <date>    
##  1 6.0.0    2013-12-23
##  2 6.0.1    2014-02-26
##  3 6.0.2    2014-04-18
##  4 6.0.3    2014-05-29
##  5 6.0.4    2014-07-30
##  6 6.0.5    2014-09-04
##  7 6.0.5-h3 2014-10-07
##  8 6.0.6    2014-10-07
##  9 6.0.7    2014-11-18
## 10 6.0.8    2015-01-13
## # … with 143 more rows

Now, let’s see what version or versions this might be:

filter(panos_trans, date == version_date)
## # A tibble: 2 x 2
##   version date      
##   <chr>   <date>    
## 1 9.0.6   2020-01-24
## 2 9.1.1   2020-01-24

Putting It All Together

We can make a command line script for this (example) scanner:

#!env Rscript
library(purrr)

gg <- glue::glue

# we also use {httr}, {readr}, {lubridate}, {anytime}, and {jsonlite}

args <- commandArgs(trailingOnly = TRUE)

stopifnot(
  c(
    "Must supply both IP address and port" = length(args) == 2
  )
)

ip <- args[1]
port <-  args[2]

httr::HEAD(
  url = gg("https://{ip}:{port}/global-protect/login.esp"),
  config = httr::config(
    ssl_verifyhost =FALSE, 
    ssl_verifypeer = FALSE
  )
) -> res

httr::headers(res) %>% 
  pluck("etag") %>% 
  gsub('"', '', .) %>% 
  substr(5, 12) %>% 
  as.hexmode() %>% 
  as.integer() %>% 
  anytime::anytime(tz = "GMT") %>% 
  as.Date() -> version_date

panos_trans <- readr::read_csv("panos-versions.txt", col_types = "cD")

res <- panos_trans[panos_trans[["date"]] == version_date,]

if (nrow(res) == 0) {
  cat(gg('{{"ip":"{ip}","port":"{port}","version"=null,"date"=null}}\n'))
} else {
  res$ip <- ip
  res$port <- port
  jsonlite::stream_out(res[,c("ip", "port", "version", "date")], verbose = FALSE)
}

Save that as panos-scanner.R and make it executable (provided you’re on a non-legacy operating system that doesn’t support such modern capabilities). Save panos_trans as a CSV file in the same directory and try it against another (sanitized IP/port) system:

./panos-scanner.R 10.20.30.40 5678                                                                                                                                                    1
{"ip":"10.20.30.40","port":"5678","version":"9.1.2","date":"2020-03-30"}

FIN

To be complete, the script should test all the URLs the ones in the script from Bishop Fox does and stand up many more guard rails to handle errors associated with unreachable hosts, getting headers from a system that is not a PAN-OS GP host, and ensuring the ETag is a valid date.

You can grab the code [from this repo](https://git.rud.is/hrbrmstr/2020-07-10-panos-rstats.

Tis been a long time coming, but a minor change to default S3 parameters in tibbles finally caused a push of {sergeant} — the R package that lets you use the Apache Drill REST API via {DBI}, {dplyr}, or directly — to CRAN.

The CRAN automatic processing system approved the release just under 19 minutes from submission.

What’s in version 0.9.0 (or, better: “what’s changed since CRAN 0.6.0?”):

  • API retries! (#ty @jameslamb)
  • Spelling! (#ty @trinker)
  • No Java! (me)
  • 0-row result handling! (#ty @scottcame)
  • progress bars! (#ty @scottcame)
  • fix auth issue! (#ty @Ahine)
  • better docker UX! (#ty @davidski)
  • column order preservation (fixes #18)
  • new `drill_functions()
  • much better error messages
  • 64-bit integer support
  • CTAS from a tbl (#29)
  • switch to {tinytest}
  • DBI tests

Some of this has been in past blog announcements, but now they’re all together in CRAN.

FIN

Apache Drill 1.18.0 (if there is one, the company that bought MapR has abandoned Drill so the fate of Drill is somewhat up in the air) will have support for streaming resultset, which will make the REST API actually usable for large data returns, so that’s on the menu for {sergeant} 1.0.0.

Getting {sergeant.caffeinated} on CRAN is also on the TODO as it contains the separated out JDBC bits.

Many thanks, again, to all the contributors who made 0.9.0 possible. As usual, file issues and/or PRs wherever you’re most comfortable.

I (and, apparently, Gandalf O_o) are pleased to announce that RSwitch version 1.7.0 has been released. (Direct Download)

RSwitch is a macOS menubar utility that:

  • makes it dead simple to manage multiple macOS R versions
  • use the latest RStudio daily builds
  • access remote RStudio Server sessions using in a purpose-built browser that lets you use keystrokes you’re used to in RStudio Desktop
  • see switch between active RStudio projects
  • access a catalog of helpful local and web resources

Version 1.7.0 fixes a few bugs and adds the following new features:

  • New command-line switcher. After a one-line command (shown here in the RSwitch Guide) you can now just rswitch #.#[.#] from the your favorite macOS terminal. This should make it easier to switch R versions in tests/local automation.

  • File uploads and exports to RStudio Server. RSwitch’s RStudio Server purpose-built browser and manager now supports file uploads and exports to RStudio Server sessions. This is super new functionality that has been tested with more than a few edge cases, but I never use this feature (that’s what scp and curl are for IMO), so if you have any problems using them, please drop an issue or comment.

  • Ensure RStudio owns .R and .Rmd files. macOS R folks who use other environments such as Xcode know all to well how other development tools just love to own .R files (and, sometimes, .Rmd files). While I use other editors to get R stuff done much more frequently these days, I still prefer it if RStudio owns these two file types, so there’s a preference you can set for it (off by default).

  • More resources. The “Reference Desk” adds a few more resources to save you some room in your browser bookmarks.

  • Works on 10.13. I relented and included macOS 10.13 in the build support. If Apple does, indeed, release a new alpha version of macOS this Spring, 10.13 support will be removed when 10.16 is released. You folks really need to upgrade macOS to stay safe.

The Guide has been updated to provide information on all the new features.

Please kick the tyres and let me know if you have any issues with the new release and, as you’ll see if you read the Guide, much of the new functionality came from user-suggestions, so don’t hesitate to drop ideas/wants as I may have just enough Swift coding talent to pull a few of them off.

(Here’s the direct download link, again, to prevent you having to scroll up to get your hands on RSwitch 1.7.0 :-)