Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

(Leading this with the periodic warning/reminder that this blog occasionally breaks from technical content and has category-based RSS feeds which can be used to ensure one never see non-technical content.)

Every decent human (which excludes 74,222,958 🇺🇸 who voted for this, now 100% undeniable, traitor) with knowledge of this past week’s tragic events is likely still processing — and will be for a while — what happened; I am no exception. The Feedly board I set up to save content I’ve been pouring over has 113 articles in it, so far.

Different aspects of the costume-clad, treasonous chaos have hit me daily, if not hourly.

Two newspaper paragraphs, each one about a different victim, have bubbled up to surface thoughts more often than much of the other stories of the week.

One is about Erin Schaff, a brave, talented journalist from the New York Times:

Grabbing my press pass, they saw that my ID said The New York Times and became really angry. They threw me to the floor, trying to take my cameras. I started screaming for help as loudly as I could. No one came. People just watched. At this point, I thought I could be killed and no one would stop them. They ripped one of my cameras away from me, broke a lens on the other and ran away. [NYTimes]

No one came.

People just watched.

While I am deeply shocked, outraged, and saddened by what happened to Erin, I am not surprised, given that the President of the United States wants journalists to be executed for regularly giving his 2017-2020 reality show bad reviews by stating undeniable facts. Furthermore, he has continually cultivated disdain and hatred for the media in his regiment of cult followers.

Erin is lucky to be alive, even if that means living in the ruins of a failed, so-called democracy.

President Trump is responsible for Erin’s assault, and he is going to get away with it.

The other is about Ashli Babbitt, the troubled insurrectionist who died assaulting the Capitol:

With help from someone who hoisted her up, Babbitt began to step through a portion of the door where the glass had been broken out. An officer on the other side, who was wearing a suit and a surgical mask, immediately shot Babbitt in the neck. She fell to the floor. [WaPo]

After the February 2020 impeachment proceedings failed to do anything substantive, the President boasted of feeling “untouchable”; and, in at a campaign rally in 2016, then Republican presidential candidate Trump boasted that he could “shoot somebody and not lose any voters.”

Trump has taken the lives of hundreds of thousands of Americans. And, while the gun wasn’t in his stubby hand, he is fully responsible for this woman’s shooting and death.

So, Trump was right: he is going to get away with it.

Mike Pence, Ted Cruz, Josh Hawley, Lindsey Graham, Susan Collins, Mitch McConnell, and a few hundred other evil, self-serving, elected cowards are all unindicted co-conspirators to Erin’s assault and Ashley’s death, as are countless “news” and talk show hosts.

Beyond what happened to these two women, this traitorous cabal also helped orchestrate this week’s current crescendo to Trump’s term in office.

I say “current” because there’s a non-zero chance of increased violence and bloodshed before January 20th despite Trump being deplatformed.

I have lost all hope that Trump will face any tangible consequences for his actions, which will only serve to embolden other wanna-be dictators like Cruz and Hawley.

What’s worse is that even after Biden’s victory was finally 100% sealed and one of America’s most cherished institutions was ransacked, the Trump supporters near me (rural-ish Maine) still, proudly, have their 2020 Trump campaign signs up and were very likely laughing and cheering the insurrection while Erin was being assaulted and Ashley’s life was ebbing away.

Even the Court Evangelicals have doubled-down in their support of Trump.

I (literally) pray I’m wrong, but it seems inevitable that the violence and bloodshed will continue through and after the 20th. As Biden tries to (also, literally) heal America by bringing science-fueled, centralized, enforced standards to quell the carnage of Covid, we will very likely and regularly see regional repeats of this week’s contemptible acts. As he and his administration attempt to right the many, many wrongs of the past four years (and more), these necessary actions will further push the ilk of this week to regularly manifest their entitlement-fueled rage.

Nos autem non in antebellum; bella iam inceperat.

I went completely daft this week and broke my months-long Twitter break due to the domestic terror event in my nation’s capitol. I’ll likely be resuming the break starting today.

Whilst keeping up with the final descent of the U.S. into a fully failed state, I also noticed that a debate from months ago on CRAN URL checks was still going strong.

I briefly chimed in those months ago and this week on the dangers of short URLs (which was not exactly the core topic of the debate which centered around HTTP URL redirects which is a feature of the protocol that URL shorteners happen to take advantage of).

Short URLs make it easier to type a URL out or remember a URL (if you can still get a decent, short keyword to use after the /) but they’re dangerous. In case you’re one of the R folks who challenge my security chops, perhaps you’ll believe Bruce.

NOTE: Regular ol’ URLs can be, and are dangerous, too, especially if they’re used in an http:// context vs an https:// context or run by daft folks who think they’re capable of making a system fully impervious to attackers.

The pandemic has made “cyber” fairly hectic, so my plan to wrap up a safety checker and local package URLs re-writer into a small, usable tool/package has no ETA on completion. However, that doesn’t mean you can’t gain visibility into the number, types, and safety of URLs in your locally installed packages.

The code below has exposition in the comments – and you can find it here as well — so I’ll close with it vs my usual “FIN”.

Stay safe out there, folks; and — to my not-so-‘United’-after-all States readers — stay strong! The nightmare of the last four years is almost over (though the cleanup — now both physical and metaphorical — is going to take a long time).

library(urltools)
library(stringi)
library(tidyverse)
# we're also using {clipr} and {tools} but via ::: and ::

# fairly comprehensive list of URL shorteners
shorteners <- read_lines("https://github.com/sambokai/ShortURL-Services-List/raw/master/shorturl-services-list.txt")

# opaque function baked into {tools}
# NOTE: this can take a while
db <- tools:::url_db_from_installed_packages(rownames(installed.packages()), verbose = TRUE)

as_tibble(db) %>% 
  distinct() %>%  # yep, even w/in a pkg there may be dups from ^^
  mutate(
    scheme = scheme(URL), # https or not
    dom = domain(URL)     # need this later to be able to compute apex domain
  )  %>% 
  filter(
    dom != "..", # prbly legit since it will be a relative "go up one directory" 
    !is.na(dom)  # the {tools} url_db_from_installed_packages() is not perfect
  ) %>% 
  bind_cols(
    suffix_extract(.$dom) # break them all down into component atoms
  ) %>% 
  select(-dom) %>% # this is now 'host' from ^^
  mutate(
    apex = sprintf("%s.%s", domain, suffix) # apex domain
  ) %>% 
  mutate(
    is_short = (host %in% shorteners) | (apex %in% shorteners) # does it use a shortener?
  ) -> db

db
## # A tibble: 12,623 x 9
##    URL        Parent    scheme host  subdomain domain suffix apex  is_short
##    <chr>      <chr>     <chr>  <chr> <chr>     <chr>  <chr>  <chr> <lgl>   
##  1 https://g… albersus… https  gith… NA        github com    gith… FALSE   
##  2 https://g… albersus… https  gith… NA        github com    gith… FALSE   
##  3 https://w… AnomalyD… https  www.… www       usenix org    usen… FALSE   
##  4 https://w… AnomalyD… https  www.… www       jstor  org    jsto… FALSE   
##  5 https://w… AnomalyD… https  www.… www       usenix org    usen… FALSE   
##  6 https://w… AnomalyD… https  www.… www       jstor  org    jsto… FALSE   
##  7 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
##  8 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
##  9 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
## 10 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
## # … with 12,613 more rows

# what packages do i have installed that use short URLS?
# a nice thing to do would be to file a PR to these authors

filter(db, is_short) %>% 
  select(
    URL,
    Parent,
    scheme
  )
## # A tibble: 5 x 3
##   URL                         Parent                   scheme
##   <chr>                       <chr>                    <chr> 
## 1 https://goo.gl/5KBjL5       fpp2/man/goog.Rd         https 
## 2 http://bit.ly/2016votecount geofacet/man/election.Rd http  
## 3 http://bit.ly/SnLi6h        knitr/man/knit.Rd        http  
## 4 https://bit.ly/magickintro  magick/man/magick.Rd     https 
## 5 http://bit.ly/2UaiYbo       ssh/doc/intro.html       http  

# what protocols are in use? (you'll note that some are borked and
# others got mangled by the {tools} function)

count(db, scheme, sort=TRUE)
## # A tibble: 5 x 2
##   scheme     n
##   <chr>  <int>
## 1 https  10007
## 2 http    2498
## 3 NA       113
## 4 ftp        4
## 5 `https     1

# what are the most used top-level sites?

count(db, host, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 1,108 x 3
##    host                      n     pct
##    <chr>                 <int>   <dbl>
##  1 docs.aws.amazon.com    3859 0.306  
##  2 github.com             2954 0.234  
##  3 cran.r-project.org      450 0.0356 
##  4 en.wikipedia.org        220 0.0174 
##  5 aws.amazon.com          204 0.0162 
##  6 doi.org                 181 0.0143 
##  7 wikipedia.org           132 0.0105 
##  8 developers.google.com   114 0.00903
##  9 stackoverflow.com       101 0.00800
## 10 gitlab.com               86 0.00681
## # … with 1,098 more rows

# same as ^^ but apex

count(db, apex, sort=TRUE) %>% 
  mutate(pct = n/sum(n)) 
## # A tibble: 743 x 3
##    apex                  n     pct
##    <chr>             <int>   <dbl>
##  1 amazon.com         4180 0.331  
##  2 github.com         2997 0.237  
##  3 r-project.org       563 0.0446 
##  4 wikipedia.org       352 0.0279 
##  5 doi.org             221 0.0175 
##  6 google.com          179 0.0142 
##  7 tidyverse.org       151 0.0120 
##  8 r-lib.org           137 0.0109 
##  9 rstudio.com         117 0.00927
## 10 stackoverflow.com   102 0.00808
## # … with 733 more rows

# See all the eavesdroppable, interceptable, 
# content-mutable-by-evil-MITM-network-operator URLs
# A nice thing to do would be to fix these and issue PRs

filter(db, scheme == "http") %>% 
  select(URL, Parent)
## # A tibble: 2,498 x 2
##    URL                                                 Parent              
##    <chr>                                               <chr>               
##  1 http://www.winfield.demon.nl                        antiword/DESCRIPTION
##  2 http://github.com/ropensci/antiword/issues          antiword/DESCRIPTION
##  3 http://dirk.eddelbuettel.com/code/anytime.html      anytime/DESCRIPTION 
##  4 http://arrayhelpers.r-forge.r-project.org/          arrayhelpers/DESCRI…
##  5 http://arrow.apache.org/blog/2019/01/25/r-spark-im… arrow/doc/arrow.html
##  6 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/accelera…
##  7 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/accelera…
##  8 http://docs.aws.amazon.com/AmazonS3/latest/dev/acl… aws.s3/man/acl.Rd   
##  9 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/bucket_e…
## 10 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/bucketli…
## # … with 2,488 more rows

# find the abusers of "http" URLs

filter(db, scheme == "http") %>% 
  select(URL, Parent) %>% 
  mutate(
    pkg = stri_match_first_regex(Parent, "(^[^/]+)")[,2]
  ) %>% 
  count(pkg, sort=TRUE)
## # A tibble: 265 x 2
##    pkg                        n
##    <chr>                  <int>
##  1 paws.security.identity   258
##  2 paws.management          152
##  3 XML                      129
##  4 paws.analytics            78
##  5 stringi                   70
##  6 paws                      57
##  7 RCurl                     51
##  8 igraph                    49
##  9 base                      47
## 10 aws.s3                    44
## # … with 255 more rows

# send all the apex domains to the clipboard

clipr::write_clip(unique(db$apex))

# go here to paste them into the domain search box
# most domain/URL checker APIs aren't free for more 
# than a cpl dozen URLs/domains

browseURL("https://www.bulkblacklist.com")

# paste what you clipped into the box and wait a while

Over Christmas break I teased some screencaps:

of some almost-natural “R” looking code (this is a snippet):

Button("Run") {
  do { // calls to R can fail so there are lots of "try"s; poking at less ugly alternatives

    // handling dots in named calls is a WIP
    _  = try R.evalParse("options(tidyverse.quiet = TRUE )")

    // in practice this wld be called once in a model
    try R.library("ggplot2")
    try R.library("hrbrthemes")
    try R.library("magick")

    // can mix initialiation of an R list with Swift and R objects
    let mvals: RObject = [
      "month": [ "Jan", "Feb", "Mar", "Apr", "May", "Jun" ],
      "value": try R.sample(100, 6)
    ]

    // ggplot2! `mvals` is above, `col.hexValue` comes from the color picker
    // can't do R.as.data.frame b/c "dots" so this is a deliberately exposed alternate call
    let gg = try R.ggplot(R.as_data_frame(mvals)) +
      R.geom_col(R.aes_string("month", "value"), fill: col.hexValue) + // supports both [un]named
      R.scale_y_comma() +
      R.labs(
        x: rNULL, y: "# things",
        title: "Monthly Bars"
      ) +
      R.theme_ipsum_gs(grid: "Y")

    // an alternative to {magick} could be getting raw SVG from {svglite} device
    // we get Image view width/height and pass that to {magick}
    // either beats disk/ssd round-trip
    let fig = try R.image_graph(
      width: Double(imageRect.width), 
      height: Double(imageRect.height), 
      res: 144
    )

    try R.print(gg)
    _ = R.dev_off() // can't do R.dev.off b/c "dots" so this is a deliberately exposed alternate call

    let res = try R.image_write(fig, path: rNULL, format: "png")

    imgData = Data(res) // "imgData" is a reactive SwiftUI bound object; when it changes Image does too

  } catch {
  }

}

that works in Swift as part of a SwiftUI app that displays a ggplot2 plot inside of a macOS application.

It doesn’t shell out to R, but uses Swift 5’s native abilities to interface with R’s C interface.

I’m not ready to reveal that SwiftR code/library just yet (break’s over and the core bits still need some tweaking) but I can provide some interim resources with an online book about working with R’s C interface from Swift on macOS. It is uninspiringly called SwiftR — Using R from Swift.

There are, at present, six chapters that introduce the Swift+R concepts via command line apps. These aren’t terribly useful (shebanged R scripts work just fine, #tyvm) in and of themselves, but command line machinations are a much lower barrier to entry than starting right in with SwiftUI (that starts in chapter seven).

FIN

If you’ve wanted a reason to burn ~20GB of drive space with an Xcode installation and start to learn Swift (or learn more about Swift) then this is a resource for you.

The topics in the chapters are also a fairly decent (albeit incomplete) overview of R’s C interface and also how to work with C code from Swift in general.

So, take advantage of the remaining pandemic time and give it a 👀.

Feedback is welcome in the comments or the book code repo (book source repo is in progress).

Hope everyone has a safe and strong new year!

While the future of the Apache Drill ecosystem is somewhat in-play (MapR — a major sponsoring org for the project — is kinda dead), I still use it almost daily (on my local home office cluster) to avoid handing over any more money to Amazon than I/we already do. The latest (yet-to-be-released) v1.18.0 has some great improvements, including JSON resultset streaming for the REST API. Alas, tweaking {sergeant} (my REST API R package) to handle that is not on the TODO for the foreseeable future, so I’ve been using {sergeant.caffeinated} — https://github.com/hrbrmstr/sergeant-caffeinated — (a RJDBC wrapper for the Drill JDBC interface) for quite a while since it handles large resultsets quite nicely.

I broke out the RJDBC functionality from {sergeant} into this separate package since, despite the fact that it’s 2019/2020, many folks still have/had problems getting {rJava} to work (FWIW it’s a seamless install for me on Windows, Ubuntu, or macOS, even Apple Silicon macOS). The surgery to separate it was fairly hack-ish (one reason it’s not on CRAN) and it finally broke with the recent {dbplyr} 2.x release. I assumed fixing the caffeinated version was easier/quicker than the REST API version, so I dug in and am cautiously tossing it out for wider poking.

An All New Way To Use 💂☕️

Gone are the days of src_drill_jdbc(), but enter in the new term of more standardized {DBI} and {d[b]plyr} access to Apache Drill. To install this version you can do:

remotes::install_github("hrbrmstr/sergeant-caffeinated")

(more install options using safer and saner social coding sites coming soon).

Let’s load up the package(s) and perform some operations.

library(sergeant.caffeinated)

test_host <- Sys.getenv("DRILL_TEST_HOST", "localhost")

be_quiet()

(con <- dbConnect(drv = DrillJDBC(), sprintf("jdbc:drill:zk=%s", test_host)))
## <DrillJDBCConnection>

The DRILL_TEST_HOST environment variable contains the hostname or IP address of my/your Drill server, defaulting to localhost if none is found.

The be_quiet() function stops the Java engine from yelling at you with “illegal reflective access” warnings. If you see this in other rJava-powered packages it means code in some classes in some Java archive files are doing some sketchy old-school things that newer JVMs aren’t happy about. At some point, these warnings become full-on errors which will break many things. Unfortunately, Drill is still fairly tied to Java 8.x and has tons of introspecting code. The errors are ugly, so if you want to get rid of them, just call this function before doing anything with Drill. (You’ll also notice log4j errors are finally gone!)

Now that we have a Drill JDBC connection, we can do something with it. All the DBI-ish operations work, but it’s 2020 and {d[b]ply} is the bee’s knees, so we’ll just dive right in with that:

(db <- tbl(con, "cp.`employee.json`"))

## # Source:   table<cp.`employee.json`> [?? x 16]
## # Database: DrillJDBCConnection
##    employee_id full_name first_name last_name position_id position_title store_id
##          <dbl> <chr>     <chr>      <chr>           <dbl> <chr>             <dbl>
##  1           1 Sheri No… Sheri      Nowmer              1 President             0
##  2           2 Derrick … Derrick    Whelply             2 VP Country Ma…        0
##  3           4 Michael … Michael    Spence              2 VP Country Ma…        0
##  4           5 Maya Gut… Maya       Gutierrez           2 VP Country Ma…        0
##  5           6 Roberta … Roberta    Damstra             3 VP Informatio…        0
##  6           7 Rebecca … Rebecca    Kanagaki            4 VP Human Reso…        0
##  7           8 Kim Brun… Kim        Brunner            11 Store Manager         9
##  8           9 Brenda B… Brenda     Blumberg           11 Store Manager        21
##  9          10 Darren S… Darren     Stanz               5 VP Finance            0
## 10          11 Jonathan… Jonathan   Murraiin           11 Store Manager         1
## # … with more rows, and 9 more variables: department_id <dbl>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <dbl>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>

Basically, that’s it: it “just works”.

FIN

If you’ve been a user of {sergeant.caffeinated} and really need src_drill_jdbc() back, drop an issue on GH or a note in the comments, and be sure to file issues if I’ve missed anything as you kick the tyres.

It’s been a while since I’ve posted anything R-related and, while this one will be brief, it may be of use to some R folks who have taken the leap into Big Sur and/or Apple Silicon. Stay to the end for an early Christmas 🎁!

Big Sur Report

As + Twitter-folks know (before my permanent read hiatus began), I’ve been on Big Sur since the very first developer betas were out. I’m even on the latest beta (11.1b1) as I type this post.

Apple picked away at many nits that they introduced with Big Sur, but it’s still more of a “Vista” release than Catalina was given all the clicks involved with installing new software. However, despite making Simon’s life a bit more difficult (re: notarization of R binaries) I’m happy to report that R 4.0.3 and the latest RStudio 1.4 daily releases work great on Big Sur. To be on the safe side, I highly recommend putting the R command-line binaries and RStudio and R-GUI .apps into both “Developer Tools” and “Full Disk Access” in the Security & Privacy preferences pane. While not completely necessary, it may save some debugging (or clicks of “OK”) down the road.

The Xcode command-line tools are finally in a stable state and can be used instead of the massive Xcode.app installation. This was problematic for a few months, but Apple has been pretty consistent keeping it version-stable with Xcode-proper.

Homebrew is pretty much feature complete on Big Sur (for Intel-architecture Macs) and I’ve run everything from the {tidyverse} to {sf}-verse, ODBC/JDBC and more. Zero issues have come up, and with the pending public release (in a few weeks) of 11.1, I can safely say you should be fine moving your R analyses and workflows to Big Sur.

Apple Silicon Report

Along with all the other madness, 2020 has resurrected the “Processor Wars” by making their own ARM 64-bit chips (the M1 series). The Mac mini flavor is fast, but I suspect some of the “feel” of that speed comes from the faster SSDs and beefed up I/O plane to get to those SSDs. These newly released Mac models require Big Sur, so if you’re not prepared to deal with that, you should hold off on a hardware upgrade.

Another big reason to hold off on a hardware upgrade is that the current M1 chips cannot handle more than 16GB of RAM. I do most of the workloads requiring memory heavy-lifting on a 128GB RAM, 8-core Ubuntu 20 server, so what would have been a deal breaker in the past is not much of one today. Still, R folks who have gotten used to 32GB+ of RAM on laptops/desktops will absolutely need to wait for the next-gen chips.

Most current macOS software — including R and RStudio — is going to run in the Rosetta 2 “translation environment”. I’ve not experienced the 20+ seconds of startup time that others have reported, but RStudio 1.4 did take noticeably (but not painfully) longer on the first run than it has on subsequent ones. Given how complex the RStudio app is (chromium engine, Qt, C++, Java, oh my!) I was concerned it might not work at all, but it does indeed work with absolutely no problems. Even the ODBC drivers (Athena, Drill, Postgres) I need to use in daily work all work great with R/RStudio.

This means things can only get even better (i.e. faster) as all these components are built with “Universal” processor support.

Homebrew can work, but it requires the use of the arch command to ensure everything is running the Rosetta 2 translation environment and nothing I’ve installed (from source) has required anything from Homebrew. Do not let that last sentence lull you into a false sense of excitement. So far, I’ve tested:

  • core {tidyverse}
  • {DBI}, {odbc}, {RJDBC}, and (hence) {dbplyr}
  • partially extended {ggplot2}-verse
  • {httr}, {rvest}, {xml2}
  • {V8}
  • a bunch of self-contained, C[++]-backed or just base R-backed stats packages

and they all work fine.

I have installed separate (non-Universal) binaries of fd, ripgrep, bat plus a few others, and almost have a working Rust toolchain up (both Rust and Go are very close to stable on Apple’s M1 architecture).

If there are specific packages/package ecosystems you’d like tested or benchmarked, drop a note in the comments. I’ll likely bee adding more field report updates over the coming weeks as I experiment with additional components.

Now, if you are a macOS R user, you already know — thanks to Tomas and Simon — that we are in wait mode for a working Fortran compiler before we will see a Universal build of R. The good news is that things are working fine under Rosetta 2 (so far).

RStudio Update

If there is any way you can use RStudio Desktop + Server Pro, I would heartily recommend that you do so. The remote R session in-app access capabilities are dreamy and addictive as all get-out.

I also (finally) made a (very stupid simple) PR into the dailies so that RStudio will be counted as a “Developer Tool” for Screen Time accounting.

RSwitch Update

🎁-time! As incentive to try Big Sur and/or Apple Silicon, I started work on version 2 of RSwitch which you can grab from — https://rud.is/rswitch/releases/RSwitch-2.0.0b.app.zip. For those new to RSwitch, it is a modern alternative to the legacy RSwitch that enabled easy switching of R versions on macOS.

The new version requires Big Sur as its based on Apple’s new “SwiftUI” and takes advantage of some components that Apple has not seen fit to make backwards compatible. The current version has fewer features than the 1.x series, but I’m still working out what should and should not be included in the app. Drop notes in the comments with feature requests (source will be up after 🦃 Day).

The biggest change is that the app is not just a menu but an popup window:

Each of those tappable download regions presents a cancellable download progress bar (all three can run at the same time), and any downloaded disk images will be auto-mounted. That third tappable download region is for downloading RStudio Pro dailies. You can get notifications for both (or neither) Community and Pro versions:

The R version switchers also provides more info about the installed versions:

(And, yes, the r79439 is what’s in Rversion.h of each of those installations. I kinda want to know why, now.)

The interface is very likely going to change dramatically, but the app is a Universal binary, so will work on M1 and Intel Big Sur Macs.

NOTE: it’s probably a good idea to put this into “Full Disk Access” in the Security & Privacy preferences pane, but you should likely wait until I post the source so you can either build it yourself or validate that it’s only doing what it says on the tin for yourself (the app is benign but you should be wary of any app you put on your Macs these days).

WARNING: For now it’s a Dark Mode only app so if you need it to be non-hideous in Light Mode, wait until next week as I add in support for both modes.

FIN

If you’re running into issues, the macOS R mailing list is probably a better place to post issues with R and BigSur/Apple Silicon, but feel free to drop a comment if you are having something you want me to test/take a stab at, and/or change/add to RSwitch.

(The RSwitch 2.0.0b release yesterday had a bug that I think has been remedied in the current downloadable version.)

NOTE: There’s a unique feed URL for R/tech stuff — https://rud.is/b/category/r/feed/. If you hit the generic “subscribe” button b/c the vast majority of posts have been on that, this isn’t one of those posts and you should probably delete it and move on with more important things than the rantings of silly man with a captain America shield.


The last 4+ years — especially the last ~10 months — had taken a bigger personal toll than I realized. I spent much of President-Elect Joseph R. Biden Jr.’s and Vice President-elect Kamala Harris’ first speeches as duly & honestly selected leaders of this nation unabashedly tear-filled. The wave of relief was overwhelming. Hearing kind, vibrant, uplifting, and articulately + professionally delivered words was like the finest symphonic production compared to the ALL CAPS productions that we’ve been forced to consume for so long.

The outgoing (perhaps a new neologism — “unpresidented” — should be used since so much of what this person did was criminally unprecedented) loser did damage our nation severely, but I’m ashamed to admit just how much damage I let him and those that support and detract him do to me.

President-elect Biden said this as part of his speech last night:

And to those who voted for President Trump, I understand your disappointment tonight.

I’ve lost a couple of elections myself.
But now, let’s give each other a chance.

It’s time to put away the harsh rhetoric.

To lower the temperature.

To see each other again.

To listen to each other again.

To make progress, we must stop treating our opponents as our enemy. We are not enemies. We are Americans.

The Bible tells us that to everything there is a season — a time to build, a time to reap, a time to sow. And a time to heal.

He went on to say:

Let this grim era of demonization in America begin to end — here and now.

The refusal of Democrats and Republicans to cooperate with one another is not due to some mysterious force beyond our control.

It’s a decision. It’s a choice we make.

And, still, further on:

We stand again at an inflection point.

We have the opportunity to defeat despair and to build a nation of prosperity and purpose.
We can do it. I know we can.

I’ve long talked about the battle for the soul of America.

We must restore the soul of America.

Our nation is shaped by the constant battle between our better angels and our darkest impulses.

It is time for our better angels to prevail.

What President-elect Biden did was socially engineer a Matthew 18:21-35 on me/us since what he’s calling on us (me) to do is forgive.

Forgive the Resident in Chief.

Forgive his supporters.

Forgive the right and left radicals whose severely flawed agendas have brought us to the brink of yet-another antebellum.

Forgive the Evangelicals who sold out American Christianity for a chance to be court evangelicals and wield even greater earthly power than they already did.

Forgive owners of establishments and organizations that showed support for MAGA and the outgoing POTUS.

Forgive the extended family on my spouse’s side who proudly supported and still support what is obviously evil.

And, forgive myself for — amongst a myriad of other things — just how un-Christ-like my hate, disdain, and despair has increasingly consumed myself and my words/actions over the past 4+ years.

I wish I could say I’m eager to do this. I am not. The self-righteous, smug, superior hate and disdain feels pretty good, doesn’t it? It’s kinda warm and fiery in a wretched country bourbon sort of way. It feels soothingly justified, too, doesn’t it? I mean, hundreds of thousands of living, breathing, amazing humans in America died directly because of “these people” (ah, how comforting acerbic tribal terminology can be), didn’t they? How can I possibly forgive that?

Fortunately — yes, fortunately — I have to, and if you’re still reading this and feel similarly to the preceding paragraph, I would strongly suggest you have to as well.

I have to because it is the foundation of my Faith (which I seem to have let evil convince me to forget for a while) and because it’s a cancer that will eventually subsume me if I let it (and I already beat physical cancer once, so I’m not letting a spiritual, emotional, and intellectual one win either).

We all have to — on all sides, since “right” and “left” are far too large buckets — if Joe and Kamala have even a remote chance to lead America into healing.

Now, I am not naive. The road ahead is long and fraught with peril. We are a deeply divided nation. Repair will take decades if it happens at all.

I’ll start by striving to take Colossians 3:12-17 more seriously and faithfully than I have ever taken it before and be ready to perform whatever actions are necessary to help this be a time for myself and our nation to heal.

I say “strive” as I had planned to conclude with some “I forgive…”s, but I quite literally cannot type anything but ellipses after those two words yet. Hopefully it won’t take too long to get past that for most of the above list. I’m not sure forgiving the last item on it will happen any time soon, though.

Stay safe. Wear a mask. Be kind.

The CDC continues to “deliver” in 2020, this time by changing the JSON response of one of the hidden APIs that my {cdcfluview} package wraps. CDC: So helpful!

It was a quick fix, and version 0.9.2 passed automated CRAN checks in ~9.42 minutes! 💙 the CRAN team!

Plus, a special shout-out to Ian-McGovern (GH ID) for triaging the issue even before I had a chance to get 🙌 on ⌨️.

I also took the opportunity to switch to {progress} from {dplyr}’s progress_estimated() (which is deprecated).

You can wait for it on CRAN or find it on all the usual social source sharing suspects, including my own: https://git.rud.is/hrbrmstr/cdcfluview.

(This is part 2 of n “quick hit” posts, each walking through some approaches to speeding up components of an iterative operation. Go here for part 1).

Thanks to the aforementioned previous post, we now have a super fast way of reading individual text files containing HTTP headers from HEAD requests into a character vector:

library(Rcpp)

vapply(
  X = fils, 
  FUN = cpp_read_file, # see previous post for the source for this C++ Rcpp function
  FUN.VALUE = character(1), 
  USE.NAMES = FALSE
) -> hdrs

head(hdrs, 2)
## [1] "HTTP/1.1 200 OK\r\nDate: Mon, 08 Jun 2020 14:40:45 GMT\r\nServer: Apache\r\nLast-Modified: Sun, 26 Apr 2020 00:06:47 GMT\r\nETag: \"ace-ec1a0-5a4265fd413c0\"\r\nAccept-Ranges: bytes\r\nContent-Length: 967072\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Type: application/x-msdownload\r\n\r\n"                                   
## [2] "HTTP/1.1 200 OK\r\nDate: Mon, 08 Jun 2020 14:43:46 GMT\r\nServer: Apache\r\nLast-Modified: Wed, 05 Jun 2019 03:52:22 GMT\r\nETag: \"423-d99a0-58a8b864f8980\"\r\nAccept-Ranges: bytes\r\nContent-Length: 891296\r\nX-XSS-Protection: 1; mode=block\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Type: application/x-msdownload\r\n\r\n"

However, I need the headers and values broken out so I can eventually get to the analysis I need to do, and a data frame of name/value columns would be the most helpful format. We’ll use {stringi} to help us build a function (explanation of what it’s doing is in comment annotations) that turns each unkempt string into a very kempt data frame:

library(stringi)

parse_headers <- function(x) {

  # split lines from into a character vector
  split_hdrs <- stri_split_lines(x, omit_empty = TRUE)

  lapply(split_hdrs, function(lines) {

    # we don't care about the HTTP x/x ...
    lines <- lines[-1]

    # make a matrix out of found NAME: VALUE
    hdrs <- stri_match_first_regex(lines, "^([^:]*):\\s*(.*)$")

    if (nrow(hdrs) > 0) { # if we have any
      data.frame(
        name = stri_replace_all_fixed(stri_trans_tolower(hdrs[,2]), "-", "_"),
        value = hdrs[,3]
      )
    } else { # if we don't have any
      NULL
    }

  })

}

parse_headers(hdrs[1:3])
## [[1]]
##              name                         value
## 1            date Mon, 08 Jun 2020 14:40:45 GMT
## 2          server                        Apache
## 3   last_modified Sun, 26 Apr 2020 00:06:47 GMT
## 4            etag     "ace-ec1a0-5a4265fd413c0"
## 5   accept_ranges                         bytes
## 6  content_length                        967072
## 7 x_frame_options                    SAMEORIGIN
## 8    content_type      application/x-msdownload
## 
## [[2]]
##               name                         value
## 1             date Mon, 08 Jun 2020 14:43:46 GMT
## 2           server                        Apache
## 3    last_modified Wed, 05 Jun 2019 03:52:22 GMT
## 4             etag     "423-d99a0-58a8b864f8980"
## 5    accept_ranges                         bytes
## 6   content_length                        891296
## 7 x_xss_protection                 1; mode=block
## 8  x_frame_options                    SAMEORIGIN
## 9     content_type      application/x-msdownload
## 
## [[3]]
##           name                         value
## 1         date Mon, 08 Jun 2020 14:23:53 GMT
## 2       server                        Apache
## 3 content_type text/html; charset=iso-8859-1

parse_header(hdrs[1])
##              name                         value
## 1            date Mon, 08 Jun 2020 14:40:45 GMT
## 2          server                        Apache
## 3   last_modified Sun, 26 Apr 2020 00:06:47 GMT
## 4            etag     "ace-ec1a0-5a4265fd413c0"
## 5   accept_ranges                         bytes
## 6  content_length                        967072
## 7 x_frame_options                    SAMEORIGIN
## 8    content_type      application/x-msdownload

Unfortunately, this takes almost 16 painful seconds to crunch through the ~75K text entries:

system.time(tmp <- parse_headers(hdrs))
##   user  system elapsed 
## 15.033   0.097  15.227 

as each call can be near 150 microseconds:

microbenchmark(
  ph = parse_headers(hdrs[1]),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##  expr     min       lq     mean  median      uq     max neval
##    ph 143.328 146.8995 154.8609 148.361 158.121 415.332  1000

A big reason it takes so long is the data frame creation. If you’ve never looked at the source for data.frame() have a go at it — https://github.com/wch/r-source/blob/86532f5aa3d9880f4c1c9e74a417005616846a34/src/library/base/R/dataframe.R#L435-L603 — before continuing.

Back? Great! The {base} data.frame() has tons of guard rails to make sure you’re getting what you think you asked for across a myriad of use cases. I learned about a trick to make data frame creation faster when I started playing with {ggplot2} source. Said trick has virtually no guard rails — it just adds a class, and row.names attribute to a list — so you really should only use it in cases like this where you have a very good idea of the structure and values of the data frame you’re making. Here’s an even more simplified version of the function in the {ggplot2} source:

fast_frame <- function(x = list()) {

  lengths <- vapply(x, length, integer(1))
  n <- if (length(x) == 0 || min(lengths) == 0) 0 else max(lengths)
  class(x) <- "data.frame"
  attr(x, "row.names") <- .set_row_names(n) # help(.set_row_names) for info

  x

}

Now, we’ll change parse_headers() a bit to use that function instead of data.frame():

parse_headers <- function(x) {

  # split lines from into a character vector
  split_hdrs <- stri_split_lines(x, omit_empty = TRUE)

  lapply(split_hdrs, function(lines) {

    # we don't care about the HTTP x/x ...
    lines <- lines[-1]

    # make a matrix out of found NAME: VALUE
    hdrs <- stri_match_first_regex(lines, "^([^:]*):\\s*(.*)$")

    if (nrow(hdrs) > 0) { # if we have any
      fast_frame(
        list(
          name = stri_replace_all_fixed(stri_trans_tolower(hdrs[,2]), "-", "_"),
          value = hdrs[,3]
        )
      )
    } else { # if we don't have any
      NULL
    }

  })

}

Note that we had to pass in a list() to it vs bare name/value vectors.

How much faster is it? Quite a bit:

microbenchmark(
  ph = parse_headers(hdrs[1]),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##  expr   min      lq     mean median      uq      max neval
##    ph 27.94 28.7205 34.66066 29.024 29.3785 4144.402  1000

This speedup means the painful ~15s is now just a tolerable ~3s:

system.time(tmp <- parse_headers(hdrs))
##  user  system elapsed 
## 2.901   0.011   2.918 

FIN

Normally, guard rails are awesome, and you can have even more safe code (which means safer and more reproducible analyses) when using {tidyverse} functions. As noted in the previous post, I’m doing a great deal of iterative work, have more than one set of headers I’m crunching on, and am testing out different approaches/theories, so going from 16 seconds to 3 seconds does truly speed up my efforts and has an even bigger impact when I process around 3 million raw header records.

I think I promised {future} work in this post (asynchronous pun not intended), but we’ll get to that eventually (probably the next post).

If you have your own favorite way to speedup data frame creation (or extracting target values from raw text records) drop a note in the comments!