R Archives - Page 14 of 56

Category Archives: R

Quick Hit: Handling Cocoa Core Data Timestamps in R

For the first time ever we got a new riding mower this weekend. We’ve always haggled to keep the one sellers were using with any given house we’ve purchased over the years (that was big enough for a yard that “requires” a riding mower).

We ended up getting a model from John Deere and the manual (yes, I actually read the manual) noted they had a Mower Plus app to track “things”. Given the lack of built-in sensors I figured this was just something kitchy but it’s not bad (they don’t do anything super-evil tracker-wise) and if you use it while mowing it tracks anything your mobile device can and will notify you for prescribed service events.

I ran the app through a local proxy server before using it in the field and it only phoned home to check serial number and get weather info, which meant there was no hidden API to capture and mimic to get access to mowing data. But, that meant it store[ds] everything it needs locally to the app. Since I use an iOS mobile device (you should too, Android is security mess) I figured that meant it stored everything in on-device SQLite databases.

A quick review of an immediate backup confirmed that this posit was correct and I started going crazy (as you’ll see in the aforelinked Twitter thread).

However, it had these cray cray timestamps:

$ ZTIMESTAMP          <dbl> 581100271, 581100272, 581100270, 581100281, 581100290, 581100328, 581100308, 581…

Thanks to some fairly fast poking by @dabdine, it turns out these are Apple Cocoa Core Data timestamps. They can come in two flavors:

seconds since 2001-01-01 00:00:00
nanoseconds since ^^

The entire reason for this post (and, some of the needless verbosity) was to get this into the internet’s long-ish term memory stores (Google/Internet Archive) so others can reference it as well.

If you come across one of these beasts (for 2019 dates they’ll likely start with either 56, 57, 58, 59 or 60 — enumerating them to make Google searches more able to pick up this post). 2020 dates would start with 60, 61, 62, or 63 (and that should be sufficient for savvy searchers). But, if you’re not having luck using Excel date tricks, {lubridate} functions or {anytime} try transforming a sample timestamp with:

as.POSIXct(sample_timestamp, origin = "2001-01-01")

to see if you get a more “expected”/realistic value and know you’ve got a Core Data timestamp. You can even stick:

from_coredata_ts <- function(x, tz = NULL) {
  .POSIXct(ifelse(
    test = floor(log10(x)) >= 10, # If you're still using R in 2317 then good on ya and edit this
    yes = as.POSIXct(x/10e8, origin = "2001-01-01"), # nanoseconds coredata
    no = as.POSIXct(x, origin = "2001-01-01") # seconds coredata
  ), tz = tz)
}

into your RStudio snippets file or in a personal “misc” package for quick-access to converting both kinds of Core Data timestamps into proper R POSIXct objects.

Making a Command Line HTML Rendering Script for “The Art of the Command Line” (in R)

The Feedly category I have setup for git-stalking has indicated a fairly massive interest in Joshua Levy’s The Art of the Command Line. What is “The Art of the Command Line”? To quote the author(s):

Fluency on the command line is a skill often neglected or considered arcane, but it improves your flexibility and productivity as an engineer in both obvious and subtle ways. This is a selection of notes and tips on using the command-line that we’ve found useful when working on Linux. Some tips are elementary, and some are fairly specific, sophisticated, or obscure. This page is not long, but if you can use and recall all the items here, you know a lot.

It’s a great resource just the way it is (simple, plain markdown rendered in GitUgh). But, we can make it even greater with some help from rmarkdown::render() and some content slicing & dicing.

My initial thought was to grab the English version, put an R Markdown YAML header on it, remove some intro cruft and render it to standalone HTML. While that would be quick, easy and useful it’s also very manual and brittle since updating it would require copy and paste; plus, it leaves out the translated versions.

So, goal number uno became “make a function to do this”. Then, I realized “Hey! This is a resource for command line stuff so why not turn the function into a command line tool!”. So this became goal number II. (I have an internal posit that R adoption would be much higher if there were more easy-to-install command line utilities built in R since that’s one reason Python has a larger install base and many folks end up just using the command line versions of modules they install. CRAN’s draconian rules on what you can do during a package install makes this somewhat moot, tho. One could argue that CRAN is doing the right thing and that Python/PyPI are woefully insecure-by-default which is also true.)

Goal Uno

Since we’re going to create a function it also makes sense to parameterize options for the language, doc-theme and highlight-theme.

The setup plan for this is endeavour is pretty straightforward:

fetch the current set of translations available
check to make sure the desired translation is in ^^ set
grab a copy of the specified document
get the title (since that’s translated for each)
remove some unnecessary front-matter
turn the AUTHORS.md link into a proper link (vs relative)
add in the YAML header with the desired customizations
render the document to standalone HTML
optionally open it after render

And, this is what that looks like:

taotcl <- function(language = "", theme = "simplex", highlight = "espresso", output_dir = getwd(), open = TRUE) {

  language <- language[1]

  # find translations

  httr::GET(
    url = "https://api.github.com/repos/jlevy/the-art-of-command-line/contents/",
    httr::add_headers(
      `Accept` = "application/vnd.github.v3+json"
    ),
    httr::user_agent("taotcl R script; @hrbrmstr")
  ) -> res

  httr::stop_for_status(res)

  ls <- httr::content(res, as = "parsed")

  readmes <- Filter(function(.x) grepl("^README", .x), vapply(ls, `[[`, character(1), "name"))
  langs <- regmatches(readmes, regexpr("-[-[:alpha:]]+", readmes))

  # check to make sure a valid one was specified

  if (language != "") { # "" => English
    language <- sprintf("-%s", language)
    if (!(language %in% langs)) {
      stop(
        "Language '", sub("^-", "", language), 
        "' not found in repo. Current translations include: ",
        paste0(sprintf("'%s'", sub("^-", "", langs)), collapse = ", "),
        ".", call.=FALSE
      )
    }
  }

  # get the desired doc

  src <- "https://raw.githubusercontent.com/jlevy/the-art-of-command-line/master/README{language}.md"
  src <- glue::glue(src)

  l <- readLines(src)

  # find the title
  title <- sub("^#[[:space:]]*", "", l[which(grepl("^#[[:space:]]*", l))[1]])

  # figure out the cut line
  cowsay <- which(grepl("cowsay", l))[1]

  l <- l[-(1:(cowsay+1))] # cut

  # make the AUTHORS.md a useful link
  l <- gsub("(AUTHORS.md)", "(https://github.com/jlevy/the-art-of-command-line/blob/master/AUTHORS.md)", l, fixed = TRUE)

  theme <- theme[1]
  highlight <- highlight[1]

'---
title: "{title}"
author: "Joshua Levy"
email: "joshua@cal.berkeley.edu"
output: 
  html_document:
    theme: {theme}
    highlight: {highlight}
    toc: true
    toc_float: true
    toc_depth: 2
---

' -> yaml

  # fill in the YAML
  yaml <- glue::glue(yaml)

  tf <- tempfile(fileext = ".Rmd")
  on.exit(unlink(tf), add = TRUE)
  writeLines(c(yaml, l), tf)

  # render the doc
  rmarkdown::render(
    input = tf,
    output_file = sprintf("%s.html", tolower(gsub(" ", "-", title))),
    output_dir = output_dir[1],
    quiet = TRUE
  ) -> loc

  # open in browser
  if (open[1]) browseURL(loc)

  message("Rendered version is at '", loc, "'")

}

Running it with the defaults will have it look like this:

You don’t have to type it all as that function is in the taotcl.R script over at my gitea / sourcehut / gitlab / gitugh

Goal II

Now that we have a function we can call from R we just need a wrapper around it. I kinda like way David Shih put together his {argparser} package (it’s on CRAN) so we’ll make a wrapper for our rendering function with it.

We have pretty much the same goal list as the function in that we want to let users specify customizations. There are some additional ones as well (this is not an exhaustive list but it was “just enough” for this go):

Make it easy for folks on real operating systems to use it without the need to use Rscript
Let folks know what required packages they need to install if any are missing
Be quiet when loading packages
Assume friendly/useful defaults
Provide long and short parameters (some folks like short, some like long)

The first few lines of the finished script will accomplish #1-3:

#!/usr/bin/env Rscript

needed <- c("magrittr", "argparser", "httr", "glue", "rmarkdown")
installed <- rownames(installed.packages())
missing <- needed[!(needed %in% installed)]

if (length(missing)) stop("Please install the following packages: ", paste0(sprintf("'%s'", missing), collapse = ", "), call.=FALSE)

suppressPackageStartupMessages({
  for (pkg in needed) {
    require(package = pkg, quietly = TRUE, warn.conflicts = FALSE, character.only = TRUE)
  }
})

Line 1 is a “hashbang”/”shebang” and — provided the file has the execute bit set — will let folks on *nix/macOS run the file without deliberately invoking Rscript. The rest just do the package checks and loads.

We need a way to get command line parameters in, hence the use of {argparser}. We’ll create an arg_parser object and then add arguments using the {magrittr} pipe (%>%). You can add long/short argument names as well as help and defaults (plus note whether an argument is a flag/toggle). Once we have those setup, we tell {argparser} to process any arguments provided by the user:

arg_parser(
  description = "Render 'The Art of the Command Line' to HTML"
) %>% 
  add_argument(
    arg = "--language",
    help = 'Language to render. Leave unspecified for English. Current known: "cs", "de", "el", "es", "fr", "id", "it", "ja", "ko", "pt", "ro", "ru", "sl", "uk", "zh-Hant", "zh"',
    type = "character",
    short = "-l",
    default = ""
  ) %>% 
  add_argument(
    arg = "--theme",
    help = "Which R Markdown document theme to use. Ref: https://l.rud.is/2JOibrZ",
    type = "character",
    short = "-t",
    default = "simplex"
  ) %>% 
  add_argument(
    arg = "--highlight",
    help = "Which R Markdown code higlight theme to use. Ref: https://l.rud.is/2JOibrZ",
    type = "character",
    short = "-c",
    default = "espresso"
  ) %>% 
  add_argument(
    arg = "--output-dir",
    help = "Where to store the rendered file. Defaults to current working directory.",
    type = "character",
    short = "-o",
    default = getwd()
  ) %>% 
  add_argument(
    arg = "--just-render",
    help = "Only render the document. Do not open in the system default browser. (Default is to render and open.)",
    short = "-j",
    flag = TRUE
  ) -> parser

opts <- argparser::parse_args(parser)

Once we have those (the taotcl() function would come next in the source) then it’s just a matter of calling the function:

taotcl(
  language = opts$language,
  theme = opts$theme,
  highlight = opts$highlight,
  output_dir = opts$output_dir,
  open = is.na(opts$just_render) | (!opts$just_render)
)

If the command line program we’ve just made is called with a -h or --help the user will get:

usage: ./taotcl.R [--help] [--just-render] [--opts OPTS] [--language LANGUAGE] [--theme THEME] [--highlight HIGHLIGHT] [--output-dir OUTPUT-DIR]

or (on Windows): Rscript taotcl.R [--help] [--just-render] [--opts OPTS] [--language LANGUAGE] [--theme THEME] [--highlight HIGHLIGHT] [--output-dir OUTPUT-DIR]

Render 'The Art of the Command Line' to HTML


flags:
  -h, --help                    show this help message and exit
  -j, --just-render             Only render the document. Do not open in the system default browser. 
                                (Default is to render and open.)

optional arguments:
  -x, --opts OPTS               RDS file containing argument values
  -l, --language LANGUAGE       Language to render. Leave unspecified for English. 
                                Current known: "cs", "de", "el", "es", "fr", "id", "it", "ja", 
                                "ko", "pt", "ro", "ru", "sl", "uk", "zh-Hant", "zh" [default: ]
  -t, --theme THEME             Which R Markdown document theme to use. Ref: https://l.rud.is/2JOibrZ [default: simplex]
  -c, --highlight HIGHLIGHT     Which R Markdown code higlight theme to use. Ref: https://l.rud.is/2JOibrZ [default: espresso]
  -o, --output-dir OUTPUT-DIR   Where to store the rendered file. Defaults to current working directory. 
                                [default: /your/current/directory/here]

If we run it with, say, ./taotcl.R --language ru -o /tmp the script will process the correct language version and render it to /tmp/искусство-командной-строки.html plus auto-open it for us. It should look like:

FIN

As noted, you can find the entire script over at my gitea / sourcehut / gitlab / gitugh. It’ll eventually get over to ~~GitLab & GitUgh~~ (and a few others as I’m expanding the scripts I use to support social coding diversity vs hegemony) as well.

Note that you can leave off the .R and the hashbang will still work just fine so it’ll be even more straightforward to use.

If you don’t want to go through all this and just want standalone rendered versions of the resource just drop a note in comments and I’ll toss up a small Shiny app which will let you specify params and get a rendered version. You can find weekly renders of all translations at https://rud.is/taotcl/.

Finally, r-lib has some handy packages to make R-built command line utilities much, much cooler (which is a minor suggestion that PRs are welcome if you want to add some flavor to this fairly vanilla utility).

Two New Ways to Make DNS over HTTPS Queries in R

A fair bit of time ago the {gdns} package made its way to CRAN to give R users the ability to use Google’s (at that time) nascent support for DNS over HTTPS (DoH). A bit later on Cloudflare also provided a global DoH endpoint and that begat the (not-on-CRAN) {dnsflare} package.

There are actually two ways to make these DoH queries: one via an HTTPS GET REST API and the other via HTTPS POST queries that use DNS wireformat queries and replies. While the POST side of DoH is pretty standardized/uniform the GET/REST API side is kind of the Wild West. I wanted a way to have support for both wireformat and REST idioms but also not have to write a gazillion packages to support the eventual plethora of diverse DoH GET/REST API services.

I “solved” this by first augmenting my (not-on-CRAN) {clandnstine} package to support the POST wireformat DoH queries (since the underlying {getdns} library supports decoding wireformat responses) and creating a very small {playdoh} package which provided generic support for (hopefully) any DoH GET/REST endpoint.

DoH vs DoT

I made the {clandnstine} package primarily to support making DNS over TLS (DoT) queries but it makes sense to combine both DoH and DoT functionality into that package. The problem is that the legacy platform most of y’all R users are on (i.e. Windows) makes using that package problematic. Therefore, by separating out the DoH GET functionality into a standalone package I don’t have to write a DNS wireformat pure R response handler.

There are performance and other differences between DoH and DoT. I suspect most DNS providers and also most open source DNS server will eventually support both DoH and DoT so which one you use will be up to your clients and use cases.

A Tale of Two (or More) Queries

We’ll issue a few queries over DoH and DoT to a few servers to ensure we’re getting the same results.

library(clandnstine) # both of these are on sourcehut (~hrbrmstr/pkgname), 
library(playdoh)     # or gitlab/gitugh (hrbrmstr/pkgname)

# DoT
x <- gdns_context()
gdns_query(x, "example.com", rr_type = "a")$just_address_answers$address_data
## [1] "93.184.216.34"

# DoH POST (wireformat)
doh_post("example.com", "a", server_path = doh_servers$quad9$url)$answer$rdata$ipv4_address
## [1] "93.184.216.34"

# DoH GET (rest)
doh_get("example.com", "a", service_path = doh_servers$securedns_eu$url)$data[1]
## [1] "93.184.216.34"

To support the, er, diversity of requirements across various GET/REST endpoints the playdoh::doh_get() function has an extra_params parameter which lets you specify any required extra REST query params. Both packages have an exposed global variable doh_servers which has both the URL and any required extra parameters.

FIN

As usual, kick the tyres, file issues and PRs where you like and if you do end up using either package drop a note in the comments.

Add Dressbarn to the Continued Retailpocalypse

I’ve talked about the retailpocalypse before and this morning I was greeted with the news about Dressbarn closing all 650 stores as I fired up a browser.

I tweeted some pix and data but not everyone is on Twitter so I’m just posting a blog-blurb here with the code and data links.

Code is below and at https://paste.sr.ht/~hrbrmstr/af6da1af0314426255c65bc2fc254e0abb2190c3.

Data is at https://rud.is/dl/dressbarn-locations.json.gz.

Images are in a gallery below the code.

library(rvest)
library(stringi)
library(urltools)
library(worldtilegrid) # install from sh/gl/gh or just remove the theme_enhange_wtg() calls
library(statebins)
library(tidyverse)

# this is the dressbarn locations directory page
pg <- read_html("https://locations.dressbarn.com/")

# this is the selector to get the main links
html_nodes(pg, "a.Directory-listLink") %>% 
  html_attr("href") -> locs

# PRE-NOTE
# No sleep() code (I looked at the web site, saw how many self-requests it makes for all DB
# resources and concluded that link scrapes + full page captures would not be burdensome
# plus they're going out of business)

# basic idea here is to get all the main state location pages
# some states only have one store so the link goes right to that so handle that condition
# for ones with multiple stores get all the links on the state index page
# for links on state index page that have multiple stores in one area,
# grab all those; then, concatenate all the final target store links into one 
# character vector.

keep(locs, ~nchar(.x) == 2) %>% 
  sprintf("https://locations.dressbarn.com/%s", .) %>% # state has multiple listings
  map(
    ~read_html(.x) %>% 
      html_nodes("a.Directory-listLink") %>% 
      html_attr("href") %>% 
      sprintf("https://locations.dressbarn.com/%s", .)
  ) %>% 
  append(
    keep(locs, ~nchar(.x) > 2) %>% sprintf("https://locations.dressbarn.com/%s", .) # state has one store
  ) %>% 
  flatten_chr() %>% 
  map_if(
    ~stri_count_fixed(.x, "/") == 4, # 4 URL parts == there's another listing page layer
    ~read_html(.x) %>% 
      html_nodes("a.Teaser-titleLink") %>% 
      html_attr("href") %>% 
      stri_replace_first_fixed("../", "") %>% 
      sprintf("https://locations.dressbarn.com/%s", .)
  )  %>% 
  flatten_chr() -> listings

# make a tibble with the HTML source for the final store location pages
# so we don't end up doing multiple retrievals

tibble(
  listing = listings,
  html_src = map_chr(listings, ~httr::GET(.x) %>% httr::content(as = "text"))
) -> dress_barn

# save off our work in the event we have a (non-R-crashing) issue
tf <- tempfile(fileext = ".rds")
print(tf)
saveRDS(dress_barn, tf) 

# now, get data from the pages
#
# first, turn all the character vectors into something we can get HTML nodes from
#
# dressbarn web folks handliy put an "uber" link on each page so we get lon/lat for free in that URL
# they also handily used an <address> semantic tag in the proper PostalAddress schema format
# so we can get locality and actual address, too
mutate(
  dress_barn,
  parsed = map(html_src, read_html),
  uber_link = 
    map_chr(
      parsed, ~html_nodes(.x, xpath=".//a[contains(@href, 'uber')]") %>% 
        html_attr("href") 
    ), 
  locality = map_chr(
    parsed, ~html_node(.x, xpath=".//address/meta[@itemprop = 'addressLocality']") %>% 
      html_attr("content")
  ),
  address = map_chr(
    parsed, ~html_node(.x, xpath=".//address/meta[@itemprop = 'streetAddress']") %>% 
      html_attr("content")
  ),
  state = stri_match_first_regex(
    dress_barn$listing, 
    "https://locations.dressbarn.com/([[:alpha:]]+)/.*$"
  )[,2]
) %>% 
  bind_cols(
    param_get(.$uber_link, c("dropoff%5Blatitude%5D", "dropoff%5Blongitude%5D")) %>% 
      as_tibble() %>% 
      set_names(c("lat", "lon")) %>%
      mutate_all(as.double)
  ) -> dress_barn

# save off our hard work with the HTML source so we can do more later if need be
select(dress_barn, -parsed) %>% 
  saveRDS("~/Data/dressbarn-with-src.rds")

# save off something others will want
select(dress_barn, -parsed, -html_src, -listing) %>% 
  jsonlite::toJSON() %>% 
  write_lines("~/Data/dressbarn-locations.json.gz")

# simple map
ggplot(dress_barn, aes(lon, lat)) + 
  geom_jitter(size = 0.25, color = ft_cols$yellow, alpha = 1/2) +
  coord_map("polyconic") +
  labs(
    title = "Locations of U.S. Dressbarn Stores",
    subtitle = "All 650 locations closing",
    caption = "Source: Dressbarn HTML store listings;\nData: <https://rud.is/dl/dressbarn-locations.json.gz> via @hrbrmstr"
  ) +
  theme_ft_rc(grid="") +
  theme_enhance_wtg()

unlink(tf) # cleanup 

count(dress_barn, state) %>% 
  left_join(tibble(name = state.name, state = tolower(state.abb))) %>% 
  left_join(usmap::statepop, by = c("name"="full")) %>% 
  mutate(per_capita = (n/pop_2015) * 1000000) %>% 
  arrange(desc(per_capita)) %>% 
  select(name, n, per_capita) %>% 
  arrange(desc(per_capita)) %>% 
  complete(name = state.name) %>% 
  statebins(state_col = "name", value_col = "per_capita", ) +
  labs(title = "Dressbarn State per-capita closings") +
  theme_ipsum_rc(grid="") +
  theme_enhance_wtg()

Mapping Tornado Alley with R

I caught a re-tweet of this tweet by @harry_stevens:

THREAD: I wrote a post on @observablehq about a map I made today. It shows a typical day in the life of a graphics journalist: You never know what problems you'll have to solve on deadline! https://t.co/yRhW1wbLxN #d3js #dataviz 1/7 pic.twitter.com/7N6mmK0nz3

— Harry Stevens (@Harry_Stevens) May 16, 2019

Harry’s thread and Observable post are great on their own and both show the power and utility of Observable javascript notebooks.

However, the re-tweet (which I’m not posting because it’s daft) took a swipe at both Python & R. Now, I’m all for a good swipe at Python (mostly to ensure we never forget all those broken spacebars and tab keys that language has caused) but I’ll gladly defend it and R together when it comes to Getting Things Done, even on deadline.

Let’s walk through what one of us might have done had we been in the same scenario as Harry.

Mapping On A Deadline

So, we have to create a map of historical tornado frequency trends on deadline.

We emailed researchers and received three txt files. One is a set of latitudes, another longitudes, and the final one is the trend value. It’s gridded data.

Download that ZIP and pretend you got three files in email vs a nice ZIP and make a new RStudio project called “tornado” and put those three files in a local-to-the-project-root data/ directory. Let’s read them in and look at them:

library(hrbrthemes) # not 100% necessary but i like my ggplot2 theme(s) :-)
library(tidyverse)  # data wrangling & ggplot2

tibble(
  lat = scan(here::here("data/lats.txt")),
  lon = scan(here::here("data/lons.txt")),
  trend = scan(here::here("data/trends.txt"))
) -> tornado

You very likely never directly use the base::scan() function, but it’s handy here since we just have files of doubles with each value separated by whitespace. Now, let’s see what we have:

tornado
## # A tibble: 30,000 x 3
##      lat   lon trend
##    <dbl> <dbl> <dbl>
##  1 0.897 -180.     0
##  2 0.897 -179.     0
##  3 0.897 -178.     0
##  4 0.897 -176.     0
##  5 0.897 -175.     0
##  6 0.897 -174.     0
##  7 0.897 -173.     0
##  8 0.897 -172.     0
##  9 0.897 -170.     0
## 10 0.897 -169.     0
## # … with 29,990 more rows

summary(tornado)
##      lat               lon                 trend           
## Min.   : 0.8973   Min.   :-179.99808   Min.   :-0.4733610  
## 1st Qu.:22.0063   1st Qu.: -90.00066   1st Qu.: 0.0000000  
## Median :43.1154   Median :  -0.00323   Median : 0.0000000  
## Mean   :43.1154   Mean   :  -0.00323   Mean   : 0.0002756  
## 3rd Qu.:64.2245   3rd Qu.:  89.99419   3rd Qu.: 0.0000000  
## Max.   :85.3335   Max.   : 179.99161   Max.   : 0.6314569  

#+ grid-overview
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

#+ trend-overview
ggplot(tornado, aes(trend)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.05))

Since we’re looking for trends (either direction) in just the United States the latitude and longitude ranges will need to be shrunk down a bit (it does indeed look like globally gridded data) and we’ll be able to shrink the data set a bit more since we only want to look at large or small tends.

We don’t really need modern R/ggplot2 mapping idioms for this project (i.e. the new {sf} ecosystem), so we’ll keep it “simple” (scare quotes since that’s a loaded term) and just use the built in maps and geom_map(). First, let’s get the U.S. states and extract their bounding boxes/limits:

maps::map("state", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>% 
  fortify(map_obj) %>% 
  as_tibble() -> state_map

xlim <- range(state_map$long)
ylim <- range(state_map$lat)

NOTE: I tend not to use the handy ggplot::map_data() function since it ends up clobbering purrr::map() which I use heavily (though not in this post). I also try to use {sf} these days so this tends to not be an issue anymore anyway.

Now, let’s focus in on the target area in the original paper and the Axios article:

filter(
  tornado,
  between(lon, -107, xlim[2]), between(lat, ylim[1], ylim[2]), # -107 gets us ~left-edge of TX
  ((trend < -0.07) | (trend > 0.07)) # approximates notebook selection range
) -> tornado

#+ grid-overview-2
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

Now we’re getting close to our final solution.

As stated in the Observable notebook and implied by the word “grid” these dots are centroids of grid rectangles. This means we really want boxes, not points. The article got all fancy but it’s not really necessary since we can use ggplot2::geom_tile() to get us said boxes:

#+ grid-overview-3
ggplot(tornado, aes(lon, lat)) +
  geom_tile(aes(fill = trend, color = trend))

Now, we just need to add in map layers, and tweak some aesthetics to make it look like a map. We’ll start naively:

#+ map-1
ggplot() +
  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +
  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = "black", size = 0.125, fill = NA
  )

Our gridded data is definitely covering the right/same areas so we just need to make this more suitable for an article. We’ll use Harry’s palette and layer in U.S. state borders, an overall country border, and approximate the title and legend aesthetics:

#+ map-final
c(
  "#023858", "#045a8d", "#0570b0", "#3690c0", "#74a9cf",
  "#a6bddb", "#d0d1e6", "#ece7f2", "#fff7fb", "#ffffff",
  "#ffffcc", "#ffeda0", "#fed976", "#feb24c", "#fd8d3c",
  "#fc4e2a", "#e31a1c", "#bd0026", "#800026"
) -> grad_cols # colors from article

ggplot() +

  # tile layer

  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +

  # state borders

  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = ft_cols$slate, size = 0.125, fill = NA
  ) +

  # usa border

  borders("usa", colour = "black", size = 0.5) +

  # color scales

  scale_colour_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +
  scale_fill_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +

  # make it Albers-ish and ensure we can fit the borders in 

  coord_map(
    projection = "polyconic",
    xlim = scales::expand_range(range(tornado$lon), add = 2),
    ylim = scales::expand_range(range(tornado$lat), add = 2)
  ) +

  # tweak legend aesthetics

  guides(
    colour = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    ),
    fill = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    )
  ) +
  labs(
    x = NULL, y = NULL
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "top") +
  theme(legend.title = element_text(size = 16, hjust = 0.5)) +
  theme(legend.key.width = unit(4, "lines")) +
  theme(legend.key.height = unit(0.5, "lines"))

FIN

I went through some extra steps for folks new to R but the overall approach was at the very least equally as expedient as the Observable one and — despite the claims by the quite daft retweet — this is no less “shareable” or “reusable” than the Observable notebook. You can clone the repo (https://git.sr.ht/~hrbrmstr/tornado) and reuse this work immediately.

If you take a stab at an alternate approach — especially if you do use {sf} — definitely blog about it and drop a link here or on Twitter.

Quick Hit: Updates to QuickLookR and {rdatainfo}

I’m using GitUgh links here b/c the issue was submitted there. Those not wishing to be surveilled by Microsoft can find the macOS QuickLook plugin project and {rdatainfo} project in SourceHut and GitLab (~hrbrmstr and hrbrmstr accounts respectively).

I hadn’t touched QuickLookR🔗 or {rdatainfo}🔗 at all since 2016 since it was really just an example proof of concept. Yet the suggestion to have it handle R markdown (Rmd) files felt useful so I updated {rdatainfo} to better handle data loading for rds, rdata, and rda file extensions and made a small update to the macOS {QuickLookR} QuickLook extension project to treat Rmd files as text files which can be previewed then edited with the default Finder extension exitor you’ve (or your apps) have set for Rmd files.

The {rdatainfo} package is only needed if you need/want R data file preview support (i.e. it’s not necessary for R markdown files). Just unzip the plugin release and put it into ~/Library/QuickLook. Here are examples for the four file types (the example code under saveRDS() and save() was used to generate those data files and the R markdown file is the default one):

file icons

Rmd Preview

rds preview

rdata preview

FIN

This is my first Xcode app build under macOS 10.14 so definitely file issues if you’re having trouble installing or compiling (there are some new shared library “gotchas” that I don’t think apply to this Xcode project but may).

Quick Hit: Scraping javascript-“enabled” Sites with {htmlunit}

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems—so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no <div> elements because they’re generated dynamically after the page loads.

: view source view of toscrape javascript example site

: devtools elements view of toscrape javascript example site

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source").

First, we’ll try just plain ol’ {rvest}:

library(rvest)

pg <- read_html("http://quotes.toscrape.com/js/")

html_nodes(pg, "div.quote")
## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit)
library(rvest)
library(purrr)
library(tibble)

js_pg <- hu_read_html("http://quotes.toscrape.com/js/")

html_nodes(js_pg, "div.quote")
## {xml_nodeset (10)}
##  [1] <div class="quote">\r\n        <span class="text">\r\n          “The world as we h ...
##  [2] <div class="quote">\r\n        <span class="text">\r\n          “It is our choices ...
##  [3] <div class="quote">\r\n        <span class="text">\r\n          “There are only tw ...
##  [4] <div class="quote">\r\n        <span class="text">\r\n          “The person, be it ...
##  [5] <div class="quote">\r\n        <span class="text">\r\n          “Imperfection is b ...
##  [6] <div class="quote">\r\n        <span class="text">\r\n          “Try not to become ...
##  [7] <div class="quote">\r\n        <span class="text">\r\n          “It is better to b ...
##  [8] <div class="quote">\r\n        <span class="text">\r\n          “I have not failed ...
##  [9] <div class="quote">\r\n        <span class="text">\r\n          “A woman is like a ...
## [10] <div class="quote">\r\n        <span class="text">\r\n          “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble(
  quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE),
  author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE),
  tags = html_nodes(js_pg, "div.quote") %>% 
    map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE))
)
## # A tibble: 10 x 3
##    quote                                                            author         tags   
##    <chr>                                                            <chr>          <list> 
##  1 “The world as we have created it is a process of our thinking. … Albert Einste… <chr […
##  2 “It is our choices, Harry, that show what we truly are, far mor… J.K. Rowling   <chr […
##  3 “There are only two ways to live your life. One is as though no… Albert Einste… <chr […
##  4 “The person, be it gentleman or lady, who has not pleasure in a… Jane Austen    <chr […
##  5 “Imperfection is beauty, madness is genius and it's better to b… Marilyn Monroe <chr […
##  6 “Try not to become a man of success. Rather become a man of val… Albert Einste… <chr […
##  7 “It is better to be hated for what you are than to be loved for… André Gide     <chr […
##  8 “I have not failed. I've just found 10,000 ways that won't work… Thomas A. Edi… <chr […
##  9 “A woman is like a tea bag; you never know how strong it is unt… Eleanor Roose… <chr […
## 10 “A day without sunshine is like, you know, night.”               Steve Martin   <chr […

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8)
library(stringi)

ctx <- v8()

html_node(pg, xpath=".//script[contains(., 'data')]") %>%  # target the <script> tag with the data
  html_text() %>% # get the text of the tag body
  stri_replace_all_regex("for \\(var[[:print:][:space:]]*", "", multiline=TRUE) %>% # delete everything after the `var data=` content
  ctx$eval() # pass it to V8

ctx$get("data") %>% # get the data from V8
  as_tibble() %>%  # tibbles rock
  janitor::clean_names() # the names do not so make them better
## # A tibble: 10 x 3
##    tags    author$name   $goodreads_link        $slug     text                            
##    <list>  <chr>         <chr>                  <chr>     <chr>                           
##  1 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “The world as we have created i…
##  2 <chr [… J.K. Rowling  /author/show/1077326.… J-K-Rowl… “It is our choices, Harry, that…
##  3 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “There are only two ways to liv…
##  4 <chr [… Jane Austen   /author/show/1265.Jan… Jane-Aus… “The person, be it gentleman or…
##  5 <chr [… Marilyn Monr… /author/show/82952.Ma… Marilyn-… “Imperfection is beauty, madnes…
##  6 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “Try not to become a man of suc…
##  7 <chr [… André Gide    /author/show/7617.And… Andre-Gi… “It is better to be hated for w…
##  8 <chr [… Thomas A. Ed… /author/show/3091287.… Thomas-A… “I have not failed. I've just f…
##  9 <chr [… Eleanor Roos… /author/show/44566.El… Eleanor-… “A woman is like a tea bag; you…
## 10 <chr [… Steve Martin  /author/show/7103.Ste… Steve-Ma… “A day without sunshine is like…

But, the {htmlunit} code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).

FIN

While {htmlunit} is great, it won’t work on super complex sites as it’s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content. The package also contains a mini-DSL if you need to perform more complex page scraping tasks as well.

You can find both {htmlunit} and {htmlunitjars} at:

Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends

Like more posts than I care to admit, this one starts innocently enough with a tweet by @gshotwell:

Is there a reference document somewhere of which dplyr commands work on various database backends? #rstats

— Gordon Shotwell (@gshotwell) April 9, 2019

Since I use at least 4 different d[b]plyr backends every week, this same question surfaces in my own noggin on occasion and I couldn’t resist going all Columbo on this mystery.

I should note that if you only really care about the backends that come with dbplyr @paleolimbot has you covered with this post, which also shows you the translated SQL!

Executing The Plan

There are at least 24 separate backends for dbplyr. Most folks won’t need more than one if their databases all have a decent ODBC or JDBC driver. To be able to use dplyr idioms with databases there needs to be a way to translate R code (e.g. function calls) into SQL. A ton of functions are pre-mapped in dbplyr already and most backend implementations start by relying on these defaults. Furthermore, since SQL is not nearly as “standard” across installations as one might think, some common tasks — such as string manipulation — have a default noop translation.

If you do have to switch across backends with any frequency, knowing which backend provides support for which functions might be nice, but there hasn’t been a reference for this until Dewey & I accepted Gordon’s challenge. What makes this a “challenge” is that you first have to figure out what packages provide a d[b]plyr backend interface then figure out what SQL translations they offer (they don’t necessarily have to inherit from the ones provided by dbplyr and may add other ones to account for SQL clauses that aren’t in functional form). So the first step was just a look through CRAN for which packages import dbplyr and also adding in some I knew were on GitHub:

library(stringi)
library(hrbrthemes)
library(tidyverse)

# All the pkgs from the home CRAN mirror that import 'dbplyr'
c(
  "arkdb", "bigrquery", "childesr", "chunked", "civis", "corrr", "cytominer", "dbplot",
  "dbplyr", "dexter", "dexterMST", "dlookr", "dplyr", "dplyr.teradata", "etl",
  "healthcareai", "hydrolinks", "implyr", "infuser", "ipumsr", "macleish", "mdsr",
  "mlbgameday", "modeldb", "MonetDBLite", "mudata2", "parsemsf", "pivot", "pleiades",
  "pool", "poplite", "RClickhouse", "replyr", "RPresto", "sergeant", "sparklyr",
  "sqlscore", "srvyr", "taxizedb", "valr", "wordbankr", "metis.tidy"
) -> pkgs

I ended up doing install.pkgs(pkgs) which was easy since I have a home CRAN mirror and use macOS (so binary package installs).

The presence of a dbplyr import does not mean a package implements a backend, so we have to load their namespaces and see if they have the core “tell” (i.e. they implement sql_translate()):

(map_df(pkgs, ~{
  tibble(
    pkg = .x,
    trans = loadNamespace(.x) %>%
      names() %>%
      keep(stri_detect_fixed, "sql_translate")
  )
}) -> xdf)
## # A tibble: 28 x 2
##    pkg       trans
##    <chr>     <chr>
##  1 bigrquery sql_translate_env.BigQueryConnection
##  2 civis     sql_translate_env.CivisConnection
##  3 dbplyr    sql_translate_env.ACCESS
##  4 dbplyr    sql_translate_env.Oracle
##  5 dbplyr    sql_translate_env.SQLiteConnection
##  6 dbplyr    sql_translate_env.Impala
##  7 dbplyr    sql_translate_env.OdbcConnection
##  8 dbplyr    sql_translate_env.MySQLConnection
##  9 dbplyr    sql_translate_env.PqConnection
## 10 dbplyr    sql_translate_env.PostgreSQLConnection
## # … with 18 more rows

Now we know the types of connections that package has SQL translation support for. But, we’re looking for the actual functions they provide. To discover that, we’re going to make dummy classed connection objects and get the translations they offer.

However, some may take the defaults from dbplyr and not override them so we also need to test if they use the sql_not_supported() noop, which we can do by seeing if the function body has a call to stop() in it. We’re also going to ignore maths operators along the way:

(filter(xdf, stri_detect_fixed(trans, ".")) %>%
  filter(trans != "sql_translate_env.NULL") %>% # ignore NULL
  filter(trans != "sql_translate_env.Pool") %>% # ignore db connection pooling 
  filter(trans != "sql_translate_env.PrestoConnection") %>% # this one errored out
  mutate(ƒ = map(trans, ~{

    # get the sql translate functions
    con <- NA
    cls <- stri_replace_first_fixed(.x, "sql_translate_env.", "")
    class(con) <- cls

    env <- sql_translate_env(con)

    # but ^^ rly isn't a nice, tidy object, it's a list of environments
    # with functions in it so we have to iterate through it to extract
    # the function names.

    map_df(env, ~{

      part <- .x
      fs <- names(part)

      # but it's not just good enough to do that b/c a given function name
      # might just implement the "sql_not_supported()" pass through. So we have
      # to actually look to see if the function body has a "stop()" call in it
      # and ignore it if it does.

      map_df(fs, ~{
        tibble(ƒ = .x, src = paste0(as.character(body(part[[.x]])), collapse = "; ")) %>% # this gets the body of the function
          filter(!stri_detect_fixed(src, "stop(")) %>%
          filter(stri_detect_regex(ƒ, "[[:alpha:]]")) %>% # and we rly don't care about maths
          select(-src)
      })
    })
  })) %>%
  unnest(ƒ) %>%
  mutate(trans = stri_replace_first_fixed(trans, "sql_translate_env.", "")) -> xdf)
## # A tibble: 1,318 x 3
##    pkg       trans              ƒ
##    <chr>     <chr>              <chr>
##  1 bigrquery BigQueryConnection median
##  2 bigrquery BigQueryConnection gsub
##  3 bigrquery BigQueryConnection as.logical
##  4 bigrquery BigQueryConnection is.null
##  5 bigrquery BigQueryConnection case_when
##  6 bigrquery BigQueryConnection is.na
##  7 bigrquery BigQueryConnection if_else
##  8 bigrquery BigQueryConnection str_replace_all
##  9 bigrquery BigQueryConnection as.integer
## 10 bigrquery BigQueryConnection as.character
## # … with 1,308 more rows

The rest is all just ggplot2 basics:

 mutate(xdf, db = glue::glue("{pkg}\n{trans}")) %>% # make something useful to display for the DB/conn
  mutate(n = 1) %>% # heatmap block on
  complete(db, ƒ) %>% # complete the heatmap
  arrange(ƒ) %>%
  mutate(ƒ = factor(ƒ, levels=rev(unique(ƒ)))) %>% # arrange the Y axis in the proper order
  ggplot(aes(db, ƒ)) +
  geom_tile(aes(fill = n), color="#2b2b2b", size=0.125, show.legend=FALSE) +
  scale_x_discrete(expand=c(0,0.1), position = "top") +
  scale_fill_continuous(na.value="white") +
  labs(
    x = NULL, y = NULL,
    title = "SQL Function Support In Known d[b]plyr Backends"
  ) +
  theme_ipsum_ps(grid="", axis_text_size = 9) + # you'll need to use the dev version of hrbrthemes for this function; just sub out a diff theme if you already have hrbrthemes loaded
  theme(axis.text.y = element_text(family = "mono", size = 7))

Which makes:

(WP wouldn’t make the featured image linkable so I had to stick it in again to enabled the link so folks can make it full size which is absolutely necessary to see it).

FIN

If you do play with the above, don’t forget to go one more step and incorporate Dewey’s actual SQL mapping to see just how unstandardized the SQL standard is.

Contiguous code for the above is over at SourceHut.