Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

I was pretty brutal to Apple earlier this week in a Twitter thread that I tried to craft so it occurred in-line with the WWDC live stream (which might be something you want to remember as/if you read on). I really don’t care about “memojis” and I have serious dismay over what is a pretty obvious fact that Apple intends to dumb down computing by shifting most folks from Macs to iPads. Their new “Pro” is for design folks and I’m not holding my breath for them to re-embrace the developer/data science communities with better laptops or smaller cheese graters.

The “meh” hardware/software announcements aren’t the worst parts of these events. The TED-esque scripting (including many failed attempts at faux “authentic” humor) is also becoming quite tedious. I joked about analyzing the “adverbs per minute” but it took a few days for their WWDC 2019 keynote video with a subtitle track to emerge. As a result, current time constraints prevent a dive into the subtitles themselves, but that doesn’t mean you can’t have some fun with them.

Read on to see how I scraped the subtitles or skip to the end to read more about this “Reader Challenge”.

Not So Subtle Subtitle Scraping

If you go to the aforelinked WWDC video URL you’ll see control on the lower right to add a subtitle track. If you do that with browser Developer Tools open you’ll see what that does:

webdevtools subtitle index screen shot

These are WebVTT formatted subtitles which have a format/syntax that enable them to be displayed at the correct playback timecode. We can see how many of them there are by looking at the end of the file:

count of subtitles

So, there are 621 of them and each are requested individually (and super-fast, in-parallel). What do these individual requests look like? Just select one of them to take a look. They’re just plain text responses (it’s not a super-intricate format).

Let’s grab one of them to the clipboard and use the {curlconverter} package to turn that into an httr::GET() request via the straighten() and make_req() functions:

I went ahead and wrapped it into a fairly-well-named function, but the GET request is virtually untouched from the aforementioned process. I just added the {idx} template into the request URL so we can glue() the right index into it. It is likely that some headers could have been eliminated but I just went with what {curlconverter} processed and returned this time.

library(stringi)
library(subtools) # https://github.com/hrbrmstr/subtools ; (ORIG: https://github.com/fkeck/subtools)
library(tidytext)
library(purrrogress) # tidy progress bars for free!
library(tidyverse)

#' Fetches a subtitle by index from the 2019 Apple WWDC Keynote subtitle stream
get_subtitle <- function(idx = 1) {

  st_url <- "https://p-events-delivery.akamaized.net/3004qzusahnbjppuwydgjzsdyzsippar/vod3/cc2/eng4/prog_index_{idx}.webvtt"
  st_url <- glue::glue(st_url)

  httr::GET(
    url = st_url,
    httr::add_headers(
      `sec-ch-ua` = "Google Chrome 75",
      `Sec-Fetch-Mode` = "cors",
      Origin = "https://developer.apple.com",
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36",
      Referer = "https://developer.apple.com/videos/play/wwdc2019/101/",
      `Sec-Fetch-Dest` = "empty",
      `Sec-Fetch-Site` = "cross-site"
    )
  ) -> res

  out <- httr::content(res, as = "text", encoding = "UTF-8")
  out <- stringi::stri_split_lines(out)

  purrr::flatten_chr(out)

}

Let’s see what one looks like:

(tmp <- get_subtitle(1))
## [1] "WEBVTT"                                          
## [2] "X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000"
## [3] ""                                                
## [4] "3"                                               
## [5] "00:00:21.199 --> 00:00:22.333"                   
## [6] ">> FEMALE SPEAKER:"                              
## [7] "Don't stay up too late."                         
## [8] ""                                                
## [9] ""                 

Looking good! But, it’s just plain characters and I don’t feel like writing a subtitle parser. And, I dont’ have to! François Keck has the {subtools} package which we can use. But, it (used to) only work on files. It now works on character vectors as well (but you’ll need to install it from my fork until the PR is merged). Let’s turn this set of noise into something we can use:

as_subtitle(tmp, format = "webvtt") %>% 
  flatten_df()

## # A tibble: 1 x 4
##   ID    Timecode.in  Timecode.out Text                                
##   <chr> <chr>        <chr>        <chr>                               
## 1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up to…

So tidy!

We now need to get all of the subtitles. We’ll do that fast since the video player retrieves them even faster than this iteration does:

# no crawl delay b/c the video player grabs these even faster than this code does
map(1:621, with_progress(get_subtitle)) %>% # with_progress gets you a progress bar for free
  map(as_subtitle, format = "webvtt") %>% 
  flatten_df() %>% 
  as_tibble() -> apple_subs

apple_subs
## # A tibble: 3,220 x 4
##    ID    Timecode.in  Timecode.out Text                               
##    <chr> <chr>        <chr>        <chr>                              
##  1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
##  2 4     00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.          
##  3 5     00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.        
##  4 6     00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.              
##  5 7     00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.              
##  6 8     00:01:46.733 00:01:47.833 This is good.                      
##  7 9     00:01:49.566 00:01:52.666 (Music playing)                    
##  8 10    00:02:05.200 00:02:12.533 (Applause)                         
##  9 10    00:02:05.200 00:02:12.533 (Applause)                         
## 10 11    00:02:14.400 00:02:15.566 >> TIM COOK: Wow.                  
## # … with 3,210 more rows

Streaming subtitles aren’t error-free and often get duplicated, let’s see if that’s the case:


# Any dups? distinct(apple_subs) ## # A tibble: 2,734 x 4 ## ID Timecode.in Timecode.out Text ## <chr> <chr> <chr> <chr> ## 1 3 00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t… ## 2 4 00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on. ## 3 5 00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right. ## 4 6 00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes. ## 5 7 00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo. ## 6 8 00:01:46.733 00:01:47.833 This is good. ## 7 9 00:01:49.566 00:01:52.666 (Music playing) ## 8 10 00:02:05.200 00:02:12.533 (Applause) ## 9 11 00:02:14.400 00:02:15.566 >> TIM COOK: Wow. ## 10 12 00:02:15.633 00:02:18.166 Thank you. ## # … with 2,724 more rows apple_subs <- distinct(apple_subs)

There were dups, but not anymore!

You can get that data frame via: http://rud.is/dl/2019-wwdc-keynote-subtitles.csv.gz.

I wanted to see if these looked OK so I dumped just the text to a file and open them up in Sublime Text to spot check:


apple_subs %>% pull(Text) %>% write_lines("/tmp/subs.txt") system("subl /tmp/subs.txt") # dblchk

Since we have a good capture of what was spoken, we can start the analysis process:

distinct(apple_subs) %>% 
  filter(!grepl("^\\(|^>>", Text)) %>%
  unnest_tokens(word, Text) %>% 
  anti_join(get_stopwords()) %>% 
  count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 2,408 x 2
##    word      n
##    <chr> <int>
##  1 now     246
##  2 can     205
##  3 new     142
##  4 like    119
##  5 just    106
##  6 app      77
##  7 great    74
##  8 apple    69
##  9 right    64
## 10 apps     59
## # … with 2,398 more rows

And, that’s when I’ve run out of time.

Reader Challenge

You’ve got the cleaned WWDC 2019 Keynote subtitle track and access to my brutal WWDC 2019 Twitter thread. What fun can you have with it? I’d still like to know the adverbs-per-‘n’ (and what kind they were). But, what else can you discover? Is there a pattern of emotional manipulation through word choices at different times? Did they change tone/style throughout the event? What other questions can you ask and tease out with data?

Drop links to your creations (and separate links to code) in the comments and I’ll re-broadcast them on Twitter and gather them all up into a new post to see what y’all came up with.

FIN

There’s no deadline as I can keep on curating as new submissions come in. While this is most assuredly an R-focused blog there is no restriction on the tools you use as well.

Hopefully this will be a fun/creative exercise for folks. If you have any questions about the scraping process or about the WebVTT format don’t hesitate to ping me here or on Twitter (@hrbrmstr).

A user of the {ggalt} package recently posted a question about how to add points to a geom_dumbbell() plot. For now, this is not something you can do with geom_dumbbell() but with a bit of data wrangling you can do this in a pretty straightforward manner with just your data and ggplot2. The example below uses 3 values per category but it should scale to n values per category (though after a certain n you should reconsider the use of a dummbell chart in favour of a more appropriate way to visualize the message you’re trying to convey).

Here’s the setup:

library(hrbrthemes)
library(tidyverse)

tibble(
  val1 = c(3, 2, 4),
  val2 = c(1, 4, 5),
  val3 = c(5, 8, 6),
  cat = factor(month.name[1:3], levels = rev(month.name[1:3]))
) -> xdf

Three values per category. The approach is pretty straightforward:

  • reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  • reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  • reshape the data frame & plot the points

I’ve put ^^ notes near each ggplot2 geom:

ggplot() +
  # reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      top_n(-1) %>% 
      slice(1) %>%
      ungroup(),
    aes(x = 0, xend = val, y = cat, yend = cat),
    linetype = "dotted", size = 0.5, color = "gray80"
  ) +
  # reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      summarise(start = range(val)[1], end = range(val)[2]) %>% 
      ungroup(),
    aes(x = start, xend = end, y = cat, yend = cat),
    color = "gray80", size = 2
  ) +
  # reshape the data frame & plot the points
  geom_point(
    data = gather(xdf, measure, value, -cat),
    aes(value, cat, color = measure), 
    size = 4
  ) +
  # i just extended the scale a bit + put axis on top; choose aesthetics that work 
  # for you
  scale_x_comma(position = "top", limits = c(0, 10)) +
  scale_color_ipsum(name = "A real legend title") +
  labs(
    x = "Description of the value", y = NULL,
    title = "A good plot title"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

And, here’s the result:

In a recent previous post I brazenly talked over the “hard parts” of how I got to the target SQLite file that houses “mowing history” for what has become my weekend obsession. So, we’ll cover just how to do that (find things in iOS backups) in this post along with how to deal with some “gotchas” if you’re doing this from macOS.

macOS (the Knife)

Kurt Weill, Bertolt Brecht, and Marc Blitzstein created some amazing lyrics that Bobby Darin did some sweet, sweet justice to:

I bring that up to talk about the cutting, biting, dangerous edge of macOS that is Apple’s somewhat mixed attempt at protecting your privacy and keeping out of sight sensitive files and directories from the the sharp teeth of malware (and to re-pimp my {mactheknife} package) . You can read up on Apple’s new protections more thoroughly over at The Eclectic Light Company. For the purposes of this blog post, Apple’s macOS Sandbox policies means you have to do some extra steps to gain access to the folder and files associated with iOS backups (which is ~/Library/Application Support/MobileSync/Backup/).

If you want RStudio, R, and anything run with Rscript to access these sandboxed areas you’ll need to enable “Full Disk Access” for those apps and executables. First you’ll need to open System Preferences > Security & Privacy and then make the Privacy tab active. Keep that window open and tap the lock to unlock the settings.

Full Disk Access settings panel in macOS

Full Disk Access settings panel in macOS

Adding RStudio is easy. Just make Finder active and hit Cmd Shift A and then find and drag the “RStudio” application into the pane+tab you opened in the previous step. Back in the Finder, hit Cmd Shift G and paste in: /Library/Frameworks/R.framework/Resources/bin and go to that folder. Drag in R and Rscript each into the pane+tab from the aforementioned step. Finally (and this got me for a minute) you also need to (again, in Finder) hit Cmd Shift G and paste in /Library/Frameworks/R.framework/Versions/3.6/Resources/bin/exec and drag that R executable into the Security & Privacy Privacy/Full Disk Access pane+tab as well. When you’ve done all that, lock the System Preferences pane and close it.

It is important to note that you just gave “R” and anything that calls R from your user space complete (well, almost) access to every sandboxed area on your system. R is a general purpose programming and scripting language which means any bit of malicious code that knows you have added those executables can use R to read from and write to any area on your system.

It is also important to note that I had to use 3.6 vs the Current symlink for the last entry so that means you need to do this for each new R version you install.

I hope folks on legacy Windows OS installs didn’t skip over this part as you’ll need to go here to figure out where your iOS backups folder is to go through the rest of the post.

Sneakin’ Round The Corner

Windows folks hopefully read at least the last bit of the previous section to figure out where their iOS backups are. On macOS that’s ~/Library/Application Support/MobileSync/Backup/. You need a local backup there (most folks just use iCloud backups these days) and Apple tells you how to do this.

Once you know you’ve got an (unencrypted) backup just go to your iOS backups directory and list the files by date and note the name/path of the most recent backup. Now we can have some fun.

library(XML) # to read plist (property list) files
library(tidyverse) # for printing and access to sqlite dbs

# replace this with the relative path to your most recent backup dir
mb <- "~/Library/Application Support/MobileSync/Backup/28500cd31b9580aaf5815c695ebd3ea5f7455628-20190601-165737"

list.files(mb, pattern = ".*\\.(db|plist)$")
## [1] "Info.plist"      "Manifest.db"      "Manifest.plist"      "Status.plist"

The above code looks for some key metadata files for iOS backups.

  • Info.plist has info on your device
  • Manifest.db has tons of info on all the files in the backup in a SQLite database
  • Manifest.plist has some additional metadata on the backup including applications included in the backup
  • Status.plist contains info on the status of the backup

Let’s take a look at the plists:

info_p <- file.path(mb, "Info.plist")
file.copy(info_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Info.plist", "/tmp/Info.plist"))

info <- XML::readKeyValueDB("/tmp/Info.plist")

str(info)
## List of 11
##  $ Device Name        : chr REDACTED
##  $ Display Name       : chr REDACTED
##  $ ICCID              : chr REDACTED
##  $ IMEI               : chr REDACTED
##  $ IPBE Backup Version: int 1
##  $ Last Backup Date   : POSIXct[1:1], format: "2019-06-01 21:23:02"
##  $ Phone Number       : chr REDACTED
##  $ Product Type       : chr REDACTED
##  $ Product Version    : chr REDACTED
##  $ Serial Number      : chr REDACTED
##  $ Unique Identifier  : chr REDACTED

status_p <- file.path(mb, "Status.plist")
file.copy(status_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Status.plist", "/tmp/Status.plist"))

status <- XML::readKeyValueDB("/tmp/Status.plist")

str(status)
## List of 6
##  $ BackupState  : chr "new"
##  $ Date         : POSIXct[1:1], format: "2019-06-01 21:22:53"
##  $ IsFullBackup : logi FALSE
##  $ SnapshotState: chr "finished"
##  $ UUID         : chr REDACTED
##  $ Version      : chr "3.3"

mainf_p <- file.path(mb, "Manifest.plist")
file.copy(mainf_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Manifest.plist", "/tmp/Manifest.plist"))

manifest <- XML::readKeyValueDB("/tmp/Manifest.plist")

str(manifest, 1)
## List of 8
##  $ Applications        :List of 745
##  $ BackupKeyBag        : chr __truncated__
##  $ Date                : POSIXct[1:1], format: "2019-06-01 20:57:40"
##  $ IsEncrypted         : logi FALSE
##  $ Lockdown            :List of 12
##  $ SystemDomainsVersion: chr "24.0"
##  $ Version             : chr "10.0"
##  $ WasPasscodeSet      : logi TRUE

You’ll note we’re making copies of these files (never play with system-managed files directly unless you know what you’re doing) and turning binary property lists into plain text XML property lists as well so we can read them with the XML::readKeyValueDB() function.

Most of that information is fairly useless for this blog post but I figured you might like to see the hidden things the system knows about your devices. What we do want to check is to see if the John Deere application and data made it into the backup. The Applications slot is a named list of application metadata. Let’s see if there’s anything Deere-ish in it:

grep("deere", names(manifest$Applications), ignore.case = TRUE, value = TRUE)
##                   key 
## "com.deere.mowerplus"

str(manifest$Applications$com.deere.mowerplus, 1)
## List of 4
##  $ CFBundleIdentifier   : chr "com.deere.mowerplus"
##  $ CFBundleVersion      : chr "180"
##  $ ContainerContentClass: chr "Data/Application"
##  $ Path                 : chr "/var/containers/Bundle/Application/30DF2640-A9AA-43A0-AD87-932CA513D75A/MowerPlus.app"

Aye! This means we should have some luck finding “mower” data in the Manifest SQLite database.

Now, we could try to follow UUIDs around but we can also take a stab at a less cumbersome approach. Let’s make a copy of the Manifest database and see what it holds:

mainf_d <- file.path(mb, "Manifest.db")
file.copy(mainf_d, "/tmp", overwrite = TRUE)
## [1] TRUE

(manifest_db <- src_sqlite("/tmp/Manifest.db"))
## src:  sqlite 3.22.0 [/private/tmp/Manifest.db]
## tbls: Files, Properties

We want to get to (hopefully) a SQLite file with the mowing data so we likely care about the Files table. Let’s take a look at the structure of that table:

(fils <- tbl(manifest_db, "Files"))
## # Source:   table<Files> [?? x 5]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##    fileID           domain         relativePath           flags        file
##    <chr>            <chr>          <chr>                  <int>      <blob>
##  1 c1da4199a18d0b5… AppDomain-com… ""                         2 <raw 437 B>
##  2 7426ac0386e2887… AppDomain-com… Library                    2 <raw 444 B>
##  3 a6393e739e1ad37… AppDomain-com… Library/WebKit             2 <raw 444 B>
##  4 c54f5c77a5e970b… AppDomain-com… Library/WebKit/Websit…     2 <raw 458 B>
##  5 578f2c96f219e95… AppDomain-com… Library/WebKit/Websit…     2 <raw 465 B>
##  6 c8833032ce7c9e9… AppDomain-com… Library/WebKit/Websit…     2 <raw 481 B>
##  7 6af21902e595f7c… AppDomain-com… Library/WebKit/Websit…     2 <raw 468 B>
##  8 4c1c49324646af0… AppDomain-com… Library/WebKit/Websit…     2 <raw 471 B>
##  9 d0636bf9b5ba2ae… AppDomain-com… Library/WebKit/Websit…     2 <raw 468 B>
## 10 0b6bb30c8abaa4e… AppDomain-com… Library/Preferences        2 <raw 458 B>
## # … with more rows

If you have a ton of apps, this is a pretty big haystack to comb through. We may be able to narrow things down a bit, though, and we’ll start by seeing what that domain column holds:

distinct(fils, domain)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##    domain                                 
##    <chr>                                  
##  1 AppDomain-Outils-OBD-Facile.EOBD-Facile
##  2 AppDomain-ch.threema.iapp              
##  3 AppDomain-co.humanco.Human             
##  4 AppDomain-co.ortatech.colr-app         
##  5 AppDomain-co.vero.app                  
##  6 AppDomain-com.7thg.Tides               
##  7 AppDomain-com.AerLingus                
##  8 AppDomain-com.BloomSky.BloomSky        
##  9 AppDomain-com.PunchThrough.LightBlue   
## 10 AppDomain-com.agilebits.onepassword-ios
## # … with more rows

So, these are app-specific and the bits after the - in each one look like the CFBundleIdentifiers from above. Let’s make sure:

filter(fils, lower(domain) %like% "%com.deere.mowerplus%") %>% 
  distinct(domain)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   domain                       
##   <chr>                        
## 1 AppDomain-com.deere.mowerplus

Aye! Let’s check to see what files are in there (and hope for a nice SQLite database):

filter(fils, domain == "AppDomain-com.deere.mowerplus") %>%
  select(relativePath) %>% 
  collect() 
## # A tibble: 14 x 1
##    relativePath                                                     
##    <chr>                                                            
##  1 ""                                                               
##  2 Library                                                          
##  3 Library/Preferences                                              
##  4 Library/Application Support                                      
##  5 Library/Application Support/com.crashlytics                      
##  6 Documents                                                        
##  7 Library/GoAppPersistentStore-GoMow.sqlite                        
##  8 Library/googleanalytics-aux-v4.sql                               
##  9 Library/googleanalytics-v3.sql                                   
## 10 Library/Preferences/com.deere.mowerplus.plist                    
## 11 Library/Application Support/ActivityCards.sqlite                 
## 12 Library/Application Support/com.crashlytics/CLSUserDefaults.plist
## 13 Library/googleanalytics-v2.sql                                   
## 14 Library/Application Support/MowTracking.sqlite

It turns out that last one is what we’re looking for. Now we just need a bit of crucial metadata to get to it:

filter(fils, relativePath == "Library/Application Support/MowTracking.sqlite") %>% 
  select(fileID, relativePath)
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   fileID                            relativePath                           
##   <chr>                             <chr>                                  
## 1 ad0009ec04c44b544d37bfc7ab343869… Library/Application Support/MowTrackin…

That fileID maps to the seriously ugly directory tree that is the rest of the iOS backups folder (you likely looked into it and wondered “What the heck?!”). The top level is a 2-digit hex prefix with files underneath it (likely for performance reasons but a bit of obfuscation never hurts, too). We’ll get the whole string:

filter(fils, relativePath == "Library/Application Support/MowTracking.sqlite") %>% 
  select(fileID)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   fileID                                  
##   <chr>                                   
## 1 ad0009ec04c44b544d37bfc7ab3438697d23d618

and, then copy over the mowing database somewhere safe to work on:

file.copy(
  file.path(mb, "ad/ad0009ec04c44b544d37bfc7ab3438697d23d618"),
  "/tmp/mowtrack.sqlite",
  overwrite = TRUE
)
## [1] TRUE

mow <- src_sqlite("/tmp/mowtrack.sqlite")

mow
## src:  sqlite 3.22.0 [/private/tmp/mowtrack.sqlite]
## tbls: Z_METADATA, Z_MODELCACHE, Z_PRIMARYKEY, ZACTIVITY, ZDEALER,
##   ZMOWALERT, ZMOWER, ZMOWLOCATION, ZSMARTCONNECTOR, ZUSER

tbl(mow, "ZMOWLOCATION") %>% 
  glimpse()
## Observations: ??
## Variables: 16
## Database: sqlite 3.22.0 [/private/tmp/mowtrack.sqlite]
## $ Z_PK                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Z_ENT               <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ Z_OPT               <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ZISPAUSEDPOINT      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ZORDER              <int> 1, 2, 0, 11, 20, 58, 38, 43, 30, 25, 21, 10,…
## $ ZSESSION            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ZSESSION2           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ZALTITUDE           <dbl> 42.64804, 42.70590, 40.99661, 39.54770, 38.2…
## $ ZCOURSE             <dbl> 358.242188, 332.226562, 18.281250, 260.85937…
## $ ZHORIZONTALACCURACY <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ ZLATITUDE           <dbl> 43.25913, 43.25914, 43.25913, 43.25915, 43.2…
## $ ZLONGITUDE          <dbl> -70.80069, -70.80069, -70.80069, -70.80067, …
## $ ZSPEED              <dbl> 0.0000000, 0.4250179, 0.5592341, 0.3802792, …
## $ ZTIMESTAMP          <dbl> 581100271, 581100272, 581100270, 581100281, …
## $ ZVERTICALACCURACY   <dbl> 6, 6, 8, 6, 4, 4, 4, 3, 4, 4, 4, 6, 4, 4, 4,…
## $ ZKLVDATA            <blob> <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <…

(Shark) FIN

Even if you don’t have this mower app that I’m currently obsessed with, you now have a primer on how to get to SQLite databases stored by any application on your iOS device. That alone may unearth some fun projects for you to hack on. Plus, you also learned a bit on how to do some light forensics on iOS backups with R/RStudio.

If you did your own trawling and found something interesting definitely blog or tweet about it and drop a link in the comments.

For the first time ever we got a new riding mower this weekend. We’ve always haggled to keep the one sellers were using with any given house we’ve purchased over the years (that was big enough for a yard that “requires” a riding mower).

We ended up getting a model from John Deere and the manual (yes, I actually read the manual) noted they had a Mower Plus app to track “things”. Given the lack of built-in sensors I figured this was just something kitchy but it’s not bad (they don’t do anything super-evil tracker-wise) and if you use it while mowing it tracks anything your mobile device can and will notify you for prescribed service events.

I ran the app through a local proxy server before using it in the field and it only phoned home to check serial number and get weather info, which meant there was no hidden API to capture and mimic to get access to mowing data. But, that meant it store[ds] everything it needs locally to the app. Since I use an iOS mobile device (you should too, Android is security mess) I figured that meant it stored everything in on-device SQLite databases.

A quick review of an immediate backup confirmed that this posit was correct and I started going crazy (as you’ll see in the aforelinked Twitter thread).

However, it had these cray cray timestamps:

$ ZTIMESTAMP          <dbl> 581100271, 581100272, 581100270, 581100281, 581100290, 581100328, 581100308, 581…

Thanks to some fairly fast poking by @dabdine, it turns out these are Apple Cocoa Core Data timestamps. They can come in two flavors:

  • seconds since 2001-01-01 00:00:00
  • nanoseconds since ^^

The entire reason for this post (and, some of the needless verbosity) was to get this into the internet’s long-ish term memory stores (Google/Internet Archive) so others can reference it as well.

If you come across one of these beasts (for 2019 dates they’ll likely start with either 56, 57, 58, 59 or 60 — enumerating them to make Google searches more able to pick up this post). 2020 dates would start with 60, 61, 62, or 63 (and that should be sufficient for savvy searchers). But, if you’re not having luck using Excel date tricks, {lubridate} functions or {anytime} try transforming a sample timestamp with:

as.POSIXct(sample_timestamp, origin = "2001-01-01")

to see if you get a more “expected”/realistic value and know you’ve got a Core Data timestamp. You can even stick:

from_coredata_ts <- function(x, tz = NULL) {
  .POSIXct(ifelse(
    test = floor(log10(x)) >= 10, # If you're still using R in 2317 then good on ya and edit this
    yes = as.POSIXct(x/10e8, origin = "2001-01-01"), # nanoseconds coredata
    no = as.POSIXct(x, origin = "2001-01-01") # seconds coredata
  ), tz = tz)
}

into your RStudio snippets file or in a personal “misc” package for quick-access to converting both kinds of Core Data timestamps into proper R POSIXct objects.

The Feedly category I have setup for git-stalking has indicated a fairly massive interest in Joshua Levy’s The Art of the Command Line. What is “The Art of the Command Line”? To quote the author(s):

Fluency on the command line is a skill often neglected or considered arcane, but it improves your flexibility and productivity as an engineer in both obvious and subtle ways. This is a selection of notes and tips on using the command-line that we’ve found useful when working on Linux. Some tips are elementary, and some are fairly specific, sophisticated, or obscure. This page is not long, but if you can use and recall all the items here, you know a lot.

It’s a great resource just the way it is (simple, plain markdown rendered in GitUgh). But, we can make it even greater with some help from rmarkdown::render() and some content slicing & dicing.

My initial thought was to grab the English version, put an R Markdown YAML header on it, remove some intro cruft and render it to standalone HTML. While that would be quick, easy and useful it’s also very manual and brittle since updating it would require copy and paste; plus, it leaves out the translated versions.

So, goal number uno became “make a function to do this”. Then, I realized “Hey! This is a resource for command line stuff so why not turn the function into a command line tool!”. So this became goal number II. (I have an internal posit that R adoption would be much higher if there were more easy-to-install command line utilities built in R since that’s one reason Python has a larger install base and many folks end up just using the command line versions of modules they install. CRAN’s draconian rules on what you can do during a package install makes this somewhat moot, tho. One could argue that CRAN is doing the right thing and that Python/PyPI are woefully insecure-by-default which is also true.)

Goal Uno

Since we’re going to create a function it also makes sense to parameterize options for the language, doc-theme and highlight-theme.

The setup plan for this is endeavour is pretty straightforward:

  • fetch the current set of translations available
  • check to make sure the desired translation is in ^^ set
  • grab a copy of the specified document
  • get the title (since that’s translated for each)
  • remove some unnecessary front-matter
  • turn the AUTHORS.md link into a proper link (vs relative)
  • add in the YAML header with the desired customizations
  • render the document to standalone HTML
  • optionally open it after render

And, this is what that looks like:

taotcl <- function(language = "", theme = "simplex", highlight = "espresso", output_dir = getwd(), open = TRUE) {

  language <- language[1]

  # find translations

  httr::GET(
    url = "https://api.github.com/repos/jlevy/the-art-of-command-line/contents/",
    httr::add_headers(
      `Accept` = "application/vnd.github.v3+json"
    ),
    httr::user_agent("taotcl R script; @hrbrmstr")
  ) -> res

  httr::stop_for_status(res)

  ls <- httr::content(res, as = "parsed")

  readmes <- Filter(function(.x) grepl("^README", .x), vapply(ls, `[[`, character(1), "name"))
  langs <- regmatches(readmes, regexpr("-[-[:alpha:]]+", readmes))

  # check to make sure a valid one was specified

  if (language != "") { # "" => English
    language <- sprintf("-%s", language)
    if (!(language %in% langs)) {
      stop(
        "Language '", sub("^-", "", language), 
        "' not found in repo. Current translations include: ",
        paste0(sprintf("'%s'", sub("^-", "", langs)), collapse = ", "),
        ".", call.=FALSE
      )
    }
  }

  # get the desired doc

  src <- "https://raw.githubusercontent.com/jlevy/the-art-of-command-line/master/README{language}.md"
  src <- glue::glue(src)

  l <- readLines(src)

  # find the title
  title <- sub("^#[[:space:]]*", "", l[which(grepl("^#[[:space:]]*", l))[1]])

  # figure out the cut line
  cowsay <- which(grepl("cowsay", l))[1]

  l <- l[-(1:(cowsay+1))] # cut

  # make the AUTHORS.md a useful link
  l <- gsub("(AUTHORS.md)", "(https://github.com/jlevy/the-art-of-command-line/blob/master/AUTHORS.md)", l, fixed = TRUE)

  theme <- theme[1]
  highlight <- highlight[1]

'---
title: "{title}"
author: "Joshua Levy"
email: "joshua@cal.berkeley.edu"
output: 
  html_document:
    theme: {theme}
    highlight: {highlight}
    toc: true
    toc_float: true
    toc_depth: 2
---

' -> yaml

  # fill in the YAML
  yaml <- glue::glue(yaml)

  tf <- tempfile(fileext = ".Rmd")
  on.exit(unlink(tf), add = TRUE)
  writeLines(c(yaml, l), tf)

  # render the doc
  rmarkdown::render(
    input = tf,
    output_file = sprintf("%s.html", tolower(gsub(" ", "-", title))),
    output_dir = output_dir[1],
    quiet = TRUE
  ) -> loc

  # open in browser
  if (open[1]) browseURL(loc)

  message("Rendered version is at '", loc, "'")

}

Running it with the defaults will have it look like this:

You don’t have to type it all as that function is in the taotcl.R script over at my gitea / sourcehut / gitlab / gitugh

Goal II

Now that we have a function we can call from R we just need a wrapper around it. I kinda like way David Shih put together his {argparser} package (it’s on CRAN) so we’ll make a wrapper for our rendering function with it.

We have pretty much the same goal list as the function in that we want to let users specify customizations. There are some additional ones as well (this is not an exhaustive list but it was “just enough” for this go):

  1. Make it easy for folks on real operating systems to use it without the need to use Rscript
  2. Let folks know what required packages they need to install if any are missing
  3. Be quiet when loading packages
  4. Assume friendly/useful defaults
  5. Provide long and short parameters (some folks like short, some like long)

The first few lines of the finished script will accomplish #1-3:

#!/usr/bin/env Rscript

needed <- c("magrittr", "argparser", "httr", "glue", "rmarkdown")
installed <- rownames(installed.packages())
missing <- needed[!(needed %in% installed)]

if (length(missing)) stop("Please install the following packages: ", paste0(sprintf("'%s'", missing), collapse = ", "), call.=FALSE)

suppressPackageStartupMessages({
  for (pkg in needed) {
    require(package = pkg, quietly = TRUE, warn.conflicts = FALSE, character.only = TRUE)
  }
})

Line 1 is a “hashbang”/”shebang” and — provided the file has the execute bit set — will let folks on *nix/macOS run the file without deliberately invoking Rscript. The rest just do the package checks and loads.

We need a way to get command line parameters in, hence the use of {argparser}. We’ll create an arg_parser object and then add arguments using the {magrittr} pipe (%>%). You can add long/short argument names as well as help and defaults (plus note whether an argument is a flag/toggle). Once we have those setup, we tell {argparser} to process any arguments provided by the user:

arg_parser(
  description = "Render 'The Art of the Command Line' to HTML"
) %>% 
  add_argument(
    arg = "--language",
    help = 'Language to render. Leave unspecified for English. Current known: "cs", "de", "el", "es", "fr", "id", "it", "ja", "ko", "pt", "ro", "ru", "sl", "uk", "zh-Hant", "zh"',
    type = "character",
    short = "-l",
    default = ""
  ) %>% 
  add_argument(
    arg = "--theme",
    help = "Which R Markdown document theme to use. Ref: https://l.rud.is/2JOibrZ",
    type = "character",
    short = "-t",
    default = "simplex"
  ) %>% 
  add_argument(
    arg = "--highlight",
    help = "Which R Markdown code higlight theme to use. Ref: https://l.rud.is/2JOibrZ",
    type = "character",
    short = "-c",
    default = "espresso"
  ) %>% 
  add_argument(
    arg = "--output-dir",
    help = "Where to store the rendered file. Defaults to current working directory.",
    type = "character",
    short = "-o",
    default = getwd()
  ) %>% 
  add_argument(
    arg = "--just-render",
    help = "Only render the document. Do not open in the system default browser. (Default is to render and open.)",
    short = "-j",
    flag = TRUE
  ) -> parser

opts <- argparser::parse_args(parser)

Once we have those (the taotcl() function would come next in the source) then it’s just a matter of calling the function:

taotcl(
  language = opts$language,
  theme = opts$theme,
  highlight = opts$highlight,
  output_dir = opts$output_dir,
  open = is.na(opts$just_render) | (!opts$just_render)
)

If the command line program we’ve just made is called with a -h or --help the user will get:

usage: ./taotcl.R [--help] [--just-render] [--opts OPTS] [--language LANGUAGE] [--theme THEME] [--highlight HIGHLIGHT] [--output-dir OUTPUT-DIR]

or (on Windows): Rscript taotcl.R [--help] [--just-render] [--opts OPTS] [--language LANGUAGE] [--theme THEME] [--highlight HIGHLIGHT] [--output-dir OUTPUT-DIR]

Render 'The Art of the Command Line' to HTML


flags:
  -h, --help                    show this help message and exit
  -j, --just-render             Only render the document. Do not open in the system default browser. 
                                (Default is to render and open.)

optional arguments:
  -x, --opts OPTS               RDS file containing argument values
  -l, --language LANGUAGE       Language to render. Leave unspecified for English. 
                                Current known: "cs", "de", "el", "es", "fr", "id", "it", "ja", 
                                "ko", "pt", "ro", "ru", "sl", "uk", "zh-Hant", "zh" [default: ]
  -t, --theme THEME             Which R Markdown document theme to use. Ref: https://l.rud.is/2JOibrZ [default: simplex]
  -c, --highlight HIGHLIGHT     Which R Markdown code higlight theme to use. Ref: https://l.rud.is/2JOibrZ [default: espresso]
  -o, --output-dir OUTPUT-DIR   Where to store the rendered file. Defaults to current working directory. 
                                [default: /your/current/directory/here]

If we run it with, say, ./taotcl.R --language ru -o /tmp the script will process the correct language version and render it to /tmp/искусство-командной-строки.html plus auto-open it for us. It should look like:

FIN

As noted, you can find the entire script over at my gitea / sourcehut / gitlab / gitugh. It’ll eventually get over to GitLab & GitUgh (and a few others as I’m expanding the scripts I use to support social coding diversity vs hegemony) as well.

Note that you can leave off the .R and the hashbang will still work just fine so it’ll be even more straightforward to use.

If you don’t want to go through all this and just want standalone rendered versions of the resource just drop a note in comments and I’ll toss up a small Shiny app which will let you specify params and get a rendered version. You can find weekly renders of all translations at https://rud.is/taotcl/.

Finally, r-lib has some handy packages to make R-built command line utilities much, much cooler (which is a minor suggestion that PRs are welcome if you want to add some flavor to this fairly vanilla utility).

A fair bit of time ago the {gdns} package made its way to CRAN to give R users the ability to use Google’s (at that time) nascent support for DNS over HTTPS (DoH). A bit later on Cloudflare also provided a global DoH endpoint and that begat the (not-on-CRAN) {dnsflare} package.

There are actually two ways to make these DoH queries: one via an HTTPS GET REST API and the other via HTTPS POST queries that use DNS wireformat queries and replies. While the POST side of DoH is pretty standardized/uniform the GET/REST API side is kind of the Wild West. I wanted a way to have support for both wireformat and REST idioms but also not have to write a gazillion packages to support the eventual plethora of diverse DoH GET/REST API services.

I “solved” this by first augmenting my (not-on-CRAN) {clandnstine} package to support the POST wireformat DoH queries (since the underlying {getdns} library supports decoding wireformat responses) and creating a very small {playdoh} package which provided generic support for (hopefully) any DoH GET/REST endpoint.

DoH vs DoT

I made the {clandnstine} package primarily to support making DNS over TLS (DoT) queries but it makes sense to combine both DoH and DoT functionality into that package. The problem is that the legacy platform most of y’all R users are on (i.e. Windows) makes using that package problematic. Therefore, by separating out the DoH GET functionality into a standalone package I don’t have to write a DNS wireformat pure R response handler.

There are performance and other differences between DoH and DoT. I suspect most DNS providers and also most open source DNS server will eventually support both DoH and DoT so which one you use will be up to your clients and use cases.

A Tale of Two (or More) Queries

We’ll issue a few queries over DoH and DoT to a few servers to ensure we’re getting the same results.

library(clandnstine) # both of these are on sourcehut (~hrbrmstr/pkgname), 
library(playdoh)     # or gitlab/gitugh (hrbrmstr/pkgname)

# DoT
x <- gdns_context()
gdns_query(x, "example.com", rr_type = "a")$just_address_answers$address_data
## [1] "93.184.216.34"

# DoH POST (wireformat)
doh_post("example.com", "a", server_path = doh_servers$quad9$url)$answer$rdata$ipv4_address
## [1] "93.184.216.34"

# DoH GET (rest)
doh_get("example.com", "a", service_path = doh_servers$securedns_eu$url)$data[1]
## [1] "93.184.216.34"

To support the, er, diversity of requirements across various GET/REST endpoints the playdoh::doh_get() function has an extra_params parameter which lets you specify any required extra REST query params. Both packages have an exposed global variable doh_servers which has both the URL and any required extra parameters.

FIN

As usual, kick the tyres, file issues and PRs where you like and if you do end up using either package drop a note in the comments.

I’ve talked about the retailpocalypse before and this morning I was greeted with the news about Dressbarn closing all 650 stores as I fired up a browser.

I tweeted some pix and data but not everyone is on Twitter so I’m just posting a blog-blurb here with the code and data links.

Code is below and at https://paste.sr.ht/~hrbrmstr/af6da1af0314426255c65bc2fc254e0abb2190c3.

Data is at https://rud.is/dl/dressbarn-locations.json.gz.

Images are in a gallery below the code.

library(rvest)
library(stringi)
library(urltools)
library(worldtilegrid) # install from sh/gl/gh or just remove the theme_enhange_wtg() calls
library(statebins)
library(tidyverse)

# this is the dressbarn locations directory page
pg <- read_html("https://locations.dressbarn.com/")

# this is the selector to get the main links
html_nodes(pg, "a.Directory-listLink") %>% 
  html_attr("href") -> locs

# PRE-NOTE
# No sleep() code (I looked at the web site, saw how many self-requests it makes for all DB
# resources and concluded that link scrapes + full page captures would not be burdensome
# plus they're going out of business)

# basic idea here is to get all the main state location pages
# some states only have one store so the link goes right to that so handle that condition
# for ones with multiple stores get all the links on the state index page
# for links on state index page that have multiple stores in one area,
# grab all those; then, concatenate all the final target store links into one 
# character vector.

keep(locs, ~nchar(.x) == 2) %>% 
  sprintf("https://locations.dressbarn.com/%s", .) %>% # state has multiple listings
  map(
    ~read_html(.x) %>% 
      html_nodes("a.Directory-listLink") %>% 
      html_attr("href") %>% 
      sprintf("https://locations.dressbarn.com/%s", .)
  ) %>% 
  append(
    keep(locs, ~nchar(.x) > 2) %>% sprintf("https://locations.dressbarn.com/%s", .) # state has one store
  ) %>% 
  flatten_chr() %>% 
  map_if(
    ~stri_count_fixed(.x, "/") == 4, # 4 URL parts == there's another listing page layer
    ~read_html(.x) %>% 
      html_nodes("a.Teaser-titleLink") %>% 
      html_attr("href") %>% 
      stri_replace_first_fixed("../", "") %>% 
      sprintf("https://locations.dressbarn.com/%s", .)
  )  %>% 
  flatten_chr() -> listings

# make a tibble with the HTML source for the final store location pages
# so we don't end up doing multiple retrievals

tibble(
  listing = listings,
  html_src = map_chr(listings, ~httr::GET(.x) %>% httr::content(as = "text"))
) -> dress_barn

# save off our work in the event we have a (non-R-crashing) issue
tf <- tempfile(fileext = ".rds")
print(tf)
saveRDS(dress_barn, tf) 

# now, get data from the pages
#
# first, turn all the character vectors into something we can get HTML nodes from
#
# dressbarn web folks handliy put an "uber" link on each page so we get lon/lat for free in that URL
# they also handily used an <address> semantic tag in the proper PostalAddress schema format
# so we can get locality and actual address, too
mutate(
  dress_barn,
  parsed = map(html_src, read_html),
  uber_link = 
    map_chr(
      parsed, ~html_nodes(.x, xpath=".//a[contains(@href, 'uber')]") %>% 
        html_attr("href") 
    ), 
  locality = map_chr(
    parsed, ~html_node(.x, xpath=".//address/meta[@itemprop = 'addressLocality']") %>% 
      html_attr("content")
  ),
  address = map_chr(
    parsed, ~html_node(.x, xpath=".//address/meta[@itemprop = 'streetAddress']") %>% 
      html_attr("content")
  ),
  state = stri_match_first_regex(
    dress_barn$listing, 
    "https://locations.dressbarn.com/([[:alpha:]]+)/.*$"
  )[,2]
) %>% 
  bind_cols(
    param_get(.$uber_link, c("dropoff%5Blatitude%5D", "dropoff%5Blongitude%5D")) %>% 
      as_tibble() %>% 
      set_names(c("lat", "lon")) %>%
      mutate_all(as.double)
  ) -> dress_barn

# save off our hard work with the HTML source so we can do more later if need be
select(dress_barn, -parsed) %>% 
  saveRDS("~/Data/dressbarn-with-src.rds")

# save off something others will want
select(dress_barn, -parsed, -html_src, -listing) %>% 
  jsonlite::toJSON() %>% 
  write_lines("~/Data/dressbarn-locations.json.gz")

# simple map
ggplot(dress_barn, aes(lon, lat)) + 
  geom_jitter(size = 0.25, color = ft_cols$yellow, alpha = 1/2) +
  coord_map("polyconic") +
  labs(
    title = "Locations of U.S. Dressbarn Stores",
    subtitle = "All 650 locations closing",
    caption = "Source: Dressbarn HTML store listings;\nData: <https://rud.is/dl/dressbarn-locations.json.gz> via @hrbrmstr"
  ) +
  theme_ft_rc(grid="") +
  theme_enhance_wtg()

unlink(tf) # cleanup 

count(dress_barn, state) %>% 
  left_join(tibble(name = state.name, state = tolower(state.abb))) %>% 
  left_join(usmap::statepop, by = c("name"="full")) %>% 
  mutate(per_capita = (n/pop_2015) * 1000000) %>% 
  arrange(desc(per_capita)) %>% 
  select(name, n, per_capita) %>% 
  arrange(desc(per_capita)) %>% 
  complete(name = state.name) %>% 
  statebins(state_col = "name", value_col = "per_capita", ) +
  labs(title = "Dressbarn State per-capita closings") +
  theme_ipsum_rc(grid="") +
  theme_enhance_wtg()

I caught a re-tweet of this tweet by @harry_stevens:

Harry’s thread and Observable post are great on their own and both show the power and utility of Observable javascript notebooks.

However, the re-tweet (which I’m not posting because it’s daft) took a swipe at both Python & R. Now, I’m all for a good swipe at Python (mostly to ensure we never forget all those broken spacebars and tab keys that language has caused) but I’ll gladly defend it and R together when it comes to Getting Things Done, even on deadline.

Let’s walk through what one of us might have done had we been in the same scenario as Harry.

Mapping On A Deadline

So, we have to create a map of historical tornado frequency trends on deadline.

We emailed researchers and received three txt files. One is a set of latitudes, another longitudes, and the final one is the trend value. It’s gridded data.

Download that ZIP and pretend you got three files in email vs a nice ZIP and make a new RStudio project called “tornado” and put those three files in a local-to-the-project-root data/ directory. Let’s read them in and look at them:

library(hrbrthemes) # not 100% necessary but i like my ggplot2 theme(s) :-)
library(tidyverse)  # data wrangling & ggplot2

tibble(
  lat = scan(here::here("data/lats.txt")),
  lon = scan(here::here("data/lons.txt")),
  trend = scan(here::here("data/trends.txt"))
) -> tornado

You very likely never directly use the base::scan() function, but it’s handy here since we just have files of doubles with each value separated by whitespace. Now, let’s see what we have:

tornado
## # A tibble: 30,000 x 3
##      lat   lon trend
##    <dbl> <dbl> <dbl>
##  1 0.897 -180.     0
##  2 0.897 -179.     0
##  3 0.897 -178.     0
##  4 0.897 -176.     0
##  5 0.897 -175.     0
##  6 0.897 -174.     0
##  7 0.897 -173.     0
##  8 0.897 -172.     0
##  9 0.897 -170.     0
## 10 0.897 -169.     0
## # … with 29,990 more rows

summary(tornado)
##      lat               lon                 trend           
## Min.   : 0.8973   Min.   :-179.99808   Min.   :-0.4733610  
## 1st Qu.:22.0063   1st Qu.: -90.00066   1st Qu.: 0.0000000  
## Median :43.1154   Median :  -0.00323   Median : 0.0000000  
## Mean   :43.1154   Mean   :  -0.00323   Mean   : 0.0002756  
## 3rd Qu.:64.2245   3rd Qu.:  89.99419   3rd Qu.: 0.0000000  
## Max.   :85.3335   Max.   : 179.99161   Max.   : 0.6314569  

#+ grid-overview
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

#+ trend-overview
ggplot(tornado, aes(trend)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.05))

Since we’re looking for trends (either direction) in just the United States the latitude and longitude ranges will need to be shrunk down a bit (it does indeed look like globally gridded data) and we’ll be able to shrink the data set a bit more since we only want to look at large or small tends.

We don’t really need modern R/ggplot2 mapping idioms for this project (i.e. the new {sf} ecosystem), so we’ll keep it “simple” (scare quotes since that’s a loaded term) and just use the built in maps and geom_map(). First, let’s get the U.S. states and extract their bounding boxes/limits:

maps::map("state", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>% 
  fortify(map_obj) %>% 
  as_tibble() -> state_map

xlim <- range(state_map$long)
ylim <- range(state_map$lat)

NOTE: I tend not to use the handy ggplot::map_data() function since it ends up clobbering purrr::map() which I use heavily (though not in this post). I also try to use {sf} these days so this tends to not be an issue anymore anyway.

Now, let’s focus in on the target area in the original paper and the Axios article:

filter(
  tornado,
  between(lon, -107, xlim[2]), between(lat, ylim[1], ylim[2]), # -107 gets us ~left-edge of TX
  ((trend < -0.07) | (trend > 0.07)) # approximates notebook selection range
) -> tornado

#+ grid-overview-2
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

Now we’re getting close to our final solution.

As stated in the Observable notebook and implied by the word “grid” these dots are centroids of grid rectangles. This means we really want boxes, not points. The article got all fancy but it’s not really necessary since we can use ggplot2::geom_tile() to get us said boxes:

#+ grid-overview-3
ggplot(tornado, aes(lon, lat)) +
  geom_tile(aes(fill = trend, color = trend))

Now, we just need to add in map layers, and tweak some aesthetics to make it look like a map. We’ll start naively:

#+ map-1
ggplot() +
  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +
  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = "black", size = 0.125, fill = NA
  )

Our gridded data is definitely covering the right/same areas so we just need to make this more suitable for an article. We’ll use Harry’s palette and layer in U.S. state borders, an overall country border, and approximate the title and legend aesthetics:

#+ map-final
c(
  "#023858", "#045a8d", "#0570b0", "#3690c0", "#74a9cf",
  "#a6bddb", "#d0d1e6", "#ece7f2", "#fff7fb", "#ffffff",
  "#ffffcc", "#ffeda0", "#fed976", "#feb24c", "#fd8d3c",
  "#fc4e2a", "#e31a1c", "#bd0026", "#800026"
) -> grad_cols # colors from article

ggplot() +

  # tile layer

  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +

  # state borders

  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = ft_cols$slate, size = 0.125, fill = NA
  ) +

  # usa border

  borders("usa", colour = "black", size = 0.5) +

  # color scales

  scale_colour_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +
  scale_fill_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +

  # make it Albers-ish and ensure we can fit the borders in 

  coord_map(
    projection = "polyconic",
    xlim = scales::expand_range(range(tornado$lon), add = 2),
    ylim = scales::expand_range(range(tornado$lat), add = 2)
  ) +

  # tweak legend aesthetics

  guides(
    colour = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    ),
    fill = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    )
  ) +
  labs(
    x = NULL, y = NULL
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "top") +
  theme(legend.title = element_text(size = 16, hjust = 0.5)) +
  theme(legend.key.width = unit(4, "lines")) +
  theme(legend.key.height = unit(0.5, "lines"))

FIN

I went through some extra steps for folks new to R but the overall approach was at the very least equally as expedient as the Observable one and — despite the claims by the quite daft retweet — this is no less “shareable” or “reusable” than the Observable notebook. You can clone the repo (https://git.sr.ht/~hrbrmstr/tornado) and reuse this work immediately.

If you take a stab at an alternate approach — especially if you do use {sf} — definitely blog about it and drop a link here or on Twitter.