Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

This morning, @kairyssdal tweeted out the following graphic from @axios:

If you’re doing the right thing and blocking evil social media javascript you can find the Axios story here and the graphic below:

I’m gonna say it: the chart is confusing. I grok what they were trying to do, but this is a clear example where a slopegraph would definitely be a better choice than a directional dumbbell chart. At the time I had ~5 minutes to spare so I did a quick makeover and a short howto thread. This post is an expansion on said thread and if you’re in the midst of making the decision to consider reading or moving on here’s what we’ll cover:

  • Making the choice between scraping or manual data entry
  • Quick introduction to EtherCalc
  • Why you might consider using EtherCalc for manual data entry over Excel or Google Sheets
  • Pulling data from EtherCalc into R
  • Making a slopegraph with the captured data
  • Customizing the slopegraph with the captured data to tell one or more stories

Read on if any or all of that is captures your interest.

To scrape or not to scrape

Even if I wanted to scrape the site, Axios makes it pretty clear they are kinda not very nice people since — while it doesn’t mention scraping — that ToS link does indicate that:

(a) you will not license, sell, rent, lease, transfer, assign, distribute, host, or otherwise commercially exploit the Site or any content displayed on the Site; (b) you will not modify, make derivative works of, disassemble, reverse compile or reverse engineer any part of the Site; (c) you will not access the Site in order to build a similar or competitive website, product, or service; and (d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means. Unless otherwise indicated, any future release, update, or addition to the Site’s functionality will be subject to these Terms. All copyright and other proprietary notices on the Site (or on any content displayed on the Site) must be retained on all copies thereof.

(OH NO I COPIED THAT FROM THEIR SITE SO I AM ALREADY VIOLATING THEIR [unenforceable] TOS!)

There’s this thing called “Fair Use” and this makeover I’m doing is 100% covered under that. The Axios ToS and the ToS of many other sites try to prohibit such fair use and they generally lose those battles in court. I have and will be citing them as sources throughout this post and the post itself falls under “commentary and criticism”. Unlike many unethical scrapers who are just scavenging data they did no work to generate and whose work will not serve the better interest of the general community, this post is fully dedicated to sharing and education.

In reality, Axios likely has such draconian ToS due to all the horrible unethical scrapers who just want free, un-cited news content.

Anyway

Even if I could scrape they don’t embed a javascript data object nor do they load an XHR JSON data blob to make the graphic. They use an idiom of loading a base image then perform annotation via markup:

making it not worth taking the time to scrape.

That means data entry. 🙁

Using EtherCalc for fun and profit data entry

I dislike Microsoft Excel (even the modern versions of it) because it is overkill for data entry. I also dislike performing data entry in Google Sheets since that means I need to be cloud-connected. So, for small, local data entry needs I turn to EtherCalc. No internet access is required, nor is there a bloated app to run.

EtherCalc is a multiuser Google Sheets-like browser-based spreadsheet powered by javascript (both in-browser and the back-end). You can install it with:

$ npm install -g ethercalc

which assumes you have a working Node.js setup along with npm.

When you run:

$ ethercalc

you are given a URL to hit with your browser. Below is what that looks like with my data entry already complete:

It can use Redis or a local filesystem as a persistence layer and does support multiple folks editing the same document at the same time.

At this point I could just save it out manually to a CSV file and read it in the old-fashioned way, but EtherCalc has an API! So we can grab the data using {httr} calls, like this:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <dbl>            <dbl>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

where a983kmmne1i7 is the active document identifer.

Now that we have the data, it’s time to start the makeover.

Stage 1: A basic slopegraph

(If you need a primer on slopegraphs, definitely check out this resource by @visualisingdata.)

We need to make a decision as to what’s going where on the slopegraph. I’m choosing to put what respondents actually read on the left and then what they say they want covered on the right. Regardless of order, we need to do bit of data wrangling to take a first stab at the chart:

ggplot() +
  # draw the slope lines
  geom_segment(
    data = xdf,
    aes(
      x = "Actually read", y = actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    )
  ) +
  # left and right vertical bars
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  # left and right category text
  geom_text(data = xdf, aes("Actually read", actually_read, label = topic)) +
  geom_text(data = xdf, aes("Say they\nwant covered", say_want_covered, label = topic)) +
  scale_x_discrete(position = "top")

That chart isn’t winning any (good) awards any time soon. Apart from the non-aligned category labels, the categories aren’t in traditional order (rank “#1” being at the top on the left), plus we definitely need more information on the chart (title, subtitle, caption, etc.). We’ll reorder the labels and tweak some of the aesthetic problems away and switch the theme:

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered))) # get the category order right
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered) # reverse the Y axis

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    ),
    size = 0.25, color = "#b2b2b2"
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01
  ) +
  scale_x_discrete(position = "top") +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank())

That looks much better and I stopped there due to time constraints for the initial thread. However, the slope lines tend to be fairly hard to follow and we really should be telling a story with them. But what story do we want to focus on ?

Story time

One aesthetic element we’ll want to immediately modify regardless of story is the line color. We can use the dir column for this:

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    )
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.2
    ),
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    ),
    labels = c(
      `-1` = "Topics Readers Want Covered < Topics Read",
      `0` = "Topics Read The Same Amount As They Want Covered",
      `1` = "Topics Read < Topics Readers Want Covered"
    )
  ) +
  guides(
    size = FALSE
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "bottom") +
  theme(legend.direction = "vertical")

It’s still somewhat hard to pick out stories and the legend may be useful but it’s not ideal. Let’s highlight the different slope types with color, annotate them directly, and see what emerges:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered)))
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered)

arw <- arrow(length = unit(5, "pt"), type = "closed")
#   x = c(1.2, 1.8, 1.9),
# y = -c(1, 13, 14),
# xend = c(1.05, 1.7, 1.6),
# yend = -c(1.125, 13, 14)
# ),
# aes(x, y , xend=xend, yend=yend),

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    ), show.legend = FALSE
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  geom_curve(
    data = data.frame(), 
    aes(x = 1.2, y = -1, xend = 1.05, yend = -1.125), 
    color = ft_cols$red, arrow = arw
  ) +
  geom_segment(
    data = data.frame(), aes(x = 1.6, xend = 1.6, yend = -12.1, y = -12.9), 
    color = "#2b2b2b", arrow = arw
  ) +
  geom_curve(
    data = data.frame(), aes(x = 1.2, y = -14.1, xend = 1.1, yend = -13.6),
    curvature = -0.5, color = ft_cols$blue, arrow = arw
  ) +
  geom_text(
    data = data.frame(
      x = c(1.15, 1.6, 1.2),
      y = -c(1.2, 13, 14),
      hjust = c(0, 0.5, 0),
      vjust = c(0.5, 1, 0.5),
      lab = c(
        "Topics Readers Want Covered < Topics Read",
        "Topics Read The Same Amount\nAs They Want Covered",
        "Topics Read < Topics Readers Want Covered"
      ),
      stringsAsFactors = FALSE
    ),
    aes(x, y, hjust = hjust, vjust = vjust, label = lab),
    family = font_rc, size = 2.5, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.75,
      `0` = 0.2,
      `1` = 0.2
    )
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    )
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme(axis.text.x = element_text(size = 12, face = "bold", color = "black")) +
  theme(axis.text.y = element_blank())

This first story indicates a potential social desirability bias in the respondents in that they claim to care more about health care, climate change, and education but really care more about more frivolous things (sports), base things (politics), and things they have almost no control over (national security).

Let’s switch the focus (only showing the modified aesthetic to avoid a code DoS):

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.75
    )
  ) +

Now we get to see just how far down on the priority list some of the “desired coverage” topics really sit. At least Health care is not at the bottom, but given how much technology controls our lives it’s a bit disconcerting to see that at the bottom.

What about the categories that did not differ in rank:

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.75,
      `1` = 0.2
    )
  ) +

You’re guess is as good as mine why folks rated these the same (assuming the surveys had similar language).

FIN

Now that you’ve got the data (oh, right, I forgot to do that):

structure(list(topic = c("Health care", "Climate change", "Education", 
"Economics", "Science", "Technology", "Business", "National Security", 
"Politics", "Sports", "Immigration", "Arts & entertainment", 
"U.S. foreign policy", "Religion"), actually_read = c(7, 5, 11, 
6, 10, 14, 13, 1, 2, 3, 4, 8, 9, 12), say_want_covered = c(1, 
2, 3, 4, 7, 8, 11, 5, 10, 14, 6, 13, 9, 12)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec = structure(list(
    cols = list(topic = structure(list(), class = c("collector_character", 
    "collector")), actually_read = structure(list(), class = c("collector_double", 
    "collector")), say_want_covered = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

and some alternate views, perhaps you have an even better way to look at it. Drop a note in the comments with any of your creations or suggestions for improvement for the final versions shown here.

I did another twitter thread on the aforeblogged MowerPlus database as I explored the tables after a second mow to determine what identified a unique mowing “session” (using John Deere’s terms).

This is the thread:

For those folks who are sanely averse to Twitter but have been following along in the blog the overall TLDR is that this time around it took less time since I didn’t have to go over areas twice and that the ZSESSION column in the ZMOWLOCATION table hold the session id for a given mowing session.

I wrapped up how to get access to the MowerPlus SQLite DB that holds this data into one function and you can see a non-Twitter (and non-annotated) version of the Twitter thread here — this rmarkdown report — or below:

FIN

I’ll likely make I made a small package for this since I’ll use it during mowing season so and you can check the other usual suspects (gitlab/bitbucket/gitugh) if you, too, want to use R to help analyze your mower data.

I was pretty brutal to Apple earlier this week in a Twitter thread that I tried to craft so it occurred in-line with the WWDC live stream (which might be something you want to remember as/if you read on). I really don’t care about “memojis” and I have serious dismay over what is a pretty obvious fact that Apple intends to dumb down computing by shifting most folks from Macs to iPads. Their new “Pro” is for design folks and I’m not holding my breath for them to re-embrace the developer/data science communities with better laptops or smaller cheese graters.

The “meh” hardware/software announcements aren’t the worst parts of these events. The TED-esque scripting (including many failed attempts at faux “authentic” humor) is also becoming quite tedious. I joked about analyzing the “adverbs per minute” but it took a few days for their WWDC 2019 keynote video with a subtitle track to emerge. As a result, current time constraints prevent a dive into the subtitles themselves, but that doesn’t mean you can’t have some fun with them.

Read on to see how I scraped the subtitles or skip to the end to read more about this “Reader Challenge”.

Not So Subtle Subtitle Scraping

If you go to the aforelinked WWDC video URL you’ll see control on the lower right to add a subtitle track. If you do that with browser Developer Tools open you’ll see what that does:

webdevtools subtitle index screen shot

These are WebVTT formatted subtitles which have a format/syntax that enable them to be displayed at the correct playback timecode. We can see how many of them there are by looking at the end of the file:

count of subtitles

So, there are 621 of them and each are requested individually (and super-fast, in-parallel). What do these individual requests look like? Just select one of them to take a look. They’re just plain text responses (it’s not a super-intricate format).

Let’s grab one of them to the clipboard and use the {curlconverter} package to turn that into an httr::GET() request via the straighten() and make_req() functions:

I went ahead and wrapped it into a fairly-well-named function, but the GET request is virtually untouched from the aforementioned process. I just added the {idx} template into the request URL so we can glue() the right index into it. It is likely that some headers could have been eliminated but I just went with what {curlconverter} processed and returned this time.

library(stringi)
library(subtools) # https://github.com/hrbrmstr/subtools ; (ORIG: https://github.com/fkeck/subtools)
library(tidytext)
library(purrrogress) # tidy progress bars for free!
library(tidyverse)

#' Fetches a subtitle by index from the 2019 Apple WWDC Keynote subtitle stream
get_subtitle <- function(idx = 1) {

  st_url <- "https://p-events-delivery.akamaized.net/3004qzusahnbjppuwydgjzsdyzsippar/vod3/cc2/eng4/prog_index_{idx}.webvtt"
  st_url <- glue::glue(st_url)

  httr::GET(
    url = st_url,
    httr::add_headers(
      `sec-ch-ua` = "Google Chrome 75",
      `Sec-Fetch-Mode` = "cors",
      Origin = "https://developer.apple.com",
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36",
      Referer = "https://developer.apple.com/videos/play/wwdc2019/101/",
      `Sec-Fetch-Dest` = "empty",
      `Sec-Fetch-Site` = "cross-site"
    )
  ) -> res

  out <- httr::content(res, as = "text", encoding = "UTF-8")
  out <- stringi::stri_split_lines(out)

  purrr::flatten_chr(out)

}

Let’s see what one looks like:

(tmp <- get_subtitle(1))
## [1] "WEBVTT"                                          
## [2] "X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000"
## [3] ""                                                
## [4] "3"                                               
## [5] "00:00:21.199 --> 00:00:22.333"                   
## [6] ">> FEMALE SPEAKER:"                              
## [7] "Don't stay up too late."                         
## [8] ""                                                
## [9] ""                 

Looking good! But, it’s just plain characters and I don’t feel like writing a subtitle parser. And, I dont’ have to! François Keck has the {subtools} package which we can use. But, it (used to) only work on files. It now works on character vectors as well (but you’ll need to install it from my fork until the PR is merged). Let’s turn this set of noise into something we can use:

as_subtitle(tmp, format = "webvtt") %>% 
  flatten_df()

## # A tibble: 1 x 4
##   ID    Timecode.in  Timecode.out Text                                
##   <chr> <chr>        <chr>        <chr>                               
## 1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up to…

So tidy!

We now need to get all of the subtitles. We’ll do that fast since the video player retrieves them even faster than this iteration does:

# no crawl delay b/c the video player grabs these even faster than this code does
map(1:621, with_progress(get_subtitle)) %>% # with_progress gets you a progress bar for free
  map(as_subtitle, format = "webvtt") %>% 
  flatten_df() %>% 
  as_tibble() -> apple_subs

apple_subs
## # A tibble: 3,220 x 4
##    ID    Timecode.in  Timecode.out Text                               
##    <chr> <chr>        <chr>        <chr>                              
##  1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
##  2 4     00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.          
##  3 5     00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.        
##  4 6     00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.              
##  5 7     00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.              
##  6 8     00:01:46.733 00:01:47.833 This is good.                      
##  7 9     00:01:49.566 00:01:52.666 (Music playing)                    
##  8 10    00:02:05.200 00:02:12.533 (Applause)                         
##  9 10    00:02:05.200 00:02:12.533 (Applause)                         
## 10 11    00:02:14.400 00:02:15.566 >> TIM COOK: Wow.                  
## # … with 3,210 more rows

Streaming subtitles aren’t error-free and often get duplicated, let’s see if that’s the case:


# Any dups? distinct(apple_subs) ## # A tibble: 2,734 x 4 ## ID Timecode.in Timecode.out Text ## <chr> <chr> <chr> <chr> ## 1 3 00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t… ## 2 4 00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on. ## 3 5 00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right. ## 4 6 00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes. ## 5 7 00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo. ## 6 8 00:01:46.733 00:01:47.833 This is good. ## 7 9 00:01:49.566 00:01:52.666 (Music playing) ## 8 10 00:02:05.200 00:02:12.533 (Applause) ## 9 11 00:02:14.400 00:02:15.566 >> TIM COOK: Wow. ## 10 12 00:02:15.633 00:02:18.166 Thank you. ## # … with 2,724 more rows apple_subs <- distinct(apple_subs)

There were dups, but not anymore!

You can get that data frame via: http://rud.is/dl/2019-wwdc-keynote-subtitles.csv.gz.

I wanted to see if these looked OK so I dumped just the text to a file and open them up in Sublime Text to spot check:


apple_subs %>% pull(Text) %>% write_lines("/tmp/subs.txt") system("subl /tmp/subs.txt") # dblchk

Since we have a good capture of what was spoken, we can start the analysis process:

distinct(apple_subs) %>% 
  filter(!grepl("^\\(|^>>", Text)) %>%
  unnest_tokens(word, Text) %>% 
  anti_join(get_stopwords()) %>% 
  count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 2,408 x 2
##    word      n
##    <chr> <int>
##  1 now     246
##  2 can     205
##  3 new     142
##  4 like    119
##  5 just    106
##  6 app      77
##  7 great    74
##  8 apple    69
##  9 right    64
## 10 apps     59
## # … with 2,398 more rows

And, that’s when I’ve run out of time.

Reader Challenge

You’ve got the cleaned WWDC 2019 Keynote subtitle track and access to my brutal WWDC 2019 Twitter thread. What fun can you have with it? I’d still like to know the adverbs-per-‘n’ (and what kind they were). But, what else can you discover? Is there a pattern of emotional manipulation through word choices at different times? Did they change tone/style throughout the event? What other questions can you ask and tease out with data?

Drop links to your creations (and separate links to code) in the comments and I’ll re-broadcast them on Twitter and gather them all up into a new post to see what y’all came up with.

FIN

There’s no deadline as I can keep on curating as new submissions come in. While this is most assuredly an R-focused blog there is no restriction on the tools you use as well.

Hopefully this will be a fun/creative exercise for folks. If you have any questions about the scraping process or about the WebVTT format don’t hesitate to ping me here or on Twitter (@hrbrmstr).

A user of the {ggalt} package recently posted a question about how to add points to a geom_dumbbell() plot. For now, this is not something you can do with geom_dumbbell() but with a bit of data wrangling you can do this in a pretty straightforward manner with just your data and ggplot2. The example below uses 3 values per category but it should scale to n values per category (though after a certain n you should reconsider the use of a dummbell chart in favour of a more appropriate way to visualize the message you’re trying to convey).

Here’s the setup:

library(hrbrthemes)
library(tidyverse)

tibble(
  val1 = c(3, 2, 4),
  val2 = c(1, 4, 5),
  val3 = c(5, 8, 6),
  cat = factor(month.name[1:3], levels = rev(month.name[1:3]))
) -> xdf

Three values per category. The approach is pretty straightforward:

  • reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  • reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  • reshape the data frame & plot the points

I’ve put ^^ notes near each ggplot2 geom:

ggplot() +
  # reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      top_n(-1) %>% 
      slice(1) %>%
      ungroup(),
    aes(x = 0, xend = val, y = cat, yend = cat),
    linetype = "dotted", size = 0.5, color = "gray80"
  ) +
  # reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      summarise(start = range(val)[1], end = range(val)[2]) %>% 
      ungroup(),
    aes(x = start, xend = end, y = cat, yend = cat),
    color = "gray80", size = 2
  ) +
  # reshape the data frame & plot the points
  geom_point(
    data = gather(xdf, measure, value, -cat),
    aes(value, cat, color = measure), 
    size = 4
  ) +
  # i just extended the scale a bit + put axis on top; choose aesthetics that work 
  # for you
  scale_x_comma(position = "top", limits = c(0, 10)) +
  scale_color_ipsum(name = "A real legend title") +
  labs(
    x = "Description of the value", y = NULL,
    title = "A good plot title"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

And, here’s the result:

In a recent previous post I brazenly talked over the “hard parts” of how I got to the target SQLite file that houses “mowing history” for what has become my weekend obsession. So, we’ll cover just how to do that (find things in iOS backups) in this post along with how to deal with some “gotchas” if you’re doing this from macOS.

macOS (the Knife)

Kurt Weill, Bertolt Brecht, and Marc Blitzstein created some amazing lyrics that Bobby Darin did some sweet, sweet justice to:

I bring that up to talk about the cutting, biting, dangerous edge of macOS that is Apple’s somewhat mixed attempt at protecting your privacy and keeping out of sight sensitive files and directories from the the sharp teeth of malware (and to re-pimp my {mactheknife} package) . You can read up on Apple’s new protections more thoroughly over at The Eclectic Light Company. For the purposes of this blog post, Apple’s macOS Sandbox policies means you have to do some extra steps to gain access to the folder and files associated with iOS backups (which is ~/Library/Application Support/MobileSync/Backup/).

If you want RStudio, R, and anything run with Rscript to access these sandboxed areas you’ll need to enable “Full Disk Access” for those apps and executables. First you’ll need to open System Preferences > Security & Privacy and then make the Privacy tab active. Keep that window open and tap the lock to unlock the settings.

Full Disk Access settings panel in macOS

Full Disk Access settings panel in macOS

Adding RStudio is easy. Just make Finder active and hit Cmd Shift A and then find and drag the “RStudio” application into the pane+tab you opened in the previous step. Back in the Finder, hit Cmd Shift G and paste in: /Library/Frameworks/R.framework/Resources/bin and go to that folder. Drag in R and Rscript each into the pane+tab from the aforementioned step. Finally (and this got me for a minute) you also need to (again, in Finder) hit Cmd Shift G and paste in /Library/Frameworks/R.framework/Versions/3.6/Resources/bin/exec and drag that R executable into the Security & Privacy Privacy/Full Disk Access pane+tab as well. When you’ve done all that, lock the System Preferences pane and close it.

It is important to note that you just gave “R” and anything that calls R from your user space complete (well, almost) access to every sandboxed area on your system. R is a general purpose programming and scripting language which means any bit of malicious code that knows you have added those executables can use R to read from and write to any area on your system.

It is also important to note that I had to use 3.6 vs the Current symlink for the last entry so that means you need to do this for each new R version you install.

I hope folks on legacy Windows OS installs didn’t skip over this part as you’ll need to go here to figure out where your iOS backups folder is to go through the rest of the post.

Sneakin’ Round The Corner

Windows folks hopefully read at least the last bit of the previous section to figure out where their iOS backups are. On macOS that’s ~/Library/Application Support/MobileSync/Backup/. You need a local backup there (most folks just use iCloud backups these days) and Apple tells you how to do this.

Once you know you’ve got an (unencrypted) backup just go to your iOS backups directory and list the files by date and note the name/path of the most recent backup. Now we can have some fun.

library(XML) # to read plist (property list) files
library(tidyverse) # for printing and access to sqlite dbs

# replace this with the relative path to your most recent backup dir
mb <- "~/Library/Application Support/MobileSync/Backup/28500cd31b9580aaf5815c695ebd3ea5f7455628-20190601-165737"

list.files(mb, pattern = ".*\\.(db|plist)$")
## [1] "Info.plist"      "Manifest.db"      "Manifest.plist"      "Status.plist"

The above code looks for some key metadata files for iOS backups.

  • Info.plist has info on your device
  • Manifest.db has tons of info on all the files in the backup in a SQLite database
  • Manifest.plist has some additional metadata on the backup including applications included in the backup
  • Status.plist contains info on the status of the backup

Let’s take a look at the plists:

info_p <- file.path(mb, "Info.plist")
file.copy(info_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Info.plist", "/tmp/Info.plist"))

info <- XML::readKeyValueDB("/tmp/Info.plist")

str(info)
## List of 11
##  $ Device Name        : chr REDACTED
##  $ Display Name       : chr REDACTED
##  $ ICCID              : chr REDACTED
##  $ IMEI               : chr REDACTED
##  $ IPBE Backup Version: int 1
##  $ Last Backup Date   : POSIXct[1:1], format: "2019-06-01 21:23:02"
##  $ Phone Number       : chr REDACTED
##  $ Product Type       : chr REDACTED
##  $ Product Version    : chr REDACTED
##  $ Serial Number      : chr REDACTED
##  $ Unique Identifier  : chr REDACTED

status_p <- file.path(mb, "Status.plist")
file.copy(status_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Status.plist", "/tmp/Status.plist"))

status <- XML::readKeyValueDB("/tmp/Status.plist")

str(status)
## List of 6
##  $ BackupState  : chr "new"
##  $ Date         : POSIXct[1:1], format: "2019-06-01 21:22:53"
##  $ IsFullBackup : logi FALSE
##  $ SnapshotState: chr "finished"
##  $ UUID         : chr REDACTED
##  $ Version      : chr "3.3"

mainf_p <- file.path(mb, "Manifest.plist")
file.copy(mainf_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Manifest.plist", "/tmp/Manifest.plist"))

manifest <- XML::readKeyValueDB("/tmp/Manifest.plist")

str(manifest, 1)
## List of 8
##  $ Applications        :List of 745
##  $ BackupKeyBag        : chr __truncated__
##  $ Date                : POSIXct[1:1], format: "2019-06-01 20:57:40"
##  $ IsEncrypted         : logi FALSE
##  $ Lockdown            :List of 12
##  $ SystemDomainsVersion: chr "24.0"
##  $ Version             : chr "10.0"
##  $ WasPasscodeSet      : logi TRUE

You’ll note we’re making copies of these files (never play with system-managed files directly unless you know what you’re doing) and turning binary property lists into plain text XML property lists as well so we can read them with the XML::readKeyValueDB() function.

Most of that information is fairly useless for this blog post but I figured you might like to see the hidden things the system knows about your devices. What we do want to check is to see if the John Deere application and data made it into the backup. The Applications slot is a named list of application metadata. Let’s see if there’s anything Deere-ish in it:

grep("deere", names(manifest$Applications), ignore.case = TRUE, value = TRUE)
##                   key 
## "com.deere.mowerplus"

str(manifest$Applications$com.deere.mowerplus, 1)
## List of 4
##  $ CFBundleIdentifier   : chr "com.deere.mowerplus"
##  $ CFBundleVersion      : chr "180"
##  $ ContainerContentClass: chr "Data/Application"
##  $ Path                 : chr "/var/containers/Bundle/Application/30DF2640-A9AA-43A0-AD87-932CA513D75A/MowerPlus.app"

Aye! This means we should have some luck finding “mower” data in the Manifest SQLite database.

Now, we could try to follow UUIDs around but we can also take a stab at a less cumbersome approach. Let’s make a copy of the Manifest database and see what it holds:

mainf_d <- file.path(mb, "Manifest.db")
file.copy(mainf_d, "/tmp", overwrite = TRUE)
## [1] TRUE

(manifest_db <- src_sqlite("/tmp/Manifest.db"))
## src:  sqlite 3.22.0 [/private/tmp/Manifest.db]
## tbls: Files, Properties

We want to get to (hopefully) a SQLite file with the mowing data so we likely care about the Files table. Let’s take a look at the structure of that table:

(fils <- tbl(manifest_db, "Files"))
## # Source:   table<Files> [?? x 5]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##    fileID           domain         relativePath           flags        file
##    <chr>            <chr>          <chr>                  <int>      <blob>
##  1 c1da4199a18d0b5… AppDomain-com… ""                         2 <raw 437 B>
##  2 7426ac0386e2887… AppDomain-com… Library                    2 <raw 444 B>
##  3 a6393e739e1ad37… AppDomain-com… Library/WebKit             2 <raw 444 B>
##  4 c54f5c77a5e970b… AppDomain-com… Library/WebKit/Websit…     2 <raw 458 B>
##  5 578f2c96f219e95… AppDomain-com… Library/WebKit/Websit…     2 <raw 465 B>
##  6 c8833032ce7c9e9… AppDomain-com… Library/WebKit/Websit…     2 <raw 481 B>
##  7 6af21902e595f7c… AppDomain-com… Library/WebKit/Websit…     2 <raw 468 B>
##  8 4c1c49324646af0… AppDomain-com… Library/WebKit/Websit…     2 <raw 471 B>
##  9 d0636bf9b5ba2ae… AppDomain-com… Library/WebKit/Websit…     2 <raw 468 B>
## 10 0b6bb30c8abaa4e… AppDomain-com… Library/Preferences        2 <raw 458 B>
## # … with more rows

If you have a ton of apps, this is a pretty big haystack to comb through. We may be able to narrow things down a bit, though, and we’ll start by seeing what that domain column holds:

distinct(fils, domain)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##    domain                                 
##    <chr>                                  
##  1 AppDomain-Outils-OBD-Facile.EOBD-Facile
##  2 AppDomain-ch.threema.iapp              
##  3 AppDomain-co.humanco.Human             
##  4 AppDomain-co.ortatech.colr-app         
##  5 AppDomain-co.vero.app                  
##  6 AppDomain-com.7thg.Tides               
##  7 AppDomain-com.AerLingus                
##  8 AppDomain-com.BloomSky.BloomSky        
##  9 AppDomain-com.PunchThrough.LightBlue   
## 10 AppDomain-com.agilebits.onepassword-ios
## # … with more rows

So, these are app-specific and the bits after the - in each one look like the CFBundleIdentifiers from above. Let’s make sure:

filter(fils, lower(domain) %like% "%com.deere.mowerplus%") %>% 
  distinct(domain)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   domain                       
##   <chr>                        
## 1 AppDomain-com.deere.mowerplus

Aye! Let’s check to see what files are in there (and hope for a nice SQLite database):

filter(fils, domain == "AppDomain-com.deere.mowerplus") %>%
  select(relativePath) %>% 
  collect() 
## # A tibble: 14 x 1
##    relativePath                                                     
##    <chr>                                                            
##  1 ""                                                               
##  2 Library                                                          
##  3 Library/Preferences                                              
##  4 Library/Application Support                                      
##  5 Library/Application Support/com.crashlytics                      
##  6 Documents                                                        
##  7 Library/GoAppPersistentStore-GoMow.sqlite                        
##  8 Library/googleanalytics-aux-v4.sql                               
##  9 Library/googleanalytics-v3.sql                                   
## 10 Library/Preferences/com.deere.mowerplus.plist                    
## 11 Library/Application Support/ActivityCards.sqlite                 
## 12 Library/Application Support/com.crashlytics/CLSUserDefaults.plist
## 13 Library/googleanalytics-v2.sql                                   
## 14 Library/Application Support/MowTracking.sqlite

It turns out that last one is what we’re looking for. Now we just need a bit of crucial metadata to get to it:

filter(fils, relativePath == "Library/Application Support/MowTracking.sqlite") %>% 
  select(fileID, relativePath)
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   fileID                            relativePath                           
##   <chr>                             <chr>                                  
## 1 ad0009ec04c44b544d37bfc7ab343869… Library/Application Support/MowTrackin…

That fileID maps to the seriously ugly directory tree that is the rest of the iOS backups folder (you likely looked into it and wondered “What the heck?!”). The top level is a 2-digit hex prefix with files underneath it (likely for performance reasons but a bit of obfuscation never hurts, too). We’ll get the whole string:

filter(fils, relativePath == "Library/Application Support/MowTracking.sqlite") %>% 
  select(fileID)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   fileID                                  
##   <chr>                                   
## 1 ad0009ec04c44b544d37bfc7ab3438697d23d618

and, then copy over the mowing database somewhere safe to work on:

file.copy(
  file.path(mb, "ad/ad0009ec04c44b544d37bfc7ab3438697d23d618"),
  "/tmp/mowtrack.sqlite",
  overwrite = TRUE
)
## [1] TRUE

mow <- src_sqlite("/tmp/mowtrack.sqlite")

mow
## src:  sqlite 3.22.0 [/private/tmp/mowtrack.sqlite]
## tbls: Z_METADATA, Z_MODELCACHE, Z_PRIMARYKEY, ZACTIVITY, ZDEALER,
##   ZMOWALERT, ZMOWER, ZMOWLOCATION, ZSMARTCONNECTOR, ZUSER

tbl(mow, "ZMOWLOCATION") %>% 
  glimpse()
## Observations: ??
## Variables: 16
## Database: sqlite 3.22.0 [/private/tmp/mowtrack.sqlite]
## $ Z_PK                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Z_ENT               <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ Z_OPT               <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ZISPAUSEDPOINT      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ZORDER              <int> 1, 2, 0, 11, 20, 58, 38, 43, 30, 25, 21, 10,…
## $ ZSESSION            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ZSESSION2           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ZALTITUDE           <dbl> 42.64804, 42.70590, 40.99661, 39.54770, 38.2…
## $ ZCOURSE             <dbl> 358.242188, 332.226562, 18.281250, 260.85937…
## $ ZHORIZONTALACCURACY <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ ZLATITUDE           <dbl> 43.25913, 43.25914, 43.25913, 43.25915, 43.2…
## $ ZLONGITUDE          <dbl> -70.80069, -70.80069, -70.80069, -70.80067, …
## $ ZSPEED              <dbl> 0.0000000, 0.4250179, 0.5592341, 0.3802792, …
## $ ZTIMESTAMP          <dbl> 581100271, 581100272, 581100270, 581100281, …
## $ ZVERTICALACCURACY   <dbl> 6, 6, 8, 6, 4, 4, 4, 3, 4, 4, 4, 6, 4, 4, 4,…
## $ ZKLVDATA            <blob> <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <…

(Shark) FIN

Even if you don’t have this mower app that I’m currently obsessed with, you now have a primer on how to get to SQLite databases stored by any application on your iOS device. That alone may unearth some fun projects for you to hack on. Plus, you also learned a bit on how to do some light forensics on iOS backups with R/RStudio.

If you did your own trawling and found something interesting definitely blog or tweet about it and drop a link in the comments.

For the first time ever we got a new riding mower this weekend. We’ve always haggled to keep the one sellers were using with any given house we’ve purchased over the years (that was big enough for a yard that “requires” a riding mower).

We ended up getting a model from John Deere and the manual (yes, I actually read the manual) noted they had a Mower Plus app to track “things”. Given the lack of built-in sensors I figured this was just something kitchy but it’s not bad (they don’t do anything super-evil tracker-wise) and if you use it while mowing it tracks anything your mobile device can and will notify you for prescribed service events.

I ran the app through a local proxy server before using it in the field and it only phoned home to check serial number and get weather info, which meant there was no hidden API to capture and mimic to get access to mowing data. But, that meant it store[ds] everything it needs locally to the app. Since I use an iOS mobile device (you should too, Android is security mess) I figured that meant it stored everything in on-device SQLite databases.

A quick review of an immediate backup confirmed that this posit was correct and I started going crazy (as you’ll see in the aforelinked Twitter thread).

However, it had these cray cray timestamps:

$ ZTIMESTAMP          <dbl> 581100271, 581100272, 581100270, 581100281, 581100290, 581100328, 581100308, 581…

Thanks to some fairly fast poking by @dabdine, it turns out these are Apple Cocoa Core Data timestamps. They can come in two flavors:

  • seconds since 2001-01-01 00:00:00
  • nanoseconds since ^^

The entire reason for this post (and, some of the needless verbosity) was to get this into the internet’s long-ish term memory stores (Google/Internet Archive) so others can reference it as well.

If you come across one of these beasts (for 2019 dates they’ll likely start with either 56, 57, 58, 59 or 60 — enumerating them to make Google searches more able to pick up this post). 2020 dates would start with 60, 61, 62, or 63 (and that should be sufficient for savvy searchers). But, if you’re not having luck using Excel date tricks, {lubridate} functions or {anytime} try transforming a sample timestamp with:

as.POSIXct(sample_timestamp, origin = "2001-01-01")

to see if you get a more “expected”/realistic value and know you’ve got a Core Data timestamp. You can even stick:

from_coredata_ts <- function(x, tz = NULL) {
  .POSIXct(ifelse(
    test = floor(log10(x)) >= 10, # If you're still using R in 2317 then good on ya and edit this
    yes = as.POSIXct(x/10e8, origin = "2001-01-01"), # nanoseconds coredata
    no = as.POSIXct(x, origin = "2001-01-01") # seconds coredata
  ), tz = tz)
}

into your RStudio snippets file or in a personal “misc” package for quick-access to converting both kinds of Core Data timestamps into proper R POSIXct objects.

The Feedly category I have setup for git-stalking has indicated a fairly massive interest in Joshua Levy’s The Art of the Command Line. What is “The Art of the Command Line”? To quote the author(s):

Fluency on the command line is a skill often neglected or considered arcane, but it improves your flexibility and productivity as an engineer in both obvious and subtle ways. This is a selection of notes and tips on using the command-line that we’ve found useful when working on Linux. Some tips are elementary, and some are fairly specific, sophisticated, or obscure. This page is not long, but if you can use and recall all the items here, you know a lot.

It’s a great resource just the way it is (simple, plain markdown rendered in GitUgh). But, we can make it even greater with some help from rmarkdown::render() and some content slicing & dicing.

My initial thought was to grab the English version, put an R Markdown YAML header on it, remove some intro cruft and render it to standalone HTML. While that would be quick, easy and useful it’s also very manual and brittle since updating it would require copy and paste; plus, it leaves out the translated versions.

So, goal number uno became “make a function to do this”. Then, I realized “Hey! This is a resource for command line stuff so why not turn the function into a command line tool!”. So this became goal number II. (I have an internal posit that R adoption would be much higher if there were more easy-to-install command line utilities built in R since that’s one reason Python has a larger install base and many folks end up just using the command line versions of modules they install. CRAN’s draconian rules on what you can do during a package install makes this somewhat moot, tho. One could argue that CRAN is doing the right thing and that Python/PyPI are woefully insecure-by-default which is also true.)

Goal Uno

Since we’re going to create a function it also makes sense to parameterize options for the language, doc-theme and highlight-theme.

The setup plan for this is endeavour is pretty straightforward:

  • fetch the current set of translations available
  • check to make sure the desired translation is in ^^ set
  • grab a copy of the specified document
  • get the title (since that’s translated for each)
  • remove some unnecessary front-matter
  • turn the AUTHORS.md link into a proper link (vs relative)
  • add in the YAML header with the desired customizations
  • render the document to standalone HTML
  • optionally open it after render

And, this is what that looks like:

taotcl <- function(language = "", theme = "simplex", highlight = "espresso", output_dir = getwd(), open = TRUE) {

  language <- language[1]

  # find translations

  httr::GET(
    url = "https://api.github.com/repos/jlevy/the-art-of-command-line/contents/",
    httr::add_headers(
      `Accept` = "application/vnd.github.v3+json"
    ),
    httr::user_agent("taotcl R script; @hrbrmstr")
  ) -> res

  httr::stop_for_status(res)

  ls <- httr::content(res, as = "parsed")

  readmes <- Filter(function(.x) grepl("^README", .x), vapply(ls, `[[`, character(1), "name"))
  langs <- regmatches(readmes, regexpr("-[-[:alpha:]]+", readmes))

  # check to make sure a valid one was specified

  if (language != "") { # "" => English
    language <- sprintf("-%s", language)
    if (!(language %in% langs)) {
      stop(
        "Language '", sub("^-", "", language), 
        "' not found in repo. Current translations include: ",
        paste0(sprintf("'%s'", sub("^-", "", langs)), collapse = ", "),
        ".", call.=FALSE
      )
    }
  }

  # get the desired doc

  src <- "https://raw.githubusercontent.com/jlevy/the-art-of-command-line/master/README{language}.md"
  src <- glue::glue(src)

  l <- readLines(src)

  # find the title
  title <- sub("^#[[:space:]]*", "", l[which(grepl("^#[[:space:]]*", l))[1]])

  # figure out the cut line
  cowsay <- which(grepl("cowsay", l))[1]

  l <- l[-(1:(cowsay+1))] # cut

  # make the AUTHORS.md a useful link
  l <- gsub("(AUTHORS.md)", "(https://github.com/jlevy/the-art-of-command-line/blob/master/AUTHORS.md)", l, fixed = TRUE)

  theme <- theme[1]
  highlight <- highlight[1]

'---
title: "{title}"
author: "Joshua Levy"
email: "joshua@cal.berkeley.edu"
output: 
  html_document:
    theme: {theme}
    highlight: {highlight}
    toc: true
    toc_float: true
    toc_depth: 2
---

' -> yaml

  # fill in the YAML
  yaml <- glue::glue(yaml)

  tf <- tempfile(fileext = ".Rmd")
  on.exit(unlink(tf), add = TRUE)
  writeLines(c(yaml, l), tf)

  # render the doc
  rmarkdown::render(
    input = tf,
    output_file = sprintf("%s.html", tolower(gsub(" ", "-", title))),
    output_dir = output_dir[1],
    quiet = TRUE
  ) -> loc

  # open in browser
  if (open[1]) browseURL(loc)

  message("Rendered version is at '", loc, "'")

}

Running it with the defaults will have it look like this:

You don’t have to type it all as that function is in the taotcl.R script over at my gitea / sourcehut / gitlab / gitugh

Goal II

Now that we have a function we can call from R we just need a wrapper around it. I kinda like way David Shih put together his {argparser} package (it’s on CRAN) so we’ll make a wrapper for our rendering function with it.

We have pretty much the same goal list as the function in that we want to let users specify customizations. There are some additional ones as well (this is not an exhaustive list but it was “just enough” for this go):

  1. Make it easy for folks on real operating systems to use it without the need to use Rscript
  2. Let folks know what required packages they need to install if any are missing
  3. Be quiet when loading packages
  4. Assume friendly/useful defaults
  5. Provide long and short parameters (some folks like short, some like long)

The first few lines of the finished script will accomplish #1-3:

#!/usr/bin/env Rscript

needed <- c("magrittr", "argparser", "httr", "glue", "rmarkdown")
installed <- rownames(installed.packages())
missing <- needed[!(needed %in% installed)]

if (length(missing)) stop("Please install the following packages: ", paste0(sprintf("'%s'", missing), collapse = ", "), call.=FALSE)

suppressPackageStartupMessages({
  for (pkg in needed) {
    require(package = pkg, quietly = TRUE, warn.conflicts = FALSE, character.only = TRUE)
  }
})

Line 1 is a “hashbang”/”shebang” and — provided the file has the execute bit set — will let folks on *nix/macOS run the file without deliberately invoking Rscript. The rest just do the package checks and loads.

We need a way to get command line parameters in, hence the use of {argparser}. We’ll create an arg_parser object and then add arguments using the {magrittr} pipe (%>%). You can add long/short argument names as well as help and defaults (plus note whether an argument is a flag/toggle). Once we have those setup, we tell {argparser} to process any arguments provided by the user:

arg_parser(
  description = "Render 'The Art of the Command Line' to HTML"
) %>% 
  add_argument(
    arg = "--language",
    help = 'Language to render. Leave unspecified for English. Current known: "cs", "de", "el", "es", "fr", "id", "it", "ja", "ko", "pt", "ro", "ru", "sl", "uk", "zh-Hant", "zh"',
    type = "character",
    short = "-l",
    default = ""
  ) %>% 
  add_argument(
    arg = "--theme",
    help = "Which R Markdown document theme to use. Ref: https://l.rud.is/2JOibrZ",
    type = "character",
    short = "-t",
    default = "simplex"
  ) %>% 
  add_argument(
    arg = "--highlight",
    help = "Which R Markdown code higlight theme to use. Ref: https://l.rud.is/2JOibrZ",
    type = "character",
    short = "-c",
    default = "espresso"
  ) %>% 
  add_argument(
    arg = "--output-dir",
    help = "Where to store the rendered file. Defaults to current working directory.",
    type = "character",
    short = "-o",
    default = getwd()
  ) %>% 
  add_argument(
    arg = "--just-render",
    help = "Only render the document. Do not open in the system default browser. (Default is to render and open.)",
    short = "-j",
    flag = TRUE
  ) -> parser

opts <- argparser::parse_args(parser)

Once we have those (the taotcl() function would come next in the source) then it’s just a matter of calling the function:

taotcl(
  language = opts$language,
  theme = opts$theme,
  highlight = opts$highlight,
  output_dir = opts$output_dir,
  open = is.na(opts$just_render) | (!opts$just_render)
)

If the command line program we’ve just made is called with a -h or --help the user will get:

usage: ./taotcl.R [--help] [--just-render] [--opts OPTS] [--language LANGUAGE] [--theme THEME] [--highlight HIGHLIGHT] [--output-dir OUTPUT-DIR]

or (on Windows): Rscript taotcl.R [--help] [--just-render] [--opts OPTS] [--language LANGUAGE] [--theme THEME] [--highlight HIGHLIGHT] [--output-dir OUTPUT-DIR]

Render 'The Art of the Command Line' to HTML


flags:
  -h, --help                    show this help message and exit
  -j, --just-render             Only render the document. Do not open in the system default browser. 
                                (Default is to render and open.)

optional arguments:
  -x, --opts OPTS               RDS file containing argument values
  -l, --language LANGUAGE       Language to render. Leave unspecified for English. 
                                Current known: "cs", "de", "el", "es", "fr", "id", "it", "ja", 
                                "ko", "pt", "ro", "ru", "sl", "uk", "zh-Hant", "zh" [default: ]
  -t, --theme THEME             Which R Markdown document theme to use. Ref: https://l.rud.is/2JOibrZ [default: simplex]
  -c, --highlight HIGHLIGHT     Which R Markdown code higlight theme to use. Ref: https://l.rud.is/2JOibrZ [default: espresso]
  -o, --output-dir OUTPUT-DIR   Where to store the rendered file. Defaults to current working directory. 
                                [default: /your/current/directory/here]

If we run it with, say, ./taotcl.R --language ru -o /tmp the script will process the correct language version and render it to /tmp/искусство-командной-строки.html plus auto-open it for us. It should look like:

FIN

As noted, you can find the entire script over at my gitea / sourcehut / gitlab / gitugh. It’ll eventually get over to GitLab & GitUgh (and a few others as I’m expanding the scripts I use to support social coding diversity vs hegemony) as well.

Note that you can leave off the .R and the hashbang will still work just fine so it’ll be even more straightforward to use.

If you don’t want to go through all this and just want standalone rendered versions of the resource just drop a note in comments and I’ll toss up a small Shiny app which will let you specify params and get a rendered version. You can find weekly renders of all translations at https://rud.is/taotcl/.

Finally, r-lib has some handy packages to make R-built command line utilities much, much cooler (which is a minor suggestion that PRs are welcome if you want to add some flavor to this fairly vanilla utility).

A fair bit of time ago the {gdns} package made its way to CRAN to give R users the ability to use Google’s (at that time) nascent support for DNS over HTTPS (DoH). A bit later on Cloudflare also provided a global DoH endpoint and that begat the (not-on-CRAN) {dnsflare} package.

There are actually two ways to make these DoH queries: one via an HTTPS GET REST API and the other via HTTPS POST queries that use DNS wireformat queries and replies. While the POST side of DoH is pretty standardized/uniform the GET/REST API side is kind of the Wild West. I wanted a way to have support for both wireformat and REST idioms but also not have to write a gazillion packages to support the eventual plethora of diverse DoH GET/REST API services.

I “solved” this by first augmenting my (not-on-CRAN) {clandnstine} package to support the POST wireformat DoH queries (since the underlying {getdns} library supports decoding wireformat responses) and creating a very small {playdoh} package which provided generic support for (hopefully) any DoH GET/REST endpoint.

DoH vs DoT

I made the {clandnstine} package primarily to support making DNS over TLS (DoT) queries but it makes sense to combine both DoH and DoT functionality into that package. The problem is that the legacy platform most of y’all R users are on (i.e. Windows) makes using that package problematic. Therefore, by separating out the DoH GET functionality into a standalone package I don’t have to write a DNS wireformat pure R response handler.

There are performance and other differences between DoH and DoT. I suspect most DNS providers and also most open source DNS server will eventually support both DoH and DoT so which one you use will be up to your clients and use cases.

A Tale of Two (or More) Queries

We’ll issue a few queries over DoH and DoT to a few servers to ensure we’re getting the same results.

library(clandnstine) # both of these are on sourcehut (~hrbrmstr/pkgname), 
library(playdoh)     # or gitlab/gitugh (hrbrmstr/pkgname)

# DoT
x <- gdns_context()
gdns_query(x, "example.com", rr_type = "a")$just_address_answers$address_data
## [1] "93.184.216.34"

# DoH POST (wireformat)
doh_post("example.com", "a", server_path = doh_servers$quad9$url)$answer$rdata$ipv4_address
## [1] "93.184.216.34"

# DoH GET (rest)
doh_get("example.com", "a", service_path = doh_servers$securedns_eu$url)$data[1]
## [1] "93.184.216.34"

To support the, er, diversity of requirements across various GET/REST endpoints the playdoh::doh_get() function has an extra_params parameter which lets you specify any required extra REST query params. Both packages have an exposed global variable doh_servers which has both the URL and any required extra parameters.

FIN

As usual, kick the tyres, file issues and PRs where you like and if you do end up using either package drop a note in the comments.