hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Quick hit: ‘dig’-ging Into r-project.org DNS Records with {processx}

The r-project.org domain had some temporary technical difficulties this week (2019-29) that made reaching R-related resources problematic for a bunch of folks for a period of time. Incidents like this underscore the need for regional and network diversity when it comes to ensuring the availability of DNS services. That is, it does no good if you have two DNS servers if they’re both connected to the same power source and/or network connection since if power goes out or the network gets wonky no client will be able to translate r-project.org to an IP address that it can then connect to.

I’m not at-keyboard much this week so only had time to take an external poke at the (new) r-project.org DNS configuration late yesterday and today before the sleepyhead vacationers emerged from slumber. To my surprise, the r-project.org current DNS setup allows full zone transfers, which means you can get the full “database” of r-project.org DNS records if you know the right incantations.

So, I wrote a small R function wrapper for the dig command using {processx}. Folks on the legacy Windows operating system are on your own for getting a copy of dig installed but users of proper, modern operating systems like Linux or macOS should have it installed by-default (or will be an easy package manager grab away).

Wrapping dig

The R-wrapper for the dig command is pretty straightforward:

library(stringi) # string processing
library(processx) # system processes orchestration
library(tidyverse) # good data wrangling idioms

dig <- function(..., cat = TRUE) {

  processx::run(
    command = unname(Sys.which("dig")), 
    args = unlist(list(...)),
  ) -> out

  if (cat) message(out$stdout)

  invisible(out)

}

We expand the ellipses into command arguments, run the command, return the output and optionally display the output via message().

Let’s see if it works by getting the dig help:

dig("-h")
## Usage:  dig [@global-server] [domain] [q-type] [q-class] {q-opt}
##             {global-d-opt} host [@local-server] {local-d-opt}
##             [ host [@local-server] {local-d-opt} [...]]
## Where:  domain    is in the Domain Name System
##         q-class  is one of (in,hs,ch,...) [default: in]
##         q-type   is one of (a,any,mx,ns,soa,hinfo,axfr,txt,...) [default:a]
##                  (Use ixfr=version for type ixfr)
##         q-opt    is one of:
##                  -4                  (use IPv4 query transport only)
##                  -6                  (use IPv6 query transport only)
##                  -b address[#port]   (bind to source address/port)
##                  -c class            (specify query class)
##                  -f filename         (batch mode)
##                  -i                  (use IP6.INT for IPv6 reverse lookups)
##                  -k keyfile          (specify tsig key file)
##                  -m                  (enable memory usage debugging)
##                  -p port             (specify port number)
##                  -q name             (specify query name)
##                  -t type             (specify query type)
##                  -u                  (display times in usec instead of msec)
##                  -x dot-notation     (shortcut for reverse lookups)
##                  -y [hmac:]name:key  (specify named base64 tsig key)
##         d-opt    is of the form +keyword[=value], where keyword is:
##                  +[no]aaonly         (Set AA flag in query (+[no]aaflag))
##                  +[no]additional     (Control display of additional section)
##                  +[no]adflag         (Set AD flag in query (default on))
##                  +[no]all            (Set or clear all display flags)
##                  +[no]answer         (Control display of answer section)
##                  +[no]authority      (Control display of authority section)
##                  +[no]besteffort     (Try to parse even illegal messages)
##                  +bufsize=###        (Set EDNS0 Max UDP packet size)
##                  +[no]cdflag         (Set checking disabled flag in query)
##                  +[no]cl             (Control display of class in records)
##                  +[no]cmd            (Control display of command line)
##                  +[no]comments       (Control display of comment lines)
##                  +[no]crypto         (Control display of cryptographic fields in records)
##                  +[no]defname        (Use search list (+[no]search))
##                  +[no]dnssec         (Request DNSSEC records)
##                  +domain=###         (Set default domainname)
##                  +[no]edns[=###]     (Set EDNS version) [0]
##                  +ednsflags=###      (Set EDNS flag bits)
##                  +[no]ednsnegotiation (Set EDNS version negotiation)
##                  +ednsopt=###[:value] (Send specified EDNS option)
##                  +noednsopt          (Clear list of +ednsopt options)
##                  +[no]expire         (Request time to expire)
##                  +[no]fail           (Don't try next server on SERVFAIL)
##                  +[no]identify       (ID responders in short answers)
##                  +[no]idnout         (convert IDN response)
##                  +[no]ignore         (Don't revert to TCP for TC responses.)
##                  +[no]keepopen       (Keep the TCP socket open between queries)
##                  +[no]multiline      (Print records in an expanded format)
##                  +ndots=###          (Set search NDOTS value)
##                  +[no]nsid           (Request Name Server ID)
##                  +[no]nssearch       (Search all authoritative nameservers)
##                  +[no]onesoa         (AXFR prints only one soa record)
##                  +[no]opcode=###     (Set the opcode of the request)
##                  +[no]qr             (Print question before sending)
##                  +[no]question       (Control display of question section)
##                  +[no]recurse        (Recursive mode)
##                  +retry=###          (Set number of UDP retries) [2]
##                  +[no]rrcomments     (Control display of per-record comments)
##                  +[no]search         (Set whether to use searchlist)
##                  +[no]short          (Display nothing except short
##                                       form of answer)
##                  +[no]showsearch     (Search with intermediate results)
##                  +[no]split=##       (Split hex/base64 fields into chunks)
##                  +[no]stats          (Control display of statistics)
##                  +subnet=addr        (Set edns-client-subnet option)
##                  +[no]tcp            (TCP mode (+[no]vc))
##                  +time=###           (Set query timeout) [5]
##                  +[no]trace          (Trace delegation down from root [+dnssec])
##                  +tries=###          (Set number of UDP attempts) [3]
##                  +[no]ttlid          (Control display of ttls in records)
##                  +[no]vc             (TCP mode (+[no]tcp))
##         global d-opts and servers (before host name) affect all queries.
##         local d-opts and servers (after host name) affect only that lookup.
##         -h                           (print help and exit)
##         -v                           (print version and exit)

To get the DNS records of r-project.org DNS we need to find the nameservers, which we can do via:

ns <- dig("+short", "NS", "@9.9.9.9", "r-project.org")
## ns1.wu-wien.ac.at.
## ns2.urbanek.info.
## ns1.urbanek.info.
## ns3.urbanek.info.
## ns4.urbanek.info.
## ns2.wu-wien.ac.at.

There are six of them (which IIRC is a few more than they had earlier this week). I wanted to see if any supported zone transfers. Here’s one way to do that:

stri_split_lines(ns$stdout, omit_empty = TRUE) %>% # split the response in stdout into lines
  flatten_chr() %>% # turn the list into a character vector
  map_df(~{ # make a data frame out of the following
    tibble(
      ns = .x, # the nameserver we are probing
      res = dig("+noall", "+answer", "AXFR", glue::glue("@{.x}"), "r-project.org", cat = FALSE) %>%  # the dig zone transfer request
        pluck("stdout") # we only want the `stdout` element of the {processx} return value
    )
  }) -> xdf

xdf
## # A tibble: 6 x 2
##   ns              res                                                  
##   <chr>           <chr>                                                
## 1 ns1.wu-wien.ac… "; Transfer failed.\n"                               
## 2 ns2.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 3 ns1.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 4 ns3.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 5 ns4.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 6 ns2.wu-wien.ac… "; Transfer failed.\n"

(NOTE: You may not get things in the same order if you try this at home due to the way DNS queries and responses work.)

So, two servers did not accept our request but four did. Let’s see what a set of zone transfer records looks like:

cat(xdf[["res"]][[2]]) 
## R-project.org.    7200  IN  SOA ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800 604800 3600
## R-project.org.    7200  IN  NS  ns1.urbanek.info.
## R-project.org.    7200  IN  NS  ns1.wu-wien.ac.at.
## R-project.org.    7200  IN  NS  ns2.urbanek.info.
## R-project.org.    7200  IN  NS  ns2.wu-wien.ac.at.
## R-project.org.    7200  IN  NS  ns3.urbanek.info.
## R-project.org.    7200  IN  NS  ns4.urbanek.info.
## R-project.org.    7200  IN  A 137.208.57.37
## R-project.org.    7200  IN  MX  5 mc1.ethz.ch.
## R-project.org.    7200  IN  MX  5 mc2.ethz.ch.
## R-project.org.    7200  IN  MX  5 mc3.ethz.ch.
## R-project.org.    7200  IN  MX  5 mc4.ethz.ch.
## R-project.org.    7200  IN  TXT "v=spf1 ip4:129.132.119.208/32 ~all"
## cran.at.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## beta.R-project.org. 7200  IN  A 137.208.57.37
## bugs.R-project.org. 7200  IN  CNAME rbugs.urbanek.info.
## cran.ch.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## cloud.R-project.org.  7200  IN  CNAME d3caqzu56oq2n9.cloudfront.net.
## cran.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## ftp.cran.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## www.cran.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## cran-archive.R-project.org. 7200 IN CNAME cran.wu-wien.ac.at.
## developer.R-project.org. 7200 IN  CNAME rdevel.urbanek.info.
## cran.es.R-project.org.  7200  IN  A 137.208.57.37
## ess.R-project.org.  7200  IN  CNAME ess.math.ethz.ch.
## journal.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## mac.R-project.org.  7200  IN  CNAME r.research.att.com.
## portal.R-project.org. 7200  IN  CNAME r-project.org.
## r-forge.R-project.org.  7200  IN  CNAME r-forge.wu-wien.ac.at.
## *.r-forge.R-project.org. 7200 IN  CNAME r-forge.wu-wien.ac.at.
## search.R-project.org. 7200  IN  CNAME finzi.psych.upenn.edu.
## svn.R-project.org.  7200  IN  CNAME svn-stat.math.ethz.ch.
## translation.R-project.org. 7200 IN  CNAME translation.r-project.kr.
## cran.uk.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## cran.us.R-project.org.  7200  IN  A 137.208.57.37
## user2004.R-project.org. 7200  IN  CNAME r-project.org.
## useR2006.R-project.org. 7200  IN  CNAME r-project.org.
## user2007.R-project.org. 7200  IN  CNAME r-project.org.
## useR2008.R-project.org. 7200  IN  CNAME r-project.org.
## useR2009.R-project.org. 7200  IN  CNAME r-project.org.
## user2010.R-project.org. 7200  IN  CNAME r-project.org.
## useR2011.R-project.org. 7200  IN  CNAME r-project.org.
## useR2012.R-project.org. 7200  IN  CNAME r-project.org.
## useR2013.R-project.org. 7200  IN  CNAME r-project.org.
## user2014.R-project.org. 7200  IN  CNAME user2014.github.io.
## useR2015.R-project.org. 7200  IN  CNAME r-project.org.
## useR2016.R-project.org. 7200  IN  CNAME user2016.github.io.
## useR2017.R-project.org. 7200  IN  CNAME r-project.org.
## useR2018.R-project.org. 7200  IN  CNAME user-2018.netlify.com.
## useR2019.R-project.org. 7200  IN  A 5.135.185.16
## wiki.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## win-builder.R-project.org. 7200 IN  A 129.217.207.166
## win-builder.R-project.org. 7200 IN  MX  0 rdevel.urbanek.info.
## www.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## R-project.org.    7200  IN  SOA ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800 604800 3600

That’s not pretty, but it’s wrangle-able. Let’s turn it into a data frame:

xdf[["res"]][[2]] %>% # get the response text
  stri_split_lines(omit_empty = TRUE) %>%  # split it into lines
  flatten_chr() %>%  # turn it into a character vector
  stri_split_regex("[[:space:]]+", n = 5, simplify = TRUE) %>% # split at whitespace, limiting to five fields
  as_tibble(.name_repair = "unique") %>% # make it a tibble
  set_names(c("host", "ttl", "class", "record_type", "value")) %>% # better colnames
  mutate(host = stri_trans_tolower(host)) %>% # case matters not in DNS names
  print(n=nrow(.)) # see our results
## # A tibble: 55 x 5
##    host                     ttl   class record_type value                                                              
##    <chr>                    <chr> <chr> <chr>       <chr>                                                              
##  1 r-project.org.           7200  IN    SOA         ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800 …
##  2 r-project.org.           7200  IN    NS          ns1.urbanek.info.                                                  
##  3 r-project.org.           7200  IN    NS          ns1.wu-wien.ac.at.                                                 
##  4 r-project.org.           7200  IN    NS          ns2.urbanek.info.                                                  
##  5 r-project.org.           7200  IN    NS          ns2.wu-wien.ac.at.                                                 
##  6 r-project.org.           7200  IN    NS          ns3.urbanek.info.                                                  
##  7 r-project.org.           7200  IN    NS          ns4.urbanek.info.                                                  
##  8 r-project.org.           7200  IN    A           137.208.57.37                                                      
##  9 r-project.org.           7200  IN    MX          5 mc1.ethz.ch.                                                     
## 10 r-project.org.           7200  IN    MX          5 mc2.ethz.ch.                                                     
## 11 r-project.org.           7200  IN    MX          5 mc3.ethz.ch.                                                     
## 12 r-project.org.           7200  IN    MX          5 mc4.ethz.ch.                                                     
## 13 r-project.org.           7200  IN    TXT         "\"v=spf1 ip4:129.132.119.208/32 ~all\""                           
## 14 cran.at.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 15 beta.r-project.org.      7200  IN    A           137.208.57.37                                                      
## 16 bugs.r-project.org.      7200  IN    CNAME       rbugs.urbanek.info.                                                
## 17 cran.ch.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 18 cloud.r-project.org.     7200  IN    CNAME       d3caqzu56oq2n9.cloudfront.net.                                     
## 19 cran.r-project.org.      7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 20 ftp.cran.r-project.org.  7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 21 www.cran.r-project.org.  7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 22 cran-archive.r-project.… 7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 23 developer.r-project.org. 7200  IN    CNAME       rdevel.urbanek.info.                                               
## 24 cran.es.r-project.org.   7200  IN    A           137.208.57.37                                                      
## 25 ess.r-project.org.       7200  IN    CNAME       ess.math.ethz.ch.                                                  
## 26 journal.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 27 mac.r-project.org.       7200  IN    CNAME       r.research.att.com.                                                
## 28 portal.r-project.org.    7200  IN    CNAME       r-project.org.                                                     
## 29 r-forge.r-project.org.   7200  IN    CNAME       r-forge.wu-wien.ac.at.                                             
## 30 *.r-forge.r-project.org. 7200  IN    CNAME       r-forge.wu-wien.ac.at.                                             
## 31 search.r-project.org.    7200  IN    CNAME       finzi.psych.upenn.edu.                                             
## 32 svn.r-project.org.       7200  IN    CNAME       svn-stat.math.ethz.ch.                                             
## 33 translation.r-project.o… 7200  IN    CNAME       translation.r-project.kr.                                          
## 34 cran.uk.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 35 cran.us.r-project.org.   7200  IN    A           137.208.57.37                                                      
## 36 user2004.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 37 user2006.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 38 user2007.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 39 user2008.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 40 user2009.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 41 user2010.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 42 user2011.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 43 user2012.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 44 user2013.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 45 user2014.r-project.org.  7200  IN    CNAME       user2014.github.io.                                                
## 46 user2015.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 47 user2016.r-project.org.  7200  IN    CNAME       user2016.github.io.                                                
## 48 user2017.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 49 user2018.r-project.org.  7200  IN    CNAME       user-2018.netlify.com.                                             
## 50 user2019.r-project.org.  7200  IN    A           5.135.185.16                                                       
## 51 wiki.r-project.org.      7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 52 win-builder.r-project.o… 7200  IN    A           129.217.207.166                                                    
## 53 win-builder.r-project.o… 7200  IN    MX          0 rdevel.urbanek.info.                                             
## 54 www.r-project.org.       7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 55 r-project.org.           7200  IN    SOA         ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800

FIN

Zone transfers are a quick way to get all the DNS information for a site. As such, it isn’t generally recommended to allow zone transfers from just anyone (though trying to keep anything secret in public DNS is a path generally fraught with peril given how easy it is to brute-force record lookups). However, if r-project.org zone transfers stay generally open, then you can use this method to keep a local copy of r-project.org host info and make local /etc/hosts (or the Windows equivalent) entries when issues like the one this past week arise.

Quick Hit: Above the Fold; Hard wrapping text at ‘n’ characters

Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings.

The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops.

I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this:

and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use.

Originally I just made a header file with the same long lines:

but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you.

The stringi::stri_wrap() function can easily take care of wrapping the time zone array elements for us:

but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them:

library(Rcpp)
library(stringi)
library(tidyverse)
library(hrbrthemes)
library(microbenchmark)

sourceCpp(code = "
#include <Rcpp.h>

// [[Rcpp::export]]
std::vector< std::string > fold_cpp(const std::string& input, int width) {

  int sz = input.length() / width;

  std::vector< std::string > out;
  out.reserve(sz); // shld make this more efficient

  for (unsigned long idx=0; idx<sz; idx++) {
    out.push_back(
      input.substr(idx*width, width)
    );
  }

  if (input.length() % width != 0) out.push_back(input.substr(width*sz));

  return(out);
}
") 

fold_base <- function(input, width) {

  vapply(
    seq(1, nchar(input), width), 
    function(idx) substr(input, idx, idx + width - 1), 
    FUN.VALUE = character(1)
  )

}

fold_tidy <- function(input, width) {

  map_chr(
    seq(1, nchar(input), width),
    ~stri_sub(input, .x, length = width)
  ) 

}

(If you know of a package that has this type of function def leave a note in the comments).

Each one does the same thing: move n sequences of width characters into a new slot in a character vector. Let’s see what they do with this toy long string example:

(src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = ""))
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc"

for (n in c(1, 7, 30, 40)) {

  print(fold_base(src, n))
  print(fold_tidy(src, n))
  print(fold_cpp(src, n))
  cat("\n")

}
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
## 
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"

So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths:

map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{
  microbenchmark(
    base = fold_base(src, .x),
    tidy = fold_tidy(src, .x),
    cpp = fold_cpp(src, .x)
  ) %>% 
    mutate(width = .x) %>% 
    as_tibble()
}) %>% 
  mutate(
    width = factor(width, 
                   levels = sort(unique(width)), 
                   ordered = TRUE)
  ) -> bench_df

ggplot(bench_df, aes(expr, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = width, fill = width),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = "Split width:", 
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

ggplot(bench_df, aes(width, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = expr, fill = expr),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_x_discrete(
    labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70")
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  scale_fill_ft() +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = NULL,
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into:

FIN

If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems.

Quick hit: Some ggplot2 Stat ? for {logspline}

I’ve become a big fan of the {logspline} package over the past ~6 months and decided to wrap up a manual ggplot2 plotting process (well, it was at least in an RStudio snippet) into a small {ggplot2} Stat to make it easier to visualize various components of the fitted model.

If you’re new to logspline density estimation this paper by Stone, Hansen, Kooperberg, and Truong is a pretty accessible introduction to the methodology and the (primary) algorithm used in the {logspline} package.

If you’re already a user of the {logspline} package and have been just using the package’s base S3 plot function but wanted the functionality in {ggplot2} this was pretty much designed for you.

If you hit up https://git.rud.is/hrbrmstr/gglogspline there are links to all your fav social code collaboration services plus installation instructions. There’s a single Stat — stat_logspline() — which does all the hard work. By default it will make y the stat(density) computed field but there are other computed fields you can use as well:

count: computed count estimates (similar to ggplot2::stat_density())
probs: distribution function
survival: survival function
hazard : hazard function

You can take a look at the (light) comparison between geom_histogram(), geom_density(), ggalt::geom_bkde(), and gglogspline::stat_logspline() below (or use this link to bust the iframe):

FIN

For “general purpose” density estimation you can likely still rely on stat_density() but after you dig into the background of logspline and try it out in some appropriate use-cases you may grow to like it as much as I have.

The package itself is super-small and focused. One reason for that is it may be helpful for those who want to create or customize their own ggplot2 Stats. (I think I forgot to change the license to MIT before this post so I’ll rectify that ASAP; I default to AGPL in a fairly pathetic attempt to stop skeezy cybersecurity startups — pretty much 99% of them — from stealing code without giving back to the community).

As usual, kick the tyres and file issues or PRs as you feel moved (and wherever you feel comfortable social coding).

Introducing the {ethercalc} package

I mentioned EtherCalc in a previous post and managed to scrounge some time to put together a fledgling {ethercalc} package (it’s also on GitLab, SourceHut, Bitbucket and GitUgh, just sub out the appropriate URL prefix).

I’m creating a package-specific Docker image (there are a couple out there but I’m not supporting their use with the package as they have a CORS configuration that make EtherCalc API wrangling problematic) for EtherCalc but I would highly recommend that you just use it via the npm module. To do that you need a working Node.js installation and I highly recommended also running a local Redis instance (it’s super lightweight). Linux folks can use their fav package manager for that and macOS folks can use homebrew. Folks on the legacy Windows operating system can visit:

to get EtherCalc going.

I also recommend running EtherCalc and Redis together for performance reasons. EtherCalc will maintain a persistent store for your spreadsheets (they call them “rooms” since EtherCalc supports collaborative editing) with or without Redis, but using Redis makes all EtherCalc ops much, much faster.

Once you have Redis running (on localhost, which is the default) and Node.js + npm installed, you can do the following to install EtherCalc:

$ npm install -g ethercalc # may require `sudo` on some macOS or *nix systems

The -g tells npm to install the module globally and will work to ensure the ethercalc executable is on your PATH. Like many things one can install from Node.js or, even Python, you may see a cadre of “warnings” and possibly even some “errors”. If you execute the following and see similar messages:

$ ethercalc --host=localhost ## IMPORTANT TO USE --host=localhost
Please connect to: http://localhost:8000/
Starting backend using webworker-threads
Falling back to vm.CreateContext backend
Express server listening on port 8000 in development mode
Zappa 0.5.0 "You can't do that on stage anymore" orchestrating the show
Connected to Redis Server: localhost:6379

and then go to the URL it gives you and you see something like this:

then you’re all set to continue.

A [Very] Brief EtherCalc Introduction

EtherCalc has a wiki. As such, please hit that to get more info on EtherCalc.

For now, if you hit that big, blue “Create Spreadsheet” button, you’ll see something pretty familiar if you’ve used Google Sheets, Excel, LibreOffice Calc (etc):

If you start ethercalc without the --host=localhost it listens on all network interfaces, so other folks on your network can also use it as a local “cloud” spreadsheet app, but also edit with you, just like Google Sheets.

I recommend playing around a bit in EtherCalc before continuing just to see that it is, indeed, a spreadsheet app like all the others you are familiar with, except it has a robust API that we can orchestrate from within R, now.

Working with {ethercalc}

You can install {ethercalc} from the aforelinked source or via:

install.packages("ethercalc", repos = "https://cinc.rud.is")

where you’ll get a binary install for Windows and macOS (binary builds are for R 3.5.x but should also work for 3.6.x installs).

If you don’t want to drop to a command line interface to start EtherCalc you can use ec_start() to run one that will only be live during your R session.

Once you have EtherCalc running you’ll need to put the URL into an ETHERCALC_HOST environment variable. I recommend adding the following to ~/.Renviron and restarting your R session:

ETHERCALC_HOST=http://localhost:8000

(You’ll get an interactive prompt to provide this if you don’t have the environment variable setup.)

You can verify R can talk to your EtherCalc instance by executing ec_running() and reading the message or examining the (invisible) return value. Post a comment or file an issue (on your preferred social coding site) if you really think you’ve done everything right and still aren’t up and running by this point.

The use-case I setup in the previous blog post was to perform light data entry since scraping was both prohibited and would have taken more time given how the visualization was made. To start a new spreadsheet (remember, EtherCalc folks call these “rooms”), just do:

ec_new("for-blog")

And you should see this appear in your default web browser:

You can do ec_list() to see the names of all “saved” spreadsheets (ec_delete() can remove them, too).

We’ll type in the values from the previous post:

Now, to retrieve those values, we can do:

ec_read("for-blog", col_types="cii")
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <int>            <int>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

That function takes any (relevant to this package use-case) parameter that readr::read_csv() takes (since it uses that under the hood to parse the object that comes back from the API call). If someone adds or modifies any values you can just call ec_read() again to retrieve them.

The ec_export() function lets you download the contents of the spreadsheet (“room”) to a local:

CSV
JSON
HTML
Markdown
Excel

file (and it also returns the raw data directly to the R session). So you can do something like:

cat(rawToChar(ec_export("for-blog", "md", "~/Data/survey.md")))
## |topic|actually_read|say_want_covered|
## | ---- | ---- | ---- |
## |Health care|7|1|
## |Climate change|5|2|
## |Education|11|3|
## |Economics|6|4|
## |Science|10|7|
## |Technology|14|8|
## |Business|13|11|
## |National Security|1|5|
## |Politics|2|10|
## |Sports|3|14|
## |Immigration|4|6|
## |Arts & entertainment|8|13|
## |U.S. foreign policy|9|9|
## |Religion|12|12|

You can also append to a spreadsheet right from R. We’ll sort that data frame (to prove the append is working and I’m not fibbing) and append it to the existing sheet (this is a toy example, but imagine appending to an always-running EtherCalc instance as a data logger, which folks actually do IRL):

ec_read("for-blog", col_types="cii") %>% 
  dplyr::arrange(desc(topic)) %>% 
  ec_append("for-blog")

Note that you can open up EtherCalc to any existing spreadsheets (“rooms”) via ec_view() as well.

FIN

It’s worth noting that EtherCalc appears to have a limit of around 500,000 “cells” per spreadsheet (“room”). I mention that since if you try to, say, ec_edit(ggplot2movies::movies, "movies") you would have very likely crashed the running EtherCalc instance if I did not code in some guide rails into that function and the ec_append() function to stop you from doing that. It’s sane limit IMO an Google Sheets does something similar (per-tab) for the similar reasons (and both limits are one reason I’m still against using a browser for “everything” given the limitations of javascript wrangling of DOM elements).

If you’re doing work on large-ish data, spreadsheets in general aren’t the best tools.

And, while you should avoid hand-wrangling data at all costs, ec_edit() is a much faster and feature-rich alternative to R’s edit() function on most systems.

I’ve shown off most of the current functionality of the {ethercalc} package in this post. One function I’ve left out is ec_cmd() which lets you completely orchestrate all EtherCalc operations. It’s powerful enough, and the EtherCalc command structure is gnarly enough, that we’ll have to cover it in a separate post. Also, stay tune for the aforementioned package-specific EtherCalc Docker image.

Kick the tyres, contribute issues and/or PRs as moved (and on your preferred social coding site) and see if both EtherCalc and {ethercalc} might work for you in place of or along with Excel and/or Google Sheets.

Makeover Jambalaya: Beating Dumbbells into Slopegraphs Whilst Orchestrating EtherCalc

This morning, @kairyssdal tweeted out the following graphic from @axios:

Confusing, but interesting.

Data shows we're a nation of news consumption hypocrites – Axios https://t.co/O0lPSc4OV3

— Kai Ryssdal (@kairyssdal) June 11, 2019

If you’re doing the right thing and blocking evil social media javascript you can find the Axios story here and the graphic below:

I’m gonna say it: the chart is confusing. I grok what they were trying to do, but this is a clear example where a slopegraph would definitely be a better choice than a directional dumbbell chart. At the time I had ~5 minutes to spare so I did a quick makeover and a short howto thread. This post is an expansion on said thread and if you’re in the midst of making the decision to consider reading or moving on here’s what we’ll cover:

Making the choice between scraping or manual data entry
Quick introduction to EtherCalc
Why you might consider using EtherCalc for manual data entry over Excel or Google Sheets
Pulling data from EtherCalc into R
Making a slopegraph with the captured data
Customizing the slopegraph with the captured data to tell one or more stories

Read on if any or all of that is captures your interest.

To scrape or not to scrape

Even if I wanted to scrape the site, Axios makes it pretty clear they are kinda not very nice people since — while it doesn’t mention scraping — that ToS link does indicate that:

(a) you will not license, sell, rent, lease, transfer, assign, distribute, host, or otherwise commercially exploit the Site or any content displayed on the Site; (b) you will not modify, make derivative works of, disassemble, reverse compile or reverse engineer any part of the Site; (c) you will not access the Site in order to build a similar or competitive website, product, or service; and (d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means. Unless otherwise indicated, any future release, update, or addition to the Site’s functionality will be subject to these Terms. All copyright and other proprietary notices on the Site (or on any content displayed on the Site) must be retained on all copies thereof.

(OH NO I COPIED THAT FROM THEIR SITE SO I AM ALREADY VIOLATING THEIR [unenforceable] TOS!)

There’s this thing called “Fair Use” and this makeover I’m doing is 100% covered under that. The Axios ToS and the ToS of many other sites try to prohibit such fair use and they generally lose those battles in court. I have and will be citing them as sources throughout this post and the post itself falls under “commentary and criticism”. Unlike many unethical scrapers who are just scavenging data they did no work to generate and whose work will not serve the better interest of the general community, this post is fully dedicated to sharing and education.

In reality, Axios likely has such draconian ToS due to all the horrible unethical scrapers who just want free, un-cited news content.

Anyway…

Even if I could scrape they don’t embed a javascript data object nor do they load an XHR JSON data blob to make the graphic. They use an idiom of loading a base image then perform annotation via markup:

making it not worth taking the time to scrape.

That means data entry. ?

Using EtherCalc for fun and profit data entry

I dislike Microsoft Excel (even the modern versions of it) because it is overkill for data entry. I also dislike performing data entry in Google Sheets since that means I need to be cloud-connected. So, for small, local data entry needs I turn to EtherCalc. No internet access is required, nor is there a bloated app to run.

EtherCalc is a multiuser Google Sheets-like browser-based spreadsheet powered by javascript (both in-browser and the back-end). You can install it with:

$ npm install -g ethercalc

which assumes you have a working Node.js setup along with npm.

When you run:

$ ethercalc

you are given a URL to hit with your browser. Below is what that looks like with my data entry already complete:

It can use Redis or a local filesystem as a persistence layer and does support multiple folks editing the same document at the same time.

At this point I could just save it out manually to a CSV file and read it in the old-fashioned way, but EtherCalc has an API! So we can grab the data using {httr} calls, like this:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <dbl>            <dbl>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

where a983kmmne1i7 is the active document identifer.

Now that we have the data, it’s time to start the makeover.

Stage 1: A basic slopegraph

(If you need a primer on slopegraphs, definitely check out this resource by @visualisingdata.)

We need to make a decision as to what’s going where on the slopegraph. I’m choosing to put what respondents actually read on the left and then what they say they want covered on the right. Regardless of order, we need to do bit of data wrangling to take a first stab at the chart:

ggplot() +
  # draw the slope lines
  geom_segment(
    data = xdf,
    aes(
      x = "Actually read", y = actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    )
  ) +
  # left and right vertical bars
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  # left and right category text
  geom_text(data = xdf, aes("Actually read", actually_read, label = topic)) +
  geom_text(data = xdf, aes("Say they\nwant covered", say_want_covered, label = topic)) +
  scale_x_discrete(position = "top")

That chart isn’t winning any (good) awards any time soon. Apart from the non-aligned category labels, the categories aren’t in traditional order (rank “#1” being at the top on the left), plus we definitely need more information on the chart (title, subtitle, caption, etc.). We’ll reorder the labels and tweak some of the aesthetic problems away and switch the theme:

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered))) # get the category order right
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered) # reverse the Y axis

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    ),
    size = 0.25, color = "#b2b2b2"
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01
  ) +
  scale_x_discrete(position = "top") +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank())

That looks much better and I stopped there due to time constraints for the initial thread. However, the slope lines tend to be fairly hard to follow and we really should be telling a story with them. But what story do we want to focus on ?

Story time

One aesthetic element we’ll want to immediately modify regardless of story is the line color. We can use the dir column for this:

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    )
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.2
    ),
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    ),
    labels = c(
      `-1` = "Topics Readers Want Covered < Topics Read",
      `0` = "Topics Read The Same Amount As They Want Covered",
      `1` = "Topics Read < Topics Readers Want Covered"
    )
  ) +
  guides(
    size = FALSE
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "bottom") +
  theme(legend.direction = "vertical")

It’s still somewhat hard to pick out stories and the legend may be useful but it’s not ideal. Let’s highlight the different slope types with color, annotate them directly, and see what emerges:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered)))
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered)

arw <- arrow(length = unit(5, "pt"), type = "closed")
#   x = c(1.2, 1.8, 1.9),
# y = -c(1, 13, 14),
# xend = c(1.05, 1.7, 1.6),
# yend = -c(1.125, 13, 14)
# ),
# aes(x, y , xend=xend, yend=yend),

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    ), show.legend = FALSE
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  geom_curve(
    data = data.frame(), 
    aes(x = 1.2, y = -1, xend = 1.05, yend = -1.125), 
    color = ft_cols$red, arrow = arw
  ) +
  geom_segment(
    data = data.frame(), aes(x = 1.6, xend = 1.6, yend = -12.1, y = -12.9), 
    color = "#2b2b2b", arrow = arw
  ) +
  geom_curve(
    data = data.frame(), aes(x = 1.2, y = -14.1, xend = 1.1, yend = -13.6),
    curvature = -0.5, color = ft_cols$blue, arrow = arw
  ) +
  geom_text(
    data = data.frame(
      x = c(1.15, 1.6, 1.2),
      y = -c(1.2, 13, 14),
      hjust = c(0, 0.5, 0),
      vjust = c(0.5, 1, 0.5),
      lab = c(
        "Topics Readers Want Covered < Topics Read",
        "Topics Read The Same Amount\nAs They Want Covered",
        "Topics Read < Topics Readers Want Covered"
      ),
      stringsAsFactors = FALSE
    ),
    aes(x, y, hjust = hjust, vjust = vjust, label = lab),
    family = font_rc, size = 2.5, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.75,
      `0` = 0.2,
      `1` = 0.2
    )
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    )
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme(axis.text.x = element_text(size = 12, face = "bold", color = "black")) +
  theme(axis.text.y = element_blank())

This first story indicates a potential social desirability bias in the respondents in that they claim to care more about health care, climate change, and education but really care more about more frivolous things (sports), base things (politics), and things they have almost no control over (national security).

Let’s switch the focus (only showing the modified aesthetic to avoid a code DoS):

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.75
    )
  ) +

Now we get to see just how far down on the priority list some of the “desired coverage” topics really sit. At least Health care is not at the bottom, but given how much technology controls our lives it’s a bit disconcerting to see that at the bottom.

What about the categories that did not differ in rank:

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.75,
      `1` = 0.2
    )
  ) +

You’re guess is as good as mine why folks rated these the same (assuming the surveys had similar language).

FIN

Now that you’ve got the data (oh, right, I forgot to do that):

structure(list(topic = c("Health care", "Climate change", "Education", 
"Economics", "Science", "Technology", "Business", "National Security", 
"Politics", "Sports", "Immigration", "Arts & entertainment", 
"U.S. foreign policy", "Religion"), actually_read = c(7, 5, 11, 
6, 10, 14, 13, 1, 2, 3, 4, 8, 9, 12), say_want_covered = c(1, 
2, 3, 4, 7, 8, 11, 5, 10, 14, 6, 13, 9, 12)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec = structure(list(
    cols = list(topic = structure(list(), class = c("collector_character", 
    "collector")), actually_read = structure(list(), class = c("collector_double", 
    "collector")), say_want_covered = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

and some alternate views, perhaps you have an even better way to look at it. Drop a note in the comments with any of your creations or suggestions for improvement for the final versions shown here.

Wrapping Up Exploration of John Deere’s MowerPlus Database

I did another twitter thread on the aforeblogged MowerPlus database as I explored the tables after a second mow to determine what identified a unique mowing “session” (using John Deere’s terms).

This is the thread:

As forewarned, today was the second mow with the new @JohnDeere mower. I'll thread the exploration of what the db looks like with 2 mows logged + have a screen shot of the new macOS 10.15 iOS backup window. (will thread this, too)

— boB ?udis (@hrbrmstr) June 9, 2019

For those folks who are sanely averse to Twitter but have been following along in the blog the overall TLDR is that this time around it took less time since I didn’t have to go over areas twice and that the ZSESSION column in the ZMOWLOCATION table hold the session id for a given mowing session.

I wrapped up how to get access to the MowerPlus SQLite DB that holds this data into one function and you can see a non-Twitter (and non-annotated) version of the Twitter thread here — this rmarkdown report — or below:

FIN

~~I’ll likely make~~ I made a small package for this since I’ll use it during mowing season so and you can check the other usual suspects (gitlab/bitbucket/gitugh) if you, too, want to use R to help analyze your mower data.

Reader Challenge: Have Some Fun with Apple’s WWDC 2019 Keynote Subtitles & Show Your Work

I was pretty brutal to Apple earlier this week in a Twitter thread that I tried to craft so it occurred in-line with the WWDC live stream (which might be something you want to remember as/if you read on). I really don’t care about “memojis” and I have serious dismay over what is a pretty obvious fact that Apple intends to dumb down computing by shifting most folks from Macs to iPads. Their new “Pro” is for design folks and I’m not holding my breath for them to re-embrace the developer/data science communities with better laptops or smaller cheese graters.

The “meh” hardware/software announcements aren’t the worst parts of these events. The TED-esque scripting (including many failed attempts at faux “authentic” humor) is also becoming quite tedious. I joked about analyzing the “adverbs per minute” but it took a few days for their WWDC 2019 keynote video with a subtitle track to emerge. As a result, current time constraints prevent a dive into the subtitles themselves, but that doesn’t mean you can’t have some fun with them.

Read on to see how I scraped the subtitles or skip to the end to read more about this “Reader Challenge”.

Not So Subtle Subtitle Scraping

If you go to the aforelinked WWDC video URL you’ll see control on the lower right to add a subtitle track. If you do that with browser Developer Tools open you’ll see what that does:

These are WebVTT formatted subtitles which have a format/syntax that enable them to be displayed at the correct playback timecode. We can see how many of them there are by looking at the end of the file:

So, there are 621 of them and each are requested individually (and super-fast, in-parallel). What do these individual requests look like? Just select one of them to take a look. They’re just plain text responses (it’s not a super-intricate format).

Let’s grab one of them to the clipboard and use the {curlconverter} package to turn that into an httr::GET() request via the straighten() and make_req() functions:

I went ahead and wrapped it into a fairly-well-named function, but the GET request is virtually untouched from the aforementioned process. I just added the {idx} template into the request URL so we can glue() the right index into it. It is likely that some headers could have been eliminated but I just went with what {curlconverter} processed and returned this time.

library(stringi)
library(subtools) # https://github.com/hrbrmstr/subtools ; (ORIG: https://github.com/fkeck/subtools)
library(tidytext)
library(purrrogress) # tidy progress bars for free!
library(tidyverse)

#' Fetches a subtitle by index from the 2019 Apple WWDC Keynote subtitle stream
get_subtitle <- function(idx = 1) {

  st_url <- "https://p-events-delivery.akamaized.net/3004qzusahnbjppuwydgjzsdyzsippar/vod3/cc2/eng4/prog_index_{idx}.webvtt"
  st_url <- glue::glue(st_url)

  httr::GET(
    url = st_url,
    httr::add_headers(
      `sec-ch-ua` = "Google Chrome 75",
      `Sec-Fetch-Mode` = "cors",
      Origin = "https://developer.apple.com",
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36",
      Referer = "https://developer.apple.com/videos/play/wwdc2019/101/",
      `Sec-Fetch-Dest` = "empty",
      `Sec-Fetch-Site` = "cross-site"
    )
  ) -> res

  out <- httr::content(res, as = "text", encoding = "UTF-8")
  out <- stringi::stri_split_lines(out)

  purrr::flatten_chr(out)

}

Let’s see what one looks like:

(tmp <- get_subtitle(1))
## [1] "WEBVTT"                                          
## [2] "X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000"
## [3] ""                                                
## [4] "3"                                               
## [5] "00:00:21.199 --> 00:00:22.333"                   
## [6] ">> FEMALE SPEAKER:"                              
## [7] "Don't stay up too late."                         
## [8] ""                                                
## [9] ""

Looking good! But, it’s just plain characters and I don’t feel like writing a subtitle parser. And, I dont’ have to! François Keck has the {subtools} package which we can use. But, it (used to) only work on files. It now works on character vectors as well (but you’ll need to install it from my fork until the PR is merged). Let’s turn this set of noise into something we can use:

as_subtitle(tmp, format = "webvtt") %>% 
  flatten_df()

## # A tibble: 1 x 4
##   ID    Timecode.in  Timecode.out Text                                
##   <chr> <chr>        <chr>        <chr>                               
## 1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up to…

So tidy!

We now need to get all of the subtitles. We’ll do that fast since the video player retrieves them even faster than this iteration does:

# no crawl delay b/c the video player grabs these even faster than this code does
map(1:621, with_progress(get_subtitle)) %>% # with_progress gets you a progress bar for free
  map(as_subtitle, format = "webvtt") %>% 
  flatten_df() %>% 
  as_tibble() -> apple_subs

apple_subs
## # A tibble: 3,220 x 4
##    ID    Timecode.in  Timecode.out Text                               
##    <chr> <chr>        <chr>        <chr>                              
##  1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
##  2 4     00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.          
##  3 5     00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.        
##  4 6     00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.              
##  5 7     00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.              
##  6 8     00:01:46.733 00:01:47.833 This is good.                      
##  7 9     00:01:49.566 00:01:52.666 (Music playing)                    
##  8 10    00:02:05.200 00:02:12.533 (Applause)                         
##  9 10    00:02:05.200 00:02:12.533 (Applause)                         
## 10 11    00:02:14.400 00:02:15.566 >> TIM COOK: Wow.                  
## # … with 3,210 more rows

Streaming subtitles aren’t error-free and often get duplicated, let’s see if that’s the case:


# Any dups?
distinct(apple_subs) 
## # A tibble: 2,734 x 4
##    ID    Timecode.in  Timecode.out Text                               
##    <chr> <chr>        <chr>        <chr>                              
##  1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
##  2 4     00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.          
##  3 5     00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.        
##  4 6     00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.              
##  5 7     00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.              
##  6 8     00:01:46.733 00:01:47.833 This is good.                      
##  7 9     00:01:49.566 00:01:52.666 (Music playing)                    
##  8 10    00:02:05.200 00:02:12.533 (Applause)                         
##  9 11    00:02:14.400 00:02:15.566 >> TIM COOK: Wow.                  
## 10 12    00:02:15.633 00:02:18.166 Thank you.                         
## # … with 2,724 more rows

apple_subs <- distinct(apple_subs)

There were dups, but not anymore!

You can get that data frame via: http://rud.is/dl/2019-wwdc-keynote-subtitles.csv.gz.

I wanted to see if these looked OK so I dumped just the text to a file and open them up in Sublime Text to spot check:


apple_subs %>%
  pull(Text) %>% 
  write_lines("/tmp/subs.txt") 

system("subl /tmp/subs.txt") # dblchk

Since we have a good capture of what was spoken, we can start the analysis process:

distinct(apple_subs) %>% 
  filter(!grepl("^\\(|^>>", Text)) %>%
  unnest_tokens(word, Text) %>% 
  anti_join(get_stopwords()) %>% 
  count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 2,408 x 2
##    word      n
##    <chr> <int>
##  1 now     246
##  2 can     205
##  3 new     142
##  4 like    119
##  5 just    106
##  6 app      77
##  7 great    74
##  8 apple    69
##  9 right    64
## 10 apps     59
## # … with 2,398 more rows

And, that’s when I’ve run out of time.

Reader Challenge

You’ve got the cleaned WWDC 2019 Keynote subtitle track and access to my brutal WWDC 2019 Twitter thread. What fun can you have with it? I’d still like to know the adverbs-per-‘n’ (and what kind they were). But, what else can you discover? Is there a pattern of emotional manipulation through word choices at different times? Did they change tone/style throughout the event? What other questions can you ask and tease out with data?

Drop links to your creations (and separate links to code) in the comments and I’ll re-broadcast them on Twitter and gather them all up into a new post to see what y’all came up with.

FIN

There’s no deadline as I can keep on curating as new submissions come in. While this is most assuredly an R-focused blog there is no restriction on the tools you use as well.

Hopefully this will be a fun/creative exercise for folks. If you have any questions about the scraping process or about the WebVTT format don’t hesitate to ping me here or on Twitter (@hrbrmstr).

Make Multi-point “dumbbell” Plots in ggplot2

A user of the {ggalt} package recently posted a question about how to add points to a geom_dumbbell() plot. For now, this is not something you can do with geom_dumbbell() but with a bit of data wrangling you can do this in a pretty straightforward manner with just your data and ggplot2. The example below uses 3 values per category but it should scale to n values per category (though after a certain n you should reconsider the use of a dummbell chart in favour of a more appropriate way to visualize the message you’re trying to convey).

Here’s the setup:

library(hrbrthemes)
library(tidyverse)

tibble(
  val1 = c(3, 2, 4),
  val2 = c(1, 4, 5),
  val3 = c(5, 8, 6),
  cat = factor(month.name[1:3], levels = rev(month.name[1:3]))
) -> xdf

Three values per category. The approach is pretty straightforward:

reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
reshape the data frame & plot the points

I’ve put ^^ notes near each ggplot2 geom:

ggplot() +
  # reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      top_n(-1) %>% 
      slice(1) %>%
      ungroup(),
    aes(x = 0, xend = val, y = cat, yend = cat),
    linetype = "dotted", size = 0.5, color = "gray80"
  ) +
  # reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      summarise(start = range(val)[1], end = range(val)[2]) %>% 
      ungroup(),
    aes(x = start, xend = end, y = cat, yend = cat),
    color = "gray80", size = 2
  ) +
  # reshape the data frame & plot the points
  geom_point(
    data = gather(xdf, measure, value, -cat),
    aes(value, cat, color = measure), 
    size = 4
  ) +
  # i just extended the scale a bit + put axis on top; choose aesthetics that work 
  # for you
  scale_x_comma(position = "top", limits = c(0, 10)) +
  scale_color_ipsum(name = "A real legend title") +
  labs(
    x = "Description of the value", y = NULL,
    title = "A good plot title"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

And, here’s the result: