Skip navigation

Category Archives: R

Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings.

The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops.

I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this:

and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use.

Originally I just made a header file with the same long lines:

but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you.

The stringi::stri_wrap() function can easily take care of wrapping the time zone array elements for us:

but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them:

library(Rcpp)
library(stringi)
library(tidyverse)
library(hrbrthemes)
library(microbenchmark)

sourceCpp(code = "
#include <Rcpp.h>

// [[Rcpp::export]]
std::vector< std::string > fold_cpp(const std::string& input, int width) {

  int sz = input.length() / width;

  std::vector< std::string > out;
  out.reserve(sz); // shld make this more efficient

  for (unsigned long idx=0; idx<sz; idx++) {
    out.push_back(
      input.substr(idx*width, width)
    );
  }

  if (input.length() % width != 0) out.push_back(input.substr(width*sz));

  return(out);
}
") 

fold_base <- function(input, width) {

  vapply(
    seq(1, nchar(input), width), 
    function(idx) substr(input, idx, idx + width - 1), 
    FUN.VALUE = character(1)
  )

}

fold_tidy <- function(input, width) {

  map_chr(
    seq(1, nchar(input), width),
    ~stri_sub(input, .x, length = width)
  ) 

}

(If you know of a package that has this type of function def leave a note in the comments).

Each one does the same thing: move n sequences of width characters into a new slot in a character vector. Let’s see what they do with this toy long string example:

(src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = ""))
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc"

for (n in c(1, 7, 30, 40)) {

  print(fold_base(src, n))
  print(fold_tidy(src, n))
  print(fold_cpp(src, n))
  cat("\n")

}
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
## 
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"   

So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths:

map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{
  microbenchmark(
    base = fold_base(src, .x),
    tidy = fold_tidy(src, .x),
    cpp = fold_cpp(src, .x)
  ) %>% 
    mutate(width = .x) %>% 
    as_tibble()
}) %>% 
  mutate(
    width = factor(width, 
                   levels = sort(unique(width)), 
                   ordered = TRUE)
  ) -> bench_df

ggplot(bench_df, aes(expr, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = width, fill = width),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = "Split width:", 
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

ggplot(bench_df, aes(width, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = expr, fill = expr),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_x_discrete(
    labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70")
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  scale_fill_ft() +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = NULL,
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into:

FIN

If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems.

I’ve become a big fan of the {logspline} package over the past ~6 months and decided to wrap up a manual ggplot2 plotting process (well, it was at least in an RStudio snippet) into a small {ggplot2} Stat to make it easier to visualize various components of the fitted model.

If you’re new to logspline density estimation this paper by Stone, Hansen, Kooperberg, and Truong is a pretty accessible introduction to the methodology and the (primary) algorithm used in the {logspline} package.

If you’re already a user of the {logspline} package and have been just using the package’s base S3 plot function but wanted the functionality in {ggplot2} this was pretty much designed for you.

If you hit up https://git.rud.is/hrbrmstr/gglogspline there are links to all your fav social code collaboration services plus installation instructions. There’s a single Statstat_logspline() — which does all the hard work. By default it will make y the stat(density) computed field but there are other computed fields you can use as well:

  • count: computed count estimates (similar to ggplot2::stat_density())
  • probs: distribution function
  • survival: survival function
  • hazard : hazard function

You can take a look at the (light) comparison between geom_histogram(), geom_density(), ggalt::geom_bkde(), and gglogspline::stat_logspline() below (or use this link to bust the iframe):

FIN

For “general purpose” density estimation you can likely still rely on stat_density() but after you dig into the background of logspline and try it out in some appropriate use-cases you may grow to like it as much as I have.

The package itself is super-small and focused. One reason for that is it may be helpful for those who want to create or customize their own ggplot2 Stats. (I think I forgot to change the license to MIT before this post so I’ll rectify that ASAP; I default to AGPL in a fairly pathetic attempt to stop skeezy cybersecurity startups — pretty much 99% of them — from stealing code without giving back to the community).

As usual, kick the tyres and file issues or PRs as you feel moved (and wherever you feel comfortable social coding).

I mentioned EtherCalc in a previous post and managed to scrounge some time to put together a fledgling {ethercalc} package (it’s also on GitLab, SourceHut, Bitbucket and GitUgh, just sub out the appropriate URL prefix).

I’m creating a package-specific Docker image (there are a couple out there but I’m not supporting their use with the package as they have a CORS configuration that make EtherCalc API wrangling problematic) for EtherCalc but I would highly recommend that you just use it via the npm module. To do that you need a working Node.js installation and I highly recommended also running a local Redis instance (it’s super lightweight). Linux folks can use their fav package manager for that and macOS folks can use homebrew. Folks on the legacy Windows operating system can visit:

to get EtherCalc going.

I also recommend running EtherCalc and Redis together for performance reasons. EtherCalc will maintain a persistent store for your spreadsheets (they call them “rooms” since EtherCalc supports collaborative editing) with or without Redis, but using Redis makes all EtherCalc ops much, much faster.

Once you have Redis running (on localhost, which is the default) and Node.js + npm installed, you can do the following to install EtherCalc:

$ npm install -g ethercalc # may require `sudo` on some macOS or *nix systems

The -g tells npm to install the module globally and will work to ensure the ethercalc executable is on your PATH. Like many things one can install from Node.js or, even Python, you may see a cadre of “warnings” and possibly even some “errors”. If you execute the following and see similar messages:

$ ethercalc --host=localhost ## IMPORTANT TO USE --host=localhost
Please connect to: http://localhost:8000/
Starting backend using webworker-threads
Falling back to vm.CreateContext backend
Express server listening on port 8000 in development mode
Zappa 0.5.0 "You can't do that on stage anymore" orchestrating the show
Connected to Redis Server: localhost:6379

and then go to the URL it gives you and you see something like this:

then you’re all set to continue.

A [Very] Brief EtherCalc Introduction

EtherCalc has a wiki. As such, please hit that to get more info on EtherCalc.

For now, if you hit that big, blue “Create Spreadsheet” button, you’ll see something pretty familiar if you’ve used Google Sheets, Excel, LibreOffice Calc (etc):

If you start ethercalc without the --host=localhost it listens on all network interfaces, so other folks on your network can also use it as a local “cloud” spreadsheet app, but also edit with you, just like Google Sheets.

I recommend playing around a bit in EtherCalc before continuing just to see that it is, indeed, a spreadsheet app like all the others you are familiar with, except it has a robust API that we can orchestrate from within R, now.

Working with {ethercalc}

You can install {ethercalc} from the aforelinked source or via:

install.packages("ethercalc", repos = "https://cinc.rud.is")

where you’ll get a binary install for Windows and macOS (binary builds are for R 3.5.x but should also work for 3.6.x installs).

If you don’t want to drop to a command line interface to start EtherCalc you can use ec_start() to run one that will only be live during your R session.

Once you have EtherCalc running you’ll need to put the URL into an ETHERCALC_HOST environment variable. I recommend adding the following to ~/.Renviron and restarting your R session:

ETHERCALC_HOST=http://localhost:8000

(You’ll get an interactive prompt to provide this if you don’t have the environment variable setup.)

You can verify R can talk to your EtherCalc instance by executing ec_running() and reading the message or examining the (invisible) return value. Post a comment or file an issue (on your preferred social coding site) if you really think you’ve done everything right and still aren’t up and running by this point.

The use-case I setup in the previous blog post was to perform light data entry since scraping was both prohibited and would have taken more time given how the visualization was made. To start a new spreadsheet (remember, EtherCalc folks call these “rooms”), just do:

ec_new("for-blog")

And you should see this appear in your default web browser:

You can do ec_list() to see the names of all “saved” spreadsheets (ec_delete() can remove them, too).

We’ll type in the values from the previous post:

Now, to retrieve those values, we can do:

ec_read("for-blog", col_types="cii")
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <int>            <int>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

That function takes any (relevant to this package use-case) parameter that readr::read_csv() takes (since it uses that under the hood to parse the object that comes back from the API call). If someone adds or modifies any values you can just call ec_read() again to retrieve them.

The ec_export() function lets you download the contents of the spreadsheet (“room”) to a local:

  • CSV
  • JSON
  • HTML
  • Markdown
  • Excel

file (and it also returns the raw data directly to the R session). So you can do something like:

cat(rawToChar(ec_export("for-blog", "md", "~/Data/survey.md")))
## |topic|actually_read|say_want_covered|
## | ---- | ---- | ---- |
## |Health care|7|1|
## |Climate change|5|2|
## |Education|11|3|
## |Economics|6|4|
## |Science|10|7|
## |Technology|14|8|
## |Business|13|11|
## |National Security|1|5|
## |Politics|2|10|
## |Sports|3|14|
## |Immigration|4|6|
## |Arts & entertainment|8|13|
## |U.S. foreign policy|9|9|
## |Religion|12|12|

You can also append to a spreadsheet right from R. We’ll sort that data frame (to prove the append is working and I’m not fibbing) and append it to the existing sheet (this is a toy example, but imagine appending to an always-running EtherCalc instance as a data logger, which folks actually do IRL):

ec_read("for-blog", col_types="cii") %>% 
  dplyr::arrange(desc(topic)) %>% 
  ec_append("for-blog")

Note that you can open up EtherCalc to any existing spreadsheets (“rooms”) via ec_view() as well.

FIN

It’s worth noting that EtherCalc appears to have a limit of around 500,000 “cells” per spreadsheet (“room”). I mention that since if you try to, say, ec_edit(ggplot2movies::movies, "movies") you would have very likely crashed the running EtherCalc instance if I did not code in some guide rails into that function and the ec_append() function to stop you from doing that. It’s sane limit IMO an Google Sheets does something similar (per-tab) for the similar reasons (and both limits are one reason I’m still against using a browser for “everything” given the limitations of javascript wrangling of DOM elements).

If you’re doing work on large-ish data, spreadsheets in general aren’t the best tools.

And, while you should avoid hand-wrangling data at all costs, ec_edit() is a much faster and feature-rich alternative to R’s edit() function on most systems.

I’ve shown off most of the current functionality of the {ethercalc} package in this post. One function I’ve left out is ec_cmd() which lets you completely orchestrate all EtherCalc operations. It’s powerful enough, and the EtherCalc command structure is gnarly enough, that we’ll have to cover it in a separate post. Also, stay tune for the aforementioned package-specific EtherCalc Docker image.

Kick the tyres, contribute issues and/or PRs as moved (and on your preferred social coding site) and see if both EtherCalc and {ethercalc} might work for you in place of or along with Excel and/or Google Sheets.

This morning, @kairyssdal tweeted out the following graphic from @axios:

If you’re doing the right thing and blocking evil social media javascript you can find the Axios story here and the graphic below:

I’m gonna say it: the chart is confusing. I grok what they were trying to do, but this is a clear example where a slopegraph would definitely be a better choice than a directional dumbbell chart. At the time I had ~5 minutes to spare so I did a quick makeover and a short howto thread. This post is an expansion on said thread and if you’re in the midst of making the decision to consider reading or moving on here’s what we’ll cover:

  • Making the choice between scraping or manual data entry
  • Quick introduction to EtherCalc
  • Why you might consider using EtherCalc for manual data entry over Excel or Google Sheets
  • Pulling data from EtherCalc into R
  • Making a slopegraph with the captured data
  • Customizing the slopegraph with the captured data to tell one or more stories

Read on if any or all of that is captures your interest.

To scrape or not to scrape

Even if I wanted to scrape the site, Axios makes it pretty clear they are kinda not very nice people since — while it doesn’t mention scraping — that ToS link does indicate that:

(a) you will not license, sell, rent, lease, transfer, assign, distribute, host, or otherwise commercially exploit the Site or any content displayed on the Site; (b) you will not modify, make derivative works of, disassemble, reverse compile or reverse engineer any part of the Site; (c) you will not access the Site in order to build a similar or competitive website, product, or service; and (d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means. Unless otherwise indicated, any future release, update, or addition to the Site’s functionality will be subject to these Terms. All copyright and other proprietary notices on the Site (or on any content displayed on the Site) must be retained on all copies thereof.

(OH NO I COPIED THAT FROM THEIR SITE SO I AM ALREADY VIOLATING THEIR [unenforceable] TOS!)

There’s this thing called “Fair Use” and this makeover I’m doing is 100% covered under that. The Axios ToS and the ToS of many other sites try to prohibit such fair use and they generally lose those battles in court. I have and will be citing them as sources throughout this post and the post itself falls under “commentary and criticism”. Unlike many unethical scrapers who are just scavenging data they did no work to generate and whose work will not serve the better interest of the general community, this post is fully dedicated to sharing and education.

In reality, Axios likely has such draconian ToS due to all the horrible unethical scrapers who just want free, un-cited news content.

Anyway

Even if I could scrape they don’t embed a javascript data object nor do they load an XHR JSON data blob to make the graphic. They use an idiom of loading a base image then perform annotation via markup:

making it not worth taking the time to scrape.

That means data entry. 🙁

Using EtherCalc for fun and profit data entry

I dislike Microsoft Excel (even the modern versions of it) because it is overkill for data entry. I also dislike performing data entry in Google Sheets since that means I need to be cloud-connected. So, for small, local data entry needs I turn to EtherCalc. No internet access is required, nor is there a bloated app to run.

EtherCalc is a multiuser Google Sheets-like browser-based spreadsheet powered by javascript (both in-browser and the back-end). You can install it with:

$ npm install -g ethercalc

which assumes you have a working Node.js setup along with npm.

When you run:

$ ethercalc

you are given a URL to hit with your browser. Below is what that looks like with my data entry already complete:

It can use Redis or a local filesystem as a persistence layer and does support multiple folks editing the same document at the same time.

At this point I could just save it out manually to a CSV file and read it in the old-fashioned way, but EtherCalc has an API! So we can grab the data using {httr} calls, like this:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <dbl>            <dbl>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

where a983kmmne1i7 is the active document identifer.

Now that we have the data, it’s time to start the makeover.

Stage 1: A basic slopegraph

(If you need a primer on slopegraphs, definitely check out this resource by @visualisingdata.)

We need to make a decision as to what’s going where on the slopegraph. I’m choosing to put what respondents actually read on the left and then what they say they want covered on the right. Regardless of order, we need to do bit of data wrangling to take a first stab at the chart:

ggplot() +
  # draw the slope lines
  geom_segment(
    data = xdf,
    aes(
      x = "Actually read", y = actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    )
  ) +
  # left and right vertical bars
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  # left and right category text
  geom_text(data = xdf, aes("Actually read", actually_read, label = topic)) +
  geom_text(data = xdf, aes("Say they\nwant covered", say_want_covered, label = topic)) +
  scale_x_discrete(position = "top")

That chart isn’t winning any (good) awards any time soon. Apart from the non-aligned category labels, the categories aren’t in traditional order (rank “#1” being at the top on the left), plus we definitely need more information on the chart (title, subtitle, caption, etc.). We’ll reorder the labels and tweak some of the aesthetic problems away and switch the theme:

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered))) # get the category order right
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered) # reverse the Y axis

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    ),
    size = 0.25, color = "#b2b2b2"
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01
  ) +
  scale_x_discrete(position = "top") +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank())

That looks much better and I stopped there due to time constraints for the initial thread. However, the slope lines tend to be fairly hard to follow and we really should be telling a story with them. But what story do we want to focus on ?

Story time

One aesthetic element we’ll want to immediately modify regardless of story is the line color. We can use the dir column for this:

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    )
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.2
    ),
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    ),
    labels = c(
      `-1` = "Topics Readers Want Covered < Topics Read",
      `0` = "Topics Read The Same Amount As They Want Covered",
      `1` = "Topics Read < Topics Readers Want Covered"
    )
  ) +
  guides(
    size = FALSE
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "bottom") +
  theme(legend.direction = "vertical")

It’s still somewhat hard to pick out stories and the legend may be useful but it’s not ideal. Let’s highlight the different slope types with color, annotate them directly, and see what emerges:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered)))
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered)

arw <- arrow(length = unit(5, "pt"), type = "closed")
#   x = c(1.2, 1.8, 1.9),
# y = -c(1, 13, 14),
# xend = c(1.05, 1.7, 1.6),
# yend = -c(1.125, 13, 14)
# ),
# aes(x, y , xend=xend, yend=yend),

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    ), show.legend = FALSE
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  geom_curve(
    data = data.frame(), 
    aes(x = 1.2, y = -1, xend = 1.05, yend = -1.125), 
    color = ft_cols$red, arrow = arw
  ) +
  geom_segment(
    data = data.frame(), aes(x = 1.6, xend = 1.6, yend = -12.1, y = -12.9), 
    color = "#2b2b2b", arrow = arw
  ) +
  geom_curve(
    data = data.frame(), aes(x = 1.2, y = -14.1, xend = 1.1, yend = -13.6),
    curvature = -0.5, color = ft_cols$blue, arrow = arw
  ) +
  geom_text(
    data = data.frame(
      x = c(1.15, 1.6, 1.2),
      y = -c(1.2, 13, 14),
      hjust = c(0, 0.5, 0),
      vjust = c(0.5, 1, 0.5),
      lab = c(
        "Topics Readers Want Covered < Topics Read",
        "Topics Read The Same Amount\nAs They Want Covered",
        "Topics Read < Topics Readers Want Covered"
      ),
      stringsAsFactors = FALSE
    ),
    aes(x, y, hjust = hjust, vjust = vjust, label = lab),
    family = font_rc, size = 2.5, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.75,
      `0` = 0.2,
      `1` = 0.2
    )
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    )
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme(axis.text.x = element_text(size = 12, face = "bold", color = "black")) +
  theme(axis.text.y = element_blank())

This first story indicates a potential social desirability bias in the respondents in that they claim to care more about health care, climate change, and education but really care more about more frivolous things (sports), base things (politics), and things they have almost no control over (national security).

Let’s switch the focus (only showing the modified aesthetic to avoid a code DoS):

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.75
    )
  ) +

Now we get to see just how far down on the priority list some of the “desired coverage” topics really sit. At least Health care is not at the bottom, but given how much technology controls our lives it’s a bit disconcerting to see that at the bottom.

What about the categories that did not differ in rank:

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.75,
      `1` = 0.2
    )
  ) +

You’re guess is as good as mine why folks rated these the same (assuming the surveys had similar language).

FIN

Now that you’ve got the data (oh, right, I forgot to do that):

structure(list(topic = c("Health care", "Climate change", "Education", 
"Economics", "Science", "Technology", "Business", "National Security", 
"Politics", "Sports", "Immigration", "Arts & entertainment", 
"U.S. foreign policy", "Religion"), actually_read = c(7, 5, 11, 
6, 10, 14, 13, 1, 2, 3, 4, 8, 9, 12), say_want_covered = c(1, 
2, 3, 4, 7, 8, 11, 5, 10, 14, 6, 13, 9, 12)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec = structure(list(
    cols = list(topic = structure(list(), class = c("collector_character", 
    "collector")), actually_read = structure(list(), class = c("collector_double", 
    "collector")), say_want_covered = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

and some alternate views, perhaps you have an even better way to look at it. Drop a note in the comments with any of your creations or suggestions for improvement for the final versions shown here.

I did another twitter thread on the aforeblogged MowerPlus database as I explored the tables after a second mow to determine what identified a unique mowing “session” (using John Deere’s terms).

This is the thread:

For those folks who are sanely averse to Twitter but have been following along in the blog the overall TLDR is that this time around it took less time since I didn’t have to go over areas twice and that the ZSESSION column in the ZMOWLOCATION table hold the session id for a given mowing session.

I wrapped up how to get access to the MowerPlus SQLite DB that holds this data into one function and you can see a non-Twitter (and non-annotated) version of the Twitter thread here — this rmarkdown report — or below:

FIN

I’ll likely make I made a small package for this since I’ll use it during mowing season so and you can check the other usual suspects (gitlab/bitbucket/gitugh) if you, too, want to use R to help analyze your mower data.

I was pretty brutal to Apple earlier this week in a Twitter thread that I tried to craft so it occurred in-line with the WWDC live stream (which might be something you want to remember as/if you read on). I really don’t care about “memojis” and I have serious dismay over what is a pretty obvious fact that Apple intends to dumb down computing by shifting most folks from Macs to iPads. Their new “Pro” is for design folks and I’m not holding my breath for them to re-embrace the developer/data science communities with better laptops or smaller cheese graters.

The “meh” hardware/software announcements aren’t the worst parts of these events. The TED-esque scripting (including many failed attempts at faux “authentic” humor) is also becoming quite tedious. I joked about analyzing the “adverbs per minute” but it took a few days for their WWDC 2019 keynote video with a subtitle track to emerge. As a result, current time constraints prevent a dive into the subtitles themselves, but that doesn’t mean you can’t have some fun with them.

Read on to see how I scraped the subtitles or skip to the end to read more about this “Reader Challenge”.

Not So Subtle Subtitle Scraping

If you go to the aforelinked WWDC video URL you’ll see control on the lower right to add a subtitle track. If you do that with browser Developer Tools open you’ll see what that does:

webdevtools subtitle index screen shot

These are WebVTT formatted subtitles which have a format/syntax that enable them to be displayed at the correct playback timecode. We can see how many of them there are by looking at the end of the file:

count of subtitles

So, there are 621 of them and each are requested individually (and super-fast, in-parallel). What do these individual requests look like? Just select one of them to take a look. They’re just plain text responses (it’s not a super-intricate format).

Let’s grab one of them to the clipboard and use the {curlconverter} package to turn that into an httr::GET() request via the straighten() and make_req() functions:

I went ahead and wrapped it into a fairly-well-named function, but the GET request is virtually untouched from the aforementioned process. I just added the {idx} template into the request URL so we can glue() the right index into it. It is likely that some headers could have been eliminated but I just went with what {curlconverter} processed and returned this time.

library(stringi)
library(subtools) # https://github.com/hrbrmstr/subtools ; (ORIG: https://github.com/fkeck/subtools)
library(tidytext)
library(purrrogress) # tidy progress bars for free!
library(tidyverse)

#' Fetches a subtitle by index from the 2019 Apple WWDC Keynote subtitle stream
get_subtitle <- function(idx = 1) {

  st_url <- "https://p-events-delivery.akamaized.net/3004qzusahnbjppuwydgjzsdyzsippar/vod3/cc2/eng4/prog_index_{idx}.webvtt"
  st_url <- glue::glue(st_url)

  httr::GET(
    url = st_url,
    httr::add_headers(
      `sec-ch-ua` = "Google Chrome 75",
      `Sec-Fetch-Mode` = "cors",
      Origin = "https://developer.apple.com",
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36",
      Referer = "https://developer.apple.com/videos/play/wwdc2019/101/",
      `Sec-Fetch-Dest` = "empty",
      `Sec-Fetch-Site` = "cross-site"
    )
  ) -> res

  out <- httr::content(res, as = "text", encoding = "UTF-8")
  out <- stringi::stri_split_lines(out)

  purrr::flatten_chr(out)

}

Let’s see what one looks like:

(tmp <- get_subtitle(1))
## [1] "WEBVTT"                                          
## [2] "X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000"
## [3] ""                                                
## [4] "3"                                               
## [5] "00:00:21.199 --> 00:00:22.333"                   
## [6] ">> FEMALE SPEAKER:"                              
## [7] "Don't stay up too late."                         
## [8] ""                                                
## [9] ""                 

Looking good! But, it’s just plain characters and I don’t feel like writing a subtitle parser. And, I dont’ have to! François Keck has the {subtools} package which we can use. But, it (used to) only work on files. It now works on character vectors as well (but you’ll need to install it from my fork until the PR is merged). Let’s turn this set of noise into something we can use:

as_subtitle(tmp, format = "webvtt") %>% 
  flatten_df()

## # A tibble: 1 x 4
##   ID    Timecode.in  Timecode.out Text                                
##   <chr> <chr>        <chr>        <chr>                               
## 1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up to…

So tidy!

We now need to get all of the subtitles. We’ll do that fast since the video player retrieves them even faster than this iteration does:

# no crawl delay b/c the video player grabs these even faster than this code does
map(1:621, with_progress(get_subtitle)) %>% # with_progress gets you a progress bar for free
  map(as_subtitle, format = "webvtt") %>% 
  flatten_df() %>% 
  as_tibble() -> apple_subs

apple_subs
## # A tibble: 3,220 x 4
##    ID    Timecode.in  Timecode.out Text                               
##    <chr> <chr>        <chr>        <chr>                              
##  1 3     00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
##  2 4     00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.          
##  3 5     00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.        
##  4 6     00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.              
##  5 7     00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.              
##  6 8     00:01:46.733 00:01:47.833 This is good.                      
##  7 9     00:01:49.566 00:01:52.666 (Music playing)                    
##  8 10    00:02:05.200 00:02:12.533 (Applause)                         
##  9 10    00:02:05.200 00:02:12.533 (Applause)                         
## 10 11    00:02:14.400 00:02:15.566 >> TIM COOK: Wow.                  
## # … with 3,210 more rows

Streaming subtitles aren’t error-free and often get duplicated, let’s see if that’s the case:


# Any dups? distinct(apple_subs) ## # A tibble: 2,734 x 4 ## ID Timecode.in Timecode.out Text ## <chr> <chr> <chr> <chr> ## 1 3 00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t… ## 2 4 00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on. ## 3 5 00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right. ## 4 6 00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes. ## 5 7 00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo. ## 6 8 00:01:46.733 00:01:47.833 This is good. ## 7 9 00:01:49.566 00:01:52.666 (Music playing) ## 8 10 00:02:05.200 00:02:12.533 (Applause) ## 9 11 00:02:14.400 00:02:15.566 >> TIM COOK: Wow. ## 10 12 00:02:15.633 00:02:18.166 Thank you. ## # … with 2,724 more rows apple_subs <- distinct(apple_subs)

There were dups, but not anymore!

You can get that data frame via: http://rud.is/dl/2019-wwdc-keynote-subtitles.csv.gz.

I wanted to see if these looked OK so I dumped just the text to a file and open them up in Sublime Text to spot check:


apple_subs %>% pull(Text) %>% write_lines("/tmp/subs.txt") system("subl /tmp/subs.txt") # dblchk

Since we have a good capture of what was spoken, we can start the analysis process:

distinct(apple_subs) %>% 
  filter(!grepl("^\\(|^>>", Text)) %>%
  unnest_tokens(word, Text) %>% 
  anti_join(get_stopwords()) %>% 
  count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 2,408 x 2
##    word      n
##    <chr> <int>
##  1 now     246
##  2 can     205
##  3 new     142
##  4 like    119
##  5 just    106
##  6 app      77
##  7 great    74
##  8 apple    69
##  9 right    64
## 10 apps     59
## # … with 2,398 more rows

And, that’s when I’ve run out of time.

Reader Challenge

You’ve got the cleaned WWDC 2019 Keynote subtitle track and access to my brutal WWDC 2019 Twitter thread. What fun can you have with it? I’d still like to know the adverbs-per-‘n’ (and what kind they were). But, what else can you discover? Is there a pattern of emotional manipulation through word choices at different times? Did they change tone/style throughout the event? What other questions can you ask and tease out with data?

Drop links to your creations (and separate links to code) in the comments and I’ll re-broadcast them on Twitter and gather them all up into a new post to see what y’all came up with.

FIN

There’s no deadline as I can keep on curating as new submissions come in. While this is most assuredly an R-focused blog there is no restriction on the tools you use as well.

Hopefully this will be a fun/creative exercise for folks. If you have any questions about the scraping process or about the WebVTT format don’t hesitate to ping me here or on Twitter (@hrbrmstr).

A user of the {ggalt} package recently posted a question about how to add points to a geom_dumbbell() plot. For now, this is not something you can do with geom_dumbbell() but with a bit of data wrangling you can do this in a pretty straightforward manner with just your data and ggplot2. The example below uses 3 values per category but it should scale to n values per category (though after a certain n you should reconsider the use of a dummbell chart in favour of a more appropriate way to visualize the message you’re trying to convey).

Here’s the setup:

library(hrbrthemes)
library(tidyverse)

tibble(
  val1 = c(3, 2, 4),
  val2 = c(1, 4, 5),
  val3 = c(5, 8, 6),
  cat = factor(month.name[1:3], levels = rev(month.name[1:3]))
) -> xdf

Three values per category. The approach is pretty straightforward:

  • reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  • reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  • reshape the data frame & plot the points

I’ve put ^^ notes near each ggplot2 geom:

ggplot() +
  # reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      top_n(-1) %>% 
      slice(1) %>%
      ungroup(),
    aes(x = 0, xend = val, y = cat, yend = cat),
    linetype = "dotted", size = 0.5, color = "gray80"
  ) +
  # reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      summarise(start = range(val)[1], end = range(val)[2]) %>% 
      ungroup(),
    aes(x = start, xend = end, y = cat, yend = cat),
    color = "gray80", size = 2
  ) +
  # reshape the data frame & plot the points
  geom_point(
    data = gather(xdf, measure, value, -cat),
    aes(value, cat, color = measure), 
    size = 4
  ) +
  # i just extended the scale a bit + put axis on top; choose aesthetics that work 
  # for you
  scale_x_comma(position = "top", limits = c(0, 10)) +
  scale_color_ipsum(name = "A real legend title") +
  labs(
    x = "Description of the value", y = NULL,
    title = "A good plot title"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

And, here’s the result:

In a recent previous post I brazenly talked over the “hard parts” of how I got to the target SQLite file that houses “mowing history” for what has become my weekend obsession. So, we’ll cover just how to do that (find things in iOS backups) in this post along with how to deal with some “gotchas” if you’re doing this from macOS.

macOS (the Knife)

Kurt Weill, Bertolt Brecht, and Marc Blitzstein created some amazing lyrics that Bobby Darin did some sweet, sweet justice to:

I bring that up to talk about the cutting, biting, dangerous edge of macOS that is Apple’s somewhat mixed attempt at protecting your privacy and keeping out of sight sensitive files and directories from the the sharp teeth of malware (and to re-pimp my {mactheknife} package) . You can read up on Apple’s new protections more thoroughly over at The Eclectic Light Company. For the purposes of this blog post, Apple’s macOS Sandbox policies means you have to do some extra steps to gain access to the folder and files associated with iOS backups (which is ~/Library/Application Support/MobileSync/Backup/).

If you want RStudio, R, and anything run with Rscript to access these sandboxed areas you’ll need to enable “Full Disk Access” for those apps and executables. First you’ll need to open System Preferences > Security & Privacy and then make the Privacy tab active. Keep that window open and tap the lock to unlock the settings.

Full Disk Access settings panel in macOS

Full Disk Access settings panel in macOS

Adding RStudio is easy. Just make Finder active and hit Cmd Shift A and then find and drag the “RStudio” application into the pane+tab you opened in the previous step. Back in the Finder, hit Cmd Shift G and paste in: /Library/Frameworks/R.framework/Resources/bin and go to that folder. Drag in R and Rscript each into the pane+tab from the aforementioned step. Finally (and this got me for a minute) you also need to (again, in Finder) hit Cmd Shift G and paste in /Library/Frameworks/R.framework/Versions/3.6/Resources/bin/exec and drag that R executable into the Security & Privacy Privacy/Full Disk Access pane+tab as well. When you’ve done all that, lock the System Preferences pane and close it.

It is important to note that you just gave “R” and anything that calls R from your user space complete (well, almost) access to every sandboxed area on your system. R is a general purpose programming and scripting language which means any bit of malicious code that knows you have added those executables can use R to read from and write to any area on your system.

It is also important to note that I had to use 3.6 vs the Current symlink for the last entry so that means you need to do this for each new R version you install.

I hope folks on legacy Windows OS installs didn’t skip over this part as you’ll need to go here to figure out where your iOS backups folder is to go through the rest of the post.

Sneakin’ Round The Corner

Windows folks hopefully read at least the last bit of the previous section to figure out where their iOS backups are. On macOS that’s ~/Library/Application Support/MobileSync/Backup/. You need a local backup there (most folks just use iCloud backups these days) and Apple tells you how to do this.

Once you know you’ve got an (unencrypted) backup just go to your iOS backups directory and list the files by date and note the name/path of the most recent backup. Now we can have some fun.

library(XML) # to read plist (property list) files
library(tidyverse) # for printing and access to sqlite dbs

# replace this with the relative path to your most recent backup dir
mb <- "~/Library/Application Support/MobileSync/Backup/28500cd31b9580aaf5815c695ebd3ea5f7455628-20190601-165737"

list.files(mb, pattern = ".*\\.(db|plist)$")
## [1] "Info.plist"      "Manifest.db"      "Manifest.plist"      "Status.plist"

The above code looks for some key metadata files for iOS backups.

  • Info.plist has info on your device
  • Manifest.db has tons of info on all the files in the backup in a SQLite database
  • Manifest.plist has some additional metadata on the backup including applications included in the backup
  • Status.plist contains info on the status of the backup

Let’s take a look at the plists:

info_p <- file.path(mb, "Info.plist")
file.copy(info_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Info.plist", "/tmp/Info.plist"))

info <- XML::readKeyValueDB("/tmp/Info.plist")

str(info)
## List of 11
##  $ Device Name        : chr REDACTED
##  $ Display Name       : chr REDACTED
##  $ ICCID              : chr REDACTED
##  $ IMEI               : chr REDACTED
##  $ IPBE Backup Version: int 1
##  $ Last Backup Date   : POSIXct[1:1], format: "2019-06-01 21:23:02"
##  $ Phone Number       : chr REDACTED
##  $ Product Type       : chr REDACTED
##  $ Product Version    : chr REDACTED
##  $ Serial Number      : chr REDACTED
##  $ Unique Identifier  : chr REDACTED

status_p <- file.path(mb, "Status.plist")
file.copy(status_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Status.plist", "/tmp/Status.plist"))

status <- XML::readKeyValueDB("/tmp/Status.plist")

str(status)
## List of 6
##  $ BackupState  : chr "new"
##  $ Date         : POSIXct[1:1], format: "2019-06-01 21:22:53"
##  $ IsFullBackup : logi FALSE
##  $ SnapshotState: chr "finished"
##  $ UUID         : chr REDACTED
##  $ Version      : chr "3.3"

mainf_p <- file.path(mb, "Manifest.plist")
file.copy(mainf_p, "/tmp", overwrite = TRUE)
system2("plutil", args=c("-convert", "xml1", "-o", "/tmp/Manifest.plist", "/tmp/Manifest.plist"))

manifest <- XML::readKeyValueDB("/tmp/Manifest.plist")

str(manifest, 1)
## List of 8
##  $ Applications        :List of 745
##  $ BackupKeyBag        : chr __truncated__
##  $ Date                : POSIXct[1:1], format: "2019-06-01 20:57:40"
##  $ IsEncrypted         : logi FALSE
##  $ Lockdown            :List of 12
##  $ SystemDomainsVersion: chr "24.0"
##  $ Version             : chr "10.0"
##  $ WasPasscodeSet      : logi TRUE

You’ll note we’re making copies of these files (never play with system-managed files directly unless you know what you’re doing) and turning binary property lists into plain text XML property lists as well so we can read them with the XML::readKeyValueDB() function.

Most of that information is fairly useless for this blog post but I figured you might like to see the hidden things the system knows about your devices. What we do want to check is to see if the John Deere application and data made it into the backup. The Applications slot is a named list of application metadata. Let’s see if there’s anything Deere-ish in it:

grep("deere", names(manifest$Applications), ignore.case = TRUE, value = TRUE)
##                   key 
## "com.deere.mowerplus"

str(manifest$Applications$com.deere.mowerplus, 1)
## List of 4
##  $ CFBundleIdentifier   : chr "com.deere.mowerplus"
##  $ CFBundleVersion      : chr "180"
##  $ ContainerContentClass: chr "Data/Application"
##  $ Path                 : chr "/var/containers/Bundle/Application/30DF2640-A9AA-43A0-AD87-932CA513D75A/MowerPlus.app"

Aye! This means we should have some luck finding “mower” data in the Manifest SQLite database.

Now, we could try to follow UUIDs around but we can also take a stab at a less cumbersome approach. Let’s make a copy of the Manifest database and see what it holds:

mainf_d <- file.path(mb, "Manifest.db")
file.copy(mainf_d, "/tmp", overwrite = TRUE)
## [1] TRUE

(manifest_db <- src_sqlite("/tmp/Manifest.db"))
## src:  sqlite 3.22.0 [/private/tmp/Manifest.db]
## tbls: Files, Properties

We want to get to (hopefully) a SQLite file with the mowing data so we likely care about the Files table. Let’s take a look at the structure of that table:

(fils <- tbl(manifest_db, "Files"))
## # Source:   table<Files> [?? x 5]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##    fileID           domain         relativePath           flags        file
##    <chr>            <chr>          <chr>                  <int>      <blob>
##  1 c1da4199a18d0b5… AppDomain-com… ""                         2 <raw 437 B>
##  2 7426ac0386e2887… AppDomain-com… Library                    2 <raw 444 B>
##  3 a6393e739e1ad37… AppDomain-com… Library/WebKit             2 <raw 444 B>
##  4 c54f5c77a5e970b… AppDomain-com… Library/WebKit/Websit…     2 <raw 458 B>
##  5 578f2c96f219e95… AppDomain-com… Library/WebKit/Websit…     2 <raw 465 B>
##  6 c8833032ce7c9e9… AppDomain-com… Library/WebKit/Websit…     2 <raw 481 B>
##  7 6af21902e595f7c… AppDomain-com… Library/WebKit/Websit…     2 <raw 468 B>
##  8 4c1c49324646af0… AppDomain-com… Library/WebKit/Websit…     2 <raw 471 B>
##  9 d0636bf9b5ba2ae… AppDomain-com… Library/WebKit/Websit…     2 <raw 468 B>
## 10 0b6bb30c8abaa4e… AppDomain-com… Library/Preferences        2 <raw 458 B>
## # … with more rows

If you have a ton of apps, this is a pretty big haystack to comb through. We may be able to narrow things down a bit, though, and we’ll start by seeing what that domain column holds:

distinct(fils, domain)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##    domain                                 
##    <chr>                                  
##  1 AppDomain-Outils-OBD-Facile.EOBD-Facile
##  2 AppDomain-ch.threema.iapp              
##  3 AppDomain-co.humanco.Human             
##  4 AppDomain-co.ortatech.colr-app         
##  5 AppDomain-co.vero.app                  
##  6 AppDomain-com.7thg.Tides               
##  7 AppDomain-com.AerLingus                
##  8 AppDomain-com.BloomSky.BloomSky        
##  9 AppDomain-com.PunchThrough.LightBlue   
## 10 AppDomain-com.agilebits.onepassword-ios
## # … with more rows

So, these are app-specific and the bits after the - in each one look like the CFBundleIdentifiers from above. Let’s make sure:

filter(fils, lower(domain) %like% "%com.deere.mowerplus%") %>% 
  distinct(domain)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   domain                       
##   <chr>                        
## 1 AppDomain-com.deere.mowerplus

Aye! Let’s check to see what files are in there (and hope for a nice SQLite database):

filter(fils, domain == "AppDomain-com.deere.mowerplus") %>%
  select(relativePath) %>% 
  collect() 
## # A tibble: 14 x 1
##    relativePath                                                     
##    <chr>                                                            
##  1 ""                                                               
##  2 Library                                                          
##  3 Library/Preferences                                              
##  4 Library/Application Support                                      
##  5 Library/Application Support/com.crashlytics                      
##  6 Documents                                                        
##  7 Library/GoAppPersistentStore-GoMow.sqlite                        
##  8 Library/googleanalytics-aux-v4.sql                               
##  9 Library/googleanalytics-v3.sql                                   
## 10 Library/Preferences/com.deere.mowerplus.plist                    
## 11 Library/Application Support/ActivityCards.sqlite                 
## 12 Library/Application Support/com.crashlytics/CLSUserDefaults.plist
## 13 Library/googleanalytics-v2.sql                                   
## 14 Library/Application Support/MowTracking.sqlite

It turns out that last one is what we’re looking for. Now we just need a bit of crucial metadata to get to it:

filter(fils, relativePath == "Library/Application Support/MowTracking.sqlite") %>% 
  select(fileID, relativePath)
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   fileID                            relativePath                           
##   <chr>                             <chr>                                  
## 1 ad0009ec04c44b544d37bfc7ab343869… Library/Application Support/MowTrackin…

That fileID maps to the seriously ugly directory tree that is the rest of the iOS backups folder (you likely looked into it and wondered “What the heck?!”). The top level is a 2-digit hex prefix with files underneath it (likely for performance reasons but a bit of obfuscation never hurts, too). We’ll get the whole string:

filter(fils, relativePath == "Library/Application Support/MowTracking.sqlite") %>% 
  select(fileID)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [/private/tmp/Manifest.db]
##   fileID                                  
##   <chr>                                   
## 1 ad0009ec04c44b544d37bfc7ab3438697d23d618

and, then copy over the mowing database somewhere safe to work on:

file.copy(
  file.path(mb, "ad/ad0009ec04c44b544d37bfc7ab3438697d23d618"),
  "/tmp/mowtrack.sqlite",
  overwrite = TRUE
)
## [1] TRUE

mow <- src_sqlite("/tmp/mowtrack.sqlite")

mow
## src:  sqlite 3.22.0 [/private/tmp/mowtrack.sqlite]
## tbls: Z_METADATA, Z_MODELCACHE, Z_PRIMARYKEY, ZACTIVITY, ZDEALER,
##   ZMOWALERT, ZMOWER, ZMOWLOCATION, ZSMARTCONNECTOR, ZUSER

tbl(mow, "ZMOWLOCATION") %>% 
  glimpse()
## Observations: ??
## Variables: 16
## Database: sqlite 3.22.0 [/private/tmp/mowtrack.sqlite]
## $ Z_PK                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Z_ENT               <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ Z_OPT               <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ZISPAUSEDPOINT      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ZORDER              <int> 1, 2, 0, 11, 20, 58, 38, 43, 30, 25, 21, 10,…
## $ ZSESSION            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ ZSESSION2           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ZALTITUDE           <dbl> 42.64804, 42.70590, 40.99661, 39.54770, 38.2…
## $ ZCOURSE             <dbl> 358.242188, 332.226562, 18.281250, 260.85937…
## $ ZHORIZONTALACCURACY <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ ZLATITUDE           <dbl> 43.25913, 43.25914, 43.25913, 43.25915, 43.2…
## $ ZLONGITUDE          <dbl> -70.80069, -70.80069, -70.80069, -70.80067, …
## $ ZSPEED              <dbl> 0.0000000, 0.4250179, 0.5592341, 0.3802792, …
## $ ZTIMESTAMP          <dbl> 581100271, 581100272, 581100270, 581100281, …
## $ ZVERTICALACCURACY   <dbl> 6, 6, 8, 6, 4, 4, 4, 3, 4, 4, 4, 6, 4, 4, 4,…
## $ ZKLVDATA            <blob> <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <…

(Shark) FIN

Even if you don’t have this mower app that I’m currently obsessed with, you now have a primer on how to get to SQLite databases stored by any application on your iOS device. That alone may unearth some fun projects for you to hack on. Plus, you also learned a bit on how to do some light forensics on iOS backups with R/RStudio.

If you did your own trawling and found something interesting definitely blog or tweet about it and drop a link in the comments.