Skip navigation

I’ve blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare a dplyr chain, fire off a collect() and then wait.

And, wait.

And, wait.

And, wait.

Queries that take significant processing time or have large result sets do not play nicely with the provided ODBC and JDBC drivers. This means “hung” R sessions and severe frustration, especially when you can login to the AWS Athena console and see that the results are right there!!

I’ve been crafting SQL by hand or using sql_render() by hand to avoid this (when I remember to) but finally felt sufficient frustration to craft a better way, provided you can install and run rJava-based code (it’s 2018 and that still is not an easy given on many systems unfortunately).

There are two functions below:

  • collect_async(), and
  • gather_results()

The collect_async() function is designed to be used like collect() but uses Athena components from the AWS SDK for Java to execute the SQL query behind the dplyr chain asynchronously. The companion function gather_results() takes the object created by collect_async() and checks to see if the results are ready. If if they are, it will use the aws.s3 package to download them. Personally, I’d just aws s3 sync ... from the command line vs use the aws.s3 package but that’s not everyone’s cup of tea.

Once I figure out the best package API for this I’ll add it to the metis package. There are many AWS idiosyncrasies that need to be accounted for and I’d rather ship this current set of functions via the blog so folks can use it (and tweak it to their needs) before waiting for perfection.

Here’s the code:

library(rJava)
library(awsjavasdk)
library(aws.signature)
library(aws.s3)
library(odbc)
library(tidyverse)
library(dbplyr)

#' Collect Amazon Athena query results asynchronously
#' 
#' Long running Athena queries and Athena queries with large result
#' sets can seriously stall a `dplyr` processing chain due to poorly
#' implemented ODBC and JDBC drivers. The AWS SDK for Athena has 
#' methods that support submitting a query asynchronously for "batch"
#' processing. All Athena resutls are stored in CSV files in S3 and it's
#' easy to use the R `aws.s3` package to grab these or perform an
#' `aws s3 sync ...` operation on the command line.
#' 
#' @md
#' @param obj the `dplyr` chain
#' @param schema Athena schema (usually matches the `Schema` parameter to the 
#'        Simba ODBC connection)
#' @param region Your AWS region. All lower case with dashes (usually matches
#'        the `AwsRegion` parameter to the Simba ODBC connection)
#' @param results_bucket the S3 results bucket where query results are stored 
#'        (usually matches the `S3OutputLocation` parameter to the Simba ODBC
#'        connection)
#' @return a `list` with the query execution ID and the S3 bucket. This object
#'         is designed to be passed to the companion `gather_results()` if you
#'         want to use the `aws.s3` package to retrieve the results. Otherwise,
#'         sync the file however you want using the query execution id.
#' @note You may need to change up the authentication provider depending on how 
#'       you use credentials with Athena
collect_async <- function(obj, schema, region, results_bucket) {

  ugly_query <- as.character(sql_render(obj))

  region <- toupper(region)
  region <- gsub("-", "_", region, fixed=TRUE)

  regions <- J("com.amazonaws.regions.Regions")

  available_regions <- grep("^[[:upper:][:digit:]_]+$", names(regions), value=TRUE)
  if (!region %in% available_regions) stop("Invalid region.", call.=FALSE)

  switch(
    region,
    "GovCloud" = regions$GovCloud,
    "US_EAST_1" = regions$US_EAST_1,
    "US_EAST_2" = regions$US_EAST_2,
    "US_WEST_1" = regions$US_WEST_1,
    "US_WEST_2" = regions$US_WEST_2,
    "EU_WEST_1" = regions$EU_WEST_1,
    "EU_WEST_2" = regions$EU_WEST_2,
    "EU_WEST_3" = regions$EU_WEST_3,
    "EU_CENTRAL_1" = regions$EU_CENTRAL_1,
    "AP_SOUTH_1" = regions$AP_SOUTH_1,
    "AP_SOUTHEAST_1" = regions$AP_SOUTHEAST_1,
    "AP_SOUTHEAST_2" = regions$AP_SOUTHEAST_2,
    "AP_NORTHEAST_1" = regions$AP_NORTHEAST_1,
    "AP_NORTHEAST_2" = regions$AP_NORTHEAST_2,
    "SA_EAST_1" = regions$SA_EAST_1,
    "CN_NORTH_1" = regions$CN_NORTH_1,
    "CN_NORTHWEST_1" = regions$CN_NORTHWEST_1,
    "CA_CENTRAL_1" = regions$CA_CENTRAL_1,
    "DEFAULT_REGION" = regions$DEFAULT_REGION
  ) -> region

  provider <- J("com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
  client <- J("com.amazonaws.services.athena.AmazonAthenaAsyncClientBuilder")

  my_client <- client$standard()
  my_client <- my_client$withRegion(region)
  my_client <- my_client$withCredentials(provider$getInstance())
  my_client <- my_client$build()

  queryExecutionContext <- .jnew("com.amazonaws.services.athena.model.QueryExecutionContext")
  context <- queryExecutionContext$withDatabase(schema)
  result <- .jnew("com.amazonaws.services.athena.model.ResultConfiguration")
  result$setOutputLocation(results_bucket)

  startQueryExecutionRequest <- .jnew("com.amazonaws.services.athena.model.StartQueryExecutionRequest")
  startQueryExecutionRequest$setQueryString(ugly_query)
  startQueryExecutionRequest$setQueryExecutionContext(context)
  startQueryExecutionRequest$setResultConfiguration(result)

  res <- my_client$startQueryExecutionAsync(startQueryExecutionRequest)

  r <- res$get()
  qex_id <- r$getQueryExecutionId()

  list(
    qex_id = qex_id,
    results_bucket = results_bucket
  )

}

#' Gather the results of an asynchronous query
#'
#' @md
#' @param async_result the result of a call to `collect_async()`
#' @return a data frame (tibble) or `NULL` if the query results are not ready yet
gather_results <- function(async_result) {
  if (bucket_exists(sprintf("%s/%s", async_result$results_bucket, async_result$qex_id))) {
    readr::read_csv(
      get_object(sprintf("%s/%s.csv", async_result$results_bucket, async_result$qex_id))
    )
  } else {
    message("Results are not in the designated bucket.")
    return(NULL)
  }
}

Now, we give it a go:

# Setup the credentials you're using
use_credentials("personal")

# load the AWS Java SDK classes
awsjavasdk::load_sdk()

# necessary for Simba ODBC and the async query ops
aws_region <- "us-east-1"
athena_schema <- "sampledb"
athena_results_bucket <- "s3://aws-athena-query-results-redacted"

# connect to Athena and the sample database
DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = athena_schema,
  AwsRegion = aws_region,
  AuthenticationType = "IAM Profile",
  AwsProfile = "personal",
  S3OutputLocation = athena_results_bucket
) -> con

# the sample table in the sample db/schema
elb_logs <- tbl(con, "elb_logs")

# create your dplyr chain. This one is small so I don't incur charges
# collect_async() MUST be the LAST item in the dplyr chain.
elb_logs %>%
  filter(requestip == "253.89.30.138") %>%
  collect_async(
    schema = athena_schema,
    region = aws_region,
    results_bucket = athena_results_bucket
  ) -> async_result

async_result
## $qex_id
## [1] "d5fe7754-919b-47c5-bd7d-3ccdb1a3a414"
## 
## $results_bucket
## [1] "s3://aws-athena-query-results-redacted"

# For long queries we can wait a bit but the function will tell us if the results
# are there or not.

gather_results(async_result)
## Parsed with column specification:
## cols(
##   timestamp = col_datetime(format = ""),
##   elbname = col_character(),
##   requestip = col_character(),
##   requestport = col_integer(),
##   backendip = col_character(),
##   backendport = col_integer(),
##   requestprocessingtime = col_double(),
##   backendprocessingtime = col_double(),
##   clientresponsetime = col_double(),
##   elbresponsecode = col_integer(),
##   backendresponsecode = col_integer(),
##   receivedbytes = col_integer(),
##   sentbytes = col_integer(),
##   requestverb = col_character(),
##   url = col_character(),
##   protocol = col_character()
## )
## # A tibble: 1 x 16
##   timestamp           elbname requestip     requestport backendip     backendport
##                                                   
## 1 2014-09-29 03:24:38 lb-demo 253.89.30.138       20159 253.89.30.138        8888
## # ... with 10 more variables: requestprocessingtime , backendprocessingtime ,
## #   clientresponsetime , elbresponsecode , backendresponsecode ,
## #   receivedbytes , sentbytes , requestverb , url , protocol 

If you do try this out and end up needing to tweak it, feedback on what you had to do (via the comments) would be greatly appreciated.

Both my osqueryr and macthekinfe packages have had a few updates and I wanted to put together a fun example (it being Friday, and all) for what you can do with them. All my packages are now on GitHub and GitLab and I’ll be maintaining them on both so I can accommodate the comfort-level of any and all contributors but will be prioritizing issues and PRs on GitLab ahead of any other platform. Having said that, I’ll mark non-CRAN packages with a # notcran comment in the source views so you know you need to install it from wherever you like to grab sketch packages from.

One table that osquery makes available under macOS is an inventory of all “apps” that macOS knows about. Previous posts have shown how to access these tables via the dplyr interface I built for osquery, but they involved multiple steps and as I started to use it more regularly (especially to explore the macOS 10.14 beta I’m running) I noticed that it could use some helper functions. One in particular — osq_expose_tables() — is pretty helpful in that it handles all the dplyr boilerplate code and makes table(s) available in the global environment by name. It takes a single table name or regular expression and then exposes all matching entities. While the function has a help page, it’s easier just to see it in action. Let’s expose the apps table:

library(osqueryr) # notcran
library(tidyverse)

osq_expose_tables("apps")

apps
## # Source:   table [?? x 19]
## # Database: OsqueryConnection
##    applescript_enab… bundle_executable    bundle_identifier   bundle_name  bundle_package_…
##                                                                   
##  1 0                 1Password 6          com.agilebits.onep… 1Password 6  APPL            
##  2 0                 2BUA8C4S2C.com.agil… 2BUA8C4S2C.com.agi… 1Password m… APPL            
##  3 1                 Adium                com.adiumX.adiumX   Adium        APPL            
##  4 1                 Adobe Connect        com.adobe.adobecon… Adobe Conne… APPL            
##  5 1                 Adobe Illustrator    com.adobe.illustra… Illustrator… APPL            
##  6 ""                AIGPUSniffer         com.adobe.AIGPUSni… AIGPUSniffer APPL            
##  7 ""                CEPHtmlEngine Helper com.adobe.cep.CEPH… CEPHtmlEngi… APPL            
##  8 ""                CEPHtmlEngine        com.adobe.cep.CEPH… CEPHtmlEngi… APPL            
##  9 ""                LogTransport2        com.adobe.headligh… LogTranspor… APPL            
## 10 ""                droplet              ""                  Analyze Doc… APPL            
## # ... with more rows, and 14 more variables: bundle_short_version ,
## #   bundle_version , category , compiler , copyright ,
## #   development_region , display_name , element , environment ,
## #   info_string , last_opened_time , minimum_system_version , name ,
## #   path 

There’s tons of info on all the apps macOS knows about, some of which are system services and “helper” apps (like Chrome’s auto-updater). One field — last_opened_time — caught my eye and I thought it would be handy to see which apps had little use (i.e. ones that haven’t been opened in a while) and which apps I might use more frequently (i.e. ones with more recent “open” times). That last_open_time is a fractional POSIX timestamp and, due to the way osquery created the schemas, it’s in a character field. That’s easy enough to convert and then arrange() the whole list in descending order to let you see what you use most frequently.

But, this is R and we can do better than a simple table or even a DT::datatable().

I recently added the ability to read macOS property lists (a.k.a. “plists”) to mactheknife by wrapping a Python module (plistlib). Since all (OK, “most”) macOS apps have an icon, I thought it would be fun to visualize the last opened frequency for each app using the app icons and ggplot2. Unfortunately, the ImageMagick (and, thus the magick package) cannot read macOS icns files, so you’ll need to do a brew install libicns before working with any of the remaining code since we’ll be relying on a command-line utility from that formula.

Let’s get the frontmatter out of the way:

library(sys)
library(magick)
library(osqueryr) # notcran
library(mactheknife) #notcran
library(ggimage)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

osq_expose_tables("apps")

# macOS will use a generic app icon when none is present in an app bundle; this is the location and we'll
# need to use it when our plist app spelunking comes up short

default_app <- "/System/Library/CoreServices/CoreTypes.bundle/Contents/Resources/GenericApplicationIcon.icns"

Next, we'll:

  • collect the apps table locally
  • filter out system-ish things (which we really don't care about for this post)
  • convert the last used time to something useful (and reduce it to a day resolution)
  • try to locate the property list for the app and read the path to the app icon file, substituting the generic one if not found (or other errors pop up):
select(apps, name, path, last_opened_time) %>%
  collect() %>%
  filter(!str_detect(path, "(^/System|usr|//System|/Library/|Helper|/Contents/|\\.service$)")) %>%
  mutate(lop_day = as.Date(anytime::anytime(as.numeric(last_opened_time)))) %>%
  mutate(icon = map_chr(path, ~{
    p <- read_plist(file.path(.x, "Contents", "Info.plist"))
    icns <- p$CFBundleIconFile[1]
    if (is.null(icns)) return(default_app)
    if (!str_detect(icns, "\\.icns$")) icns <- sprintf("%s.icns", icns)
    file.path(.x, "Contents", "Resources", icns)
  })) -> apps_df

apps_df
## # A tibble: 274 x 5
##    last_opened_time name                       path                      lop_day    icon                       
##                                                                                      
##  1 1529958322.11297 1Password 6.app            /Applications/1Password … 2018-06-25 /Applications/1Password 6.…
##  2 1523889402.80918 Adium.app                  /Applications/Adium.app   2018-04-16 /Applications/Adium.app/Co…
##  3 1516307513.7606  Adobe Connect.app          /Applications/Adobe Conn… 2018-01-18 /Applications/Adobe Connec…
##  4 1530044681.76677 Adobe Illustrator.app      /Applications/Adobe Illu… 2018-06-26 /Applications/Adobe Illust…
##  5 -1.0             Analyze Documents.app      /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  6 -1.0             Make Calendar.app          /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  7 -1.0             Contact Sheets.app         /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  8 -1.0             Export Flash Animation.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  9 -1.0             Web Gallery.app            /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
## 10 -1.0             Adobe InDesign CC 2018.app /Applications/Adobe InDe… 1969-12-31 /Applications/Adobe InDesi…
## # ... with 264 more rows

Since I really didn't feel like creating a package wrapper for libicns, we're going to use the sys package to make system calls to convert the icns files to png files. We really don't want to do this repeatedly for the same files if we ever run this again, so we'll setup a cache directory to hold our converted pngs.

Apps can (and, usually do) have multiple icons with varying sizes and are not guaranteed to have every common size available. So, we'll have the libicns icns2png utility extract all the icons and use the highest resolution one, using magick to reduce it to a 32x32 png bitmap.

# setup the cache dir -- use whatever you want
cache_dir <- path.expand("~/.r-icns-cache")
dir.create(cache_dir)

# create a unique name hash for more compact names
mutate(apps_df, icns_png = map_chr(icon, ~{
  hash <- digest::digest(.x, serialize=FALSE)
  file.path(cache_dir, sprintf("%s.png", hash))
})) -> apps_df

# find the icns2png program
icns2png <- unname(Sys.which("icns2png"))

# go through each icon file 
pb <- progress_estimated(length(apps_df$icns_png))
walk2(apps_df$icon, apps_df$icns_png, ~{

  pb$tick()$print() # progress!

  if (!file.exists(.y)) { # don't create it if it already exists

    td <- tempdir()

    # no icon file == use default one
    if (!file.exists(.x)) .x <- default_app

    # convert all of them to pngs
    sys::exec_internal(
      cmd = icns2png,
      args = c("-x", "-o", td, .x),
      error = FALSE
    ) -> res

    rawToChar(res$stdout) %>% # go through icns2png output
      str_split("\n") %>%
      flatten_chr() %>%
      keep(str_detect, "  Saved") %>% # find all the extracted icons
      last() %>% # use the last one
      str_replace(".* to /", "/") %>% # clean up the filename so we can read it in
      str_replace("\\.$", "") -> png

    # read and convert
    image_read(png) %>%
      image_resize(geometry_area(32, 32)) %>%
      image_write(.y)

  }

})

You can open up that cache directory with the macOS finder to find all the extracted/converted pngs.

Now, we're on the final leg of our app-use visualization journey.

Some system/utility apps have start-of-epoch dates due to the way the macOS installer tags them. We only want "recent" ones so I set an arbitrary cutoff date of the year 2000. Since many apps would have the same last opened date, I wanted to get a spread out layout "for free". One way to do that is to use ggbeeswarm::position_beswarm():

filter(apps_df, lop_day > as.Date("2000-01-01")) %>%
  ggplot() +
  geom_image(
    aes(x="", lop_day, image = icns_png), size = 0.033,
    position = position_quasirandom(width = 0.5)
  ) +
  geom_text(
    data = data_frame(
      x = c(0.6, 0.6),
      y = as.Date(c("2018-05-01", "2017-09-15")),
      label = c("More recently used ↑", "Not used in a while ↓")
    ), 
    aes(x, y, label=label), family = font_an, size = 5 , hjust = 0,
    color = "lightslategray"
  ) +
  labs(x = NULL, y = "Last Opened Time") +
  labs(
    x = NULL, y = NULL,
    title = "macOS 'Last Used' App History"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank())

There are tons of other ways to look at this data and you can use the osquery daemon to log this data regularly so you can get an extra level of detail. An interesting offshot project would be to grab the latest RStudio dailies and see if you can wrangle a sweet D3 visualization from the app data we collected. Make sure to drop a comment with your creations in the comments. You can find the full code in this snippet.

UPDATE (2018-07-07)

A commenter really wanted tooltips with app names. So did I, but neither plotly nor ggiraph support ggimage so we can't get tooltips for free.

However, if you're willing to use the latest RStudio Preview or Daily editions, then we can "easily" use the new built-in D3 support to get some sketch tooltips.

First, we need to change up the plotting code a bit so we can get some base data to feed to D3:

filter(apps_df, lop_day > as.Date("2000-01-01")) %>%
  mutate(name = sub("\\.app", "", name)) %>% 
  ggplot() +
  geom_image(
    aes(x="", lop_day, image = icns_png, name=name), size = 0.033,
    position = position_quasirandom(width = 0.5)
  ) +
  geom_text(
    data = data_frame(
      x = c(0.6, 0.6),
      y = as.Date(c("2018-05-01", "2017-09-15")),
      label = c("More recently used ↑", "Not used in a while ↓")
    ), 
    aes(x, y, label=label), family = font_an, size = 5 , hjust = 0,
    color = "lightslategray"
  ) +
  labs(x = NULL, y = "Last Opened Time") +
  labs(
    x = NULL, y = NULL,
    title = "macOS 'Last Used' App History"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank()) -> gg

gb <- ggplot_build(gg) # compute the layout

idf <- tbl_df(gb$data[[1]]) # extract the data
idf <- mutate(idf, image = sprintf("lib/imgs-1.0.0/%s", basename(image))) # munge the png paths so D3 can find them

write_rds(idf, "~/Data/apps.rds") # save off the data

Now, we just need some D3 javascript glue:

// !preview r2d3 data=data.frame(readRDS("~/Data/apps.rds")), d3_version = 4, dependencies = htmltools::htmlDependency(name = "imgs", version = "1.0.0", src = "~/.r-icns-cache", all_files = TRUE)

var margin = {top: 16, right: 32, bottom: 16, left: 32},
    width = width - margin.left - margin.right,
    height = height - margin.top - margin.bottom;

var x = d3.scaleLinear().range([0, width]);
var y = d3.scaleLinear().range([height, 0]);

x.domain([
  d3.min(data, function(d) { return d.x; }) - 0.05,
  d3.max(data, function(d) { return d.x; }) + 0.05
]);

y.domain([
  d3.min(data, function(d) { return d.y; }) - 16,
  d3.max(data, function(d) { return d.y; }) + 16
]);

var tooltip = d3.select("body")
    .append("div")
    .style("position", "absolute")
    .style("z-index", "10")
    .style("visibility", "hidden")
    .style("color", "blue")
    .style("background", "white")
    .style("padding", "5px")
    .style("font-family", "sans-serif")
    .text("");

svg.attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
  .append("g")
    .attr("transform",
          "translate(" + margin.left + "," + margin.top + ")");

var images = svg.selectAll("appimg")
      .data(data)
    .enter().append("svg:image")
      .attr("xlink:href",  function(d) { return d.image;})
      .attr("x", function(d) { return x(d.x);})
      .attr("y", function(d) { return y(d.y);})
      .attr("height", 32)
      .attr("width", 32)
      .on("mouseover", function(d) { return tooltip.style("visibility", "visible").text(d.name); })
  .on("mousemove", function(){ return tooltip.style("top", (event.pageY-10)+"px").style("left",(event.pageX+10)+"px"); })
  .on("mouseout", function(){ return tooltip.style("visibility", "hidden"); });

If you don't want to live dangerously, you can also save that script off and just use r2d3 directly:

r2d3::r2d3(
  data = data.frame(readRDS("~/Data/apps.rds")), 
  script = "~/Desktop/app-d3.js",
  d3_version = 4, 
  dependencies = htmltools::htmlDependency(
    name = "imgs", 
    version = "1.0.0", 
    src = "~/.r-icns-cache", 
    all_files = TRUE
  )
)

Either way gives us interactive tooltips:

I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.

When I finally jumped on Twitter today, I saw this:

and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.

The Problem

The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).

The PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:

This slideshow requires JavaScript.

We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.

The Solution

We’ll need some packages and the file itself, so let’s get that bit out of the way first:

library(stringi)
library(pdftools)
library(hrbrthemes)
library(ggpomological)
library(tidyverse)

# grab the PDF text
mia_url <- "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"
mia_fil <- "~/Data/pmkor_una_all.pdf"
if (!file.exists(mia_fil)) download.file(mia_url, mia_fil)

# read it in
doc <- pdf_text(mia_fil) 

Let's look at those three example pages:

cat(doc[[1]])
##                                   Defense POW/MIA Accounting Agency
##                                       Personnel Missing - Korea (PMKOR)
##                                        (Reported for ALL Unaccounted For)
##                                                                                                Total Unaccounted: 7,699
## Name                       Rank/Rate     Branch                           Year State/Territory
## ABBOTT, RICHARD FRANK      M/Sgt         UNITED STATES ARMY               1950 VERMONT
## ABEL, DONALD RAYMOND       Pvt           UNITED STATES ARMY               1950 PENNSYLVANIA
## ...
## AKERS, HERBERT DALE        Cpl           UNITED STATES ARMY               1950 INDIANA
## AKERS, JAMES FRANCIS       Cpl           UNITED STATES MARINE CORPS       1950 VIRGINIA

cat(doc[[2]])
## Name                          Rank/Rate Branch                     Year State/Territory
## AKERS, RICHARD ALLEN          1st Lt    UNITED STATES ARMY         1951 PENNSYLVANIA
## AKI, CLARENCE HALONA          Sgt       UNITED STATES ARMY         1950 HAWAII
...
## AMIDON, DONALD PRENTICE       PFC       UNITED STATES MARINE CORPS 1950 TEXAS
## AMOS, CHARLES GEARL           Cpl       UNITED STATES ARMY         1951 NORTH CAROLINA

cat(doc[[length(doc)]])
## Name                                                Rank/Rate           Branch                                              Year         State/Territory
## ZAVALA, FREDDIE                                     Cpl                 UNITED STATES ARMY                                  1951         CALIFORNIA
## ZAWACKI, FRANK JOHN                                 Sgt                 UNITED STATES ARMY                                  1950         OHIO
## ...
## ZUVER, ROBERT LEONARD                               Pfc                 UNITED STATES ARMY                                  1950         CALIFORNIA
## ZWILLING, LOUIS JOSEPH                              Cpl                 UNITED STATES ARMY                                  1951         ILLINOIS
##                                       This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA).
##                Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists.
## Report Prepared: 06/19/2018 11:25

The poppler library's "layout" mode (which pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:

  • there's a uniform header on each page
  • the "layout" mode returned uniform per-page, fixed-width columns
  • there's no "special column tweaks" that some folks use to make PDFs more readable by humans

There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:

  • convert the text to lines
  • find the header line
  • find the column start/end positions from the header on the page (since they are different for each page)
  • reading it in with readr::read_fwf()
  • remove headers, preamble and epilogue cruft
  • turn it all into one data frame
# we're going to process each page and read_fwf will complain violently
# when it hits header/footer rows vs data rows and we rly don't need to
# see all those warnings
read_fwf_q <- quietly(read_fwf)

# go through each page
map_df(doc, ~{
  
  stri_split_lines(.x) %>% 
    flatten_chr() -> lines # want the lines from each page
  
  # find the header on the page and get the starting locations for each column
  keep(lines, stri_detect_regex, "^Name") %>% 
    stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>% 
    map(`[`, 1) %>% 
    flatten_int() -> starts
  
  # now get the ending locations; cheating and using `NA` for the last column  
  ends <- c(starts[-1] - 1, NA)

  # since each page has a lovely header and poppler's "layout" mode creates 
  # a surprisingly usable fixed-width table, the core idiom is to find the start/end
  # of each column using the header as a canary
  cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state"))

  paste0(lines, collapse="\n") %>%        # turn it into something read_fwf() can read 
    read_fwf_q(col_positions = cols) %>%  # read it!
    .$result %>%                          # need to do this b/c of `quietly()`
    filter(!is.na(name)) %>%              # non-data lines
    filter(name != "Name") %>%            # remove headers from each page
    filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly)
  
}) -> xdf

xdf
## # A tibble: 7,699 x 5
##    name                       rank   branch                  year  state        
##                                                        
##  1 ABBOTT, RICHARD FRANK      M/Sgt  UNITED STATES ARMY      1950  VERMONT      
##  2 ABEL, DONALD RAYMOND       Pvt    UNITED STATES ARMY      1950  PENNSYLVANIA 
##  3 ABELE, FRANCIS HOWARD      Sfc    UNITED STATES ARMY      1950  CONNECTICUT  
##  4 ABELES, GEORGE ELLIS       Pvt    UNITED STATES ARMY      1950  CALIFORNIA   
##  5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950  ALABAMA      
##  6 ABREU, MANUEL Jr.          Pfc    UNITED STATES ARMY      1950  MASSACHUSETTS
##  7 ACEVEDO, ISAAC             Sgt    UNITED STATES ARMY      1952  PUERTO RICO  
##  8 ACINELLI, BILL JOSEPH      Pfc    UNITED STATES ARMY      1951  MISSOURI     
##  9 ACKLEY, EDWIN FRANCIS      Pfc    UNITED STATES ARMY      1950  NEW YORK     
## 10 ACKLEY, PHILIP WARREN      Pfc    UNITED STATES ARMY      1950  NEW HAMPSHIRE
## # ... with 7,689 more rows

Now the data is both usable and sobering:

title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea"
subtitle <- "Reported for ALL Unaccounted For"
caption <-  "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"

mutate(xdf, year = factor(year)) %>% 
  mutate(branch = stri_trans_totitle(branch)) -> xdf

ordr <- count(xdf, branch, sort=TRUE)

mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% 
  ggplot(aes(year)) +
  geom_bar(aes(fill = branch), width=0.65) +
  scale_y_comma(name = "# POW/MIA") +
  scale_fill_pomological(name=NULL, ) +
  labs(x = NULL, title = title, subtitle = subtitle) +
  theme_ipsum_rc(grid="Y") +
  theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) +
  theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))

You can catch a bit of the @rOpenSci 2018 Unconference experience at home w with this short-ish ‘splainer video on how to use the new middlechild package (https://github.com/ropenscilabs/middlechild) & mitmproxy to automagically create reusable httr verb functions from manual browser form interactions.

The forthcoming RStudio 1.2 release has a new “Jobs” feature for running and managing background R tasks.

I did a series of threaded screencaps on Twitter but that doesn’t do the feature justice.

So I threw together a quick ‘splainer on how to run and Python (despite RStudio not natively supporting Python) code in the background while you get other stuff done, then work with the results.

A colleague asked if I would blog about how I crafted the grid of world tile grids in this post and I accepted the challenge. The technique isn’t too hard as it just builds on the initial work by Jon Schwabish and a handy file made by Maarten Lambrechts.

The Premise

For this particular use-case, I sifted through our internet scan data and classified a series of device families from their telnet banners then paired that with our country-level attribution data for each IPv4 address. I’m not generally “a fan” of rolling things up at a country level, but since many (most) of these devices are residential or small/medium-business routers, country-level attribution has some merit.

But, I’m also not a fan of country-level choropleths when it comes to “cyber” nor am I wont to area-skewed cartograms since most folks still cannot interpret them. Both of those take up a ton of screen real estate, too, espeically if you have more than one of them. Yet, I wanted to show a map-like structure without resorting to Hilbert IPv4 heatmaps since they are neither very readable by a general audience and become skewed when you have to move up from a 1 pixel == 1 Class C network block.

I think the tile grid is a great compromise since it avoids the “area”and projection skewness confusion that regular global choropleths cause while still preserving geographic & positional proximity. Sure, they’ll take some getting used to by casual readers, but I felt it was the best of all the tradeoffs.

The Setup

Here’s the data:


library(here)
library(hrbrthemes)
library(tidyverse)

wtg <- read_csv("https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv")

glimpse(wtg)
 
## Observations: 192
## Variables: 11
## $ name             "Afghanistan", "Albania", "Algeria", "Angola",...
## $ alpha.2          "AF", "AL", "DZ", "AO", "AQ", "AG", "AR", "AM"...
## $ alpha.3          "AFG", "ALB", "DZA", "AGO", "ATA", "ATG", "ARG...
## $ country.code     "004", "008", "012", "024", "010", "028", "032...
## $ iso_3166.2       "ISO 3166-2:AF", "ISO 3166-2:AL", "ISO 3166-2:...
## $ region           "Asia", "Europe", "Africa", "Africa", "Antarct...
## $ sub.region       "Southern Asia", "Southern Europe", "Northern ...
## $ region.code      "142", "150", "002", "002", NA, "019", "019", ...
## $ sub.region.code  "034", "039", "015", "017", NA, "029", "005", ...
## $ x                22, 15, 13, 13, 15, 7, 6, 20, 24, 15, 21, 4, 2...
## $ y                8, 9, 11, 17, 23, 4, 14, 6, 19, 6, 7, 2, 9, 8,...

routers <- read_csv(here::here("data", "routers.csv"))

routers
 
## # A tibble: 453,027 x 3
##    type     country_name           country_code
##                                 
##  1 mikrotik Slovak Republic        SK          
##  2 mikrotik Czechia                CZ          
##  3 mikrotik Colombia               CO          
##  4 mikrotik Bosnia and Herzegovina BA          
##  5 mikrotik Czechia                CZ          
##  6 mikrotik Brazil                 BR          
##  7 mikrotik Vietnam                VN          
##  8 mikrotik Brazil                 BR          
##  9 mikrotik India                  IN          
## 10 mikrotik Brazil                 BR          
## # ... with 453,017 more rows

distinct(routers, type) %>% 
  arrange(type) %>% 
  print(n=11)
 
## # A tibble: 11 x 1
##    type    
##       
##  1 asus    
##  2 dlink   
##  3 huawei  
##  4 linksys 
##  5 mikrotik
##  6 netgear 
##  7 qnap    
##  8 tplink  
##  9 ubiquiti
## 10 upvel   
## 11 zte

So, we have 11 different device families under assault by “VPNFilter” and I wanted to show the global distribution of them. Knowing the compact world tile grid would facet well, I set off to make it happen.

Let’s get some decent names for facet labels:


real_names <- read_csv(here::here("data", "real_names.csv"))

real_names
 
## # A tibble: 11 x 2
##    type     lab             
##                   
##  1 asus     Asus Device     
##  2 dlink    D-Link Devices  
##  3 huawei   Huawei Devices  
##  4 linksys  Linksys Devices 
##  5 mikrotik Mikrotik Devices
##  6 netgear  Netgear Devices 
##  7 qnap     QNAP Devices    
##  8 tplink   TP-Link Devices 
##  9 ubiquiti Ubiquiti Devices
## 10 upvel    Upvel Devices   
## 11 zte      ZTE Devices

Next, we need to summarise our scan results and pair it up the world tile grid data and our real names:


count(routers, country_code, type) %>%  # summarise the data into # of device familes per country
  left_join(wtg, by = c("country_code" = "alpha.2")) %>% # join them up on the common field
  filter(!is.na(alpha.3)) %>% # we only want countries on the grid and maxmind attributes some things to meta-regions and anonymous proxies
  left_join(real_names) -> wtg_routers

glimpse(wtg_routers)

## Observations: 629
## Variables: 14
## $ country_code     "AE", "AE", "AE", "AF", "AF", "AF", "AG", "AL"...
## $ type             "asus", "huawei", "mikrotik", "huawei", "mikro...
## $ n                1, 12, 70, 12, 264, 27, 1, 941, 2081, 7, 2, 1,...
## $ name             "United Arab Emirates", "United Arab Emirates"...
## $ alpha.3          "ARE", "ARE", "ARE", "AFG", "AFG", "AFG", "ATG...
## $ country.code     "784", "784", "784", "004", "004", "004", "028...
## $ iso_3166.2       "ISO 3166-2:AE", "ISO 3166-2:AE", "ISO 3166-2:...
## $ region           "Asia", "Asia", "Asia", "Asia", "Asia", "Asia"...
## $ sub.region       "Western Asia", "Western Asia", "Western Asia"...
## $ region.code      "142", "142", "142", "142", "142", "142", "019...
## $ sub.region.code  "145", "145", "145", "034", "034", "034", "029...
## $ x                20, 20, 20, 22, 22, 22, 7, 15, 15, 15, 20, 20,...
## $ y                10, 10, 10, 8, 8, 8, 4, 9, 9, 9, 6, 6, 6, 6, 1...
## $ lab              "Asus Device", "Huawei Devices", "Mikrotik Dev...

Then, plot it:


ggplot(wtg_routers, aes(x, y, fill=n, group=lab)) +
  geom_tile(color="#b2b2b2", size=0.125) +
  scale_y_reverse() +
  viridis::scale_fill_viridis(name="# Devices", trans="log10", na.value="white", label=scales::comma) +
  facet_wrap(~lab, ncol=3) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "World Tile Grid Per-country Concentration of\nSeriously Poorly Configured Network Devices",
    subtitle = "Device discovery based on in-scope 'VPNFilter' vendor device banner strings",
    caption = "Source: Rapid7 Project Sonar & Censys"
  ) +
  theme_ipsum_rc(grid="") +
  theme(panel.background = element_rect(fill="#969696", color="#969696")) +
  theme(axis.text=element_blank()) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width = unit(2, "lines")) +
  theme(legend.position=c(0.85, 0.1))

 

Doh! We forgot to ensure we had data for every country. Let’s try that again:


count(routers, country_code, type) %>%
  complete(country_code, type) %>%
  filter(!is.na(country_code)) %>%
  left_join(wtg, c("country_code" = "alpha.2")) %>%
  filter(!is.na(alpha.3)) %>%
  left_join(real_names) %>%
  complete(country_code, type, x=unique(wtg$x), y=unique(wtg$y)) %>%
  filter(!is.na(lab)) %>%
  ggplot(aes(x, y, fill=n, group=lab)) +
  geom_tile(color="#b2b2b2", size=0.125) +
  scale_y_reverse() +
  viridis::scale_fill_viridis(name="# Devices", trans="log10", na.value="white", label=scales::comma) +
  facet_wrap(~lab, ncol=3) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "World Tile Grid Per-country Concentration of\nSeriously Poorly Configured Network Devices",
    subtitle = "Device discovery based on in-scope 'VPNFilter' vendor device banner strings",
    caption = "Source: Rapid7 Project Sonar & Censys"
  ) +
  theme_ipsum_rc(grid="") +
  theme(panel.background = element_rect(fill="#969696", color="#969696")) +
  theme(axis.text=element_blank()) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width = unit(2, "lines")) +
  theme(legend.position=c(0.85, 0.1))

 

That’s better.

We take advantage of ggplot2’s ability to facet and just ensure we have complete (even if NA) tiles for each panel.

Once consumers start seeing these used more they’ll be able to pick up key markers (or one of us will come up with a notation that makes key markers more visible) and be able to get specific information from the chart. I just wanted to show regional and global differences between vendors (and really give MikroTik users a swift kick in the patootie for being so bad with their kit).

FIN

You can find the RStudio project (code + data) here: (http://rud.is/dl/tile-grid-grid.zip)


NOTE: There is some iframed content in this post and you can bust out of it if you want to see the document in a full browser window.

Also, apologies for some lingering GitHub links. I’m waiting for all the repos to import into to other services and haven’t had time to setup my own self-hosted public instance of any community-usable git-ish environment yet.


And So It Begins

After seeing Fira Sans in action in presentations at eRum 2018 I felt compelled to add hrbrthemes support for it so I made a firasans? extension to it that uses Fira Sans Condensed and Fira Code fonts for ggplot2 graphics.

But I really wanted to go the extra mile and make an R Markdown theme for it, yet I’m weary of both jQuery & Bootstrap, plus prefer Prism over HighlightJS. So I started work on “Prism Skeleton”, which is an R Markdown template that has most of the features you would expect and some new ones, plus uses Prism and Fira Sans/Code. You can try it out on your own if you use markdowntemplates? but the “production” version is likely going to eventually go into the firasans package. (I use markdowntemplates as a playground for R Markdown experiments.)

The source for the iframe at the end of this document is here: https://rud.is/dl/hello-dorling.Rmd. There are some notable features (I’ll repeat a few from above):

  • Fira Sans for headers and text
  • Fira Code for all monospaced content (including source code)
  • No jQuery
  • No Bootstrap (it uses the ‘Skeleton’ CSS framework)
  • No HighightJS (it uses the ‘Prism” highlighter)
  • Extended YAML parameters (more on that in a bit)
  • Defaults to fig.retina=2 and the use of optipng or pngquant for PNG compression (so it expects them to be installed — ref this post by Zev Ross for more info and additional image use tips)

“What’s this about ‘Dorling’?”

Oh, yes. You can read the iframe or busted out document for that bit. It’s a small package to make it easier to create Dorling cartograms based on previous work by @datagistips.

“You said something about ‘extended YAML’?”

Aye. Here’s the YAML excerpt from the Dorling Rmd:

---
title: "Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects)"
author: "boB Rudis"
navlink: "[rud.is](https://rud.is/b/)"
og:
  type: "article"
  title: "Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects)"
  url: "https://github.com/hrbrmstr/spdorling"
footer:
  - content: '[GitLab](https://gitlab.com/hrbrmstr)
' - content: 'This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.' date: "`r Sys.Date()`" output: markdowntemplates::prismskel ---

The title, author & date should be familiar fields but the author and date get some different placement since the goal is more of a flowing document than academic report.

If navlink is present (it’s not required) there will be a static bar at the top of the HTML document with a link on the right (any content, really, but a link is what’s in the example). Remove navlink and no bar will be there.

The og section is for open graph tags and you customize them how you like. Open graph tags make it easier to share posts on social media or even Slack since they’ll auto-expand various content bits.

There’s also a custom footer (exclude it if you don’t want one) that can take multiple content sub-elements.

The goal isn’t so much to give you a 100% usable R Markdown template but something you can clone and customize for your own use. Since this example shows how to use custom fonts and a different code highlighter (which meant using some custom knitr hooks), it should be easier to customize than some of the other ones in the template playground package. FWIW I plan on adapting this for a work template this week.

The other big customization is the use of Prism with a dark theme. Again, you can clone + customize this at-will but I may add config options for all Prism themes at some point (mostly if there is interest).

FIN

(Well, almost fin)

Kick the tyres on both the new template and the new package and drop suggestions here for the time being (until I get fully transitioned to a new git-hosting platform). One TODO for spdorling is to increase the point count for the circle polygons but I’m sure folks can come up with enhancement requests to the API after y’all have played with it for a while.

As noted a few times, the Rmd example with the Dorling cartograms is below.

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

$ cd ~/Downloads
$ ls -l
total 264856
-rw-r--r--@ 1 user  staff     169062 Nov 27  2017 1109.1968.pdf
-rw-r--r--@ 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf
-rw-r--r--@ 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf
-rw-r--r--@ 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf
-rw-r--r--@ 1 user  staff     482953 May 11 12:00 1805.01554.pdf
-rw-r--r--@ 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg
-rw-r--r--@ 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf
-rw-r--r--@ 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz
-rw-r--r--@ 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg

You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn’t think so. Thus begat the xattrs? package.

Exploring Past Downloads

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let’s see what is hidden with the RStudio disk image:


library(xattrs)
library(reticulate) # not 100% necessary but you'll see why later
library(tidyverse) # we'll need this later

list_xattrs("~/Downloads/RStudio-1.2.627.dmg")
## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"    
## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"   

There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:


get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
## [169] 00 00 00 89

Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:


get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"

So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.


biplist <- import("biplist", as="biplist")

biplist$readPlistFromString(
  get_xattr_raw(
    "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms"
  )
)
## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg"
## [2] "https://dailies.rstudio.com/rstudio/oss/mac/" 

That's much better.

Let's work with metadata for the whole directory:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") -> xdf

xdf
## # A tibble: 24 x 4
##    file            name                                  size contents   
##                                                     
##  1 1109.1968.pdf   com.apple.lastuseddate#PS               16  
##  2 1109.1968.pdf   com.apple.metadata:kMDItemWhereFroms   110 
##  3 1109.1968.pdf   com.apple.quarantine                    74  
##  4 1109.1968v1.pdf com.apple.lastuseddate#PS               16  
##  5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms   116 
##  6 1109.1968v1.pdf com.apple.quarantine                    74  
##  7 1804.09970.pdf  com.apple.metadata:kMDItemWhereFroms    86  
##  8 1804.09970.pdf  com.apple.quarantine                    82  
##  9 1804.09988.pdf  com.apple.lastuseddate#PS               16  
## 10 1804.09988.pdf  com.apple.metadata:kMDItemWhereFroms   104 
## # ... with 14 more rows

## count(xdf, name, sort=TRUE)
## # A tibble: 5 x 2
##   name                                     n
##                                   
## 1 com.apple.metadata:kMDItemWhereFroms     9
## 2 com.apple.quarantine                     9
## 3 com.apple.lastuseddate#PS                4
## 4 com.apple.diskimages.fsck                1
## 5 com.apple.diskimages.recentcksum         1

Now we can focus on the task at hand: recovering the URLs:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") %>% 
  filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% 
  mutate(where_from = map(contents, biplist$readPlistFromString)) %>% 
  select(file, where_from) %>% 
  unnest() %>% 
  filter(!where_from == "")
## # A tibble: 15 x 2
##    file                where_from                                                       
##                                                                               
##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf                              
##  2 1109.1968.pdf       https://www.google.com/                                          
##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf                        
##  4 1109.1968v1.pdf     https://www.google.com/                                          
##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf                             
##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf           
##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf                             
##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf        
##  9 athena-ug.pdf       https://www.google.com/                                          
## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz 
## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/                    
## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
## 13 osquery-3.2.4.pkg   https://osquery.io/downloads/official/3.2.4                      
## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/             

(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.