Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.

When I finally jumped on Twitter today, I saw this:

and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.

The Problem

The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).

The PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:

This slideshow requires JavaScript.

We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.

The Solution

We’ll need some packages and the file itself, so let’s get that bit out of the way first:

library(stringi)
library(pdftools)
library(hrbrthemes)
library(ggpomological)
library(tidyverse)

# grab the PDF text
mia_url <- "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"
mia_fil <- "~/Data/pmkor_una_all.pdf"
if (!file.exists(mia_fil)) download.file(mia_url, mia_fil)

# read it in
doc <- pdf_text(mia_fil) 

Let's look at those three example pages:

cat(doc[[1]])
##                                   Defense POW/MIA Accounting Agency
##                                       Personnel Missing - Korea (PMKOR)
##                                        (Reported for ALL Unaccounted For)
##                                                                                                Total Unaccounted: 7,699
## Name                       Rank/Rate     Branch                           Year State/Territory
## ABBOTT, RICHARD FRANK      M/Sgt         UNITED STATES ARMY               1950 VERMONT
## ABEL, DONALD RAYMOND       Pvt           UNITED STATES ARMY               1950 PENNSYLVANIA
## ...
## AKERS, HERBERT DALE        Cpl           UNITED STATES ARMY               1950 INDIANA
## AKERS, JAMES FRANCIS       Cpl           UNITED STATES MARINE CORPS       1950 VIRGINIA

cat(doc[[2]])
## Name                          Rank/Rate Branch                     Year State/Territory
## AKERS, RICHARD ALLEN          1st Lt    UNITED STATES ARMY         1951 PENNSYLVANIA
## AKI, CLARENCE HALONA          Sgt       UNITED STATES ARMY         1950 HAWAII
...
## AMIDON, DONALD PRENTICE       PFC       UNITED STATES MARINE CORPS 1950 TEXAS
## AMOS, CHARLES GEARL           Cpl       UNITED STATES ARMY         1951 NORTH CAROLINA

cat(doc[[length(doc)]])
## Name                                                Rank/Rate           Branch                                              Year         State/Territory
## ZAVALA, FREDDIE                                     Cpl                 UNITED STATES ARMY                                  1951         CALIFORNIA
## ZAWACKI, FRANK JOHN                                 Sgt                 UNITED STATES ARMY                                  1950         OHIO
## ...
## ZUVER, ROBERT LEONARD                               Pfc                 UNITED STATES ARMY                                  1950         CALIFORNIA
## ZWILLING, LOUIS JOSEPH                              Cpl                 UNITED STATES ARMY                                  1951         ILLINOIS
##                                       This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA).
##                Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists.
## Report Prepared: 06/19/2018 11:25

The poppler library's "layout" mode (which pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:

  • there's a uniform header on each page
  • the "layout" mode returned uniform per-page, fixed-width columns
  • there's no "special column tweaks" that some folks use to make PDFs more readable by humans

There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:

  • convert the text to lines
  • find the header line
  • find the column start/end positions from the header on the page (since they are different for each page)
  • reading it in with readr::read_fwf()
  • remove headers, preamble and epilogue cruft
  • turn it all into one data frame
# we're going to process each page and read_fwf will complain violently
# when it hits header/footer rows vs data rows and we rly don't need to
# see all those warnings
read_fwf_q <- quietly(read_fwf)

# go through each page
map_df(doc, ~{
  
  stri_split_lines(.x) %>% 
    flatten_chr() -> lines # want the lines from each page
  
  # find the header on the page and get the starting locations for each column
  keep(lines, stri_detect_regex, "^Name") %>% 
    stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>% 
    map(`[`, 1) %>% 
    flatten_int() -> starts
  
  # now get the ending locations; cheating and using `NA` for the last column  
  ends <- c(starts[-1] - 1, NA)

  # since each page has a lovely header and poppler's "layout" mode creates 
  # a surprisingly usable fixed-width table, the core idiom is to find the start/end
  # of each column using the header as a canary
  cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state"))

  paste0(lines, collapse="\n") %>%        # turn it into something read_fwf() can read 
    read_fwf_q(col_positions = cols) %>%  # read it!
    .$result %>%                          # need to do this b/c of `quietly()`
    filter(!is.na(name)) %>%              # non-data lines
    filter(name != "Name") %>%            # remove headers from each page
    filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly)
  
}) -> xdf

xdf
## # A tibble: 7,699 x 5
##    name                       rank   branch                  year  state        
##                                                        
##  1 ABBOTT, RICHARD FRANK      M/Sgt  UNITED STATES ARMY      1950  VERMONT      
##  2 ABEL, DONALD RAYMOND       Pvt    UNITED STATES ARMY      1950  PENNSYLVANIA 
##  3 ABELE, FRANCIS HOWARD      Sfc    UNITED STATES ARMY      1950  CONNECTICUT  
##  4 ABELES, GEORGE ELLIS       Pvt    UNITED STATES ARMY      1950  CALIFORNIA   
##  5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950  ALABAMA      
##  6 ABREU, MANUEL Jr.          Pfc    UNITED STATES ARMY      1950  MASSACHUSETTS
##  7 ACEVEDO, ISAAC             Sgt    UNITED STATES ARMY      1952  PUERTO RICO  
##  8 ACINELLI, BILL JOSEPH      Pfc    UNITED STATES ARMY      1951  MISSOURI     
##  9 ACKLEY, EDWIN FRANCIS      Pfc    UNITED STATES ARMY      1950  NEW YORK     
## 10 ACKLEY, PHILIP WARREN      Pfc    UNITED STATES ARMY      1950  NEW HAMPSHIRE
## # ... with 7,689 more rows

Now the data is both usable and sobering:

title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea"
subtitle <- "Reported for ALL Unaccounted For"
caption <-  "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"

mutate(xdf, year = factor(year)) %>% 
  mutate(branch = stri_trans_totitle(branch)) -> xdf

ordr <- count(xdf, branch, sort=TRUE)

mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% 
  ggplot(aes(year)) +
  geom_bar(aes(fill = branch), width=0.65) +
  scale_y_comma(name = "# POW/MIA") +
  scale_fill_pomological(name=NULL, ) +
  labs(x = NULL, title = title, subtitle = subtitle) +
  theme_ipsum_rc(grid="Y") +
  theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) +
  theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))

You can catch a bit of the @rOpenSci 2018 Unconference experience at home w with this short-ish ‘splainer video on how to use the new middlechild package (https://github.com/ropenscilabs/middlechild) & mitmproxy to automagically create reusable httr verb functions from manual browser form interactions.

The forthcoming RStudio 1.2 release has a new “Jobs” feature for running and managing background R tasks.

I did a series of threaded screencaps on Twitter but that doesn’t do the feature justice.

So I threw together a quick ‘splainer on how to run and Python (despite RStudio not natively supporting Python) code in the background while you get other stuff done, then work with the results.

A colleague asked if I would blog about how I crafted the grid of world tile grids in this post and I accepted the challenge. The technique isn’t too hard as it just builds on the initial work by Jon Schwabish and a handy file made by Maarten Lambrechts.

The Premise

For this particular use-case, I sifted through our internet scan data and classified a series of device families from their telnet banners then paired that with our country-level attribution data for each IPv4 address. I’m not generally “a fan” of rolling things up at a country level, but since many (most) of these devices are residential or small/medium-business routers, country-level attribution has some merit.

But, I’m also not a fan of country-level choropleths when it comes to “cyber” nor am I wont to area-skewed cartograms since most folks still cannot interpret them. Both of those take up a ton of screen real estate, too, espeically if you have more than one of them. Yet, I wanted to show a map-like structure without resorting to Hilbert IPv4 heatmaps since they are neither very readable by a general audience and become skewed when you have to move up from a 1 pixel == 1 Class C network block.

I think the tile grid is a great compromise since it avoids the “area”and projection skewness confusion that regular global choropleths cause while still preserving geographic & positional proximity. Sure, they’ll take some getting used to by casual readers, but I felt it was the best of all the tradeoffs.

The Setup

Here’s the data:


library(here)
library(hrbrthemes)
library(tidyverse)

wtg <- read_csv("https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv")

glimpse(wtg)
 
## Observations: 192
## Variables: 11
## $ name             "Afghanistan", "Albania", "Algeria", "Angola",...
## $ alpha.2          "AF", "AL", "DZ", "AO", "AQ", "AG", "AR", "AM"...
## $ alpha.3          "AFG", "ALB", "DZA", "AGO", "ATA", "ATG", "ARG...
## $ country.code     "004", "008", "012", "024", "010", "028", "032...
## $ iso_3166.2       "ISO 3166-2:AF", "ISO 3166-2:AL", "ISO 3166-2:...
## $ region           "Asia", "Europe", "Africa", "Africa", "Antarct...
## $ sub.region       "Southern Asia", "Southern Europe", "Northern ...
## $ region.code      "142", "150", "002", "002", NA, "019", "019", ...
## $ sub.region.code  "034", "039", "015", "017", NA, "029", "005", ...
## $ x                22, 15, 13, 13, 15, 7, 6, 20, 24, 15, 21, 4, 2...
## $ y                8, 9, 11, 17, 23, 4, 14, 6, 19, 6, 7, 2, 9, 8,...

routers <- read_csv(here::here("data", "routers.csv"))

routers
 
## # A tibble: 453,027 x 3
##    type     country_name           country_code
##                                 
##  1 mikrotik Slovak Republic        SK          
##  2 mikrotik Czechia                CZ          
##  3 mikrotik Colombia               CO          
##  4 mikrotik Bosnia and Herzegovina BA          
##  5 mikrotik Czechia                CZ          
##  6 mikrotik Brazil                 BR          
##  7 mikrotik Vietnam                VN          
##  8 mikrotik Brazil                 BR          
##  9 mikrotik India                  IN          
## 10 mikrotik Brazil                 BR          
## # ... with 453,017 more rows

distinct(routers, type) %>% 
  arrange(type) %>% 
  print(n=11)
 
## # A tibble: 11 x 1
##    type    
##       
##  1 asus    
##  2 dlink   
##  3 huawei  
##  4 linksys 
##  5 mikrotik
##  6 netgear 
##  7 qnap    
##  8 tplink  
##  9 ubiquiti
## 10 upvel   
## 11 zte

So, we have 11 different device families under assault by “VPNFilter” and I wanted to show the global distribution of them. Knowing the compact world tile grid would facet well, I set off to make it happen.

Let’s get some decent names for facet labels:


real_names <- read_csv(here::here("data", "real_names.csv"))

real_names
 
## # A tibble: 11 x 2
##    type     lab             
##                   
##  1 asus     Asus Device     
##  2 dlink    D-Link Devices  
##  3 huawei   Huawei Devices  
##  4 linksys  Linksys Devices 
##  5 mikrotik Mikrotik Devices
##  6 netgear  Netgear Devices 
##  7 qnap     QNAP Devices    
##  8 tplink   TP-Link Devices 
##  9 ubiquiti Ubiquiti Devices
## 10 upvel    Upvel Devices   
## 11 zte      ZTE Devices

Next, we need to summarise our scan results and pair it up the world tile grid data and our real names:


count(routers, country_code, type) %>%  # summarise the data into # of device familes per country
  left_join(wtg, by = c("country_code" = "alpha.2")) %>% # join them up on the common field
  filter(!is.na(alpha.3)) %>% # we only want countries on the grid and maxmind attributes some things to meta-regions and anonymous proxies
  left_join(real_names) -> wtg_routers

glimpse(wtg_routers)

## Observations: 629
## Variables: 14
## $ country_code     "AE", "AE", "AE", "AF", "AF", "AF", "AG", "AL"...
## $ type             "asus", "huawei", "mikrotik", "huawei", "mikro...
## $ n                1, 12, 70, 12, 264, 27, 1, 941, 2081, 7, 2, 1,...
## $ name             "United Arab Emirates", "United Arab Emirates"...
## $ alpha.3          "ARE", "ARE", "ARE", "AFG", "AFG", "AFG", "ATG...
## $ country.code     "784", "784", "784", "004", "004", "004", "028...
## $ iso_3166.2       "ISO 3166-2:AE", "ISO 3166-2:AE", "ISO 3166-2:...
## $ region           "Asia", "Asia", "Asia", "Asia", "Asia", "Asia"...
## $ sub.region       "Western Asia", "Western Asia", "Western Asia"...
## $ region.code      "142", "142", "142", "142", "142", "142", "019...
## $ sub.region.code  "145", "145", "145", "034", "034", "034", "029...
## $ x                20, 20, 20, 22, 22, 22, 7, 15, 15, 15, 20, 20,...
## $ y                10, 10, 10, 8, 8, 8, 4, 9, 9, 9, 6, 6, 6, 6, 1...
## $ lab              "Asus Device", "Huawei Devices", "Mikrotik Dev...

Then, plot it:


ggplot(wtg_routers, aes(x, y, fill=n, group=lab)) +
  geom_tile(color="#b2b2b2", size=0.125) +
  scale_y_reverse() +
  viridis::scale_fill_viridis(name="# Devices", trans="log10", na.value="white", label=scales::comma) +
  facet_wrap(~lab, ncol=3) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "World Tile Grid Per-country Concentration of\nSeriously Poorly Configured Network Devices",
    subtitle = "Device discovery based on in-scope 'VPNFilter' vendor device banner strings",
    caption = "Source: Rapid7 Project Sonar & Censys"
  ) +
  theme_ipsum_rc(grid="") +
  theme(panel.background = element_rect(fill="#969696", color="#969696")) +
  theme(axis.text=element_blank()) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width = unit(2, "lines")) +
  theme(legend.position=c(0.85, 0.1))

 

Doh! We forgot to ensure we had data for every country. Let’s try that again:


count(routers, country_code, type) %>%
  complete(country_code, type) %>%
  filter(!is.na(country_code)) %>%
  left_join(wtg, c("country_code" = "alpha.2")) %>%
  filter(!is.na(alpha.3)) %>%
  left_join(real_names) %>%
  complete(country_code, type, x=unique(wtg$x), y=unique(wtg$y)) %>%
  filter(!is.na(lab)) %>%
  ggplot(aes(x, y, fill=n, group=lab)) +
  geom_tile(color="#b2b2b2", size=0.125) +
  scale_y_reverse() +
  viridis::scale_fill_viridis(name="# Devices", trans="log10", na.value="white", label=scales::comma) +
  facet_wrap(~lab, ncol=3) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "World Tile Grid Per-country Concentration of\nSeriously Poorly Configured Network Devices",
    subtitle = "Device discovery based on in-scope 'VPNFilter' vendor device banner strings",
    caption = "Source: Rapid7 Project Sonar & Censys"
  ) +
  theme_ipsum_rc(grid="") +
  theme(panel.background = element_rect(fill="#969696", color="#969696")) +
  theme(axis.text=element_blank()) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width = unit(2, "lines")) +
  theme(legend.position=c(0.85, 0.1))

 

That’s better.

We take advantage of ggplot2’s ability to facet and just ensure we have complete (even if NA) tiles for each panel.

Once consumers start seeing these used more they’ll be able to pick up key markers (or one of us will come up with a notation that makes key markers more visible) and be able to get specific information from the chart. I just wanted to show regional and global differences between vendors (and really give MikroTik users a swift kick in the patootie for being so bad with their kit).

FIN

You can find the RStudio project (code + data) here: (http://rud.is/dl/tile-grid-grid.zip)


NOTE: There is some iframed content in this post and you can bust out of it if you want to see the document in a full browser window.

Also, apologies for some lingering GitHub links. I’m waiting for all the repos to import into to other services and haven’t had time to setup my own self-hosted public instance of any community-usable git-ish environment yet.


And So It Begins

After seeing Fira Sans in action in presentations at eRum 2018 I felt compelled to add hrbrthemes support for it so I made a firasans? extension to it that uses Fira Sans Condensed and Fira Code fonts for ggplot2 graphics.

But I really wanted to go the extra mile and make an R Markdown theme for it, yet I’m weary of both jQuery & Bootstrap, plus prefer Prism over HighlightJS. So I started work on “Prism Skeleton”, which is an R Markdown template that has most of the features you would expect and some new ones, plus uses Prism and Fira Sans/Code. You can try it out on your own if you use markdowntemplates? but the “production” version is likely going to eventually go into the firasans package. (I use markdowntemplates as a playground for R Markdown experiments.)

The source for the iframe at the end of this document is here: https://rud.is/dl/hello-dorling.Rmd. There are some notable features (I’ll repeat a few from above):

  • Fira Sans for headers and text
  • Fira Code for all monospaced content (including source code)
  • No jQuery
  • No Bootstrap (it uses the ‘Skeleton’ CSS framework)
  • No HighightJS (it uses the ‘Prism” highlighter)
  • Extended YAML parameters (more on that in a bit)
  • Defaults to fig.retina=2 and the use of optipng or pngquant for PNG compression (so it expects them to be installed — ref this post by Zev Ross for more info and additional image use tips)

“What’s this about ‘Dorling’?”

Oh, yes. You can read the iframe or busted out document for that bit. It’s a small package to make it easier to create Dorling cartograms based on previous work by @datagistips.

“You said something about ‘extended YAML’?”

Aye. Here’s the YAML excerpt from the Dorling Rmd:

---
title: "Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects)"
author: "boB Rudis"
navlink: "[rud.is](https://rud.is/b/)"
og:
  type: "article"
  title: "Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects)"
  url: "https://github.com/hrbrmstr/spdorling"
footer:
  - content: '[GitLab](https://gitlab.com/hrbrmstr)
' - content: 'This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.' date: "`r Sys.Date()`" output: markdowntemplates::prismskel ---

The title, author & date should be familiar fields but the author and date get some different placement since the goal is more of a flowing document than academic report.

If navlink is present (it’s not required) there will be a static bar at the top of the HTML document with a link on the right (any content, really, but a link is what’s in the example). Remove navlink and no bar will be there.

The og section is for open graph tags and you customize them how you like. Open graph tags make it easier to share posts on social media or even Slack since they’ll auto-expand various content bits.

There’s also a custom footer (exclude it if you don’t want one) that can take multiple content sub-elements.

The goal isn’t so much to give you a 100% usable R Markdown template but something you can clone and customize for your own use. Since this example shows how to use custom fonts and a different code highlighter (which meant using some custom knitr hooks), it should be easier to customize than some of the other ones in the template playground package. FWIW I plan on adapting this for a work template this week.

The other big customization is the use of Prism with a dark theme. Again, you can clone + customize this at-will but I may add config options for all Prism themes at some point (mostly if there is interest).

FIN

(Well, almost fin)

Kick the tyres on both the new template and the new package and drop suggestions here for the time being (until I get fully transitioned to a new git-hosting platform). One TODO for spdorling is to increase the point count for the circle polygons but I’m sure folks can come up with enhancement requests to the API after y’all have played with it for a while.

As noted a few times, the Rmd example with the Dorling cartograms is below.

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

$ cd ~/Downloads
$ ls -l
total 264856
-rw-r--r--@ 1 user  staff     169062 Nov 27  2017 1109.1968.pdf
-rw-r--r--@ 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf
-rw-r--r--@ 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf
-rw-r--r--@ 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf
-rw-r--r--@ 1 user  staff     482953 May 11 12:00 1805.01554.pdf
-rw-r--r--@ 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg
-rw-r--r--@ 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf
-rw-r--r--@ 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz
-rw-r--r--@ 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg

You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn’t think so. Thus begat the xattrs? package.

Exploring Past Downloads

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let’s see what is hidden with the RStudio disk image:


library(xattrs)
library(reticulate) # not 100% necessary but you'll see why later
library(tidyverse) # we'll need this later

list_xattrs("~/Downloads/RStudio-1.2.627.dmg")
## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"    
## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"   

There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:


get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
## [169] 00 00 00 89

Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:


get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"

So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.


biplist <- import("biplist", as="biplist")

biplist$readPlistFromString(
  get_xattr_raw(
    "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms"
  )
)
## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg"
## [2] "https://dailies.rstudio.com/rstudio/oss/mac/" 

That's much better.

Let's work with metadata for the whole directory:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") -> xdf

xdf
## # A tibble: 24 x 4
##    file            name                                  size contents   
##                                                     
##  1 1109.1968.pdf   com.apple.lastuseddate#PS               16  
##  2 1109.1968.pdf   com.apple.metadata:kMDItemWhereFroms   110 
##  3 1109.1968.pdf   com.apple.quarantine                    74  
##  4 1109.1968v1.pdf com.apple.lastuseddate#PS               16  
##  5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms   116 
##  6 1109.1968v1.pdf com.apple.quarantine                    74  
##  7 1804.09970.pdf  com.apple.metadata:kMDItemWhereFroms    86  
##  8 1804.09970.pdf  com.apple.quarantine                    82  
##  9 1804.09988.pdf  com.apple.lastuseddate#PS               16  
## 10 1804.09988.pdf  com.apple.metadata:kMDItemWhereFroms   104 
## # ... with 14 more rows

## count(xdf, name, sort=TRUE)
## # A tibble: 5 x 2
##   name                                     n
##                                   
## 1 com.apple.metadata:kMDItemWhereFroms     9
## 2 com.apple.quarantine                     9
## 3 com.apple.lastuseddate#PS                4
## 4 com.apple.diskimages.fsck                1
## 5 com.apple.diskimages.recentcksum         1

Now we can focus on the task at hand: recovering the URLs:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") %>% 
  filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% 
  mutate(where_from = map(contents, biplist$readPlistFromString)) %>% 
  select(file, where_from) %>% 
  unnest() %>% 
  filter(!where_from == "")
## # A tibble: 15 x 2
##    file                where_from                                                       
##                                                                               
##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf                              
##  2 1109.1968.pdf       https://www.google.com/                                          
##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf                        
##  4 1109.1968v1.pdf     https://www.google.com/                                          
##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf                             
##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf           
##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf                             
##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf        
##  9 athena-ug.pdf       https://www.google.com/                                          
## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz 
## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/                    
## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
## 13 osquery-3.2.4.pkg   https://osquery.io/downloads/official/3.2.4                      
## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/             

(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.

Regular readers will recall the “utility belt” post from back in April of this year. This is a follow-up to a request made asking for a list of all the % infix functions in those files.

We’re going to:

  • collect up all of the sources
  • parse them
  • find all the definitions of % infix functions
  • write them to a file

We’ll start by grabbing the data from the previous post and look at it as a refresher:


library(stringi)
library(tidyverse)

utils <- read_rds(url("https://rud.is/dl/utility-belt.rds"))

utils
## # A tibble: 1,746 x 13
##    permsissions links owner     group      size month day   year_hr path   date       pkg   fil   file_src             
##  1 -rw-r--r--       0 hornik    users      1658 Jun   05    2016    AHR/R… 2016-06-05 AHR   util… "## \\int f(x)dg(x) …
##  2 -rw-r--r--       0 ligges    users     12609 Dec   13    2016    ALA4R… 2016-12-13 ALA4R util… "## some utility fun…
##  3 -rw-r--r--       0 hornik    users         0 Feb   24    2017    AWR.K… 2017-02-24 AWR.… util… ""                   
##  4 -rw-r--r--       0 ligges    users      4127 Aug   30    2017    Alpha… 2017-08-30 Alph… util… "#\n#' Assign API ke…
##  5 -rw-r--r--       0 ligges    users       121 Jan   19    2017    Amylo… 2017-01-19 Amyl… util… "make_decision <- fu…
##  6 -rw-r--r--       0 herbrandt herbrandt    52 Aug   10    2017    BANES… 2017-08-10 BANE… util… "#' @importFrom dply…
##  7 -rw-r--r--       0 ripley    users     36977 Jan   06    2015    BEQI2… 2015-01-06 BEQI2 util… "#' \tRemove Redunda…
##  8 -rw-r--r--       0 hornik    users     34198 May   10    2017    BGDat… 2017-05-10 BGDa… util… "# A more memory-eff…
##  9 -rwxr-xr-x       0 ligges    users      3676 Aug   14    2016    BGLR/… 2016-08-14 BGLR  util… "\n readBinMat=funct…
## 10 -rw-r--r--       0 ripley    users      2547 Feb   04    2015    BLCOP… 2015-02-04 BLCOP util… "###################…
## # ... with 1,736 more rows

Note that we somewhat expected the file source to potentially come in handy at a later date and also expected the need to revisit that post, so the R data file [←direct link to RDS] included a file_src column.

Now, let's find all the source files with at least one infix definition, collect them together and parse them so we can do more code spelunking:


filter(utils, stri_detect_fixed(file_src, "`%")) %>% # only find sources with infix definitions
  pull(file_src) %>%
  paste0(collapse="\n\n") %>%
  parse(text = ., keep.source=TRUE) -> infix_src

str(infix_src, 1)
## length 1364 expression(dplyr::`%>%`, `%||%` <- function(a, b) if (is.null(a)) b else a, get_pkg_path <- function(ctx) {  pkg_| __truncated__ ...
##  - attr(*, "srcref")=List of 1364
##  - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'  
##  - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 15768 0 0 0 1 15768
##   ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' 

We can now take all of that lovely parsed source and tokenize it to work with the discrete elements in a very tidy manner:


infix_parsed <- tbl_df(getParseData(infix_src)) # tbl_df() is mainly for pretty printing 

infix_parsed
## # A tibble: 118,242 x 9
##    line1  col1 line2  col2    id parent token          terminal text      
##  1     1     1     1    24     1    -10 COMMENT        TRUE     #' @impor…
##  2     2     1     2    10     4    -10 COMMENT        TRUE     #' @export
##  3     3     1     3    12    10      0 expr           FALSE    ""        
##  4     3     1     3     5     7     10 SYMBOL_PACKAGE TRUE     dplyr     
##  5     3     6     3     7     8     10 NS_GET         TRUE     ::        
##  6     3     8     3    12     9     10 SYMBOL         TRUE     `%>%`     
##  7     5     1     5    49    51      0 expr           FALSE    ""        
##  8     5     1     5     6    16     18 SYMBOL         TRUE     `%||%`    
##  9     5     1     5     6    18     51 expr           FALSE    ""        
## 10     5     8     5     9    17     51 LEFT_ASSIGN    TRUE     <-        
## # ... with 118,232 more rows

We just need to find a sequence of tokens that make up a function definition, then whittle those down to ones that look like our % infix names:


pat <- c("SYMBOL", "expr", "LEFT_ASSIGN", "expr", "FUNCTION") # pattern for function definition

# find all of ^^ sequences (there's a good twitter discussion on this abt a month ago)
idx <- which(infix_parsed$token == pat[1]) # find location of match of start of seq

# look for the rest of the sequences starting at each idx position
map_lgl(idx, ~{
  all(infix_parsed$token[.x:(.x+(length(pat)-1))] == pat)
}) -> found

f_defs <- idx[found] # starting indices of all the places where functions are defined

# filter ^^ to only find infix ones
infix_defs <- f_defs[stri_detect_regex(infix_parsed$text[f_defs], "^`\\%")]

# there aren't too many, but remember we're just searching `util` functions
length(infix_defs)
## [1] 106

Now, write it out to a file so we can peruse the infix functions:


# nuke a file and fill it with the function definition
cat("", sep="", file="infix_functions.R")
walk2(
  getParseText(infix_parsed, infix_parsed$id[infix_defs]),     # extract the infix name
  getParseText(infix_parsed, infix_parsed$id[infix_defs + 3]), # extract the function definition body
  ~{
    cat(.x, " <- ", .y, "\n\n", sep="", file="infix_functions.R", append=TRUE)
  }
)

There are 106 of them so you can find the extracted ones in this gist.

Here's an overview of what you can expect to find:

# A tibble: 39 x 2
   name                 n
 1 `%||%`              47
 2 `%+%`                7
 3 `%AND%`              4
 4 `%notin%`            4
 5 `%:::%`              3
 6 `%==%`               3
 7 `%!=%`               2
 8 `%*diag%`            2
 9 `%diag*%`            2
10 `%nin%`              2
11 `%OR%`               2
12 `%::%`               1
13 `%??%`               1
14 `%.%`                1
15 `%@%`                1
16 `%&&%`               1
17 `%&%`                1
18 `%+&%`               1
19 `%++%`               1
20 `%+|%`               1
21 `%<<%`               1
22 `%>>%`               1
23 `%~~%`               1
24 `%assert_class%`     1
25 `%contains%`         1
26 `%din%`              1
27 `%fin%`              1
28 `%identical%`        1
29 `%In%`               1
30 `%inr%`              1
31 `%M%`                1
32 `%notchin%`          1
33 `%or%`               1
34 `%p%`                1
35 `%pin%`              1
36 `%R%`                1
37 `%s%`                1
38 `%sub_in%`           1
39 `%sub_nin%`          1

FIN

If any of those are useful, feel free to PR them in to https://github.com/hrbrmstr/freebase/blob/master/inst/templates/infix-helpers.R (and add yourself to the DESCRIPTION if you do).

Hopefully this provided some further inspiration to continue to use R not only as your language of choice but also as a fun data source.

I’m going to (eventually) write a full post on the package I’m mentioning in this one : osqueryr?. The TLDR on osqueryr is that it is an R DBI wrapper (that has just enough glue to also be plugged into dbplyr) for osquery?. The TLDR on osquery is that it “exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems.”

In short, osquery turns the metadata and state information of your local system (or remote system(s)) into a SQL-compliant database. It also works on Windows, Linux, BSD and macOS. This means you can query a fleet of systems with a (mostly) normalized set of tables and get aggregated results. Operations and information security staff use this to manage systems and perform incident response tasks, but you can use it to get just about anything and there are even more powerful modes of operation for osquery. But, more on all the features of osquery[r] in another post.

If you are skeptical, here’s some proof (which I need to show regardless of your skepticism state). First, a local “connection”:


library(DBI)
library(osqueryr)

con <- DBI::dbConnect(Osquery())

head(dbListTables(con), 10)
##  [1] "account_policy_data" "acpi_tables"         "ad_config"          
##  [4] "alf"                 "alf_exceptions"      "alf_explicit_auths" 
##  [7] "alf_services"        "app_schemes"         "apps"               
## [10] "apt_sources"

dbListFields(con, "processes")
##  [1] "cmdline"            "cwd"                "disk_bytes_read"   
##  [4] "disk_bytes_written" "egid"               "euid"              
##  [7] "gid"                "name"               "nice"              
## [10] "on_disk"            "parent"             "path"              
## [13] "pgroup"             "pid"                "resident_size"     
## [16] "root"               "sgid"               "start_time"        
## [19] "state"              "suid"               "system_time"       
## [22] "threads"            "total_size"         "uid"               
## [25] "user_time"          "wired_size"

dbGetQuery(con, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'")
## # A tibble: 2 x 2
##   name     system_time
## 1 Firewall 3          
## 2 firefox  517846

then, a remote "connection":


con2 <- osqueryr::dbConnect(Osquery(), host = "hrbrmstr@osq1")

head(dbListTables(con2), 10)
##  [1] "account_policy_data" "acpi_tables"         "ad_config"          
##  [4] "alf"                 "alf_exceptions"      "alf_explicit_auths" 
##  [7] "alf_services"        "app_schemes"         "apps"               
## [10] "apt_sources"

dbListFields(con2, "processes")
##  [1] "cmdline"            "cwd"                "disk_bytes_read"   
##  [4] "disk_bytes_written" "egid"               "euid"              
##  [7] "gid"                "name"               "nice"              
## [10] "on_disk"            "parent"             "path"              
## [13] "pgroup"             "pid"                "resident_size"     
## [16] "root"               "sgid"               "start_time"        
## [19] "state"              "suid"               "system_time"       
## [22] "threads"            "total_size"         "uid"               
## [25] "user_time"          "wired_size"

dbGetQuery(con2, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'")
## # A tibble: 1 x 2
##   name    system_time
## 1 firefox 1071992

"You're talking an awful lot about the package when you said this was a post on 'standards' and 'consistency'."

True, but we needed that bit above for context. To explain what this post has to do with "standards" and "consistency" I also need to tell you a bit more about how both osquery and the osqueryr package are implemented.

You can read about osquery in-depth starting at the link at the top of this post, but the authors of the tool really wanted a consistent idiom for accessing system metadata with usable, normalized output. They chose (to use a word they didn't but one that works for an R audience) a "data frame" as the output format and picked the universal language of "data frames" -- SQL -- as the inquiry interface. So, right there are examples of both standards and consistency: using SQL vs coming up with yet-another-query-language and avoiding the chaos of the myriad of outputs from various system commands by making all results conform to a rectangular data structure.

Let's take this one-step further with a specific example. All modern operating systems have the concept of a "process" and said processes have (mostly) similar attributes. However, the commands used to get a detailed listing of those processes differ (sometimes wildly) from OS to OS. The authors of osquery came up with a set of schemas to ensure a common, rectangular output and naming conventions (note that some schemas are unique to a particular OS since some elements of operating systems have no useful counterparts on other operating systems).

The osquery authors also took consistency and standards to yet-another-level by taking advantage of a feature of SQLite called virtual tables. That enables them to have C/C++/Objective-C "glue" that gets called when a query is made so they can dispatch the intent to the proper functions or shell commands and then send all the results back -- or -- use the SQLite engine capabilities to do joining, filtering, UDF-calling, etc to produce rich, targeted rectangular output back.

By not reinventing the wheel and relying on well-accepted features like data frames, SQL and SQLite the authors could direct all their focus on solving the problem they posited.

"Um, you're talking alot about everything but R now."

We're getting to the good (i.e. "R") part now.

Because the authors didn't try to become SQL parser writer experts and relied on the standard SQL offerings of SQLite, the queries made are "real" SQL (if you've worked with more than one database engine, you know how they all implement different flavours of SQL).

Because these queries are "real" SQL, we can write an R DBI driver for it. The DBI package aims "[to define] a common interface between R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl's DBI, Java's JDBC, Python's DB-API, and Microsoft's ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed."

If you look at the osqueryr package source, you'll see a bunch of DBI boilerplate code (which is in the r-dbi organization example code) and only a handful of "touch points" for the actual calls to osqueryi (the command that processes SQL). No handling of anything but passing on SQL to the osqueryi engine and getting rectangular results back. By abstracting the system call details, R users can work with a familiar, consistent, standard interface and have full access to the power of osquery without firing up a terminal.

But it gets even better.

As noted above, one design aspect of osquery was to enable remote usage. Rather than come up with yet-another-daemon-and-custom-protocol, the osquery authors suggest ssh? as one way of invoking the command on remote systems and getting the rectangular results back.

Because the osqueryr package used the sys? package for making local system calls, there was only a tiny bit of extra effort required to switch from sys::exec_internal() to a sibling call in the ssh? package -- ssh::ssh_exec_internal() when remote connections were specified. (Said effort could have been zero if I chose a slightly different function in sys, too.)

Relying on well-accepted standards made both osqueryi and the R DBI-driver work seamlessly without much code at all and definitely without a rats nest of nested if/else statements and custom httr functions.

But it gets even more better-er

Some folks like & grok SQL, others don't. (Humans have preferences, go figure.)

A few years ago, Hadley (do I even need to use his last name at this point in time?) came up with the idea to have a more expressive and consistent way to work with data frames. We now know this as the tidyverse but one core element of the tidyverse is dplyr, which can really level-up your data frame game (no comments about data.table, or the beauty of base R, please). Not too long after the birth of dplyr came the ability to work with remote, rectangular, SQL-based data sources with (mostly) the same idioms.

And, not too long after that, the remote dplyr interface (now, dbplyr) got up close and personal with DBI. Which ultimately means that if you make a near-fully-compliant DBI interface to a SQL back-end you can now do something like this:


library(DBI)
library(dplyr)
library(osqueryr)

con <- DBI::dbConnect(Osquery())

osqdb <- src_dbi(con)

procs <- tbl(osqdb, "processes")
listen <- tbl(osqdb, "listening_ports")

left_join(procs, listen, by="pid") %>%
  filter(port != "", protocol == "17") %>% # 17 == TCP
  distinct(name, port, address, pid)
## # Source:   lazy query [?? x 4]
## # Database: OsqueryConnection
##    address name              pid   port 
##  1 0.0.0.0 BetterTouchTool   46317 57183
##  2 0.0.0.0 Dropbox           1214  17500
##  3 0.0.0.0 SystemUIServer    429   0    
##  4 0.0.0.0 SystemUIServer    429   62240
##  5 0.0.0.0 UserEventAgent    336   0    
##  6 0.0.0.0 WiFiAgent         493   0    
##  7 0.0.0.0 WiFiProxy         725   0    
##  8 0.0.0.0 com.docker.vpnkit 732   0    
##  9 0.0.0.0 identityservicesd 354   0    
## 10 0.0.0.0 loginwindow       111   0    
## # ... with more rows

The src_dbi() call wires up everything for us because d[b]plyr can rely on DBI doing it's standard & consistent job and DBI can rely on the SQLite processing crunchy goodness of osqueryi to ultimately get us a list of really dangerous (if not firewalled off) processes that are listening on all network interfaces. (Note to self: find out why the BetterTouchTool and Dropbox authors feel the need to bind to 0.0.0.0)

FIN

What did standards and consistency get us?

  • The osquery authors spent time solving a hard problem vs creating new data formats and protocols
  • Rectangular data (i.e. "data frame") provides consistency and structure which ends up causing more freedom
  • "Standard" SQL enables a consistent means to work with rectangular data
  • ssh normalizes (secure) access across systems with a consistent protocol
  • A robust, well-defined standard mechanism for working with SQL databases enabled nigh instantaneous wiring up of a whole new back-end to R
  • ssh and sys common idioms made working with the new back-end on remote systems as easy as is on a local system
  • Another robust, well-defined modern mechanism for working with rectangular data got wired up to this new back-end with (pretty much) one line of code because of the defined standard and expectation of consistency (and works for local and remote)

Standards and consistency are pretty darned cool.