Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.

It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.

You can see a full-size version of the image via this link.

library(hrbrthemes) # use github or gitlab version
library(tidyverse)

# get the data

dat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json"

jsonlite::fromJSON(dat_url) %>% 
  mutate(BreachDate = as.Date(BreachDate)) %>% 
  tbl_df() -> breaches

# selected breach labels df
group_by(breaches, year = lubridate::year(BreachDate)) %>% 
  top_n(1, wt=PwnCount) %>% 
  ungroup() %>% 
  filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely
  mutate(
    lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000))
  ) %>% 
  arrange(year) -> labs

# num of known breaches in that year for labels
count(breaches, year = lubridate::year(BreachDate)) %>% 
  mutate(nlab = sprintf("n=%s", n)) %>% 
  mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_cts

mutate(breaches, p_half = PwnCount/2) %>% # for centered segments
  ggplot() +
  geom_segment( # centered segments
    aes(BreachDate, p_half, xend=BreachDate, yend=-p_half), 
    color = ft_cols$yellow, size = 0.3
  ) +
  geom_text( # selected breach labels
    data = labs, aes(BreachDate, PwnCount/2, label = lab),
    lineheight = 0.875, size = 3.25, family = font_rc,
    hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25),
    nudge_y = 0,  color = ft_cols$slate
  ) +
  geom_text( # top year labels
    data = year_cts, aes(lab_x, Inf, label = year), family = font_rc, 
    size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray
  ) +
  geom_text( # bottom known breach count totals
    data = year_cts, aes(lab_x, -Inf, label = nlab, size = n), 
    vjust = 0, lineheight = 0.875, color = ft_cols$peach,
    family = font_rc, show.legend = FALSE
  ) +
  scale_x_date( # break on year
    name = NULL, date_breaks = "1 year", date_labels = "%Y"
  ) +
  scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels
  scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes 
  labs(
    title = "HIBP (Known) Breach Frequency & Size",
    subtitle = "Segment length is number of accounts; n=# known account breaches that year",
    caption = "Source: HIBP via "
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_blank()) +
  theme(axis.text.x = element_blank())

Continuing the blog’s UDF theme of late, there are two new UDF kids in town:

Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url()? function already?” My answer is “Sure, but it’s based on java.net.URL which is fundamentally broken.”

Slicing & dicing URLs and IDNs is a large part of the $DAYJOB and they go together pretty well, hence the joint UDF release.

Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.

Counting Lying Lock Icons

SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).

If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL

There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.

Let’s get the data:

library(sergeant)
library(tidyverse)

read_delim(
  file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", 
  delim = " ", 
  col_names = FALSE,
  col_types = "ccc"
) -> gfg_update

dl_path <- file.path("~/Data/gfg_links.tsv.gz")

if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)

Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.

Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:

db <- src_drill()

tbl(db, "(
SELECT 
  b.host,
  port,
  b.rec.hostname AS hostname,
  b.rec.assigned AS assigned,
  b.rec.tld AS tld,
  b.rec.subdomain AS subdomain
FROM
  (SELECT
    host, port, suffix_extract(host) AS rec             -- break the hostname into components
  FROM
    (SELECT
      a.rec.host AS host, a.rec.port AS port
    FROM
      (SELECT 
        columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components
      FROM dfs.d.`/gfg_links.tsv.gz`) a
    WHERE a.rec.port IS NOT NULL                        -- filter out URL parsing failures
    )
  ) b
WHERE b.rec.tld IS NOT NULL                             -- filter out domain parsing failures
)") -> gfg_df

gfg_df
## # Database: DrillConnection
##    hostname  port host              subdomain assigned      tld  
##                                    
##  1 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  2 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  3 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  4 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  5 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  6 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  7 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  8 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  9 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## 10 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## # ... with more rows

While we could have done it all in SQL, we saved some bits for R:

distinct(gfg_df, assigned, port) %>% 
  count(port) %>% 
  collect() -> port_counts

port_counts
# A tibble: 2 x 2
   port     n
*  
1    80 20648
2   443 22178

You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).

FIN

Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.

I'm not 100% settled on the galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.

Kick the tyres & file issues & PRs as necessary.

There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion of the metadata — often just tweet timestamp, the tweet creator and the tweet text — leaving to the analyst the trudging work of re-extracting hashtags, mentions, URLs (etc).

Twitter provides a tweet-text processing library for many languages. One of these languages is Java. Since it make sense to perform at-scale data operations in Apache Drill, it also seemed to make sense that Apache Drill could use a tweet metadata extraction set of user-defined functions (UDFs). Plus, there just aren’t enough examples of Drill UDFs out there. Thus begat drill-twitter-text?.

What’s Inside the Tin?

There are five UDF functions in the package:

  • tw_parse_tweet(string): Parses the tweet text and returns a map column with the following named values:
    • weightedLength: (int) the overall length of the tweet with code points weighted per the ranges defined in the configuration file
    • permillage: (int) indicates the proportion (per thousand) of the weighted length in comparison to the max weighted length. A value > 1000 indicates input text that is longer than the allowable maximum.
    • isValid: (boolean) indicates if input text length corresponds to a valid result.
    • display_start / display_end: (int) indices identifying the inclusive start and exclusive end of the displayable content of the Tweet.
    • valid_start / valid_end: (int) indices identifying the inclusive start and exclusive end of the valid content of the Tweet.
  • tw_extract_hashtags(string): Extracts all hashtags in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_screennames(string): Extracts all screennames in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_urls(string): Extracts all URLs in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_reply_screenname(): Extracts the reply screenname (if any) from the tweet text into a VARCHAR.

The repo has all the necessary bits and info to help you compile and load the necessary JARs, but those in a hurry can just copy all the files in the target directory to your local jars/3rparty directory and restart Drill.

Usage

Here’s an example of how to call each UDF along with the output:

SELECT 
  tw_extract_screennames(tweetText) AS mentions,
  tw_extract_hashtags(tweetText) AS tags,
  tw_extract_urls(tweetText) AS urls,
  tw_extract_reply_screenname(tweetText) AS reply_to,
  tw_parse_tweet(tweetText) AS tweet_meta
FROM
  (SELECT 
     '@youThere Load data from #Apache Drill to @QlikSense - #Qlik Tuesday Tips and Tricks #ApacheDrill #BigData https://t.co/fkAJokKF5O https://t.co/bxdNCiqdrE' AS tweetText
   FROM (VALUES((1))))

+----------+------+------+----------+------------+
| mentions | tags | urls | reply_to | tweet_meta |
+----------+------+------+----------+------------+
| ["youThere","QlikSense"] | ["Apache","Qlik","ApacheDrill","BigData"] | ["https://t.co/fkAJokKF5O","https://t.co/bxdNCiqdrE"] | youThere | {"weightedLength":154,"permillage":550,"isValid":true,"display_start":0,"display_end":153,"valid_start":0,"valid_end":153} |
+----------+------+------+----------+------------+

FIN

Kick the tyres and file issues and PRs as needed.

A previous post explored how to deal with Amazon Athena queries asynchronously. The function presented is a beast, though it is on purpose (to provide options for folks).

In reality, nobody really wants to use rJava wrappers much anymore and dealing with icky Python library calls directly just feels wrong, plus Python functions often return truly daft/ugly data structures. R users deserve better than that.

R is not a first-class citizen in Amazon-land and while the cloudyr project does a fine job building native-R packages for various Amazon services, the fact remains that the official Amazon SDKs are in other languages. The reticulate package provides an elegant interface to Python so it seemed to make sense to go ahead and wrap the boto3 Athena client into something more R-like and toss in the collect_async() function for good measure.

Dependencies

I forced a dependency on Python 3.5 because friends don’t let friends rely on dated, fragmented ecosystems. Python versions can gracefully (mostly) coexist so there should be no pain/angst associated with keeping an updated Python 3 environment around. As noted in the package, I highly recommend adding RETICULATE_PYTHON=/usr/local/bin/python3 to your R environment (~/.Renviron is a good place to store it) since it will help reticulate find the proper target.

If boto3 is not installed, you will need to do pip3 install boto3 to ensure you have the necessary Python module available and associated with your Python 3 installation.

It may seem obvious, but an Amazon AWS account is also required and you should be familiar with the Athena service and AWS services in general. Most of the roto.athena functions have a set of optional parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token
  • region_name
  • profile_name

Ideally, these should be in setup in the proper configuration files and you should let boto3 handle the details of retrieving them. One parameter you will see used in many of my examples is profile_name = "personal". I have numerous AWS accounts and manage them via the profile ids. By ensuring the AWS configuration files are thoroughly populated, I avoid the need to load and pass around the various keys and/or tokens most AWS SDK API calls require. You can read more about profile management in the official docs: 1, 2.

Usage

The project README and package manual pages are populated and have a smattering of usage examples. It is likely you will really just want to execute a manually prepared SQL query and retrieve the results or do the dplyr dance and collect the results asynchronously. We’ll cover both of those use-cases now, starting with a manual SQL query.

If you have not deleted it, your Athena instance comes with a sampledb that contains an elb_logs table. We’ll use that for our example queries. First, let’s get the packages we’ll be using out of the way:

library(odbc)
library(DBI) # for dplyr access later
library(odbc) # for dplyr access later
library(roto.athena) # hrbrmstr/roto.athena on gh or gl
library(tidyverse) # b/c it rocks

Now, we’ll prepare and execute the query. This is a super-simple one:

query <- "SELECT COUNT(requestip) AS ct FROM elb_logs"

start_query_execution(
  query = query,
  database = "sampledb",
  output_location = "s3://aws-athena-query-results-redacted",
  profile = "personal"
) -> qex_id

The qex_id contains the query execution id. We can pass that along to get information on the status of the query:

get_query_execution(qex_id, profile = "personal") %>%
  glimpse()
## Observations: 1
## Variables: 10
## $ query_execution_id   "7f8d8bd6-9fe6-4a26-a021-ee10470c1048"
## $ query                "SELECT COUNT(requestip) AS ct FROM elb_logs"
## $ output_location      "s3://aws-athena-query-results-redacted/7f...
## $ database             "sampledb"
## $ state                "RUNNING"
## $ state_change_reason  NA
## $ submitted            "2018-07-20 11:06:06.468000-04:00"
## $ completed            NA
## $ execution_time_ms    NA
## $ bytes_scanned        NA

If the state is not SUCCEEDED then you’ll need to be patient before trying to retrieve the results.

get_query_results(qex_id, profile = "personal")
## # A tibble: 1 x 1
##   ct             
##   
## 1 4229

Now, we’ll use dplyr via the Athena ODBC driver:

DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AwsProfile = "personal",
  AuthenticationType = "IAM Profile",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

elb_logs <- tbl(con, "elb_logs")

I've got the ODBC DBI fragment in a parameterized RStudio snippet and others may find that as a time-saver if you're not doing that already.

Now to build and submit the query:

mutate(elb_logs, tsday = substr(timestamp, 1, 10)) %>%
  filter(tsday == "2014-09-29") %>%
  select(requestip, requestprocessingtime) %>%
  collect_async(
    database = "sampledb",
    output_location = "s3://aws-athena-query-results-redacted",
    profile_name = "personal"
  ) -> qex_id

As noted in the previous blog post, collect_async() turn the dplyr chain into a SQL query then fires off the whole thing to start_query_execution() for you and returns the query execution id:

get_query_execution(qex_id, profile = "personal") %>%
  glimpse()
## Observations: 1
## Variables: 10
## $ query_execution_id   "95bd158b-7790-42ba-aa83-e7436c3470fe"
## $ query                "SELECT \"requestip\", \"requestprocessing...
## $ output_location      "s3://aws-athena-query-results-redacted/95...
## $ database             "sampledb"
## $ state                "RUNNING"
## $ state_change_reason  NA
## $ submitted            "2018-07-20 11:06:12.817000-04:00"
## $ completed            NA
## $ execution_time_ms    NA
## $ bytes_scanned        NA

Again, you'll need to be patient and wait for the state to be SUCCEEDED to retrieve the results.

get_query_results(qex_id, profile = "personal")
## # A tibble: 774 x 2
##    requestip       requestprocessingtime
##                               
##  1 255.48.150.122              0.0000900
##  2 249.213.227.93              0.0000970
##  3 245.108.120.229             0.0000870
##  4 241.112.203.216             0.0000940
##  5 241.43.107.223              0.0000760
##  6 249.117.98.137              0.0000830
##  7 250.134.112.194             0.0000630
##  8 250.200.171.222             0.0000540
##  9 248.193.76.218              0.0000820
## 10 250.57.61.131               0.0000870
## # ... with 764 more rows

You can also use the query execution id to sync the resultant CSV from S3. Which one is more performant is definitely something you'll need to test since it varies with AWS region, result set size, your network connection and other environment variables. One benefit of using get_query_results() is that it uses the column types to set the data frame column types appropriately (I still need to setup a full test of all possible types so not all are handled yet).

Kick the tyres

The package is up on both GitLab and GitHub and any and all feedback (i.e. Issues) or tweaks (i.e. PRs) are most welcome.

I’ve blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare a dplyr chain, fire off a collect() and then wait.

And, wait.

And, wait.

And, wait.

Queries that take significant processing time or have large result sets do not play nicely with the provided ODBC and JDBC drivers. This means “hung” R sessions and severe frustration, especially when you can login to the AWS Athena console and see that the results are right there!!

I’ve been crafting SQL by hand or using sql_render() by hand to avoid this (when I remember to) but finally felt sufficient frustration to craft a better way, provided you can install and run rJava-based code (it’s 2018 and that still is not an easy given on many systems unfortunately).

There are two functions below:

  • collect_async(), and
  • gather_results()

The collect_async() function is designed to be used like collect() but uses Athena components from the AWS SDK for Java to execute the SQL query behind the dplyr chain asynchronously. The companion function gather_results() takes the object created by collect_async() and checks to see if the results are ready. If if they are, it will use the aws.s3 package to download them. Personally, I’d just aws s3 sync ... from the command line vs use the aws.s3 package but that’s not everyone’s cup of tea.

Once I figure out the best package API for this I’ll add it to the metis package. There are many AWS idiosyncrasies that need to be accounted for and I’d rather ship this current set of functions via the blog so folks can use it (and tweak it to their needs) before waiting for perfection.

Here’s the code:

library(rJava)
library(awsjavasdk)
library(aws.signature)
library(aws.s3)
library(odbc)
library(tidyverse)
library(dbplyr)

#' Collect Amazon Athena query results asynchronously
#' 
#' Long running Athena queries and Athena queries with large result
#' sets can seriously stall a `dplyr` processing chain due to poorly
#' implemented ODBC and JDBC drivers. The AWS SDK for Athena has 
#' methods that support submitting a query asynchronously for "batch"
#' processing. All Athena resutls are stored in CSV files in S3 and it's
#' easy to use the R `aws.s3` package to grab these or perform an
#' `aws s3 sync ...` operation on the command line.
#' 
#' @md
#' @param obj the `dplyr` chain
#' @param schema Athena schema (usually matches the `Schema` parameter to the 
#'        Simba ODBC connection)
#' @param region Your AWS region. All lower case with dashes (usually matches
#'        the `AwsRegion` parameter to the Simba ODBC connection)
#' @param results_bucket the S3 results bucket where query results are stored 
#'        (usually matches the `S3OutputLocation` parameter to the Simba ODBC
#'        connection)
#' @return a `list` with the query execution ID and the S3 bucket. This object
#'         is designed to be passed to the companion `gather_results()` if you
#'         want to use the `aws.s3` package to retrieve the results. Otherwise,
#'         sync the file however you want using the query execution id.
#' @note You may need to change up the authentication provider depending on how 
#'       you use credentials with Athena
collect_async <- function(obj, schema, region, results_bucket) {

  ugly_query <- as.character(sql_render(obj))

  region <- toupper(region)
  region <- gsub("-", "_", region, fixed=TRUE)

  regions <- J("com.amazonaws.regions.Regions")

  available_regions <- grep("^[[:upper:][:digit:]_]+$", names(regions), value=TRUE)
  if (!region %in% available_regions) stop("Invalid region.", call.=FALSE)

  switch(
    region,
    "GovCloud" = regions$GovCloud,
    "US_EAST_1" = regions$US_EAST_1,
    "US_EAST_2" = regions$US_EAST_2,
    "US_WEST_1" = regions$US_WEST_1,
    "US_WEST_2" = regions$US_WEST_2,
    "EU_WEST_1" = regions$EU_WEST_1,
    "EU_WEST_2" = regions$EU_WEST_2,
    "EU_WEST_3" = regions$EU_WEST_3,
    "EU_CENTRAL_1" = regions$EU_CENTRAL_1,
    "AP_SOUTH_1" = regions$AP_SOUTH_1,
    "AP_SOUTHEAST_1" = regions$AP_SOUTHEAST_1,
    "AP_SOUTHEAST_2" = regions$AP_SOUTHEAST_2,
    "AP_NORTHEAST_1" = regions$AP_NORTHEAST_1,
    "AP_NORTHEAST_2" = regions$AP_NORTHEAST_2,
    "SA_EAST_1" = regions$SA_EAST_1,
    "CN_NORTH_1" = regions$CN_NORTH_1,
    "CN_NORTHWEST_1" = regions$CN_NORTHWEST_1,
    "CA_CENTRAL_1" = regions$CA_CENTRAL_1,
    "DEFAULT_REGION" = regions$DEFAULT_REGION
  ) -> region

  provider <- J("com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
  client <- J("com.amazonaws.services.athena.AmazonAthenaAsyncClientBuilder")

  my_client <- client$standard()
  my_client <- my_client$withRegion(region)
  my_client <- my_client$withCredentials(provider$getInstance())
  my_client <- my_client$build()

  queryExecutionContext <- .jnew("com.amazonaws.services.athena.model.QueryExecutionContext")
  context <- queryExecutionContext$withDatabase(schema)
  result <- .jnew("com.amazonaws.services.athena.model.ResultConfiguration")
  result$setOutputLocation(results_bucket)

  startQueryExecutionRequest <- .jnew("com.amazonaws.services.athena.model.StartQueryExecutionRequest")
  startQueryExecutionRequest$setQueryString(ugly_query)
  startQueryExecutionRequest$setQueryExecutionContext(context)
  startQueryExecutionRequest$setResultConfiguration(result)

  res <- my_client$startQueryExecutionAsync(startQueryExecutionRequest)

  r <- res$get()
  qex_id <- r$getQueryExecutionId()

  list(
    qex_id = qex_id,
    results_bucket = results_bucket
  )

}

#' Gather the results of an asynchronous query
#'
#' @md
#' @param async_result the result of a call to `collect_async()`
#' @return a data frame (tibble) or `NULL` if the query results are not ready yet
gather_results <- function(async_result) {
  if (bucket_exists(sprintf("%s/%s", async_result$results_bucket, async_result$qex_id))) {
    readr::read_csv(
      get_object(sprintf("%s/%s.csv", async_result$results_bucket, async_result$qex_id))
    )
  } else {
    message("Results are not in the designated bucket.")
    return(NULL)
  }
}

Now, we give it a go:

# Setup the credentials you're using
use_credentials("personal")

# load the AWS Java SDK classes
awsjavasdk::load_sdk()

# necessary for Simba ODBC and the async query ops
aws_region <- "us-east-1"
athena_schema <- "sampledb"
athena_results_bucket <- "s3://aws-athena-query-results-redacted"

# connect to Athena and the sample database
DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = athena_schema,
  AwsRegion = aws_region,
  AuthenticationType = "IAM Profile",
  AwsProfile = "personal",
  S3OutputLocation = athena_results_bucket
) -> con

# the sample table in the sample db/schema
elb_logs <- tbl(con, "elb_logs")

# create your dplyr chain. This one is small so I don't incur charges
# collect_async() MUST be the LAST item in the dplyr chain.
elb_logs %>%
  filter(requestip == "253.89.30.138") %>%
  collect_async(
    schema = athena_schema,
    region = aws_region,
    results_bucket = athena_results_bucket
  ) -> async_result

async_result
## $qex_id
## [1] "d5fe7754-919b-47c5-bd7d-3ccdb1a3a414"
## 
## $results_bucket
## [1] "s3://aws-athena-query-results-redacted"

# For long queries we can wait a bit but the function will tell us if the results
# are there or not.

gather_results(async_result)
## Parsed with column specification:
## cols(
##   timestamp = col_datetime(format = ""),
##   elbname = col_character(),
##   requestip = col_character(),
##   requestport = col_integer(),
##   backendip = col_character(),
##   backendport = col_integer(),
##   requestprocessingtime = col_double(),
##   backendprocessingtime = col_double(),
##   clientresponsetime = col_double(),
##   elbresponsecode = col_integer(),
##   backendresponsecode = col_integer(),
##   receivedbytes = col_integer(),
##   sentbytes = col_integer(),
##   requestverb = col_character(),
##   url = col_character(),
##   protocol = col_character()
## )
## # A tibble: 1 x 16
##   timestamp           elbname requestip     requestport backendip     backendport
##                                                   
## 1 2014-09-29 03:24:38 lb-demo 253.89.30.138       20159 253.89.30.138        8888
## # ... with 10 more variables: requestprocessingtime , backendprocessingtime ,
## #   clientresponsetime , elbresponsecode , backendresponsecode ,
## #   receivedbytes , sentbytes , requestverb , url , protocol 

If you do try this out and end up needing to tweak it, feedback on what you had to do (via the comments) would be greatly appreciated.

Both my osqueryr and macthekinfe packages have had a few updates and I wanted to put together a fun example (it being Friday, and all) for what you can do with them. All my packages are now on GitHub and GitLab and I’ll be maintaining them on both so I can accommodate the comfort-level of any and all contributors but will be prioritizing issues and PRs on GitLab ahead of any other platform. Having said that, I’ll mark non-CRAN packages with a # notcran comment in the source views so you know you need to install it from wherever you like to grab sketch packages from.

One table that osquery makes available under macOS is an inventory of all “apps” that macOS knows about. Previous posts have shown how to access these tables via the dplyr interface I built for osquery, but they involved multiple steps and as I started to use it more regularly (especially to explore the macOS 10.14 beta I’m running) I noticed that it could use some helper functions. One in particular — osq_expose_tables() — is pretty helpful in that it handles all the dplyr boilerplate code and makes table(s) available in the global environment by name. It takes a single table name or regular expression and then exposes all matching entities. While the function has a help page, it’s easier just to see it in action. Let’s expose the apps table:

library(osqueryr) # notcran
library(tidyverse)

osq_expose_tables("apps")

apps
## # Source:   table [?? x 19]
## # Database: OsqueryConnection
##    applescript_enab… bundle_executable    bundle_identifier   bundle_name  bundle_package_…
##                                                                   
##  1 0                 1Password 6          com.agilebits.onep… 1Password 6  APPL            
##  2 0                 2BUA8C4S2C.com.agil… 2BUA8C4S2C.com.agi… 1Password m… APPL            
##  3 1                 Adium                com.adiumX.adiumX   Adium        APPL            
##  4 1                 Adobe Connect        com.adobe.adobecon… Adobe Conne… APPL            
##  5 1                 Adobe Illustrator    com.adobe.illustra… Illustrator… APPL            
##  6 ""                AIGPUSniffer         com.adobe.AIGPUSni… AIGPUSniffer APPL            
##  7 ""                CEPHtmlEngine Helper com.adobe.cep.CEPH… CEPHtmlEngi… APPL            
##  8 ""                CEPHtmlEngine        com.adobe.cep.CEPH… CEPHtmlEngi… APPL            
##  9 ""                LogTransport2        com.adobe.headligh… LogTranspor… APPL            
## 10 ""                droplet              ""                  Analyze Doc… APPL            
## # ... with more rows, and 14 more variables: bundle_short_version ,
## #   bundle_version , category , compiler , copyright ,
## #   development_region , display_name , element , environment ,
## #   info_string , last_opened_time , minimum_system_version , name ,
## #   path 

There’s tons of info on all the apps macOS knows about, some of which are system services and “helper” apps (like Chrome’s auto-updater). One field — last_opened_time — caught my eye and I thought it would be handy to see which apps had little use (i.e. ones that haven’t been opened in a while) and which apps I might use more frequently (i.e. ones with more recent “open” times). That last_open_time is a fractional POSIX timestamp and, due to the way osquery created the schemas, it’s in a character field. That’s easy enough to convert and then arrange() the whole list in descending order to let you see what you use most frequently.

But, this is R and we can do better than a simple table or even a DT::datatable().

I recently added the ability to read macOS property lists (a.k.a. “plists”) to mactheknife by wrapping a Python module (plistlib). Since all (OK, “most”) macOS apps have an icon, I thought it would be fun to visualize the last opened frequency for each app using the app icons and ggplot2. Unfortunately, the ImageMagick (and, thus the magick package) cannot read macOS icns files, so you’ll need to do a brew install libicns before working with any of the remaining code since we’ll be relying on a command-line utility from that formula.

Let’s get the frontmatter out of the way:

library(sys)
library(magick)
library(osqueryr) # notcran
library(mactheknife) #notcran
library(ggimage)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

osq_expose_tables("apps")

# macOS will use a generic app icon when none is present in an app bundle; this is the location and we'll
# need to use it when our plist app spelunking comes up short

default_app <- "/System/Library/CoreServices/CoreTypes.bundle/Contents/Resources/GenericApplicationIcon.icns"

Next, we'll:

  • collect the apps table locally
  • filter out system-ish things (which we really don't care about for this post)
  • convert the last used time to something useful (and reduce it to a day resolution)
  • try to locate the property list for the app and read the path to the app icon file, substituting the generic one if not found (or other errors pop up):
select(apps, name, path, last_opened_time) %>%
  collect() %>%
  filter(!str_detect(path, "(^/System|usr|//System|/Library/|Helper|/Contents/|\\.service$)")) %>%
  mutate(lop_day = as.Date(anytime::anytime(as.numeric(last_opened_time)))) %>%
  mutate(icon = map_chr(path, ~{
    p <- read_plist(file.path(.x, "Contents", "Info.plist"))
    icns <- p$CFBundleIconFile[1]
    if (is.null(icns)) return(default_app)
    if (!str_detect(icns, "\\.icns$")) icns <- sprintf("%s.icns", icns)
    file.path(.x, "Contents", "Resources", icns)
  })) -> apps_df

apps_df
## # A tibble: 274 x 5
##    last_opened_time name                       path                      lop_day    icon                       
##                                                                                      
##  1 1529958322.11297 1Password 6.app            /Applications/1Password … 2018-06-25 /Applications/1Password 6.…
##  2 1523889402.80918 Adium.app                  /Applications/Adium.app   2018-04-16 /Applications/Adium.app/Co…
##  3 1516307513.7606  Adobe Connect.app          /Applications/Adobe Conn… 2018-01-18 /Applications/Adobe Connec…
##  4 1530044681.76677 Adobe Illustrator.app      /Applications/Adobe Illu… 2018-06-26 /Applications/Adobe Illust…
##  5 -1.0             Analyze Documents.app      /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  6 -1.0             Make Calendar.app          /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  7 -1.0             Contact Sheets.app         /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  8 -1.0             Export Flash Animation.app /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
##  9 -1.0             Web Gallery.app            /Applications/Adobe Illu… 1969-12-31 /Applications/Adobe Illust…
## 10 -1.0             Adobe InDesign CC 2018.app /Applications/Adobe InDe… 1969-12-31 /Applications/Adobe InDesi…
## # ... with 264 more rows

Since I really didn't feel like creating a package wrapper for libicns, we're going to use the sys package to make system calls to convert the icns files to png files. We really don't want to do this repeatedly for the same files if we ever run this again, so we'll setup a cache directory to hold our converted pngs.

Apps can (and, usually do) have multiple icons with varying sizes and are not guaranteed to have every common size available. So, we'll have the libicns icns2png utility extract all the icons and use the highest resolution one, using magick to reduce it to a 32x32 png bitmap.

# setup the cache dir -- use whatever you want
cache_dir <- path.expand("~/.r-icns-cache")
dir.create(cache_dir)

# create a unique name hash for more compact names
mutate(apps_df, icns_png = map_chr(icon, ~{
  hash <- digest::digest(.x, serialize=FALSE)
  file.path(cache_dir, sprintf("%s.png", hash))
})) -> apps_df

# find the icns2png program
icns2png <- unname(Sys.which("icns2png"))

# go through each icon file 
pb <- progress_estimated(length(apps_df$icns_png))
walk2(apps_df$icon, apps_df$icns_png, ~{

  pb$tick()$print() # progress!

  if (!file.exists(.y)) { # don't create it if it already exists

    td <- tempdir()

    # no icon file == use default one
    if (!file.exists(.x)) .x <- default_app

    # convert all of them to pngs
    sys::exec_internal(
      cmd = icns2png,
      args = c("-x", "-o", td, .x),
      error = FALSE
    ) -> res

    rawToChar(res$stdout) %>% # go through icns2png output
      str_split("\n") %>%
      flatten_chr() %>%
      keep(str_detect, "  Saved") %>% # find all the extracted icons
      last() %>% # use the last one
      str_replace(".* to /", "/") %>% # clean up the filename so we can read it in
      str_replace("\\.$", "") -> png

    # read and convert
    image_read(png) %>%
      image_resize(geometry_area(32, 32)) %>%
      image_write(.y)

  }

})

You can open up that cache directory with the macOS finder to find all the extracted/converted pngs.

Now, we're on the final leg of our app-use visualization journey.

Some system/utility apps have start-of-epoch dates due to the way the macOS installer tags them. We only want "recent" ones so I set an arbitrary cutoff date of the year 2000. Since many apps would have the same last opened date, I wanted to get a spread out layout "for free". One way to do that is to use ggbeeswarm::position_beswarm():

filter(apps_df, lop_day > as.Date("2000-01-01")) %>%
  ggplot() +
  geom_image(
    aes(x="", lop_day, image = icns_png), size = 0.033,
    position = position_quasirandom(width = 0.5)
  ) +
  geom_text(
    data = data_frame(
      x = c(0.6, 0.6),
      y = as.Date(c("2018-05-01", "2017-09-15")),
      label = c("More recently used ↑", "Not used in a while ↓")
    ), 
    aes(x, y, label=label), family = font_an, size = 5 , hjust = 0,
    color = "lightslategray"
  ) +
  labs(x = NULL, y = "Last Opened Time") +
  labs(
    x = NULL, y = NULL,
    title = "macOS 'Last Used' App History"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank())

There are tons of other ways to look at this data and you can use the osquery daemon to log this data regularly so you can get an extra level of detail. An interesting offshot project would be to grab the latest RStudio dailies and see if you can wrangle a sweet D3 visualization from the app data we collected. Make sure to drop a comment with your creations in the comments. You can find the full code in this snippet.

UPDATE (2018-07-07)

A commenter really wanted tooltips with app names. So did I, but neither plotly nor ggiraph support ggimage so we can't get tooltips for free.

However, if you're willing to use the latest RStudio Preview or Daily editions, then we can "easily" use the new built-in D3 support to get some sketch tooltips.

First, we need to change up the plotting code a bit so we can get some base data to feed to D3:

filter(apps_df, lop_day > as.Date("2000-01-01")) %>%
  mutate(name = sub("\\.app", "", name)) %>% 
  ggplot() +
  geom_image(
    aes(x="", lop_day, image = icns_png, name=name), size = 0.033,
    position = position_quasirandom(width = 0.5)
  ) +
  geom_text(
    data = data_frame(
      x = c(0.6, 0.6),
      y = as.Date(c("2018-05-01", "2017-09-15")),
      label = c("More recently used ↑", "Not used in a while ↓")
    ), 
    aes(x, y, label=label), family = font_an, size = 5 , hjust = 0,
    color = "lightslategray"
  ) +
  labs(x = NULL, y = "Last Opened Time") +
  labs(
    x = NULL, y = NULL,
    title = "macOS 'Last Used' App History"
  ) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x = element_blank()) -> gg

gb <- ggplot_build(gg) # compute the layout

idf <- tbl_df(gb$data[[1]]) # extract the data
idf <- mutate(idf, image = sprintf("lib/imgs-1.0.0/%s", basename(image))) # munge the png paths so D3 can find them

write_rds(idf, "~/Data/apps.rds") # save off the data

Now, we just need some D3 javascript glue:

// !preview r2d3 data=data.frame(readRDS("~/Data/apps.rds")), d3_version = 4, dependencies = htmltools::htmlDependency(name = "imgs", version = "1.0.0", src = "~/.r-icns-cache", all_files = TRUE)

var margin = {top: 16, right: 32, bottom: 16, left: 32},
    width = width - margin.left - margin.right,
    height = height - margin.top - margin.bottom;

var x = d3.scaleLinear().range([0, width]);
var y = d3.scaleLinear().range([height, 0]);

x.domain([
  d3.min(data, function(d) { return d.x; }) - 0.05,
  d3.max(data, function(d) { return d.x; }) + 0.05
]);

y.domain([
  d3.min(data, function(d) { return d.y; }) - 16,
  d3.max(data, function(d) { return d.y; }) + 16
]);

var tooltip = d3.select("body")
    .append("div")
    .style("position", "absolute")
    .style("z-index", "10")
    .style("visibility", "hidden")
    .style("color", "blue")
    .style("background", "white")
    .style("padding", "5px")
    .style("font-family", "sans-serif")
    .text("");

svg.attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
  .append("g")
    .attr("transform",
          "translate(" + margin.left + "," + margin.top + ")");

var images = svg.selectAll("appimg")
      .data(data)
    .enter().append("svg:image")
      .attr("xlink:href",  function(d) { return d.image;})
      .attr("x", function(d) { return x(d.x);})
      .attr("y", function(d) { return y(d.y);})
      .attr("height", 32)
      .attr("width", 32)
      .on("mouseover", function(d) { return tooltip.style("visibility", "visible").text(d.name); })
  .on("mousemove", function(){ return tooltip.style("top", (event.pageY-10)+"px").style("left",(event.pageX+10)+"px"); })
  .on("mouseout", function(){ return tooltip.style("visibility", "hidden"); });

If you don't want to live dangerously, you can also save that script off and just use r2d3 directly:

r2d3::r2d3(
  data = data.frame(readRDS("~/Data/apps.rds")), 
  script = "~/Desktop/app-d3.js",
  d3_version = 4, 
  dependencies = htmltools::htmlDependency(
    name = "imgs", 
    version = "1.0.0", 
    src = "~/.r-icns-cache", 
    all_files = TRUE
  )
)

Either way gives us interactive tooltips:

I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.

When I finally jumped on Twitter today, I saw this:

and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.

The Problem

The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).

The PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:

This slideshow requires JavaScript.

We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.

The Solution

We’ll need some packages and the file itself, so let’s get that bit out of the way first:

library(stringi)
library(pdftools)
library(hrbrthemes)
library(ggpomological)
library(tidyverse)

# grab the PDF text
mia_url <- "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"
mia_fil <- "~/Data/pmkor_una_all.pdf"
if (!file.exists(mia_fil)) download.file(mia_url, mia_fil)

# read it in
doc <- pdf_text(mia_fil) 

Let's look at those three example pages:

cat(doc[[1]])
##                                   Defense POW/MIA Accounting Agency
##                                       Personnel Missing - Korea (PMKOR)
##                                        (Reported for ALL Unaccounted For)
##                                                                                                Total Unaccounted: 7,699
## Name                       Rank/Rate     Branch                           Year State/Territory
## ABBOTT, RICHARD FRANK      M/Sgt         UNITED STATES ARMY               1950 VERMONT
## ABEL, DONALD RAYMOND       Pvt           UNITED STATES ARMY               1950 PENNSYLVANIA
## ...
## AKERS, HERBERT DALE        Cpl           UNITED STATES ARMY               1950 INDIANA
## AKERS, JAMES FRANCIS       Cpl           UNITED STATES MARINE CORPS       1950 VIRGINIA

cat(doc[[2]])
## Name                          Rank/Rate Branch                     Year State/Territory
## AKERS, RICHARD ALLEN          1st Lt    UNITED STATES ARMY         1951 PENNSYLVANIA
## AKI, CLARENCE HALONA          Sgt       UNITED STATES ARMY         1950 HAWAII
...
## AMIDON, DONALD PRENTICE       PFC       UNITED STATES MARINE CORPS 1950 TEXAS
## AMOS, CHARLES GEARL           Cpl       UNITED STATES ARMY         1951 NORTH CAROLINA

cat(doc[[length(doc)]])
## Name                                                Rank/Rate           Branch                                              Year         State/Territory
## ZAVALA, FREDDIE                                     Cpl                 UNITED STATES ARMY                                  1951         CALIFORNIA
## ZAWACKI, FRANK JOHN                                 Sgt                 UNITED STATES ARMY                                  1950         OHIO
## ...
## ZUVER, ROBERT LEONARD                               Pfc                 UNITED STATES ARMY                                  1950         CALIFORNIA
## ZWILLING, LOUIS JOSEPH                              Cpl                 UNITED STATES ARMY                                  1951         ILLINOIS
##                                       This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA).
##                Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists.
## Report Prepared: 06/19/2018 11:25

The poppler library's "layout" mode (which pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:

  • there's a uniform header on each page
  • the "layout" mode returned uniform per-page, fixed-width columns
  • there's no "special column tweaks" that some folks use to make PDFs more readable by humans

There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:

  • convert the text to lines
  • find the header line
  • find the column start/end positions from the header on the page (since they are different for each page)
  • reading it in with readr::read_fwf()
  • remove headers, preamble and epilogue cruft
  • turn it all into one data frame
# we're going to process each page and read_fwf will complain violently
# when it hits header/footer rows vs data rows and we rly don't need to
# see all those warnings
read_fwf_q <- quietly(read_fwf)

# go through each page
map_df(doc, ~{
  
  stri_split_lines(.x) %>% 
    flatten_chr() -> lines # want the lines from each page
  
  # find the header on the page and get the starting locations for each column
  keep(lines, stri_detect_regex, "^Name") %>% 
    stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>% 
    map(`[`, 1) %>% 
    flatten_int() -> starts
  
  # now get the ending locations; cheating and using `NA` for the last column  
  ends <- c(starts[-1] - 1, NA)

  # since each page has a lovely header and poppler's "layout" mode creates 
  # a surprisingly usable fixed-width table, the core idiom is to find the start/end
  # of each column using the header as a canary
  cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state"))

  paste0(lines, collapse="\n") %>%        # turn it into something read_fwf() can read 
    read_fwf_q(col_positions = cols) %>%  # read it!
    .$result %>%                          # need to do this b/c of `quietly()`
    filter(!is.na(name)) %>%              # non-data lines
    filter(name != "Name") %>%            # remove headers from each page
    filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly)
  
}) -> xdf

xdf
## # A tibble: 7,699 x 5
##    name                       rank   branch                  year  state        
##                                                        
##  1 ABBOTT, RICHARD FRANK      M/Sgt  UNITED STATES ARMY      1950  VERMONT      
##  2 ABEL, DONALD RAYMOND       Pvt    UNITED STATES ARMY      1950  PENNSYLVANIA 
##  3 ABELE, FRANCIS HOWARD      Sfc    UNITED STATES ARMY      1950  CONNECTICUT  
##  4 ABELES, GEORGE ELLIS       Pvt    UNITED STATES ARMY      1950  CALIFORNIA   
##  5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950  ALABAMA      
##  6 ABREU, MANUEL Jr.          Pfc    UNITED STATES ARMY      1950  MASSACHUSETTS
##  7 ACEVEDO, ISAAC             Sgt    UNITED STATES ARMY      1952  PUERTO RICO  
##  8 ACINELLI, BILL JOSEPH      Pfc    UNITED STATES ARMY      1951  MISSOURI     
##  9 ACKLEY, EDWIN FRANCIS      Pfc    UNITED STATES ARMY      1950  NEW YORK     
## 10 ACKLEY, PHILIP WARREN      Pfc    UNITED STATES ARMY      1950  NEW HAMPSHIRE
## # ... with 7,689 more rows

Now the data is both usable and sobering:

title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea"
subtitle <- "Reported for ALL Unaccounted For"
caption <-  "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"

mutate(xdf, year = factor(year)) %>% 
  mutate(branch = stri_trans_totitle(branch)) -> xdf

ordr <- count(xdf, branch, sort=TRUE)

mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% 
  ggplot(aes(year)) +
  geom_bar(aes(fill = branch), width=0.65) +
  scale_y_comma(name = "# POW/MIA") +
  scale_fill_pomological(name=NULL, ) +
  labs(x = NULL, title = title, subtitle = subtitle) +
  theme_ipsum_rc(grid="Y") +
  theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) +
  theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))

You can catch a bit of the @rOpenSci 2018 Unconference experience at home w with this short-ish ‘splainer video on how to use the new middlechild package (https://github.com/ropenscilabs/middlechild) & mitmproxy to automagically create reusable httr verb functions from manual browser form interactions.