Skip navigation

The wayback? package has had an update to more efficiently retrieve mementos and added support for working with the Internet Archive’s advanced search+scrape API.

Search/Scrape

The search/scrape interface lets you examine the IA collections and download what you are after (programmatically). The main function is ia_scrape() but you can also paginate through results with the helper functions provided.

To demonstrate, let’s peruse the IA NASA collection and then grab one of the images. First, we need to search the collection then choose a target URL to retrieve and finally download it. The identifier is the key element to ensure we can retrieve the information about a particular collection.

library(wayback)

nasa <- ia_scrape("collection:nasa", count=100L)

tibble:::print.tbl_df(nasa)
## # A tibble: 100 x 3
##    identifier addeddate            title                                       
##                                                                 
##  1 00-042-154 2009-08-26T16:30:09Z International Space Station exhibit         
##  2 00-042-32  2009-08-26T16:30:12Z Swamp to Space historical exhibit           
##  3 00-042-43  2009-08-26T16:30:16Z Naval Meteorology and Oceanography Command …
##  4 00-042-56  2009-08-26T16:30:19Z Test Control Center exhibit                 
##  5 00-042-71  2009-08-26T16:30:21Z Space Shuttle Cockpit exhibit               
##  6 00-042-94  2009-08-26T16:30:24Z RocKeTeria restaurant                       
##  7 00-050D-01 2009-08-26T16:30:26Z Swamp to Space exhibit                      
##  8 00-057D-01 2009-08-26T16:30:29Z Astro Camp 2000 Rocketry Exercise           
##  9 00-062D-03 2009-08-26T16:30:32Z Launch Pad Tour Stop                        
## 10 00-068D-01 2009-08-26T16:30:34Z Lunar Lander Exhibit                        
## # ... with 90 more rows

(item <- ia_retrieve(nasa$identifier[1]))

## # A tibble: 6 x 4
##   file                       link                                                               last_mod          size 
## 1 00-042-154.jpg             https://archive.org/download/00-042-154/00-042-154.jpg             06-Nov-2000 15:34 1.2M 
## 2 00-042-154_archive.torrent https://archive.org/download/00-042-154/00-042-154_archive.torrent 06-Jul-2018 11:14 1.8K 
## 3 00-042-154_files.xml       https://archive.org/download/00-042-154/00-042-154_files.xml       06-Jul-2018 11:14 1.7K 
## 4 00-042-154_meta.xml        https://archive.org/download/00-042-154/00-042-154_meta.xml        03-Jun-2016 02:06 1.4K 
## 5 00-042-154_thumb.jpg       https://archive.org/download/00-042-154/00-042-154_thumb.jpg       26-Aug-2009 16:30 7.7K 
## 6 __ia_thumb.jpg             https://archive.org/download/00-042-154/__ia_thumb.jpg             06-Jul-2018 11:14 26.6K

download.file(item$link[1], file.path("man/figures", item$file[1]))

I just happened to know this would take me to an image. You can add the media type to the result (along with a host of other fields) to help with programmatic filtering.

The API is still not sealed in stone, so you're encouraged to submit questions/suggestions.

FIN

The vignette is embedded below and frame-busted here. It covers a very helpful and practical use-case identified recently by an OP on StackOverflow.

There's also a new pkgdown-gen'd site for the package.

Issues & PRs welcome at your community coding site of choice.

I was chatting with some cyber-mates at a recent event and the topic of cyber attacks on the U.S. power-grid came up (as it often does these days). The conversation was brief, but the topic made its way into active memory and resurfaced when I saw today’s Data Is Plural newsletter which noted that “Utility companies are required to report major power outages and other “electric disturbance events” to the Department of Energy within a business day (or, depending on the type of event, sooner) of the incident. The federal agency then aggregates the reports annual summary datasets.” (follow the links to the newsletter to get the URLs for the site since Jeremy deserves your ?).

Many of us data nerds use the Data Is Plural newsletters as fodder for class assignments, blog posts or personal “data katas”. This time, I was after cyber attack data.

When you head to the annual reports URL, you’re greeted with a SharePoint-driven HTML table:

So, our options are PDF or XLS (and I mean .xls, too, they’re not modern .xlsx files). We’ll opt for the latter and cache them locally before working on them. One “gotcha” is that the hrefs look like this: https://www.oe.netl.doe.gov/download.aspx?type=OE417XLS&ID=78 — i.e. no filenames. But, the filenames come along for the ride when an HTTP GET or HEAD request is issued in a content-disposition response header. We’ll use this metadata instead of siphoning off the year from the first column of the table:

library(rvest)
library(readxl)
library(tidyverse)

doe <- read_html("https://www.oe.netl.doe.gov/OE417_annual_summary.aspx")

dir.create("~/Data/doe-cache-dir", showWarnings = FALSE)

html_nodes(doe, xpath=".//a[contains(., 'XLS')]") %>%
  html_attr("href") %>%
  { .pb <<- progress_estimated(length(.)) ; . } %>% # we likely don't rly need progress bars tho
  walk(~{

    .pb$tick()$print()

    dl_url <- sprintf("https://www.oe.netl.doe.gov/%s", .x)

    res <- HEAD(dl_url) # so we can get the filename
    stop_for_status(res) # halt on network errors

    fil <- str_replace(
      string = res$headers['content-disposition'],
      pattern = "attachment; filename=",
      replacement = "~/Data/doe-cache-dir/"
    )

    if (!file.exists(fil)) { # this pattern allows us to issue a lightweight HTTP HEAD request, then cache and refresh w/o wasting server/our bandwidth/cpu
      res <- GET(dl_url, httr::write_disk(fil))
      stop_for_status(res)
      Sys.sleep(5) # be kind to the server(s) but only if we're downloading data files since HEAD requests don't really tax services
    }

  })

Let's do a quick check for the likelihood of uniformity. Some of these files go back to 2002 and I suspect they're more geared for "printing" (the PDF counterparts were a clue) than programmatic processing:

# check to see if the files are all the same (spoiler alert: they're not)
list.files("~/Data/doe-cache-dir", "xls", full.names=TRUE) %>%
  map_df(~list(
    fil = basename(.x),
    ncols = read_xls(.x, col_names=FALSE, col_types="text") %>% ncol()
  )) -> cols_profile

cols_profile
## # A tibble: 17 x 2
##    fil                     ncols
##                       
##  1 2002_Annual_Summary.xls     8
##  2 2003_Annual_Summary.xls     8
##  3 2004_Annual_Summary.xls     8
##  4 2005_Annual_Summary.xls     8
##  5 2006_Annual_Summary.xls     8
##  6 2007_Annual_Summary.xls     8
##  7 2008_Annual_Summary.xls     8
##  8 2009_Annual_Summary.xls     8
##  9 2010_Annual_Summary.xls     8
## 10 2011_Annual_Summary.xls     9
## 11 2012_Annual_Summary.xls     9
## 12 2013_Annual_Summary.xls     9
## 13 2014_Annual_Summary.xls     9
## 14 2015_Annual_Summary.xls    11
## 15 2016_Annual_Summary.xls    11
## 16 2017_Annual_Summary.xls    11
## 17 2018_Annual_Summary.xls    11

O_o

At this point, I paused and wanted to see what was going on in the minds of the DoE staffers charged with releasing this data.

(You can grab the macOS Quick Look preview snaps of all of those here.)

From 2002 to 2010 the Excel documents are clearly designed for print as the target, complete with month breaklines and repeated (+ heavily formatted) headers. They even left other tabs around (inconsistently).

Things got a little better between 2011 and 2014, but we still have month breaks and occasional, repeated headers (someone likely learned how to generate headers-per-page in Excel in 2011 then the administration changed hands and new staffers came in and fubar'd 2012 a bit before going back to the slightly better format).

Prior to 2015, the print-as-target trumped programmatic access. Interestingly enough, this is roughly when "data science" was on the upswing (in a major way):



Starting with 2015 we have a "month" column, more uniformity for specifying dates & times and more care given to other value fields, so kudos to the then and current staffers who made our data-machinating lives easier.

This really is, I believe, a byproduct of modern "data literacy". Folks in charge of gathering and publishing data are realizing there are multiple ways others want/need to consume the data. The original purpose for this data was to hand a report to someone after regulations were put in place to mandate notifications. I'm willing to bet nobody did anything with this data for a few years. Staffers either learned to wield Excel better or new staffers came in with this new knowledge. Finally, recent years clearly show that the staffers realize that folks are as (or more) likely to programmatically consume this information as they are reading a a long list of events (?). More work is needed (and an API or CSV/JSON output would be super cool) but it's great to see data literacy alive and well in the halls of the U.S. gov.

Said modern format changes do not really help us work with the complete data set and the more recent files have issues all their own, including inconsistency in the way the date/time columns are represented in Excel cells.

By golly, we're still going to try to read all these files in and work with them (for at least the purpose I originally set out on). We'll have to deal with the differences in columns and come up with a way to remove non-data rows. I also kinda at least want dates as dates. Here's my stab at an initial clean-up (there's lots more work to do, though):

map2(cols_profile$fil, cols_profile$ncols, ~{

  if (.y == 8) { # handle 8 cols

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("date_began", "region", "time", "area", "event_type", "loss", "customers_affected", "date_restored"),
      col_types = c("date", "text", "text", "text", "text", "text", "text", "date")
    )) %>%
      filter(!is.na(date_began)) %>%
      mutate(date_began = as.Date(date_began))

  } else if (.y == 9) { # handle 9 cols

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("date_began", "time_began", "date_restored", "time_restored", "area", "region", "event_type", "loss", "customers_affected"),
      col_types = c("date", "guess", "date", "guess", "text", "text", "text", "text", "text")
    )) %>%
      filter(!is.na(date_began)) %>%
      mutate(date_began = as.Date(date_began))

  } else if (.y == 11) { # handle 11 cols

    # note that the date columns aren't uniform in the Excel spreadsheets even in these more data-literate files :-(

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("month", "date_began", "time_began", "date_restored", "time_restored", "area", "region", "alert_criteria", "event_type", "loss", "customers_affected"),
      col_types = c("text", "text", "guess", "text", "guess", "text", "text", "text", "text", "text", "text")
    )) %>%
      mutate(
        date_began = case_when(
          str_detect(date_began, "/") ~ suppressWarnings(as.Date(date_began, format="%m/%d/%Y")),
          str_detect(date_began, "^[[:digit:]]+$") ~ suppressWarnings(as.Date(as.integer(date_began), origin = "1899-12-30")),
          TRUE ~ suppressWarnings(as.Date(NA))
        )
      ) %>%
      mutate(
        date_restored = case_when(
          str_detect(date_restored, "/") ~ suppressWarnings(as.Date(date_restored, format="%m/%d/%Y")),
          str_detect(date_restored, "^[[:digit:]]+$") ~ suppressWarnings(as.Date(as.integer(date_restored), origin = "1899-12-30")),
          TRUE ~ suppressWarnings(as.Date(NA))
        )
      ) %>%
      filter(!is.na(date_began))

  }

}) -> reports

reports[[1]]
## # A tibble: 23 x 8
##    date_began region time                area      event_type      loss  customers_affec… date_restored
##                                                       
##  1 2002-01-30 SPP    0.25                Oklahoma  Ice Storm       500   1881134          2002-02-07 12:00:00
##  2 2002-01-29 SPP    Evening             Metropol… Ice Storm       500-… 270000           NA
##  3 2002-01-30 SPP    0.66666666666666663 Missouri  Ice Storm       210   95000            2002-02-10 21:00:00
##  4 2002-02-27 WSCC   0.45000000000000001 Californ… Interruption o… 300   255000           2002-02-27 11:35:00
##  5 2002-03-09 ECAR   0                   Lower Pe… Severe Weather  190   190000           2002-03-11 12:00:00
##  6 2002-04-08 WSCC   0.625               Arizona   Vandalism/      0     0                2002-04-09 00:00:00
##  7 2002-07-09 WSCC   0.51875000000000004 Californ… Interruption o… 240   1 PG&E           2002-07-09 19:54:00
##  8 2002-07-19 WSCC   0.49375000000000002 Californ… Interruption o… 240   1 PG&E           2002-07-19 16:30:00
##  9 2002-07-20 NPCC   0.52777777777777779 New York  Fire            278   63500            2002-07-20 20:12:00
## 10 2002-08-02 MAIN   0.52986111111111112 Illinois  Interruption o… 232   53565            2002-08-02 18:36:00
## # ... with 13 more rows

reports[[10]]
## # A tibble: 307 x 9
##    date_began time_began  date_restored       time_restored  area    region event_type loss  customers_affec…
##                                                    
##  1 2011-01-11 0.96388888… 2011-01-11 00:00:00 0.96388888888… Athens… NPCC   Electrica… 0     0
##  2 2011-01-12 0.25        2011-01-12 00:00:00 0.58333333333… Massac… NPCC   Winter St… N/A   80000
##  3 2011-01-13 0.30624999… 2011-01-13 00:00:00 0.34236111111… North … FRCC   Firm Syst… 150   20900
##  4 2011-01-18 0.58333333… 2011-01-18 00:00:00 0.58333333333… Whitma… NPCC   Vandalism  0     0
##  5 2011-01-23 0.29166666… 2011-01-23 00:00:00 0.54166666666… Frankl… WECC   Vandalism  0     0
##  6 2011-01-24 0.55555555… 2011-01-24 00:00:00 0.5625         Newman… WECC   Suspiciou… 0     0
##  7 2011-01-25 0.14097222… 2011-01-25 00:00:00 0.45833333333… Newark… RFC    Vandalism  0     0
##  8 2011-01-26 0.39236111… 2011-01-27 00:00:00 0.70833333333… Carson… WECC   Suspected… 0     0
##  9 2011-01-26 0.39791666… 2011-01-27 00:00:00 0.62708333333… Michig… RFC    Vandalism  0     0
## 10 2011-01-26 0.70833333… 2011-01-31 00:00:00 0.33333333333… Montgo… RFC    Winter St… N/A   210000
## # ... with 297 more rows

reports[[17]]
## # A tibble: 120 x 11
##    month  date_began time_began date_restored time_restored area   region alert_criteria     event_type loss
##                                                          
##  1 Janua… 2018-01-01 0.7645833… 2018-01-02    0.7576388888… Tenne… SERC   Public appeal to … Severe We… Unkn…
##  2 Janua… 2018-01-01 0.7381944… NA            Unknown       Texas: TRE    Public appeal to … Severe We… Unkn…
##  3 Janua… 2018-01-01 0.9006944… 2018-01-02    0.4375        Tenne… SERC   Public appeal to … System Op… Unkn…
##  4 Janua… 2018-01-02 0.4166666… 2018-02-12    0.3333333333… New Y… NPCC   Fuel supply emerg… Fuel Supp… 675
##  5 Janua… 2018-01-02 0.3125     NA            Unknown       South… SERC   Public appeal to … Severe We… 0
##  6 Janua… 2018-01-02 0.28125    2018-01-02    0.375         North… SERC   System-wide volta… Severe We… 14998
##  7 Janua… 2018-01-04 0.0756944… 2018-01-04    0.0895833333… Texas… TRE    Physical attack t… Actual Ph… Unkn…
##  8 Janua… 2018-01-12 0.5472222… 2018-01-12    0.6201388888… Michi… RF     Cyber event that … System Op… 41
##  9 Janua… 2018-01-15 0.1805555… 2018-01-18    0.2416666666… Texas: TRE    Public appeal to … Severe We… Unkn…
## 10 Janua… 2018-01-16 0.625      2018-01-18    0.5416666666… Tenne… SERC   Public appeal to … Severe We… Unkn…
## # ... with 110 more rows, and 1 more variable: customers_affected 

If you'd've handled the above differently it'd be ? if you could drop a note in the comments (for both my benefit and that of any other readers who have kindly made it this far into this tome).

At this point, I really just want to finally see if there are any "cyber" events in the data set and when/where they were. To do that, let's whittle down the columns a bit and make one data frame out of all the reports:

map_df(reports, ~{
  select(.x, date_began, region, area, event_type, customers_affected, date_restored) %>%
    mutate(date_restored = as.Date(date_restored)) %>%
    mutate(
      customers_affected = suppressWarnings(
        str_replace_all(customers_affected, "\\-.*$|[[:punct:]]+|[[:alpha:]]+", "") %>%
          as.numeric()
      )
    ) %>%
    mutate(date_restored = as.Date(ifelse(is.na(date_restored), date_began, date_restored), origin = "1970-01-01"))
}) -> events

events
## # A tibble: 2,243 x 6
##    date_began region area                          event_type                 customers_affect… date_restored
##                                                                       
##  1 2002-01-30 SPP    Oklahoma                      Ice Storm                            1881134 2002-02-07
##  2 2002-01-29 SPP    Metropolitan Kansas City Area Ice Storm                             270000 2002-01-29
##  3 2002-01-30 SPP    Missouri                      Ice Storm                              95000 2002-02-10
##  4 2002-02-27 WSCC   California                    Interruption of Firm Load             255000 2002-02-27
##  5 2002-03-09 ECAR   Lower Peninsula of Michigan   Severe Weather                        190000 2002-03-11
##  6 2002-04-08 WSCC   Arizona                       Vandalism/                                 0 2002-04-09
##  7 2002-07-09 WSCC   California                    Interruption of Firm Power                 1 2002-07-09
##  8 2002-07-19 WSCC   California                    Interruption of Firm Powe…                 1 2002-07-19
##  9 2002-07-20 NPCC   New York                      Fire                                   63500 2002-07-20
## 10 2002-08-02 MAIN   Illinois                      Interruption of Firm Power             53565 2002-08-02
## # ... with 2,233 more rows

Now we're cookin' with gas!

Let's do a quick check to make sure things look OK:

count(events, event_type, sort=TRUE)
## # A tibble: 390 x 2
##    event_type                         n
##                              
##  1 Severe Weather                   369
##  2 Vandalism                        216
##  3 Severe Weather - Thunderstorms    97
##  4 Suspected Physical Attack         87
##  5 System Operations                 74
##  6 Severe Thunderstorms              70
##  7 Winter Storm                      51
##  8 Ice Storm                         42
##  9 Physical Attack - Vandalism       40
## 10 High Winds                        33
## # ... with 380 more rows

Those events+quantities seem to make sense. Now, for my ultimate goal:

filter(events, grepl("cyber|hack", event_type, ignore.case=TRUE)) # yep, grepl() is still in muscle memory
## # A tibble: 19 x 6
##    date_began region area                               event_type             customers_affec… date_restored
##                                                                       
##  1 2003-01-25 ECAR   Cincinnati, Ohio                   Cyber Threat From Int…               NA 2003-01-25
##  2 2011-02-03 RFC    Bowie, Maryland                    Suspected Cyber Attack                0 2011-02-03
##  3 2011-02-17 WECC   Roseville, California              Suspected Cyber Attack                0 2011-02-23
##  4 2011-03-14 RFC    Baltimore, Maryland                Suspected Cyber Attack               NA 2011-03-14
##  5 2011-04-03 SERC   Unknown                            Suspected Cyber Attack                0 2011-04-05
##  6 2011-07-08 RFC    PJM Corporate Office, Pennsylvania Suspected Cyber Attack               NA 2011-07-11
##  7 2011-12-21 WECC   Boise, Idaho                       Suspected Cyber Attack                0 2011-12-21
##  8 2012-01-17 TRE    Austin, Texas                      Suspected Cyber Attack                0 2012-01-17
##  9 2012-02-17 SERC   Little Rock, Arkansas              Suspected Cyber Attack               NA 2012-02-17
## 10 2012-11-15 MRO    Iowa; Michigan                     Suspected Cyber Attack               NA 2012-11-15
## 11 2013-06-21 MRO    Michigan, Iowa                     Suspected Cyber Attack               NA 2013-10-30
## 12 2013-10-16 SERC   Roxboro Plant, North Carolina      Cyber Event with Pote…                0 2013-10-16
## 13 2014-03-20 NPCC   New York                           Suspected Cyber Attack               NA 2014-03-20
## 14 2014-10-21 MRO    Carmel, Indiana                    Suspected Cyber Attack               NA 2014-10-21
## 15 2014-12-30 NPCC   New Hampshire, Massachusetts, Mai… Suspected Cyber Attack               NA 2014-12-31
## 16 2016-02-07 NPCC   New York: Orange County            Cyber Attack                         NA 2016-02-07
## 17 2016-04-12 WECC   Washington: Pend Oreille County    Cyber Attack                          0 2016-04-12
## 18 2016-11-09 WECC   California: Stanislaus County, Sa… Cyber Attack                          0 2016-11-09
## 19 2016-12-13 WECC   California: Riverside County;      Cyber Event                           0 2016-12-13

?

FIN

There's a great deal of work left out of this power-outage data cleanup work:

  • Turn outage start/end info into POSIXct objects
  • Normalize area (make it a data_frame column with state and municipality so it can be unnested nicely)
  • Normalize event_type since many of the phrases used are equivalent and some have more than one categorization
  • Normalize loss somehow and do a better job with customers_affected (I did not double-check my work and I think there are errors in that column, now, but I didn't need it for my goal).

Since GitLab snippets are terrible, awful, slow things I've grudgingly posted the above code (contiguously) over at GitHub.

If you wrangle the data more and/or come up with other insights drop a note in the comments with a link to your post.

The sergeant? package has a minor update that adds REST API coverage for two “new” storage endpoints that make it possible to add, update and remove storage configurations on-the-fly without using the GUI or manually updating a config file.

This is an especially handy feature when paired with Drill’s new, official Docker container since that means we can:

  • fire up a clean Drill instance
  • modify the storage configuration (to, say, point to a local file system directory)
  • execute SQL ops
  • destroy the Drill instance

all from within R.

This is even more handy for those of us who prefer handling JSON data in Drill than in R directly or with sparklyr.

Quick Example

In a few weeks most of the following verbose-code-snippets will have a more diminutive and user-friendly interface within sergeant, but for now we’ll perform the above bulleted steps with some data that was used in a recent new package which was also generated by another recent new package. The zdnsr::zdns_exec() function ultimately generates a deeply nested JSON file that I really prefer working with in Drill before shunting it into R. Said file is stored, say, in the ~/drilldat directory.

Now, I have Drill running all the time on almost every system I use, but we’ll pretend I don’t for this example. I’ve run zdns_exec() and generated the JSON file and it’s in the aforementioned directory. Let’s fire up an instance and connect to it:

library(sergeant) # git[hu|la]b:hrbrmstr/sergeant
library(dplyr)

docker <- Sys.which("docker") # you do need docker. it's a big dependency, but worth it IMO

(system2(
  command = docker,  
  args = c(
    "run", "-i", 
    "--name", "drill-1.14.0", 
    "-p", "8047:8047", 
    "-v", paste0(c(path.expand("~/drilldat"), "/drilldat"), collapse=":"),
    "--detach", 
    "-t", "drill/apache-drill:1.14.0",
    "/bin/bash"
  ),
  stdout = TRUE
) -> drill_container)
## [1] "d6bc79548fa073d3bfbd32528a12669d753e7a19a6258e1be310e1db378f0e0d"

The above snippet fires up a Drill Docker container (downloads it, too, if not already local) and wires up a virtual directory to it.

We should wait a couple seconds and make sure we can connect to it:

drill_connection() %>% 
  drill_active()
## [1] TRUE

Now, we need to add a storage configuration so we can access our virtual directory. Rather than modify dfs we’ll add a drilldat plugin that will work with the local filesystem just like dfs does:

drill_connection() %>%
  drill_mod_storage(
    name = "drilldat",
    config = '
{
  "config" : {
    "connection" : "file:///", 
    "enabled" : true,
    "formats" : null,
    "type" : "file",
    "workspaces" : {
      "root" : {
        "location" : "/drilldat",
        "writable" : true,
        "defaultInputFormat": null
      }
    }
  },
  "name" : "drilldat"
}
')
## $result
## [1] "success"

Now, we can perform all the Drill ops sergeant has to offer, including ones like this:

(db <- src_drill("localhost"))
## src:  DrillConnection
## tbls: cp.default, dfs.default, dfs.root, dfs.tmp, drilldat.default, drilldat.root,
##   INFORMATION_SCHEMA, sys

tbl(db, "drilldat.root.`/*.json`")
## # Source:   table [?? x 10]
## # Database: DrillConnection
##    data                                              name   error class status timestamp          
##                                                                    
##  1 "{\"authorities\":[{\"ttl\":180,\"type\":\"SOA\"… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  2 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  3 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  4 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  5 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  6 "{\"authorities\":[{\"ttl\":1799,\"type\":\"SOA\… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  7 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  8 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  9 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
## 10 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
## # ... with more rows

(tbl(db, "(
SELECT
 b.answers.name AS question,
 b.answers.answer AS answer
FROM (
 SELECT 
   FLATTEN(a.data.answers) AS answers
 FROM 
   drilldat.root.`/*.json` a
 WHERE 
   (a.status = 'NOERROR')
) b
)") %>% 
 collect() -> dmarc_recs)
## # A tibble: 1,250 x 2
##    question             answer                                                                    
##  *                                                                                      
##  1 _dmarc.washjeff.edu  v=DMARC1; p=none                                                          
##  2 _dmarc.barry.edu     v=DMARC1; p=none; rua=mailto:dmpost@barry.edu,mailto:7cc566d7@mxtoolbox.d…
##  3 _dmarc.yhc.edu       v=DMARC1; pct=100; p=none                                                 
##  4 _dmarc.aacc.edu      v=DMARC1;p=none; rua=mailto:DKIM_DMARC@aacc.edu;ruf=mailto:DKIM_DMARC@aac…
##  5 _dmarc.sagu.edu      v=DMARC1; p=none; rua=mailto:Office365contact@sagu.edu; ruf=mailto:Office…
##  6 _dmarc.colostate.edu v=DMARC1; p=none; pct=100; rua=mailto:re+anahykughvo@dmarc.postmarkapp.co…
##  7 _dmarc.wne.edu       v=DMARC1;p=quarantine;sp=none;fo=1;ri=86400;pct=50;rua=mailto:dmarcreply@…
##  8 _dmarc.csuglobal.edu v=DMARC1; p=none;                                                         
##  9 _dmarc.devry.edu     v=DMARC1; p=none; pct=100; rua=mailto:devry@rua.agari.com; ruf=mailto:dev…
## 10 _dmarc.sullivan.edu  v=DMARC1; p=none; rua=mailto:mcambron@sullivan.edu; ruf=mailto:mcambron@s…
## # ... with 1,240 more rows

Finally (when done), we can terminate the Drill container:

system2(
  command = "docker",
  args = c("rm", "-f", drill_container)
)

FIN

Those system2() calls are hard on the ? and a pain to type/remember, so they’ll be wrapped in some sergeant utility functions (I’m hesitant to add a reticulate dependency to sergeant which is necessary to use the docker package, hence the system call wrapper approach).

Check your favorite repository for more sergeant updates and file issues if you have suggestions for how you’d like this Docker API for Drill to be controlled.

Nowadays (I’ve seen that word used so much in journal articles lately that I could not resist using it) I’m using world tile grids more frequently as the need arises to convey the state of exposure of various services at a global (country) scale. Given that necessity fosters invention it seemed that having a ggplot2 geom for world tile grids would make my life easier and also make work more efficient.

To that end, there’s a nascent ggplot2 extension package for making world tile grids called (uncreatively) worldtilegrid?. It’s also at GitHub if you’re more comfortable working from code from there.

The README has examples but we’ll walk through another one here and work with life expectancy data curated by Our World in Data. Rather than draw out this post to a less tenable length with an explanation of how to pull the XHR JSON data into R, just grab the CSV linked with the visualization on that page and substitute the path in the code for where you stored it.

We do need to clean up this data a bit since it has some issues. Let’s do that and carve out some slices into two new data frames so we can work with the most recent curated year and some historical data:

library(hrbrthemes) # gitlab/github :: hrbrmstr/hrbrthemes (use devtools installers)
library(worldtilegrid) # gitlab/github :: hrbrmstr/worldtilegrid (use devtools installers)
library(tidyverse)

cols(
  Entity = col_character(),
  Code = col_character(),
  Year = col_integer(),
  `Life expectancy (Clio-Infra up to 1949; UN Population Division for 1950 to 2015)` = col_double()
) -> le_cols

read_csv("~/Downloads/life-expectancy.csv", col_types = le_cols) %>%
  set_names(c("country", "iso3c", "year", "life_expectancy")) -> lifexp

# clean it up a bit since it's not great data and bust it into groups
# that match the vis on the site vs use continuous raw data, esp since
# the data is really just estimates
filter(lifexp, !is.na(iso3c)) %>%
  filter(nchar(iso3c) == 3) %>%
  mutate(
    grp = cut(
      x = life_expectancy,
      breaks = c(10, 20, 30, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85),
      include.lowest = TRUE
    )
  ) -> lifexp

last_year <-  filter(lifexp, year == last(year))
a_few_years <- filter(lifexp, year %in% c(seq(1900, 2015, 20), 2015))

Now we can use it to make some cartograms.

Using the World Tile Grid geom

You can reference a previous post for what it would normally take to create these tile grids. The idea was to simplify the API down to the caller just needing to specify some country names/ISO[23]Cs (to a country aesthetic) and a value (to a fill aesthetic) and let the package do the rest. To that end, here's how to display the life expectancy data for 2015:

ggplot(last_year) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") + # yep, you can sum up the entire blog post to this one line
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 2015",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number\nof years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to\nstay the same throughout its life",
    caption = "Data source: "

  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

You can add labels via geom_text() and use the wtg stat for it as it provides a number of computed variables you can work with:

  • x,y: the X,Y position of the tile
  • name: Country name (e.g. Afghanistan)
  • country.code: ISO2C country code abbreviation (e.g. AF)
  • iso_3166.2: Full ISO 3166 2-letter abbreviation code (e.g. ISO 3166-2:AF)
  • region: Region name (e.g. Asia)
  • sub.region: Sub-region name (e.g. Southern Asia)
  • region.code: Region code (e.g. 142)
  • sub.region.code: Sub-region code (e.g. 034)

Labeling should be a deliberate decision and used sparingly/with care.

Easier World Tile Grid Facets

Making faceted charts is also straightforward. Just use ggplot2's faceting subsytem:

ggplot(a_few_years) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") +
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 1900-2015 (selected years)",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number of years a newborn infant\nwould live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life",
    caption = "Data source: "
  ) +
  facet_wrap(~year) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

(You may want to open that one up in a separate tab/window.)

Animation might have been a better choice than facets (which is an exercise left to the reader ?).

FIN

The package API is still in "rapidly changing" mode, but feel free to kick the tyres and file issues or PRs on your community coding platform of choice as needed.

As noted in the package, the world tile grid concept credit goes to Jon Schwabish and the CSV data used to create the package exposed wtg data frame is the work of Maarten Lambrechts.

Apple has run the death bell on 32-bit macOS apps and, if you’re running a recent macOS version on your Mac (which you should so you can get security updates) you likely see this alert from time-to-time:

If you’re like me, you click through that and keep working but later ponder just how many of those apps you have. They are definitely going away, so knowing if your favourite app is on the chopping block is likely a good idea.

You can get this information via the “About This Mac”͢”System Report” app and sorting via one of the table fields.

R folks are data folks and we know we can do better than that. But, first we need to get the data. Thankfully, we can get this via the system_profiler command-line utility since it can both display user-friendly information in the terminal and also generate an XML version of the information to work with. We won’t need to head to the terminal for this work, though, since there are many ways to execute the command from R and read the generated output.

Executing System Calls from R

Base R provides two core methods for issuing a system call:

  • system()
  • system2()

Note that there are other functions provided with a base R installation that can also issue system commands and process the “piped” output, but we’ll focus on these deliberate invocation ones.

Functions in two other packages can also assist with this task and we’ll include a look at:

  • processx::run()
  • sys::exec_internal()

as well.

Why leave base R for this task? Truthfully, we really don’t need to, but both sys and processx have other tasks which do make them handy tools in your package toolbox. Having said that, keep reading since we’re going to end up not choosing the built-in functions for this task.

This is the command line we need to execute:

system_profiler -xml -detailLevel full SPApplicationsDataType

Let’s load up all the packages we’ll be needing and execute this command-line all four ways, then briefly discuss the differences:

library(sys)
library(processx)
library(microbenchmark)
library(xml2)
library(tidyverse)

system(
  command = "system_profiler -xml -detailLevel full SPApplicationsDataType",
  intern = TRUE
) -> apps_system

str(apps_system)
##  chr [1:10665] "" ...

system2(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  stdout = TRUE
) -> apps_system2

str(apps_system2)
##  chr [1:10665] "" ...

processx::run(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  spinner = TRUE
) -> apps_processx_run

str(apps_processx_run)
## List of 4
##  $ status : int 0
##  $ stdout : chr "XML STRING THAT prism.js won't let me show"
##  $ stderr : chr ""
##  $ timeout: logi FALSE

sys::exec_internal(
  cmd = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType")
) -> apps_sys_exec_internal

str(apps_sys_exec_internal)
## List of 3
##  $ status: int 0
##  $ stdout: raw [1:331133] 3c 3f 78 6d ...
##  $ stderr: raw(0) 

The core difference between system() and the rest is that you need to shQuote()? for system() whereas that’s taken care of for you by the others (so they’re a bit safer by default since you’re more than likely going to forget to shQuote()).

You can definitely notice the main differences in return objects. The built-in functions just give us the character data from the standard output stream (stdout) and the last two return a more structured object that provides more explicit information about the job we just executed. The base ones can provide this detail, but it’s a twisty maze of remembering which options do what vs the more (IMO) straightforward approach both processx and sys take.

You’ll also notice a difference in stdout between processx and sys with the latter giving us a raw vector vs a character vector. This gives us a great deal of power and flexibility. It also turns out to be a great choice for processing command-line-generated XML data. Here’s why:

microbenchmark(
  sys = xml2::read_xml(apps_sys_exec_internal$stdout),
  processx = xml2::read_xml(apps_processx_run$stdout)
)
## Unit: milliseconds
##      expr      min       lq      mean    median        uq      max neval
##       sys 4.086492  4.60078  9.085143  5.508814  5.906942 207.6495   100
##  processx 9.510356 10.98282 14.275499 12.054810 13.292234 163.9870   100

It turns out xml2::read_xml() makes much quicker work of the raw vector data (though, I mean, really—are we really going to care about those ~5ms IRL?).

We’ll move on to the real reason for the post, but definitely explore both sys and processx since they are both super-handy packages.

“Can we please just find the 32-bit apps already?”

No problem. Well, actually, there is a minor annoyance. These are property list XML files and I’ll confess that I truly hate this format. There are “dictionary arrays” of key and value nodes, but those nodes are siblings vs directly associated pairs. So, we have to use the sibling relationship to work with them. It’s not hard, per se, just (again, IMO) suboptimal.

Let’s take a look at it:

apps <- read_xml(apps_sys_exec_internal$stdout)

xml_find_all(apps, "//array/dict")
## {xml_nodeset (476)}
##  [1] \n  _SPCommandLineArguments\n  \n     ...
##  [2] \n  _name\n  Sublime Text\n  h ...
##  [3] \n  _name\n  System Preferences\n   ...
##  [4] \n  _name\n  Google Chrome Canary\n ...
##  [5] \n  _name\n  Google Chrome\n   ...
##  [6] \n  _name\n  Dropbox\n  has64B ...
##  [7] \n  _name\n  Keypad\n  has64Bi ...
##  [8] \n  _name\n  Garmin WebUpdater\n  < ...
##  [9] \n  _name\n  LaTeXiT\n  has64B ...
## [10] \n  _name\n  CocoaPacketAnalyzer\n  ...
## [11] \n  _name\n  Janetter\n  has64 ...
## [12] \n  _name\n  VMware Fusion\n   ...
## [13] \n  _name\n  Photo Library Migration Utility ...
## [14] \n  _name\n  Setup Assistant\n  \n  _name\n  Siri\n  has64BitI ...
## [16] \n  _name\n  Software Update\n  \n  _name\n  Spotlight\n  has6 ...
## [18] \n  _name\n  Stocks\n  has64Bi ...
## [19] \n  _name\n  SystemUIServer\n  \n  _name\n  UniversalAccessControl ...
## ...

Let’s look at a sample record using xml_view() from the htmltidy package:

xml_find_all(apps, "//array/dict[key='_name']")[1] %>% 
  htmltidy::xml_view()

Be wary of using xml_view() on giant XML structures since it’ll freeze up RStudio for a bit and even slows down Chrome since the resultant, composed DOM object can get ginormous.

Now we know we can use has64BitIntelCode for filtering once we get to the data. Let’s read in all the apps, cherry-picking the fields and then just look at the 32-bit apps:

xml_find_all(apps, "//array/dict[key='_name']") %>% 
  map_df(~{
    list(
      name = xml_find_first(.x, ".//string") %>% xml_text(),
      path = xml_find_first(.x, ".//key[.='path']/following-sibling::string") %>% xml_text(),
      is_64bit = xml_find_first(.x, ".//key[.='has64BitIntelCode']/following-sibling::string") %>% xml_text() 
    )
  }) %>% 
  filter(is_64bit == "no") %>% 
  arrange(name) %>% 
  select(-is_64bit)
## # A tibble: 30 x 2
##    name                      path                                           
##                                                                   
##  1 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  2 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  3 AAM Updates Notifier      /Applications/Utilities/Adobe Application Mana…
##  4 AAMLauncherUtil           /Applications/Utilities/Adobe Application Mana…
##  5 ACR_9_10                  /Library/Application Support/Adobe/Uninstall/A…
##  6 Adobe Application Manager /Applications/Utilities/Adobe Application Mana…
##  7 adobe_licutil             /Applications/Utilities/Adobe Application Mana…
##  8 Audacity                  /Applications/Audacity.app                     
##  9 COCM_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## 10 COPS_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## # ... with 20 more rows

The Adobe helper apps are longstanding 32-bit “offenders”. Many of these death-row apps fall into the “helper” category and will hopefully get some attention by their developers. I do find it amusing that Apple kinda wants us to prod the developers to get their collective acts together.

FIN

This exhaustive search finds all of the 32-bit apps residing on your system. If you just want to see the one’s you’ve executed and macOS has kept track of, you can drop to a command-line and do:

sudo Rscript -e 'knitr::kable((dplyr::select(dplyr::tbl(dplyr::src_sqlite("/var/db/SystemPolicyConfiguration/ExecPolicy"), "legacy_exec_history_v3"), responsible_path)))'

You need elevated privileges to access those files, so please read that whole line to make sure I’m not having you rm -rf /.

Remember to take some time to explore the sys and processx packages and perhaps bundle up the salient bits of this post into a script so you can occasionally check to see the 32-bit eradication progress.

It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.

R folks have had the decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).

However, that nigh-mundane addition is not the only new feature in decapitated.

Introducing gepetto

While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer?. Puppeteer is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.

Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto? which aims to do just that.

Gepetto is a Node.js application which uses puppeteer for all the hard work. After seeing that such a REST API interface was possible via the puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer-fueled javascript for the Lua interface.

A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.

Installing gepetto

There are instructions over at the repo on installing gepetto but R users can try a shortcut by grabbing the latest version of decapitated from Git[La|Hu]b and running decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:

The installer provides some guidance should thing go awry. You’ll notice gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.

Working with gepetto

Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto. You should see something like this:

$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000

NOTE: You can use a different host/port by setting the HOST and PORT environment variables accordingly before startup.

You can then go to http://localhost:3000 in your browser and should see this:

Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.

If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:

The Swagger definition JSON is also at http://localhost:3000/swagger.json.

The API documentation will be a bit more robust as the module’s corners are rounded out.

“But, this is supposed to be an R post…”

Yes. Yes it is.

If you followed along in the previous section and started gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!

library(rvest)
library(decapitated)
library(tidyverse)

gpid <- start_gepetto()

gpid
## PROCESS 'gepetto', running, pid 60827.

gepetto() %>% 
  gep_active()
## [1] TRUE

Anything other than a “running” response means there’s something wrong and you can use the various processx methods on that gpid object to inspect the error log. If you were able to run gepetto from the command line then it should be fine in R, too. The gep() function build a connection object and gep_active() tests an API endpoint to ensure you can communicate with the server.

Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html().

gepetto() %>% 
  gep_render_html("http://therapboard.com/") -> doc

html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%  
  html_attr("src") %>% 
  head(10)
## [1] "audio/2chainz_4.mp3"        "audio/2chainz_yeah2.mp3"   
## [3] "audio/2chainz_tellem.mp3"   "audio/2chainz_tru.mp3"     
## [5] "audio/2chainz_unh3.mp3"     "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3"  "audio/2pac_4.mp3"          
## [9] "audio/2pac_5.mp3"           "audio/2pac_6.mp3"

Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs.

We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:

gepetto() %>% 
  gep_render_magick("https://nytimes.com/")
##   format width height colorspace matte filesize density
## 1    PNG  1440   6828       sRGB  TRUE        0   72x72

Astute readers will notice it returns a magick object so you can work with it immediately.

I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).

Testing gep_render_pdf() is an exercise left to the reader.

FIN

The gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.

If you’ve got a directory full of Bro NSM logs, it’s easy to work with them in Apache Drill since they’re just tab-separated values (TSV) files by default. The most tedious part is mapping the columns to proper types and hopefully this saves at least one person from typing it out manually:

SELECT 
  TO_TIMESTAMP(CAST(columns[0] AS DOUBLE)) AS ts,
                    columns[1]             AS uid,
                    columns[2]             AS id_orig_h,
                    columns[3]             AS id_orig_p,
                    columns[4]             AS id_resp_h,
                    columns[5]             AS id_resp_p,
                    columns[6]             AS proto,
                    columns[7]             AS service,
              CAST( columns[8] AS DOUBLE)  AS duration,
              CAST( columns[9] AS INTEGER) AS orig_bytes,
              CAST(columns[10] AS INTEGER) AS resp_bytes,
                   columns[11]             AS conn_state,
                   columns[12]             AS local_orig,
                   columns[13]             AS local_resp,
              CAST(columns[14] AS INTEGER) AS missed_bytes,
                   columns[15]             AS history,
              CAST(columns[16] AS INTEGER) AS orig_packets,
              CAST(columns[17] AS INTEGER) AS orig_ip_bytes,
              CAST(columns[18] AS INTEGER) AS resp_pkts,
              CAST(columns[19] AS INTEGER) AS resp_ip_bytes,
                   columns[20]             AS tunnel_parents
FROM dfs.brologs.`/201808/*`

You can either store them all under a single workspace with a default input type or soft-link/rename them to end in .tsv (it’s unlikely you want to change all .log files to be read as TSV everywhere).

While you could just use the logs this way, consider using CTAS to move them to Parquet. The above will created typed columns and the queries will generally be much faster.

Apache Drill 1.14.0 was recently released, bringing with it many new features and a temporary incompatibility with the current rev of the MapR ODBC drivers. The Drill community expects new ODBC drivers to arrive shortly. The sergeant? is an alternative to ODBC for R users as it provides a dplyr interface to the REST API along with a JDBC interface and functions to work directly with the REST API in a more programmatic fashion.

First-class dplyr-citizen support for the sergeant JDBC interface

I’ve been primarily using the ODBC interface for a while, now, since it’s dead simple to use it with dplyr (as has been noted in my still-unfinished, short cookbook on wiring up Drill and R). The ODBC incompatibility is pretty severe since it’s at the protobuf-level, but sergeant::src_drill() is an easy swap out and does not have any issues since it works against the REST API. Unfortunately, the query endpoint of the REST API mangles the field order when it returns query results. This really isn’t too painful since it’s easy to add in a select() call after gathering query results to reorder things. However, it’s painful enough that it facilitated rounding out some of the corners to the JDBC interface.

sergeant::drill_jdbc() now returns a <DrillJDBCConnection> object which was necessary to add dplyr classes for just enough bits to enable smooth operation with the tbl() function (without breaking all your other RJDBC usage in the same session). The next blog section will use the new JDBC interface with dplyr as it introduces one of Drill’s new features.

Query Image Metadata with Apache Drill 1.14.0

There are quite a few R packages for reading image/media metadata. Since that seems to be en vogue, R folks might be interested in Drill’s new image metadata format plugin. Just point drill to a directory of files and you can use a familiar dplyr interface to get the deets on your pirated torrent archivefamily photo inventory.

You first need to follow the directions at the aforelinked resource and add the following format to the formats: section.

formats: {
     "image": {
       "type": "image",
       "extensions": [
         "jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif",
         "ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p",
         "m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw",
         "crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f"
       ],
       "fileSystemMetadata": true,
       "descriptive": true,
       "timeZone": null
     }  
   }

Note that the configuration snippet on Drill’s site (as of the date-stamp on this post) did not have a , after the ] for the extensions array, so copy this one instead.

I created a media workspace and set the defaultInputFormat to image. Here’s a naive first look at what you can get back from a simple query to a jpg directory under it (using the new JDBC interface and dplyr):

library(sergeant)
library(tidyverse)

(con <- drill_jdbc("bigd:2181"))
## 

tbl(con, "dfs.media.`/jpg/*`") %>%
  glimpse()
## Observations: ??
## Variables: 28
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

That’s quite a bit of metadata, but the Drill format plugin page kinda fibs a bit about column types since we see many chrs there. You may be quick to question the sergeant package but this isn’t using the REST interface and we can use DBI calls to ask Drill what’s it’s sending us:

dbSendQuery(con, "SELECT * FROM dfs.media.`/jpg/*`") %>%
  dbColumnInfo()
##         field.name        field.type data.type            name
## 1         FileSize CHARACTER VARYING character        FileSize
## 2     FileDateTime CHARACTER VARYING character    FileDateTime
## 3           Format CHARACTER VARYING character          Format
## 4       PixelWidth CHARACTER VARYING character      PixelWidth
## 5      PixelHeight CHARACTER VARYING character     PixelHeight
## 6     BitsPerPixel CHARACTER VARYING character    BitsPerPixel
## 7         DPIWidth CHARACTER VARYING character        DPIWidth
## 8        DPIHeight CHARACTER VARYING character       DPIHeight
## 9       Orientaion CHARACTER VARYING character      Orientaion
## 10       ColorMode CHARACTER VARYING character       ColorMode
## 11        HasAlpha CHARACTER VARYING character        HasAlpha
## 12        Duration CHARACTER VARYING character        Duration
## 13      VideoCodec CHARACTER VARYING character      VideoCodec
## 14       FrameRate CHARACTER VARYING character       FrameRate
## 15      AudioCodec CHARACTER VARYING character      AudioCodec
## 16 AudioSampleSize CHARACTER VARYING character AudioSampleSize
## 17 AudioSampleRate CHARACTER VARYING character AudioSampleRate
## 18            JPEG               MAP character            JPEG
## 19            JFIF               MAP character            JFIF
## 20        ExifIFD0               MAP character        ExifIFD0
## 21      ExifSubIFD               MAP character      ExifSubIFD
## 22  AppleMakernote               MAP character  AppleMakernote
## 23             GPS               MAP character             GPS
## 24             XMP               MAP character             XMP
## 25       Photoshop               MAP character       Photoshop
## 26            IPTC               MAP character            IPTC
## 27         Huffman               MAP character         Huffman
## 28        FileType               MAP character        FileType

We can still work with the results, but there’s also a pretty key element missing: the media filename. The reason it’s not in the listing is that filename is an implicit column that we have to ask for. So, we need to modify our query to be something like this:

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  glimpse()
## Observations: ??
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

We could work with the “map” columns with Drill’s SQL, but this is just metadata and even if there are many files, most R folks have sufficient system memory these days to collect it all and work with it locally. There’s nothing stopping you from working on the SQL side, though, and it may be a better choice if you’ll be using this to process huge archives. But, we’ll do this in R and convert a bunch of field types along the way:

from_map <- function(x) { map(x, jsonlite::fromJSON)}

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  collect() %>%
  mutate_at(
    .vars = vars(
      JPEG, JFIF, ExifSubIFD, AppleMakernote, GPS, XMP, Photoshop, IPTC, Huffman, FileType
    ),
    .funs=funs(from_map)
  ) %>%
  mutate_at(
    .vars = vars(
      PixelWidth, PixelHeight, DPIWidth, DPIHeight, FrameRate, AudioSampleSize, AudioSampleRate
    ),
    .funs=funs(as.numeric)
  ) %>%
  glimpse() -> media_df
## Observations: 11
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       4032, 4032, 4032, 4032, 4032, 4032, 3024, 4032...
## $ PixelHeight      3024, 3024, 3024, 3024, 3024, 3024, 4032, 3024...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ DPIHeight        72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioSampleRate  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ JPEG             [["Baseline", "8 bits", "3024 pixels", "4032 ...
## $ JFIF             [["1.1", "none", "72 dots", "72 dots", "0", "...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       [["1/2227 sec", "f/1.8", "Program normal", "2...
## $ AppleMakernote   [["5", "[558 values]", "[104 values]", "1", "...
## $ GPS              [["N", "44° 19' 6.3\"", "W", "-68° 11' 22.39\...
## $ XMP              [["4", ["2017-06-22T14:28:04"], ["2017-06-22T...
## $ Photoshop        [["48 89 11 77 33 105 192 33 170 252 63 34 43...
## $ IPTC             [["UTF-8", "2", "14:28:04", "2017:06:22", "20...
## $ Huffman          [["4 Huffman tables"], ["4 Huffman tables"], ...
## $ FileType         [["JPEG", "Joint Photographic Experts Group",...

Now, we can do anything with the data, including getting the average file size:

mutate(media_df, FileSize = str_replace(FileSize, " bytes", "") %>% as.numeric()) %>%
  summarise(mean(FileSize))
## # A tibble: 1 x 1
##   `mean(FileSize)`
##              
## 1         3878963.

FIN

The enhancements to the JDBC interface have only been given a light workout but seem to be doing well so-far. Kick the tyres and file an issue if you have any problems. ODBC users should not have to wait long for new drivers and src_drill() aficionados can keep chugging along as before without issue.

For those new to Apache Drill, there’s now an official Docker image for it so you can get up and running without adding too much cruft to your local systems. I may add support for spinning up and managing a Drill container to the sergeant package, so keep your eyes on pushes to the repo.

Also keep an eye on the mini-cookbook as I’ll be modifying to to account for the new package changes and introduce additional, new Drill features.