Skip navigation

Despite having sailed through the core components of this year’s Talk Like A Pirate Day R post a few months ago, time has been an enemy of late so this will be a short post that others can build off of, especially since there’s lots more knife work ground to cover from the data.

DMC-WhAt?

Since this is TLAPD, I’ll pilfer some of the explanation from GitHub itself:

The Digital Millennium Copyright Act (DMCA) <start_of_current_pilfer>“provides a safe harbor for service providers that host user-generated content. Since even a single claim of copyright infringement can carry statutory damages of up to $150,000, the possibility of being held liable for user-generated content could be very harmful for service providers. With potential damages multiplied across millions of users, cloud-computing and user-generated content sites like YouTube, Facebook, or GitHub probably never would have existed without the DMCA (or at least not without passing some of that cost downstream to their users).”

“The DMCA addresses this issue by creating a copyright liability safe harbor for internet service providers hosting allegedly infringing user-generated content. Essentially, so long as a service provider follows the DMCA’s notice-and-takedown rules, it won’t be liable for copyright infringement based on user-generated content. Because of this, it is important for GitHub to maintain its DMCA safe-harbor status.”</end_of_current_pilfer>

(I’ll save you from a long fact- and opinion-based diatribe on the DMCA, but suffice it to say it’s done far more harm than good IMO. Also, hopefully the “piracy” connection makes sense, now :-)

If your initial reaction was “What does the DMCA have to do with GitHub?” it likely (quickly) turned to “Oh…GitHub is really just a version-controlled file sharing service…”. As such it has to have a robust takedown policy and process.

I don’t know if Microsoft is going to keep the practice of being open about DMCA requests now that they own GitHub nor do I know if they’ll use the same process on themselves (since, as we’ll see, they have issued DMCA requests to GitHub in the past). For now, we’ll assume they will, thus making the code from this post usable in the future to check on the status of DMCA requests over a longer period of time. But first we need the data.

Hunting for treasure in the data hoard

Unsurprisingly, GitHub stores DMCA data on GitHub. Ironically, they store it openly — in-part — to shine a light on what giant, global megacorps like Microsoft are doing. Feel free to use one of the many R packages to clone the repo, but a simple command-line git clone git@github.com:github/dmca.git is quick and efficient (not everything needs to be done from R).

The directory structure looks like this:

├── 2011
├── 2012
├── 2013
├── 2014
├── 2015
├── 2016
├── 2017
├── 2017-02-01-RBoyApps-2.md
├── 2017-02-15-DeutscheBank.md
├── 2017-03-13-Jetbrains.md
├── 2017-06-26-Wipro-Counternotice.md
├── 2017-06-30-AdflyLink.md
├── 2017-07-28-Toontown-2.md
├── 2017-08-31-Tourzan.md
├── 2017-09-04-Random-House.md
├── 2017-09-05-RandomHouse-2.md
├── 2017-09-18-RandomHouse.md
├── 2017-09-19-Ragnarok.md
├── 2017-10-10-Broadcom.md
├── 2018
├── 2018-02-01-NihonAdSystems.md
├── 2018-03-03-TuneIn.md
├── 2018-03-16-Wabg.md
├── 2018-05-17-Packt.md
├── 2018-06-12-Suning.md
├── 2018-07-31-Pearson.md
├── CONTRIBUTING.md
├── data
└── README.md

Unfortunately, the data directory contains fools’ gold (it’s just high-level summary data).

We want DMCA filer names, repo names, file names and the DMCA notice text (though we’ll be leaving NLP projects up to the intrepid readers). For that, it will mean processing the directories of notices.

Notices are named (sadly, with some inconsistency) like this: 2018-03-15-Microsoft.md. Year, month, date and name of org. The contents are text-versions of correspondence (usually email text) that have some requirements in order to be processed. There’s also an online form one can fill out but it’s pretty much a free text field with some semblance of structure. It’s up to humans to follow that structure and — as such — there is inconsistency in the text as well. (Perhaps this is a great lesson that non-constrained inputs and human-originated filenames aren’t a great plan for curating data stores.)

You may have seen what look like takedown files in the top level of the repo. I have no idea if they are legit (since they aren’t in the structured directories) so we’ll be ignoring them.

When I took a look at the directories, some files end in .markdown but most end in .md. We’ll cover both instances (you’ll need to replace /data/github/dmca with the prefix where you stored the repo:

library(tools)
library(stringi)
library(hrbrthemes)
library(tidyverse)

list.files(
  path = sprintf("/data/github/dmca/%s", 2011:2018), 
  pattern = "\\.md$|\\.markdown$",
  full.names = TRUE
) -> dmca_files

As noted previously, we’re going to focus on DMCA views over time, look at organizations who filed DMCA notices and the notice content. It turns out the filenames also distinguish whether a notice is a takedown request or a counter-notice (i.e. an “oops…my bad…” by a takedown originator) or a retraction, so we’ll collect that metadata as well. Finally, we’ll slurp up the text along the way.

Again, I’ve taken a pass at this and found out the following:

  • Some dates are coded incorrectly (infrequently enough to be able to use some causal rules to fix)
  • Some org names are coded incorrectly (often enough to skew counts, so we need to deal with it)
  • Counter-notice and retraction tags are inconsistent, so we need to deal with that as well

It’s an ugly pipeline, so I’ve annotated these initial steps to make what’s going on a bit clearer:

map_df(dmca_files, ~{
  
  file_path_sans_ext(.x) %>% # remove extension
    basename() %>% # get just the filename
    stri_match_all_regex(
      "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{1,2})-(.*)" # try to find the date and the org
    ) %>% 
    unlist() -> date_org
  
  if (is.na(date_org[2])) { # handle a special case where the date pattern above didn't work
    file_path_sans_ext(.x) %>% 
      basename() %>%
      stri_match_all_regex(
        "([[:digit:]]{4}-[[:digit:]]{2})-(.*)"
      ) %>% 
      unlist() -> date_org
  }
  
  # a few files are still broken so we'll deal with them as special cases
  
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    date_org <- c("", "2017-11-06", "1776")
  } else if (stri_detect_fixed(.x, "2017/2017-Offensive-Security-7.md")) {
    date_org <- c("", "2017-12-30", "Offensive-Security-7")
  } else if (stri_detect_fixed(.x, "2017/Offensive-Security-6.md")) {
    date_org <- c("", "2017-12-29", "Offensive-Security-6")
  }
  
  # we used a somewhat liberal regex to capture dates since some are 
  # still broken. We'll deal with those first, then turn them
  # into proper Date objects
  
  list(
    notice_day = case_when(
      date_org[2] == "2015-12-3"  ~ "2015-12-03",
      date_org[2] == "2015-12-7"  ~ "2015-12-07",
      date_org[2] == "2016-08"    ~ "2016-08-01",
      date_org[2] == "2016-10-7"  ~ "2016-10-07",
      date_org[2] == "2016-11-1"  ~ "2016-11-01",
      date_org[2] == "2016-11-3"  ~ "2016-11-03",
      date_org[2] == "2017-06"    ~ "2017-06-01",
      date_org[2] == "0107-05-22" ~ "2017-05-22",
      date_org[2] == "2017-11-1"  ~ "2017-11-01",
      TRUE ~ date_org[2]
    ) %>% 
      lubridate::ymd(),
    notice_org = date_org[3] %>% # somtimes the org name is messed up so we need to clean it up
      stri_replace_last_regex("[-]*[[:digit:]]+$", "") %>% 
      stri_replace_all_fixed("-", " "),
    notice_content = list(read_lines(.x)) # grab the content
  ) -> ret
  
  # and there are still some broken org names
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    ret$notice_org <- "1776"
  } 
  
  ret
  
}) -> dmca

dmca
## # A tibble: 4,460 x 3
##    notice_day notice_org                   notice_content
##                                         
##  1 2011-01-27 sony                             
##  2 2011-01-28 tera                             
##  3 2011-01-31 sony                             
##  4 2011-02-03 sony counternotice                
##  5 2011-02-03 sony                          
##  6 2011-03-24 oracle                            
##  7 2011-03-30 mentor graphics                  
##  8 2011-05-24 cpp virtual world operations     
##  9 2011-06-07 sony                             
## 10 2011-06-13 diablominer                      
## # ... with 4,450 more rows

Much better. We’ve got more deck-swabbing to do, now, to tag the counter-notice and retractions:

mutate(
  dmca,
  counter_notice = stri_detect_fixed(notice_org, "counternotice|counter notice"), # handle inconsistency
  retraction = stri_detect_fixed(notice_org, "retraction"), 
  notice_org = stri_trans_tolower(notice_org) %>% 
    stri_replace_first_regex("\ *(counternotice|counter notice)\ *", "") %>% # clean up org names with tags
    stri_replace_first_regex("\ *retraction\ *", "")
) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

I’ve lower-cased the org names to make it easier to wrangle them since we do, indeed, need to wrangle them.

I’m super-not-proud of the following code block, but I went into it thinking the org name corrections would be infrequent. But, as I worked with the supposedly-cleaned data, I kept adding correction rules and eventually created a monster:

mutate(
  dmca,
  notice_org = case_when(
    stri_detect_fixed(notice_org, "accenture")        ~ "accenture",
    stri_detect_fixed(notice_org, "adobe")            ~ "adobe",
    stri_detect_fixed(notice_org, "amazon")           ~ "amazon",
    stri_detect_fixed(notice_org, "ansible")          ~ "ansible",
    stri_detect_fixed(notice_org, "aspengrove")       ~ "aspengrove",
    stri_detect_fixed(notice_org, "apple")            ~ "apple",
    stri_detect_fixed(notice_org, "aws")              ~ "aws",
    stri_detect_fixed(notice_org, "blizzard")         ~ "blizzard",
    stri_detect_fixed(notice_org, "o reilly")         ~ "oreilly",
    stri_detect_fixed(notice_org, "random")           ~ "random house",
    stri_detect_fixed(notice_org, "casado")           ~ "casadocodigo",
    stri_detect_fixed(notice_org, "ccp")              ~ "ccp",
    stri_detect_fixed(notice_org, "cisco")            ~ "cisco",
    stri_detect_fixed(notice_org, "cloudsixteen")     ~ "cloud sixteen",
    stri_detect_fixed(notice_org, "collinsharper")    ~ "collins ’harper",
    stri_detect_fixed(notice_org, "contentanalytics") ~ "content analytics",
    stri_detect_fixed(notice_org, "packt")            ~ "packt",
    stri_detect_fixed(notice_org, "penguin")          ~ "penguin",
    stri_detect_fixed(notice_org, "wiley")            ~ "wiley",
    stri_detect_fixed(notice_org, "wind river")       ~ "windriver",
    stri_detect_fixed(notice_org, "windriver")        ~ "windriver",
    stri_detect_fixed(notice_org, "wireframe")        ~ "wireframe shader",
    stri_detect_fixed(notice_org, "listen")           ~ "listen",
    stri_detect_fixed(notice_org, "wpecommerce")      ~ "wpecommerce",
    stri_detect_fixed(notice_org, "yahoo")            ~ "yahoo",
    stri_detect_fixed(notice_org, "youtube")          ~ "youtube",
    stri_detect_fixed(notice_org, "x pressive")       ~ "xpressive",
    stri_detect_fixed(notice_org, "ximalaya")         ~ "ximalaya",
    stri_detect_fixed(notice_org, "pragmatic")        ~ "pragmatic",
    stri_detect_fixed(notice_org, "evadeee")          ~ "evadeee",
    stri_detect_fixed(notice_org, "iaai")             ~ "iaai",
    stri_detect_fixed(notice_org, "line corp")        ~ "line corporation",
    stri_detect_fixed(notice_org, "mediumrare")       ~ "medium rare",
    stri_detect_fixed(notice_org, "profittrailer")    ~ "profit trailer",
    stri_detect_fixed(notice_org, "smartadmin")       ~ "smart admin",
    stri_detect_fixed(notice_org, "microsoft")        ~ "microsoft",
    stri_detect_fixed(notice_org, "monotype")         ~ "monotype",
    stri_detect_fixed(notice_org, "qualcomm")         ~ "qualcomm",
    stri_detect_fixed(notice_org, "pearson")          ~ "pearson",
    stri_detect_fixed(notice_org, "sony")             ~ "sony",
    stri_detect_fixed(notice_org, "oxford")           ~ "oxford",
    stri_detect_fixed(notice_org, "oracle")           ~ "oracle",
    stri_detect_fixed(notice_org, "out fit")          ~ "outfit",
    stri_detect_fixed(notice_org, "nihon")            ~ "nihon",
    stri_detect_fixed(notice_org, "opencv")           ~ "opencv",
    stri_detect_fixed(notice_org, "newsis")           ~ "newsis",
    stri_detect_fixed(notice_org, "nostarch")         ~ "nostarch",
    stri_detect_fixed(notice_org, "stardog")          ~ "stardog",
    stri_detect_fixed(notice_org, "mswindows")        ~ "microsoft",
    stri_detect_fixed(notice_org, "moody")            ~ "moody",
    stri_detect_fixed(notice_org, "minecraft")        ~ "minecraft",
    stri_detect_fixed(notice_org, "medinasoftware")   ~ "medina software",
    stri_detect_fixed(notice_org, "linecorporation")  ~ "line corporation",
    stri_detect_fixed(notice_org, "steroarts")        ~ "stereoarts",
    stri_detect_fixed(notice_org, "mathworks")        ~ "mathworks",
    stri_detect_fixed(notice_org, "tmssoftware")      ~ "tmssoftware",
    stri_detect_fixed(notice_org, "toontown")         ~ "toontown",
    stri_detect_fixed(notice_org, "wahoo")            ~ "wahoo",
    stri_detect_fixed(notice_org, "webkul")           ~ "webkul",
    stri_detect_fixed(notice_org, "whmcs")            ~ "whmcs",
    stri_detect_fixed(notice_org, "viber")            ~ "viber",
    stri_detect_fixed(notice_org, "totalfree")        ~ "totalfreedom",
    stri_detect_fixed(notice_org, "successacademies") ~ "success academies",
    stri_detect_fixed(notice_org, "ecgwaves")         ~ "ecgwaves",
    stri_detect_fixed(notice_org, "synology")         ~ "synology",
    stri_detect_fixed(notice_org, "infistar")         ~ "infistar’",
    stri_detect_fixed(notice_org, "galleria")         ~ "galleria",
    stri_detect_fixed(notice_org, "jadoo")            ~ "jadoo",
    stri_detect_fixed(notice_org, "dofustouch")       ~ "dofus touch",
    stri_detect_fixed(notice_org, "gravityforms")     ~ "gravity forms",
    stri_detect_fixed(notice_org, "fujiannewland")    ~ "fujian newland",
    stri_detect_fixed(notice_org, "dk uk")            ~ "dk",
    stri_detect_fixed(notice_org, "dk us")            ~ "dk",
    stri_detect_fixed(notice_org, "dkuk")             ~ "dk",
    stri_detect_fixed(notice_org, "dkus")             ~ "dk",
    stri_detect_fixed(notice_org, "facet")            ~ "facet",
    stri_detect_fixed(notice_org, "fh admin")         ~ "fhadmin",
    stri_detect_fixed(notice_org, "electronicarts")   ~ "electronic arts",
    stri_detect_fixed(notice_org, "daikonforge")      ~ "daikon forge",
    stri_detect_fixed(notice_org, "corgiengine")      ~ "corgi engine",
    stri_detect_fixed(notice_org, "epicgames")        ~ "epic  games",
    stri_detect_fixed(notice_org, "essentialmode")    ~ "essentialmode",
    stri_detect_fixed(notice_org, "jetbrains")        ~ "jetbrains",
    stri_detect_fixed(notice_org, "foxy")             ~ "foxy themes",
    stri_detect_fixed(notice_org, "cambridgemobile")  ~ "cambridge mobile",
    stri_detect_fixed(notice_org, "offensive")        ~ "offensive security",
    stri_detect_fixed(notice_org, "outfit")           ~ "outfit",
    stri_detect_fixed(notice_org, "haihuan")          ~ "shanghai haihuan",
    stri_detect_fixed(notice_org, "schuster")         ~ "simon & schuster",
    stri_detect_fixed(notice_org, "silicon")          ~ "silicon labs",
    TRUE ~ notice_org
  )) %>% 
  arrange(notice_day) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

You are heartily encouraged to create a translation table in place of that monstrosity.

But, we finally have usable data. You can avoid the above by downloading https://rud.is/dl/github-dmca.json.gz and using jsonlite::stream_in() or ndjson::stream_in() to get the above data frame.

Hoisting the mizzen sailplots

Let’s see what the notice submission frequency looks like over time:

# assuming you downloaded it as suggested
jsonlite::stream_in(gzfile("~/Data/github-dmca.json.gz")) %>% 
  tbl_df() %>% 
  mutate(notice_day = as.Date(notice_day)) -> dmca

filter(dmca, !retraction) %>% 
  mutate(
    notice_year = lubridate::year(notice_day),
    notice_ym = as.Date(format(notice_day, "%Y-%m-01"))
  ) %>% 
  dplyr::count(notice_ym) %>% 
  arrange(notice_ym) %>% 
  ggplot(aes(notice_ym, n)) +
  ggalt::stat_xspline(
    geom="area", fill=alpha(ft_cols$blue, 1/3), color=ft_cols$blue
  ) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Notices", 
    title = "GitHub DMCA Notices by Month Since 2011"
  ) +
  theme_ft_rc(grid="XY")

I’m not naive, but that growth was a bit of a shocker, which made me want to jump in and see who the top-filers were:

count(dmca, notice_org, sort=TRUE)
## # A tibble: 1,948 x 2
##    notice_org             n
##                  
##  1 webkul                92
##  2 pearson               90
##  3 stereoarts            86
##  4 qualcomm              72
##  5 codility              71
##  6 random house          62
##  7 outfit                57
##  8 offensive security    49
##  9 sensetime             46
## 10 penguin               44
## # ... with 1,938 more rows

“Webkul” is an enterprise eCommerce (I kinda miss all the dashed “e-” prefixes we used to use back in the day) platform. I mention that since I didn’t know what it was either. There are some recognizable names there like “Pearson” and “Random House” and “Penguin” which make sense since it’s easy to share improperly share e-books (modern non-dashed idioms be darned).

Let’s see the top 15 orgs by year since 2015 (since that’s when DMCA filings really started picking up and because I like 2×2 grids). We’ll also leave out counter-notices and retractions and alpha-order it since I want to be able to scan the names more than I want to see rank:

filter(dmca, !retraction, !counter_notice, notice_day >= as.Date("2015-01-01")) %>%
  mutate(
    notice_year = lubridate::year(notice_day),
  ) %>% 
  dplyr::count(notice_year, notice_org) %>% 
  group_by(notice_year) %>% 
  top_n(15) %>% 
  slice(1:15) %>% 
  dplyr::ungroup() %>%
  mutate( # a-z order with "a" on top 
    notice_org = factor(notice_org, levels = unique(sort(notice_org, decreasing = TRUE)))
  ) %>% 
  ggplot(aes(n, notice_org, xend=0, yend=notice_org)) +
  geom_segment(size = 2, color = ft_cols$peach) +
  facet_wrap(~notice_year, scales = "free") +
  scale_x_comma(limits=c(0, 60)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 15 GitHub DMCA Filers by Year Since 2015"
  ) +
  theme_ft_rc(grid="X")

Let’s look at rogues’ gallery of the pirates themselves:

dmca %>% 
  mutate(
    ghusers = notice_content %>% 
      map(~{
        stri_match_all_regex(.x, "http[s]*://github.com/([^/]+)/.*") %>% 
          discard(~is.na(.x[,1])) %>% 
          map_chr(~.x[,2]) %>% 
          unique() %>% 
          discard(`==`, "github") %>% 
          discard(~grepl(" ", .x))
      })
  ) %>% 
  unnest(ghusers) %>% 
  dplyr::count(ghusers, sort=TRUE) %>% 
  print() -> offenders
## # A tibble: 18,396 x 2
##    ghusers           n
##             
##  1 RyanTech         16
##  2 sdgdsffdsfff     12
##  3 gamamaru6005     10
##  4 ranrolls         10
##  5 web-padawan      10
##  6 alexinfopruna     8
##  7 cyr2242           8
##  8 liveqmock         8
##  9 promosirupiah     8
## 10 RandyMcMillan     8
## # ... with 18,386 more rows

As you might expect, most users have only 1 or two complaints filed against them since it was likely an oversight more than malice on their part:

ggplot(offenders, aes(x="", n)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(breaks=1:16, limits=c(1,16)) +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of GitHub DMCA Complaints Received by a User"
  ) +
  theme_ft_rc(grid="X")

But, there are hundreds of digital buccaneers, and we can have a bit of fun with them especially since I noticed quite a few had default (generated) avatars with lots of white in them (presenting this with a pirate hat-tip to Maëlle & Lucy):

library(magick)

dir.create("gh-pirates")
dir.create("gh-pirates-jpeg")

# this kinda spoils the surprise; i should have renamed it
download.file("https://rud.is/dl/jolly-roger.jpeg", "jolly-roger.jpeg")

ghs <- safely(gh::gh) # no need to add cruft to our namespace for one function 

filter(offenders, n>2) %>% 
  pull(ghusers) %>% 
  { .pb <<- progress_estimated(length(.)); . } %>% # there are a few hundred of them
  walk(~{
    .pb$tick()$print()
    user <- ghs(sprintf("/users/%s", .x))$result # the get-user and then download avatar idiom shld help us not bust GH API rate limits
    if (!is.null(user)) {
      download.file(user$avatar_url, file.path("gh-pirates", .x), quiet=TRUE) # can't assume avatar file type
    }
  })

# we'll convert them all to jpeg and resize them at the same time plus make sure they aren't greyscale
dir.create("gh-pirates-jpeg")
list.files("gh-pirates", full.names = TRUE, recursive = FALSE) %>%
  walk(~{
    image_read(.x) %>% 
      image_scale("72x72") %>% 
      image_convert("jpeg", type = "TrueColor", colorspace = "rgb") %>% 
      image_write(
        path = file.path("gh-pirates-jpeg", sprintf("%s.jpeg", basename(.x))), 
        format = "jpeg"
      )
  })

set.seed(20180919) # seemed appropriate for TLAPD
RsimMosaic::composeMosaicFromImageRandomOptim( # this takes a bit
  originalImageFileName = "jolly-roger.jpeg",
  outputImageFileName = "gh-pirates-flag.jpeg",
  imagesToUseInMosaic = "gh-pirates-jpeg",
  removeTiles = TRUE,
  fracLibSizeThreshold = 0.1
)

Finally, we’ll look at the types of pilfered files. To do that, we’ll first naively look for github repo URLs (there are github.io ones in there too, though, which is an exercise left to ye corsairs):

mutate(
  dmca,
  files = notice_content %>% 
    map(~{
      paste0(.x, collapse = " ") %>% 
        stri_extract_all_regex(gh_url_pattern, omit_no_match=FALSE, opts_regex = stri_opts_regex(TRUE)) %>% 
        unlist() %>% 
        stri_replace_last_regex("[[:punct:]]+$", "")
    })
) -> dmca_with_files

Now, we can see just how many resources/repos/files are in a complaint:

filter(dmca_with_files, map_lgl(files, ~!is.na(.x[1]))) %>% 
  select(notice_day, notice_org, files) %>% 
  mutate(num_refs = lengths(files)) %>%
  arrange(desc(num_refs)) %>%  # take a peek at the heavy hitters
  print() -> files_with_counts
## # A tibble: 4,020 x 4
##    notice_day notice_org files         num_refs
##                          
##  1 2014-08-27 monotype        2504
##  2 2011-02-03 sony            1160
##  3 2016-06-08 monotype        1015
##  4 2018-04-05 hexrays            906
##  5 2016-06-15 ibo                877
##  6 2016-08-18 jetbrains          777
##  7 2017-10-14 cengage            611
##  8 2016-08-23 yahoo              556
##  9 2017-08-30 altis              529
## 10 2015-09-22 jetbrains          468
## # ... with 4,010 more rows

ggplot(files_with_counts, aes(x="", num_refs)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(trans="log10") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of Files/Repos per-GitHub DMCA Complaint",
    caption = "Note: Log10 Scale"
  ) +
  theme_ft_rc(grid="X")

And, what are the most offensive file types (per-year):

mutate(
  files_with_counts, 
  extensions = map(files, ~tools::file_ext(.x) %>% 
    discard(`==` , "")
  )
) %>% 
  select(notice_day, notice_org, extensions) %>% 
  unnest(extensions) %>% 
  mutate(year = lubridate::year(notice_day)) -> file_types

count(file_types, year, extensions) %>% 
  filter(year >= 2014) %>% 
  group_by(year) %>% 
  top_n(10) %>% 
  slice(1:10) %>% 
  ungroup() %>% 
  ggplot(aes(year, n)) +
  ggrepel::geom_text_repel(
    aes(label = extensions, size=n), 
    color = ft_cols$green, family=font_ps, show.legend=FALSE
  ) +
  scale_size(range = c(3, 10)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 10 File-type GitHub DMCA Takedowns Per-year"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y=element_blank())

It’s not all code (lots of fonts and books) but there are plenty of source code files in those annual lists.

FIN

That’s it for this year’s TLAPD post. You’ve got the data and some starter code so build away! There are plenty more insights left to find and if you do take a stab at finding your own treasure, definitely leave a note in the comments.

The wayback? package has had an update to more efficiently retrieve mementos and added support for working with the Internet Archive’s advanced search+scrape API.

Search/Scrape

The search/scrape interface lets you examine the IA collections and download what you are after (programmatically). The main function is ia_scrape() but you can also paginate through results with the helper functions provided.

To demonstrate, let’s peruse the IA NASA collection and then grab one of the images. First, we need to search the collection then choose a target URL to retrieve and finally download it. The identifier is the key element to ensure we can retrieve the information about a particular collection.

library(wayback)

nasa <- ia_scrape("collection:nasa", count=100L)

tibble:::print.tbl_df(nasa)
## # A tibble: 100 x 3
##    identifier addeddate            title                                       
##                                                                 
##  1 00-042-154 2009-08-26T16:30:09Z International Space Station exhibit         
##  2 00-042-32  2009-08-26T16:30:12Z Swamp to Space historical exhibit           
##  3 00-042-43  2009-08-26T16:30:16Z Naval Meteorology and Oceanography Command …
##  4 00-042-56  2009-08-26T16:30:19Z Test Control Center exhibit                 
##  5 00-042-71  2009-08-26T16:30:21Z Space Shuttle Cockpit exhibit               
##  6 00-042-94  2009-08-26T16:30:24Z RocKeTeria restaurant                       
##  7 00-050D-01 2009-08-26T16:30:26Z Swamp to Space exhibit                      
##  8 00-057D-01 2009-08-26T16:30:29Z Astro Camp 2000 Rocketry Exercise           
##  9 00-062D-03 2009-08-26T16:30:32Z Launch Pad Tour Stop                        
## 10 00-068D-01 2009-08-26T16:30:34Z Lunar Lander Exhibit                        
## # ... with 90 more rows

(item <- ia_retrieve(nasa$identifier[1]))

## # A tibble: 6 x 4
##   file                       link                                                               last_mod          size 
## 1 00-042-154.jpg             https://archive.org/download/00-042-154/00-042-154.jpg             06-Nov-2000 15:34 1.2M 
## 2 00-042-154_archive.torrent https://archive.org/download/00-042-154/00-042-154_archive.torrent 06-Jul-2018 11:14 1.8K 
## 3 00-042-154_files.xml       https://archive.org/download/00-042-154/00-042-154_files.xml       06-Jul-2018 11:14 1.7K 
## 4 00-042-154_meta.xml        https://archive.org/download/00-042-154/00-042-154_meta.xml        03-Jun-2016 02:06 1.4K 
## 5 00-042-154_thumb.jpg       https://archive.org/download/00-042-154/00-042-154_thumb.jpg       26-Aug-2009 16:30 7.7K 
## 6 __ia_thumb.jpg             https://archive.org/download/00-042-154/__ia_thumb.jpg             06-Jul-2018 11:14 26.6K

download.file(item$link[1], file.path("man/figures", item$file[1]))

I just happened to know this would take me to an image. You can add the media type to the result (along with a host of other fields) to help with programmatic filtering.

The API is still not sealed in stone, so you're encouraged to submit questions/suggestions.

FIN

The vignette is embedded below and frame-busted here. It covers a very helpful and practical use-case identified recently by an OP on StackOverflow.

There's also a new pkgdown-gen'd site for the package.

Issues & PRs welcome at your community coding site of choice.

I was chatting with some cyber-mates at a recent event and the topic of cyber attacks on the U.S. power-grid came up (as it often does these days). The conversation was brief, but the topic made its way into active memory and resurfaced when I saw today’s Data Is Plural newsletter which noted that “Utility companies are required to report major power outages and other “electric disturbance events” to the Department of Energy within a business day (or, depending on the type of event, sooner) of the incident. The federal agency then aggregates the reports annual summary datasets.” (follow the links to the newsletter to get the URLs for the site since Jeremy deserves your ?).

Many of us data nerds use the Data Is Plural newsletters as fodder for class assignments, blog posts or personal “data katas”. This time, I was after cyber attack data.

When you head to the annual reports URL, you’re greeted with a SharePoint-driven HTML table:

So, our options are PDF or XLS (and I mean .xls, too, they’re not modern .xlsx files). We’ll opt for the latter and cache them locally before working on them. One “gotcha” is that the hrefs look like this: https://www.oe.netl.doe.gov/download.aspx?type=OE417XLS&ID=78 — i.e. no filenames. But, the filenames come along for the ride when an HTTP GET or HEAD request is issued in a content-disposition response header. We’ll use this metadata instead of siphoning off the year from the first column of the table:

library(rvest)
library(readxl)
library(tidyverse)

doe <- read_html("https://www.oe.netl.doe.gov/OE417_annual_summary.aspx")

dir.create("~/Data/doe-cache-dir", showWarnings = FALSE)

html_nodes(doe, xpath=".//a[contains(., 'XLS')]") %>%
  html_attr("href") %>%
  { .pb <<- progress_estimated(length(.)) ; . } %>% # we likely don't rly need progress bars tho
  walk(~{

    .pb$tick()$print()

    dl_url <- sprintf("https://www.oe.netl.doe.gov/%s", .x)

    res <- HEAD(dl_url) # so we can get the filename
    stop_for_status(res) # halt on network errors

    fil <- str_replace(
      string = res$headers['content-disposition'],
      pattern = "attachment; filename=",
      replacement = "~/Data/doe-cache-dir/"
    )

    if (!file.exists(fil)) { # this pattern allows us to issue a lightweight HTTP HEAD request, then cache and refresh w/o wasting server/our bandwidth/cpu
      res <- GET(dl_url, httr::write_disk(fil))
      stop_for_status(res)
      Sys.sleep(5) # be kind to the server(s) but only if we're downloading data files since HEAD requests don't really tax services
    }

  })

Let's do a quick check for the likelihood of uniformity. Some of these files go back to 2002 and I suspect they're more geared for "printing" (the PDF counterparts were a clue) than programmatic processing:

# check to see if the files are all the same (spoiler alert: they're not)
list.files("~/Data/doe-cache-dir", "xls", full.names=TRUE) %>%
  map_df(~list(
    fil = basename(.x),
    ncols = read_xls(.x, col_names=FALSE, col_types="text") %>% ncol()
  )) -> cols_profile

cols_profile
## # A tibble: 17 x 2
##    fil                     ncols
##                       
##  1 2002_Annual_Summary.xls     8
##  2 2003_Annual_Summary.xls     8
##  3 2004_Annual_Summary.xls     8
##  4 2005_Annual_Summary.xls     8
##  5 2006_Annual_Summary.xls     8
##  6 2007_Annual_Summary.xls     8
##  7 2008_Annual_Summary.xls     8
##  8 2009_Annual_Summary.xls     8
##  9 2010_Annual_Summary.xls     8
## 10 2011_Annual_Summary.xls     9
## 11 2012_Annual_Summary.xls     9
## 12 2013_Annual_Summary.xls     9
## 13 2014_Annual_Summary.xls     9
## 14 2015_Annual_Summary.xls    11
## 15 2016_Annual_Summary.xls    11
## 16 2017_Annual_Summary.xls    11
## 17 2018_Annual_Summary.xls    11

O_o

At this point, I paused and wanted to see what was going on in the minds of the DoE staffers charged with releasing this data.

(You can grab the macOS Quick Look preview snaps of all of those here.)

From 2002 to 2010 the Excel documents are clearly designed for print as the target, complete with month breaklines and repeated (+ heavily formatted) headers. They even left other tabs around (inconsistently).

Things got a little better between 2011 and 2014, but we still have month breaks and occasional, repeated headers (someone likely learned how to generate headers-per-page in Excel in 2011 then the administration changed hands and new staffers came in and fubar'd 2012 a bit before going back to the slightly better format).

Prior to 2015, the print-as-target trumped programmatic access. Interestingly enough, this is roughly when "data science" was on the upswing (in a major way):



Starting with 2015 we have a "month" column, more uniformity for specifying dates & times and more care given to other value fields, so kudos to the then and current staffers who made our data-machinating lives easier.

This really is, I believe, a byproduct of modern "data literacy". Folks in charge of gathering and publishing data are realizing there are multiple ways others want/need to consume the data. The original purpose for this data was to hand a report to someone after regulations were put in place to mandate notifications. I'm willing to bet nobody did anything with this data for a few years. Staffers either learned to wield Excel better or new staffers came in with this new knowledge. Finally, recent years clearly show that the staffers realize that folks are as (or more) likely to programmatically consume this information as they are reading a a long list of events (?). More work is needed (and an API or CSV/JSON output would be super cool) but it's great to see data literacy alive and well in the halls of the U.S. gov.

Said modern format changes do not really help us work with the complete data set and the more recent files have issues all their own, including inconsistency in the way the date/time columns are represented in Excel cells.

By golly, we're still going to try to read all these files in and work with them (for at least the purpose I originally set out on). We'll have to deal with the differences in columns and come up with a way to remove non-data rows. I also kinda at least want dates as dates. Here's my stab at an initial clean-up (there's lots more work to do, though):

map2(cols_profile$fil, cols_profile$ncols, ~{

  if (.y == 8) { # handle 8 cols

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("date_began", "region", "time", "area", "event_type", "loss", "customers_affected", "date_restored"),
      col_types = c("date", "text", "text", "text", "text", "text", "text", "date")
    )) %>%
      filter(!is.na(date_began)) %>%
      mutate(date_began = as.Date(date_began))

  } else if (.y == 9) { # handle 9 cols

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("date_began", "time_began", "date_restored", "time_restored", "area", "region", "event_type", "loss", "customers_affected"),
      col_types = c("date", "guess", "date", "guess", "text", "text", "text", "text", "text")
    )) %>%
      filter(!is.na(date_began)) %>%
      mutate(date_began = as.Date(date_began))

  } else if (.y == 11) { # handle 11 cols

    # note that the date columns aren't uniform in the Excel spreadsheets even in these more data-literate files :-(

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("month", "date_began", "time_began", "date_restored", "time_restored", "area", "region", "alert_criteria", "event_type", "loss", "customers_affected"),
      col_types = c("text", "text", "guess", "text", "guess", "text", "text", "text", "text", "text", "text")
    )) %>%
      mutate(
        date_began = case_when(
          str_detect(date_began, "/") ~ suppressWarnings(as.Date(date_began, format="%m/%d/%Y")),
          str_detect(date_began, "^[[:digit:]]+$") ~ suppressWarnings(as.Date(as.integer(date_began), origin = "1899-12-30")),
          TRUE ~ suppressWarnings(as.Date(NA))
        )
      ) %>%
      mutate(
        date_restored = case_when(
          str_detect(date_restored, "/") ~ suppressWarnings(as.Date(date_restored, format="%m/%d/%Y")),
          str_detect(date_restored, "^[[:digit:]]+$") ~ suppressWarnings(as.Date(as.integer(date_restored), origin = "1899-12-30")),
          TRUE ~ suppressWarnings(as.Date(NA))
        )
      ) %>%
      filter(!is.na(date_began))

  }

}) -> reports

reports[[1]]
## # A tibble: 23 x 8
##    date_began region time                area      event_type      loss  customers_affec… date_restored
##                                                       
##  1 2002-01-30 SPP    0.25                Oklahoma  Ice Storm       500   1881134          2002-02-07 12:00:00
##  2 2002-01-29 SPP    Evening             Metropol… Ice Storm       500-… 270000           NA
##  3 2002-01-30 SPP    0.66666666666666663 Missouri  Ice Storm       210   95000            2002-02-10 21:00:00
##  4 2002-02-27 WSCC   0.45000000000000001 Californ… Interruption o… 300   255000           2002-02-27 11:35:00
##  5 2002-03-09 ECAR   0                   Lower Pe… Severe Weather  190   190000           2002-03-11 12:00:00
##  6 2002-04-08 WSCC   0.625               Arizona   Vandalism/      0     0                2002-04-09 00:00:00
##  7 2002-07-09 WSCC   0.51875000000000004 Californ… Interruption o… 240   1 PG&E           2002-07-09 19:54:00
##  8 2002-07-19 WSCC   0.49375000000000002 Californ… Interruption o… 240   1 PG&E           2002-07-19 16:30:00
##  9 2002-07-20 NPCC   0.52777777777777779 New York  Fire            278   63500            2002-07-20 20:12:00
## 10 2002-08-02 MAIN   0.52986111111111112 Illinois  Interruption o… 232   53565            2002-08-02 18:36:00
## # ... with 13 more rows

reports[[10]]
## # A tibble: 307 x 9
##    date_began time_began  date_restored       time_restored  area    region event_type loss  customers_affec…
##                                                    
##  1 2011-01-11 0.96388888… 2011-01-11 00:00:00 0.96388888888… Athens… NPCC   Electrica… 0     0
##  2 2011-01-12 0.25        2011-01-12 00:00:00 0.58333333333… Massac… NPCC   Winter St… N/A   80000
##  3 2011-01-13 0.30624999… 2011-01-13 00:00:00 0.34236111111… North … FRCC   Firm Syst… 150   20900
##  4 2011-01-18 0.58333333… 2011-01-18 00:00:00 0.58333333333… Whitma… NPCC   Vandalism  0     0
##  5 2011-01-23 0.29166666… 2011-01-23 00:00:00 0.54166666666… Frankl… WECC   Vandalism  0     0
##  6 2011-01-24 0.55555555… 2011-01-24 00:00:00 0.5625         Newman… WECC   Suspiciou… 0     0
##  7 2011-01-25 0.14097222… 2011-01-25 00:00:00 0.45833333333… Newark… RFC    Vandalism  0     0
##  8 2011-01-26 0.39236111… 2011-01-27 00:00:00 0.70833333333… Carson… WECC   Suspected… 0     0
##  9 2011-01-26 0.39791666… 2011-01-27 00:00:00 0.62708333333… Michig… RFC    Vandalism  0     0
## 10 2011-01-26 0.70833333… 2011-01-31 00:00:00 0.33333333333… Montgo… RFC    Winter St… N/A   210000
## # ... with 297 more rows

reports[[17]]
## # A tibble: 120 x 11
##    month  date_began time_began date_restored time_restored area   region alert_criteria     event_type loss
##                                                          
##  1 Janua… 2018-01-01 0.7645833… 2018-01-02    0.7576388888… Tenne… SERC   Public appeal to … Severe We… Unkn…
##  2 Janua… 2018-01-01 0.7381944… NA            Unknown       Texas: TRE    Public appeal to … Severe We… Unkn…
##  3 Janua… 2018-01-01 0.9006944… 2018-01-02    0.4375        Tenne… SERC   Public appeal to … System Op… Unkn…
##  4 Janua… 2018-01-02 0.4166666… 2018-02-12    0.3333333333… New Y… NPCC   Fuel supply emerg… Fuel Supp… 675
##  5 Janua… 2018-01-02 0.3125     NA            Unknown       South… SERC   Public appeal to … Severe We… 0
##  6 Janua… 2018-01-02 0.28125    2018-01-02    0.375         North… SERC   System-wide volta… Severe We… 14998
##  7 Janua… 2018-01-04 0.0756944… 2018-01-04    0.0895833333… Texas… TRE    Physical attack t… Actual Ph… Unkn…
##  8 Janua… 2018-01-12 0.5472222… 2018-01-12    0.6201388888… Michi… RF     Cyber event that … System Op… 41
##  9 Janua… 2018-01-15 0.1805555… 2018-01-18    0.2416666666… Texas: TRE    Public appeal to … Severe We… Unkn…
## 10 Janua… 2018-01-16 0.625      2018-01-18    0.5416666666… Tenne… SERC   Public appeal to … Severe We… Unkn…
## # ... with 110 more rows, and 1 more variable: customers_affected 

If you'd've handled the above differently it'd be ? if you could drop a note in the comments (for both my benefit and that of any other readers who have kindly made it this far into this tome).

At this point, I really just want to finally see if there are any "cyber" events in the data set and when/where they were. To do that, let's whittle down the columns a bit and make one data frame out of all the reports:

map_df(reports, ~{
  select(.x, date_began, region, area, event_type, customers_affected, date_restored) %>%
    mutate(date_restored = as.Date(date_restored)) %>%
    mutate(
      customers_affected = suppressWarnings(
        str_replace_all(customers_affected, "\\-.*$|[[:punct:]]+|[[:alpha:]]+", "") %>%
          as.numeric()
      )
    ) %>%
    mutate(date_restored = as.Date(ifelse(is.na(date_restored), date_began, date_restored), origin = "1970-01-01"))
}) -> events

events
## # A tibble: 2,243 x 6
##    date_began region area                          event_type                 customers_affect… date_restored
##                                                                       
##  1 2002-01-30 SPP    Oklahoma                      Ice Storm                            1881134 2002-02-07
##  2 2002-01-29 SPP    Metropolitan Kansas City Area Ice Storm                             270000 2002-01-29
##  3 2002-01-30 SPP    Missouri                      Ice Storm                              95000 2002-02-10
##  4 2002-02-27 WSCC   California                    Interruption of Firm Load             255000 2002-02-27
##  5 2002-03-09 ECAR   Lower Peninsula of Michigan   Severe Weather                        190000 2002-03-11
##  6 2002-04-08 WSCC   Arizona                       Vandalism/                                 0 2002-04-09
##  7 2002-07-09 WSCC   California                    Interruption of Firm Power                 1 2002-07-09
##  8 2002-07-19 WSCC   California                    Interruption of Firm Powe…                 1 2002-07-19
##  9 2002-07-20 NPCC   New York                      Fire                                   63500 2002-07-20
## 10 2002-08-02 MAIN   Illinois                      Interruption of Firm Power             53565 2002-08-02
## # ... with 2,233 more rows

Now we're cookin' with gas!

Let's do a quick check to make sure things look OK:

count(events, event_type, sort=TRUE)
## # A tibble: 390 x 2
##    event_type                         n
##                              
##  1 Severe Weather                   369
##  2 Vandalism                        216
##  3 Severe Weather - Thunderstorms    97
##  4 Suspected Physical Attack         87
##  5 System Operations                 74
##  6 Severe Thunderstorms              70
##  7 Winter Storm                      51
##  8 Ice Storm                         42
##  9 Physical Attack - Vandalism       40
## 10 High Winds                        33
## # ... with 380 more rows

Those events+quantities seem to make sense. Now, for my ultimate goal:

filter(events, grepl("cyber|hack", event_type, ignore.case=TRUE)) # yep, grepl() is still in muscle memory
## # A tibble: 19 x 6
##    date_began region area                               event_type             customers_affec… date_restored
##                                                                       
##  1 2003-01-25 ECAR   Cincinnati, Ohio                   Cyber Threat From Int…               NA 2003-01-25
##  2 2011-02-03 RFC    Bowie, Maryland                    Suspected Cyber Attack                0 2011-02-03
##  3 2011-02-17 WECC   Roseville, California              Suspected Cyber Attack                0 2011-02-23
##  4 2011-03-14 RFC    Baltimore, Maryland                Suspected Cyber Attack               NA 2011-03-14
##  5 2011-04-03 SERC   Unknown                            Suspected Cyber Attack                0 2011-04-05
##  6 2011-07-08 RFC    PJM Corporate Office, Pennsylvania Suspected Cyber Attack               NA 2011-07-11
##  7 2011-12-21 WECC   Boise, Idaho                       Suspected Cyber Attack                0 2011-12-21
##  8 2012-01-17 TRE    Austin, Texas                      Suspected Cyber Attack                0 2012-01-17
##  9 2012-02-17 SERC   Little Rock, Arkansas              Suspected Cyber Attack               NA 2012-02-17
## 10 2012-11-15 MRO    Iowa; Michigan                     Suspected Cyber Attack               NA 2012-11-15
## 11 2013-06-21 MRO    Michigan, Iowa                     Suspected Cyber Attack               NA 2013-10-30
## 12 2013-10-16 SERC   Roxboro Plant, North Carolina      Cyber Event with Pote…                0 2013-10-16
## 13 2014-03-20 NPCC   New York                           Suspected Cyber Attack               NA 2014-03-20
## 14 2014-10-21 MRO    Carmel, Indiana                    Suspected Cyber Attack               NA 2014-10-21
## 15 2014-12-30 NPCC   New Hampshire, Massachusetts, Mai… Suspected Cyber Attack               NA 2014-12-31
## 16 2016-02-07 NPCC   New York: Orange County            Cyber Attack                         NA 2016-02-07
## 17 2016-04-12 WECC   Washington: Pend Oreille County    Cyber Attack                          0 2016-04-12
## 18 2016-11-09 WECC   California: Stanislaus County, Sa… Cyber Attack                          0 2016-11-09
## 19 2016-12-13 WECC   California: Riverside County;      Cyber Event                           0 2016-12-13

?

FIN

There's a great deal of work left out of this power-outage data cleanup work:

  • Turn outage start/end info into POSIXct objects
  • Normalize area (make it a data_frame column with state and municipality so it can be unnested nicely)
  • Normalize event_type since many of the phrases used are equivalent and some have more than one categorization
  • Normalize loss somehow and do a better job with customers_affected (I did not double-check my work and I think there are errors in that column, now, but I didn't need it for my goal).

Since GitLab snippets are terrible, awful, slow things I've grudgingly posted the above code (contiguously) over at GitHub.

If you wrangle the data more and/or come up with other insights drop a note in the comments with a link to your post.

The sergeant? package has a minor update that adds REST API coverage for two “new” storage endpoints that make it possible to add, update and remove storage configurations on-the-fly without using the GUI or manually updating a config file.

This is an especially handy feature when paired with Drill’s new, official Docker container since that means we can:

  • fire up a clean Drill instance
  • modify the storage configuration (to, say, point to a local file system directory)
  • execute SQL ops
  • destroy the Drill instance

all from within R.

This is even more handy for those of us who prefer handling JSON data in Drill than in R directly or with sparklyr.

Quick Example

In a few weeks most of the following verbose-code-snippets will have a more diminutive and user-friendly interface within sergeant, but for now we’ll perform the above bulleted steps with some data that was used in a recent new package which was also generated by another recent new package. The zdnsr::zdns_exec() function ultimately generates a deeply nested JSON file that I really prefer working with in Drill before shunting it into R. Said file is stored, say, in the ~/drilldat directory.

Now, I have Drill running all the time on almost every system I use, but we’ll pretend I don’t for this example. I’ve run zdns_exec() and generated the JSON file and it’s in the aforementioned directory. Let’s fire up an instance and connect to it:

library(sergeant) # git[hu|la]b:hrbrmstr/sergeant
library(dplyr)

docker <- Sys.which("docker") # you do need docker. it's a big dependency, but worth it IMO

(system2(
  command = docker,  
  args = c(
    "run", "-i", 
    "--name", "drill-1.14.0", 
    "-p", "8047:8047", 
    "-v", paste0(c(path.expand("~/drilldat"), "/drilldat"), collapse=":"),
    "--detach", 
    "-t", "drill/apache-drill:1.14.0",
    "/bin/bash"
  ),
  stdout = TRUE
) -> drill_container)
## [1] "d6bc79548fa073d3bfbd32528a12669d753e7a19a6258e1be310e1db378f0e0d"

The above snippet fires up a Drill Docker container (downloads it, too, if not already local) and wires up a virtual directory to it.

We should wait a couple seconds and make sure we can connect to it:

drill_connection() %>% 
  drill_active()
## [1] TRUE

Now, we need to add a storage configuration so we can access our virtual directory. Rather than modify dfs we’ll add a drilldat plugin that will work with the local filesystem just like dfs does:

drill_connection() %>%
  drill_mod_storage(
    name = "drilldat",
    config = '
{
  "config" : {
    "connection" : "file:///", 
    "enabled" : true,
    "formats" : null,
    "type" : "file",
    "workspaces" : {
      "root" : {
        "location" : "/drilldat",
        "writable" : true,
        "defaultInputFormat": null
      }
    }
  },
  "name" : "drilldat"
}
')
## $result
## [1] "success"

Now, we can perform all the Drill ops sergeant has to offer, including ones like this:

(db <- src_drill("localhost"))
## src:  DrillConnection
## tbls: cp.default, dfs.default, dfs.root, dfs.tmp, drilldat.default, drilldat.root,
##   INFORMATION_SCHEMA, sys

tbl(db, "drilldat.root.`/*.json`")
## # Source:   table [?? x 10]
## # Database: DrillConnection
##    data                                              name   error class status timestamp          
##                                                                    
##  1 "{\"authorities\":[{\"ttl\":180,\"type\":\"SOA\"… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  2 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  3 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  4 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  5 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  6 "{\"authorities\":[{\"ttl\":1799,\"type\":\"SOA\… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  7 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  8 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  9 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
## 10 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
## # ... with more rows

(tbl(db, "(
SELECT
 b.answers.name AS question,
 b.answers.answer AS answer
FROM (
 SELECT 
   FLATTEN(a.data.answers) AS answers
 FROM 
   drilldat.root.`/*.json` a
 WHERE 
   (a.status = 'NOERROR')
) b
)") %>% 
 collect() -> dmarc_recs)
## # A tibble: 1,250 x 2
##    question             answer                                                                    
##  *                                                                                      
##  1 _dmarc.washjeff.edu  v=DMARC1; p=none                                                          
##  2 _dmarc.barry.edu     v=DMARC1; p=none; rua=mailto:dmpost@barry.edu,mailto:7cc566d7@mxtoolbox.d…
##  3 _dmarc.yhc.edu       v=DMARC1; pct=100; p=none                                                 
##  4 _dmarc.aacc.edu      v=DMARC1;p=none; rua=mailto:DKIM_DMARC@aacc.edu;ruf=mailto:DKIM_DMARC@aac…
##  5 _dmarc.sagu.edu      v=DMARC1; p=none; rua=mailto:Office365contact@sagu.edu; ruf=mailto:Office…
##  6 _dmarc.colostate.edu v=DMARC1; p=none; pct=100; rua=mailto:re+anahykughvo@dmarc.postmarkapp.co…
##  7 _dmarc.wne.edu       v=DMARC1;p=quarantine;sp=none;fo=1;ri=86400;pct=50;rua=mailto:dmarcreply@…
##  8 _dmarc.csuglobal.edu v=DMARC1; p=none;                                                         
##  9 _dmarc.devry.edu     v=DMARC1; p=none; pct=100; rua=mailto:devry@rua.agari.com; ruf=mailto:dev…
## 10 _dmarc.sullivan.edu  v=DMARC1; p=none; rua=mailto:mcambron@sullivan.edu; ruf=mailto:mcambron@s…
## # ... with 1,240 more rows

Finally (when done), we can terminate the Drill container:

system2(
  command = "docker",
  args = c("rm", "-f", drill_container)
)

FIN

Those system2() calls are hard on the ? and a pain to type/remember, so they’ll be wrapped in some sergeant utility functions (I’m hesitant to add a reticulate dependency to sergeant which is necessary to use the docker package, hence the system call wrapper approach).

Check your favorite repository for more sergeant updates and file issues if you have suggestions for how you’d like this Docker API for Drill to be controlled.

Nowadays (I’ve seen that word used so much in journal articles lately that I could not resist using it) I’m using world tile grids more frequently as the need arises to convey the state of exposure of various services at a global (country) scale. Given that necessity fosters invention it seemed that having a ggplot2 geom for world tile grids would make my life easier and also make work more efficient.

To that end, there’s a nascent ggplot2 extension package for making world tile grids called (uncreatively) worldtilegrid?. It’s also at GitHub if you’re more comfortable working from code from there.

The README has examples but we’ll walk through another one here and work with life expectancy data curated by Our World in Data. Rather than draw out this post to a less tenable length with an explanation of how to pull the XHR JSON data into R, just grab the CSV linked with the visualization on that page and substitute the path in the code for where you stored it.

We do need to clean up this data a bit since it has some issues. Let’s do that and carve out some slices into two new data frames so we can work with the most recent curated year and some historical data:

library(hrbrthemes) # gitlab/github :: hrbrmstr/hrbrthemes (use devtools installers)
library(worldtilegrid) # gitlab/github :: hrbrmstr/worldtilegrid (use devtools installers)
library(tidyverse)

cols(
  Entity = col_character(),
  Code = col_character(),
  Year = col_integer(),
  `Life expectancy (Clio-Infra up to 1949; UN Population Division for 1950 to 2015)` = col_double()
) -> le_cols

read_csv("~/Downloads/life-expectancy.csv", col_types = le_cols) %>%
  set_names(c("country", "iso3c", "year", "life_expectancy")) -> lifexp

# clean it up a bit since it's not great data and bust it into groups
# that match the vis on the site vs use continuous raw data, esp since
# the data is really just estimates
filter(lifexp, !is.na(iso3c)) %>%
  filter(nchar(iso3c) == 3) %>%
  mutate(
    grp = cut(
      x = life_expectancy,
      breaks = c(10, 20, 30, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85),
      include.lowest = TRUE
    )
  ) -> lifexp

last_year <-  filter(lifexp, year == last(year))
a_few_years <- filter(lifexp, year %in% c(seq(1900, 2015, 20), 2015))

Now we can use it to make some cartograms.

Using the World Tile Grid geom

You can reference a previous post for what it would normally take to create these tile grids. The idea was to simplify the API down to the caller just needing to specify some country names/ISO[23]Cs (to a country aesthetic) and a value (to a fill aesthetic) and let the package do the rest. To that end, here's how to display the life expectancy data for 2015:

ggplot(last_year) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") + # yep, you can sum up the entire blog post to this one line
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 2015",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number\nof years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to\nstay the same throughout its life",
    caption = "Data source: "

  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

You can add labels via geom_text() and use the wtg stat for it as it provides a number of computed variables you can work with:

  • x,y: the X,Y position of the tile
  • name: Country name (e.g. Afghanistan)
  • country.code: ISO2C country code abbreviation (e.g. AF)
  • iso_3166.2: Full ISO 3166 2-letter abbreviation code (e.g. ISO 3166-2:AF)
  • region: Region name (e.g. Asia)
  • sub.region: Sub-region name (e.g. Southern Asia)
  • region.code: Region code (e.g. 142)
  • sub.region.code: Sub-region code (e.g. 034)

Labeling should be a deliberate decision and used sparingly/with care.

Easier World Tile Grid Facets

Making faceted charts is also straightforward. Just use ggplot2's faceting subsytem:

ggplot(a_few_years) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") +
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 1900-2015 (selected years)",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number of years a newborn infant\nwould live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life",
    caption = "Data source: "
  ) +
  facet_wrap(~year) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

(You may want to open that one up in a separate tab/window.)

Animation might have been a better choice than facets (which is an exercise left to the reader ?).

FIN

The package API is still in "rapidly changing" mode, but feel free to kick the tyres and file issues or PRs on your community coding platform of choice as needed.

As noted in the package, the world tile grid concept credit goes to Jon Schwabish and the CSV data used to create the package exposed wtg data frame is the work of Maarten Lambrechts.

Apple has run the death bell on 32-bit macOS apps and, if you’re running a recent macOS version on your Mac (which you should so you can get security updates) you likely see this alert from time-to-time:

If you’re like me, you click through that and keep working but later ponder just how many of those apps you have. They are definitely going away, so knowing if your favourite app is on the chopping block is likely a good idea.

You can get this information via the “About This Mac”͢”System Report” app and sorting via one of the table fields.

R folks are data folks and we know we can do better than that. But, first we need to get the data. Thankfully, we can get this via the system_profiler command-line utility since it can both display user-friendly information in the terminal and also generate an XML version of the information to work with. We won’t need to head to the terminal for this work, though, since there are many ways to execute the command from R and read the generated output.

Executing System Calls from R

Base R provides two core methods for issuing a system call:

  • system()
  • system2()

Note that there are other functions provided with a base R installation that can also issue system commands and process the “piped” output, but we’ll focus on these deliberate invocation ones.

Functions in two other packages can also assist with this task and we’ll include a look at:

  • processx::run()
  • sys::exec_internal()

as well.

Why leave base R for this task? Truthfully, we really don’t need to, but both sys and processx have other tasks which do make them handy tools in your package toolbox. Having said that, keep reading since we’re going to end up not choosing the built-in functions for this task.

This is the command line we need to execute:

system_profiler -xml -detailLevel full SPApplicationsDataType

Let’s load up all the packages we’ll be needing and execute this command-line all four ways, then briefly discuss the differences:

library(sys)
library(processx)
library(microbenchmark)
library(xml2)
library(tidyverse)

system(
  command = "system_profiler -xml -detailLevel full SPApplicationsDataType",
  intern = TRUE
) -> apps_system

str(apps_system)
##  chr [1:10665] "" ...

system2(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  stdout = TRUE
) -> apps_system2

str(apps_system2)
##  chr [1:10665] "" ...

processx::run(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  spinner = TRUE
) -> apps_processx_run

str(apps_processx_run)
## List of 4
##  $ status : int 0
##  $ stdout : chr "XML STRING THAT prism.js won't let me show"
##  $ stderr : chr ""
##  $ timeout: logi FALSE

sys::exec_internal(
  cmd = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType")
) -> apps_sys_exec_internal

str(apps_sys_exec_internal)
## List of 3
##  $ status: int 0
##  $ stdout: raw [1:331133] 3c 3f 78 6d ...
##  $ stderr: raw(0) 

The core difference between system() and the rest is that you need to shQuote()? for system() whereas that’s taken care of for you by the others (so they’re a bit safer by default since you’re more than likely going to forget to shQuote()).

You can definitely notice the main differences in return objects. The built-in functions just give us the character data from the standard output stream (stdout) and the last two return a more structured object that provides more explicit information about the job we just executed. The base ones can provide this detail, but it’s a twisty maze of remembering which options do what vs the more (IMO) straightforward approach both processx and sys take.

You’ll also notice a difference in stdout between processx and sys with the latter giving us a raw vector vs a character vector. This gives us a great deal of power and flexibility. It also turns out to be a great choice for processing command-line-generated XML data. Here’s why:

microbenchmark(
  sys = xml2::read_xml(apps_sys_exec_internal$stdout),
  processx = xml2::read_xml(apps_processx_run$stdout)
)
## Unit: milliseconds
##      expr      min       lq      mean    median        uq      max neval
##       sys 4.086492  4.60078  9.085143  5.508814  5.906942 207.6495   100
##  processx 9.510356 10.98282 14.275499 12.054810 13.292234 163.9870   100

It turns out xml2::read_xml() makes much quicker work of the raw vector data (though, I mean, really—are we really going to care about those ~5ms IRL?).

We’ll move on to the real reason for the post, but definitely explore both sys and processx since they are both super-handy packages.

“Can we please just find the 32-bit apps already?”

No problem. Well, actually, there is a minor annoyance. These are property list XML files and I’ll confess that I truly hate this format. There are “dictionary arrays” of key and value nodes, but those nodes are siblings vs directly associated pairs. So, we have to use the sibling relationship to work with them. It’s not hard, per se, just (again, IMO) suboptimal.

Let’s take a look at it:

apps <- read_xml(apps_sys_exec_internal$stdout)

xml_find_all(apps, "//array/dict")
## {xml_nodeset (476)}
##  [1] \n  _SPCommandLineArguments\n  \n     ...
##  [2] \n  _name\n  Sublime Text\n  h ...
##  [3] \n  _name\n  System Preferences\n   ...
##  [4] \n  _name\n  Google Chrome Canary\n ...
##  [5] \n  _name\n  Google Chrome\n   ...
##  [6] \n  _name\n  Dropbox\n  has64B ...
##  [7] \n  _name\n  Keypad\n  has64Bi ...
##  [8] \n  _name\n  Garmin WebUpdater\n  < ...
##  [9] \n  _name\n  LaTeXiT\n  has64B ...
## [10] \n  _name\n  CocoaPacketAnalyzer\n  ...
## [11] \n  _name\n  Janetter\n  has64 ...
## [12] \n  _name\n  VMware Fusion\n   ...
## [13] \n  _name\n  Photo Library Migration Utility ...
## [14] \n  _name\n  Setup Assistant\n  \n  _name\n  Siri\n  has64BitI ...
## [16] \n  _name\n  Software Update\n  \n  _name\n  Spotlight\n  has6 ...
## [18] \n  _name\n  Stocks\n  has64Bi ...
## [19] \n  _name\n  SystemUIServer\n  \n  _name\n  UniversalAccessControl ...
## ...

Let’s look at a sample record using xml_view() from the htmltidy package:

xml_find_all(apps, "//array/dict[key='_name']")[1] %>% 
  htmltidy::xml_view()

Be wary of using xml_view() on giant XML structures since it’ll freeze up RStudio for a bit and even slows down Chrome since the resultant, composed DOM object can get ginormous.

Now we know we can use has64BitIntelCode for filtering once we get to the data. Let’s read in all the apps, cherry-picking the fields and then just look at the 32-bit apps:

xml_find_all(apps, "//array/dict[key='_name']") %>% 
  map_df(~{
    list(
      name = xml_find_first(.x, ".//string") %>% xml_text(),
      path = xml_find_first(.x, ".//key[.='path']/following-sibling::string") %>% xml_text(),
      is_64bit = xml_find_first(.x, ".//key[.='has64BitIntelCode']/following-sibling::string") %>% xml_text() 
    )
  }) %>% 
  filter(is_64bit == "no") %>% 
  arrange(name) %>% 
  select(-is_64bit)
## # A tibble: 30 x 2
##    name                      path                                           
##                                                                   
##  1 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  2 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  3 AAM Updates Notifier      /Applications/Utilities/Adobe Application Mana…
##  4 AAMLauncherUtil           /Applications/Utilities/Adobe Application Mana…
##  5 ACR_9_10                  /Library/Application Support/Adobe/Uninstall/A…
##  6 Adobe Application Manager /Applications/Utilities/Adobe Application Mana…
##  7 adobe_licutil             /Applications/Utilities/Adobe Application Mana…
##  8 Audacity                  /Applications/Audacity.app                     
##  9 COCM_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## 10 COPS_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## # ... with 20 more rows

The Adobe helper apps are longstanding 32-bit “offenders”. Many of these death-row apps fall into the “helper” category and will hopefully get some attention by their developers. I do find it amusing that Apple kinda wants us to prod the developers to get their collective acts together.

FIN

This exhaustive search finds all of the 32-bit apps residing on your system. If you just want to see the one’s you’ve executed and macOS has kept track of, you can drop to a command-line and do:

sudo Rscript -e 'knitr::kable((dplyr::select(dplyr::tbl(dplyr::src_sqlite("/var/db/SystemPolicyConfiguration/ExecPolicy"), "legacy_exec_history_v3"), responsible_path)))'

You need elevated privileges to access those files, so please read that whole line to make sure I’m not having you rm -rf /.

Remember to take some time to explore the sys and processx packages and perhaps bundle up the salient bits of this post into a script so you can occasionally check to see the 32-bit eradication progress.

It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.

R folks have had the decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).

However, that nigh-mundane addition is not the only new feature in decapitated.

Introducing gepetto

While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer?. Puppeteer is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.

Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto? which aims to do just that.

Gepetto is a Node.js application which uses puppeteer for all the hard work. After seeing that such a REST API interface was possible via the puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer-fueled javascript for the Lua interface.

A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.

Installing gepetto

There are instructions over at the repo on installing gepetto but R users can try a shortcut by grabbing the latest version of decapitated from Git[La|Hu]b and running decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:

The installer provides some guidance should thing go awry. You’ll notice gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.

Working with gepetto

Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto. You should see something like this:

$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000

NOTE: You can use a different host/port by setting the HOST and PORT environment variables accordingly before startup.

You can then go to http://localhost:3000 in your browser and should see this:

Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.

If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:

The Swagger definition JSON is also at http://localhost:3000/swagger.json.

The API documentation will be a bit more robust as the module’s corners are rounded out.

“But, this is supposed to be an R post…”

Yes. Yes it is.

If you followed along in the previous section and started gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!

library(rvest)
library(decapitated)
library(tidyverse)

gpid <- start_gepetto()

gpid
## PROCESS 'gepetto', running, pid 60827.

gepetto() %>% 
  gep_active()
## [1] TRUE

Anything other than a “running” response means there’s something wrong and you can use the various processx methods on that gpid object to inspect the error log. If you were able to run gepetto from the command line then it should be fine in R, too. The gep() function build a connection object and gep_active() tests an API endpoint to ensure you can communicate with the server.

Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html().

gepetto() %>% 
  gep_render_html("http://therapboard.com/") -> doc

html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%  
  html_attr("src") %>% 
  head(10)
## [1] "audio/2chainz_4.mp3"        "audio/2chainz_yeah2.mp3"   
## [3] "audio/2chainz_tellem.mp3"   "audio/2chainz_tru.mp3"     
## [5] "audio/2chainz_unh3.mp3"     "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3"  "audio/2pac_4.mp3"          
## [9] "audio/2pac_5.mp3"           "audio/2pac_6.mp3"

Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs.

We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:

gepetto() %>% 
  gep_render_magick("https://nytimes.com/")
##   format width height colorspace matte filesize density
## 1    PNG  1440   6828       sRGB  TRUE        0   72x72

Astute readers will notice it returns a magick object so you can work with it immediately.

I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).

Testing gep_render_pdf() is an exercise left to the reader.

FIN

The gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.

If you’ve got a directory full of Bro NSM logs, it’s easy to work with them in Apache Drill since they’re just tab-separated values (TSV) files by default. The most tedious part is mapping the columns to proper types and hopefully this saves at least one person from typing it out manually:

SELECT 
  TO_TIMESTAMP(CAST(columns[0] AS DOUBLE)) AS ts,
                    columns[1]             AS uid,
                    columns[2]             AS id_orig_h,
                    columns[3]             AS id_orig_p,
                    columns[4]             AS id_resp_h,
                    columns[5]             AS id_resp_p,
                    columns[6]             AS proto,
                    columns[7]             AS service,
              CAST( columns[8] AS DOUBLE)  AS duration,
              CAST( columns[9] AS INTEGER) AS orig_bytes,
              CAST(columns[10] AS INTEGER) AS resp_bytes,
                   columns[11]             AS conn_state,
                   columns[12]             AS local_orig,
                   columns[13]             AS local_resp,
              CAST(columns[14] AS INTEGER) AS missed_bytes,
                   columns[15]             AS history,
              CAST(columns[16] AS INTEGER) AS orig_packets,
              CAST(columns[17] AS INTEGER) AS orig_ip_bytes,
              CAST(columns[18] AS INTEGER) AS resp_pkts,
              CAST(columns[19] AS INTEGER) AS resp_ip_bytes,
                   columns[20]             AS tunnel_parents
FROM dfs.brologs.`/201808/*`

You can either store them all under a single workspace with a default input type or soft-link/rename them to end in .tsv (it’s unlikely you want to change all .log files to be read as TSV everywhere).

While you could just use the logs this way, consider using CTAS to move them to Parquet. The above will created typed columns and the queries will generally be much faster.