rud.is

Roll Your Own Federal Government Shutdown-caused SSL Certificate Expiration Monitor in R

By now, even remote villages on uncharted islands in the Pacific know that the U.S. is in the midst of a protracted partial government shutdown. It’s having real impacts on the lives of Federal government workers but they aren’t the only ones. Much of the interaction Federal agencies have with the populace takes place online and the gateway to most of these services/information is a web site.

There are Federal standards that require U.S. government web sites to use SSL/TLS certificates and those certificates have something in common with, say, a loaf of bread you buy at the store: they expire. In all but the best of orgs — or we zany folks who use L e t ‘ s E n c r y p t and further propel internet denizens into a false sense of safety & privacy — renewing certificates involves manual labor/human intervention. For a good chunk of U.S. Federal agencies, those particular humans aren’t around. If a site’s SSL certificate expires and isn’t re-issued, it causes browsers to do funny things, like this:

Now, some of these sites are configured improperly in many ways, including them serving pages on both http and https (vs redirecting to https immediately upon receiving an http connection). But, browsers like Chrome will generally try https first and scare you into not viewing the site.

But, how big a problem could this really be? We can find out with a fairly diminutive R script that:

grabs a list of Federal agency domains (thanks to the GSA)
tries to make a SSL/TLS connection (via the openssl package) to the apex domain or www. prefixed apex domain
find the expiration date for the cert
do some simple date math

I’ve commented the script below pretty well so I’ll refrain from further blathering:

library(furrr)
library(openssl)
library(janitor)
library(memoise)
library(hrbrthemes)
library(tidyverse)

# fetch the GSA CSV:

read_csv(
  file = "https://raw.githubusercontent.com/GSA/data/master/dotgov-domains/current-federal.csv",
  col_types = "ccccccc"
) %>% 
  janitor::clean_names() -> xdf

# make openssl::download_ssl_cert calls safer in the even there
# are network/connection issues
.dl_cert <- possibly(openssl::download_ssl_cert, otherwise = NULL)

# memoise the downloader just in case we need to break the iterator
# below or another coding error causes it to break (the cached values
# will go away in a new R session or if you manually purge them)
dl_cert <- memoise::memoise(.dl_cert)

# we'll do this in parallel to save time (~1,200 domains)
plan(multiprocess)

# now follow the process described in the bullet points
future_map_dfr(xdf$domain_name, ~{

  who <- .x

  crt <- dl_cert(who)  

  if (!is.null(crt)) {
    # shld be the first cert and expires is second validity field
    expires <- crt[[1]]$validity[2] 
  } else {
    crt <- dl_cert(sprintf("www.%s", who)) # may be on www b/c "gov"
    if (!is.null(crt)) {
      expires <- crt[[1]]$validity[2]
    } else {
      expires <- NA_character_  
    }
  }

  # keep a copy of the apex domain, the expiration field and the cert
  # (in the event you want to see just how un-optimized the U.S. IT 
  # infrastructure is by how many stupid vendors they use for certs)
  tibble(
    who = who,
    expires = expires,
    cert = list(crt)
  )

}) -> cdf

Now, lets make strings into proper dates, count only the dates starting with the date of the shutdown to the end of 2019 (b/c the reckless human at the helm is borderline insane enough to do that) and plot the timeline:

filter(cdf, !is.na(expires)) %>% 
  mutate(
    expires = as.Date(
      as.POSIXct(expires, format="%b %d %H:%M:%S %Y")
    )
  ) %>% 
  arrange(expires) 
  count(expires) %>% 
  filter(
    expires >= as.Date("2018-12-22"), 
    expires <= as.Date("2019-12-31")
  ) %>% 
  ggplot(aes(expires, n)) +
  geom_vline(
    xintercept = Sys.Date(), linetype="dotted", size=0.25, color = "white"
  ) +
  geom_label(
    data = data.frame(), 
    aes(x = Sys.Date(), y = Inf, label = "Today"),
    color = "black", vjust = 1
  ) +
  geom_segment(aes(xend=expires, yend=0), color = ft_cols$peach) + 
  scale_x_date(name=NULL, date_breaks="1 month", date_labels="%b") +
  scale_y_comma("# Federal Agency Certs") +
  labs(title = "2019 Federal Agency ShutdownCertpoalypse") +
  theme_ft_rc(grid="Y")

Now, I’m unwarrantedly optimistic that this debacle could be over by the end of January. How many certs (by agency) could go bad by then?

left_join(cdf, xdf, by=c("who"="domain_name")) %>% 
  mutate(
    expires = as.Date(
      as.POSIXct(expires, format="%b %d %H:%M:%S %Y")
    )
  ) %>% 
  filter(
    expires >= as.Date("2018-12-22"),
    expires <= as.Date("2019-01-31")
  ) %>% 
  count(agency, sort = TRUE)
## # A tibble: 10 x 2
##    agency                                          n
##    <chr>                                       <int>
##  1 Government Publishing Office                    8
##  2 Department of Commerce                          4
##  3 Department of Defense                           3
##  4 Department of Housing and Urban Development     3
##  5 Department of Justice                           3
##  6 Department of Energy                            1
##  7 Department of Health and Human Services         1
##  8 Department of State                             1
##  9 Department of the Interior                      1
## 10 Department of the Treasury                      1

Ugh.

FIN

Not every agency is fully shutdown and not all workers in charge of cert renewals are furloughed (or being forced to work without pay). But, this one other area shows the possible unintended consequences of making rash, partisan decisions (something both Democrats & Republicans excel at).

You can find the contiguous R code at 2018-01-10-shutdown-certpocalypse.R and definitely try to explore the contents of those certificates.

Waffle Geoms & Other Miscellaneous In-Development Package Updates

More than just sergeant has been hacked on recently, so here’s a run-down of various ? updates:

waffle

The square pie chart generating waffle? package now contains a nascent geom_waffle() so you can do things like this:

library(hrbrthemes)
library(waffle)
library(tidyverse)

tibble(
  parts = factor(rep(month.abb[1:3], 3), levels=month.abb[1:3]),
  values = c(10, 20, 30, 6, 14, 40, 30, 20, 10),
  fct = c(rep("Thing 1", 3), rep("Thing 2", 3), rep("Thing 3", 3))
) -> xdf

ggplot(xdf, aes(fill=parts, values=values)) +
  geom_waffle(color = "white", size=1.125, n_rows = 6) +
  facet_wrap(~fct, ncol=1) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  ggthemes::scale_fill_tableau(name=NULL) +
  coord_equal() +
  labs(
    title = "Faceted Waffle Geoms"
  ) +
  theme_ipsum_rc(grid="") +
  theme_enhance_waffle()

and get:

It’s super brand new so pls file issues (wherev you like besides blog comments as they’re not conducive to package triaging) if anything breaks or you need more aesthetic configuration options. NOTE: You need to use the 1.0.0 branch as noted in the master branch README.

markdowntemplates

I had to take a quick peek at markdowntemplates? due to a question from a blog reader about the Jupyter notebook generation functionality. While I was in the code I added two new bits to the knit: markdowntemplates::to_jupyter code. First is the option to specify a run: parameter in the YAML header so you can just knit the document to a Jupyter notebook without executing the chunks:

---
title: "ggplot2 example"
knit: markdowntemplates::to_jupyter
run: false
---

If run is not present it defaults to true.

The other add is a bit of intelligence to whether it should include %load_ext rpy2.ipython (the Jupyter “magic” that lets it execute R chunks). If no R code chunks are present, rpy2.ipython will not be loaded.

securitytrails

SecurityTrails is a service for cybersecurity researchers & defenders that provides tools and an API to aid in querying for all sorts of current and historical information on domains and IP addresses. It now (finally) has a mostly-complete R package securitytrails?. They’re research partners of $DAYJOB and their API is ?? so give it a spin if you are looking to broaden your threat-y API collection.

astools

Keeping the cyber theme going for a bit, next up is astools)? which are “Tools to Work With Autonomous System (‘AS’) Network and Organization Data”. Autonomous Systems (AS) are at the core of the internet (we all live in one) and this package provides tools to fetch AS data/metadata from various sources and work with it in R. For instance, we can grab the latest RouteViews data:

(rv_df <- routeviews_latest())
## # A tibble: 786,035 x 6
##    cidr         asn   minimum_ip maximum_ip  min_numeric max_numeric
##    <chr>        <chr> <chr>      <chr>             <dbl>       <dbl>
##  1 1.0.0.0/24   13335 1.0.0.0    1.0.0.255      16777216    16777471
##  2 1.0.4.0/22   56203 1.0.4.0    1.0.7.255      16778240    16779263
##  3 1.0.4.0/24   56203 1.0.4.0    1.0.4.255      16778240    16778495
##  4 1.0.5.0/24   56203 1.0.5.0    1.0.5.255      16778496    16778751
##  5 1.0.6.0/24   56203 1.0.6.0    1.0.6.255      16778752    16779007
##  6 1.0.7.0/24   56203 1.0.7.0    1.0.7.255      16779008    16779263
##  7 1.0.16.0/24  2519  1.0.16.0   1.0.16.255     16781312    16781567
##  8 1.0.64.0/18  18144 1.0.64.0   1.0.127.255    16793600    16809983
##  9 1.0.128.0/17 23969 1.0.128.0  1.0.255.255    16809984    16842751
## 10 1.0.128.0/18 23969 1.0.128.0  1.0.191.255    16809984    16826367
## # ... with 786,025 more rows

That, in turn, can work with iptools::ip_to_asn() so we can figure out which AS an IP address lives in:

rv_trie <- as_asntrie(rv_df)

iptools::ip_to_asn(rv_trie, "174.62.167.97")
## [1] "7922"

It can also fetch AS name info:

asnames_current()
## # A tibble: 63,453 x 4
##    asn   handle       asinfo                                                iso2c
##    <chr> <chr>        <chr>                                                 <chr>
##  1 1     LVLT-1       Level 3 Parent, LLC                                   US   
##  2 2     UDEL-DCN     University of Delaware                                US   
##  3 3     MIT-GATEWAYS Massachusetts Institute of Technology                 US   
##  4 4     ISI-AS       University of Southern California                     US   
##  5 5     SYMBOLICS    Symbolics, Inc.                                       US   
##  6 6     BULL-HN      Bull HN Information Systems Inc.                      US   
##  7 7     DSTL         DSTL                                                  GB   
##  8 8     RICE-AS      Rice University                                       US   
##  9 9     CMU-ROUTER   Carnegie Mellon University                            US   
## 10 10    CSNET-EXT-AS CSNET Coordination and Information Center (CSNET-CIC) US   
## # ... with 63,443 more rows

which we can use for further enrichment:

routeviews_latest() %>% 
  left_join(asnames_current())
## Joining, by = "asn"

## # A tibble: 786,035 x 9
##    cidr         asn   minimum_ip maximum_ip  min_numeric max_numeric handle            asinfo                     iso2c
##    <chr>        <chr> <chr>      <chr>             <dbl>       <dbl> <chr>             <chr>                      <chr>
##  1 1.0.0.0/24   13335 1.0.0.0    1.0.0.255      16777216    16777471 CLOUDFLARENET     Cloudflare, Inc.           US   
##  2 1.0.4.0/22   56203 1.0.4.0    1.0.7.255      16778240    16779263 GTELECOM-AUSTRAL… Gtelecom-AUSTRALIA         AU   
##  3 1.0.4.0/24   56203 1.0.4.0    1.0.4.255      16778240    16778495 GTELECOM-AUSTRAL… Gtelecom-AUSTRALIA         AU   
##  4 1.0.5.0/24   56203 1.0.5.0    1.0.5.255      16778496    16778751 GTELECOM-AUSTRAL… Gtelecom-AUSTRALIA         AU   
##  5 1.0.6.0/24   56203 1.0.6.0    1.0.6.255      16778752    16779007 GTELECOM-AUSTRAL… Gtelecom-AUSTRALIA         AU   
##  6 1.0.7.0/24   56203 1.0.7.0    1.0.7.255      16779008    16779263 GTELECOM-AUSTRAL… Gtelecom-AUSTRALIA         AU   
##  7 1.0.16.0/24  2519  1.0.16.0   1.0.16.255     16781312    16781567 VECTANT           ARTERIA Networks Corporat… JP   
##  8 1.0.64.0/18  18144 1.0.64.0   1.0.127.255    16793600    16809983 AS-ENECOM         Energia Communications,In… JP   
##  9 1.0.128.0/17 23969 1.0.128.0  1.0.255.255    16809984    16842751 TOT-NET           TOT Public Company Limited TH   
## 10 1.0.128.0/18 23969 1.0.128.0  1.0.191.255    16809984    16826367 TOT-NET           TOT Public Company Limited TH   
## # ... with 786,025 more rows

Note that routeviews_latest() and asnames_current() cache the data so there is no re-downloading unless you clear the local cache.

docxtractr

The docxtractr? package recently got a CRAN push due to some changes in the tibble ? but it also include a new feature that lets you accept or reject “tracked changes” before trying to extract tables/comments from a document without harming/changing the original document.

ednstest

DNS Flag Day is fast approaching. What is “DNS Flag Day”? It’s a day when yet-another cabal of large-scale DNS providers and tech heavy hitters decided that they know what’s best for the internet and are mandating compliance with RFC 6891 (EDNS). Honestly, there’s no good reason to run crappy DNS servers and no good reason not to support EDNS.

You could just go to the flag day site and test your provider (by entering your domain name, if you have one). But, you can also load the package, and run it locally (it still calls their API since it’s open and provides a very detailed results page if your DNS server isn’t compliant). You can just run it to get compact output and an auto-load of the report page in your browser or save off the returned object and inspect it to see what tests failed.

I ran it on a few domains that are likely familiar to readers and this is what it showed:

edns_test("rud.is")
## EDNS compliance test for [rud.is] has ✔ PASSED!
## Report URL: https://ednscomp.isc.org/ednscomp/60049cb032

edns_test("rstudio.com")
## EDNS compliance test for [rstudio.com] has ✖ FAILED
## Report URL: https://ednscomp.isc.org/ednscomp/54e2057229

edns_test("r-project.org")
## EDNS compliance test for [r-project.org] has ✔ PASSED!
## Report URL: https://ednscomp.isc.org/ednscomp/839ee9c9af

The print() function in the package also has some minimal cli? and crayon? usage in it if you’re looking to jazz up your R console output.

ulid

Finally, there’s ulid? which is a package to make “Universally Unique Lexicographically Sortable Identifiers in R”. These ULIDs have the following features:

128-bit compatibility with UUID
1.21e+24 unique ULIDs per millisecond
Lexicographically sortable!
Canonically encoded as a 26 character string, as opposed to the 36 character UUID
Uses Crockford’s base32 for better efficiency and readability (5 bits per character)
Case insensitive
No special characters (URL safe)
Monotonic sort order (correctly detects and handles the same millisecond)

They’re made up of

 01AN4Z07BY      79KA1307SR9X4MV3

|----------|    |----------------|
 Timestamp          Randomness
   48bits             80bits

The timestamp is a 48 bit integer representing UNIX-time in milliseconds and the randomness is an 80 bit cryptographically secure source of randomness (where possible). Read more in the full specification.

You can get one ULID easily:

ulid::ULIDgenerate()
## [1] "0001E2ERKHVPKZJ6FA6ZWHH1KS"

Generate a whole bunch of ’em:

(u <- ulid::ULIDgenerate(20))
##  [1] "0001E2ERKHVX5QF5D59SX2E65T" "0001E2ERKHKD6MHKYB1G8JHN5X" "0001E2ERKHTK0XEHVV2G5877K9" "0001E2ERKHKFGG5NPN24PC1N0W"
##  [5] "0001E2ERKH3F48CAKJCVMSCBKS" "0001E2ERKHF3N0B94VK05GTXCW" "0001E2ERKH24GCJ2CT3Z5WM1FD" "0001E2ERKH381RJ232KK7SMWQW"
##  [9] "0001E2ERKH7NAZ1T4HR4ZRQRND" "0001E2ERKHSATC17G2QAPYXE0C" "0001E2ERKH76R83NFST3MZNW84" "0001E2ERKHFKS52SD8WJ8FHXMV"
## [13] "0001E2ERKHQM6VBM5JB235JJ1W" "0001E2ERKHXG2KNYWHHFS8X69Z" "0001E2ERKHQW821KPRM4GQFANJ" "0001E2ERKHD5KWTM5S345A3RP4"
## [17] "0001E2ERKH0D901W6KX66B1BHE" "0001E2ERKHKPHZBFSC16FC7FFC" "0001E2ERKHQQH7315GMY8HRYXV" "0001E2ERKH016YBAJAB7K9777T"

and “unmarshal” them (which gets you the timestamp back):

unmarshal(u)
##                     ts              rnd
## 1  2018-12-29 07:02:57 VX5QF5D59SX2E65T
## 2  2018-12-29 07:02:57 KD6MHKYB1G8JHN5X
## 3  2018-12-29 07:02:57 TK0XEHVV2G5877K9
## 4  2018-12-29 07:02:57 KFGG5NPN24PC1N0W
## 5  2018-12-29 07:02:57 3F48CAKJCVMSCBKS
## 6  2018-12-29 07:02:57 F3N0B94VK05GTXCW
## 7  2018-12-29 07:02:57 24GCJ2CT3Z5WM1FD
## 8  2018-12-29 07:02:57 381RJ232KK7SMWQW
## 9  2018-12-29 07:02:57 7NAZ1T4HR4ZRQRND
## 10 2018-12-29 07:02:57 SATC17G2QAPYXE0C
## 11 2018-12-29 07:02:57 76R83NFST3MZNW84
## 12 2018-12-29 07:02:57 FKS52SD8WJ8FHXMV
## 13 2018-12-29 07:02:57 QM6VBM5JB235JJ1W
## 14 2018-12-29 07:02:57 XG2KNYWHHFS8X69Z
## 15 2018-12-29 07:02:57 QW821KPRM4GQFANJ
## 16 2018-12-29 07:02:57 D5KWTM5S345A3RP4
## 17 2018-12-29 07:02:57 0D901W6KX66B1BHE
## 18 2018-12-29 07:02:57 KPHZBFSC16FC7FFC
## 19 2018-12-29 07:02:57 QQH7315GMY8HRYXV
## 20 2018-12-29 07:02:57 016YBAJAB7K9777T

and can even supply your own timestamp:

(ut <- ts_generate(as.POSIXct("2017-11-01 15:00:00", origin="1970-01-01")))
## [1] "0001CZM6DGE66RJEY4N05F5R95"

unmarshal(ut)
##                    ts              rnd
## 1 2017-11-01 15:00:00 E66RJEY4N05F5R95

FIN

Kick the tyres & file issues/PRs as needed and definitely give sr.ht a spin for your code-hosting needs. It’s 100% free and open source software made up of mini-services that let you use only what you need. Zero javacript on site and no tracking/adverts. Plus, no evil giant megacorps doing heaven knows what with your browser, repos, habits and intellectual property.

On the Road to 0.8.0 — Some Additional New Features Coming in the sergeant Package

It was probably not difficult to discern from my previous Drill-themed post that I’m fairly excited about the Apache Drill 1.15.0 release. I’ve rounded out most of the existing corners for it in preparation for a long-overdue CRAN update and have been concentrating on two helper features: configuring & launching Drill embedded Docker containers and auto-generation of Drill CTAS queries.

Drill Docker Goodness

Starting with version 1.14.0, Apache provides Drill Docker images for use in experimenting/testing/building-off-of. They run Drill in single node standalone mode so you’re not going to be running this in “production” (unless you have light or just personal workloads). Docker is a great way to get to know Drill if you haven’t already played with it since you don’t have do do much except run the Docker image.

I’ve simplified this even more thanks to @rgfitzjohn’s most excellent stevedore? package which adds a robust R wrapper to the Docker client without relying on any heavy external dependencies such as reticulate. The new drill_up() function will auto-fetch the latest Drill image and launch a container so you can have a running Drill instance with virtually no effort on your part.

Just running the vanilla image isn’t enough since your goal is likely to do more than work with the built-in cp data source. The default container launch scenario also doesn’t hook up any local filesystem paths to the container so you really can’t do much other than cp-oriented queries. Rather than make you do all the work of figuring out how to machinate Docker command line arguments and manually configure a workspace that points to a local filesystem area in the Drill web admin GUI the drill_up() function provides a data_dir argument (that defaults to the getwd() where you are in your R session) which will then auto-wire up that path into the container and create a dfs.d workspace which auto-points to it for you. Here’s a sample execution:

library(sergeant)
library(tidyverse)

dr <- drill_up(data_dir = "~/Data")
## Drill container started. Waiting for the service to become active (this may take up to 30s).
## Drill container ID: f02a11b50e1647e44c4e233799180da3e907c8aa27900f192b5fd72acfa67ec0

You can use dc$stop() to stop the container or use the printed container id to do it from the command line.

We’ll use this containerized Drill instance with the next feature but I need to thank @cboettig for the suggestion to make an auto-downloader-runner-thingy before doing that. (Thank you @cboettig!)

Taking the Tedium out of CTAS

@dseverski, an intrepid R, Drill & sergeant user noticed some new package behavior with Drill 1.15.0 that ended up spawning a new feature: automatic generation of Drill CTAS statements.

Prior to 1.14.0 sergeant had no way to accurately, precisely tell data types of the columns coming back since the REST API didn’t provide them (as noted in the previous Drill post). Now, it did rely on the JSON types to create the initial data frames but id also did something **kinda horribad*: it ran readr::type_convert() on the result sets ?. Said operation had the singular benefit of auto-converting CSV/CSVH/TSV/PSV/etc data to something sane without having to worry about writing lengthy CTAS queries (at the expense of potentially confusing everyone, though that didn’t seem to happen).

With 1.15.0, the readr::type_convert() crutch is gone, which results in less-than-helpful things like this when you have delimiter-separated values data:

# using the Drill container we just started above

write_csv(nycflights13::flights, "~/Data/flights.csvh")

con <- src_drill("localhost")

tbl(con, "dfs.d.`flights.csvh`") %>% 
  glimpse()
## Observations: ??
## Variables: 19
## Database: DrillConnection
## $ year           <chr> "2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013…
## $ month          <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ day            <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ dep_time       <chr> "517", "533", "542", "544", "554", "554", "555", "557", "557"…
## $ sched_dep_time <chr> "515", "529", "540", "545", "600", "558", "600", "600", "600"…
## $ dep_delay      <chr> "2", "4", "2", "-1", "-6", "-4", "-5", "-3", "-3", "-2", "-2"…
## $ arr_time       <chr> "830", "850", "923", "1004", "812", "740", "913", "709", "838…
## $ sched_arr_time <chr> "819", "830", "850", "1022", "837", "728", "854", "723", "846…
## $ arr_delay      <chr> "11", "20", "33", "-18", "-25", "12", "19", "-14", "-8", "8",…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "…
## $ flight         <chr> "1545", "1714", "1141", "725", "461", "1696", "507", "5708", …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"…
## $ air_time       <chr> "227", "227", "160", "183", "116", "150", "158", "53", "140",…
## $ distance       <chr> "1400", "1416", "1089", "1576", "762", "719", "1065", "229", …
## $ hour           <chr> "5", "5", "5", "5", "6", "5", "6", "6", "6", "6", "6", "6", "…
## $ minute         <chr> "15", "29", "40", "45", "0", "58", "0", "0", "0", "0", "0", "…
## $ time_hour      <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-01-01T1…

So the package does what it finally should have been doing all along. But, as noted, that’s not great if you just wanted to quickly work with a directory of CSV files. In theory, you’re supposed to use Drill’s CREATE TABLE AS then do a bunch of CASTS and TO_s to get proper data types. But who has time for that?

David had a stellar idea, might sergeant be able to automagically create CTAS statements from a query?. Yes. Yes it just might be able to do that with the new ctas_profile() function.

Let’s pipe the previous tbl() into ctas_profile() and see what we get:

tbl(con, "dfs.d.`flights.csvh`") %>% 
  ctas_profile() %>% 
  cat()

-- ** Created by ctas_profile() in the R sergeant package, version 0.8.0 **

CREATE TABLE CHANGE____ME AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS VARCHAR) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  TO_TIMESTAMP(`time_hour`, 'FORMATSTRING') AS `time_hour` -- *NOTE* You need to specify the format string. Sample character data is: [2013-01-01T10:00:00Z]. 
FROM (SELECT * FROM dfs.d.`flights.csvh`)


-- TIMESTAMP and/or DATE columns were detected.
Drill's date/time format string reference can be found at:
--
-- <http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html>

There’s a parameter for the new table name which will cause the CHANGE____ME to go away and when the function finds TIMESTAMP or DATE fields it knows to switch to their TO_ cousins and gives sample data with a reminder that you need to make a format string (I’ll eventually auto-generate them unless someone PRs it first). And, since nodoby but Java programmers remember Joda format strings (they’re different than what you’re used to) it provides a handy link to them if it detects the presence of those column types.

Now, we don’t need to actually create a new table (though converting a bunch of CSVs to Parquet is likely a good idea for performance reasons) to use that output. We can pass most of that new query right to tbl():

tbl(con, sql("
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS VARCHAR) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  TO_TIMESTAMP(`time_hour`, 'yyyy-MM-dd''T''HH:mm:ssZ') AS `time_hour` -- [2013-01-01T10:00:00Z].
FROM (SELECT * FROM dfs.d.`flights.csvh`)
")) %>% 
  glimpse()
## Observations: ??
## Variables: 19
## Database: DrillConnection
## $ year           <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <dbl> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 5…
## $ sched_dep_time <dbl> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 6…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -…
## $ arr_time       <dbl> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, …
## $ sched_arr_time <dbl> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, …
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "…
## $ flight         <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71,…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0…
## $ time_hour      <dttm> 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 10:00:0…

Ahhhh… Useful data types. (And, see what I mean about that daft format string? Also, WP is mangling the format string so add a comment if you need the actual string.)

FIN

As you can see questions, suggestions (and PRs!) are welcome and heeded on your social-coding platform of choice (though y’all still seem to be stuck on GH ?).

NOTE: I’ll be subbing out most install_github() links in READMEs and future blog posts for install_git() counterparts pointing to my sr.ht repos (as I co-locate/migrate them there).

You can play with the new 0.8.0 features via devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0").

‘data:’ Scraping & Chart Reproduction : Arrows of Environmental Destruction

2019-01-03 – 10:26
Posted in data wrangling, DataVis, DataViz, ggplot, R, web scraping
Comments (3)

Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has made Americans safer and healthier as a result. What’s happened to an EPA left in the hands of evil (yep, “evil”) in the past two years is beyond lamentable and we likely have two more years of lamenting ahead of us (unless you actually like your water with a coal ash chaser).

The authors of the article made this chart to show the stark contrast between 2017 and 2018 when it comes to regulatory actions for eight acts:

Clean Air Act (CAA)
Clean Water Act (CWA)
Emergency Planning and Community Right to Know Act (EPCRA)
Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)
Resource Conservation and Recovery Act (RCRA)
Safe Drinking Water Act (SDWA)
Toxic Substances Control Act (TSCA)
– Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)

They made this arrow chart (via Datawrapper):

For some reason, that chart sparked a “I really need to make that in R” moment, and thus begat this post.

I’ve got a geom for dumbbell charts but that’s not going to work for this arrow chart since I really wanted to (mostly) reproduce it the way it was. Here’s my go at it.

Data First

Datawrapper embeds have a handy “Get the data” link in them but it’s not a link to a file. It’s a javascript-generated data: href so you either need to click on the link and download it or be hard-headed like I am go the way of pain and scrape it (reproducibility FTW). Let’s get packages and data gathering code out of the way. I’ll exposit a bit more about said data gathering after the code block:

library(stringi)
library(rvest)
library(hrbrthemes) # git[la|hu]b / hrbrmstr / hrbrthemes
library(tidyverse)

article <- read_html("https://theconversation.com/the-epa-has-backed-off-enforcement-under-trump-here-are-the-numbers-108640")

html_node(article, "iframe#psm7n") %>% # find the iframe
  html_attr("src") %>% # get iframe URL
  read_html() %>%  # read it in
  html_node(xpath=".//script[contains(., 'data: ')]") %>% # find the javascript section with the data
  html_text() %>% # get that section
  stri_split_lines() %>% # split into lines so we can target the actual data element
  unlist() %>% 
  keep(stri_detect_fixed, 'data: "Fiscal') %>% # just get the data line
  stri_trim_both() %>% # prep it for extraction
  stri_replace_first_fixed('data: "', "") %>% 
  stri_replace_last_fixed('"', "") %>% 
  stri_replace_all_fixed("\\n", "\n") %>% # make lines lines
  stri_split_lines() %>% 
  unlist() %>%
  stri_split_fixed("\\t") %>% # we now have a list of vectors
  map_dfc(~set_names(list(.x[2:length(.x)]), .x[1])) %>%  # first element of each vector is colname
  type_convert(col_types = "cddn") %>% # get real types
  set_names(c("act", "y2018", "y2017", "pct")) -> psm

psm
## # A tibble: 8 x 4
##   act    y2018 y2017   pct
##   <chr>  <dbl> <dbl> <dbl>
## 1 CAA      199   405   -51
## 2 CERCLA   147   194   -24
## 3 CWA      320   565   -43
## 4 EPCRA     56   107   -48
## 5 FIFRA    363   910   -60
## 6 RCRA     149   275   -46
## 7 SDWA     121   178   -32
## 8 TSCA      80   152   -47

Inside the main article URL content there’s an iframe load:

<p><iframe id="psm7n" class="tc-infographic-datawrapper" src="https://datawrapper.dwcdn.net/psm7n/2/" height="400px" width="100%" style="border: none" frameborder="0"></iframe></p>

We grab the contents of that iframe link (https://datawrapper.dwcdn.net/psm7n/2/) which has a data: line way down towards the bottom of one of the last javascript blocks:

That ugly line gets transformed into a link that will download as a normal CSV file, but we have to do the above wrangling on it before we can get it into a format we can work with.

Now, we can make the chart.

Chart Time!

Let’s get the Y axis in the right order:

psm %>%
  arrange(desc(y2017)) %>%
  mutate(act = factor(act, levels = rev(act))) -> psm

Next, we setup X axis breaks and also get the max value for some positioning calculations (so we don’t hardcode values):

# setup x axis breaks and max value for label position computation
x_breaks <- pretty(c(psm$y2018, psm$y2017))
max_val <- max(x_breaks)

I have two minor nitpicks about the original chart (and changes to them as a result). First, I really don’t like the Y axis gridlines but I do believe we need something to help the eye move horizontally and associate each label to its respective geom. Instead of gridlines I opt for a diminutive dotted line from 0 to the first (min) value.

The second nitpick is that — while the chart has the act information in the caption area — the caption is in alpha order vs the order the act acronyms appear in the data. If it was an alpha bullet list I might not complain, but I chose to modify the order to fit the chart, which we build dynamically with the help of this vector:

# act info for caption
c(
  "CAA" = "Clean Air Act (CAA)",
  "CWA" = "Clean Water Act (CWA)",
  "EPCRA" = "Emergency Planning and Community Right to Know Act (EPCRA)",
  "FIFRA" = "Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)",
  "RCRA" = "Resource Conservation and Recovery Act (RCRA)",
  "SDWA" = "Safe Drinking Water Act (SDWA)",
  "TSCA" = "Toxic Substances Control Act (TSCA)",
  "CERCLA" = "Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)"
) -> acts

w125 <- scales::wrap_format(125) # help us word wrap at ~125 chars

# order the vector and turn it into wrapped lines
act_info <- w125(paste0(unname(acts[as.character(psm$act)]), collapse = "; "))

Now, we can generate the geoms. It looks like alot of code, but I like to use newlines to help structure ggplot2 calls. I still miss my old gg <- gg + idiom but RStudio makes it way too easy to execute the whole expression with just the use of + so I’ve succumbed to their behaviour modification. To break it down w/o code, we essentially need:

the arrows for each act
the 2017 and 2018 direct label values for each act
the 2017 and 2018 top “titles”
segments for ^^
title, subtitle and caption(s)

We use percent-maths to position labels and other objects so the code can be re-used for other arrow plots (hardcoding to the data values is likely fine, but you’ll end up tweaking the numbers more and wasting ~2-5m per new chart).

  # dots from 0 to minval
  geom_segment(
    aes(0, act, xend = y2018, yend = act),
    linetype = "dotted", color = "#b2b2b2", size = 0.33
  ) +

  # minval label
  geom_label(
    aes(y2018, act, label = y2018),
    label.size = 0, hjust = 1, size = 3.5, family = font_rc
  ) +

  # maxval label
  geom_label(
    aes(y2017 + (0.0015 * y2017), act, label = y2017),
    label.size = 0, hjust = 0, size = 3.5, family = font_rc
  ) +

  # the measure line+arrow
  geom_segment(
    aes(y2018, act, xend = y2017, yend = act),
    color = "#4a90e2", size = 0.75, # I pulled the color value from the original chart
    arrow = arrow(ends = "first", length = unit(5, "pt"))
  ) +

  # top of chart year (min)
  geom_label(
    data = head(psm, 1),
    aes(y2018, 9, label = "2018"),
    hjust = 0, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate
  ) +

  # top of chart year (max)
  geom_label(
    data = head(psm, 1),
    aes(y2017, 9, label = "2017"),
    hjust = 1, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate
  ) +

  # bar from top of chart year label to first minval measure
  geom_segment(
    data = head(psm, 1),
    aes(
      y2018 + (0.005 * max_val), 8.5, 
      xend = y2018 + (0.005 * max_val), yend = 8.25
    ), 
    size = 0.25
  ) +

  # bar from top of chart year label to first maxval measure
  geom_segment(
    data = head(psm, 1),
    aes(
      y2017 - (0.005 * max_val), 8.5, 
      xend = y2017 - (0.005 * max_val), yend = 8.25
    ), 
    size = 0.25
  ) +

  # fix x axis scale and place breaks
  scale_x_comma(limits = c(0, max_val), breaks = seq(0, max_val, 200)) +

  # make room for top "titles"
  scale_y_discrete(expand = c(0, 1)) +

  labs(
    y = NULL,
    title = "Decline by statute",
    subtitle = "The number of civil cases the EPA brought to conclusion has dropped across a number of federal statutes,\nincluding the Clean Air Act (CAA) and others.",
    x = act_info,
    caption = "Original Chart/Data: The Conversation, CC-BY-ND;<https://bit.ly/2VuJrOT>; Source: Environmental Data & Government Initiative <https://bit.ly/2VpcFyl>"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(axis.text.x = element_text(color = ft_cols$slate)) +
  theme(axis.title.x = element_text(
    hjust = 0, size = 10, face = "italic", color = ft_cols$gray, margin = margin(t = 10)
  )) +
  theme(plot.caption = element_text(hjust = 0))

Here’s the result:

(it even looks ok in “batman” mode):

FIN

With Microsoft owning GitHub I’m not using gists anymore and the GitLab “snippets” equivalent is just too dog-slow to use, so starting in 2019 I’m self-hosing contiguous R example code used in the blog posts. For the moment, that means links to plain R files but I may just setup gitea for them sometime before the end of Q1. You can find a contiguous, commented version of the above code in here.

If you do your own makeover don’t forget to drop a link to your creation(s) in the comments!

Apache Drill 1.15.0 + sergeant 0.8.0 = pcapng Support, Proper Column Types & Mounds of New Metadata

Apache Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores […] without having to create and manage schemas. […] It has a schema-free JSON document model similar to MongoDB and Elasticsearch; [a plethora of APIs, including] ANSI SQL, ODBC/JDBC, and HTTP[S] REST; [is] extremely user and developer friendly; [and, has a] pluggable architecture enables connectivity to multiple datastores.

To ring in the new year the Drill team knocked out a new 1.15.0 release with a cadre of new functionality including:

CROSS JOIN support
New metadata tables for files, system functiosn, and system functions
pcapng support
enhanced datetime functions
better security
and, web UI improvements

One super-helpful new feature of the REST API is that it now returns query results metadata along with the query results themselves. This means REST API endpoints finally know both column order and column type. This gave me cause to re-visit the sergeant package [GL|GH] and make some accommodations for some of these new features.

Ushering In A New Order

Drill REST API queries return a "columns" field and "metadata" field with the data itself. We can use that to force an order to the columns as well as mostly use proper types (vs JSON-parsed/guessed types). I say mostly since the package still uses jsonlite to parse the results and there’s no support for 64-bit integers in jsonlite (more on this later).

We’ll use the example from DRILL-6847 and use the example provided by Charles Givre in his Jira issue since it will let me demonstrate more of that “mostly” comment and show off another new feature:

library(sergeant) # 0.8.0 branch of sergeant on gitlab or github
library(tidyverse)

con <- src_drill("localhost")

x <- tbl(con, "cp.`employee.json`")

mutate(x, employee_id = as.integer64(employee_id)) %>% 
  mutate(position_id = as.integer64(position_id)) %>% 
  select(
    employee_id, full_name, first_name, last_name, 
    position_id, position_title
  ) -> bigint_result

The above is (logically):

SELECT 
  CAST (employee_id AS INT) AS employee_id,
  full_name,
  first_name, 
  last_name, 
  CAST (position_id AS BIGINT) AS position_id, 
  position_title 
FROM cp.`employee.json`

What do we get when we take a preview of the result?

bigint_result
## # Source:   lazy query [?? x 6]
## # Database: DrillConnection
##    employee_id full_name  first_name last_name position_id position_title 
##          <dbl> <chr>      <chr>      <chr>           <dbl> <chr>          
##  1           1 Sheri Now… Sheri      Nowmer              1 President      
##  2           2 Derrick W… Derrick    Whelply             2 VP Country Man…
##  3           4 Michael S… Michael    Spence              2 VP Country Man…
##  4           5 Maya Guti… Maya       Gutierrez           2 VP Country Man…
##  5           6 Roberta D… Roberta    Damstra             3 VP Information…
##  6           7 Rebecca K… Rebecca    Kanagaki            4 VP Human Resou…
##  7           8 Kim Brunn… Kim        Brunner            11 Store Manager  
##  8           9 Brenda Bl… Brenda     Blumberg           11 Store Manager  
##  9          10 Darren St… Darren     Stanz               5 VP Finance     
## 10          11 Jonathan … Jonathan   Murraiin           11 Store Manager  
## # ... with more rows

Warning message:
One or more columns are of type BIGINT. The sergeant package currently uses jsonlite::fromJSON()
to process Drill REST API result sets. Since jsonlite does not support 64-bit integers BIGINT 
columns are initially converted to numeric since that's how jsonlite::fromJSON() works. This is
problematic for many reasons, including trying to use 'dplyr' idioms with said converted 
BIGINT-to-numeric columns. It is recommended that you 'CAST' BIGINT columns to 'VARCHAR' prior to
working with them from R/'dplyr'.

If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill 
with the MapR ODBC drivers.

This informational warning will only be shown once per R session and you can disable them from 
appearing by setting the 'sergeant.bigint.warnonce' option to 'FALSE' 
(i.e. options(sergeant.bigint.warnonce = FALSE)).

The first thing sergeant users will notice is proper column order (before it just returned the columns in the order they came back in the JSON rows[] structure). The second thing is that we didn’t get integer64s back. Instead, we got doubles plus an information warning about why and what you can do about it. Said warning only displays once per-session and can be silenced with the option sergeant.bigint.warnonce. i.e. just put:

options(sergeant.bigint.warnonce = FALSE)

in your script or ~/.Rprofile and you won’t hear from it again.

The as.integer64() we used is not from the bit64 package but an internal sergeant package function that knows how to translate said operation to, e.g. CAST( employee_id AS BIGINT ).

You can use the ODBC drivers to gain BIGINT support and there are plans for the 0.8.0 branch to eventually use rapidjsonr at the C++-level to provide direct in-package support for BIGINTs as well.

Better Error Messages

Drill query errors that the sergeant package bubbled up through its various interfaces have not been pretty or all that useful. This has changed with the 0.8.0 branch. Let’s take a look:

tbl(con, "cp.employees.json")
## # Source:   table<cp.employees.json> [?? x 4]
## # Database: DrillConnection

Warning message:
VALIDATION ERROR: From line 2, column 6 to line 2, column 24: Object 'cp.employees.json' not found

Original Query:

  1: SELECT *
  2: FROM `cp.employees.json`
  3: LIMIT 10

Query Profile Error Link:
http://localhost:8047/profiles/079fc8cf-19c6-4c78-95a9-0b949a3ecf4c

As you can see in the above output, you now get a highly-formatted return value with the original SQL query broken into lines (with line numbers) and a full link to the Drill query profile so you can dig in to the gnarly details of complex query issues. As you work with this and find edge cases I missed for messages, drop an issue on your social-coding site of choice.

SUPPORT ALL THE PCAPs!

Drill has had packet capture (PCAP) file support for a while now and 1.15.0 adds support for the more modern/rich pcapng format. To enable support for this you need to add "pcapng": {"type": "pcapng", "extensions": ["pcapng"] }, to the "formats" section of your storage plugins and also configure a workspace directory to use that as the default (the principle of which is covered here).

We’ll use one of the Wireshark example captures to demonstrate:

pcaps <- tbl(con, "dfs.caps.`*.pcapng`")

glimpse(pcaps)
## Observations: ??
## Variables: 25
## $ tcp_flags_ece_ecn_capable            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece_congestion_experienced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_psh                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ type                                 <chr> "TCP", "TCP", "TCP", "TCP...
## $ tcp_flags_cwr                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ dst_ip                               <chr> "74.125.28.139", "10.254....
## $ src_ip                               <chr> "10.254.157.208", "74.125...
## $ tcp_flags_fin                        <int> 1, 1, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags                            <int> 17, 17, 16, 16, 16, 0, 0,...
## $ tcp_flags_ack                        <int> 1, 1, 1, 1, 1, 0, 0, 0, 0...
## $ src_mac_address                      <chr> "00:05:9A:3C:7A:00", "00:...
## $ tcp_flags_syn                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_rst                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ timestamp                            <dttm> 2015-04-14 07:19:25, 201...
## $ tcp_session                          <dbl> 8.353837e+17, 8.353837e+1...
## $ packet_data                          <chr> "\"3DU...<z...E..(J.@.......
## $ tcp_parsed_flags                     <chr> "ACK|FIN", "ACK|FIN", "AC...
## $ tcp_flags_ns                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ src_port                             <int> 60268, 443, 60268, 58382,...
## $ packet_length                        <int> 54, 54, 54, 55, 66, 78, 7...
## $ tcp_flags_urg                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_ack                              <int> 662445631, 1496589825, 66...
## $ dst_port                             <int> 443, 60268, 443, 29216, 5...
## $ dst_mac_address                      <chr> "00:11:22:33:44:55", "00:...

count(pcaps, src_ip, dst_ip, sort=TRUE)
## # Source:     lazy query [?? x 3]
## # Database:   DrillConnection
## # Groups:     src_ip
## # Ordered by: desc(n)
##    src_ip         dst_ip             n
##    <chr>          <chr>          <dbl>
##  1 10.254.157.208 10.254.158.25    298
##  2 10.254.158.25  10.254.157.208   204
##  3 174.137.42.81  10.254.157.208    76
##  4 10.254.157.208 10.254.158.8      54
##  5 10.254.158.8   10.254.157.208    49
##  6 74.125.28.102  10.254.157.208    49
##  7 10.254.157.208 74.125.28.102     44
##  8 10.254.157.208 174.137.42.81     41
##  9 54.84.98.25    10.254.157.208    25
## 10 157.55.56.168  10.254.157.208    25
## # ... with more rows

More work appears to be planned by the Drill team to enable digging into the packet (binary) contents.

Drill Metadata As Data

Drill has provided ways to lookup Drill operational information as actual tables but the Drill team has added support for even more metadata-as-data queries.

First up is finally having better access to filesystem information. Prior to 1.15.0 one could get file and path attributes as part of other queries, but now we can treat filesystems as actual data. Let’s list all the PCAPs in the above workspace:

tbl(con, "information_schema.`schemata`") %>% 
  filter(SCHEMA_NAME == "dfs.caps") %>% 
  print() %>% 
  pull(SCHEMA_NAME) -> pcap_schema
## # Source:   lazy query [?? x 9]
## # Database: DrillConnection
##   CATALOG_NAME SCHEMA_NAME SCHEMA_OWNER TYPE  IS_MUTABLE
##   <chr>        <chr>       <chr>        <chr> <chr>     
## 1 DRILL        dfs.caps    <owner>      file  NO

tbl(con, "information_schema.`files`") %>% 
  filter(schema_name == pcap_schema) %>% 
  glimpse()
## Observations: ??
## Variables: 13
## $ SCHEMA_NAME       <chr> "dfs.caps"
## $ ROOT_SCHEMA_NAME  <chr> "dfs"
## $ WORKSPACE_NAME    <chr> "caps"
## $ FILE_NAME         <chr> "dof-short-capture.pcapng"
## $ RELATIVE_PATH     <chr> "dof-short-capture.pcapng"
## $ IS_DIRECTORY      <lgl> FALSE
## $ IS_FILE           <lgl> TRUE
## $ LENGTH            <dbl> 634280
## $ OWNER             <chr> "hrbrmstr"
## $ GROUP             <chr> "staff"
## $ PERMISSION        <chr> "rw-r--r--"
## $ ACCESS_TIME       <dttm> 1969-12-31 19:00:00
## $ MODIFICATION_TIME <dttm> 2019-01-01 19:12:17

The Drill system options table now has full descriptions for the options and also provides a new table that knows about all of Drills functions and all your custom UDFs. drill_opts() and drill_functions() return a data frame of all this info and have an optional browse parameter which, if set to TRUE, will show a DT interactive data table for them. I find this especially handy when I forget something like regexp_like syntax (I use alot of back-ends and many are wildly different) and can now do this:

FIN

Keep on the lookout for the rapidjsonr/BIGINT integration and more new features of the sergeant package. NOTE: The better error messages have been ported over to the sergeant.caffeinated package (the RJDBC interface) and the other niceties will make their way into that package soon as well.

So, make sure you’re using the 0.8.0 GL / GH, kick the tyres, file issues where you’re most comfortable working.

May your queries all be optimized and results sets complete in the new year!

Exploring 2018 R-bloggers & R Weekly Posts with Feedly & the ‘seymour’ package

Well, 2018 has flown by and today seems like an appropriate time to take a look at the landscape of R bloggerdom as seen through the eyes of readers of R-bloggers and R Weekly. We’ll do this via a new package designed to make it easier to treat Feedly as a data source: seymour [GL | GH] (which is a pun-ified name based on a well-known phrase from Little Shop of Horrors).

The seymour package builds upon an introductory Feedly API blog post from back in April 2018 and covers most of the “getters” in the API (i.e. you won’t be adding anything to or modifying anything in Feedly through this package unless you PR into it with said functions). An impetus for finally creating the package came about when I realized that you don’t need a Feedly account to use the search or stream endpoints. You do get more data back if you have a developer token and can also access your own custom Feedly components if you have one. If you are a “knowledge worker” and do not have a Feedly account (and, really, a Feedly Pro account) you are missing out. But, this isn’t a rah-rah post about Feedly, it’s a rah-rah post about R! Onwards!

Feeling Out The Feeds

There are a bunch of different ways to get Feedly metadata about an RSS feed. One easy way is to just use the RSS feed URL itself:

library(seymour) # git[la|hu]b/hrbrmstr/seymour
library(hrbrthemes) # git[la|hu]b/hrbrmstr/hrbrthemes
library(lubridate)
library(tidyverse)

r_bloggers <- feedly_feed_meta("http://feeds.feedburner.com/RBloggers")
r_weekly <- feedly_feed_meta("https://rweekly.org/atom.xml")
r_weekly_live <- feedly_feed_meta("https://feeds.feedburner.com/rweeklylive")

glimpse(r_bloggers)
## Observations: 1
## Variables: 14
## $ feedId      <chr> "feed/http://feeds.feedburner.com/RBloggers"
## $ id          <chr> "feed/http://feeds.feedburner.com/RBloggers"
## $ title       <chr> "R-bloggers"
## $ subscribers <int> 24518
## $ updated     <dbl> 1.546227e+12
## $ velocity    <dbl> 44.3
## $ website     <chr> "https://www.r-bloggers.com"
## $ topics      <I(list)> data sci....
## $ partial     <lgl> FALSE
## $ iconUrl     <chr> "https://storage.googleapis.com/test-site-assets/X...
## $ visualUrl   <chr> "https://storage.googleapis.com/test-site-assets/X...
## $ language    <chr> "en"
## $ contentType <chr> "longform"
## $ description <chr> "Daily news and tutorials about R, contributed by ...

glimpse(r_weekly)
## Observations: 1
## Variables: 13
## $ feedId      <chr> "feed/https://rweekly.org/atom.xml"
## $ id          <chr> "feed/https://rweekly.org/atom.xml"
## $ title       <chr> "RWeekly.org - Blogs to Learn R from the Community"
## $ subscribers <int> 876
## $ updated     <dbl> 1.546235e+12
## $ velocity    <dbl> 1.1
## $ website     <chr> "https://rweekly.org/"
## $ topics      <I(list)> data sci....
## $ partial     <lgl> FALSE
## $ iconUrl     <chr> "https://storage.googleapis.com/test-site-assets/2...
## $ visualUrl   <chr> "https://storage.googleapis.com/test-site-assets/2...
## $ contentType <chr> "longform"
## $ language    <chr> "en"

glimpse(r_weekly_live)
## Observations: 1
## Variables: 9
## $ id          <chr> "feed/https://feeds.feedburner.com/rweeklylive"
## $ feedId      <chr> "feed/https://feeds.feedburner.com/rweeklylive"
## $ title       <chr> "R Weekly Live: R Focus"
## $ subscribers <int> 1
## $ updated     <dbl> 1.5461e+12
## $ velocity    <dbl> 14.7
## $ website     <chr> "https://rweekly.org/live"
## $ language    <chr> "en"
## $ description <chr> "Live Updates from R Weekly"

Feedly uses some special terms, one of which (above) is velocity. “Velocity” is simply the average number of articles published weekly (Feedly’s platform updates that every few weeks for each feed). R-bloggers has over 24,000 Feedly subscribers so any post-rankings we do here should be fairly representative. I included both the “live” and the week-based R Weekly feeds as I wanted to compare post coverage between R-bloggers and R Weekly in terms of raw content.

On the other hand, R Weekly’s “weekly” RSS feed has less than 1,000 subscribers. WAT?! While I have mostly nothing against R-bloggers-proper I heartily encourage ardent readers to also subscribe to R Weekly and perhaps even consider switching to it (or at least adding the individual blog feeds they monitor to your own Feedly). It wasn’t until the Feedly API that I had any idea of how many folks were really viewing my R blog posts since we must provide a full post RSS feed to R-bloggers and get very little in return (at least in terms of data). R Weekly uses a link counter but redirects all clicks to the blog author’s site where we can use logs or analytics platforms to measure engagement. R Weekly is also run by a group of volunteers (more eyes == more posts they catch!) and has a Patreon where the current combined weekly net is likely not enough to buy each volunteer a latte. No ads, a great team and direct engagement stats for the community of R bloggers seems like a great deal for $1.00 USD. If you weren’t persuaded by the above rant, then perhaps at least consider installing this (from source that you control).

Lastly, I believe I’m that “1” subscriber to R Weekly Live O_o. But, I digress.

We’ve got the feedIds (which can be used as “stream” ids) so let’s get cracking!

Binding Up The Posts

We need to use the feedId in calls to feedly_stream() to get the individual posts. The API claims there’s a temporal parameter that allows one to get posts only after a certain date but I couldn’t get it to work (PRs are welcome on any community source code portal you’re most comfortable in if you’re craftier than I am). As a result, we need to make a guess as to how many calls we need to make for two of the three feeds. Basic maths of 44 * 52 / 1000 suggests ~3 should suffice for R Weekly (live) and R-bloggers but let’s do 5 to be safe. We should be able to get R Weekly (weekly) in one go.

r_weekly_wk <- feedly_stream(r_weekly$feedId)

range(r_weekly_wk$items$published) # my preview of this said it got back to 2016!
## [1] "2016-05-20 20:00:00 EDT" "2018-12-30 19:00:00 EST"

# NOTE: If this were more than 3 I'd use a loop/iterator
# In reality, I should make a helper function to do this for you (PRs welcome)

r_blog_1 <- feedly_stream(r_bloggers$feedId)
r_blog_2 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_1$continuation)
r_blog_3 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_2$continuation)

r_weekly_live_1 <- feedly_stream(r_weekly_live$feedId)
r_weekly_live_2 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_1$continuation)
r_weekly_live_3 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_2$continuation)

bind_rows(r_blog_1$items, r_blog_2$items, r_blog_3$items) %>% 
  filter(published >= as.Date("2018-01-01")) -> r_blog_stream

bind_rows(r_weekly_live_1$items, r_weekly_live_2$items, r_weekly_live_3$items) %>% 
  filter(published >= as.Date("2018-01-01")) -> r_weekly_live_stream

r_weekly_wk_stream <- filter(r_weekly_wk$items, published >= as.Date("2018-01-01"))

Let’s take a look:

glimpse(r_weekly_wk_stream)
## Observations: 54
## Variables: 27
## $ id                  <chr> "2nIALmjjlFcpPJKakm2k8hjka0FzpApixM7HHu8B0...
## $ originid            <chr> "https://rweekly.org/2018-53", "https://rw...
## $ fingerprint         <chr> "114357f1", "199f78d0", "9adc236e", "63f99...
## $ title               <chr> "R Weekly 2018-53 vroom, Classification", ...
## $ updated             <dttm> 2018-12-30 19:00:00, 2018-12-23 19:00:00,...
## $ crawled             <dttm> 2018-12-31 00:51:39, 2018-12-23 23:46:49,...
## $ published           <dttm> 2018-12-30 19:00:00, 2018-12-23 19:00:00,...
## $ alternate           <list> [<https://rweekly.org/2018-53.html, text/...
## $ canonicalurl        <chr> "https://rweekly.org/2018-53.html", "https...
## $ unread              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ engagement          <int> 1, 5, 5, 3, 2, 3, 1, 2, 3, 2, 4, 3, 2, 2, ...
## $ engagementrate      <dbl> 0.33, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ recrawled           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tags                <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL...
## $ content_content     <chr> "<p>Hello and welcome to this new issue!</...
## $ content_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/https://rweekly.org/atom.xml", "feed...
## $ origin_title        <chr> "RWeekly.org - Blogs to Learn R from the C...
## $ origin_htmlurl      <chr> "https://rweekly.org/", "https://rweekly.o...
## $ visual_processor    <chr> "feedly-nikon-v3.1", "feedly-nikon-v3.1", ...
## $ visual_url          <chr> "https://github.com/rweekly/image/raw/mast...
## $ visual_width        <int> 372, 672, 1000, 1000, 1000, 1001, 1000, 10...
## $ visual_height       <int> 479, 480, 480, 556, 714, 624, 237, 381, 36...
## $ visual_contenttype  <chr> "image/png", "image/png", "image/gif", "im...
## $ webfeeds_icon       <chr> "https://storage.googleapis.com/test-site-...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(r_weekly_live_stream)
## Observations: 1,333
## Variables: 27
## $ id                  <chr> "rhkRVQ8KjjGRDQxeehIj6RRIBGntdni0ZHwPTR8B3...
## $ originid            <chr> "https://link.rweekly.org/ckb", "https://l...
## $ fingerprint         <chr> "c11a0782", "c1897fc3", "c0b36206", "7049e...
## $ title               <chr> "Top Tweets of 2018", "My #Best9of2018 twe...
## $ crawled             <dttm> 2018-12-29 11:11:52, 2018-12-28 11:24:22,...
## $ published           <dttm> 2018-12-28 19:00:00, 2018-12-27 19:00:00,...
## $ canonical           <list> [<https://link.rweekly.org/ckb, text/html...
## $ alternate           <list> [<http://feedproxy.google.com/~r/RWeeklyL...
## $ unread              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ tags                <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ canonicalurl        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ ampurl              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ cdnampurl           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ engagement          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ summary_content     <chr> "<p>maraaverick.rbind.io</p><img width=\"1...
## $ summary_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/https://feeds.feedburner.com/rweekly...
## $ origin_title        <chr> "R Weekly Live: R Focus", "R Weekly Live: ...
## $ origin_htmlurl      <chr> "https://rweekly.org/live", "https://rweek...
## $ visual_url          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_processor    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_width        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_height       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_contenttype  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_pocket  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(r_blog_stream)
## Observations: 2,332
## Variables: 34
## $ id                  <chr> "XGq6cYRY3hH9/vdZr0WOJiPdAe0u6dQ2ddUFEsTqP...
## $ keywords            <list> ["R bloggers", "R bloggers", "R bloggers"...
## $ originid            <chr> "https://datascienceplus.com/?p=19513", "h...
## $ fingerprint         <chr> "2f32071a", "332f9548", "2e6f8adb", "3d7ed...
## $ title               <chr> "Leaf Plant Classification: Statistical Le...
## $ crawled             <dttm> 2018-12-30 22:35:22, 2018-12-30 19:01:25,...
## $ published           <dttm> 2018-12-30 19:26:20, 2018-12-30 13:18:00,...
## $ canonical           <list> [<https://www.r-bloggers.com/leaf-plant-c...
## $ author              <chr> "Giorgio Garziano", "Sascha W.", "Economet...
## $ alternate           <list> [<http://feedproxy.google.com/~r/RBlogger...
## $ unread              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ entities            <list> [<c("nlp/f/entity/en/-/leaf plant classif...
## $ engagement          <int> 50, 39, 482, 135, 33, 12, 13, 41, 50, 31, ...
## $ engagementrate      <dbl> 1.43, 0.98, 8.76, 2.45, 0.59, 0.21, 0.22, ...
## $ enclosure           <list> [NULL, NULL, NULL, NULL, <c("https://0.gr...
## $ tags                <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL...
## $ recrawled           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ updatecount         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ content_content     <chr> "<p><div><div><div><div data-show-faces=\"...
## $ content_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ summary_content     <chr> "CategoriesAdvanced Modeling\nTags\nLinear...
## $ summary_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/http://feeds.feedburner.com/RBlogger...
## $ origin_title        <chr> "R-bloggers", "R-bloggers", "R-bloggers", ...
## $ origin_htmlurl      <chr> "https://www.r-bloggers.com", "https://www...
## $ visual_processor    <chr> "feedly-nikon-v3.1", "feedly-nikon-v3.1", ...
## $ visual_url          <chr> "https://i0.wp.com/datascienceplus.com/wp-...
## $ visual_width        <int> 383, 400, NA, 286, 456, 250, 450, 456, 397...
## $ visual_height       <int> 309, 300, NA, 490, 253, 247, 450, 253, 333...
## $ visual_contenttype  <chr> "image/png", "image/png", NA, "image/png",...
## $ webfeeds_icon       <chr> "https://storage.googleapis.com/test-site-...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_pocket  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

And also check how far into December for each did I get as of this post? (I’ll check again after the 31 and update if needed).

range(r_weekly_wk_stream$published)
## [1] "2018-01-07 19:00:00 EST" "2018-12-30 19:00:00 EST"

range(r_blog_stream$published)
## [1] "2018-01-01 11:00:27 EST" "2018-12-30 19:26:20 EST"

range(r_weekly_live_stream$published)
## [1] "2018-01-01 19:00:00 EST" "2018-12-28 19:00:00 EST"

Digging Into The Weeds Feeds

In the above glimpses there’s another special term, engagement. Feedly defines this as an “indicator of how popular this entry is. The higher the number, the more readers have read, saved or shared this particular entry”. We’ll use this to look at the most “engaged” content in a bit. What’s noticeable from the start is that R Weekly Live has 1,333 entries and R-bloggers has 2,330 entries (so, nearly double the number of entries). Those counts are a bit of “fake news” when it comes to overall unique posts as can be seen by:

bind_rows(
  mutate(r_weekly_live_stream, src = "R Weekly (Live)"),
  mutate(r_blog_stream, src = "R-bloggers")
) %>% 
  mutate(wk = lubridate::week(published)) -> y2018

filter(y2018, title == "RcppArmadillo 0.9.100.5.0") %>% 
  select(src, title, originid, published) %>% 
  gt::gt()

src	title	originid	published
R Weekly (Live)	RcppArmadillo 0.9.100.5.0	https://link.rweekly.org/bg6	2018-08-17 07:55:00
R Weekly (Live)	RcppArmadillo 0.9.100.5.0	https://link.rweekly.org/bfr	2018-08-16 21:20:00
R-bloggers	RcppArmadillo 0.9.100.5.0	https://www.r-bloggers.com/?guid=f8865e8a004f772bdb64e3c4763a0fe5	2018-08-17 08:00:00
R-bloggers	RcppArmadillo 0.9.100.5.0	https://www.r-bloggers.com/?guid=3046299f73344a927f787322c867233b	2018-08-16 21:20:00

Feedly has many processes going on behind the scenes to identify new entries and update entries as original sources are modified. This “duplication” (thankfully) doesn’t happen alot:

count(y2018, src, wk, title, sort=TRUE) %>% 
  filter(n > 1) %>% 
  arrange(wk) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("wk", "n"), decimals = 0)

src	wk	title	n
R-bloggers	3	conapomx data package	2
R Weekly (Live)	5	R in Latin America	2
R Weekly (Live)	12	Truncated Poisson distributions in R and Stan by @ellis2013nz	2
R Weekly (Live)	17	Regression Modeling Strategies	2
R Weekly (Live)	18	How much work is onboarding?	2
R Weekly (Live)	18	Survey books, courses and tools by @ellis2013nz	2
R-bloggers	20	Beautiful and Powerful Correlation Tables in R	2
R Weekly (Live)	24	R Consortium is soliciting your feedback on R package best practices	2
R Weekly (Live)	33	RcppArmadillo 0.9.100.5.0	2
R-bloggers	33	RcppArmadillo 0.9.100.5.0	2
R-bloggers	39	Individual level data	2
R Weekly (Live)	41	How R gets built on Windows	2
R Weekly (Live)	41	R Consortium grant applications due October 31	2
R Weekly (Live)	41	The Economist’s Big Mac Index is calculated with R	2
R Weekly (Live)	42	A small logical change with big impact	2
R Weekly (Live)	42	Maryland’s Bridge Safety, reported using R	2
R-bloggers	47	OneR – fascinating insights through simple rules	2

In fact, it happens infrequently enough that I’m going to let the “noise” stay in the data since Feedly technically is tracking some content change.

Let’s look at the week-over-week curation counts (neither source publishes original content, so using the term “published” seems ill fitting) for each:

count(y2018, src, wk) %>% 
  ggplot(aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0, color = src), show.legend = FALSE) +
  facet_wrap(~src, ncol=1, scales="free_x") + 
  labs(
    x = "Week #", y = "# Posts", 
    title = "Weekly Post Curation Stats for R-bloggers & R Weekly (Live)"
  ) +
  theme_ft_rc(grid="Y")

Despite R-bloggers having curated more overall content, there’s plenty to read each week for consumers of either/both aggregators.

Speaking of consuming, let’s look at the distribution of engagement scores for both aggregators:

group_by(y2018, src) %>% 
  summarise(v = list(broom::tidy(summary(engagement)))) %>% 
  unnest()
## # A tibble: 2 x 8
##   src             minimum    q1 median  mean    q3 maximum    na
##   <chr>             <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 R Weekly (Live)       0     0    0     0       0       0  1060
## 2 R-bloggers            1    16   32.5  58.7    70    2023    NA

Well, it seems that it’s more difficult for Feedly to track engagement for the link-only R Weekly (Live) feed, so we’ll have to focus on R-bloggers for engagement views. Summary values are fine, but we can get a picture of the engagement distribution (we’ll do it monthly to get a bit more granularity, too):

filter(y2018, src == "R-bloggers") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>% 
  ggplot(aes(month, engagement)) +
  geom_violin() +
  ggbeeswarm::geom_quasirandom(
    groupOnX = TRUE, size = 2, color = "#2b2b2b", fill = ft_cols$green,
    shape = 21, stroke = 0.25
  ) +
  scale_y_comma(trans = "log10") +
  labs(
    x = NULL, y = "Engagement Score",
    title = "Monthly Post Engagement Distributions for R-bloggers Curated Posts",
    caption = "NOTE: Y-axis log10 Scale"
  ) +
  theme_ft_rc(grid="Y")

I wasn’t expecting each month’s distribution to be so similar. There are definitely outliers in terms of positive engagement so we should be able see what types of R-focused content piques the interest of the ~25,000 Feedly subscribers of R-bloggers.

filter(y2018, src == "R-bloggers") %>% 
  group_by(author) %>% 
  summarise(n_posts = n(), total_eng = sum(engagement), avg_eng = mean(engagement), med_eng = median(engagement)) %>% 
  arrange(desc(n_posts)) %>% 
  slice(1:20) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("n_posts", "total_eng", "avg_eng", "med_eng"), decimals = 0)

author	n_posts	total_eng	avg_eng	med_eng
David Smith	116	9,791	84	47
John Mount	94	4,614	49	33
rOpenSci – open tools for open science	89	2,967	33	19
Thinking inside the box	85	1,510	18	14
R Views	60	4,142	69	47
hrbrmstr	55	1,179	21	16
Dr. Shirin Glander	54	2,747	51	25
xi’an	49	990	20	12
Mango Solutions	42	1,221	29	17
Econometrics and Free Software	33	2,858	87	60
business-science.io – Articles	31	4,484	145	70
NA	31	1,724	56	40
statcompute	29	1,329	46	33
Ryan Sheehy	25	1,271	51	45
Keith Goldfeld	24	1,305	54	43
free range statistics – R	23	440	19	12
Jakob Gepp	21	348	17	13
Tal Galili	21	1,587	76	22
Jozef’s Rblog	18	1,617	90	65
arthur charpentier	16	1,320	82	68

It is absolutely no surprise David comes in at number one in both post count and almost every engagement summary statistic since he’s a veritable blogging machine and creates + curates some super interesting content (whereas your’s truly doesn’t even make the median engagement cut ?).

What were the most engaging posts?

filter(y2018, src == "R-bloggers") %>% 
  arrange(desc(engagement)) %>% 
  mutate(published = as.Date(published)) %>% 
  select(engagement, title, published, author) %>% 
  slice(1:50) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("engagement"), decimals = 0)

engagement	title	published	author
2,023	Happy Birthday R	2018-08-27	eoda GmbH
1,132	15 Types of Regression you should know	2018-03-25	ListenData
697	R and Python: How to Integrate the Best of Both into Your Data Science Workflow	2018-10-08	business-science.io – Articles
690	Ultimate Python Cheatsheet: Data Science Workflow with Python	2018-11-18	business-science.io – Articles
639	Data Analysis with Python Course: How to read, wrangle, and analyze data	2018-10-31	Andrew Treadway
617	Machine Learning Results in R: one plot to rule them all!	2018-07-18	Bernardo Lares
614	R tip: Use Radix Sort	2018-08-21	John Mount
610	Data science courses in R (/python/etc.) for $10 at Udemy (Sitewide Sale until Aug 26th)	2018-08-24	Tal Galili
575	Why R for data science – and not Python?	2018-12-02	Learning Machines
560	Case Study: How To Build A High Performance Data Science Team	2018-09-18	business-science.io – Articles
516	R 3.5.0 is released! (major release with many new features)	2018-04-24	Tal Galili
482	R or Python? Why not both? Using Anaconda Python within R with {reticulate}	2018-12-30	Econometrics and Free Software
479	Sankey Diagram for the 2018 FIFA World Cup Forecast	2018-06-10	Achim Zeileis
477	5 amazing free tools that can help with publishing R results and blogging	2018-12-22	Jozef’s Rblog
462	What’s the difference between data science, machine learning, and artificial intelligence?	2018-01-09	David Robinson
456	XKCD “Curve Fitting”, in R	2018-09-28	David Smith
450	The prequel to the drake R package	2018-02-06	rOpenSci – open tools for open science
449	Who wrote that anonymous NYT op-ed? Text similarity analyses with R	2018-09-07	David Smith
437	Elegant regression results tables and plots in R: the finalfit package	2018-05-16	Ewen Harrison
428	How to implement neural networks in R	2018-01-12	David Smith
426	Data transformation in #tidyverse style: package sjmisc updated #rstats	2018-02-06	Daniel
413	Neural Networks Are Essentially Polynomial Regression	2018-06-20	matloff
403	Custom R charts coming to Excel	2018-05-11	David Smith
379	A perfect RStudio layout	2018-05-22	Ilya Kashnitsky
370	Drawing beautiful maps programmatically with R, sf and ggplot2 — Part 1: Basics	2018-10-25	Mel Moreno and Mathieu Basille
368	The Financial Times and BBC use R for publication graphics	2018-06-27	David Smith
367	Dealing with The Problem of Multicollinearity in R	2018-08-16	Perceptive Analytics
367	Excel is obsolete. Here are the Top 2 alternatives from R and Python.	2018-03-13	Appsilon Data Science Blog
365	New R Cheatsheet: Data Science Workflow with R	2018-11-04	business-science.io – Articles
361	Tips for analyzing Excel data in R	2018-08-30	David Smith
360	Importing 30GB of data in R with sparklyr	2018-02-16	Econometrics and Free Software
358	Scraping a website with 5 lines of R code	2018-01-24	David Smith
356	Clustering the Bible	2018-12-27	Learning Machines
356	Finally, You Can Plot H2O Decision Trees in R	2018-12-26	Gregory Kanevsky
356	Geocomputation with R – the afterword	2018-12-12	Rstats on Jakub Nowosad’s website
347	Time Series Deep Learning: Forecasting Sunspots With Keras Stateful LSTM In R	2018-04-18	business-science.io – Articles
343	Run Python from R	2018-03-27	Deepanshu Bhalla
336	Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)	2018-07-24	Bernardo Lares
332	R Generation: 25 Years of R	2018-08-01	David Smith
329	How to extract data from a PDF file with R	2018-01-05	Packt Publishing
325	R or Python? Python or R? The ongoing debate.	2018-01-28	tomaztsql
322	How to perform Logistic Regression, LDA, & QDA in R	2018-01-05	Prashant Shekhar
321	Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity	2018-09-06	David Robinson
311	Intuition for principal component analysis (PCA)	2018-12-06	Learning Machines
310	Packages for Getting Started with Time Series Analysis in R	2018-02-18	atmathew
309	Announcing the R Markdown Book	2018-07-13	Yihui Xie
307	Automated Email Reports with R	2018-11-01	JOURNEYOFANALYTICS
304	future.apply – Parallelize Any Base R Apply Function	2018-06-23	JottR on R
298	How to build your own Neural Network from scratch in R	2018-10-09	Posts on Tychobra
293	RStudio 1.2 Preview: SQL Integration	2018-10-02	Jonathan McPherson

Weekly & monthly curated post descriptive statstic patterns haven’t changed much since the April post:

filter(y2018, src == "R-bloggers") %>% 
  mutate(wkday = lubridate::wday(published, label = TRUE, abbr = TRUE)) %>%
  count(wkday) %>% 
  ggplot(aes(wkday, n)) +
  geom_col(width = 0.5, fill = ft_cols$slate, color = NA) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Curated Posts",
    title = "Day-of-week Curated Post Count for the R-bloggers Feed"
  ) +
  theme_ft_rc(grid="Y")

filter(y2018, src == "R-bloggers") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>%
  count(month) %>% 
  ggplot(aes(month, n)) +
  geom_col(width = 0.5, fill = ft_cols$slate, color = NA) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Curated Posts",
    title = "Monthly Curated Post Count for the R-bloggers Feed"
  ) +
  theme_ft_rc(grid="Y")

Surprisingly, monthly post count consistency (or even posting something each month) is not a common trait amongst the top 20 (by total engagement) authors:

w20 <- scales::wrap_format(20)

filter(y2018, src == "R-bloggers") %>% 
  filter(!is.na(author)) %>% # some posts don't have author attribution
  mutate(author_t = map_chr(w20(author), paste0, collapse="\n")) %>% # we need to wrap for facet titles (below)
  count(author, author_t, wt=engagement, sort=TRUE) %>% # get total author engagement
  slice(1:20) %>% # top 20
  { .auth_ordr <<- . ; . } %>% # we use the order later
  left_join(filter(y2018, src == "R-bloggers"), "author") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>%
  count(month, author_t, sort = TRUE) %>% 
  mutate(author_t = factor(author_t, levels = .auth_ordr$author_t)) %>% 
  ggplot(aes(month, nn, author_t)) +
  geom_col(width = 0.5) +
  scale_x_discrete(labels=substring(month.abb, 1, 1)) +
  scale_y_comma() +
  facet_wrap(~author_t) +
  labs(
    x = NULL, y = "Curated Post Count",
    title = "Monthly Curated Post Counts-per-Author (Top 20 by Engagement)",
    subtitle = "Arranged by Total Author Engagement"
  ) +
  theme_ft_rc(grid="yY")

Overall, most authors favor shorter titles for their posts:

filter(y2018, src == "R-bloggers") %>% 
  mutate(
    `Character Count Distribution` = nchar(title), 
    `Word Count Distribution` = stringi::stri_count_boundaries(title, type = "word")
  ) %>% 
  select(id, `Character Count Distribution`, `Word Count Distribution`) %>% 
  gather(measure, value, -id) %>% 
  ggplot(aes(value)) +
  ggalt::geom_bkde(alpha=1/3, color = ft_cols$slate, fill = ft_cols$slate) +
  scale_y_continuous(expand=c(0,0)) +
  facet_wrap(~measure, scales = "free") +
  labs(
    x = NULL, y = "Density",
    title = "Title Character/Word Count Distributions",
    subtitle = "~38 characters/11 words seems to be the sweet spot for most authors",
    caption = "Note Free X/Y Scales"
  ) +
  theme_ft_rc(grid="XY")

This post is already kinda tome-length so I’ll leave it to y’all to grab the data and dig in a bit more.

A Word About Using The `content_content` Field For R-bloggers Posts

Since R-bloggers requires a full feed from contributors, they, in-turn, post a “kinda” full-feed back out. I say “kinda” as they still haven’t fixed a reported bug in their processing engine which causes issues in (at least) Feedly’s RSS processing engine. If you use Feedly, take a look at the R-bloggers RSS feed entry for the recent “R or Python? Why not both? Using Anaconda Python within R with {reticulate}” post. It cuts off near “Let’s check its type:”. This is due to the way the < character is processed by the R-bloggers ingestion engine which turns the ## <class 'pandas.core.frame.DataFrame'> in the original post and doesn’t even display right on the R-bloggers page as it mangles the input and turns the descriptive output into an actuall <class> tag: <class 'pandas.core.frame.dataframe'=""></class>. It’s really an issue on both sides, but R-bloggers is doing the mangling and should seriously consider addressing it in 2019.

Since it is still not fixed, it forces you to go to R-bloggers (clicks FTW? and may partly explain why that example post has a 400+ engagement score) unless you scroll back up to the top of the Feedly view and go to the author’s blog page. Given that tibble output invariably has a < right up top, your best bet for getting more direct views of your own content is to get a code-block with printed ## < output in it as close to the beginning as possible (perhaps start each post with a print(tbl_df(mtcars)))? ?).

Putting post-view-hacking levity aside, this content mangling means you can’t trust the content_content column in the stream data frame to have all the content; that is, if you were planning on taking the provided data and doing some topic clustering or content-based feature extraction for other stats/ML ops you’re out of luck and need to crawl the original site URLs on your own to get the main content for such analyses.

A Bit More About seymour

The seymour package has the following API functions:

feedly_access_token: Retrieve the Feedly Developer Token
feedly_collections: Retrieve Feedly Connections
feedly_feed_meta: Retrieve Metadata for a Feed
feedly_opml: Retrieve Your Feedly OPML File
feedly_profile: Retrieve Your Feedly Profile
feedly_search_contents: Search content of a stream
feedly_search_title: Find feeds based on title, url or ‘#topic’
feedly_stream: Retrieve contents of a Feedly “stream”
feedly_tags: Retrieve List of Tags

along with following helper function (which we’ll introduce in a minute):

render_stream: Render a Feedly Stream Data Frame to RMarkdown

and, the following helper reference (as Feedly has some “universal” streams):

global_resource_ids: Global Resource Ids Helper Reference

The render_stream() function is semi-useful on its own but was designed as more of a “you may want to replicate this on your own” (i.e. have a look at the source code and riff off of it). “Streams” are individual feeds, collections or even “boards” you make and with this new API package and the power of R Markdown, you can make your own “newsletter” like this:

fp <- feedly_profile() # get profile to get my id

# use the id to get my "security" category feed in my feedly
fs <- feedly_stream(sprintf("user/%s/category/security", fp$id))

# get the top 10 items with engagement >= third quartile of all posts
# and don't include duplicates in the report
mutate(fs$items, published = as.Date(published)) %>% 
  filter(published >= as.Date("2018-12-01")) %>%
  filter(engagement > fivenum(engagement)[4]) %>% 
  filter(!is.na(summary_content)) %>% 
  mutate(alt_url = map_chr(alternate, ~.x[[1]])) %>% 
  distinct(alt_url, .keep_all = TRUE) %>% 
  slice(1:10) -> for_report

# render the report
render_stream(
  feedly_stream = for_report, 
  title = "Cybersecurity News", 
  include_visual = TRUE,
  browse = TRUE
)

Which makes the following Rmd and HTML. (So, no need to “upgrade” to “Teams” to make newsletters!).

FIN

As noted, the 2018 data for R Weekly (Live) & R-bloggers is available and you can find the seymour package on [GL | GH].

If you’re not a Feedly user I strongly encourage you to give it a go! And, if you don’t subscribe to R Weekly, you should make that your first New Year’s Resolution.

Here’s looking to another year of great R content across the R blogosphere!

Certifiably Gone Phishing

2018-12-23 – 17:28
Posted in Cybersecurity, Encryption, Information Security, R
Comments (3)

Phishing is [still] the primary way attackers either commit a primary criminal act (i.e. phish a target to, say, install ransomware) or is the initial vehicle used to gain a foothold in an organization so they can perform other criminal operations to achieve some goal. As such, security teams, vendors and active members of the cybersecurity community work diligently to neutralize phishing campaigns as quickly as possible.

One popular community tool/resource in this pursuit is PhishTank which is a collaborative clearing house for data and information about phishing on the Internet. Also, PhishTank provides an open API for developers and researchers to integrate anti-phishing data into their applications at no charge.

While the PhishTank API is useful for real-time anti-phishing operations the data is also useful for security researchers as we work to understand the ebb, flow and evolution of these attacks. One avenue of research is to track the various features associated with phishing campaigns which include (amongst many other elements) network (internet) location of the phishing site, industry being targeted, domain names being used, what type of sites are being cloned/copied and a feature we’ll be looking at in this post: what percentage of new phishing sites use SSL encryption and — of these — which type of SSL certificates are “en vogue”.

Phishing sites are increasingly using and relying on SSL certificates because we in the information security industry spent a decade instructing the general internet surfing population to trust sites with the green lock icon near the location bar. Initially, phishers worked to compromise existing, encryption-enabled web properties to install phishing sites/pages since they could leech off of the “trusted” status of the associated SSL certificates. However, the advent of services like Let’s Encrypt have made it possible for attacker to setup their own phishing domains that look legitimate to current-generation internet browsers and prey upon the decade’s old “trust the lock icon” mantra that most internet users still believe. We’ll table that path of discussion (since it’s fraught with peril if you don’t support the internet-do-gooder-consequences-be-darned cabal’s personal agendas) and just focus on how to work with PhishTank data in R and take a look at the most prevalent SSL certs used in the past week (you can extend the provided example to go back as far as you like provided the phishing sites are still online).

Accessing PhishTank From R

You can use the aquarium package [GL|GH] to gain access to the data provided by PhishTank’s API (you need to sign up for access and put you API key into the PHISHTANK_API_KEY environment variable which is best done via your ~/.Renviron file).

Let’s setup all the packages we’ll need and cache a current copy of the PhishTank data. The package forces you to utilize your own caching strategy since it doesn’t make sense for it to decide that for you. I’d suggest either using the time-stamped approach below or using some type of database system (or, say, Apache Drill) to actually manage the data.

Here are the packages we’ll need:

library(psl) # git[la|hu]b/hrbrmstr/psl
library(curlparse) # git[la|hu]b/hrbrmstr/curlparse
library(aquarium) # git[la|hu]b/hrbrmstr/aquarium
library(gt) # github/rstudio/gt
library(furrr)
library(stringi)
library(openssl)
library(tidyverse)

NOTE: The psl and curlparse packages are optional. Windows users will find it difficult to get them working and it may be easier to review the functions provided by the urlparse package and substitute equivalents for the domain() and apex_domain() functions used below. Now, we get a copy of the current PhishTank dataset & cache it:

if (!file.exists("~/Data/2018-12-23-fishtank.rds")) {
  xdf <- pt_read_db()
  saveRDS(xdf, "~/Data/2018-12-23-fishtank.rds")
} else {
  xdf <- readRDS("~/Data/2018-12-23-fishtank.rds")
}

Let’s take a look:

glimpse(xdf)
## Observations: 16,446
## Variables: 9
## $ phish_id          <chr> "5884184", "5884138", "5884136", "5884135", ...
## $ url               <chr> "http://internetbanking-bancointer.com.br/lo...
## $ phish_detail_url  <chr> "http://www.phishtank.com/phish_detail.php?p...
## $ submission_time   <dttm> 2018-12-22 20:45:09, 2018-12-22 18:40:24, 2...
## $ verified          <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ verification_time <dttm> 2018-12-22 20:45:52, 2018-12-22 21:26:49, 2...
## $ online            <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ details           <list> [<209.132.252.7, 209.132.252.0/24, 7296 468...
## $ target            <chr> "Other", "Other", "Other", "PayPal", "Other"...

The data is really straightforward. We have unique ids for each site/campaign the URL of the site along with a URL to extra descriptive info PhishTank has on the site/campaign. We also know when the site was submitted/discovered and other details, such as the network/internet space the site is in:

glimpse(xdf$details[1])
## List of 1
##  $ :'data.frame':    1 obs. of  6 variables:
##   ..$ ip_address        : chr "209.132.252.7"
##   ..$ cidr_block        : chr "209.132.252.0/24"
##   ..$ announcing_network: chr "7296 468"
##   ..$ rir               : chr "arin"
##   ..$ country           : chr "US"
##   ..$ detail_time       : chr "2018-12-23T01:46:16+00:00"

We’re going to focus on recent phishing sites (in this case, ones that are less than a week old) and those that use SSL certificates:

filter(xdf, verified == "yes") %>%
  filter(online == "yes") %>%
  mutate(diff = as.numeric(difftime(Sys.Date(), verification_time), "days")) %>%
  filter(diff <= 7) %>%
  { all_ct <<- nrow(.) ; . } %>%
  filter(grepl("^https", url)) %>%
  { ssl_ct <<- nrow(.) ; . } %>%
  mutate(
    domain = domain(url),
    apex = apex_domain(domain)
  ) -> recent

Let’s ee how many are using SSL:

(ssl_ct)
## [1] 383

(pct_ssl <- ssl_ct / all_ct)
## [1] 0.2919207

This percentage is lower than a recent “50% of all phishing sites use encryption” statistic going around of late. There are many reasons for the difference:

PhishTank doesn’t have all phishing sites in it
We just looked at a week of examples
Some sites were offline at the time of access attempt
Diverse attacker groups with varying degrees of competence engage in phishing attacks

Despite the 20% deviation, 30% is still a decent percentage, and a green, “everything’s ??” icon is a still a valued prize so we shall pursue our investigation.

Now we need to retrieve all those certs. This can be a slow operation that so we’ll grab them in parallel. It’s also quite possible the “online”status above data frame glimpse is inaccurate (sites can go offline quickly) so we’ll catch certificate request failures with safely() and cache the results:

cert_dl <- purrr::safely(openssl::download_ssl_cert)

plan(multiprocess)

if (!file.exists("~/Data/recent.rds")) {

  recent <- mutate(recent, cert = future_map(domain, cert_dl))
  saveRDS(recent, "~/Data/recent.rds")

} else {
  recent <- readRDS("~/Data/recent.rds")
}

Let see how many request failures we had:

(failed <- sum(map_lgl(recent$cert, ~is.null(.x$result))))
## [1] 25

(failed / nrow(recent))
## [1] 0.06527415

As noted in the introduction to the blog, when attackers want to use SSL for the lock icon ruse they can either try to piggyback off of legitimate domains or rely on Let’s Encrypt to help them commit crimes. Let’s see what the top p”apex” domains](https://help.github.com/articles/about-supported-custom-domains/#apex-domains) were in use in the past week:

count(recent, apex, sort = TRUE)
## # A tibble: 255 x 2
##    apex                              n
##    <chr>                         <int>
##  1 000webhostapp.com                42
##  2 google.com                       17
##  3 umbler.net                        8
##  4 sharepoint.com                    6
##  5 com-fl.cz                         5
##  6 lbcpzonasegurabeta-viabcp.com     4
##  7 windows.net                       4
##  8 ashaaudio.net                     3
##  9 brijprints.com                    3
## 10 portaleisp.com                    3
## # ... with 245 more rows

We can see that a large hosting provider (000webhostapp.com) bore a decent number of these sites, but Google Sites (which is what the full domain represented by the google.com apex domain here is usually pointing to) Microsoft SharePoint (sharepoint.com) and Microsoft forums (windows.net) are in active use as well (which is smart give the pervasive trust associated with those properties). There are 241 distinct apex domains in this 1-week set so what is the SSL cert diversity across these pages/campaigns?

We ultimately used openssl::download_ssl_cert to retrieve the SSL certs of each site that was online, so let’s get the issuer and intermediary certs from them and look at the prevalence of each. We’ll extract the fields from the issuer component returned by openssl::download_ssl_cert then just do some basic maths:

filter(recent, map_lgl(cert, ~!is.null(.x$result))) %>%
  mutate(issuers = map(cert, ~map_chr(.x$result, ~.x$issuer))) %>%
  mutate(
    inter = map_chr(issuers, ~.x[1]), # the order is not guaranteed here but the goal of the exercise is
    root = map_chr(issuers, ~.x[2])   # to get you working with the data vs build a 100% complete solution
  ) %>%
  mutate(
    inter = stri_replace_all_regex(inter, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>% # there are parswers for the cert info fields but this hack is quick and works
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols),
    root = stri_replace_all_regex(root, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>%
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols)
  ) -> recent

Let’s take a look at roots:

unnest(recent, root) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN	n	pct
DST Root CA X3	96	26.82%
COMODO RSA Certification Authority	93	25.98%
DigiCert Global Root G2	45	12.57%
Baltimore CyberTrust Root	30	8.38%
GlobalSign	27	7.54%
DigiCert Global Root CA	15	4.19%
Go Daddy Root Certificate Authority – G2	14	3.91%
COMODO ECC Certification Authority	11	3.07%
Actalis Authentication Root CA	9	2.51%
GlobalSign Root CA	4	1.12%
Amazon Root CA 1	3	0.84%
Let’s Encrypt Authority X3	3	0.84%
AddTrust External CA Root	2	0.56%
DigiCert High Assurance EV Root CA	2	0.56%
USERTrust RSA Certification Authority	2	0.56%
GeoTrust Global CA	1	0.28%
SecureTrust CA	1	0.28%

DST Root CA X3 is (wait for it) Let’s Encrypt! Now, Comodo is not far behind and indeed surpasses LE if we combine the extra-special “enhanced” versions they provide and it’s important for you to read the comments near the lines of code making assumptions about order of returned issuer information above. Now, let’s take a look at intermediaries:

unnest(recent, inter) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN	n	pct
Let’s Encrypt Authority X3	99	27.65%
cPanel\, Inc. Certification Authority	75	20.95%
RapidSSL TLS RSA CA G1	45	12.57%
Google Internet Authority G3	24	6.70%
COMODO RSA Domain Validation Secure Server CA	20	5.59%
CloudFlare Inc ECC CA-2	18	5.03%
Go Daddy Secure Certificate Authority – G2	14	3.91%
COMODO ECC Domain Validation Secure Server CA 2	11	3.07%
Actalis Domain Validation Server CA G1	9	2.51%
RapidSSL RSA CA 2018	9	2.51%
Microsoft IT TLS CA 1	6	1.68%
Microsoft IT TLS CA 5	6	1.68%
DigiCert SHA2 Secure Server CA	5	1.40%
Amazon	3	0.84%
GlobalSign CloudSSL CA – SHA256 – G3	2	0.56%
GTS CA 1O1	2	0.56%
AlphaSSL CA – SHA256 – G2	1	0.28%
DigiCert SHA2 Extended Validation Server CA	1	0.28%
DigiCert SHA2 High Assurance Server CA	1	0.28%
Don Dominio / MrDomain RSA DV CA	1	0.28%
GlobalSign Extended Validation CA – SHA256 – G3	1	0.28%
GlobalSign Organization Validation CA – SHA256 – G2	1	0.28%
RapidSSL SHA256 CA	1	0.28%
TrustAsia TLS RSA CA	1	0.28%
USERTrust RSA Domain Validation Secure Server CA	1	0.28%
NA	1	0.28%

LE is number one again! But, it’s important to note that these issuer CommonNames can roll up into a single issuing organization given just how messed up integrity and encryption capability is when it comes to web site certs, so the raw results could do with a bit of post-processing for a more complete picture (an exercise left to intrepid readers).

FIN

There are tons of avenues to explore with this data, so I hope this post whet your collective appetites sufficiently for you to dig into it, especially if you have some dowm-time coming.

Let me also take this opportunity to resissue guidance I and many others have uttered this holiday season: be super careful about what you click on, which sites you even just visit, and just how much you really trust the site, provider and entity behind the form about to enter your personal information and credit card info into.

More “Scraping Ethics Gone Awry” and “Why Do This When There’s a Free API?”

I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this).

Today, they sent me this gem by @JeromeDeveloper: Scrapy and Scrapyrt: how to create your own API from (almost) any website. Go ahead and click it. Give the Medium author the ? they so desperately crave (and to provide context for the rant below).

I have no issue with @JeromeDeveloper’s coding prowess, nor Scrapy/Scrapyrt. In fact, I’m a huge fan of the folks at ScrapingHub, so much so that I wrote splashr to enable use of their Splash server from R.

My issue is with the example the author chose to use.

CoinMarketCap provides cryptocurrency prices and other cryptocurrency info. I use it to track cryptocurrency prices to see which currency attackers who pwn devices to install illegal cryptocurrency rigs might be switching to next and to get a feel for when they’ll stop mining and go back to just stealing data and breaking things.

CoinMarketCap has an API with a generous free tier with the following text in their Terms & Conditions (which, in the U.S. [soon] may stupidly be explicitly repeated & required on each page that scraping is prohibited on vs a universal site link):

You may not, and shall not, copy, reproduce, download, “screen scrape”, store, transmit, broadcast, publish, modify, create a derivative work from, display, perform, distribute, redistribute, sell, license, rent, lease or otherwise use, transfer (either in printed, electronic or other format) or exploit any Content, in whole or in part, in any way that does not comply with these Terms without our prior written permission.

There is only one reason (apart from complete oblivion) to use CoinMarketCap as an example: to show folks how clever you are at bypassing site restrictions and eventually avoiding paying for an API to get data that you did absolutely nothing to help gather, curate and setup infrastructure for. There is no mention of “be sure what you are doing is legal/ethical”, just a casual caution to not abuse the Scrapyrt technology since it may get you banned.

Ethics matter across every area of “data science” (of which, scraping is one component). Just because you can do something doesn’t mean you should and just because you don’t like Terms & Conditions and want to grift the work of others for fun, profit & ? also doesn’t mean you should; and, it definitely doesn’t mean you should be advocating others do it as well.

Ironically, Medium itself places restrictions on what you can do:

Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.

yet they advocated I read and heed a post which violates similar terms of another site. So I wonder how they’d feel if I did a riff of that post and showed how to setup a hackish-API to scrape all their content. O_o

FIN