Skip navigation

Earlier today, @noamross posted to Twitter:

The answer was a 1:1 “file upload” curl to httr translation:

httr::POST(
  url = "https://file.io",
  encode = "multipart",
  body = list(file = httr::upload_file("/some/path/to/file")),
)

but I wanted to do more than that since Noam took 20 minutes out his day this week (with no advance warning) to speak to my intro-to-stats class about his work and R.

The Twitter request was (ultimately) a question on how to use R to post content to https://file.io. They have a really simple API, and the timespan from Noam’s request to the initial commit of a fully functional package was roughly 17 minutes. The end product included the ability to post files, strings and R data (something that seemed like a good thing to add).

Not too long after came a v0.1.0 release complete with tests and passing CRAN checks on all platforms.

Noam also suggested I do a screencast:

I don’t normally do screencasts but had some conference call time so folks can follow along at home:

That’s not the best screencast in the world, but it’s very representative of the workflow I used. A great deal of boilerplate package machinations is accomplished with this bash script.

I wasn’t happy with the hurried function name choices I made nor was I thrilled with the package title, description, tests and basic docs, so I revamped all those into another release. That took a while, mostly due to constantly triggering API warnings about being rate-limited.

So, if you have a 5 GB or less file, character vector or in-memory R data you’d like to ephemerally share with others, take the fileio package for a spin:

devtools::install_github("hrbrmstr/fileio")

fileio::fi_post_text("TWFrZSBzdXJlIHRvIEAgbWUgb24gVHdpdHRlciBpZiB5b3UgZGVjb2RlIHRoaXM=")
##   success    key                   link  expiry
## 1    TRUE n18ZSB https://file.io/n18ZSB 14 days

(bonus points if you can figure out what that seemingly random sequence of characters says).

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package? which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc + dplyr connection almost instantly:

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib", 
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "Default Credentials",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl <- tbl(con, "elb_logs")

some_tbl
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp    elbname requestip  requestport backendip backendport
##    <chr>        <chr>   <chr>            <int> <chr>           <int>
##  1 2014-09-26T… lb-demo 249.6.80.…        5123 249.6.80…        8888
##  2 2014-09-26T… lb-demo 246.22.15…        5123 248.178.…        8888
##  3 2014-09-26T… lb-demo 248.179.3…       45667 254.70.2…         443
##  4 2014-09-26T… lb-demo 243.2.127…       14496 248.178.…          80
##  5 2014-09-26T… lb-demo 247.76.18…        6887 252.0.81…        8888
##  6 2014-09-26T… lb-demo 254.110.3…       22052 248.178.…        8888
##  7 2014-09-26T… lb-demo 249.113.2…       24902 245.241.…        8888
##  8 2014-09-26T… lb-demo 246.128.7…        5123 244.202.…        8888
##  9 2014-09-26T… lb-demo 249.6.80.…       24902 255.226.…        8888
## 10 2014-09-26T… lb-demo 253.102.6…        6887 246.22.1…        8888
## # ... with more rows, and 10 more variables:
## #   requestprocessingtime <dbl>, backendprocessingtime <dbl>,
## #   clientresponsetime <dbl>, elbresponsecode <chr>,
## #   backendresponsecode <chr>, receivedbytes <S3: integer64>,
## #   sentbytes <S3: integer64>, requestverb <chr>, url <chr>,
## #   protocol <chr>## 

The TLDR is that I can now use 100% dplyr idioms with Athena vs add one to the RJDBC driver I made for metis. The metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill ;-)

What About Drill?

I also maintain the sergeant package? which provides REST API and REST query access to Apache Drill along with a REST API DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
  ConnectionType = "Zookeeper",
  AuthenticationType = "No Authentication",
  ZKCLusterID = "CLUSTERID",
  ZkQuorum = "HOST:2181",
  AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

(employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`")))
## # Source:   SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
##    employee_id   full_name    first_name last_name position_id   position_title   store_id  
##    <S3: integer> <chr>        <chr>      <chr>     <S3: integer> <chr>            <S3: inte>
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0         
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0         
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0         
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0         
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0         
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0         
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9         
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21        
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0         
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1         
## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>## 

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n              
##    <chr>                     <S3: integer64>
##  1 Store Temporary Checker   268            
##  2 Store Temporary Stocker   264            
##  3 Store Permanent Checker   226            
##  4 Store Permanent Stocker   222            
##  5 Store Shift Supervisor    52             
##  6 Store Permanent Butcher   32             
##  7 Store Manager             24             
##  8 Store Assistant Manager   24             
##  9 Store Information Systems 16             
## 10 HQ Finance and Accounting 8              
## # ... with more rows##

Apart from having to do that sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ve updated my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.

FIN

The odbc package is super-slick and it’s refreshing to be able to use dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr verbs. However, those queries can also be put right into sql() with the Athena ODBC driver connection and used via the same dplyr verb magic afterwards.

Today is, indeed, a good day to query!

This week’s edition of Data is Plural had two really fun data sets. One is serious fun (the first comprehensive data set on U.S. evictions, and the other I knew about but had forgotten: The Federal Register Executive Order (EO) data set(s).

The EO data is also comprehensive as the summary JSON (or CSV) files have links to more metadata and even more links to the full-text in various formats.

What follows is a quick post to help bootstrap folks who may want to do some tidy text mining on this data. We’ll look at EOs-per-year (per-POTUS) and also take a look at the “top 5 ‘first words'” in the titles of the EOS (also by POTUS).

Ingesting the Data

The EO main page has a list of EO JSON files by POTUS. We’re going to scrape this so we can classify the EOs by POTUS (we could also just use the Federal Register API since @thosjleeper wrote a spiffy package to access it):

library(rvest)
library(stringi)
library(pluralize) # devtools::install_github("hrbrmstr/pluralize")
library(hrbrthemes)
library(tidyverse)

#' Retrieve the Federal Register main EO page so we can get the links for each POTUS
pg <- read_html("https://www.federalregister.gov/executive-orders") 

#' Find the POTUS EO data nodes, excluding the one for "All"
html_nodes(pg, "ul.bulk-files") %>% 
  html_nodes(xpath = ".//li[span[a[contains(@href, 'json')]] and 
                            not(span[contains(., 'All')])]") -> potus_nodes

#' Turn the POTUS info into a data frame with the POTUS name and EO JSON link,
#' then retrieve the JSON file and make a data frame of individual data elements
data_frame(
  potus = html_nodes(potus_nodes, "span:nth-of-type(1)") %>% html_text(),
  eo_link = html_nodes(potus_nodes, "a[href *= 'json']") %>% 
    html_attr("href") %>% 
    sprintf("https://www.federalregister.gov%s", .)
) %>% 
  mutate(eo = map(eo_link, jsonlite::fromJSON)) %>% 
  mutate(eo = map(eo, "results")) %>% 
  unnest() -> eo_df

glimpse(eo_df)
## Observations: 887
## Variables: 16
## $ potus                  <chr> "Donald Trump", "Donald Trump", "Donald Trump", "Donald Trump", "Donald Trump", "D...
## $ eo_link                <chr> "https://www.federalregister.gov/documents/search.json?conditions%5Bcorrection%5D=...
## $ citation               <chr> "82 FR 8351", "82 FR 8657", "82 FR 8793", "82 FR 8799", "82 FR 8977", "82 FR 9333"...
## $ document_number        <chr> "2017-01799", "2017-02029", "2017-02095", "2017-02102", "2017-02281", "2017-02450"...
## $ end_page               <int> 8352, 8658, 8797, 8803, 8982, 9338, 9341, 9966, 10693, 10696, 10698, 10700, 12287,...
## $ executive_order_notes  <chr> NA, "See: EO 13807, August 15, 2017", NA, NA, "See: EO 13780, March 6, 2017", "Sup...
## $ executive_order_number <int> 13765, 13766, 13767, 13768, 13769, 13770, 13771, 13772, 13773, 13774, 13775, 13776...
## $ html_url               <chr> "https://www.federalregister.gov/documents/2017/01/24/2017-01799/minimizing-the-ec...
## $ pdf_url                <chr> "https://www.gpo.gov/fdsys/pkg/FR-2017-01-24/pdf/2017-01799.pdf", "https://www.gpo...
## $ publication_date       <chr> "2017-01-24", "2017-01-30", "2017-01-30", "2017-01-30", "2017-02-01", "2017-02-03"...
## $ signing_date           <chr> "2017-01-20", "2017-01-24", "2017-01-25", "2017-01-25", "2017-01-27", "2017-01-28"...
## $ start_page             <int> 8351, 8657, 8793, 8799, 8977, 9333, 9339, 9965, 10691, 10695, 10697, 10699, 12285,...
## $ title                  <chr> "Minimizing the Economic Burden of the Patient Protection and Affordable Care Act ...
## $ full_text_xml_url      <chr> "https://www.federalregister.gov/documents/full_text/xml/2017/01/24/2017-01799.xml...
## $ body_html_url          <chr> "https://www.federalregister.gov/documents/full_text/html/2017/01/24/2017-01799.ht...
## $ json_url               <chr> "https://www.federalregister.gov/api/v1/documents/2017-01799.json", "https://www.f...

EOs By Year

To see how many EOs were signed per-year, per-POTUS, we’ll convert the signing_date into a year (and return it back to a Date object so we get spiffier plot labels), factor order the POTUS names and mark the start of each POTUS term. I’m not usually a fan of stacked bar charts, but since there will only be — at most — two segments, I think they work well and it also shows just how many EOs are established in year one of a POTUS term:

mutate(eo_df, year = lubridate::year(signing_date)) %>% 
  mutate(year = as.Date(sprintf("%s-01-01", year))) %>% 
  count(year, potus) %>%
  mutate(
    potus = factor(
      potus, 
      levels = c("Donald Trump", "Barack Obama", "George W. Bush", "William J. Clinton")
    )
  ) %>%
  ggplot(aes(year, n, group=potus)) +
  geom_col(position = "stack", aes(fill = potus)) +
  scale_x_date(
    name = NULL,
    expand = c(0,0),
    breaks = as.Date(c("1993-01-01", "2001-01-01", "2009-01-01", "2017-01-01")),
    date_labels = "%Y",
    limits = as.Date(c("1992-01-01", "2020-12-31"))
  ) +
  scale_y_comma(name = "# EOs") +
  scale_fill_ipsum(name = NULL) +
  guides(fill = guide_legend(reverse=TRUE)) +
  labs(
    title = "Number of Executive Orders Signed Per-Year, Per-POTUS",
    subtitle = "1993-Present",
    caption = "Source: Federal Register <https://www.federalregister.gov/executive-orders>"
  ) +
  theme_ipsum_rc(grid = "Y") +
  theme(legend.position = "bottom")

Favourite First (Title) Words

I’ll let some eager tidy text miners go-to-town on the full text links and just focus on one aspect of the EO titles: the “first” words. These are generally words like “Amending”, “Establishing”, “Promoting”, etc. to give citizens a quick idea of what’s the order is supposed to be doing. We’ll remove common words, turn plurals into singulars and also get rid of years/dates to make the data a bit more useful and focus on the “top 5” first words used by each POTUS (and show all the first words across each POTUS). I’m using raw counts here (since this is a quick post) but another view normalized by percent of all POTUS EOs might prove more interesting/valuable:

mutate(titles_df, first_word = singularize(first_word)) %>% 
  count(potus, first_word, sort=TRUE) %>% 
  filter(!stri_detect_regex(first_word, "President|Federal|National")) %>%
  mutate(first_word = stri_replace_all_fixed(first_word, "Establishment", "Establishing")) %>% 
  mutate(first_word = stri_replace_all_fixed(first_word, "Amendment", "Amending")) -> first_words

group_by(first_words, potus) %>% 
    top_n(5) %>%  
    ungroup() %>% 
    distinct(first_word) %>% 
    pull(first_word) -> all_first_words

filter(first_words, first_word %in% all_first_words) %>% 
  mutate(
    potus = factor(
      potus, 
      levels = c("Donald Trump", "Barack Obama", "George W. Bush", "William J. Clinton")
    )
  ) %>% 
  mutate(
    first_word = factor(
      first_word, 
      levels = rev(sort(unique(first_word)))
    )
  ) -> first_df

ggplot(first_df, aes(n, first_word)) +
  geom_segment(aes(xend=0, yend=first_word, color=potus), size=4) +
  scale_x_comma(limits=c(0,40)) +
  scale_y_discrete(limits = sort(unique(first_df$first_word))) +
  facet_wrap(~potus, scales = "free", ncol = 2) +
  labs(
    x = "# EOs",
    y = NULL,
    title = "Top 5 Executive Order 'First Words' by POTUS",
    subtitle = "1993-Present",
    caption = "Source: Federal Register <https://www.federalregister.gov/executive-orders>"
  ) +
  theme_ipsum_rc(grid="X", strip_text_face = "bold") +
  theme(panel.spacing.x = unit(5, "lines")) +
  theme(legend.position="none")

FWIW I expected more “Revocation”/”Removing” from the current tangerine-in-chief, but there’s plenty “Enforcing” and “Blocking” to make up for it (being the “tough guy” that he likes to pretend he is).

FIN

There’s way more that can be done with this data set and hopefully folks will take it for a spin and come up with their own interesting views. If you do, drop a note in the comments with a link to your creation(s)!

The code blocks are all combined into this gist.

If you come here often you’ve noticed that I’ve been writing a semi-frequent series on using the Feedly API with R.

A recent post was created to help someone use the API. It worked for them but — as you can see in the comment — an assertion was made that these items were “locked away”. This is far from the case.

Feedly lets you hookup Dropbox to Feedly. That does a bunch of things, the first of which is that your Dropbox folder (i.e. ~/Dropbox) now has a ~/Dropbox/Apps/Feedly Vault directory where Feedly will store all sorts of wonderful items:

.
├── ? OPML Backup
├── ? Saved For Later
└── ? Tags

Copies of your OPML file (the XML container that has the references to all the RSS feeds you subscribe to) are backed up in OPML Backup every time there is a change to them. I’ve made 127 changes to my RSS feeds since 2014 and they’re all backed up in OPML Backup, ready to be processed with R or some other, inferior programming language.

The Saved for Later folder has a set of sub-directories by year:

Saved For Later/
├── ? 2011
├── ? 2012
├── ? 2013
├── ? 2014
├── ? 2015
├── ? 2016
├── ? 2017
└── ? 2018

Inside each of those annums are HTML files for all the posts you’ve, well, saved for later. The HTML contains the view you saw in the Feedly reader pane.

Astute readers will notice directories for 2011, 2012 and 2013. Feedly was not around back then. So, what are they? They are the “saved posts” you had when/if you used Google Reader (back in the day) and did an initial import from GReader to Feedly to begin your new RSS journey. (Feedly devs are 100% awesome).

Similarly, the Tags folder has copies of the HTML for anything you’ve filed under a tag/board.

So, if you’re not keen on using the Feedly API but want direct or programmatic access to your OPML file and saved content, look no further than a simple Dropbox directory traversal.

@mkjcktzn asked if one can access Feedly “Saved for Later” items via the API. The answer is “Yes!”, and it builds off of that previous post. You’ll need to read it and get your authentication key (still no package ?) before continuing.

We’ll use most (I think “all”) of the code from the previous post, so let’s bring that over here:

library(httr)
library(tidyverse)

.pkgenv <- new.env(parent=emptyenv())
.pkgenv$token <- Sys.getenv("FEEDLY_ACCESS_TOKEN")

.feedly_token <- function() return(.pkgenv$token)

feedly_stream <- function(stream_id, ct=100L, continuation=NULL) {
  
  ct <- as.integer(ct)
  
  if (!is.null(continuation)) ct <- 1000L
  
  httr::GET(
    url = "https://cloud.feedly.com/v3/streams/contents",
    httr::add_headers(
      `Authorization` = sprintf("OAuth %s", .feedly_token())
    ),
    query = list(
      streamId = stream_id,
      count = ct,
      continuation = continuation
    )
  ) -> res
  
  httr::stop_for_status(res)
  
  res <- httr::content(res, as="text")
  res <- jsonlite::fromJSON(res)
  
  res
  
}

According to the Feedly API Overview there is a “global resource id” which is formatted like user/:userId/tag/global.saved and defined as “Users can save articles for later. Equivalent of starring articles in Google Reader.”.

The “Saved for Later” feature is quite handy and all we need to do to get access to it is substitute our user id for :userId. To do that, we’ll build a helper function:

feedly_profile <- function() {
  
  httr::GET(
    url = "https://cloud.feedly.com/v3/profile",
    httr::add_headers(
      `Authorization` = sprintf("OAuth %s", .feedly_token())
    )
  ) -> res
  
  httr::stop_for_status(res)
  
  res <- httr::content(res, as="text")
  res <- jsonlite::fromJSON(res)
  
  class(res) <- c("feedly_profile")
  
  res
  
}

When that function is called, it returns a ton of user profile information in a list, including the id that we need:

me <- feedly_profile()

str(me, 1)
## List of 46
##  $ id                          : chr "9b61e777-6ee2-476d-a158-03050694896a"
##  $ client                      : chr "feedly"
##  $ email                       : chr "...@example.com"
##  $ wave                        : chr "2013.26"
##  $ logins                      :'data.frame': 4 obs. of  6 variables:
##  $ product                     : chr "Feedly..."
##  $ picture                     : chr "https://..."
##  $ twitter                     : chr "hrbrmstr"
##  $ givenName                   : chr "..."
##  $ evernoteUserId              : chr "112233"
##  $ familyName                  : chr "..."
##  $ google                      : chr "1100199130101939"
##  $ gender                      : chr "..."
##  $ windowsLiveId               : chr "1020d010389281e3"
##  $ twitterUserId               : chr "99119939"
##  $ twitterProfileBannerImageUrl: chr "https://..."
##  $ evernoteStoreUrl            : chr "https://..."
##  $ evernoteWebApiPrefix        : chr "https://..."
##  $ evernotePartialOAuth        : logi ...
##  $ dropboxUid                  : chr "54555"
##  $ subscriptionPaymentProvider : chr "......"
##  $ productExpiration           : num 2.65e+12
##  $ subscriptionRenewalDate     : num 2.65e+12
##  $ subscriptionStatus          : chr "Active"
##  $ upgradeDate                 : num 2.5e+12
##  $ backupTags                  : logi TRUE
##  $ backupOpml                  : logi TRUE
##  $ dropboxConnected            : logi TRUE
##  $ twitterConnected            : logi TRUE
##  $ customGivenName             : chr "..."
##  $ customFamilyName            : chr "..."
##  $ customEmail                 : chr "...@example.com"
##  $ pocketUsername              : chr "...@example.com"
##  $ windowsLivePartialOAuth     : logi TRUE
##  $ facebookConnected           : logi FALSE
##  $ productRenewalAmount        : int 1111
##  $ evernoteConnected           : logi TRUE
##  $ pocketConnected             : logi TRUE
##  $ wordPressConnected          : logi FALSE
##  $ windowsLiveConnected        : logi TRUE
##  $ dropboxOpmlBackup           : logi TRUE
##  $ dropboxTagBackup            : logi TRUE
##  $ backupPageFormat            : chr "Html"
##  $ dropboxFormat               : chr "Html"
##  $ locale                      : chr "en_US"
##  $ fullName                    : chr "..."
##  - attr(*, "class")= chr "feedly_profile"

(You didn’t think I wouldn’t redact that, did you? Note that I made up a unique id as well.)

Now we can call our stream function and get the results:

entries <- feedly_stream(sprintf("user/%s/tag/global.saved", me$id))

str(entries$items, 1)
# output not shown as you don't really need to see what I've Saved for Later

The structure is the same as in the previous post.

Now, you can go to town and programmatically access your Feedly “Saved for Later” entries.

You an also find more “Resource Ids” and “Global Resource Ids” formats on the API Overview page.

Since I just railed against Congress for being a bit two-faced about privacy I thought some rud.is site disclosure would be in order.

At present, third-party tracking is limited to:

  • Something in my WordPress configuration adding a DNS pre-fetch for fonts.googleapis.com. There are a few more other DNS pre-fetches that I’m also going to try to eradicate (but that aren’t showing up in my uBlock Origin likely to to /etc/hosts blocks);
  • Gravatar (which displays logos near comment author names). I’m torn on this one but Gravatar is owned by Automattic (who owns WordPress). See next bullet on that;
  • WordPress. Vain site stats tracking, JetPack uptime warnings and some other WordPress pings happen (including some automatic short-linking) as well as the previous bullet bits. I’m not likely going to do the site surgery necessary to stop this but you have full disclosure and can easily avoid pings to those sites via uBlock Origin site-specific rules;
  • SendPulse; I’m running an experiment on user behaviours when it comes to authorizing web notifications (and I just kinda ruined said experiment). I’ll be disabling it later this year (after a full year of it being on so I can have more than just a few sentences to say).

The above came from an in-browser uBlock Origin report.

I ran a splashr::render_har() — which is how I measured things for the Congressional privacy post — on one of my pages and this is the result:

tld                 n
1 rud.is           67
2 wp.com           21
3 gravatar.com      6
4 wordpress.com     3
5 w.org             3
6 sendpulse.com     2

Props on WordPress capturing w.org! I’m still ticked Microsoft stole bob.com from me ages ago.

As you can see, most resources load from my site and none come from Twitter, Facebook or Google Plus.

I run WordPress for a ton of reasons too long to go into for this post, so I’m likely not going to change anything about that list (apart from the DNS pre-fetching).

Hopefully that will abate any concerns visitors might have, especially after reading the post about Congress.

I apologize up-front for using bad words in this post.

Said bad words include “Facebook”, “Mark Zuckerberg” and many referrals to entities within the U.S. Government. Given the topic, it cannot be helped.

I’ve also left the R tag on this despite only showing some ggplot2 plots and Markdown tables. See the end of the post for how to get access to the code & data. R was used solely and extensively for the work behind the words.


This week Congress put on a show as they summoned the current Facebook CEO — Mark Zuckerberg — down to Washington, D.C. to demonstrate how little most of them know about how the modern internet and social networks actually work plus chest-thump to prove to their constituents they really and truly care about you.

These Congress-critters offered such proof in the guise of railing against Facebook for how they’ve handled your data. Note that I should really say our data since they do have an extensive profile database on me and most everyone else even if they’re not Facebook platform users (full disclosure: I do not have a Facebook account).

Ostensibly, this data-mishandling impacted your privacy. Most of the committee members wanted any constituent viewers to come away believing they and their fellow Congress-critters truly care about your privacy.

Fortunately, we have a few ways to measure this “caring” and the remainder of this post will explore how much members of the U.S. House and Senate care about your privacy when you visit their official .gov web sites. Future posts may explore campaign web sites and other metrics, but what better place to show they care about you then right there in their digital houses.

Privacy Primer

When you visit a web site with any browser, the main URL pulls in resources to aid in the composition and functionality of the page. These could be:

  • HTML (the main page is very likely HTML unless it’s just a media URL)
  • images (png, jpg, gif, “svg”, etc),
  • fonts
  • CSS (the “style sheet” that tells the browser how to decorate and position elements on the page)
  • binary objects (such as embedded PDF files or “protocol buffer” content)
  • XML or JSON
  • JavaScript

(plus some others)

When you go to, say, www.example.com the site does not have to load all the resources from example.com domains. In fact, it’s rare to find a modern site which does not use resources from one or more third party sites.

When each resource is loaded (generally) some information about you goes along for the ride. At a minimum, the request time and source (your) IP address is exposed and — unless you’re really careful/paranoid — the referring site, browser configuration and even cookies are even available to the third party sites. It does not take many of these data points to (pretty much) uniquely identify you. And, this is just for “benign” content like images. We’ll get to JavaScript in a bit.

As you move along the web, these third-party touch-points add up. To demonstrate this, I did my best to de-privatize my browser and OS configuration and visited 12 web sites while keeping a fresh install of Firefox Lightbeam running. Here’s the result:

Each main circle is a distinct/main site and the triangles are resources the site tried to load. The red triangles indicate a common third-party resource that was loaded by two or more sites. Each of those red triangles knows where you’ve been (again, unless you’ve been very careful/paranoid) and can use that information to enhance their knowledge about you.

It gets a bit worse with JavaScript content since a much stronger fingerprint can be created for you (you can learn more about fingerprints at this spiffy EFF site). Plus, JavaScript code can try to pilfer cookies, “hack” the browser, serve up malicious adverts, measure time-on-site, and even enlist you in a cryptomining army.

There are other issues with trusting loaded browser content, but we’ll cover that a bit further into the investigation.

Measuring “Caring”

The word “privacy” was used over 100 times each day by both Zuckerberg and our Congress-critters. Senators and House members made it pretty clear Facebook should care more about your privacy. Implicit in said posit is that they, themselves, must care about your privacy. I’m sure they’ll be glad to point out all along the midterm campaign trails just how much they’re doing to protect your privacy.

We don’t just have to take their word for it. After berating Facebook’s chief college dropout and chastising the largest social network on the planet we can see just how much of “you” these representatives give to Facebook (and other sites) and also how much they protect you when you decide to pay them[] [] a digital visit.

For this metrics experiment, I built a crawler using R and my splashr? package which, in turn, uses ScrapingHub’s open source Splash. Splash is an automation framework that lets you programmatically visit a site just like a human would with a real browser.

Normally when one scrapes content from the internet they’re just grabbing the plain, single HTML file that is at the target of a URL. Splash lets us behave like a browser and capture all the resources — images, CSS, fonts, JavaScript — the site loads and will also execute any JavaScript, so it will also capture resources each script may itself load.

By capturing the entire browser experience for the main page of each member of Congress we can get a pretty good idea of just how much each one cares about your digital privacy, and just how much they secretly love Facebook.

Let’s take a look, first, at where you go when you digitally visit a Congress-critter.

Network/Hosting/DNS

Each House and Senate member has an official (not campaign) site that is hosted on a .gov domain and served up from a handful of IP addresses across the following (n is the number of Congress-critter web sites):

asn aso n
AS5511 Orange 425
AS7016 Comcast Cable Communications, LLC 95
AS20940 Akamai International B.V. 13
AS1999 U.S. House of Representatives 6
AS7843 Time Warner Cable Internet LLC 1
AS16625 Akamai Technologies, Inc. 1

“Orange” is really Akamai and Akamai is a giant content delivery network which helps web sites efficiently provide content to your browser and can offer Denial of Service (DoS) protection. Most sites are behind Akamai, which means you “touch” Akamai every time you visit the site. They know you were there, but I know a sufficient body of folks who work at Akamai and I’m fairly certain they’re not too evil. Virtually no representative solely uses House/Senate infrastructure, but this is almost a necessity given how easy it is to take down a site with a DoS attack and how polarized politics is in America.

To get to those IP addresses, DNS names like www.king.senate.gov (one of the Senators from my state) needs to be translated to IP addresses. DNS queries are also data gold mines and everyone from your ISP to the DNS server that knows the name-to-IP mapping likely sees your IP address. Here are the DNS servers that serve up the directory lookups for all of the House and Senate domains:

nameserver gov_hosted
e4776.g.akamaiedge.net. FALSE
wc.house.gov.edgekey.net. FALSE
e509.b.akamaiedge.net. FALSE
evsan2.senate.gov.edgekey.net. FALSE
e485.b.akamaiedge.net. FALSE
evsan1.senate.gov.edgekey.net. FALSE
e483.g.akamaiedge.net. FALSE
evsan3.senate.gov.edgekey.net. FALSE
wwwhdv1.house.gov. TRUE
firesideweb02cc.house.gov. TRUE
firesideweb01cc.house.gov. TRUE
firesideweb03cc.house.gov. TRUE
dchouse01cc.house.gov. TRUE
c3pocc.house.gov. TRUE
ceweb.house.gov. TRUE
wwwd2-cdn.house.gov. TRUE
45press.house.gov. TRUE
gopweb1a.house.gov. TRUE
eleven11web.house.gov. TRUE
frontierweb.house.gov. TRUE
primitivesocialweb.house.gov. TRUE

Akamai kinda does need to serve up DNS for the sites they host, so this list also makes sense. But, you’ve now had two touch-points logged and we haven’t even loaded a single web page yet.

Safe? & Secure? Connections

When we finally make a connection to a Congress-critter’s site, it is going to be over SSL/TLS. They all support it (which is ?, but SSL/TLS confidentiality is not as bullet-proof as many “HTTPS Everywhere” proponents would like to con you into believing). However, I took a look at the SSL certificates for House and Senate sites. Here’s a sampling from, again, my state (one House representative):

The *.house.gov “Common Name (CN)” is a wildcard certificate. Many SSL certificates have just one valid CN, but it’s also possible to list alternate, valid “alt” names that can all use the same, single certificate. Wildcard certificates ease the burden of administration but it also means that if, say, I managed to get my hands on the certificate chain and private key file, I could setup vladimirputin.house.gov somewhere and your browser would think it’s A-OK. Granted, there are far more Representatives than there are Senators and their tenure length is pretty erratic these days, so I can sort of forgive them for taking the easy route, but I also in no way, shape or form believe they protect those chains and private keys well.

In contrast, the Senate can and does embed the alt-names:

Are We There Yet?

We’ve got the IP address of the site and established a “secure” connection. Now it’s time to grab the index page and all the rest of the resources that come along for the ride. As noted in the Privacy Primer (above), the loading of third-party resources is problematic from a privacy (and security) perspective. Just how many third party resources do House and Senate member sites rely on?

To figure that out, I tallied up all of the non-.gov resources loaded by each web site and plotted the distribution of House and Senate (separately) in a “beeswarm” plot with a boxplot shadowing underneath so you can make out the pertinent quantiles:

As noted, the median is around 30 for both House and Senate member sites. In other words, they value your browsing privacy so little that most Congress-critters gladly share your browser session with many other sites.

We also talked about confidentiality above. If an https site loads http resources the contents of what you see on the page cannot but guaranteed. So, how responsible are they when it comes to at least ensuring these third-party resources are loaded over https?

You’re mostly covered from a pseudo-confidentiality perspective, but what are they serving up to you? Here’s a summary of the MIME types being delivered to you:

MIME Type Number of Resources Loaded
image/jpeg 6,445
image/png 3,512
text/html 2,850
text/css 1,830
image/gif 1,518
text/javascript 1,512
font/ttf 1,266
video/mp4 974
application/json 673
application/javascript 670
application/x-javascript 353
application/octet-stream 187
application/font-woff2 99
image/bmp 44
image/svg+xml 39
text/plain 33
application/xml 15
image/jpeg, video/mp2t 12
application/x-protobuf 9
binary/octet-stream 5
font/woff 4
image/jpg 4
application/font-woff 2
application/vnd.google.gdata.error+xml 1

We’ll cover some of these in more detail a bit further into the post.

Facebook & “Friends”

Facebook started all this, so just how cozy are these Congress-critters with Facebook?

Turns out that both Senators and House members are very comfortable letting you give Facebook a love-tap when you come visit their sites since over 60% of House and 40% of Senate sites use 2 or more Facebook resources. Not all Facebook resources are created equal[ly evil] and we’ll look at some of the more invasive ones soon.

Facebook is not the only devil out there. I added in the public filter list from Disconnect and the numbers go up from 60% to 70% for the House and from 40% to 60% for the Senate when it comes to a larger corpus of known tracking sites/resources.

Here’s a list of some (first 20) of the top domains (with one of Twitter’s media-serving domains taking the individual top-spot):

Main third-party domain # of ‘pings’ %
twimg.com 764 13.7%
fbcdn.net 655 11.8%
twitter.com 573 10.3%
google-analytics.com 489 8.8%
doubleclick.net 462 8.3%
facebook.com 451 8.1%
gstatic.com 385 6.9%
fonts.googleapis.com 270 4.9%
youtube.com 246 4.4%
google.com 183 3.3%
maps.googleapis.com 144 2.6%
webtrendslive.com 95 1.7%
instagram.com 75 1.3%
bootstrapcdn.com 68 1.2%
cdninstagram.com 63 1.1%
fonts.net 51 0.9%
ajax.googleapis.com 50 0.9%
staticflickr.com 34 0.6%
translate.googleapis.com 34 0.6%
sharethis.com 32 0.6%

So, when you go to check out what your representative is ‘officially’ up to, you’re being served…up on a silver platter to a plethora of sites where you are the product.

It’s starting to look like Congress-folk aren’t as sincere about your privacy as they may have led us all to believe this week.

A [Java]Script for Success[ful Privacy Destruction]

As stated earlier, not all third-party content is created equally malicious. JavaScript resources run code in your browser on your device and while there are limits to what it can do, those limits diminish weekly as crafty coders figure out more ways to use JavaScript to collect information and perform shady or malicious deeds.

So, how many House/Senate sites load one or more third-party JavaScript resources?

Virtually all of them.

To make matters worse, no .gov or third-party resource of any kind was loaded using subresource integrity validation. Subresource integrity validation means that the site owner — at some point — ensured that the resource being loaded was not malicious and then created a fingerprint for it and told your browser what that fingerprint is so it can compare it to what got loaded. If the fingerprints don’t match, the content is not loaded/executed. Using subresource integrity is not trivial since it requires a top-notch content management team and failure to synchronize/checkpoint third-party content fingerprints will result in resources failing to load.

Congress was quick to demand that Facebook implement stronger policies and controls, but they, themselves, cannot be bothered.

Future Work

There are plenty more avenues to explore in this data set (such as “security headers” — they all 100% use strict-transport-security pretty well, but are deeply deficient in others) and more targets for future works, such as the campaign sites of House and Senate members. I may follow up with a look at a specific slice from this data set (the members of the committees who were berating Zuckerberg this week).

The bottom line is that while the beating Facebook took this week was just, those inflicting the pain have a long way to go themselves before they can truly judge what other social media and general internet sites do when it comes to ensuring the safety and privacy of their visitors.

In other words, “Legislator, regulate thyself” before thy regulatists others.

FIN

Apart from some egregiously bad (or benign) examples, I tried not to “name and shame”. I also won’t answer any questions about facets by party since that really doesn’t matter too much as they’re all pretty bad when it comes to understanding and implementing privacy and safey on their sites.

The data set can be found over at Zenodo (alternately, click/tap/select the badge below). I converted the R data frame to ndjson/streaming JSON/jsonlines (however you refer to the format) and tested it out in Apache Drill.

I’ll toss up some R code using data extracts later this week (meaning by April 20th).

DOI

@RMHoge asked the following on Twitter:

Here’s one way to do that which doesn’t rely on pandoc (pandoc can easily do this and ships with RStudio but shelling out for this is cheating :-)

We’ll need some help (NOTE that 2 of these are “GitHub” packages)

library(archive) # install_github("jimhester/archive") + 3rd party library
library(hgr) # install_github("hrbrmstr/hgr")
library(stringi)
library(tidyverse)

We’ll use one of @hadleywickham’s books since it’s O’Reilly and they do epubs well. The archive package lets us treat the epub (which is really just a ZIP file) as a mini-filesystem and embraces “tidy” so we have lovely data frames to work with:

bk_src <- "~/Data/R Packages.epub"

bk <- archive::archive(bk_src)

bk
## # A tibble: 92 x 3
##    path                           size date               
##    <chr>                         <dbl> <dttm>             
##  1 mimetype                        20. 2015-03-24 21:49:16
##  2 OEBPS/assets/cover.png      211616. 2015-06-03 16:16:56
##  3 OEBPS/content.opf            10193. 2015-03-24 21:49:16
##  4 OEBPS/toc.ncx                30037. 2015-03-24 21:49:16
##  5 OEBPS/cover.html               315. 2015-03-24 21:49:16
##  6 OEBPS/titlepage01.html         466. 2015-03-24 21:49:16
##  7 OEBPS/copyright-page01.html   3286. 2015-03-24 21:49:16
##  8 OEBPS/toc01.html             17557. 2015-03-24 21:49:16
##  9 OEBPS/preface01.html         17784. 2015-03-24 21:49:16
## 10 OEBPS/part01.html              444. 2015-03-24 21:49:16
## # ... with 82 more rows

We care not about crufty bits and only want HTML files (NOTE: I use html for the pattern since they can be .xhtml files as well):

## # A tibble: 26 x 3
##    path                          size date               
##    <chr>                        <dbl> <dttm>             
##  1 OEBPS/cover.html              315. 2015-03-24 21:49:16
##  2 OEBPS/titlepage01.html        466. 2015-03-24 21:49:16
##  3 OEBPS/copyright-page01.html  3286. 2015-03-24 21:49:16
##  4 OEBPS/toc01.html            17557. 2015-03-24 21:49:16
##  5 OEBPS/preface01.html        17784. 2015-03-24 21:49:16
##  6 OEBPS/part01.html             444. 2015-03-24 21:49:16
##  7 OEBPS/ch01.html             12007. 2015-03-24 21:49:16
##  8 OEBPS/ch02.html             28633. 2015-03-24 21:49:18
##  9 OEBPS/part02.html             454. 2015-03-24 21:49:18
## 10 OEBPS/ch03.html             28629. 2015-03-24 21:49:18
## # ... with 16 more rows

Let’s read in one file (as a test) and convert it to text and show the first few lines of it:

archive::archive_read(bk, "OEBPS/preface01.html") %>%
  read_lines() %>%
  paste0(collapse = "\n") -> chapter

hgr::clean_text(chapter) %>%
  stri_sub(1, 1000) %>%
  cat()
## Preface
## 
## 
## In This Book
## 
## This book will guide you from being a user of R packages to being a creator of R packages. In , you’ll learn why mastering this skill is so important, and why it’s easier than you think. Next, you’ll learn about the basic structure of a package, and the forms it can take, in . The subsequent chapters go into more detail about each component. They’re roughly organized in order of importance:
## 
## 
##  The most important directory is R/, where your R code lives. A package with just this directory is still a useful package. (And indeed, if you stop reading the book after this chapter, you’ll have still learned some useful new skills.)
##  
##  The DESCRIPTION lets you describe what your package needs to work. If you’re sharing your package, you’ll also use the DESCRIPTION to describe what it does, who can use it (the license), and who to contact if things go wrong.
##  
##  If you want other people (including “future you”!) to understand how to use the functions in your package, you’

hgr::clean_text() uses some XSLT magic to pull text. My jericho? can often do a better job but it’s rJava-based so a bit painful for some folks to get running.

Now, we’ll convert all the files:

filter(bk, stri_detect_fixed(path, "html")) %>%
  mutate(content = map_chr(path, ~{
    archive::archive_read(bk, .x) %>%
      read_lines() %>%
      paste0(collapse = "\n") %>%
      hgr::clean_text()
  })) %>%
  print(n=27)
## # A tibble: 26 x 4
##    path                          size date                content         
##    <chr>                        <dbl> <dttm>              <chr>           
##  1 OEBPS/cover.html              315. 2015-03-24 21:49:16 Cover           
##  2 OEBPS/titlepage01.html        466. 2015-03-24 21:49:16 "R Packages\n\n…
##  3 OEBPS/copyright-page01.html  3286. 2015-03-24 21:49:16 "R Packages\n\n…
##  4 OEBPS/toc01.html            17557. 2015-03-24 21:49:16 "navPrefaceIn T…
##  5 OEBPS/preface01.html        17784. 2015-03-24 21:49:16 "Preface\n\n\nI…
##  6 OEBPS/part01.html             444. 2015-03-24 21:49:16 Getting Started 
##  7 OEBPS/ch01.html             12007. 2015-03-24 21:49:16 "Introduction\n…
##  8 OEBPS/ch02.html             28633. 2015-03-24 21:49:18 "Package Struct…
##  9 OEBPS/part02.html             454. 2015-03-24 21:49:18 Package Compone…
## 10 OEBPS/ch03.html             28629. 2015-03-24 21:49:18 "R Code\n\nThe …
## 11 OEBPS/ch04.html             31275. 2015-03-24 21:49:18 "Package Metada…
## 12 OEBPS/ch05.html             42089. 2015-03-24 21:49:18 "Object Documen…
## 13 OEBPS/ch06.html             31484. 2015-03-24 21:49:18 "Vignettes: Lon…
## 14 OEBPS/ch07.html             28594. 2015-03-24 21:49:18 "Testing\n\nTes…
## 15 OEBPS/ch08.html             30808. 2015-03-24 21:49:18 "Namespace\n\nT…
## 16 OEBPS/ch09.html             12125. 2015-03-24 21:49:18 "External Data\…
## 17 OEBPS/ch10.html             42013. 2015-03-24 21:49:18 "Compiled Code\…
## 18 OEBPS/ch11.html              8933. 2015-03-24 21:49:18 "Installed File…
## 19 OEBPS/ch12.html              3897. 2015-03-24 21:49:18 "Other Componen…
## 20 OEBPS/part03.html             446. 2015-03-24 21:49:18 Best Practices  
## 21 OEBPS/ch13.html             59493. 2015-03-24 21:49:18 "Git and GitHub…
## 22 OEBPS/ch14.html             44702. 2015-03-24 21:49:18 "Automated Chec…
## 23 OEBPS/ch15.html             39450. 2015-03-24 21:49:18 "Releasing a Pa…
## 24 OEBPS/ix01.html             75277. 2015-03-24 21:49:20 IndexAad hoc te…
## 25 OEBPS/colophon01.html         974. 2015-03-24 21:49:20 "About the Auth…
## 26 OEBPS/colophon02.html        1653. 2015-03-24 21:49:20 "Colophon\n\nTh…

I’m not wrapping this into a package anytime soon but this is also a pretty basic flow that may not require a package. This has been wrapped into a small package dubbed pubcrawl?.

Drop a note in the comments with your hints/workflows on converting epub to plaintext!