Skip navigation

In the previous post I tried to explain what Content Security Policies (CSPs) are and how to work with them in R. In case you didn’t RTFPost the TLDR is that CSPs give you control over what can be loaded along with your web content and can optionally be configured to generate a violation report for any attempt to violate the policy you create. While you don’t need to specify a report URI you really should since at the very least you’ll know if you errantly missed a given host, wildcard, or path. You’ll also know when there’s been malicious or just plain skeezy activity going on with third-parties and your content (which is part of the whole point of CSPs).

There’s an “R” category tag on this post (so it’s hitting R-bloggers, et al) since it’s part of an unnumbered series on working with CSPs in R and the next posts will show how to analyze the JSON-formatted reports that are generated. But, to analyze such reports you kinda need a way to get them first. So, we’re going to setup a “serverless” workflow in Amazon AWS to shove CSP reports into a well-organized structure in S3 from which we’ll be able to access, ingest, and analyze them.

Sure, there are services out there who will (legit for free) let you forward violation reports to them but if you can do this for “free” on your own and not give data out to a third-party to make money or ostensibly do-gooder reputation from I can’t fathom an argument for just giving up control.

Note that all you need is an an internet-accessible HTTPS endpoint that can take an HTTP POST request with a JSON payload and then store that somewhere, so if you want to, say, use the plumber package to handle these requests without resorting to AWS, then by all means do so! (And, blog about it!)

AWS “Serverless” CSP Report Workflow Prerequisites

You’re obviously going to need an Amazon AWS account and will also need the AWS Command Line Interface tools installed plus an IAM user that has permissions to use CloudFormation. AWS has been around a while now so yet-another-howto on signing up for AWS, installing the CLI tools and generating an IAM user would be, at-best, redundant. Amazon has decent intro resources and, honestly, it’s 2019 and having some familiarity with how to work with at least one cloud provider is pretty much a necessary skillset at this point depending on what part of “tech” you’re in. If you’re new to AWS then follow the links in this paragraph, run through some basics and jump back to enter the four commands you’ll need to run to bootstrap your CSP collection setup.

Bootstrapping an S3 CSP Collector in AWS

We’re going to use this CloudFormation workflow to bootstrap the CSP collection process and you should skim the yaml file to see what’s going on. Said yaml is “infrastructure as code”, meaning it’s a series of configuration directives to generate AWS services for you (i.e. no pointing-and-clicking) and, perhaps more importantly, destroy them for you if you no longer want to keep this active.

The CF Output directive will be the URI you’re going to use in the report-uri/report-to CSP directives and is something we’ll be querying for at the end of the setup process.

The first set of resources are AWS Glue templates which would enable wiring up the CSP report results into AWS Athena. Glue is a nice ETL framework but it’s kinda expensive if set in active mode (Amazon calls it ‘crawler’ mode) so this CloudFormation recipe only created the Glue template but does not activate it. This section can (as the repo author notes) be deleted but it does no harm and costs nothing extra so leaving it in is fine as well.

The next bit sets up an AWS Firehose configuration which is a silly sounding name for setting up a workflow for where to store “streaming” data. This “firehose” config is just going to setup a path for an S3 bucket and then setup the necessary permissions associated with said bucket. This is where we’re going to pull data from in the next post.

The aforementioned “firehose” can take streaming data from all kinds of input sources and our data source is going to be a POSTed JSON HTTP interaction from a browser so we need to have something that listens for these POST requests and wire that up to the “firehose”. For that we need an API gateway and that’s what the penultimate section sets up for us. It instructs AWS to setup an API endpoint to listen for POST requests, tells it the data type (JSON) it will be handling and then tells it what AWS Lambda to call, which is in the last section.

Said lambda code is in the repo’s index.js and is a short Node.js script to post-process the CSP report JSON into something slightly more usable in a data analysis context (the folks who made the violation report clearly did not have data science folks in mind when creating the structure given the liberal use if - in field names).

If the above sounds super-complex just go get CSP reports, you’re not-wrong. We trade off the cost and tedium of self-hosting and securing a standalone-yet-simple JSON POST handling server for a moderately complex workflow that involves multiple types of moving parts in AWS. The downside is having to gain a more than casual familiarity with AWS components. The plus side is that this is pretty much free unless your site is wildly popular and either constantly under XSS attack or your CSP policy is woefully misconfigured.

“‘Free’, you say?!” Yep. Free. (OK, “mostly” free)

  • AWS API Gateway: 1,000,000 HTTP REST API calls (our POST reqs that call the lambda code) per month are free
  • AWS Lambda (the index.js runner which sends data to the “firehose”): 1,000,000 free requests per month and 400,000 seconds of compute time per month (the index.js takes ~1s to run)
  • AWS Firehose (the bit that shoves data into S3): first 500 TB/month is $0.029 USD
  • AWS S3: First 50 TB / month is $0.023 per GB (the CSP JSON POSTs gzip’d are usually <1K each) + some super-fractional (of a penny) costs for PUTting data into S3 and copying data from S3.

A well-crafted CSP and a typical site should end up costing you way less than $1.00 USD/month and you can monitor it all via the console or with alerts (change your region, if needed). Plus, you can destroy it at any time with one command (we haven’t built it yet so we’ll see this in a bit).

Launching the Bootstrap

As the repo says, do:

$ git clone git@github.com:michaelbanfield/serverless-csp-report-to.git # get the repo
$ cd serverless-csp-report-to # go to the dir
$ aws s3 mb s3://some-unique-and-decent-bucket-name-to-hold-the-lambda-code/ # pick a good name that you'll recognize
$ aws cloudformation package \ # generate the build template
    --template-file template.yaml \
    --s3-bucket <bucket-you-just-created> \
    --output-template-file packaged-template.yaml

$ aws cloudformation deploy \ # launch the build
    --template-file /path/to/packaged-template.yaml \
    --stack-name CSPReporter \
    --capabilities CAPABILITY_IAM

It’ll take a minute or two and when it is done just do:

$ aws cloudformation describe-stacks \ 
    --query "Stacks[0].Outputs[0].OutputValue" \
    --output text \
    --stack-name CSPReporter

To get the URL you’ll use in the reporting directives.

To get rid of all these created resources you can go into the console and do it or just do

$ aws cloudformation --delete-stack --stack-name CSPReporter

To see the bucket that was created for the CSP reports just do:

$ aws s3 ls | grep firehose

FIN

If you’re experienced with AWS that was likely not a big deal. If you’re new or inexperienced with AWS this is not a bad way to get some experience with a “serverless” API setup since it’s cheap, easy to delete and touches on a number of key components within AWS.

You can browse through the AWS console to see all of what was created and eventually tweak the CF yaml to bend it to your own will.

Next time we’ll dive in to CSP violation report analysis with R.

REMINDER to — regardless of the source (whether it’s me, RStudio, spiffy R package authors, or big names like AWS/Microsoft/etc.) — always at least spot check the code you’re about to install or execute. Everyone needs to start developing and honing a zero-trust mindset when it comes to even installing apps from app stores on your phones/tablets let alone allowing random R, C[++], Python, Go, Rust, Haskel, … code to execute on your laptops and servers. This is one reason I went through the sections in the YAML and deliberately linked to the index.js. Not knowing what the code does can lead to unfortunate situations down the line.

NOTE: If you have an alternative Terraform configuration for this drop a note in the comments since TF is a bit more “modern” and less AWS-centric “infrastructure as code” framework. Also, if you’ve done this with Azure or other providers, also drop a note in the comments since it may be of use to folks who aren’t interested in using AWS. Finally, if you do make a plumber server for this, also drop a note to a post with how you did it and perhaps discuss the costs & headaches involved.

I blathered alot about HTTP headers in the last post.

In the event you wanted to dig deeper I threw together a small package that will let you grab HTTP headers from a given URL and take a look at them. The README has examples for most things but we’ll go through a bit of them here as well.

For those that just want to play, you can do:

install.packages("hdrs", repos = "https://cinc.rud.is/")

hdrs::explore_app()

and use the diminutive Shiny app to explore a site’s security headers or look at all the headers they return. (Oh, yeah…if you read the previous post then looked at the above screenshot you’ll notice how completely useless IP blocking is to determined attackers individuals.)

NOTE: There are binaries for macOS and Windows at my CINC repo for hdrs so you’ll be getting those if you use the above method. Use type='source' on that call or use various remotes package functions to install the source package (after reading it b/c you really shouldn’t trust any package, ever) from:

Moving Ahead

Let’s use the command-line to poke at my newfound most favorite site to use in security-related examples:

library(hdrs)

assess_security_headers("https://cran.r-project.org/") %>% 
  dplyr::select(-url)
## # A tibble: 13 x 4
##    header                            value                  status_code message            
##  * <chr>                             <chr>                  <chr>       <chr>              
##  1 access-control-allow-origin       NA                     WARN        Header not set     
##  2 content-security-policy           NA                     WARN        Header not set     
##  3 expect-ct                         NA                     WARN        Header not set     
##  4 feature-policy                    NA                     WARN        Header not set     
##  5 public-key-pins                   NA                     WARN        Header not set     
##  6 referrer-policy                   NA                     WARN        Header not set     
##  7 server                            Apache/2.4.10 (Debian) NOTE        Server header found
##  8 strict-transport-security         NA                     WARN        Header not set     
##  9 x-content-type-options            NA                     WARN        Header not set     
## 10 x-frame-options                   NA                     WARN        Header not set     
## 11 x-permitted-cross-domain-policies NA                     WARN        Header not set     
## 12 x-powered-by                      NA                     WARN        Header not set     
## 13 x-xss-protection                  NA                     WARN        Header not set     

Ouch. Not exactly a great result (so, perhaps it matters little how poorly maintained the downstream mirrors are after all, or maybe it’s perfectly fine to run a five year old web server with some fun vulns).

Anyway…

The assess_security_headers() function looks at 13 modern “security-oriented” HTTP headers, performs a very light efficacy assessment and returns the results.

  • access-control-allow-origin
  • content-security-policy
  • expect-ct
  • feature-policy
  • server
  • public-key-pins
  • referrer-policy
  • strict-transport-security
  • x-content-type-options
  • x-frame-options
  • x-permitted-cross-domain-policies
  • x-powered-by
  • x-xss-protection

Since you likely do not have every HTTP header’s name, potential values, suggested values, and overall purpose memorized, you can also pass in include_ref = TRUE to the function to get more information with decent textual descriptions like you saw in the screenshot (the Shiny app omits many fields).

The full reference is available in a data element:

data("http_headers")

dplyr::glimpse(http_headers)
## Observations: 184
## Variables: 14
## $ header_field_name    <chr> "A-IM", "Accept", "Accept-Additions", "Accept-Charset", "Accept-Datetime", "Accept-Encoding…
## $ type_1               <chr> "Permanent", "Permanent", "Permanent", "Permanent", "Permanent", "Permanent", "Permanent", …
## $ protocol             <chr> "http", "http", "http", "http", "http", "http", "http", "http", "http", "http", "http", "ht…
## $ status               <chr> "", "standard", "", "standard", "informational", "standard", "", "standard", "", "standard"…
## $ reference            <chr> "https://tools.ietf.org/html/rfc3229#section-10.5.3", "https://tools.ietf.org/html/rfc7231#…
## $ type_2               <chr> "Request", "Request", "Request", "Request", "Request", "Request", "Request", "Request", "Re…
## $ enable               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FAL…
## $ required             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, TRUE, TRUE, NA, TRUE, NA, NA, NA, TRUE, NA, NA, NA, NA, NA,…
## $ https                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, TRUE, TRUE, NA, TRUE, NA, NA, NA, TRUE, NA, NA, NA, NA, NA,…
## $ security_description <chr> "", "", "", "", "", "", "", "", "Sometimes an HTTP intermediary might try to detect viruses…
## $ security_reference   <chr> "", "", "", "", "", "", "", "", "https://tools.ietf.org/html/rfc5789#section-5", "https://t…
## $ recommendations      <chr> "", "", "", "", "", "", "", "", "Antivirus software scans for viruses or worms.", "Servers …
## $ cwe                  <chr> "", "", "", "", "", "", "", "", "CWE-509: Replicating Malicious Code (Virus or Worm)", "CWE…
## $ cwe_url              <chr> "\r", "\r", "\r", "\r", "\r", "\r", "\r", "\r", "https://cwe.mitre.org/data/definitions/509…

There will eventually be a lovely vignette with well-formatted sections that include the above information so you can reference it at your leisure (it’s great bedtime reading).

The http_headers object is fully documented but here’s what those fields mean:

  • header_field_name: header field
  • type_1: Permanent (in a standard); Provisional (experimental); Personal (unofficial)
  • protocol: should always be http for now but may be different (e.g. quic)
  • status: blank == unknown; otherwise the value describes the status well
  • reference: where to look for more info
  • type_2: Request (should only be found in requests); Response (should only be found in responses); Request/Response found in either; Reserved (not in use yet)
  • enable: should you have this enabled
  • required: Is this header required
  • https: HTTPS-specific header?
  • security_description: Information on the header
  • security_reference: Extra external reference information on the header
  • recommendations: Recommended setting(s)
  • cwe: Associated Common Weakness Enumeration (CWE) identifier
  • cwe_url: Associated CWE URL

Even Moar Headers

HTTP servers can spit out tons of headers and we can catch’em all with hdrs::explain_headers(). That function grabs the headers, merges in the full metadata from http_headers and returns a big ol’ data frame. We’ll only pull out the security reference URL for this last example:

explain_headers("https://community.rstudio.com/") %>% 
  dplyr::select(header, value, security_reference)
## # A tibble: 18 x 3
##    header               value                                                         security_reference                   
##    <chr>                <chr>                                                         <chr>                                
##  1 cache-control        no-cache, no-store                                            https://tools.ietf.org/html/rfc7234#…
##  2 connection           keep-alive                                                    ""                                   
##  3 content-encoding     gzip                                                          https://en.wikipedia.org/wiki/BREACH…
##  4 content-security-po… base-uri 'none'; object-src 'none'; script-src 'unsafe-eval'… https://www.owasp.org/index.php/List…
##  5 content-type         text/html; charset=utf-8                                      https://tools.ietf.org/html/rfc7231#…
##  6 date                 Tue, 05 Mar 2019 20:40:31 GMT                                 ""                                   
##  7 referrer-policy      strict-origin-when-cross-origin                               NA                                   
##  8 server               nginx                                                         https://tools.ietf.org/html/rfc7231#…
##  9 strict-transport-se… max-age=31536000                                              https://tools.ietf.org/html/rfc6797  
## 10 vary                 Accept-Encoding                                               ""                                   
## 11 x-content-type-opti… nosniff                                                       https://www.owasp.org/index.php/List…
## 12 x-discourse-route    list/latest                                                   NA                                   
## 13 x-download-options   noopen                                                        NA                                   
## 14 x-frame-options      SAMEORIGIN                                                    https://tools.ietf.org/html/rfc7034  
## 15 x-permitted-cross-d… none                                                          NA                                   
## 16 x-request-id         12322c6e-b47e-4960-b384-32138097886c                          NA                                   
## 17 x-runtime            0.106664                                                      NA                                   
## 18 x-xss-protection     1; mode=block                                                 https://www.owasp.org/index.php/List…

FIN

Have some fun and poke at some headers. Perhaps even use this to do a survey of key web sites in your field of work/study and see how well they rate. As usual, post PRs & issues at your fav social coding site.

Not flagging this with an “R” tag since I don’t want to spam R-bloggers but I mentioned here that I’d be disabling logging on https://cinc.rud.is and https://git.rud.is and I wanted to follow up on that with an addendum that I’ve opted to disable user/system tracking in the access logs of both those sites vs just drop access logs entirely.

I’ve configured nginx with ipscrub and have defined a new log_format:

log_format special '$remote_addr_ipscrub - [$time_local] "$request" $status $body_bytes_sent';

that anonymizes the incoming IP address and does not log referrer (it only logs what you see above).

That way I can get usage info without tracking you b/c I rly have no need for the extra info.

It was super-easy to install and I can recommend it for anyone who runs nginx and wants to avoid capturing information you really don’t need to measure engagement.

In the “Changes on CRAN” section of the latest version of the The R Journal (Vol. 10/2, December 2018) had this short blurb entitled “CRAN mirror security”:

Currently, there are 100 official CRAN mirrors, 68 of which provide both secure downloads via ‘https’ and use secure mirroring from the CRAN master (via rsync through ssh tunnels). Since the R 3.4.0 release, chooseCRANmirror() offers these mirrors in preference to the others which are not fully secured (yet).

I would have linked to the R Journal section quoted above but I can’t because I’m blocked from accessing all resources at the IP address serving cran.r-project.org from my business-class internet connection likely due to me having a personal CRAN mirror (that was following the rules, which I also cannot link to since I can’t get to the site).

That word — “security” — is one of the most misunderstood and misused terms in modern times in many contexts. The context for the use here is cybersecurity and since CRAN (and others in the R community) seem to equate transport-layer uber-obfuscation with actual security/safety I thought it would be useful for R users in general to get a more complete picture of these so-called “secure” hosts. I also did this since I had to figure out another way to continue to have a CRAN mirror and needed to validate which nodes both supported + allowed mirroring and were at least somewhat trustworthy.

Unless there is something truly egregious in a given section I’m just going to present data with some commentary (I’m unamused abt being blocked so some commentary has an unusually sharp edge) and refrain from stating “X is 👍|👎” since the goal is really to help you make the best decision of which mirror to use on your own.

The full Rproj supporting the snippets in this post (and including the data gathered by the post) can be found in my new R blog projects.

We’re going to need a few supporting packages so let’s get those out of the way:

library(xml2)
library(httr)
library(curl)
library(stringi)
library(urltools)
library(ipinfo) # install.packages("ipinfo", repos = "https://cinc.rud.is/")
library(openssl)
library(furrr)
library(vershist) # install.packages("vershist", repos = "https://cinc.rud.is/")
library(ggalt)
library(ggbeeswarm)
library(hrbrthemes)
library(tidyverse)

What Is “Secure”?

As noted, CRAN folks seem to think encryption == security since the criteria for making that claim in the R Journal was transport-layer encryption for rsync (via ssh) mirroring from CRAN to a downstream mirror and a downstream mirror providing an https transport for shuffling package binaries and sources from said mirror to your local system(s). I find that equally as adorable as I do the rhetoric from the Let’s Encrypt cabal as this https gets you:

  • in theory protection from person-in-the-middle attacks that could otherwise fiddle with the package bits in transport
  • protection from your organization or ISP knowing what specific package you were grabbing; note that unless you’ve got a setup where your DNS requests are also encrypted the entity that controls your transport layer does indeed know exactly where you’re going.

and…that’s about it.

The soon-to-be-gone-and-formerly-green-in-most-browsers lock icon alone tells you nothing about the configuration of any site you’re connecting to and using rsync over ssh provides no assurance as to what else is on the CRAN mirror server(s), what else is using the mirror server(s), how many admins/users have shell access to those system(s) nor anything else about the cyber hygiene of those systems.

So, we’re going to look at (not necessarily in this order & non-exhaustively since this isn’t a penetration test and only lightweight introspection has been performed):

  • how many servers are involved in a given mirror URL
  • SSL certificate information including issuer, strength, and just how many other domains can use the cert
  • the actual server SSL transport configuration to see just how many CRAN mirrors have HIGH or CRITICAL SSL configuration issues
  • use (or lack thereof) HTTP “security” headers (I mean, the server is supposed to be “secure”, right?)
  • how much other “junk” is running on a given CRAN mirror (the more running services the greater the attack surface)

We’ll use R for most of this, too (I’m likely never going to rewrite longstanding SSL testers in/for R).

Let’s dig in.

Acquiring Most of the Metadata

It can take a little while to run some of the data gathering steps so the project repo includes the already-gathered data. But, we’ll show the work on the first bit of reconnaissance which involves:

  • Slurping the SSL certificate from the first server in each CRAN mirror entry (again, I can’t link to the mirror page because I literally can’t see CRAN or the main R site anymore)
  • Performing an HTTP HEAD request (to minimize server bandwidth & CPU usage) of the full CRAN mirror URL (we have to since load balancers or proxies could re-route us to a completely different server otherwise)
  • Getting an IP address for each CRAN mirror
  • Getting metadata about that IP address

This all done below:

if (!file.exists(here::here("data/mir-dat.rds"))) {
  mdoc <- xml2::read_xml(here::here("data/mirrors.html"), as_html = TRUE)

  xml_find_all(mdoc, ".//td/a[contains(@href, 'https')]") %>%
    xml_attr("href") %>%
    unique() -> ssl_mirrors

  plan(multiprocess)

  # safety first
  dl_cert <- possibly(openssl::download_ssl_cert, NULL)
  HEAD_ <- possibly(httr::HEAD, NULL)
  dig <- possibly(curl::nslookup, NULL)
  query_ip_ <- possibly(ipinfo::query_ip, NULL)

  ssl_mirrors %>%
    future_map(~{
      host <- domain(.x)
      ip <- dig(host, TRUE)
      ip_info <- if (length(ip)) query_ip_(ip) else NULL
      list(
        host = host,
        cert = dl_cert(host),
        head = HEAD_(.x),
        ip = ip,
        ip_info = ip_info
      )
    }) -> mir_dat

  saveRDS(mir_dat, here::here("data/mir-dat.rds"))
} else {
  mir_dat <- readRDS(here::here("data/mir-dat.rds"))
}

# take a look

str(mir_dat[1], 3)
## List of 1
##  $ :List of 5
##   ..$ host   : chr "cloud.r-project.org"
##   ..$ cert   :List of 4
##   .. ..$ :List of 8
##   .. ..$ :List of 8
##   .. ..$ :List of 8
##   .. ..$ :List of 8
##   ..$ head   :List of 10
##   .. ..$ url        : chr "https://cloud.r-project.org/"
##   .. ..$ status_code: int 200
##   .. ..$ headers    :List of 13
##   .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##   .. ..$ all_headers:List of 1
##   .. ..$ cookies    :'data.frame':   0 obs. of  7 variables:
##   .. ..$ content    : raw(0) 
##   .. ..$ date       : POSIXct[1:1], format: "2018-11-29 09:41:27"
##   .. ..$ times      : Named num [1:6] 0 0.0507 0.0512 0.0666 0.0796 ...
##   .. .. ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##   .. ..$ request    :List of 7
##   .. .. ..- attr(*, "class")= chr "request"
##   .. ..$ handle     :Class 'curl_handle' <externalptr> 
##   .. ..- attr(*, "class")= chr "response"
##   ..$ ip     : chr "52.85.89.62"
##   ..$ ip_info:List of 8
##   .. ..$ ip      : chr "52.85.89.62"
##   .. ..$ hostname: chr "server-52-85-89-62.jfk6.r.cloudfront.net"
##   .. ..$ city    : chr "Seattle"
##   .. ..$ region  : chr "Washington"
##   .. ..$ country : chr "US"
##   .. ..$ loc     : chr "47.6348,-122.3450"
##   .. ..$ postal  : chr "98109"
##   .. ..$ org     : chr "AS16509 Amazon.com, Inc."

Note that two sites failed to respond so they were excluded from all analyses.

A Gratuitous Map of “Secure” CRAN Servers

Since ipinfo.io‘s API returns lat/lng geolocation information why not start with a map (since that’s going to be the kindest section of this post):

maps::map("world", ".", exact = FALSE, plot = FALSE,  fill = TRUE) %>%
  fortify() %>%
  filter(region != "Antarctica") -> world

map_chr(mir_dat, ~.x$ip_info$loc) %>%
  stri_split_fixed(pattern = ",", n = 2, simplify = TRUE) %>%
  as.data.frame(stringsAsFactors = FALSE) %>%
  as_tibble() %>%
  mutate_all(list(as.numeric)) -> wheres_cran

ggplot() +
  ggalt::geom_cartogram(
    data = world, map = world, aes(long, lat, map_id=region),
    color = ft_cols$gray, size = 0.125
  ) +
  geom_point(
    data = wheres_cran, aes(V2, V1), size = 2,
    color = ft_cols$slate, fill = alpha(ft_cols$yellow, 3/4), shape = 21
  ) +
  ggalt::coord_proj("+proj=wintri") +
  labs(
    x = NULL, y = NULL,
    title = "Geolocation of HTTPS-'enabled' CRAN Mirrors"
  ) +
  theme_ft_rc(grid="") +
  theme(axis.text = element_blank())

Shakesperian Security

What’s in a [Subject Alternative] name? That which we call a site secure. By using dozens of other names would smell as not really secure at all? —Hackmeyo & Pwndmeyet (II, ii, 1-2)

The average internet user likely has no idea that one SSL certificate can front a gazillion sites. I’m not just talking a wildcard cert (e.g. using *.rud.is for all rud.is subdomains which I try not to do for many reasons), I’m talking dozens of subject alternative names. Let’s examine some data since an example is better than blathering:

# extract some of the gathered metadata into a data frame
map_df(mir_dat, ~{
  tibble(
    host = .x$host,
    s_issuer = .x$cert[[1]]$issuer %||% NA_character_,
    i_issuer = .x$cert[[2]]$issuer %||% NA_character_,
    algo = .x$cert[[1]]$algorithm %||% NA_character_,
    names = .x$cert[[1]]$alt_names %||% NA_character_,
    nm_ct = length(.x$cert[[1]]$alt_names),
    key_size = .x$cert[[1]]$pubkey$size %||% NA_integer_
  )
}) -> certs

certs <- filter(certs, complete.cases(certs))

count(certs, host, sort=TRUE) %>%
  ggplot() +
  geom_quasirandom(
    aes("", n), size = 2,
    color = ft_cols$slate, fill = alpha(ft_cols$yellow, 3/4), shape = 21
  ) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Servers",
    title = "Distribution of the number of alt-names in CRAN mirror certificates"
  ) +
  theme_ft_rc(grid="Y")

Most only front a couple but there are some with a crazy amount of domains. We can look at a slice of cran.cnr.berkeley.edu:

filter(certs, host == "cran.cnr.berkeley.edu") %>%
  select(names) %>%
  head(20)
names
nature.berkeley.edu
ag-labor.cnr.berkeley.edu
agro-laboral.cnr.berkeley.edu
agroecology.berkeley.edu
anthoff.erg.berkeley.edu
are-dev.cnr.berkeley.edu
are-prod.cnr.berkeley.edu
are-qa.cnr.berkeley.edu
are.berkeley.edu
arebeta.berkeley.edu
areweb.berkeley.edu
atkins-dev.cnr.berkeley.edu
atkins-prod.cnr.berkeley.edu
atkins-qa.cnr.berkeley.edu
atkins.berkeley.edu
bakerlab-dev.cnr.berkeley.edu
bakerlab-prod.cnr.berkeley.edu
bakerlab-qa.cnr.berkeley.edu
bamg.cnr.berkeley.edu
beahrselp-dev.cnr.berkeley.edu

The project repo has some more examples and you can examine as many as you like.

For some CRAN mirrors the certificate is used all over the place at the hosting organization. That alone isn’t bad, but organizations are generally terrible at protecting the secrets associated with certificate generation (just look at how many Google/Apple app store apps are found monthly to be using absconded-with enterprise certs) and since each server with these uber-certs has copies of public & private bits users had better hope that mal-intentioned ne’er-do-wells do not get copies of them (making it easier to impersonate any one of those, especially if an attacker controls DNS).

This Berkeley uber-cert is also kinda cute since it mixes alt-names for dev, prod & qa systems across may different apps/projects (dev systems are notoriously maintained improperly in virtually every organization).

There are legitimate reasons and circumstances for wildcard certs and taking advantage of SANs. You can examine what other CRAN mirrors do and judge for yourself which ones are Doing It Kinda OK.

Size (and Algorithm) Matters

In some crazy twist of pleasant surprises most of the mirrors seem to do OK when it comes to the algorithm and key size used for the certificate(s):

distinct(certs, host, algo, key_size) %>%
  count(algo, key_size, sort=TRUE)
algo key_size n
sha256WithRSAEncryption 2048 59
sha256WithRSAEncryption 4096 13
ecdsa-with-SHA256 256 2
sha256WithRSAEncryption 256 1
sha256WithRSAEncryption 384 1
sha512WithRSAEncryption 2048 1
sha512WithRSAEncryption 4096 1

You can go to the mirror list and hit up SSL Labs Interactive Server Test (which has links to many ‘splainers) or use the ssllabs🔗 R package to get the grade of each site. I dig into the state of config and transport issues below but will suggest that you stick with sites with ecdsa certs or sha256 and higher numbers if you want a general, quick bit of guidance.

Where Do They Get All These Wonderful Certs?

Certs come from somewhere. You can self-generate play ones, setup your own internal/legit certificate authority and augment trust chains, or go to a bona-fide certificate authority to get a certificate.

Your browsers and operating systems have a built-in set of certificate authorities they trust and you can use ssllabs::get_root_certs()🔗 to see an up-to-date list of ones for Mozilla, Apple, Android, Java & Windows. In the age of Let’s Encrypt, certificates have almost no monetary value and virtually no integrity value so where they come from isn’t as important as it used to be, but it’s kinda fun to poke at it anyway:

distinct(certs, host, i_issuer) %>%
  count(i_issuer, sort = TRUE) %>%
  head(28)
i_issuer n
CN=DST Root CA X3,O=Digital Signature Trust Co. 20
CN=COMODO RSA Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB 7
CN=DigiCert Assured ID Root CA,OU=www.digicert.com,O=DigiCert Inc,C=US 7
CN=DigiCert Global Root CA,OU=www.digicert.com,O=DigiCert Inc,C=US 6
CN=DigiCert High Assurance EV Root CA,OU=www.digicert.com,O=DigiCert Inc,C=US 6
CN=QuoVadis Root CA 2 G3,O=QuoVadis Limited,C=BM 5
CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New Jersey,C=US 5
CN=GlobalSign Root CA,OU=Root CA,O=GlobalSign nv-sa,C=BE 4
CN=Trusted Root CA SHA256 G2,O=GlobalSign nv-sa,OU=Trusted Root,C=BE 3
CN=COMODO ECC Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB 2
CN=DFN-Verein PCA Global – G01,OU=DFN-PKI,O=DFN-Verein,C=DE 2
OU=Security Communication RootCA2,O=SECOM Trust Systems CO.\,LTD.,C=JP 2
CN=AddTrust External CA Root,OU=AddTrust External TTP Network,O=AddTrust AB,C=SE 1
CN=Amazon Root CA 1,O=Amazon,C=US 1
CN=Baltimore CyberTrust Root,OU=CyberTrust,O=Baltimore,C=IE 1
CN=Certum Trusted Network CA,OU=Certum Certification Authority,O=Unizeto Technologies S.A.,C=PL 1
CN=DFN-Verein Certification Authority 2,OU=DFN-PKI,O=Verein zur Foerderung eines Deutschen Forschungsnetzes e. V.,C=DE 1
CN=Go Daddy Root Certificate Authority – G2,O=GoDaddy.com\, Inc.,L=Scottsdale,ST=Arizona,C=US 1
CN=InCommon RSA Server CA,OU=InCommon,O=Internet2,L=Ann Arbor,ST=MI,C=US 1
CN=QuoVadis Root CA 2,O=QuoVadis Limited,C=BM 1
CN=QuoVadis Root Certification Authority,OU=Root Certification Authority,O=QuoVadis Limited,C=BM 1

That first one is Let’s Encrypt, which is not unexpected since they’re free and super easy to setup/maintain (especially for phishing campaigns).

A “fun” exercise might be to Google/DDG around for historical compromises tied to these CAs (look in the subject ones too if you’re playing with the data at home) and see what, eh, issues they’ve had.

You might want to keep more of an eye on this whole “boring” CA bit, too, since some trust stores are noodling on the idea of trusting surveillance firms and you never know what Microsoft or Google is going to do to placate authoritarian regimes and allow into their trust stores.

At this point in the exercise you’ve got

  • how many domains a certificate fronts
  • certificate strength
  • certificate birthplace

to use when formulating your own decision on what CRAN mirror to use.

But, as noted, certificate breeding is not enough. Let’s dive into the next areas.

It’s In The Way That You Use It

You can’t just look at a cert to evaluate site security. Sure, you can spend 4 days and use the aforementioned ssllabs package to get the rating for each cert (well, if they’ve been cached then an API call won’t be an assessment so you can prime the cache with 4 other ppl in one day and then everyone else can use the cached values and not burn the rate limit) or go one-by-one in the SSL Labs test site, but we can also use a tool like testssl.sh🔗 to gather technical data via interactive protocol examination.

I’m being a bit harsh in this post, so fair’s fair and here are the plaintext results from my own run of testssl.sh for rud.is along with ones from Qualys:

As you can see in the detail pages, I am having an issue with the provider of my .is domain (severe limitation on DNS record counts and types) so I fail CAA checks because I literally can’t add an entry for it nor can I use a different nameserver. Feel encouraged to pick nits about that tho as that should provide sufficient impetus to take two weeks of IRL time and some USD to actually get it transferred (yay. international. domain. providers.)

The project repo has all the results from a weekend run on the CRAN mirrors. No special options were chosen for the runs.

list.files(here::here("data/ssl"), "json$", full.names = TRUE) %>%
  map_df(jsonlite::fromJSON) %>%
  as_tibble() -> ssl_tests

# filter only fields we want to show and get them in order
sev <- c("OK", "LOW", "MEDIUM", "HIGH", "WARN", "CRITICAL")

group_by(ip) %>%
  count(severity) %>%
  ungroup() %>%
  complete(ip = unique(ip), severity = sev) %>%
  mutate(severity = factor(severity, levels = sev)) %>% # order left->right by severity
  arrange(ip) %>%
  mutate(ip = factor(ip, levels = rev(unique(ip)))) %>% # order alpha by mirror name so it's easier to ref
  ggplot(aes(severity, ip, fill=n)) +
  geom_tile(color = "#b2b2b2", size = 0.125) +
  scale_x_discrete(name = NULL, expand = c(0,0.1), position = "top") +
  scale_y_discrete(name = NULL, expand = c(0,0)) +
  viridis::scale_fill_viridis(
    name = "# Tests", option = "cividis", na.value = ft_cols$gray
  ) +
  labs(
    title = "CRAN Mirror SSL Test Summary Findings by Severity"
  ) +
  theme_ft_rc(grid="") +
  theme(axis.text.y = element_text(size = 8, family = "mono")) -> gg

# We're going to move the title vs have too wide of a plot

gb <- ggplot2::ggplotGrob(gg)
gb$layout$l[gb$layout$name %in% "title"] <- 2

grid::grid.newpage()
grid::grid.draw(gb)

Thankfully most SSL checks come back OK. Unfortunately, many do not:

filter(ssl_tests,severity == "HIGH") %>% 
  count(id, sort = TRUE)
id n
BREACH 42
cipherlist_3DES_IDEA 37
cipher_order 34
RC4 16
cipher_negotiated 10
LOGJAM-common_primes 9
POODLE_SSL 6
SSLv3 6
cert_expiration_status 1
cert_notAfter 1
fallback_SCSV 1
LOGJAM 1
secure_client_renego 1
filter(ssl_tests,severity == "CRITICAL") %>% 
  count(id, sort = TRUE)
id n
cipherlist_LOW 16
TLS1_1 5
CCS 2
cert_chain_of_trust 1
cipherlist_aNULL 1
cipherlist_EXPORT 1
DROWN 1
FREAK 1
ROBOT 1
SSLv2 1

Some CRAN mirror site admins aren’t keeping up with secure SSL configurations. If you’re not familiar with some of the acronyms here are a few (fairly layman-friendly) links:

You’d be hard-pressed to have me say that the presence of these is the end of the world (I mean, you’re trusting random servers to provide packages for you which may run in secure enclaves on production code, so how important can this really be?) but I also wouldn’t attach the word “secure” to any CRAN mirror with HIGH or CRITICAL SSL configuration weaknesses.

Getting Ahead[er] Of Myself

We did the httr::HEAD() request primarily to capture HTTP headers. And, we definitely got some!

map_df(mir_dat, ~{

  if (length(.x$head$headers) == 0) return(NULL)

  host <- .x$host

  flatten_df(.x$head$headers) %>%
    gather(name, value) %>%
    mutate(host = host)

}) -> hdrs

count(hdrs, name, sort=TRUE) %>%
  head(nrow(.))
name n
content-type 79
date 79
server 79
last-modified 72
content-length 67
accept-ranges 65
etag 65
content-encoding 38
connection 28
vary 28
strict-transport-security 13
x-frame-options 8
x-content-type-options 7
cache-control 4
expires 3
x-xss-protection 3
cf-ray 2
expect-ct 2
set-cookie 2
via 2
ms-author-via 1
pragma 1
referrer-policy 1
upgrade 1
x-amz-cf-id 1
x-cache 1
x-permitted-cross-domain 1
x-powered-by 1
x-robots-tag 1
x-tuna-mirror-id 1
x-ua-compatible 1

There are a handful of “security” headers that kinda matter so we’ll see how many “secure” CRAN mirrors use “security” headers:

c(
  "content-security-policy", "x-frame-options", "x-xss-protection",
  "x-content-type-options", "strict-transport-security", "referrer-policy"
) -> secure_headers

count(hdrs, name, sort=TRUE) %>%
  filter(name %in% secure_headers)
name n
strict-transport-security 13
x-frame-options 8
x-content-type-options 7
x-xss-protection 3
referrer-policy 1

I’m honestly shocked any were in use but only a handful or two are using even one “security” header. cran.csiro.au uses all five of the above so good on ya Commonwealth Scientific and Industrial Research Organisation!

I keep putting the word “security” in quotes as R does nothing with these headers when you do an install.packages(). As a whole they’re important but mostly when it comes to your safety when browsing those CRAN mirrors.

I would have liked to have seen at least one with some Content-Security-Policy header, but a girl can at least dream.

Version Aversion

There’s another HTTP response header we can look at, the Server one which is generally there to help attackers figure out whether they should target you further for HTTP server and application attacks. No, I mean it! Back in the day when geeks rules the internets — and it wasn’t just a platform for cat pictures and pwnd IP cameras — things like the Server header were cool because it might help us create server-specific interactions and build cool stuff. Yes, modern day REST APIs are likely better in the long run but the naiveté of the silver age of the internet was definitely something special (and also led to the chaos we have now). But, I digress.

In theory, no HTTP server in it’s rightly configured digital mind would tell you what it’s running down to the version level, but most do. (Again, feel free to pick nits that I let the world know I run nginx…or do I). Assuming the CRAN mirrors haven’t been configured to deceive attackers and report what folks told them to report we can survey what they run behind the browser window:

filter(hdrs, name == "server") %>%
  separate(
    value, c("kind", "version"), sep="/", fill="right", extra="merge"
  ) -> svr

count(svr, kind, sort=TRUE)
kind n
Apache 57
nginx 15
cloudflare 2
CSIRO 1
Hiawatha v10.8.4 1
High Performance 8bit Web Server 1
none 1
openresty 1

I really hope Cloudflare is donating bandwidth vs charging these mirror sites. They’ve likely benefitted greatly from the diverse FOSS projects many of these sites serve. (I hadn’t said anything bad about Cloudflare yet so I had to get one in before the end).

Lots run Apache (makes sense since CRAN-proper does too, not that I can validate that from home since I’m IP blocked…bitter much, hrbrmstr?) Many run nginx. CSIRO likely names their server that on purpose and hasn’t actually written their own web server. Hiawatha is, indeed, a valid web server. While there are also “high performance 8bit web servers” out there I’m willing to bet that’s a joke header value along with “none”. Finally, “openresty” is also a valid web server (it’s nginx++).

We’ll pick on Apache and nginx and see how current patch levels are. Not all return a version number but a good chunk do:

apache_httpd_version_history() %>%
  arrange(rls_date) %>%
  mutate(
    vers = factor(as.character(vers), levels = as.character(vers))
  ) -> apa_all

filter(svr, kind == "Apache") %>%
  filter(!is.na(version)) %>%
  mutate(version = stri_replace_all_regex(version, " .*$", "")) %>%
  count(version) %>%
  separate(version, c("maj", "min", "pat"), sep="\\.", convert = TRUE, fill = "right") %>%
  mutate(pat = ifelse(is.na(pat), 1, pat)) %>%
  mutate(v = sprintf("%s.%s.%s", maj, min, pat)) %>%
  mutate(v = factor(v, levels = apa_all$vers)) %>%
  arrange(v) -> apa_vers

filter(apa_all, vers %in% apa_vers$v) %>%
  arrange(rls_date) %>%
  group_by(rls_year) %>%
  slice(1) %>%
  ungroup() %>%
  arrange(rls_date) -> apa_yrs

ggplot() +
  geom_blank(
    data = apa_vers, aes(v, n)
  ) +
  geom_segment(
    data = apa_yrs, aes(vers, 0, xend=vers, yend=Inf),
    linetype = "dotted", size = 0.25, color = "white"
  ) +
  geom_segment(
    data = apa_vers, aes(v, n, xend=v, yend=0),
    color = ft_cols$gray, size = 8
  ) +
  geom_label(
    data = apa_yrs, aes(vers, Inf, label = rls_year),
    family = font_rc, color = "white", fill = "#262a31", size = 4,
    vjust = 1, hjust = 0, nudge_x = 0.01, label.size = 0
  ) +
  scale_y_comma(limits = c(0, 15)) +
  labs(
    x = "Apache Version #", y = "# Servers",
    title = "CRAN Mirrors Apache Version History"
  ) +
  theme_ft_rc(grid="Y") +
  theme(axis.text.x = element_text(family = "mono", size = 8, color = "white"))

O_O

I’ll let you decide if a six-year-old version of Apache indicates how well a mirror site is run or not. Sure, mitigations could be in place but I see no statement of efficacy on any site so we’ll go with #lazyadmin.

But, it’s gotta be better with nginx, right? It’s all cool & modern!

nginx_version_history() %>%
  arrange(rls_date) %>%
  mutate(
    vers = factor(as.character(vers), levels = as.character(vers))
  ) -> ngx_all

filter(svr, kind == "nginx") %>%
  filter(!is.na(version)) %>%
  mutate(version = stri_replace_all_regex(version, " .*$", "")) %>%
  count(version) %>%
  separate(version, c("maj", "min", "pat"), sep="\\.", convert = TRUE, fill = "right") %>%
  mutate(v = sprintf("%s.%s.%s", maj, min, pat)) %>%
  mutate(v = factor(v, levels = ngx_all$vers)) %>%
  arrange(v) -> ngx_vers

filter(ngx_all, vers %in% ngx_vers$v) %>%
  arrange(rls_date) %>%
  group_by(rls_year) %>%
  slice(1) %>%
  ungroup() %>%
  arrange(rls_date) -> ngx_yrs

ggplot() +
  geom_blank(
    data = ngx_vers, aes(v, n)
  ) +
  geom_segment(
    data = ngx_yrs, aes(vers, 0, xend=vers, yend=Inf),
    linetype = "dotted", size = 0.25, color = "white"
  ) +
  geom_segment(
    data = ngx_vers, aes(v, n, xend=v, yend=0),
    color = ft_cols$gray, size = 8
  ) +
  geom_label(
    data = ngx_yrs, aes(vers, Inf, label = rls_year),
    family = font_rc, color = "white", fill = "#262a31", size = 4,
    vjust = 1, hjust = 0, nudge_x = 0.01, label.size = 0
  ) +
  scale_y_comma(limits = c(0, 15)) +
  labs(
    x = "nginx Version #", y = "# Servers",
    title = "CRAN Mirrors nginx Version History"
  ) +
  theme_ft_rc(grid="Y") +
  theme(axis.text.x = element_text(family = "mono", color = "white"))

🤨

I will at close out this penultimate section with a “thank you!” to the admins at Georg-August-Universität Göttingen and Yamagata University for keeping up with web server patches.

You Made It This Far

If I had known you’d read to the nigh bitter end I would have made cookies. You’ll have to just accept the ones the blog gives your browser (those ones taste taste pretty bland tho).

The last lightweight element we’ll look at is “what else do these ‘secure’ CRAN mirrors run”?

To do this, we’ll turn to Rapid7 OpenData and look at what else is running on the IP addresses used by these CRAN mirrors. We already know some certs are promiscuous, so what about the servers themselves?

cran_mirror_other_things <- readRDS(here::here("data/cran-mirror-other-things.rds"))

# "top" 20
distinct(cran_mirror_other_things, ip, port) %>%
  count(ip, sort = TRUE) %>%
  head(20)
ip n
104.25.94.23 8
143.107.10.17 7
104.27.133.206 5
137.208.57.37 5
192.75.96.254 5
208.81.1.244 5
119.40.117.175 4
130.225.254.116 4
133.24.248.17 4
14.49.99.238 4
148.205.148.16 4
190.64.49.124 4
194.214.26.146 4
200.236.31.1 4
201.159.221.67 4
202.90.159.172 4
217.31.202.63 4
222.66.109.32 4
45.63.11.93 4
62.44.96.11 4

Four isn’t bad since we kinda expect at least 80, 443 and 21 (FTP) to be running. We’ll take those away and look at the distribution:

distinct(cran_mirror_other_things, ip, port) %>%
  filter(!(port %in% c(21, 80, 443))) %>%
  count(ip) %>%
  count(n) %>%
  mutate(n = factor(n)) %>%
  ggplot() +
  geom_segment(
    aes(n, nn, xend = n, yend = 0), size = 10, color = ft_cols$gray
  ) +
  scale_y_comma() +
  labs(
    x = "Total number of running services", y = "# hosts",
    title = "How many other services do CRAN mirrors run?",
    subtitle = "NOTE: Not counting 80/443/21"
  ) +
  theme_ft_rc(grid="Y")

So, what are these other ports?

distinct(cran_mirror_other_things, ip, port) %>%
  count(port, sort=TRUE)
port n
80 75
443 75
21 29
22 18
8080 6
25 5
53 2
2082 2
2086 2
8000 2
8008 2
8443 2
111 1
465 1
587 1
993 1
995 1
2083 1
2087 1

22 is SSH, 53 is DNS, 8000/8008/8080/8553 are web high ports usually associated with admin or API endpoints and generally a bad sign when exposed externally (especially on a “secure” mirror server). 25/465/587/993/995 all deal with mail sending and reading (not exactly a great service to have on a “secure” mirror server). I didn’t poke too hard but 208[2367] tend to be cPanel admin ports and those being internet-accessible is also not great.

Port 111 is sunrpc and is a really bad thing to expose to the internet or to run at all. But, the server is a “secure” CRAN mirror, so perhaps everything is fine.

FIN

While I hope this posts informs, I’ve worked in cybersecurity for ages and — as a result — don’t really expect anything to change. Tomorrow, I’ll still be blocked from the main CRAN & r-project.org site despite having better “security” than the vast majority of these “secure” CRAN mirrors (and was following the rules). Also CRAN mirror settings tend to be fairly invisible since most modern R users use the RStudio default (which is really not a bad choice from any “security” analysis angle), choose the first item in the mirror-chooser (Russian roulette!), or live with the setting in the site-wide Rprofile anyway (org-wide risk acceptance/”blame the admin”).

Since I only stated it way back up top (WordPress says this is ~3,900 words but much of that is [I think] code) you can get the full R project for this and examine the data yourself. There is a bit more data and code in the project since I also looked up the IP addresses in Rapid7’s FDNS OpenData study set to really see how many domains point to a particular CRAN mirror but really didn’t want to drag the post on any further.

Now, where did I put those Python 3 & Julia Jupyter notebooks…

The in-dev htmlunit package for javascript-“enabled” web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-“enabled” sites. The Chrome emulation is now also on-par with Chrome 72 series (my Chrome beta is at 73.0.3683.56 so it’s super close to very current).

In reality, the update was to the htmlunitjars package where the main project JAR and dependent JARs all received a refresh.

The README and tests were all re-run on both packages and Travis is happy.

If you’ve got a working rJava installation (aye, it’s 2019 and that’s still “a thing”) then you can just do:

install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is/")

to get them installed and start playing with the DSL or work directly with the Java classes.

FIN

As usual, use your preferred social coding site to log feature requests or problems.

I’ve got a work-in-progress drat-ified CRAN-like repo for (eventually) all my packages over at CINC🔗 (“CINC is not CRAN” and it also sounds like “sync”). This is in parallel with a co-location/migration of all my packages to SourceHut (just waiting for the sr.ht alpha API to be baked) and a self-hosted public Gitea instance. Everything will still be on that legacy social coding site y’all use but the ultimate goal is to have all installs be possible via the CINC repository (i.e. install.packages()) or via a remotes::install_git() install from this standalone or any social coding site.

I’ll eventually publish the workflow but the idea is to customize a pkgdown YAML file in each package repo so the navbar has links back to CINC and other pages (this will take some time as I seem to have made alot of little packages over the years) and then to add a package to the CINC repo:

The above processes helped shine a light on some bad README practices I’ve had and also about how to make it a bit easier (in the future) to install C[++]-backed packages. Speaking of READMEs, I also need to get all the README’s updated to use either install.packages() from CINC or a remotes install from Gitea.

Another couple of goals are to possibly get binary package versions added (though that’s going to be interesting orchestration exercise) and see if I can’t get some notary🔗 concepts implemented.

It’s actually been a fun mini-project since the drat part is a simple as drat::insertPackage('PKG', '/path/to/cinc') (#ty Dirk!) — though I need to think through some logic around maintaining Archive versions and also deleting packages which drat doesn’t do yet but is also as simple as removing tarballs and running tools::write_PACKAGES().

As an aside, I also drat-ified all our $WORK packages and made that repo work-internally-accessible via static S3 web hosting. At $0.023 USD per GB (per-month) for just hosting the objects and $0.0004 USD per 1,000 GET requests (plus minimal setup charges for SSL) it’s super cheap and also super-easy to maintain. Drop a note in the comments if you’re interested in more details of the S3 drat setup.

FIN

After a few more weeks’ baking period for the self-hosed Gitea and CINC sites will have all non-error web-logging disabled and error logs won’t save IP addresses or referrers (I welcome anyone who wants to third-party audit the nginx configs) since another goal is also to help folks not be a product for tech startups or giant, soulless, global multi-national companies with a history of being horrendously evil.

Be on the lookout for a full writeup with code in the coming weeks.

P.S.

For Safari-users on 10.14+ I’ve made some tweaks to the “batman mode” version of the site. If you do use Safari (but…why?!) and have any issues with readability in “dark mode” just drop a note in the comments and I’ll see what I can do.

There are two fledgling rJava-based R packages that enable working with the AWS SDK for Athena:

They’re both needed to conform with the way CRAN like rJava-based packages submitted that also have large JAR dependencies. The goal is to eventually have wrappers for anything R folks need under the AWS Java SDK menu.

All package pairs will eventually cohabitate under the Cloudy R Project once each gets to 90% API coverage, passes CRAN checks and has passing Travis checks.

One thing I did get working right up front was the asynchronous dplyr chain query execution collect_async(), so if you need that and would rather not use reticulated wrappers, now’s your chance.

You would be correct in assuming this is an offshoot of the recent work on updating metis. My primary impetus for this is to remove the reticulate dependency from our Dockerized production setups but I also have discovered I like the Java libraries more than the boto3-based ones (not really a shocker there if you know my views on Python). As a result I should be able to quickly wrap most any library you may need (see below).

FIN

The next major wrapper coming is S3 (there are bits of it implemented in awsathena now but that’s temporary) and — for now — you can toss a comment here or file an issue in any of the social coding sites you like for priority wrapping of other AWS Java SDK libraries. Also, if you want some experience working with rJava packages in a judgement-free zone, drop a note into these or any new AWS rJava-based package repos and I’ll gladly walk you through your first PR.

Those outside the Colonies may not know that Payless—a national chain that made footwear affordable for millions of ‘Muricans who can’t spare $100.00 USD for a pair of shoes their 7 year old will outgrow in a year— is closing. CNBC also had a story that featured a choropleth with a tiny button at the bottom that indicated one could get the data:

I should have known this would turn out to be a chore since they used Tableau—the platform of choice when you want to take advantage of all the free software libraries they use to power their premier platform which, in turn, locks up all the data for you so others can’t adopt, adapt and improve. Go. Egregious. Predatory. Capitalism.

Anyway.

I wanted the data to do some real analysis vs produce a fairly unhelpful visualization (TLDR: layer in Census data for areas impacted, estimate job losses, compute nearest similar Payless stores to see impact on transportation-challenged homes, etc. Y’now, citizen data journalism-y things) so I pressed the button and watched for the URL in Chrome (aye, for those that remember I moved to Firefox et al in 2018, I switched back; more on that in March) and copied it to try to make this post actually reproducible (a novel concept for Tableau fanbois):

library(tibble)
library(readr)

# https://www.cnbc.com/2019/02/19/heres-a-map-of-where-payless-shoesource-is-closing-2500-stores.html

tfil <- "~/Data/Sheet_3_data.csv"

download.file(
  "https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",
  tfil
)
## trying URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
## Error in download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",  : 
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
## In addition: Warning message:
## In download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",  :
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true': HTTP status was '410 Gone'

WAT

Truth be told I expected a time-boxed URL of some sort (prior experience FTW). Selenium or Splash were potential alternatives but I didn’t want to research the legality of more forceful scraping (I just wanted the data) so I manually downloaded the file (*the horror*) and proceeded to read it in. Well, try to read it in:

read_csv(tfil)
## Parsed with column specification:
## cols(
##   A = col_logical()
## )
## Warning: 2092 parsing failures.
## row col           expected actual                      file
##   1   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   2   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   3   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   4   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   5   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
## ... ... .................. ...... .........................
## See problems(...) for more details.
## 
## # A tibble: 2,090 x 1
##    A    
##    <lgl>
##  1 NA   
##  2 NA   
##  3 NA   
##  4 NA   
##  5 NA   
##  6 NA   
##  7 NA   
##  8 NA   
##  9 NA   
## 10 NA   
## # … with 2,080 more rows

WAT

Getting a single column back from readr::read_[ct]sv() is (generally) a tell-tale sign that the file format is amiss. Before donning a deerstalker (I just wanted the data!) I tried to just use good ol’ read.csv():

read.csv(tfil, stringsAsFactors=FALSE)
## Error in make.names(col.names, unique = TRUE) : 
##   invalid multibyte string at '<ff><fe>A'
## In addition: Warning messages:
## 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 1 appears to contain embedded nulls
## 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 2 appears to contain embedded nulls
## 3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 3 appears to contain embedded nulls
## 4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 4 appears to contain embedded nulls
## 5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 5 appears to contain embedded nulls

WAT

Actually the “WAT” isn’t really warranted since read.csv() gave us some super-valuable info via invalid multibyte string at '<ff><fe>A'. FF FE is a big signal1 2 we’re working with a file in another encoding as that’s a common “magic” sequence at the start of such files.

But, I didn’t want to delve into my Columbo persona… I. Just. Wanted. The. Data. So, I tried the mind-bendingly fast and flexible helper from data.table:

data.table::fread(tfil)
## Error in data.table::fread(tfil) : 
##   File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

AHA. UTF-16 (maybe). Let’s poke at the raw file:

x <- readBin(tfil, "raw", file.size(tfil)) ## also: read_file_raw(tfil)

x[1:100]
##   [1] ff fe 41 00 64 00 64 00 72 00 65 00 73 00 73 00 09 00 43 00
##  [21] 69 00 74 00 79 00 09 00 43 00 6f 00 75 00 6e 00 74 00 72 00
##  [41] 79 00 09 00 49 00 6e 00 64 00 65 00 78 00 09 00 4c 00 61 00
##  [61] 62 00 65 00 6c 00 09 00 4c 00 61 00 74 00 69 00 74 00 75 00
##  [81] 64 00 65 00 09 00 4c 00 6f 00 6e 00 67 00 69 00 74 00 75 00

There’s our ff fe (which is the beginning of the possibility it’s UTF-16) but that 41 00 harkens back to UTF-16’s older sibling UCS-2. The 0x00‘s are embedded nuls (likely to get bytes aligned). And, there are alot of 09s. Y’know what they are? They’re <tab>s. That’s right. Tableau named file full of TSV records in an unnecessary elaborate encoding as CSV. Perhaps they broke the “T” on all their keyboards typing their product name so much.

Living A Boy’s [Data] Adventure Tale

At this point we have:

  • no way to support an automated, reproducible workflow
  • an ill-named file for what it contains
  • an overly-encoded file for what it contains
  • many wasted minutes (which is likely by design to have us give up and just use Tableau. No. Way.)

At this point I’m in full-on Rockford Files (pun intended) mode and delved down to the command line to use a old, trusted sidekick enca🔗:

$ enca -L none Sheet_3_data.csv
## Universal character set 2 bytes; UCS-2; BMP
##   LF line terminators
##   Byte order reversed in pairs (1,2 -> 2,1)

Now, all we have to do is specify the encoding!

read_tsv(tfil, locale = locale(encoding = "UCS-2LE"))
## Error in guess_header_(datasource, tokenizer, locale) : 
##   Incomplete multibyte sequence

WAT

Unlike the other 99% of the time (mebbe 99.9%) you use it, the tidyverse doesn’t have your back in this situation (but it does have your backlog in that it’s on the TODO).

Y’know who does have your back? Base R!:

read.csv(tfil, sep="\t", fileEncoding = "UCS-2LE", stringsAsFactors=FALSE) %>% 
  as_tibble()
## # A tibble: 2,089 x 14
##    Address City  Country Index Label Latitude Longitude
##    <chr>   <chr> <chr>   <int> <chr>    <dbl>     <dbl>
##  1 1627 O… Aubu… United…     1 Payl…     32.6     -85.4
##  2 900 Co… Doth… United…     2 Payl…     31.3     -85.4
##  3 301 Co… Flor… United…     3 Payl…     34.8     -87.6
##  4 304 Ox… Home… United…     4 Payl…     33.5     -86.8
##  5 2000 R… Hoov… United…     5 Payl…     33.4     -86.8
##  6 6140 U… Hunt… United…     6 Payl…     34.7     -86.7
##  7 312 Sc… Mobi… United…     7 Payl…     30.7     -88.2
##  8 3402 B… Mobi… United…     8 Payl…     30.7     -88.1
##  9 5300 H… Mobi… United…     9 Payl…     30.6     -88.2
## 10 6641 A… Mont… United…    10 Payl…     32.4     -86.2
## # … with 2,079 more rows, and 7 more variables:
## #   Number.of.Records <int>, State <chr>, Store.Number <int>,
## #   Store.count <int>, Zip.code <chr>, State.Usps <chr>,
## #   statename <chr>

WAT WOOT!

Note that read.csv(tfil, sep="\t", fileEncoding = "UTF-16LE", stringsAsFactors=FALSE) would have worked equally as well.

The Road Not [Originally] Taken

Since this activity decimated productivity, for giggles I turned to another trusted R sidekick, the stringi package, to see what it said:

library(stringi)

stri_enc_detect(x)
## [[1]]
##      Encoding Language Confidence
## 1    UTF-16LE                1.00
## 2  ISO-8859-1       pt       0.61
## 3  ISO-8859-2       cs       0.39
## 4    UTF-16BE                0.10
## 5   Shift_JIS       ja       0.10
## 6     GB18030       zh       0.10
## 7      EUC-JP       ja       0.10
## 8      EUC-KR       ko       0.10
## 9        Big5       zh       0.10
## 10 ISO-8859-9       tr       0.01

And, just so it’s primed in the Google caches for future searchers, another way to get this data (and other data that’s even gnarlier but similar in form) into R would have been:

stri_read_lines(tfil) %>% 
  paste0(collapse="\n") %>% 
  read.csv(text=., sep="\t", stringsAsFactors=FALSE) %>% 
  as_tibble()
## # A tibble: 2,089 x 14
##    Address City  Country Index Label Latitude Longitude
##    <chr>   <chr> <chr>   <dbl> <chr>    <dbl>     <dbl>
##  1 1627 O… Aubu… United…     1 Payl…     32.6     -85.4
##  2 900 Co… Doth… United…     2 Payl…     31.3     -85.4
##  3 301 Co… Flor… United…     3 Payl…     34.8     -87.6
##  4 304 Ox… Home… United…     4 Payl…     33.5     -86.8
##  5 2000 R… Hoov… United…     5 Payl…     33.4     -86.8
##  6 6140 U… Hunt… United…     6 Payl…     34.7     -86.7
##  7 312 Sc… Mobi… United…     7 Payl…     30.7     -88.2
##  8 3402 B… Mobi… United…     8 Payl…     30.7     -88.1
##  9 5300 H… Mobi… United…     9 Payl…     30.6     -88.2
## 10 6641 A… Mont… United…    10 Payl…     32.4     -86.2
## # … with 2,079 more rows, and 7 more variables: `Number of
## #   Records` <dbl>, State <chr>, `Store Number` <dbl>, `Store
## #   count` <dbl>, `Zip code` <chr>, `State Usps` <chr>,
## #   statename <chr>

(with similar dances to use read_csv() or fread()).

FIN

The night’s quest to do some real work with the data was DoS’d by what I’ll brazenly call a deliberate attempt to dissuade doing exactly that in anything but a commercial program. But, understanding the impact of yet-another massive retail store closing is super-important and it looks like it may be up to us (since the media is too distracted by incompetent leaders and inexperienced junior NY representatives) to do the work so it is left for another eve.

Folks who’d like to do the same can grab the UTF-8 encoded actual CSV from this site which has also been run through janitor::clean_names() so there’s proper column types and names to work with.

Speaking of which, here’s the cols spec for that CSV:

cols(
  address = col_character(),
  city = col_character(),
  country = col_character(),
  index = col_double(),
  label = col_character(),
  latitude = col_double(),
  longitude = col_double(),
  number_of_records = col_double(),
  state = col_character(),
  store_number = col_double(),
  store_count = col_double(),
  zip_code = col_character(),
  state_usps = col_character(),
  statename = col_character()
)

If you do anything with the data blog about it and post a link in the comments so I and others can learn from what you’ve discovered! It’s already kinda scary that one doesn’t even need a basemap to see just how much a part of ‘Murica Payless was: