Skip navigation

Phishing is [still] the primary way attackers either commit a primary criminal act (i.e. phish a target to, say, install ransomware) or is the initial vehicle used to gain a foothold in an organization so they can perform other criminal operations to achieve some goal. As such, security teams, vendors and active members of the cybersecurity community work diligently to neutralize phishing campaigns as quickly as possible.

One popular community tool/resource in this pursuit is PhishTank which is a collaborative clearing house for data and information about phishing on the Internet. Also, PhishTank provides an open API for developers and researchers to integrate anti-phishing data into their applications at no charge.

While the PhishTank API is useful for real-time anti-phishing operations the data is also useful for security researchers as we work to understand the ebb, flow and evolution of these attacks. One avenue of research is to track the various features associated with phishing campaigns which include (amongst many other elements) network (internet) location of the phishing site, industry being targeted, domain names being used, what type of sites are being cloned/copied and a feature we’ll be looking at in this post: what percentage of new phishing sites use SSL encryption and — of these — which type of SSL certificates are “en vogue”.

Phishing sites are increasingly using and relying on SSL certificates because we in the information security industry spent a decade instructing the general internet surfing population to trust sites with the green lock icon near the location bar. Initially, phishers worked to compromise existing, encryption-enabled web properties to install phishing sites/pages since they could leech off of the “trusted” status of the associated SSL certificates. However, the advent of services like Let’s Encrypt have made it possible for attacker to setup their own phishing domains that look legitimate to current-generation internet browsers and prey upon the decade’s old “trust the lock icon” mantra that most internet users still believe. We’ll table that path of discussion (since it’s fraught with peril if you don’t support the internet-do-gooder-consequences-be-darned cabal’s personal agendas) and just focus on how to work with PhishTank data in R and take a look at the most prevalent SSL certs used in the past week (you can extend the provided example to go back as far as you like provided the phishing sites are still online).

Accessing PhishTank From R

You can use the aquarium package [GL|GH] to gain access to the data provided by PhishTank’s API (you need to sign up for access and put you API key into the PHISHTANK_API_KEY environment variable which is best done via your ~/.Renviron file).

Let’s setup all the packages we’ll need and cache a current copy of the PhishTank data. The package forces you to utilize your own caching strategy since it doesn’t make sense for it to decide that for you. I’d suggest either using the time-stamped approach below or using some type of database system (or, say, Apache Drill) to actually manage the data.

Here are the packages we’ll need:

library(psl) # git[la|hu]b/hrbrmstr/psl
library(curlparse) # git[la|hu]b/hrbrmstr/curlparse
library(aquarium) # git[la|hu]b/hrbrmstr/aquarium
library(gt) # github/rstudio/gt
library(furrr)
library(stringi)
library(openssl)
library(tidyverse)

NOTE: The psl and curlparse packages are optional. Windows users will find it difficult to get them working and it may be easier to review the functions provided by the urlparse package and substitute equivalents for the domain() and apex_domain() functions used below. Now, we get a copy of the current PhishTank dataset & cache it:

if (!file.exists("~/Data/2018-12-23-fishtank.rds")) {
  xdf <- pt_read_db()
  saveRDS(xdf, "~/Data/2018-12-23-fishtank.rds")
} else {
  xdf <- readRDS("~/Data/2018-12-23-fishtank.rds")
}

Let’s take a look:

glimpse(xdf)
## Observations: 16,446
## Variables: 9
## $ phish_id          <chr> "5884184", "5884138", "5884136", "5884135", ...
## $ url               <chr> "http://internetbanking-bancointer.com.br/lo...
## $ phish_detail_url  <chr> "http://www.phishtank.com/phish_detail.php?p...
## $ submission_time   <dttm> 2018-12-22 20:45:09, 2018-12-22 18:40:24, 2...
## $ verified          <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ verification_time <dttm> 2018-12-22 20:45:52, 2018-12-22 21:26:49, 2...
## $ online            <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ details           <list> [<209.132.252.7, 209.132.252.0/24, 7296 468...
## $ target            <chr> "Other", "Other", "Other", "PayPal", "Other"...

The data is really straightforward. We have unique ids for each site/campaign the URL of the site along with a URL to extra descriptive info PhishTank has on the site/campaign. We also know when the site was submitted/discovered and other details, such as the network/internet space the site is in:

glimpse(xdf$details[1])
## List of 1
##  $ :'data.frame':    1 obs. of  6 variables:
##   ..$ ip_address        : chr "209.132.252.7"
##   ..$ cidr_block        : chr "209.132.252.0/24"
##   ..$ announcing_network: chr "7296 468"
##   ..$ rir               : chr "arin"
##   ..$ country           : chr "US"
##   ..$ detail_time       : chr "2018-12-23T01:46:16+00:00"

We’re going to focus on recent phishing sites (in this case, ones that are less than a week old) and those that use SSL certificates:

filter(xdf, verified == "yes") %>%
  filter(online == "yes") %>%
  mutate(diff = as.numeric(difftime(Sys.Date(), verification_time), "days")) %>%
  filter(diff <= 7) %>%
  { all_ct <<- nrow(.) ; . } %>%
  filter(grepl("^https", url)) %>%
  { ssl_ct <<- nrow(.) ; . } %>%
  mutate(
    domain = domain(url),
    apex = apex_domain(domain)
  ) -> recent

Let’s ee how many are using SSL:

(ssl_ct)
## [1] 383

(pct_ssl <- ssl_ct / all_ct)
## [1] 0.2919207

This percentage is lower than a recent “50% of all phishing sites use encryption” statistic going around of late. There are many reasons for the difference:

  • PhishTank doesn’t have all phishing sites in it
  • We just looked at a week of examples
  • Some sites were offline at the time of access attempt
  • Diverse attacker groups with varying degrees of competence engage in phishing attacks

Despite the 20% deviation, 30% is still a decent percentage, and a green, “everything’s ??” icon is a still a valued prize so we shall pursue our investigation.

Now we need to retrieve all those certs. This can be a slow operation that so we’ll grab them in parallel. It’s also quite possible the “online”status above data frame glimpse is inaccurate (sites can go offline quickly) so we’ll catch certificate request failures with safely() and cache the results:

cert_dl <- purrr::safely(openssl::download_ssl_cert)

plan(multiprocess)

if (!file.exists("~/Data/recent.rds")) {

  recent <- mutate(recent, cert = future_map(domain, cert_dl))
  saveRDS(recent, "~/Data/recent.rds")

} else {
  recent <- readRDS("~/Data/recent.rds")
}

Let see how many request failures we had:

(failed <- sum(map_lgl(recent$cert, ~is.null(.x$result))))
## [1] 25

(failed / nrow(recent))
## [1] 0.06527415

As noted in the introduction to the blog, when attackers want to use SSL for the lock icon ruse they can either try to piggyback off of legitimate domains or rely on Let’s Encrypt to help them commit crimes. Let’s see what the top p”apex” domains](https://help.github.com/articles/about-supported-custom-domains/#apex-domains) were in use in the past week:

count(recent, apex, sort = TRUE)
## # A tibble: 255 x 2
##    apex                              n
##    <chr>                         <int>
##  1 000webhostapp.com                42
##  2 google.com                       17
##  3 umbler.net                        8
##  4 sharepoint.com                    6
##  5 com-fl.cz                         5
##  6 lbcpzonasegurabeta-viabcp.com     4
##  7 windows.net                       4
##  8 ashaaudio.net                     3
##  9 brijprints.com                    3
## 10 portaleisp.com                    3
## # ... with 245 more rows

We can see that a large hosting provider (000webhostapp.com) bore a decent number of these sites, but Google Sites (which is what the full domain represented by the google.com apex domain here is usually pointing to) Microsoft SharePoint (sharepoint.com) and Microsoft forums (windows.net) are in active use as well (which is smart give the pervasive trust associated with those properties). There are 241 distinct apex domains in this 1-week set so what is the SSL cert diversity across these pages/campaigns?

We ultimately used openssl::download_ssl_cert to retrieve the SSL certs of each site that was online, so let’s get the issuer and intermediary certs from them and look at the prevalence of each. We’ll extract the fields from the issuer component returned by openssl::download_ssl_cert then just do some basic maths:

filter(recent, map_lgl(cert, ~!is.null(.x$result))) %>%
  mutate(issuers = map(cert, ~map_chr(.x$result, ~.x$issuer))) %>%
  mutate(
    inter = map_chr(issuers, ~.x[1]), # the order is not guaranteed here but the goal of the exercise is
    root = map_chr(issuers, ~.x[2])   # to get you working with the data vs build a 100% complete solution
  ) %>%
  mutate(
    inter = stri_replace_all_regex(inter, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>% # there are parswers for the cert info fields but this hack is quick and works
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols),
    root = stri_replace_all_regex(root, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>%
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols)
  ) -> recent

Let’s take a look at roots:

unnest(recent, root) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
DST Root CA X3 96 26.82%
COMODO RSA Certification Authority 93 25.98%
DigiCert Global Root G2 45 12.57%
Baltimore CyberTrust Root 30 8.38%
GlobalSign 27 7.54%
DigiCert Global Root CA 15 4.19%
Go Daddy Root Certificate Authority – G2 14 3.91%
COMODO ECC Certification Authority 11 3.07%
Actalis Authentication Root CA 9 2.51%
GlobalSign Root CA 4 1.12%
Amazon Root CA 1 3 0.84%
Let’s Encrypt Authority X3 3 0.84%
AddTrust External CA Root 2 0.56%
DigiCert High Assurance EV Root CA 2 0.56%
USERTrust RSA Certification Authority 2 0.56%
GeoTrust Global CA 1 0.28%
SecureTrust CA 1 0.28%

DST Root CA X3 is (wait for it) Let’s Encrypt! Now, Comodo is not far behind and indeed surpasses LE if we combine the extra-special “enhanced” versions they provide and it’s important for you to read the comments near the lines of code making assumptions about order of returned issuer information above. Now, let’s take a look at intermediaries:

unnest(recent, inter) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
Let’s Encrypt Authority X3 99 27.65%
cPanel\, Inc. Certification Authority 75 20.95%
RapidSSL TLS RSA CA G1 45 12.57%
Google Internet Authority G3 24 6.70%
COMODO RSA Domain Validation Secure Server CA 20 5.59%
CloudFlare Inc ECC CA-2 18 5.03%
Go Daddy Secure Certificate Authority – G2 14 3.91%
COMODO ECC Domain Validation Secure Server CA 2 11 3.07%
Actalis Domain Validation Server CA G1 9 2.51%
RapidSSL RSA CA 2018 9 2.51%
Microsoft IT TLS CA 1 6 1.68%
Microsoft IT TLS CA 5 6 1.68%
DigiCert SHA2 Secure Server CA 5 1.40%
Amazon 3 0.84%
GlobalSign CloudSSL CA – SHA256 – G3 2 0.56%
GTS CA 1O1 2 0.56%
AlphaSSL CA – SHA256 – G2 1 0.28%
DigiCert SHA2 Extended Validation Server CA 1 0.28%
DigiCert SHA2 High Assurance Server CA 1 0.28%
Don Dominio / MrDomain RSA DV CA 1 0.28%
GlobalSign Extended Validation CA – SHA256 – G3 1 0.28%
GlobalSign Organization Validation CA – SHA256 – G2 1 0.28%
RapidSSL SHA256 CA 1 0.28%
TrustAsia TLS RSA CA 1 0.28%
USERTrust RSA Domain Validation Secure Server CA 1 0.28%
NA 1 0.28%

LE is number one again! But, it’s important to note that these issuer CommonNames can roll up into a single issuing organization given just how messed up integrity and encryption capability is when it comes to web site certs, so the raw results could do with a bit of post-processing for a more complete picture (an exercise left to intrepid readers).

FIN

There are tons of avenues to explore with this data, so I hope this post whet your collective appetites sufficiently for you to dig into it, especially if you have some dowm-time coming.

Let me also take this opportunity to resissue guidance I and many others have uttered this holiday season: be super careful about what you click on, which sites you even just visit, and just how much you really trust the site, provider and entity behind the form about to enter your personal information and credit card info into.

I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this).

Today, they sent me this gem by @JeromeDeveloper: Scrapy and Scrapyrt: how to create your own API from (almost) any website. Go ahead and click it. Give the Medium author the ? they so desperately crave (and to provide context for the rant below).

I have no issue with @JeromeDeveloper’s coding prowess, nor Scrapy/Scrapyrt. In fact, I’m a huge fan of the folks at ScrapingHub, so much so that I wrote splashr to enable use of their Splash server from R.

My issue is with the example the author chose to use.

CoinMarketCap provides cryptocurrency prices and other cryptocurrency info. I use it to track cryptocurrency prices to see which currency attackers who pwn devices to install illegal cryptocurrency rigs might be switching to next and to get a feel for when they’ll stop mining and go back to just stealing data and breaking things.

CoinMarketCap has an API with a generous free tier with the following text in their Terms & Conditions (which, in the U.S. [soon] may stupidly be explicitly repeated & required on each page that scraping is prohibited on vs a universal site link):

You may not, and shall not, copy, reproduce, download, “screen scrape”, store, transmit, broadcast, publish, modify, create a derivative work from, display, perform, distribute, redistribute, sell, license, rent, lease or otherwise use, transfer (either in printed, electronic or other format) or exploit any Content, in whole or in part, in any way that does not comply with these Terms without our prior written permission.

There is only one reason (apart from complete oblivion) to use CoinMarketCap as an example: to show folks how clever you are at bypassing site restrictions and eventually avoiding paying for an API to get data that you did absolutely nothing to help gather, curate and setup infrastructure for. There is no mention of “be sure what you are doing is legal/ethical”, just a casual caution to not abuse the Scrapyrt technology since it may get you banned.

Ethics matter across every area of “data science” (of which, scraping is one component). Just because you can do something doesn’t mean you should and just because you don’t like Terms & Conditions and want to grift the work of others for fun, profit & ? also doesn’t mean you should; and, it definitely doesn’t mean you should be advocating others do it as well.

Ironically, Medium itself places restrictions on what you can do:

Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.

yet they advocated I read and heed a post which violates similar terms of another site. So I wonder how they’d feel if I did a riff of that post and showed how to setup a hackish-API to scrape all their content. O_o

The libcurl library (the foundational library behind the RCurl and curl packages) has switched to using OpenSSL’s default ciphers since version 7.56.0 (October 4 2017). If you’re a regular updater of curl/httr you should be fairly current with these cipher suites, but if you’re not a keen updater or use RCurl for your web-content tasks, you are likely not working with a recent cipher list and may start running into trouble as the internet self-proclaimed web guardians keep their wild abandon push towards “HTTPS Everywhere”.

Why is this important? Well, as a web consumer (via browsers) you likely haven’t run into any issues when visiting SSL/TLS-enabled sites since most browsers update super-frequently and bring along modern additions to cipher suites with them. Cipher suites are one of the backbones of assurance when it comes to secure connections to servers and stronger/different ciphers are continually added to openssl (and other libraries). If a server (rightfully) only supports a modern, seriously secure TLS configuration, clients that do not have such support won’t be able to connect and you’ll see errors such as:

SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

You can test what a server supports via tools like SSL Test. I’d point to a command-line tool but there are enough R users on crippled Windows systems that it’s just easier to have you point and click to see. If you are game to try a command-line tool then give testssl? a go from an RStudio terminal (I use that suggestion specifically to be platform agnostic as I cannot assume R Windows users know how to use a sane shell). The testssl script has virtually no dependencies so it should “work everywhere”. Note that both SSL Test and testsslmake quite a few connections to a site so make sure you’re only using your own server(s) as test targets unless you have permission from others to use theirs (go ahead and hit mine if you like).

You can also see what your R client packages support. One could run:

library(magrittr)

read.table(
  text = system("openssl ciphers -v", intern=TRUE) %>% 
    gsub("[[:alpha:]]+=", "", .)
) %>% 
  setNames(
    c("ciphername", "protoccol_version", "key_exchange", "authentication", 
      "symmetric_encryption_method", "message_authentication_code")
  )

in attempt to do that via the openssl binary on your system, but Windows users likely won’t be able to run that (unlike every other modern OS including iOS) and it might not show you what your installed R client packages can handle since they may be using different libraries.

So, another platform-agnostic (but requiring a call to a web site, so some privacy leakage) is to use How’s My SSL.

ssl_check_url <- "https://www.howsmyssl.com/a/check"

jsonlite::fromJSON(
  readLines(url(ssl_check_url), warn = FALSE)
) -> base_chk

jsonlite::fromJSON(
  RCurl::getURL(ssl_check_url)
) -> rcurl_chk

jsonlite::fromJSON(
  rawToChar(
    curl::curl_fetch_memory(ssl_check_url)$content
  )
) -> curl_chk

Compare the $given_cipher_suites for each of those to see how they compare and also take a look at $rating. macOS and Linux users should have fairly uniform results for all three. Windows users may be in for a sad awakening (I mean you’re used to that on a regular basis, so it’s cool). You can also configure how you communicate what you support via the ssl_cipher_list cURL option (capitalization is a bit different with RCurl but I kinda want you to use the curl package so you’re on your own to translate. Note that you can’t game the system and claim you can handle a cipher you don’t actually have.

FIN

You should try to stay current with the OpenSSL (or equivalent) library on your operating system and also with the libcurl library on your system and then the curl, openssl, and RCurl packages. You may have production-rules requiring lead-times for changing configs but these should be in the “test when they make it to CRAN and install as-soon-as-possible-thereafter” category.

Despite their now inherent evil status, GitHub has some tools other repository aggregators do not. One such tool is the free vulnerability alert service which will scan repositories for outdated+vulnerable dependencies.

Now, “R” is nowhere near a first-class citizen in the internet writ large, including software development tooling (e.g. the Travis-CI and GitLab continuous integration recipes are community maintained vs a first-class/supported offering). This also means that GitHub’s service will never check for nor alert when a pure R package has security issues, mostly due to the fact that there’s only a teensy few of us who even bother to check packages for issues once in a while and there’s no real way to report said issues into the CVE process easily (though I guess I could given that my $DAYJOB is an official CVE issuer), so the integrity & safety of the R package ecosystem is still in the “trust me, everything’s ?!!” state. Given that, any extra way to keep even some R packages less insecure is great.

So, right now you’re thinking “you click-baited us with a title that your lede just said isn’t possible…WTHeck?!.

It’s true that GitHub does not consider R a first-class citizen, but it does support Java and:

    available.packages() %>% 
      dplyr::as_data_frame() %>% 
      tidyr::separate_rows(Imports, sep=",[[:space:]]*") %>% # we really just
      tidyr::separate_rows(Depends, sep=",[[:space:]]*") %>% # need these two
      tidyr::separate_rows(Suggests, sep=",[[:space:]]*") %>%
      tidyr::separate_rows(Enhances, sep=",[[:space:]]*") %>%
      dplyr::select(Package, Imports, Depends) %>% 
      filter(
        grepl("rJava", Imports) | grepl("rJava", "Depends") | 
          grepl("Suggests", Imports) | grepl("Enhances", "Depends")
      ) %>% 
      dplyr::distinct(Package) %>% 
      dplyr::summarise(total_pkgs_using_rjava = n())
    ## # A tibble: 1 x 1
    ##   total_pkgs_using_rjava
    ##                    <int>
    ## 1                     66

according to ☝ there are 66 CRAN packages that require rJava, seven of which explicitly provide only JARs (a compressed directory tree of supporting Java classes). There are more CRAN-unpublished rJava-based projects on GitLab & GitHub, but it’s likely that public-facing rJava packages that include or depend on public JAR-dependent projects still number less than ~200. Given the now >13K packages in CRAN, this is a tiny subset but with the sorry state of R security, anything is better than nothing.

Having said that, one big reason (IMO) for the lack of Java-wrapped CRAN or “devtools”-only released rJava-dependent packages it that it’s 2018 and you still have better odds of winning a Vegas-jackpot than you do getting rJava to work on your workstation in less than 4 tries and especially after an OS upgrade. That’s sad since there are many wonderful, solid and useful Java libraries that would be super-handy for many workflows yet most of us (I’m including myself) package-writers prefer to spin wheels to get C++ or Rust libraries working with R than try to make it easier for regular R users to tap into that rich Java ecosystem.

But, I digress.

For the handful of us that do write and use rJava-based packages, we can better serve our userbase by deliberately putting those R+Java repos on GitHub. Now, I hear you. They’re evil and by doing this one of the most evil corporations on the planet can make money with your metadata (and, possibly just blatantly steal your code for use in-product without credit) but I’ll give that up on a case-by-case basis to make it easier to keep users safe.

Why will this enhance safety? Go take a look at one of my non-CRAN rJava-backed packages: pdfbox?. It has this awesome “in-your-face” security warning banner:

The vulnerability is CVE-2018-11797 which is baseline computed to be “high severity” with a the following specific weakness: In Apache PDFBox 1.8.0 to 1.8.15 and 2.0.0RC1 to 2.0.11, a carefully crafted PDF file can trigger an extremely long running computation when parsing the page tree.. So, it’s a process denial of service vulnerability. You’ll also note I haven’t updated the JARs yet (mostly since it’s not a code-execution vulnerability).

I knew about this 28 days ago (I’ve been incredibly busy and there’s alot of blather required to talk about it, hence the delay in blogging) thanks to the GitHub service and will resolve it when I get some free time over the Thanksgiving break. I received an alert for this, there are hooks for security alerts (so one can auto-create an issue) and there’s a warning for users and any of them could file an issue to let me know it’s super-important to them that I get it fixed (or they could be super-awesome and file a PR :-).

FIN

The TLDR is (first) a note — to package authors — who use rJava to bite the GitHub bullet and take advantage of this free service; and, (second) — to users — to encourage use of this service by authors of packages you use and to keep a watchful eye out for any security alerts for code you depend on to get things done.

A (perhaps) third and final note is for all of us to be to continually mindful about the safety & integrity of the R package ecosystem and do what we can to keep moving it forward.

If you’re an R/RStudio user who has migrated to Mojave (macOS 10.14) or are contemplating migrating to it, you will likely eventually run into an issue where you’re trying to access resources that are in Apple’s new hardened filesystem sandboxes.

Rather than reinvent the wheel by blathering about what that means, give these links a visit if you want more background info:

and/or Google/DuckDuckGo for macos privacy tcc full disk access as there have been a ton of blogs on this topic.

The TLDR for getting your R/RStudio bits to work in these sandboxed areas is to grant them “Full Disk Access” (if that sounds scary, it is) permissions. You can do that by adding both the R executable and RStudio executable (drag their icons) to the Full Disk Access pane under the Privacy tab of System Preferences => Security & Privacy:

I also used the Finder’s “Go To” command to head on over /usr/local/bin and use the “Show Original” popup on both R and Rscript and dragged their fully qualified path binaries into the pane as well (you can’t see them b/c the pane is far too small). The symbolic links might be sufficient, but I’ve been running this way since the betas and just re-drag the versioned R/Rscript binaries each time I upgrade (or rebuild) R.

If you do grant FDA to R/RStudio just make sure be even more careful about trusting R code you run from the internet or R packages you install from untrusted sources (like GitHub) since R/RStudio are now potential choice conduits for malicious code that wants to get at your seekret things.

Photo by Alexander Dummer on Unsplash

The CBC covered the recent (as of the original post-time on this blog entry) Quebec elections and used a well-crafted hex grid map to display results:

They have a great ‘splainer on why they use this type of map.

Thinking that it may be useful for others, I used a browser Developer Tools inspector to yank out the javascript-created SVG and wrangled out the hexes using svg2geojson? and put them into a GeoJSON file along with some metadata that I extracted from the minified javascript from the CBC’s site and turned into a data frame using the V8? package. Since most of the aforementioned work was mouse clicking and ~8 (disjointed) lines of accompanying super-basic R code, there’s not really much to show wrangling-wise1, but I can show an example of using the new GeoJSON file in R and the sf? package:

library(sf)
library(ggplot2)

# get the GeoJSON file from: https://gitlab.com/hrbrmstr/quebec-hex-ridings or https://github.com/hrbrmstr/quebec-hex-ridings
sf::st_read("quebec-ridings.geojson", quiet = TRUE, stringsAsFactors = FALSE) %>% 
  ggplot() +
  geom_sf(aes(fill = regionname)) +
  coord_sf(datum = NA) +
  ggthemes::scale_fill_tableau(name = NULL, "Tableau 20") +
  ggthemes::theme_map() +
  theme(legend.position = "bottom")

And, with a little more ggplot2-tweaking and some magick, we can even put it in the CBC-styled border:

library(sf)
library(magick)
library(ggplot2)

plt <- image_graph(1488, 1191, bg = "white")
sf::st_read("quebec-ridings.geojson", quiet=TRUE, stringsAsFactors=FALSE) %>% 
  ggplot() +
  geom_sf(aes(fill=regionname)) +
  coord_sf(datum=NA) +
  scale_x_continuous(expand=c(0,2)) +
  scale_y_continuous(expand=c(0,0)) +
  ggthemes::theme_map() +
  theme(plot.margin = margin(t=150)) +
  theme(legend.position = "none")
dev.off()

# get this bkgrnd img from the repo
image_composite(plt, image_read("imgs/background.png")) %>% 
  image_write("imgs/composite-map.png")

You can tweak the border color with magick? as needed and there’s a background2.png in the imgs directory in the repo that has the white inset that you can further composite as needed.

With a teensy bit of work you should be able adjust the stroke color via aes() to separate things as the CBC did.

FIN

It’s important to re-state that the CBC made the original polygons for the hexes (well, they made a set of grid points and open source software turned it into a set of SVG paths) and the background images. All I did was some extra bit of wrangling and conversionating2.

1 I can toss a screencast if there’s sufficient interest.
2 Totally not a word.

In my semi-daily run of brew update I noticed that proj4 had been updated to 5.2. I kinda “squeee“‘d since (as the release notes show) the Equal Earth projection was added to it (+proj=eqearth).

As the team who created the projection describes it: “The Equal Earth map projection is a new equal-area pseudocylindrical projection for world maps. It is inspired by the widely used Robinson projection, but unlike the Robinson projection, retains the relative size of areas. The projection equations are simple to implement and fast to evaluate. Continental outlines are shown in a visually pleasing and balanced way.”

They continue: “The Robinson and Equal Earth projections share a similar outer shape[…] Upon close inspection, however, the way that they differ becomes apparent. The Equal Earth with a height-to-width ratio of 1:2.05 is slightly wider than the Robinson at 1:1.97. On the Equal Earth, the continents in tropical and mid-latitude areas are more elongated (north to south) and polar areas are more flattened. This is a consequence of Equal Earth being equal-area in contrast to the Robinson that moderately exaggerates high-latitude areas.”

Here’s a comparison of it to other, similar, projections:

©Taylor & Francis Group, 2018. All rights reserved.

Map projections are a pretty nerdy thing, but this one even got the attention of Newsweek.

To use this new projection now in R, you’ll need to install the proj4 ? from source after upgrading to the new proj4 library. That was a simple brew upgrade for me and Linux users can do the various package manager incantations to get theirs as well. Windows users can be jealous for a while until updated package binaries make their way to CRAN (or switch to a real platform ?).

After a fresh source install of proj4 all you have to do is:

library(ggalt) # git[la|hu]b/hrbrmstr/hrbrthemes
library(hrbrthemes) # git[la|hu]b/hrbrmstr/hrbrthemes
library(ggplot2)

world <- map_data("world")

ggplot() +
  geom_map(
    map = world, data = world,
    aes(long, lat, map_id = region), 
    color = ft_cols$white, fill = ft_cols$slate,
    size = 0.125
  ) +
  coord_proj("+proj=eqearth") +
  labs(
    x = NULL, y = NULL,
    title = "Equal Earth Projection (+proj=eqearth)"
  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank())

to get:

Remember, friends don't let friends use Mercator.

Despite having sailed through the core components of this year’s Talk Like A Pirate Day R post a few months ago, time has been an enemy of late so this will be a short post that others can build off of, especially since there’s lots more knife work ground to cover from the data.

DMC-WhAt?

Since this is TLAPD, I’ll pilfer some of the explanation from GitHub itself:

The Digital Millennium Copyright Act (DMCA) <start_of_current_pilfer>“provides a safe harbor for service providers that host user-generated content. Since even a single claim of copyright infringement can carry statutory damages of up to $150,000, the possibility of being held liable for user-generated content could be very harmful for service providers. With potential damages multiplied across millions of users, cloud-computing and user-generated content sites like YouTube, Facebook, or GitHub probably never would have existed without the DMCA (or at least not without passing some of that cost downstream to their users).”

“The DMCA addresses this issue by creating a copyright liability safe harbor for internet service providers hosting allegedly infringing user-generated content. Essentially, so long as a service provider follows the DMCA’s notice-and-takedown rules, it won’t be liable for copyright infringement based on user-generated content. Because of this, it is important for GitHub to maintain its DMCA safe-harbor status.”</end_of_current_pilfer>

(I’ll save you from a long fact- and opinion-based diatribe on the DMCA, but suffice it to say it’s done far more harm than good IMO. Also, hopefully the “piracy” connection makes sense, now :-)

If your initial reaction was “What does the DMCA have to do with GitHub?” it likely (quickly) turned to “Oh…GitHub is really just a version-controlled file sharing service…”. As such it has to have a robust takedown policy and process.

I don’t know if Microsoft is going to keep the practice of being open about DMCA requests now that they own GitHub nor do I know if they’ll use the same process on themselves (since, as we’ll see, they have issued DMCA requests to GitHub in the past). For now, we’ll assume they will, thus making the code from this post usable in the future to check on the status of DMCA requests over a longer period of time. But first we need the data.

Hunting for treasure in the data hoard

Unsurprisingly, GitHub stores DMCA data on GitHub. Ironically, they store it openly — in-part — to shine a light on what giant, global megacorps like Microsoft are doing. Feel free to use one of the many R packages to clone the repo, but a simple command-line git clone git@github.com:github/dmca.git is quick and efficient (not everything needs to be done from R).

The directory structure looks like this:

├── 2011
├── 2012
├── 2013
├── 2014
├── 2015
├── 2016
├── 2017
├── 2017-02-01-RBoyApps-2.md
├── 2017-02-15-DeutscheBank.md
├── 2017-03-13-Jetbrains.md
├── 2017-06-26-Wipro-Counternotice.md
├── 2017-06-30-AdflyLink.md
├── 2017-07-28-Toontown-2.md
├── 2017-08-31-Tourzan.md
├── 2017-09-04-Random-House.md
├── 2017-09-05-RandomHouse-2.md
├── 2017-09-18-RandomHouse.md
├── 2017-09-19-Ragnarok.md
├── 2017-10-10-Broadcom.md
├── 2018
├── 2018-02-01-NihonAdSystems.md
├── 2018-03-03-TuneIn.md
├── 2018-03-16-Wabg.md
├── 2018-05-17-Packt.md
├── 2018-06-12-Suning.md
├── 2018-07-31-Pearson.md
├── CONTRIBUTING.md
├── data
└── README.md

Unfortunately, the data directory contains fools’ gold (it’s just high-level summary data).

We want DMCA filer names, repo names, file names and the DMCA notice text (though we’ll be leaving NLP projects up to the intrepid readers). For that, it will mean processing the directories of notices.

Notices are named (sadly, with some inconsistency) like this: 2018-03-15-Microsoft.md. Year, month, date and name of org. The contents are text-versions of correspondence (usually email text) that have some requirements in order to be processed. There’s also an online form one can fill out but it’s pretty much a free text field with some semblance of structure. It’s up to humans to follow that structure and — as such — there is inconsistency in the text as well. (Perhaps this is a great lesson that non-constrained inputs and human-originated filenames aren’t a great plan for curating data stores.)

You may have seen what look like takedown files in the top level of the repo. I have no idea if they are legit (since they aren’t in the structured directories) so we’ll be ignoring them.

When I took a look at the directories, some files end in .markdown but most end in .md. We’ll cover both instances (you’ll need to replace /data/github/dmca with the prefix where you stored the repo:

library(tools)
library(stringi)
library(hrbrthemes)
library(tidyverse)

list.files(
  path = sprintf("/data/github/dmca/%s", 2011:2018), 
  pattern = "\\.md$|\\.markdown$",
  full.names = TRUE
) -> dmca_files

As noted previously, we’re going to focus on DMCA views over time, look at organizations who filed DMCA notices and the notice content. It turns out the filenames also distinguish whether a notice is a takedown request or a counter-notice (i.e. an “oops…my bad…” by a takedown originator) or a retraction, so we’ll collect that metadata as well. Finally, we’ll slurp up the text along the way.

Again, I’ve taken a pass at this and found out the following:

  • Some dates are coded incorrectly (infrequently enough to be able to use some causal rules to fix)
  • Some org names are coded incorrectly (often enough to skew counts, so we need to deal with it)
  • Counter-notice and retraction tags are inconsistent, so we need to deal with that as well

It’s an ugly pipeline, so I’ve annotated these initial steps to make what’s going on a bit clearer:

map_df(dmca_files, ~{
  
  file_path_sans_ext(.x) %>% # remove extension
    basename() %>% # get just the filename
    stri_match_all_regex(
      "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{1,2})-(.*)" # try to find the date and the org
    ) %>% 
    unlist() -> date_org
  
  if (is.na(date_org[2])) { # handle a special case where the date pattern above didn't work
    file_path_sans_ext(.x) %>% 
      basename() %>%
      stri_match_all_regex(
        "([[:digit:]]{4}-[[:digit:]]{2})-(.*)"
      ) %>% 
      unlist() -> date_org
  }
  
  # a few files are still broken so we'll deal with them as special cases
  
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    date_org <- c("", "2017-11-06", "1776")
  } else if (stri_detect_fixed(.x, "2017/2017-Offensive-Security-7.md")) {
    date_org <- c("", "2017-12-30", "Offensive-Security-7")
  } else if (stri_detect_fixed(.x, "2017/Offensive-Security-6.md")) {
    date_org <- c("", "2017-12-29", "Offensive-Security-6")
  }
  
  # we used a somewhat liberal regex to capture dates since some are 
  # still broken. We'll deal with those first, then turn them
  # into proper Date objects
  
  list(
    notice_day = case_when(
      date_org[2] == "2015-12-3"  ~ "2015-12-03",
      date_org[2] == "2015-12-7"  ~ "2015-12-07",
      date_org[2] == "2016-08"    ~ "2016-08-01",
      date_org[2] == "2016-10-7"  ~ "2016-10-07",
      date_org[2] == "2016-11-1"  ~ "2016-11-01",
      date_org[2] == "2016-11-3"  ~ "2016-11-03",
      date_org[2] == "2017-06"    ~ "2017-06-01",
      date_org[2] == "0107-05-22" ~ "2017-05-22",
      date_org[2] == "2017-11-1"  ~ "2017-11-01",
      TRUE ~ date_org[2]
    ) %>% 
      lubridate::ymd(),
    notice_org = date_org[3] %>% # somtimes the org name is messed up so we need to clean it up
      stri_replace_last_regex("[-]*[[:digit:]]+$", "") %>% 
      stri_replace_all_fixed("-", " "),
    notice_content = list(read_lines(.x)) # grab the content
  ) -> ret
  
  # and there are still some broken org names
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    ret$notice_org <- "1776"
  } 
  
  ret
  
}) -> dmca

dmca
## # A tibble: 4,460 x 3
##    notice_day notice_org                   notice_content
##                                         
##  1 2011-01-27 sony                             
##  2 2011-01-28 tera                             
##  3 2011-01-31 sony                             
##  4 2011-02-03 sony counternotice                
##  5 2011-02-03 sony                          
##  6 2011-03-24 oracle                            
##  7 2011-03-30 mentor graphics                  
##  8 2011-05-24 cpp virtual world operations     
##  9 2011-06-07 sony                             
## 10 2011-06-13 diablominer                      
## # ... with 4,450 more rows

Much better. We’ve got more deck-swabbing to do, now, to tag the counter-notice and retractions:

mutate(
  dmca,
  counter_notice = stri_detect_fixed(notice_org, "counternotice|counter notice"), # handle inconsistency
  retraction = stri_detect_fixed(notice_org, "retraction"), 
  notice_org = stri_trans_tolower(notice_org) %>% 
    stri_replace_first_regex("\ *(counternotice|counter notice)\ *", "") %>% # clean up org names with tags
    stri_replace_first_regex("\ *retraction\ *", "")
) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

I’ve lower-cased the org names to make it easier to wrangle them since we do, indeed, need to wrangle them.

I’m super-not-proud of the following code block, but I went into it thinking the org name corrections would be infrequent. But, as I worked with the supposedly-cleaned data, I kept adding correction rules and eventually created a monster:

mutate(
  dmca,
  notice_org = case_when(
    stri_detect_fixed(notice_org, "accenture")        ~ "accenture",
    stri_detect_fixed(notice_org, "adobe")            ~ "adobe",
    stri_detect_fixed(notice_org, "amazon")           ~ "amazon",
    stri_detect_fixed(notice_org, "ansible")          ~ "ansible",
    stri_detect_fixed(notice_org, "aspengrove")       ~ "aspengrove",
    stri_detect_fixed(notice_org, "apple")            ~ "apple",
    stri_detect_fixed(notice_org, "aws")              ~ "aws",
    stri_detect_fixed(notice_org, "blizzard")         ~ "blizzard",
    stri_detect_fixed(notice_org, "o reilly")         ~ "oreilly",
    stri_detect_fixed(notice_org, "random")           ~ "random house",
    stri_detect_fixed(notice_org, "casado")           ~ "casadocodigo",
    stri_detect_fixed(notice_org, "ccp")              ~ "ccp",
    stri_detect_fixed(notice_org, "cisco")            ~ "cisco",
    stri_detect_fixed(notice_org, "cloudsixteen")     ~ "cloud sixteen",
    stri_detect_fixed(notice_org, "collinsharper")    ~ "collins ’harper",
    stri_detect_fixed(notice_org, "contentanalytics") ~ "content analytics",
    stri_detect_fixed(notice_org, "packt")            ~ "packt",
    stri_detect_fixed(notice_org, "penguin")          ~ "penguin",
    stri_detect_fixed(notice_org, "wiley")            ~ "wiley",
    stri_detect_fixed(notice_org, "wind river")       ~ "windriver",
    stri_detect_fixed(notice_org, "windriver")        ~ "windriver",
    stri_detect_fixed(notice_org, "wireframe")        ~ "wireframe shader",
    stri_detect_fixed(notice_org, "listen")           ~ "listen",
    stri_detect_fixed(notice_org, "wpecommerce")      ~ "wpecommerce",
    stri_detect_fixed(notice_org, "yahoo")            ~ "yahoo",
    stri_detect_fixed(notice_org, "youtube")          ~ "youtube",
    stri_detect_fixed(notice_org, "x pressive")       ~ "xpressive",
    stri_detect_fixed(notice_org, "ximalaya")         ~ "ximalaya",
    stri_detect_fixed(notice_org, "pragmatic")        ~ "pragmatic",
    stri_detect_fixed(notice_org, "evadeee")          ~ "evadeee",
    stri_detect_fixed(notice_org, "iaai")             ~ "iaai",
    stri_detect_fixed(notice_org, "line corp")        ~ "line corporation",
    stri_detect_fixed(notice_org, "mediumrare")       ~ "medium rare",
    stri_detect_fixed(notice_org, "profittrailer")    ~ "profit trailer",
    stri_detect_fixed(notice_org, "smartadmin")       ~ "smart admin",
    stri_detect_fixed(notice_org, "microsoft")        ~ "microsoft",
    stri_detect_fixed(notice_org, "monotype")         ~ "monotype",
    stri_detect_fixed(notice_org, "qualcomm")         ~ "qualcomm",
    stri_detect_fixed(notice_org, "pearson")          ~ "pearson",
    stri_detect_fixed(notice_org, "sony")             ~ "sony",
    stri_detect_fixed(notice_org, "oxford")           ~ "oxford",
    stri_detect_fixed(notice_org, "oracle")           ~ "oracle",
    stri_detect_fixed(notice_org, "out fit")          ~ "outfit",
    stri_detect_fixed(notice_org, "nihon")            ~ "nihon",
    stri_detect_fixed(notice_org, "opencv")           ~ "opencv",
    stri_detect_fixed(notice_org, "newsis")           ~ "newsis",
    stri_detect_fixed(notice_org, "nostarch")         ~ "nostarch",
    stri_detect_fixed(notice_org, "stardog")          ~ "stardog",
    stri_detect_fixed(notice_org, "mswindows")        ~ "microsoft",
    stri_detect_fixed(notice_org, "moody")            ~ "moody",
    stri_detect_fixed(notice_org, "minecraft")        ~ "minecraft",
    stri_detect_fixed(notice_org, "medinasoftware")   ~ "medina software",
    stri_detect_fixed(notice_org, "linecorporation")  ~ "line corporation",
    stri_detect_fixed(notice_org, "steroarts")        ~ "stereoarts",
    stri_detect_fixed(notice_org, "mathworks")        ~ "mathworks",
    stri_detect_fixed(notice_org, "tmssoftware")      ~ "tmssoftware",
    stri_detect_fixed(notice_org, "toontown")         ~ "toontown",
    stri_detect_fixed(notice_org, "wahoo")            ~ "wahoo",
    stri_detect_fixed(notice_org, "webkul")           ~ "webkul",
    stri_detect_fixed(notice_org, "whmcs")            ~ "whmcs",
    stri_detect_fixed(notice_org, "viber")            ~ "viber",
    stri_detect_fixed(notice_org, "totalfree")        ~ "totalfreedom",
    stri_detect_fixed(notice_org, "successacademies") ~ "success academies",
    stri_detect_fixed(notice_org, "ecgwaves")         ~ "ecgwaves",
    stri_detect_fixed(notice_org, "synology")         ~ "synology",
    stri_detect_fixed(notice_org, "infistar")         ~ "infistar’",
    stri_detect_fixed(notice_org, "galleria")         ~ "galleria",
    stri_detect_fixed(notice_org, "jadoo")            ~ "jadoo",
    stri_detect_fixed(notice_org, "dofustouch")       ~ "dofus touch",
    stri_detect_fixed(notice_org, "gravityforms")     ~ "gravity forms",
    stri_detect_fixed(notice_org, "fujiannewland")    ~ "fujian newland",
    stri_detect_fixed(notice_org, "dk uk")            ~ "dk",
    stri_detect_fixed(notice_org, "dk us")            ~ "dk",
    stri_detect_fixed(notice_org, "dkuk")             ~ "dk",
    stri_detect_fixed(notice_org, "dkus")             ~ "dk",
    stri_detect_fixed(notice_org, "facet")            ~ "facet",
    stri_detect_fixed(notice_org, "fh admin")         ~ "fhadmin",
    stri_detect_fixed(notice_org, "electronicarts")   ~ "electronic arts",
    stri_detect_fixed(notice_org, "daikonforge")      ~ "daikon forge",
    stri_detect_fixed(notice_org, "corgiengine")      ~ "corgi engine",
    stri_detect_fixed(notice_org, "epicgames")        ~ "epic  games",
    stri_detect_fixed(notice_org, "essentialmode")    ~ "essentialmode",
    stri_detect_fixed(notice_org, "jetbrains")        ~ "jetbrains",
    stri_detect_fixed(notice_org, "foxy")             ~ "foxy themes",
    stri_detect_fixed(notice_org, "cambridgemobile")  ~ "cambridge mobile",
    stri_detect_fixed(notice_org, "offensive")        ~ "offensive security",
    stri_detect_fixed(notice_org, "outfit")           ~ "outfit",
    stri_detect_fixed(notice_org, "haihuan")          ~ "shanghai haihuan",
    stri_detect_fixed(notice_org, "schuster")         ~ "simon & schuster",
    stri_detect_fixed(notice_org, "silicon")          ~ "silicon labs",
    TRUE ~ notice_org
  )) %>% 
  arrange(notice_day) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

You are heartily encouraged to create a translation table in place of that monstrosity.

But, we finally have usable data. You can avoid the above by downloading https://rud.is/dl/github-dmca.json.gz and using jsonlite::stream_in() or ndjson::stream_in() to get the above data frame.

Hoisting the mizzen sailplots

Let’s see what the notice submission frequency looks like over time:

# assuming you downloaded it as suggested
jsonlite::stream_in(gzfile("~/Data/github-dmca.json.gz")) %>% 
  tbl_df() %>% 
  mutate(notice_day = as.Date(notice_day)) -> dmca

filter(dmca, !retraction) %>% 
  mutate(
    notice_year = lubridate::year(notice_day),
    notice_ym = as.Date(format(notice_day, "%Y-%m-01"))
  ) %>% 
  dplyr::count(notice_ym) %>% 
  arrange(notice_ym) %>% 
  ggplot(aes(notice_ym, n)) +
  ggalt::stat_xspline(
    geom="area", fill=alpha(ft_cols$blue, 1/3), color=ft_cols$blue
  ) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Notices", 
    title = "GitHub DMCA Notices by Month Since 2011"
  ) +
  theme_ft_rc(grid="XY")

I’m not naive, but that growth was a bit of a shocker, which made me want to jump in and see who the top-filers were:

count(dmca, notice_org, sort=TRUE)
## # A tibble: 1,948 x 2
##    notice_org             n
##                  
##  1 webkul                92
##  2 pearson               90
##  3 stereoarts            86
##  4 qualcomm              72
##  5 codility              71
##  6 random house          62
##  7 outfit                57
##  8 offensive security    49
##  9 sensetime             46
## 10 penguin               44
## # ... with 1,938 more rows

“Webkul” is an enterprise eCommerce (I kinda miss all the dashed “e-” prefixes we used to use back in the day) platform. I mention that since I didn’t know what it was either. There are some recognizable names there like “Pearson” and “Random House” and “Penguin” which make sense since it’s easy to share improperly share e-books (modern non-dashed idioms be darned).

Let’s see the top 15 orgs by year since 2015 (since that’s when DMCA filings really started picking up and because I like 2×2 grids). We’ll also leave out counter-notices and retractions and alpha-order it since I want to be able to scan the names more than I want to see rank:

filter(dmca, !retraction, !counter_notice, notice_day >= as.Date("2015-01-01")) %>%
  mutate(
    notice_year = lubridate::year(notice_day),
  ) %>% 
  dplyr::count(notice_year, notice_org) %>% 
  group_by(notice_year) %>% 
  top_n(15) %>% 
  slice(1:15) %>% 
  dplyr::ungroup() %>%
  mutate( # a-z order with "a" on top 
    notice_org = factor(notice_org, levels = unique(sort(notice_org, decreasing = TRUE)))
  ) %>% 
  ggplot(aes(n, notice_org, xend=0, yend=notice_org)) +
  geom_segment(size = 2, color = ft_cols$peach) +
  facet_wrap(~notice_year, scales = "free") +
  scale_x_comma(limits=c(0, 60)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 15 GitHub DMCA Filers by Year Since 2015"
  ) +
  theme_ft_rc(grid="X")

Let’s look at rogues’ gallery of the pirates themselves:

dmca %>% 
  mutate(
    ghusers = notice_content %>% 
      map(~{
        stri_match_all_regex(.x, "http[s]*://github.com/([^/]+)/.*") %>% 
          discard(~is.na(.x[,1])) %>% 
          map_chr(~.x[,2]) %>% 
          unique() %>% 
          discard(`==`, "github") %>% 
          discard(~grepl(" ", .x))
      })
  ) %>% 
  unnest(ghusers) %>% 
  dplyr::count(ghusers, sort=TRUE) %>% 
  print() -> offenders
## # A tibble: 18,396 x 2
##    ghusers           n
##             
##  1 RyanTech         16
##  2 sdgdsffdsfff     12
##  3 gamamaru6005     10
##  4 ranrolls         10
##  5 web-padawan      10
##  6 alexinfopruna     8
##  7 cyr2242           8
##  8 liveqmock         8
##  9 promosirupiah     8
## 10 RandyMcMillan     8
## # ... with 18,386 more rows

As you might expect, most users have only 1 or two complaints filed against them since it was likely an oversight more than malice on their part:

ggplot(offenders, aes(x="", n)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(breaks=1:16, limits=c(1,16)) +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of GitHub DMCA Complaints Received by a User"
  ) +
  theme_ft_rc(grid="X")

But, there are hundreds of digital buccaneers, and we can have a bit of fun with them especially since I noticed quite a few had default (generated) avatars with lots of white in them (presenting this with a pirate hat-tip to Maëlle & Lucy):

library(magick)

dir.create("gh-pirates")
dir.create("gh-pirates-jpeg")

# this kinda spoils the surprise; i should have renamed it
download.file("https://rud.is/dl/jolly-roger.jpeg", "jolly-roger.jpeg")

ghs <- safely(gh::gh) # no need to add cruft to our namespace for one function 

filter(offenders, n>2) %>% 
  pull(ghusers) %>% 
  { .pb <<- progress_estimated(length(.)); . } %>% # there are a few hundred of them
  walk(~{
    .pb$tick()$print()
    user <- ghs(sprintf("/users/%s", .x))$result # the get-user and then download avatar idiom shld help us not bust GH API rate limits
    if (!is.null(user)) {
      download.file(user$avatar_url, file.path("gh-pirates", .x), quiet=TRUE) # can't assume avatar file type
    }
  })

# we'll convert them all to jpeg and resize them at the same time plus make sure they aren't greyscale
dir.create("gh-pirates-jpeg")
list.files("gh-pirates", full.names = TRUE, recursive = FALSE) %>%
  walk(~{
    image_read(.x) %>% 
      image_scale("72x72") %>% 
      image_convert("jpeg", type = "TrueColor", colorspace = "rgb") %>% 
      image_write(
        path = file.path("gh-pirates-jpeg", sprintf("%s.jpeg", basename(.x))), 
        format = "jpeg"
      )
  })

set.seed(20180919) # seemed appropriate for TLAPD
RsimMosaic::composeMosaicFromImageRandomOptim( # this takes a bit
  originalImageFileName = "jolly-roger.jpeg",
  outputImageFileName = "gh-pirates-flag.jpeg",
  imagesToUseInMosaic = "gh-pirates-jpeg",
  removeTiles = TRUE,
  fracLibSizeThreshold = 0.1
)

Finally, we’ll look at the types of pilfered files. To do that, we’ll first naively look for github repo URLs (there are github.io ones in there too, though, which is an exercise left to ye corsairs):

mutate(
  dmca,
  files = notice_content %>% 
    map(~{
      paste0(.x, collapse = " ") %>% 
        stri_extract_all_regex(gh_url_pattern, omit_no_match=FALSE, opts_regex = stri_opts_regex(TRUE)) %>% 
        unlist() %>% 
        stri_replace_last_regex("[[:punct:]]+$", "")
    })
) -> dmca_with_files

Now, we can see just how many resources/repos/files are in a complaint:

filter(dmca_with_files, map_lgl(files, ~!is.na(.x[1]))) %>% 
  select(notice_day, notice_org, files) %>% 
  mutate(num_refs = lengths(files)) %>%
  arrange(desc(num_refs)) %>%  # take a peek at the heavy hitters
  print() -> files_with_counts
## # A tibble: 4,020 x 4
##    notice_day notice_org files         num_refs
##                          
##  1 2014-08-27 monotype        2504
##  2 2011-02-03 sony            1160
##  3 2016-06-08 monotype        1015
##  4 2018-04-05 hexrays            906
##  5 2016-06-15 ibo                877
##  6 2016-08-18 jetbrains          777
##  7 2017-10-14 cengage            611
##  8 2016-08-23 yahoo              556
##  9 2017-08-30 altis              529
## 10 2015-09-22 jetbrains          468
## # ... with 4,010 more rows

ggplot(files_with_counts, aes(x="", num_refs)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(trans="log10") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of Files/Repos per-GitHub DMCA Complaint",
    caption = "Note: Log10 Scale"
  ) +
  theme_ft_rc(grid="X")

And, what are the most offensive file types (per-year):

mutate(
  files_with_counts, 
  extensions = map(files, ~tools::file_ext(.x) %>% 
    discard(`==` , "")
  )
) %>% 
  select(notice_day, notice_org, extensions) %>% 
  unnest(extensions) %>% 
  mutate(year = lubridate::year(notice_day)) -> file_types

count(file_types, year, extensions) %>% 
  filter(year >= 2014) %>% 
  group_by(year) %>% 
  top_n(10) %>% 
  slice(1:10) %>% 
  ungroup() %>% 
  ggplot(aes(year, n)) +
  ggrepel::geom_text_repel(
    aes(label = extensions, size=n), 
    color = ft_cols$green, family=font_ps, show.legend=FALSE
  ) +
  scale_size(range = c(3, 10)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 10 File-type GitHub DMCA Takedowns Per-year"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y=element_blank())

It’s not all code (lots of fonts and books) but there are plenty of source code files in those annual lists.

FIN

That’s it for this year’s TLAPD post. You’ve got the data and some starter code so build away! There are plenty more insights left to find and if you do take a stab at finding your own treasure, definitely leave a note in the comments.