TLAPD Archives

Category Archives: TLAPD

Taking a Tour of the Pirate Ship ‘GitHub DMCA’ with R

Despite having sailed through the core components of this year’s Talk Like A Pirate Day R post a few months ago, time has been an enemy of late so this will be a short post that others can build off of, especially since there’s lots more ~~knife work~~ ground to cover from the data.

DMC-WhAt?

Since this is TLAPD, I’ll pilfer some of the explanation from GitHub itself:

The Digital Millennium Copyright Act (DMCA) <start_of_current_pilfer>“provides a safe harbor for service providers that host user-generated content. Since even a single claim of copyright infringement can carry statutory damages of up to $150,000, the possibility of being held liable for user-generated content could be very harmful for service providers. With potential damages multiplied across millions of users, cloud-computing and user-generated content sites like YouTube, Facebook, or GitHub probably never would have existed without the DMCA (or at least not without passing some of that cost downstream to their users).”

“The DMCA addresses this issue by creating a copyright liability safe harbor for internet service providers hosting allegedly infringing user-generated content. Essentially, so long as a service provider follows the DMCA’s notice-and-takedown rules, it won’t be liable for copyright infringement based on user-generated content. Because of this, it is important for GitHub to maintain its DMCA safe-harbor status.”</end_of_current_pilfer>

(I’ll save you from a long fact- and opinion-based diatribe on the DMCA, but suffice it to say it’s done far more harm than good IMO. Also, hopefully the “piracy” connection makes sense, now :-)

If your initial reaction was “What does the DMCA have to do with GitHub?” it likely (quickly) turned to “Oh…GitHub is really just a version-controlled file sharing service…”. As such it has to have a robust takedown policy and process.

I don’t know if Microsoft is going to keep the practice of being open about DMCA requests now that they own GitHub nor do I know if they’ll use the same process on themselves (since, as we’ll see, they have issued DMCA requests to GitHub in the past). For now, we’ll assume they will, thus making the code from this post usable in the future to check on the status of DMCA requests over a longer period of time. But first we need the data.

Hunting for treasure in the data hoard

Unsurprisingly, GitHub stores DMCA data on GitHub. Ironically, they store it openly — in-part — to shine a light on what giant, global megacorps like Microsoft are doing. Feel free to use one of the many R packages to clone the repo, but a simple command-line git clone git@github.com:github/dmca.git is quick and efficient (not everything needs to be done from R).

The directory structure looks like this:

├── 2011
├── 2012
├── 2013
├── 2014
├── 2015
├── 2016
├── 2017
├── 2017-02-01-RBoyApps-2.md
├── 2017-02-15-DeutscheBank.md
├── 2017-03-13-Jetbrains.md
├── 2017-06-26-Wipro-Counternotice.md
├── 2017-06-30-AdflyLink.md
├── 2017-07-28-Toontown-2.md
├── 2017-08-31-Tourzan.md
├── 2017-09-04-Random-House.md
├── 2017-09-05-RandomHouse-2.md
├── 2017-09-18-RandomHouse.md
├── 2017-09-19-Ragnarok.md
├── 2017-10-10-Broadcom.md
├── 2018
├── 2018-02-01-NihonAdSystems.md
├── 2018-03-03-TuneIn.md
├── 2018-03-16-Wabg.md
├── 2018-05-17-Packt.md
├── 2018-06-12-Suning.md
├── 2018-07-31-Pearson.md
├── CONTRIBUTING.md
├── data
└── README.md

Unfortunately, the data directory contains fools’ gold (it’s just high-level summary data).

We want DMCA filer names, repo names, file names and the DMCA notice text (though we’ll be leaving NLP projects up to the intrepid readers). For that, it will mean processing the directories of notices.

Notices are named (sadly, with some inconsistency) like this: 2018-03-15-Microsoft.md. Year, month, date and name of org. The contents are text-versions of correspondence (usually email text) that have some requirements in order to be processed. There’s also an online form one can fill out but it’s pretty much a free text field with some semblance of structure. It’s up to humans to follow that structure and — as such — there is inconsistency in the text as well. (Perhaps this is a great lesson that non-constrained inputs and human-originated filenames aren’t a great plan for curating data stores.)

You may have seen what look like takedown files in the top level of the repo. I have no idea if they are legit (since they aren’t in the structured directories) so we’ll be ignoring them.

When I took a look at the directories, some files end in .markdown but most end in .md. We’ll cover both instances (you’ll need to replace /data/github/dmca with the prefix where you stored the repo:

library(tools)
library(stringi)
library(hrbrthemes)
library(tidyverse)

list.files(
  path = sprintf("/data/github/dmca/%s", 2011:2018), 
  pattern = "\\.md$|\\.markdown$",
  full.names = TRUE
) -> dmca_files

As noted previously, we’re going to focus on DMCA views over time, look at organizations who filed DMCA notices and the notice content. It turns out the filenames also distinguish whether a notice is a takedown request or a counter-notice (i.e. an “oops…my bad…” by a takedown originator) or a retraction, so we’ll collect that metadata as well. Finally, we’ll slurp up the text along the way.

Again, I’ve taken a pass at this and found out the following:

Some dates are coded incorrectly (infrequently enough to be able to use some causal rules to fix)
Some org names are coded incorrectly (often enough to skew counts, so we need to deal with it)
Counter-notice and retraction tags are inconsistent, so we need to deal with that as well

It’s an ugly pipeline, so I’ve annotated these initial steps to make what’s going on a bit clearer:

map_df(dmca_files, ~{
  
  file_path_sans_ext(.x) %>% # remove extension
    basename() %>% # get just the filename
    stri_match_all_regex(
      "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{1,2})-(.*)" # try to find the date and the org
    ) %>% 
    unlist() -> date_org
  
  if (is.na(date_org[2])) { # handle a special case where the date pattern above didn't work
    file_path_sans_ext(.x) %>% 
      basename() %>%
      stri_match_all_regex(
        "([[:digit:]]{4}-[[:digit:]]{2})-(.*)"
      ) %>% 
      unlist() -> date_org
  }
  
  # a few files are still broken so we'll deal with them as special cases
  
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    date_org <- c("", "2017-11-06", "1776")
  } else if (stri_detect_fixed(.x, "2017/2017-Offensive-Security-7.md")) {
    date_org <- c("", "2017-12-30", "Offensive-Security-7")
  } else if (stri_detect_fixed(.x, "2017/Offensive-Security-6.md")) {
    date_org <- c("", "2017-12-29", "Offensive-Security-6")
  }
  
  # we used a somewhat liberal regex to capture dates since some are 
  # still broken. We'll deal with those first, then turn them
  # into proper Date objects
  
  list(
    notice_day = case_when(
      date_org[2] == "2015-12-3"  ~ "2015-12-03",
      date_org[2] == "2015-12-7"  ~ "2015-12-07",
      date_org[2] == "2016-08"    ~ "2016-08-01",
      date_org[2] == "2016-10-7"  ~ "2016-10-07",
      date_org[2] == "2016-11-1"  ~ "2016-11-01",
      date_org[2] == "2016-11-3"  ~ "2016-11-03",
      date_org[2] == "2017-06"    ~ "2017-06-01",
      date_org[2] == "0107-05-22" ~ "2017-05-22",
      date_org[2] == "2017-11-1"  ~ "2017-11-01",
      TRUE ~ date_org[2]
    ) %>% 
      lubridate::ymd(),
    notice_org = date_org[3] %>% # somtimes the org name is messed up so we need to clean it up
      stri_replace_last_regex("[-]*[[:digit:]]+$", "") %>% 
      stri_replace_all_fixed("-", " "),
    notice_content = list(read_lines(.x)) # grab the content
  ) -> ret
  
  # and there are still some broken org names
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    ret$notice_org <- "1776"
  } 
  
  ret
  
}) -> dmca

dmca
## # A tibble: 4,460 x 3
##    notice_day notice_org                   notice_content
##                                         
##  1 2011-01-27 sony                             
##  2 2011-01-28 tera                             
##  3 2011-01-31 sony                             
##  4 2011-02-03 sony counternotice                
##  5 2011-02-03 sony                          
##  6 2011-03-24 oracle                            
##  7 2011-03-30 mentor graphics                  
##  8 2011-05-24 cpp virtual world operations     
##  9 2011-06-07 sony                             
## 10 2011-06-13 diablominer                      
## # ... with 4,450 more rows

Much better. We’ve got more deck-swabbing to do, now, to tag the counter-notice and retractions:

mutate(
  dmca,
  counter_notice = stri_detect_fixed(notice_org, "counternotice|counter notice"), # handle inconsistency
  retraction = stri_detect_fixed(notice_org, "retraction"), 
  notice_org = stri_trans_tolower(notice_org) %>% 
    stri_replace_first_regex("\ *(counternotice|counter notice)\ *", "") %>% # clean up org names with tags
    stri_replace_first_regex("\ *retraction\ *", "")
) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

I’ve lower-cased the org names to make it easier to wrangle them since we do, indeed, need to wrangle them.

I’m super-not-proud of the following code block, but I went into it thinking the org name corrections would be infrequent. But, as I worked with the supposedly-cleaned data, I kept adding correction rules and eventually created a monster:

mutate(
  dmca,
  notice_org = case_when(
    stri_detect_fixed(notice_org, "accenture")        ~ "accenture",
    stri_detect_fixed(notice_org, "adobe")            ~ "adobe",
    stri_detect_fixed(notice_org, "amazon")           ~ "amazon",
    stri_detect_fixed(notice_org, "ansible")          ~ "ansible",
    stri_detect_fixed(notice_org, "aspengrove")       ~ "aspengrove",
    stri_detect_fixed(notice_org, "apple")            ~ "apple",
    stri_detect_fixed(notice_org, "aws")              ~ "aws",
    stri_detect_fixed(notice_org, "blizzard")         ~ "blizzard",
    stri_detect_fixed(notice_org, "o reilly")         ~ "oreilly",
    stri_detect_fixed(notice_org, "random")           ~ "random house",
    stri_detect_fixed(notice_org, "casado")           ~ "casadocodigo",
    stri_detect_fixed(notice_org, "ccp")              ~ "ccp",
    stri_detect_fixed(notice_org, "cisco")            ~ "cisco",
    stri_detect_fixed(notice_org, "cloudsixteen")     ~ "cloud sixteen",
    stri_detect_fixed(notice_org, "collinsharper")    ~ "collins ’harper",
    stri_detect_fixed(notice_org, "contentanalytics") ~ "content analytics",
    stri_detect_fixed(notice_org, "packt")            ~ "packt",
    stri_detect_fixed(notice_org, "penguin")          ~ "penguin",
    stri_detect_fixed(notice_org, "wiley")            ~ "wiley",
    stri_detect_fixed(notice_org, "wind river")       ~ "windriver",
    stri_detect_fixed(notice_org, "windriver")        ~ "windriver",
    stri_detect_fixed(notice_org, "wireframe")        ~ "wireframe shader",
    stri_detect_fixed(notice_org, "listen")           ~ "listen",
    stri_detect_fixed(notice_org, "wpecommerce")      ~ "wpecommerce",
    stri_detect_fixed(notice_org, "yahoo")            ~ "yahoo",
    stri_detect_fixed(notice_org, "youtube")          ~ "youtube",
    stri_detect_fixed(notice_org, "x pressive")       ~ "xpressive",
    stri_detect_fixed(notice_org, "ximalaya")         ~ "ximalaya",
    stri_detect_fixed(notice_org, "pragmatic")        ~ "pragmatic",
    stri_detect_fixed(notice_org, "evadeee")          ~ "evadeee",
    stri_detect_fixed(notice_org, "iaai")             ~ "iaai",
    stri_detect_fixed(notice_org, "line corp")        ~ "line corporation",
    stri_detect_fixed(notice_org, "mediumrare")       ~ "medium rare",
    stri_detect_fixed(notice_org, "profittrailer")    ~ "profit trailer",
    stri_detect_fixed(notice_org, "smartadmin")       ~ "smart admin",
    stri_detect_fixed(notice_org, "microsoft")        ~ "microsoft",
    stri_detect_fixed(notice_org, "monotype")         ~ "monotype",
    stri_detect_fixed(notice_org, "qualcomm")         ~ "qualcomm",
    stri_detect_fixed(notice_org, "pearson")          ~ "pearson",
    stri_detect_fixed(notice_org, "sony")             ~ "sony",
    stri_detect_fixed(notice_org, "oxford")           ~ "oxford",
    stri_detect_fixed(notice_org, "oracle")           ~ "oracle",
    stri_detect_fixed(notice_org, "out fit")          ~ "outfit",
    stri_detect_fixed(notice_org, "nihon")            ~ "nihon",
    stri_detect_fixed(notice_org, "opencv")           ~ "opencv",
    stri_detect_fixed(notice_org, "newsis")           ~ "newsis",
    stri_detect_fixed(notice_org, "nostarch")         ~ "nostarch",
    stri_detect_fixed(notice_org, "stardog")          ~ "stardog",
    stri_detect_fixed(notice_org, "mswindows")        ~ "microsoft",
    stri_detect_fixed(notice_org, "moody")            ~ "moody",
    stri_detect_fixed(notice_org, "minecraft")        ~ "minecraft",
    stri_detect_fixed(notice_org, "medinasoftware")   ~ "medina software",
    stri_detect_fixed(notice_org, "linecorporation")  ~ "line corporation",
    stri_detect_fixed(notice_org, "steroarts")        ~ "stereoarts",
    stri_detect_fixed(notice_org, "mathworks")        ~ "mathworks",
    stri_detect_fixed(notice_org, "tmssoftware")      ~ "tmssoftware",
    stri_detect_fixed(notice_org, "toontown")         ~ "toontown",
    stri_detect_fixed(notice_org, "wahoo")            ~ "wahoo",
    stri_detect_fixed(notice_org, "webkul")           ~ "webkul",
    stri_detect_fixed(notice_org, "whmcs")            ~ "whmcs",
    stri_detect_fixed(notice_org, "viber")            ~ "viber",
    stri_detect_fixed(notice_org, "totalfree")        ~ "totalfreedom",
    stri_detect_fixed(notice_org, "successacademies") ~ "success academies",
    stri_detect_fixed(notice_org, "ecgwaves")         ~ "ecgwaves",
    stri_detect_fixed(notice_org, "synology")         ~ "synology",
    stri_detect_fixed(notice_org, "infistar")         ~ "infistar’",
    stri_detect_fixed(notice_org, "galleria")         ~ "galleria",
    stri_detect_fixed(notice_org, "jadoo")            ~ "jadoo",
    stri_detect_fixed(notice_org, "dofustouch")       ~ "dofus touch",
    stri_detect_fixed(notice_org, "gravityforms")     ~ "gravity forms",
    stri_detect_fixed(notice_org, "fujiannewland")    ~ "fujian newland",
    stri_detect_fixed(notice_org, "dk uk")            ~ "dk",
    stri_detect_fixed(notice_org, "dk us")            ~ "dk",
    stri_detect_fixed(notice_org, "dkuk")             ~ "dk",
    stri_detect_fixed(notice_org, "dkus")             ~ "dk",
    stri_detect_fixed(notice_org, "facet")            ~ "facet",
    stri_detect_fixed(notice_org, "fh admin")         ~ "fhadmin",
    stri_detect_fixed(notice_org, "electronicarts")   ~ "electronic arts",
    stri_detect_fixed(notice_org, "daikonforge")      ~ "daikon forge",
    stri_detect_fixed(notice_org, "corgiengine")      ~ "corgi engine",
    stri_detect_fixed(notice_org, "epicgames")        ~ "epic  games",
    stri_detect_fixed(notice_org, "essentialmode")    ~ "essentialmode",
    stri_detect_fixed(notice_org, "jetbrains")        ~ "jetbrains",
    stri_detect_fixed(notice_org, "foxy")             ~ "foxy themes",
    stri_detect_fixed(notice_org, "cambridgemobile")  ~ "cambridge mobile",
    stri_detect_fixed(notice_org, "offensive")        ~ "offensive security",
    stri_detect_fixed(notice_org, "outfit")           ~ "outfit",
    stri_detect_fixed(notice_org, "haihuan")          ~ "shanghai haihuan",
    stri_detect_fixed(notice_org, "schuster")         ~ "simon & schuster",
    stri_detect_fixed(notice_org, "silicon")          ~ "silicon labs",
    TRUE ~ notice_org
  )) %>% 
  arrange(notice_day) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

You are heartily encouraged to create a translation table in place of that monstrosity.

But, we finally have usable data. You can avoid the above by downloading https://rud.is/dl/github-dmca.json.gz and using jsonlite::stream_in() or ndjson::stream_in() to get the above data frame.

Hoisting the mizzen sailplots

Let’s see what the notice submission frequency looks like over time:

# assuming you downloaded it as suggested
jsonlite::stream_in(gzfile("~/Data/github-dmca.json.gz")) %>% 
  tbl_df() %>% 
  mutate(notice_day = as.Date(notice_day)) -> dmca

filter(dmca, !retraction) %>% 
  mutate(
    notice_year = lubridate::year(notice_day),
    notice_ym = as.Date(format(notice_day, "%Y-%m-01"))
  ) %>% 
  dplyr::count(notice_ym) %>% 
  arrange(notice_ym) %>% 
  ggplot(aes(notice_ym, n)) +
  ggalt::stat_xspline(
    geom="area", fill=alpha(ft_cols$blue, 1/3), color=ft_cols$blue
  ) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Notices", 
    title = "GitHub DMCA Notices by Month Since 2011"
  ) +
  theme_ft_rc(grid="XY")

I’m not naive, but that growth was a bit of a shocker, which made me want to jump in and see who the top-filers were:

count(dmca, notice_org, sort=TRUE)
## # A tibble: 1,948 x 2
##    notice_org             n
##                  
##  1 webkul                92
##  2 pearson               90
##  3 stereoarts            86
##  4 qualcomm              72
##  5 codility              71
##  6 random house          62
##  7 outfit                57
##  8 offensive security    49
##  9 sensetime             46
## 10 penguin               44
## # ... with 1,938 more rows

“Webkul” is an enterprise eCommerce (I kinda miss all the dashed “e-” prefixes we used to use back in the day) platform. I mention that since I didn’t know what it was either. There are some recognizable names there like “Pearson” and “Random House” and “Penguin” which make sense since it’s easy to share improperly share e-books (modern non-dashed idioms be darned).

Let’s see the top 15 orgs by year since 2015 (since that’s when DMCA filings really started picking up and because I like 2×2 grids). We’ll also leave out counter-notices and retractions and alpha-order it since I want to be able to scan the names more than I want to see rank:

filter(dmca, !retraction, !counter_notice, notice_day >= as.Date("2015-01-01")) %>%
  mutate(
    notice_year = lubridate::year(notice_day),
  ) %>% 
  dplyr::count(notice_year, notice_org) %>% 
  group_by(notice_year) %>% 
  top_n(15) %>% 
  slice(1:15) %>% 
  dplyr::ungroup() %>%
  mutate( # a-z order with "a" on top 
    notice_org = factor(notice_org, levels = unique(sort(notice_org, decreasing = TRUE)))
  ) %>% 
  ggplot(aes(n, notice_org, xend=0, yend=notice_org)) +
  geom_segment(size = 2, color = ft_cols$peach) +
  facet_wrap(~notice_year, scales = "free") +
  scale_x_comma(limits=c(0, 60)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 15 GitHub DMCA Filers by Year Since 2015"
  ) +
  theme_ft_rc(grid="X")

Let’s look at rogues’ gallery of the pirates themselves:

dmca %>% 
  mutate(
    ghusers = notice_content %>% 
      map(~{
        stri_match_all_regex(.x, "http[s]*://github.com/([^/]+)/.*") %>% 
          discard(~is.na(.x[,1])) %>% 
          map_chr(~.x[,2]) %>% 
          unique() %>% 
          discard(`==`, "github") %>% 
          discard(~grepl(" ", .x))
      })
  ) %>% 
  unnest(ghusers) %>% 
  dplyr::count(ghusers, sort=TRUE) %>% 
  print() -> offenders
## # A tibble: 18,396 x 2
##    ghusers           n
##             
##  1 RyanTech         16
##  2 sdgdsffdsfff     12
##  3 gamamaru6005     10
##  4 ranrolls         10
##  5 web-padawan      10
##  6 alexinfopruna     8
##  7 cyr2242           8
##  8 liveqmock         8
##  9 promosirupiah     8
## 10 RandyMcMillan     8
## # ... with 18,386 more rows

As you might expect, most users have only 1 or two complaints filed against them since it was likely an oversight more than malice on their part:

ggplot(offenders, aes(x="", n)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(breaks=1:16, limits=c(1,16)) +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of GitHub DMCA Complaints Received by a User"
  ) +
  theme_ft_rc(grid="X")

But, there are hundreds of digital buccaneers, and we can have a bit of fun with them especially since I noticed quite a few had default (generated) avatars with lots of white in them (presenting this with a pirate hat-tip to Maëlle & Lucy):

library(magick)

dir.create("gh-pirates")
dir.create("gh-pirates-jpeg")

# this kinda spoils the surprise; i should have renamed it
download.file("https://rud.is/dl/jolly-roger.jpeg", "jolly-roger.jpeg")

ghs <- safely(gh::gh) # no need to add cruft to our namespace for one function 

filter(offenders, n>2) %>% 
  pull(ghusers) %>% 
  { .pb <<- progress_estimated(length(.)); . } %>% # there are a few hundred of them
  walk(~{
    .pb$tick()$print()
    user <- ghs(sprintf("/users/%s", .x))$result # the get-user and then download avatar idiom shld help us not bust GH API rate limits
    if (!is.null(user)) {
      download.file(user$avatar_url, file.path("gh-pirates", .x), quiet=TRUE) # can't assume avatar file type
    }
  })

# we'll convert them all to jpeg and resize them at the same time plus make sure they aren't greyscale
dir.create("gh-pirates-jpeg")
list.files("gh-pirates", full.names = TRUE, recursive = FALSE) %>%
  walk(~{
    image_read(.x) %>% 
      image_scale("72x72") %>% 
      image_convert("jpeg", type = "TrueColor", colorspace = "rgb") %>% 
      image_write(
        path = file.path("gh-pirates-jpeg", sprintf("%s.jpeg", basename(.x))), 
        format = "jpeg"
      )
  })

set.seed(20180919) # seemed appropriate for TLAPD
RsimMosaic::composeMosaicFromImageRandomOptim( # this takes a bit
  originalImageFileName = "jolly-roger.jpeg",
  outputImageFileName = "gh-pirates-flag.jpeg",
  imagesToUseInMosaic = "gh-pirates-jpeg",
  removeTiles = TRUE,
  fracLibSizeThreshold = 0.1
)

Finally, we’ll look at the types of pilfered files. To do that, we’ll first naively look for github repo URLs (there are github.io ones in there too, though, which is an exercise left to ye corsairs):

mutate(
  dmca,
  files = notice_content %>% 
    map(~{
      paste0(.x, collapse = " ") %>% 
        stri_extract_all_regex(gh_url_pattern, omit_no_match=FALSE, opts_regex = stri_opts_regex(TRUE)) %>% 
        unlist() %>% 
        stri_replace_last_regex("[[:punct:]]+$", "")
    })
) -> dmca_with_files

Now, we can see just how many resources/repos/files are in a complaint:

filter(dmca_with_files, map_lgl(files, ~!is.na(.x[1]))) %>% 
  select(notice_day, notice_org, files) %>% 
  mutate(num_refs = lengths(files)) %>%
  arrange(desc(num_refs)) %>%  # take a peek at the heavy hitters
  print() -> files_with_counts
## # A tibble: 4,020 x 4
##    notice_day notice_org files         num_refs
##                          
##  1 2014-08-27 monotype        2504
##  2 2011-02-03 sony            1160
##  3 2016-06-08 monotype        1015
##  4 2018-04-05 hexrays            906
##  5 2016-06-15 ibo                877
##  6 2016-08-18 jetbrains          777
##  7 2017-10-14 cengage            611
##  8 2016-08-23 yahoo              556
##  9 2017-08-30 altis              529
## 10 2015-09-22 jetbrains          468
## # ... with 4,010 more rows

ggplot(files_with_counts, aes(x="", num_refs)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(trans="log10") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of Files/Repos per-GitHub DMCA Complaint",
    caption = "Note: Log10 Scale"
  ) +
  theme_ft_rc(grid="X")

And, what are the most offensive file types (per-year):

mutate(
  files_with_counts, 
  extensions = map(files, ~tools::file_ext(.x) %>% 
    discard(`==` , "")
  )
) %>% 
  select(notice_day, notice_org, extensions) %>% 
  unnest(extensions) %>% 
  mutate(year = lubridate::year(notice_day)) -> file_types

count(file_types, year, extensions) %>% 
  filter(year >= 2014) %>% 
  group_by(year) %>% 
  top_n(10) %>% 
  slice(1:10) %>% 
  ungroup() %>% 
  ggplot(aes(year, n)) +
  ggrepel::geom_text_repel(
    aes(label = extensions, size=n), 
    color = ft_cols$green, family=font_ps, show.legend=FALSE
  ) +
  scale_size(range = c(3, 10)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 10 File-type GitHub DMCA Takedowns Per-year"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y=element_blank())

It’s not all code (lots of fonts and books) but there are plenty of source code files in those annual lists.

FIN

That’s it for this year’s TLAPD post. You’ve got the data and some starter code so build away! There are plenty more insights left to find and if you do take a stab at finding your own treasure, definitely leave a note in the comments.

Pirating Web Content Responsibly With R

2017-09-19 – 07:28
Posted in data wrangling, R, TLAPD, web scraping
Tagged post
Comments (20)

International ~~Code~~ Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still ~~pirate~~ crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary ~~sailors~~ crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the ~~map~~ code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

A Package Full o’ Pirates & Makin’ Interactive Pirate Maps in arrrrrRstats

2015-09-19 – 14:34
Posted in cartography, DataVis, DataViz, Humor, maps, R, TLAPD
Tagged post
Comments (4)

Avast, me hearties! It’s time four t’ annual International Talk Like a Pirate Day #rstats post!

(OK, I won’t make you suffer continuous pirate-speak for the entire post)

I tried to be a bit more practical this year and have two treasuRe chests for you to (hopefully) enjoy.

A Package Full o’ Pirates

I’ve covered the Anti-shipping Activity Messages (ASAM) Database before for TLAPD before but getting, updating and working with the data has more pain points than it should, so I wrapped a small package around it.

Here’s how to get all pirate attacks this year (2015) so far:

# devtools::install_github("hrbrmstr/asam")
library(asam)
 
data(asam_shp)
pirates <- subset(asam_shp,
                  grepl("pirate", Aggressor, ignore.case=TRUE) &
                  format(DateOfOcc, "%Y") == "2015")
 
nrow(pirates)
## [1] 78

It looks like there have been 78 registered pirate attacks this year. The National Geospatial Intelligence Agency (NGIA) marks the attacks by lat/lon and also by region/subregion, and I managed to obtain the official polygons for these regions, so we can plot these attacks on a world map and also show the subregions:

library(ggplot2)
 
# get the ASAM subregion polygons
subregions <- asam_subregions()
subregions_map <- fortify(subregions)
 
# get the world map
world <- map_data("world")
 
# get the points for the pirate attack occurrences
pirate_pts <- data.frame(pirates)
 
gg <- ggplot()
 
# world map layer
gg <- gg + geom_map(data=world, map=world,
                    aes(x=long, y=lat, map_id=region),
                    color="black", fill="#e7e7e7", size=0.15)
# ASAM regions layer
gg <- gg + geom_map(data=subregions_map, map=subregions_map,
                    aes(x=long, y=lat, map_id=id),
                    color="white", fill="white", size=0.15, alpha=0)
 
# attacks
gg <- gg + geom_point(data=pirate_pts, color="black", fill="yellow", 
                      aes(x=coords.x1, y=coords.x2), shape=21)
 
gg <- gg + xlim(-170, 170)
gg <- gg + ylim(-58, 75)
gg <- gg + coord_map("mollweide")
gg <- gg + theme_map()
gg <- gg + theme(panel.background=element_rect(fill="steelblue"))
gg

README-map-1

There is quite a bit more data than just location, though and we can work with it much better in an interactive map.

Makin’ Interactive Pirate Maps

Now, what makes the following an interactive pirate map is not so much the fact that we’ll be plotting points of pirate attacks on a Leaflet map, but we’ll also be using a pirate treasure map theme on the Leaflet map.

Let’s start with showing how to use a general pirate map theme before adding in the ASAM data.

You’ll have to pause here and head on over to MapBox to register for a (free) account. You’ll need to go through the gyrations to eventually get a public token and mapbox id to use the pirate map tiles they have. I store those in my .Renviron so I don’t have to cut/paste inane strings when I need to use this, or other, APIs or need to keep them from prying eyes. Since MapBox exposes these strings in GET call URLs, the use of environment variables is strictly for convenience in this case.

library(leaflet)
 
mapbox_public_token <- Sys.getenv("MAPBOX_PUBLIC_TOKEN")
mapbox_map_id <- Sys.getenv("PIRATE_MAP_ID")
mapbox_url <- "https://a.tiles.mapbox.com/v4/%s/{z}/{x}/{y}.png?access_token=%s"
mapbox_tiles_template <- sprintf(mapbox_url, mapbox_map_id, mapbox_public_token)

Now, what good is a pirate map without an ‘X’ marking the spot for some treasure. For that we’ll need an ‘X’:

in a format we can use with Leaflet:

x_marker <- icons("http://rud.is/dl/x.png",
                  iconHeight=64, iconWidth=64,
                  iconAnchorX=32, iconAnchorY=32)

Now, we can display a pirate map for all scurvy dogs to see:

leaflet() %>%
  addTiles(mapbox_tiles_template) %>%
  setView(lng=-50.9249, lat=45.68929, zoom=3) %>%
  addMarkers(-70.2667, 43.6667, icon=x_marker)

NOTE: I have not buried treasure in Portland, Maine, but go nuts digging at that location if you still want to.

Pirates on Pirate Maps

We can make a [crude] interactive ASAM browser by combining our data from above with our new, pirate-y mapping capabilities:

library(asam)
library(sp)
library(dplyr)
library(leaflet)
 
data(asam_shp)
dat <- subset(asam_shp,
              DateOfOcc > as.Date("2015-01-01") &
                grepl("pirate", Aggressor, ignore.case=TRUE))
# could also do data.frame(dat)
dat <- bind_cols(dat@data, data.frame(coordinates(dat), stringsAsFactors=FALSE))

We’ll build a popup with the ASAM incident description fields and add it and the pirate incident points to a pirate-themed Leaflet map:

popup_template <- '<div style="background:#f3e0b5; padding:10px"><b>Date:</b> %s
<span style="float:right"><a target="_blank" href="https://msi.nga.mil/NGAPortal/msi/query_results.jsp?MSI_queryType=ASAM&amp;MSI_generalFilterType=SpecificNumber&amp;MSI_generalFilterValue=%s&amp;MSI_additionalFilterType1=None&amp;MSI_additionalFilterType2=-999&amp;MSI_additionalFilterValue1=-999&amp;MSI_additionalFilterValue2=-999&amp;MSI_outputOptionType1=SortBy&amp;MSI_outputOptionType2=-999&amp;MSI_outputOptionValue1=Date_DESC&amp;MSI_outputOptionValue2=-999&amp;MSI_MAP=-999">ASAM Record</a>
</span><br/>
<b>Victim:</b> %s<br/>
<b>Description:</b> %s</div>'
 
nona <- function(x) ifelse(is.na(x), " ", x)
 
pirate_pops <- sprintf(popup_template,
                       dat$date,
                       gsub("-", "_", dat$Reference),
                       dat$Victim,
                       paste0(nona(dat$Descript),
                              nona(dat$Desc1), nona(dat$Desc2), nona(dat$Desc3),
                              nona(dat$Desc4), nona(dat$Desc5), nona(dat$Desc6),
                              sep=" "))
 
mapbox_public_token <- Sys.getenv("MAPBOX_PUBLIC_TOKEN")
mapbox_map_id <- Sys.getenv("PIRATE_MAP_ID")
mapbox_url <- "https://a.tiles.mapbox.com/v4/%s/{z}/{x}/{y}.png?access_token=%s"
mapbox_tiles_template <- sprintf(mapbox_url, mapbox_map_id, mapbox_public_token)
 
leaflet() %>%
  addTiles(mapbox_tiles_template) %>%
  setView(lng=-50.9249, lat=45.68929, zoom=3) %>%
  addCircles(dat$coords.x1, dat$coords.x2, radius=300,
             color="#664c1f", popup=pirate_pops)

Select any of the circle marks and you’ll get a popup with a description and link to the official ASAM record (like this):

tlapd2015012_html

Fin

I’m not sure when I’ll get back to the asam package, but it could use some attention. The Aggressor field could be auto-cleaned to make it more usable and a dplyr-esque interface could be developed to select incidents. Also, since it includes a shapefile of subregions, that could also be used to do more spatial-oriented analyses of the incidents. It’s all there for any pirate lackey to pilfer.

Drop a note in the comments if you have any of your own pirate-y creations or an issue on github for feature requests & bug reports.