Despite having sailed through the core components of this year’s Talk Like A Pirate Day R post a few months ago, time has been an enemy of late so this will be a short post that others can build off of, especially since there’s lots more knife work ground to cover from the data.
DMC-WhAt?
Since this is TLAPD, I’ll pilfer some of the explanation from GitHub itself:
The Digital Millennium Copyright Act (DMCA) <start_of_current_pilfer>“provides a safe harbor for service providers that host user-generated content. Since even a single claim of copyright infringement can carry statutory damages of up to $150,000, the possibility of being held liable for user-generated content could be very harmful for service providers. With potential damages multiplied across millions of users, cloud-computing and user-generated content sites like YouTube, Facebook, or GitHub probably never would have existed without the DMCA (or at least not without passing some of that cost downstream to their users).”
“The DMCA addresses this issue by creating a copyright liability safe harbor for internet service providers hosting allegedly infringing user-generated content. Essentially, so long as a service provider follows the DMCA’s notice-and-takedown rules, it won’t be liable for copyright infringement based on user-generated content. Because of this, it is important for GitHub to maintain its DMCA safe-harbor status.”</end_of_current_pilfer>
(I’ll save you from a long fact- and opinion-based diatribe on the DMCA, but suffice it to say it’s done far more harm than good IMO. Also, hopefully the “piracy” connection makes sense, now :-)
If your initial reaction was “What does the DMCA have to do with GitHub?” it likely (quickly) turned to “Oh…GitHub is really just a version-controlled file sharing service…”. As such it has to have a robust takedown policy and process.
I don’t know if Microsoft is going to keep the practice of being open about DMCA requests now that they own GitHub nor do I know if they’ll use the same process on themselves (since, as we’ll see, they have issued DMCA requests to GitHub in the past). For now, we’ll assume they will, thus making the code from this post usable in the future to check on the status of DMCA requests over a longer period of time. But first we need the data.
Hunting for treasure in the data hoard
Unsurprisingly, GitHub stores DMCA data on GitHub. Ironically, they store it openly — in-part — to shine a light on what giant, global megacorps like Microsoft are doing. Feel free to use one of the many R packages to clone the repo, but a simple command-line git clone git@github.com:github/dmca.git
is quick and efficient (not everything needs to be done from R).
The directory structure looks like this:
├── 2011
├── 2012
├── 2013
├── 2014
├── 2015
├── 2016
├── 2017
├── 2017-02-01-RBoyApps-2.md
├── 2017-02-15-DeutscheBank.md
├── 2017-03-13-Jetbrains.md
├── 2017-06-26-Wipro-Counternotice.md
├── 2017-06-30-AdflyLink.md
├── 2017-07-28-Toontown-2.md
├── 2017-08-31-Tourzan.md
├── 2017-09-04-Random-House.md
├── 2017-09-05-RandomHouse-2.md
├── 2017-09-18-RandomHouse.md
├── 2017-09-19-Ragnarok.md
├── 2017-10-10-Broadcom.md
├── 2018
├── 2018-02-01-NihonAdSystems.md
├── 2018-03-03-TuneIn.md
├── 2018-03-16-Wabg.md
├── 2018-05-17-Packt.md
├── 2018-06-12-Suning.md
├── 2018-07-31-Pearson.md
├── CONTRIBUTING.md
├── data
└── README.md
Unfortunately, the data
directory contains fools’ gold (it’s just high-level summary data).
We want DMCA filer names, repo names, file names and the DMCA notice text (though we’ll be leaving NLP projects up to the intrepid readers). For that, it will mean processing the directories of notices.
Notices are named (sadly, with some inconsistency) like this: 2018-03-15-Microsoft.md
. Year, month, date and name of org. The contents are text-versions of correspondence (usually email text) that have some requirements in order to be processed. There’s also an online form one can fill out but it’s pretty much a free text field with some semblance of structure. It’s up to humans to follow that structure and — as such — there is inconsistency in the text as well. (Perhaps this is a great lesson that non-constrained inputs and human-originated filenames aren’t a great plan for curating data stores.)
You may have seen what look like takedown files in the top level of the repo. I have no idea if they are legit (since they aren’t in the structured directories) so we’ll be ignoring them.
When I took a look at the directories, some files end in .markdown
but most end in .md
. We’ll cover both instances (you’ll need to replace /data/github/dmca
with the prefix where you stored the repo:
library(tools)
library(stringi)
library(hrbrthemes)
library(tidyverse)
list.files(
path = sprintf("/data/github/dmca/%s", 2011:2018),
pattern = "\\.md$|\\.markdown$",
full.names = TRUE
) -> dmca_files
As noted previously, we’re going to focus on DMCA views over time, look at organizations who filed DMCA notices and the notice content. It turns out the filenames also distinguish whether a notice is a takedown request or a counter-notice (i.e. an “oops…my bad…” by a takedown originator) or a retraction, so we’ll collect that metadata as well. Finally, we’ll slurp up the text along the way.
Again, I’ve taken a pass at this and found out the following:
- Some dates are coded incorrectly (infrequently enough to be able to use some causal rules to fix)
- Some org names are coded incorrectly (often enough to skew counts, so we need to deal with it)
- Counter-notice and retraction tags are inconsistent, so we need to deal with that as well
It’s an ugly pipeline, so I’ve annotated these initial steps to make what’s going on a bit clearer:
map_df(dmca_files, ~{
file_path_sans_ext(.x) %>% # remove extension
basename() %>% # get just the filename
stri_match_all_regex(
"([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{1,2})-(.*)" # try to find the date and the org
) %>%
unlist() -> date_org
if (is.na(date_org[2])) { # handle a special case where the date pattern above didn't work
file_path_sans_ext(.x) %>%
basename() %>%
stri_match_all_regex(
"([[:digit:]]{4}-[[:digit:]]{2})-(.*)"
) %>%
unlist() -> date_org
}
# a few files are still broken so we'll deal with them as special cases
if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
date_org <- c("", "2017-11-06", "1776")
} else if (stri_detect_fixed(.x, "2017/2017-Offensive-Security-7.md")) {
date_org <- c("", "2017-12-30", "Offensive-Security-7")
} else if (stri_detect_fixed(.x, "2017/Offensive-Security-6.md")) {
date_org <- c("", "2017-12-29", "Offensive-Security-6")
}
# we used a somewhat liberal regex to capture dates since some are
# still broken. We'll deal with those first, then turn them
# into proper Date objects
list(
notice_day = case_when(
date_org[2] == "2015-12-3" ~ "2015-12-03",
date_org[2] == "2015-12-7" ~ "2015-12-07",
date_org[2] == "2016-08" ~ "2016-08-01",
date_org[2] == "2016-10-7" ~ "2016-10-07",
date_org[2] == "2016-11-1" ~ "2016-11-01",
date_org[2] == "2016-11-3" ~ "2016-11-03",
date_org[2] == "2017-06" ~ "2017-06-01",
date_org[2] == "0107-05-22" ~ "2017-05-22",
date_org[2] == "2017-11-1" ~ "2017-11-01",
TRUE ~ date_org[2]
) %>%
lubridate::ymd(),
notice_org = date_org[3] %>% # somtimes the org name is messed up so we need to clean it up
stri_replace_last_regex("[-]*[[:digit:]]+$", "") %>%
stri_replace_all_fixed("-", " "),
notice_content = list(read_lines(.x)) # grab the content
) -> ret
# and there are still some broken org names
if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
ret$notice_org <- "1776"
}
ret
}) -> dmca
dmca
## # A tibble: 4,460 x 3
## notice_day notice_org notice_content
##
## 1 2011-01-27 sony
## 2 2011-01-28 tera
## 3 2011-01-31 sony
## 4 2011-02-03 sony counternotice
## 5 2011-02-03 sony
## 6 2011-03-24 oracle
## 7 2011-03-30 mentor graphics
## 8 2011-05-24 cpp virtual world operations
## 9 2011-06-07 sony
## 10 2011-06-13 diablominer
## # ... with 4,450 more rows
Much better. We’ve got more deck-swabbing to do, now, to tag the counter-notice and retractions:
mutate(
dmca,
counter_notice = stri_detect_fixed(notice_org, "counternotice|counter notice"), # handle inconsistency
retraction = stri_detect_fixed(notice_org, "retraction"),
notice_org = stri_trans_tolower(notice_org) %>%
stri_replace_first_regex("\ *(counternotice|counter notice)\ *", "") %>% # clean up org names with tags
stri_replace_first_regex("\ *retraction\ *", "")
) -> dmca
dmca
## # A tibble: 4,460 x 5
## notice_day notice_org notice_content counter_notice retraction
##
## 1 2011-01-27 sony FALSE FALSE
## 2 2011-01-28 tera FALSE FALSE
## 3 2011-01-31 sony FALSE FALSE
## 4 2011-02-03 sony FALSE FALSE
## 5 2011-02-03 sony FALSE FALSE
## 6 2011-03-24 oracle FALSE FALSE
## 7 2011-03-30 mentor graphics FALSE FALSE
## 8 2011-05-24 cpp virtual worl… FALSE FALSE
## 9 2011-06-07 sony FALSE FALSE
## 10 2011-06-13 diablominer FALSE FALSE
## # ... with 4,450 more rows
I’ve lower-cased the org names to make it easier to wrangle them since we do, indeed, need to wrangle them.
I’m super-not-proud of the following code block, but I went into it thinking the org name corrections would be infrequent. But, as I worked with the supposedly-cleaned data, I kept adding correction rules and eventually created a monster:
mutate(
dmca,
notice_org = case_when(
stri_detect_fixed(notice_org, "accenture") ~ "accenture",
stri_detect_fixed(notice_org, "adobe") ~ "adobe",
stri_detect_fixed(notice_org, "amazon") ~ "amazon",
stri_detect_fixed(notice_org, "ansible") ~ "ansible",
stri_detect_fixed(notice_org, "aspengrove") ~ "aspengrove",
stri_detect_fixed(notice_org, "apple") ~ "apple",
stri_detect_fixed(notice_org, "aws") ~ "aws",
stri_detect_fixed(notice_org, "blizzard") ~ "blizzard",
stri_detect_fixed(notice_org, "o reilly") ~ "oreilly",
stri_detect_fixed(notice_org, "random") ~ "random house",
stri_detect_fixed(notice_org, "casado") ~ "casadocodigo",
stri_detect_fixed(notice_org, "ccp") ~ "ccp",
stri_detect_fixed(notice_org, "cisco") ~ "cisco",
stri_detect_fixed(notice_org, "cloudsixteen") ~ "cloud sixteen",
stri_detect_fixed(notice_org, "collinsharper") ~ "collins ’harper",
stri_detect_fixed(notice_org, "contentanalytics") ~ "content analytics",
stri_detect_fixed(notice_org, "packt") ~ "packt",
stri_detect_fixed(notice_org, "penguin") ~ "penguin",
stri_detect_fixed(notice_org, "wiley") ~ "wiley",
stri_detect_fixed(notice_org, "wind river") ~ "windriver",
stri_detect_fixed(notice_org, "windriver") ~ "windriver",
stri_detect_fixed(notice_org, "wireframe") ~ "wireframe shader",
stri_detect_fixed(notice_org, "listen") ~ "listen",
stri_detect_fixed(notice_org, "wpecommerce") ~ "wpecommerce",
stri_detect_fixed(notice_org, "yahoo") ~ "yahoo",
stri_detect_fixed(notice_org, "youtube") ~ "youtube",
stri_detect_fixed(notice_org, "x pressive") ~ "xpressive",
stri_detect_fixed(notice_org, "ximalaya") ~ "ximalaya",
stri_detect_fixed(notice_org, "pragmatic") ~ "pragmatic",
stri_detect_fixed(notice_org, "evadeee") ~ "evadeee",
stri_detect_fixed(notice_org, "iaai") ~ "iaai",
stri_detect_fixed(notice_org, "line corp") ~ "line corporation",
stri_detect_fixed(notice_org, "mediumrare") ~ "medium rare",
stri_detect_fixed(notice_org, "profittrailer") ~ "profit trailer",
stri_detect_fixed(notice_org, "smartadmin") ~ "smart admin",
stri_detect_fixed(notice_org, "microsoft") ~ "microsoft",
stri_detect_fixed(notice_org, "monotype") ~ "monotype",
stri_detect_fixed(notice_org, "qualcomm") ~ "qualcomm",
stri_detect_fixed(notice_org, "pearson") ~ "pearson",
stri_detect_fixed(notice_org, "sony") ~ "sony",
stri_detect_fixed(notice_org, "oxford") ~ "oxford",
stri_detect_fixed(notice_org, "oracle") ~ "oracle",
stri_detect_fixed(notice_org, "out fit") ~ "outfit",
stri_detect_fixed(notice_org, "nihon") ~ "nihon",
stri_detect_fixed(notice_org, "opencv") ~ "opencv",
stri_detect_fixed(notice_org, "newsis") ~ "newsis",
stri_detect_fixed(notice_org, "nostarch") ~ "nostarch",
stri_detect_fixed(notice_org, "stardog") ~ "stardog",
stri_detect_fixed(notice_org, "mswindows") ~ "microsoft",
stri_detect_fixed(notice_org, "moody") ~ "moody",
stri_detect_fixed(notice_org, "minecraft") ~ "minecraft",
stri_detect_fixed(notice_org, "medinasoftware") ~ "medina software",
stri_detect_fixed(notice_org, "linecorporation") ~ "line corporation",
stri_detect_fixed(notice_org, "steroarts") ~ "stereoarts",
stri_detect_fixed(notice_org, "mathworks") ~ "mathworks",
stri_detect_fixed(notice_org, "tmssoftware") ~ "tmssoftware",
stri_detect_fixed(notice_org, "toontown") ~ "toontown",
stri_detect_fixed(notice_org, "wahoo") ~ "wahoo",
stri_detect_fixed(notice_org, "webkul") ~ "webkul",
stri_detect_fixed(notice_org, "whmcs") ~ "whmcs",
stri_detect_fixed(notice_org, "viber") ~ "viber",
stri_detect_fixed(notice_org, "totalfree") ~ "totalfreedom",
stri_detect_fixed(notice_org, "successacademies") ~ "success academies",
stri_detect_fixed(notice_org, "ecgwaves") ~ "ecgwaves",
stri_detect_fixed(notice_org, "synology") ~ "synology",
stri_detect_fixed(notice_org, "infistar") ~ "infistar’",
stri_detect_fixed(notice_org, "galleria") ~ "galleria",
stri_detect_fixed(notice_org, "jadoo") ~ "jadoo",
stri_detect_fixed(notice_org, "dofustouch") ~ "dofus touch",
stri_detect_fixed(notice_org, "gravityforms") ~ "gravity forms",
stri_detect_fixed(notice_org, "fujiannewland") ~ "fujian newland",
stri_detect_fixed(notice_org, "dk uk") ~ "dk",
stri_detect_fixed(notice_org, "dk us") ~ "dk",
stri_detect_fixed(notice_org, "dkuk") ~ "dk",
stri_detect_fixed(notice_org, "dkus") ~ "dk",
stri_detect_fixed(notice_org, "facet") ~ "facet",
stri_detect_fixed(notice_org, "fh admin") ~ "fhadmin",
stri_detect_fixed(notice_org, "electronicarts") ~ "electronic arts",
stri_detect_fixed(notice_org, "daikonforge") ~ "daikon forge",
stri_detect_fixed(notice_org, "corgiengine") ~ "corgi engine",
stri_detect_fixed(notice_org, "epicgames") ~ "epic games",
stri_detect_fixed(notice_org, "essentialmode") ~ "essentialmode",
stri_detect_fixed(notice_org, "jetbrains") ~ "jetbrains",
stri_detect_fixed(notice_org, "foxy") ~ "foxy themes",
stri_detect_fixed(notice_org, "cambridgemobile") ~ "cambridge mobile",
stri_detect_fixed(notice_org, "offensive") ~ "offensive security",
stri_detect_fixed(notice_org, "outfit") ~ "outfit",
stri_detect_fixed(notice_org, "haihuan") ~ "shanghai haihuan",
stri_detect_fixed(notice_org, "schuster") ~ "simon & schuster",
stri_detect_fixed(notice_org, "silicon") ~ "silicon labs",
TRUE ~ notice_org
)) %>%
arrange(notice_day) -> dmca
dmca
## # A tibble: 4,460 x 5
## notice_day notice_org notice_content counter_notice retraction
##
## 1 2011-01-27 sony FALSE FALSE
## 2 2011-01-28 tera FALSE FALSE
## 3 2011-01-31 sony FALSE FALSE
## 4 2011-02-03 sony FALSE FALSE
## 5 2011-02-03 sony FALSE FALSE
## 6 2011-03-24 oracle FALSE FALSE
## 7 2011-03-30 mentor graphics FALSE FALSE
## 8 2011-05-24 cpp virtual worl… FALSE FALSE
## 9 2011-06-07 sony FALSE FALSE
## 10 2011-06-13 diablominer FALSE FALSE
## # ... with 4,450 more rows
You are heartily encouraged to create a translation table in place of that monstrosity.
But, we finally have usable data. You can avoid the above by downloading https://rud.is/dl/github-dmca.json.gz and using jsonlite::stream_in()
or ndjson::stream_in()
to get the above data frame.
Hoisting the mizzen sailplots
Let’s see what the notice submission frequency looks like over time:
# assuming you downloaded it as suggested
jsonlite::stream_in(gzfile("~/Data/github-dmca.json.gz")) %>%
tbl_df() %>%
mutate(notice_day = as.Date(notice_day)) -> dmca
filter(dmca, !retraction) %>%
mutate(
notice_year = lubridate::year(notice_day),
notice_ym = as.Date(format(notice_day, "%Y-%m-01"))
) %>%
dplyr::count(notice_ym) %>%
arrange(notice_ym) %>%
ggplot(aes(notice_ym, n)) +
ggalt::stat_xspline(
geom="area", fill=alpha(ft_cols$blue, 1/3), color=ft_cols$blue
) +
scale_y_comma() +
labs(
x = NULL, y = "# Notices",
title = "GitHub DMCA Notices by Month Since 2011"
) +
theme_ft_rc(grid="XY")
I’m not naive, but that growth was a bit of a shocker, which made me want to jump in and see who the top-filers were:
count(dmca, notice_org, sort=TRUE)
## # A tibble: 1,948 x 2
## notice_org n
##
## 1 webkul 92
## 2 pearson 90
## 3 stereoarts 86
## 4 qualcomm 72
## 5 codility 71
## 6 random house 62
## 7 outfit 57
## 8 offensive security 49
## 9 sensetime 46
## 10 penguin 44
## # ... with 1,938 more rows
“Webkul” is an enterprise eCommerce (I kinda miss all the dashed “e-” prefixes we used to use back in the day) platform. I mention that since I didn’t know what it was either. There are some recognizable names there like “Pearson” and “Random House” and “Penguin” which make sense since it’s easy to share improperly share e-books (modern non-dashed idioms be darned).
Let’s see the top 15 orgs by year since 2015 (since that’s when DMCA filings really started picking up and because I like 2×2 grids). We’ll also leave out counter-notices and retractions and alpha-order it since I want to be able to scan the names more than I want to see rank:
filter(dmca, !retraction, !counter_notice, notice_day >= as.Date("2015-01-01")) %>%
mutate(
notice_year = lubridate::year(notice_day),
) %>%
dplyr::count(notice_year, notice_org) %>%
group_by(notice_year) %>%
top_n(15) %>%
slice(1:15) %>%
dplyr::ungroup() %>%
mutate( # a-z order with "a" on top
notice_org = factor(notice_org, levels = unique(sort(notice_org, decreasing = TRUE)))
) %>%
ggplot(aes(n, notice_org, xend=0, yend=notice_org)) +
geom_segment(size = 2, color = ft_cols$peach) +
facet_wrap(~notice_year, scales = "free") +
scale_x_comma(limits=c(0, 60)) +
labs(
x = NULL, y = NULL,
title = "Top 15 GitHub DMCA Filers by Year Since 2015"
) +
theme_ft_rc(grid="X")
Let’s look at rogues’ gallery of the pirates themselves:
dmca %>%
mutate(
ghusers = notice_content %>%
map(~{
stri_match_all_regex(.x, "http[s]*://github.com/([^/]+)/.*") %>%
discard(~is.na(.x[,1])) %>%
map_chr(~.x[,2]) %>%
unique() %>%
discard(`==`, "github") %>%
discard(~grepl(" ", .x))
})
) %>%
unnest(ghusers) %>%
dplyr::count(ghusers, sort=TRUE) %>%
print() -> offenders
## # A tibble: 18,396 x 2
## ghusers n
##
## 1 RyanTech 16
## 2 sdgdsffdsfff 12
## 3 gamamaru6005 10
## 4 ranrolls 10
## 5 web-padawan 10
## 6 alexinfopruna 8
## 7 cyr2242 8
## 8 liveqmock 8
## 9 promosirupiah 8
## 10 RandyMcMillan 8
## # ... with 18,386 more rows
As you might expect, most users have only 1 or two complaints filed against them since it was likely an oversight more than malice on their part:
ggplot(offenders, aes(x="", n)) +
ggbeeswarm::geom_quasirandom(
color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
shape = 21, size = 3, stroke = 0.125
) +
scale_y_comma(breaks=1:16, limits=c(1,16)) +
coord_flip() +
labs(
x = NULL, y = NULL,
title = "Distribution of the Number of GitHub DMCA Complaints Received by a User"
) +
theme_ft_rc(grid="X")
But, there are hundreds of digital buccaneers, and we can have a bit of fun with them especially since I noticed quite a few had default (generated) avatars with lots of white in them (presenting this with a pirate hat-tip to Maëlle & Lucy):
library(magick)
dir.create("gh-pirates")
dir.create("gh-pirates-jpeg")
# this kinda spoils the surprise; i should have renamed it
download.file("https://rud.is/dl/jolly-roger.jpeg", "jolly-roger.jpeg")
ghs <- safely(gh::gh) # no need to add cruft to our namespace for one function
filter(offenders, n>2) %>%
pull(ghusers) %>%
{ .pb <<- progress_estimated(length(.)); . } %>% # there are a few hundred of them
walk(~{
.pb$tick()$print()
user <- ghs(sprintf("/users/%s", .x))$result # the get-user and then download avatar idiom shld help us not bust GH API rate limits
if (!is.null(user)) {
download.file(user$avatar_url, file.path("gh-pirates", .x), quiet=TRUE) # can't assume avatar file type
}
})
# we'll convert them all to jpeg and resize them at the same time plus make sure they aren't greyscale
dir.create("gh-pirates-jpeg")
list.files("gh-pirates", full.names = TRUE, recursive = FALSE) %>%
walk(~{
image_read(.x) %>%
image_scale("72x72") %>%
image_convert("jpeg", type = "TrueColor", colorspace = "rgb") %>%
image_write(
path = file.path("gh-pirates-jpeg", sprintf("%s.jpeg", basename(.x))),
format = "jpeg"
)
})
set.seed(20180919) # seemed appropriate for TLAPD
RsimMosaic::composeMosaicFromImageRandomOptim( # this takes a bit
originalImageFileName = "jolly-roger.jpeg",
outputImageFileName = "gh-pirates-flag.jpeg",
imagesToUseInMosaic = "gh-pirates-jpeg",
removeTiles = TRUE,
fracLibSizeThreshold = 0.1
)
Finally, we’ll look at the types of pilfered files. To do that, we’ll first naively look for github repo URLs (there are github.io
ones in there too, though, which is an exercise left to ye corsairs):
mutate(
dmca,
files = notice_content %>%
map(~{
paste0(.x, collapse = " ") %>%
stri_extract_all_regex(gh_url_pattern, omit_no_match=FALSE, opts_regex = stri_opts_regex(TRUE)) %>%
unlist() %>%
stri_replace_last_regex("[[:punct:]]+$", "")
})
) -> dmca_with_files
Now, we can see just how many resources/repos/files are in a complaint:
filter(dmca_with_files, map_lgl(files, ~!is.na(.x[1]))) %>%
select(notice_day, notice_org, files) %>%
mutate(num_refs = lengths(files)) %>%
arrange(desc(num_refs)) %>% # take a peek at the heavy hitters
print() -> files_with_counts
## # A tibble: 4,020 x 4
## notice_day notice_org files num_refs
##
## 1 2014-08-27 monotype 2504
## 2 2011-02-03 sony 1160
## 3 2016-06-08 monotype 1015
## 4 2018-04-05 hexrays 906
## 5 2016-06-15 ibo 877
## 6 2016-08-18 jetbrains 777
## 7 2017-10-14 cengage 611
## 8 2016-08-23 yahoo 556
## 9 2017-08-30 altis 529
## 10 2015-09-22 jetbrains 468
## # ... with 4,010 more rows
ggplot(files_with_counts, aes(x="", num_refs)) +
ggbeeswarm::geom_quasirandom(
color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
shape = 21, size = 3, stroke = 0.125
) +
scale_y_comma(trans="log10") +
coord_flip() +
labs(
x = NULL, y = NULL,
title = "Distribution of the Number of Files/Repos per-GitHub DMCA Complaint",
caption = "Note: Log10 Scale"
) +
theme_ft_rc(grid="X")
And, what are the most offensive file types (per-year):
mutate(
files_with_counts,
extensions = map(files, ~tools::file_ext(.x) %>%
discard(`==` , "")
)
) %>%
select(notice_day, notice_org, extensions) %>%
unnest(extensions) %>%
mutate(year = lubridate::year(notice_day)) -> file_types
count(file_types, year, extensions) %>%
filter(year >= 2014) %>%
group_by(year) %>%
top_n(10) %>%
slice(1:10) %>%
ungroup() %>%
ggplot(aes(year, n)) +
ggrepel::geom_text_repel(
aes(label = extensions, size=n),
color = ft_cols$green, family=font_ps, show.legend=FALSE
) +
scale_size(range = c(3, 10)) +
labs(
x = NULL, y = NULL,
title = "Top 10 File-type GitHub DMCA Takedowns Per-year"
) +
theme_ft_rc(grid="X") +
theme(axis.text.y=element_blank())
It’s not all code (lots of fonts and books) but there are plenty of source code files in those annual lists.
FIN
That’s it for this year’s TLAPD post. You’ve got the data and some starter code so build away! There are plenty more insights left to find and if you do take a stab at finding your own treasure, definitely leave a note in the comments.