Just about a week ago @thosjleeper posited something on twitter w/r/t how many CRAN packages had associations with GitHub (i.e. how many used GitHub for development). The `DESCRIPTION` file (that comes with all R packages) has some fields that can house this information and most folks who do use GitHub for development of R seem to use the `URL` field to host the repository URL, which looks something like this:
(that’s from @ma_salmon’s ++good `geoparser` package)
There may be traces of GitHub URLs in other fields, but I took this initial naïve assumption and added a step to my daily home CRAN mirror scripts (what, you don’t have your own CRAN mirror at home, too?) to generate two files that you, the R community, can use whenever you want to inspect GitHub R packages:
– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata) (R data file)
– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json) (json file)
You can use:
to read these files but they change only daily, so you might want to `download.file()` them vs waste bandwidth re-reading them intra-day.
As of this post there were 1,544 packages meeting my naïve criteria.
One interesting side-discovery of this effort was to discover that there are 122 “distinct” `DESCRIPTION` fields in-use, but some of those are mixed-case versions of each other (118 total unique ones). Plus, there doesn’t seem to be as hard and fast a rule on the fields as one might think. Some examples:
– `acknowledgement`, `acknowledgements`, `acknowledgments`
– `bioviews`, `biocviews`
– `keyword`, `keywords`
– `reference`, `references`, `reference manual`
– `systemrequirement`, `systemrequirements`, `systemrequirementsnote`
You can see the usage counts for all the fields in the table below:
But, I digress.
Who has the most CRAN packages with associated GitHub repositories? The code below _mostly_ answers this. I say “mostly” since I don’t handle edge cases in the `URL` field (look at it to see what I mean). It’s also possible that there are traces of GitHub in other fields, and I’ll address those in my local CRAN parser at some point. Feel free to post your findings, fixes or enhancements in the comments.
library(dplyr) library(tidyr) library(stringi) library(DT) download.file("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata", "ghcran.Rdata") load("ghcran.Rdata") ghcran$URL %>% stri_match_first_regex("://github.com/(.*?/[[:alnum:]_\\-\\.]+)") %>% as.data.frame(stringsAsFactors=FALSE) %>% setNames(c("url", "repos")) %>% filter(!is.na(repos)) %>% separate(repos, c("author", "repo"), "/", extra="drop") %>% count(author) %>% arrange(desc(n)) %>% datatable()