CRAN Packages on GitHub (and some CRAN DESCRIPTION observations)

Just about a week ago @thosjleeper posited something on twitter w/r/t how many CRAN packages had associations with GitHub (i.e. how many used GitHub for development). The `DESCRIPTION` file (that comes with all R packages) has some fields that can house this information and most folks who do use GitHub for development of R seem to use the `URL` field to host the repository URL, which looks something like this:

`URL: http://github.com/ropenscilabs/geoparser`

(that’s from @ma_salmon’s ++good `geoparser` package)

There may be traces of GitHub URLs in other fields, but I took this initial naïve assumption and added a step to my daily home CRAN mirror scripts (what, you don’t have your own CRAN mirror at home, too?) to generate two files that you, the R community, can use whenever you want to inspect GitHub R packages:

– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata) (R data file)
– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json) (json file)

You can use:

`load(url(“http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata”))`

`jsonlite::fromJSON(“http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json”)`

to read these files but they change only daily, so you might want to `download.file()` them vs waste bandwidth re-reading them intra-day.

As of this post there were 1,544 packages meeting my naïve criteria.

One interesting side-discovery of this effort was to discover that there are 122 “distinct” `DESCRIPTION` fields in-use, but some of those are mixed-case versions of each other (118 total unique ones). Plus, there doesn’t seem to be as hard and fast a rule on the fields as one might think. Some examples:

– `acknowledgement`, `acknowledgements`, `acknowledgments`
– `bioviews`, `biocviews`
– `keyword`, `keywords`
– `reference`, `references`, `reference manual`
– `systemrequirement`, `systemrequirements`, `systemrequirementsnote`

You can see the usage counts for all the fields in the table below:

But, I digress.

Who has the most CRAN packages with associated GitHub repositories? The code below _mostly_ answers this. I say “mostly” since I don’t handle edge cases in the `URL` field (look at it to see what I mean). It’s also possible that there are traces of GitHub in other fields, and I’ll address those in my local CRAN parser at some point. Feel free to post your findings, fixes or enhancements in the comments.

library(dplyr)
library(tidyr)
library(stringi)
library(DT)

download.file("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata",
              "ghcran.Rdata")

load("ghcran.Rdata")

ghcran$URL %>% 
  stri_match_first_regex("://github.com/(.*?/[[:alnum:]_\\-\\.]+)") %>% 
  as.data.frame(stringsAsFactors=FALSE) %>% 
  setNames(c("url", "repos")) %>% 
  filter(!is.na(repos)) %>% 
  separate(repos, c("author", "repo"), "/", extra="drop") %>% 
  count(author) %>% 
  arrange(desc(n)) %>% 
  datatable()

6 Comments

- Seth
- Posted 2016-07-10 at 12:31
- Permalink
- Reply
Wow. I’m the only person to put my Twitter handle in the description.
- David F. Severski
- Posted 2016-07-10 at 13:20
- Permalink
- Reply
Neat parsing! Two quick quibbles – your jsonlite::fromJSON example is referencing the Rdata file instead of the JSON file and the Rdata file, when loaded under Win R3.3.1 gives a corruption error of “input has been corrupted, with LF replaced by CR”. The JSON file parses cleanly.
- - hrbrmstr
  - Posted 2016-07-10 at 13:54
  - Permalink
  - Reply
  thx. :-) fixed #1 and give #2 a try (replaced a writeLines with just write in the bits that run on my server)
- gaborcsardi
- Posted 2016-07-11 at 03:26
- Permalink
- Reply
Nice analysis. You could also look at BugReports, as many packages link that to GH and URL to a proper homepage. E.g. http://crandb.r-pkg.org/igraph

Btw. you don’t need a local mirror for this, DESCRIPTION files are available “live” from CRANDB, you can download all of them with a single query, just remove the limit from this: http://crandb.r-pkg.org/-/latest?limit=3
- - Marcin Kosiński
  - Posted 2016-07-11 at 10:59
  - Permalink
  - Reply
  I would say BugReports it’s better, since I’m also using URL for my gh-page
- - hrbrmstr
  - Posted 2016-07-11 at 11:07
  - Permalink
  - Reply
  Aye. +1 for CRANDB. I’ve got a local CRAN mirror for other reasons (part of which is local CI).