CRAN Packages on GitHub (and some CRAN DESCRIPTION observations)

Just about a week ago @thosjleeper posited something on twitter w/r/t how many CRAN packages had associations with GitHub (i.e. how many used GitHub for development). The `DESCRIPTION` file (that comes with all R packages) has some fields that can house this information and most folks who do use GitHub for development of R seem to use the `URL` field to host the repository URL, which looks something like this:

`URL: http://github.com/ropenscilabs/geoparser`

(that’s from @ma_salmon’s ++good `geoparser` package)

There may be traces of GitHub URLs in other fields, but I took this initial naïve assumption and added a step to my daily home CRAN mirror scripts (what, you don’t have your own CRAN mirror at home, too?) to generate two files that you, the R community, can use whenever you want to inspect GitHub R packages:

– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata) (R data file)
– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json) (json file)

You can use:

`load(url(“http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata”))`

or

`jsonlite::fromJSON(“http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json”)`

to read these files but they change only daily, so you might want to `download.file()` them vs waste bandwidth re-reading them intra-day.

As of this post there were 1,544 packages meeting my naïve criteria.

One interesting side-discovery of this effort was to discover that there are 122 “distinct” `DESCRIPTION` fields in-use, but some of those are mixed-case versions of each other (118 total unique ones). Plus, there doesn’t seem to be as hard and fast a rule on the fields as one might think. Some examples:

– `acknowledgement`, `acknowledgements`, `acknowledgments`
– `bioviews`, `biocviews`
– `keyword`, `keywords`
– `reference`, `references`, `reference manual`
– `systemrequirement`, `systemrequirements`, `systemrequirementsnote`

You can see the usage counts for all the fields in the table below:



But, I digress.

Who has the most CRAN packages with associated GitHub repositories? The code below _mostly_ answers this. I say “mostly” since I don’t handle edge cases in the `URL` field (look at it to see what I mean). It’s also possible that there are traces of GitHub in other fields, and I’ll address those in my local CRAN parser at some point. Feel free to post your findings, fixes or enhancements in the comments.

library(dplyr)
library(tidyr)
library(stringi)
library(DT)

download.file("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata",
              "ghcran.Rdata")

load("ghcran.Rdata")

ghcran$URL %>% 
  stri_match_first_regex("://github.com/(.*?/[[:alnum:]_\\-\\.]+)") %>% 
  as.data.frame(stringsAsFactors=FALSE) %>% 
  setNames(c("url", "repos")) %>% 
  filter(!is.na(repos)) %>% 
  separate(repos, c("author", "repo"), "/", extra="drop") %>% 
  count(author) %>% 
  arrange(desc(n)) %>% 
  datatable()



Cover image from Data-Driven Security
Amazon Author Page

8 Comments CRAN Packages on GitHub (and some CRAN DESCRIPTION observations)

  1. David F. Severski

    Neat parsing! Two quick quibbles – your jsonlite::fromJSON example is referencing the Rdata file instead of the JSON file and the Rdata file, when loaded under Win R3.3.1 gives a corruption error of “input has been corrupted, with LF replaced by CR”. The JSON file parses cleanly.

    Reply
  2. Pingback: CRAN Packages on GitHub (and some CRAN DESCRIPTION observations) – Mubashir Qasim

  3. Pingback: CRAN Packages on GitHub (and some CRAN DESCRIPTION observations) | A bunch of data

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.