Skip navigation

Tag Archives: post

The NIH is [moving forward](http://www.npr.org/sections/health-shots/2016/08/04/488387729/nih-plans-to-lift-ban-on-research-funds-for-part-human-part-animal-embryos) with plans to financially support & encourage human-animal chimera research.

You can find more info over at the [NIH blog](http://osp.od.nih.gov/under-the-poliscope/2016/08/next-steps-research-using-animal-embryos-containing-human-cells).

Chimera’s have been a longstanding subject of science-fiction/fantasy and many authors have visited it to help inform the ethics debate. A fairly recent exploration of this has been through the [Fullmetal Alchemist](http://www.fullmetalalchemist.com/) anime/manga series. TLDR: it doesn’t go so well, even in animal-animal chimera hybrids.

Yes, that’s fiction and the current NIH proposals are nowhere near as audacious as what’s described in the FMA series. But, surprisingly, you can’t find a large number vocal critics of human-animal chimera research since the modern “open scientific community” is actually pretty harshly judgmental of anyone that tries to limit or challenge “science” in any way (since it’s their religion, as everyone believes in something whether they claim to or not). Open, logical and—more importantly—effective criticisms against purported “progress” are often career-limiting moves. All this at a time in history when the current generation of scientists seems to be excelling at ignoring the potential for unintended consequences of their works.

Folks can (and should) [add their comments](http://grants.nih.gov/grants/rfi/rfi.cfm?ID=57) either for or against this proposal. Not commenting means you agree with the NIH plans and support your U.S. tax dollars & government resources going to support this research; it also means you are on the hook when this eventually goes horribly, horribly wrong.

For those commenting to show their _lack_ of support, I augmented a statement from an [interview wtih Dr. Stuart Newman](http://www.beliefnet.com/news/science-religion/2005/05/the-peril-and-promise-of-mix-and-match-biotech.aspx), who is a vocal detractor of human-animal chimeras (so much so that he tried to prevent it through the USPTO process, which eventually failed) for my submission:

I agree with Dr. Stuart A. Newman that, like every human activity, biotechnology is open to wise and foolish uses. The profit motive, coupled with an uncritical acceptance of the notion that new technology is the main way to human advancement, often leads to hype and incautious applications. In fact, existing technologies – sanitation, keeping water and air unpolluted, enabling poor people to eat enough and well-off people not too much, providing birth control and maternal and infant health services – would save more lives over the coming century than all foreseeable biotechnological applications.

I see no viable way for the NIH to prevent wanton abuse once they open the doors to this type of research. As a taxpayer, a well-read individual and someone who does have a sense of morality, I would rather precious, scant financial and bureaucratic resources go into known, proven endeavors that can have substantial, real, immediate impact.

As the translation from the FMA series states: _”Humans have a limitless desire to use their knowledge in real life…the desire to see what you can do with the power that is given to them…the desire to understand all the secrets in this world and experiment with them.”_ My reasoning and my faith suggest that there are definitely doors that should remain closed.

From: [Why Corporate America Is Leaving the Suburbs for the City](http://nyti.ms/2adP5Rn):

>_We wanted energy, vibrancy and diversity, and to accelerate a change in our culture by moving downtown._

translation:

>_We want to begin a process of strategically removing more highly paid, legacy employees who can’t commit 12 hours a day to our company and replacing them with younger folks we can take advantage of._

The move to forgo the addition of parking spaces in these new city HQs and encourage the use of mass transit is also interesting, given the [current state](http://www.salon.com/2015/03/01/american_mass_transit_is_dying/) [of mass transit systems](http://www.sfchronicle.com/bayarea/article/BART-shutdown-underscores-aging-system-s-6916061.php) [in America](http://thehill.com/policy/transportation/289278-dc-metro-proposes-permanent-earlier-closing-times-on-subway). Will these corporations be kicking in greenbacks for infrastructure/capacity improvement? Methinks not.

Remember, kids, these are soulless, giant, multinational corporations that place “shareholder value” over **everything else**. Also, remember that you’ll be a “legacy” worker someday, too.

Hopefully some startup will jump in to buy up all the forthcoming empty suburban campus spaces and turn them into indoor farms.

This is another purrr-focused post but it’s also an homage to the nascent magick package (R interface to ImageMagick) by @opencpu.

We’re starting to see/feel the impact of the increasing drought up here in southern Maine. I’ve used the data from the U.S. Drought Monitor before on the blog, but they also provide shapefiles and this seemed like a good opportunity to further demonstrate the utility of purrr and make animations directly using magick. Plus, I wanted to see the progression of the drought. Putting library() statements for purrr, magick and broom together was completely random, but I now feel compelled to find a set of functions to put into a cauldron package. But, I digress.

What does this demonstrate?

Apart from giving you an idea of the extent of the drought, working through this will help you:

  • use the quietly() function (which automagically turns off warnings for a function)
  • see another example of a formula function
  • illustrate the utility map_df(), and
  • see how to create an animation pipeline for magick

Comments are in the code and the drought gif is at the end. I deliberately only had it loop once, so refresh the image if you want to see the progression again. Also, drop a note in the comments if anything needs more exposition. (NOTE: I was fairly bad and did virtually no file cleanup in the function, so you’ll have half a year’s shapefiles in your getwd(). Consider the cleanup an exercise for the reader :-)

library(rgdal)
library(sp)
library(albersusa) # devtools::install_github("hrbrmstr/albersusa")
library(ggplot2) # devtools::install_github("hadley/ggplot2")
library(ggthemes)
library(rgeos)

# the witch's brew
library(purrr)
library(broom)
library(magick)

#' Get a drought map shapefile and turn it into a PNG
drought_map <- function(wk) {

  # need to hush some chatty functions
  hush_tidy <- quietly(tidy)

  # some are more stubbon than others
  old_warn <- getOption("warn")
  options(warn=-1)

  week <- format(wk, "%Y%m%d")

  # get the drought shapefile only if we don't have it already
  URL <- sprintf("http://droughtmonitor.unl.edu/data/shapefiles_m/USDM_%s_M.zip", week)
  (fil <- basename(URL))
  if (!file.exists(fil)) download.file(URL, fil)
  unzip(fil)

  # read in the shapefile and reduce the polygon complexity
  dr <- readOGR(sprintf("USDM_%s.shp", week),
                sprintf("USDM_%s", week),
                verbose=FALSE,
                stringsAsFactors=FALSE)

  dr <- SpatialPolygonsDataFrame(gSimplify(dr, 0.01, TRUE), dr@data)

  # turn separate out each drought level into its own fortified data.frame
  map(dr$DM, ~subset(dr, DM==.)) %>%
    map(hush_tidy) %>%
    map_df("result", .id="DM") -> m

  # get a conus base map (prbly cld have done map_data("usa"), too)
  usa_composite() %>%
    subset(!(iso_3166_2 %in% c("AK", "HI"))) %>%
    hush_tidy() -> usa

  usa <- usa$result # an artifact of using quietly()

  # this is all Ushey's fault. the utility of cmd-enter to run
  # the entire ggplot2 chain (in RStudio) turns out to have a
  # greater productity boost (i measured it) than my shortcuts for
  # gg <- gg + snippets and hand-editing the "+" bits out when
  # editing old plot constructs. I'm not giving up on gg <- gg + tho

  # Note putting the "base" layer on top since we don't really
  # want to deal with alpha levels of the drought polygons and
  # we're only plotting the outline of the us/states, not filling
  # the interior(s).

  ggplot() +
    geom_map(data=m, map=m,
             aes(long, lat, fill=DM, map_id=id),
             color="#2b2b2b", size=0.05) +
    geom_map(data=usa, map=usa, aes(long, lat, map_id=id),
             color="#2b2b2b88", fill=NA, size=0.1) +
    scale_fill_brewer("Drought Level", palette="YlOrBr") +
    coord_map("polyconic", xlim=c(-130, -65), ylim=c(25, 50)) +
    labs(x=sprintf("Week: %s", wk)) +
    theme_map() +
    theme(axis.title=element_text()) +
    theme(axis.title.x=element_text()) +
    theme(axis.title.y=element_blank()) +
    theme(legend.position="bottom") +
    theme(legend.direction="horizontal") -> gg

  options(warn=old_warn) # put things back the way they were

  outfil <- sprintf("gg-dm-%s.png", wk)
  ggsave(outfil, gg, width=8, height=5)

  outfil

}

# - create a vector of weeks (minus the current one)
# - create the individual map PNGs
# - read the individual map PNGs into a list
# - join the images together
# - create the animated gif structure
# - write the gif to a file

seq(as.Date("2016-01-05"), Sys.Date(), by="1 week") %>%
  head(-1) %>%
  map(drought_map) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(fps=2, loop=1) %>%
  image_write("drought.gif")

NOTE: an updated, comment-free version of the above code block is in this gist and uses spdplyr::filter() vs subset(), keeps downloaded files tidy in a temporary directory and includes a progress bar vs raw, ugly download.file() messages.

I’ve converted the vast majority of my *apply usage over to purrr functions. In an attempt to make this a quick post, I’ll refrain from going into all the benefits of the purrr package. Instead, I’ll show just one thing that’s super helpful: formula functions.

After seeing this Quartz article using a visualization to compare the frequency and volume of mass shootings, I wanted to grab the data to look at it with a stats-eye (humans are ++gd at visually identifying patterns, but we’re also ++gd as misinterpreting them, plus stats validates visual assumptions). I’m not going into that here, but will use the grabbing of the data to illustrate the formula functions. Note that there’s quite a bit of “setup” here for just one example, so I guess I kinda am attempting to shill the purrr package and the “piping tidyverse” just a tad.

If you head on over to the site with the data you’ll see you can download files for all four years. In theory, these are all individual years, but the names of the files gave me pause:

  • MST Data 2013 - 2015.csv
  • MST Data 2014 - 2015.csv
  • MST Data 2015 - 2015.csv
  • Mass Shooting Data 2016 - 2016.csv

So, they may all be individual years, but the naming consistency isn’t there and it’s better to double check than to assume.

First, we can check to see if the column names are the same (we can eyeball this since there are only four files and a small # of columns):

library(purrr)
library(readr)

dir() %>% 
  map(read_csv) %>% 
  map(colnames)

## [[1]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"
## 
## [[2]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"
## 
## [[3]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"
## 
## [[4]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"

A quick inspection of the date column shows it’s in month/day/year format and we want to know if each file only spans one year. This is where the elegance of the formula function comes in:

library(lubridate)

dir() %>% 
  map(read_csv) %>% 
  map(~range(mdy(.$date))) # <--- the *entire* post was to show this one line ;-)

## [[1]]
## [1] "2016-01-06" "2016-07-25"
## 
## [[2]]
## [1] "2013-01-01" "2013-12-31"
## 
## [[3]]
## [1] "2014-01-01" "2014-12-29"
## 
## [[4]]
## [1] "2015-01-01" "2015-12-31"

To break that down a bit:

  • dir() returns a vector of filenames in the current directory
  • the first map() reads each of those files in and creates a list with four elements, each being a tibble (data_frame / data.frame)
  • the second map() iterates over those data frames and calls a newly created anonymous function which converts the date column to a proper Date data type then gets the range of those dates, ultimately resulting in a four element list, with each element being a two element vector of Dates

For you “basers” out there, this is what that looks like old school style:

fils <- dir()
dfs <- lapply(fils, read.csv, stringsAsFactors=FALSE)
lapply(dfs, function(x) range(as.Date(x$date, format="%m/%e/%Y")))

or

lapply(dir(), function(x) {
  df <- read.csv(x, stringsAsFactors=FALSE)
  range(as.Date(df$date, format="%m/%e/%Y"))
})

You eliminate the function(x) { } and get pre-defined vars (either .x or . and, if needed, .y) to compose your maps and pipes very cleanly and succinctly, but still being super-readable.

After performing this inspection (i.e. that each file does contain only a incidents for a single year), we can now automate the data ingestion:

library(rvest)
library(purrr)
library(readr)
library(dplyr)
library(lubridate)

read_html("https://www.massshootingtracker.org/data") %>% 
  html_nodes("a[href^='https://docs.goo']") %>% 
  html_attr("href") %>% 
  map_df(read_csv) %>% 
  mutate(date=mdy(date)) -> shootings

Here’s what that looks like w/o the tidyverse/piping:

library(XML)

doc <- htmlParse("http://www.massshootingtracker.org/data") # note the necessary downgrade to "http"

dfs <- xpathApply(doc, "//a[contains(@href, 'https://docs.goo')]", function(x) {
  csv <- xmlGetAttr(x, "href")
  df <- read.csv(csv, stringsAsFactors=FALSE)
  df$date <- as.Date(df$date, format="%m/%e/%Y")
  df
})

shootings <- do.call(rbind, dfs)

Even hardcore “basers” may have to admit that the piping/tidyverse version is ultimately better.

Give the purrr package a first (or second) look if you haven’t switched over to it. Type safety, readable anonymous functions and C-backed fast functional idioms will mean that your code may ultimately be purrrfect.

UPDATE #1

I received a question in the comments regarding how I came to that CSS selector for the gdocs CSV URLs, so I made a quick video of the exact steps I took. Exposition below the film.

Right-click “Inspect” in Chrome is my go-to method for finding what I’m after. This isn’t the place to dive deep into the dark art of web page spelunking, but in this case, when I saw there were four similar anchor (<a>) tags that pointed to the CSV “files”, I took the easy way out and just built a selector based on the href attribute value (or, more specifically, the characters at the start of the href attribute). However, all four ways below end up targeting the same four elements:

pg <- read_html("https://www.massshootingtracker.org/data")

html_nodes(pg, "a.btn.btn-default")
html_nodes(pg, "a[href^='https://docs.goo']")
html_nodes(pg, xpath=".//a[@class='btn btn-default']")
html_nodes(pg, xpath=".//a[contains(@href, 'https://docs.goo')]")

UPDATE #2

Due to:

I swapped out list.files() in favour of dir() (though, as someone who detests DOS/Windows, typing that function name is super-painful).

I been updating some existing packages and github-releasing new ones (before a CRAN push). Most are “cyber”-related, but there are some general purpose ones. Here’s a quick overview:

  • docxtractr (CRAN, now, v0.2.0) was initially designed to make it easy to get data tables out of MS Word (docx) documents. The update removes use of a deprecated xml2 package function and adds the ability to extract comments from Word docs.
  • slackr (CRAN, now v1.4.2) lets you pass R code, plots, data objects and arbitrary text to Slack from R. A Slack API change introduced some changes that broke the package. Said changes have been compensated for.
  • iptools (GitHub, v0.5.0) makes it super easy and quick to work with IPv4 (and some IPv6) addresses in R. The dev updates add NA support to checkers/validators (by @quonimous), Hilbert-space coordinate generators and faster ways to test IPv4 address membership in CIDR blocks (one involves using the triebeard pkg by Oliver and works well with BGP dumps read in from mtr [below] and the other uses optimized integer searches).
  • hubway (GitHub, v0.1.0) provides programmatic access to the JSON data from the bike-sharing service Hubway. I’m using it to build a predictive model of bike availability in Boston.
  • myip (GitHub, v0.1.0) provides unified access to numerous “what’s my IP address?” services on the internet. May merge this with other work @wrathematics has been doing.
  • mrt (GitHub, v0.1.0) small, C-backed package that wraps libBGPdump and reads Route Views MRT BGP dumps.
  • algorithmia (GitHub, v0.1.0) provides an R wrapper to the Algorithmia web service, enabling use of a wide range of hosted algorithms using local or cloud data.
  • ssllabs (GitHub, v0.1.0) provides an R wrapper to the SSL Labs (SSL/TLS cert checker) API
  • accidents (GitHub v0.1.0) an R data page containing historical U.S. NHTSA accident data.
  • htmltidy (GitHub, v0.1.0) an R wrapper to the HTML Tidy library that cleans up gnarly HTML/XHTML, making it easier to parse with rvest.
  • gdns (GitHub, v0.1.0) an R wrapper to the Google “DNS over HTTPS” API.
  • ohby (GitHub, v0.1.0) an R wrapper to the nascent ohby URL & content shortener

If any of the GitHub-only pkgs are of [major] use to folks, let me know and I’ll prioritize getting CI tests wired up and CRAN submissions started.

In other news, ggalt is gearing up for an August CRAN release, so if you have any ggplot2 extensions that need a home, fork that repo and PR them my way before the middle of the month.

Finally, I’ve had many folks contribute code and bug reports to packages and wanted to close with a YUGE “thank you!” to all of you who did so.

The insanely productive elf-lord, @quominus put together a small package ([`triebeard`](https://github.com/ironholds/triebeard)) that exposes an API for [radix/prefix tries](https://en.wikipedia.org/wiki/Trie) at both the R and Rcpp levels. I know he had some personal needs for this and we both kinda need these to augment some functions in our `iptools` package. Despite `triebeard` having both a vignette and function-level examples, I thought it might be good to show a real-world use of the package (at least in the cyber real world): fast determination of which [autonomous system](https://en.wikipedia.org/wiki/Autonomous_system_(Internet)) an IPv4 address is in (if it’s in one at all).

I’m not going to delve to deep into routing (you can find a good primer [here](http://www.kixtart.org/forums/ubbthreads.php?ubb=showflat&Number=81619&site_id=1#import) and one that puts routing in the context of radix tries [here](http://www.juniper.net/documentation/en_US/junos14.1/topics/usage-guidelines/policy-configuring-route-lists-for-use-in-routing-policy-match-conditions.html)) but there exists, essentially, abbreviated tables of which IP addresses belong to a particular network. These tables are in routers on your local networks and across the internet. Groups of these networks (on the internet) are composed into those autonomous systems I mentioned earlier and these tables are used to get the packets that make up the cat videos you watch routed to you as efficiently as possible.

When dealing with cybersecurity data science, it’s often useful to know which autonomous system an IP address belongs in. The world is indeed full of peril and in it there are many dark places. It’s a dangerous business, going out on the internet and we sometimes find it possible to identify unusually malicious autonomous systems by looking up suspicious IP addresses en masse. These mappings look something like this:

CIDR            ASN
1.0.0.0/24      47872
1.0.4.0/24      56203
1.0.5.0/24      56203
1.0.6.0/24      56203
1.0.7.0/24      38803
1.0.48.0/20     49597
1.0.64.0/18     18144

Each CIDR has a start and end IP address which can ultimately be converted to integers. Now, one _could_ just sequentially compare start and end ranges to see which CIDR an IP address belongs in, but there are (as of the day of this post) `647,563` CIDRs to compare against, which—in the worst case—would mean having to traverse through the entire list to find the match (or discover there is no match). There are some trivial ways to slightly optimize this, but the search times could still be fairly long, especially when you’re trying to match a billion IPv4 addresses to ASNs.

By storing the CIDR mask (the number of bits of the leading IP address specified after the `/`) in binary form (strings of 1’s and 0’s) as keys for the trie, we get much faster lookups (only a few comparisons at worst-case vs 647,563).

I made an initial, naïve, mostly straight R, implementation as a precursor to a more low-level implementation in Rcpp in our `iptools` package and to illustrate this use of the `triebeard` package.

One thing we’ll need is a function to convert an IPv4 address (in long integer form) into a binary character string. We _could_ do this with base R, but it’ll be super-slow and it doesn’t take much effort to create it with an Rcpp inline function:

library(Rcpp)
library(inline)

ip_to_binary_string <- rcpp(signature(x="integer"), "
  NumericVector xx(x);

  std::vector<double> X(xx.begin(),xx.end());
  std::vector<std::string> output(X.size());

  for (unsigned int i=0; i<X.size(); i++){

    if ((i % 10000) == 0) Rcpp::checkUserInterrupt();

    output[i] = std::bitset<32>(X[i]).to_string();

  }

  return(Rcpp::wrap(output));
")

ip_to_binary_string(ip_to_numeric("192.168.1.1"))
## [1] "11000000101010000000000100000001"

We take a vector from R and use some C++ standard library functions to convert them to bits. I vectorized this in C++ for speed (which is just a fancy way to say I used a `for` loop). In this case, our short cut will not make for a long delay.

Now, we’ll need a CIDR file. There are [historical ones](http://data.4tu.nl/repository/uuid:d4d23b8e-2077-4592-8b47-cb476ad16e12) avaialble, and I use one that I generated the day of this post (and, referenced in the code block below). You can use [`pyasn`](https://github.com/hadiasghari/pyasn) to make new ones daily (relegating mindless, automated, menial data retrieval tasks to the python goblins, like one should).

library(iptools)
library(stringi)
library(dplyr)
library(purrr)
library(readr)
library(tidyr)

asn_dat_url <- "http://rud.is/dl/asn-20160712.1600.dat.gz"
asn_dat_fil <- basename(asn_dat_url)
if (!file.exists(asn_dat_fil)) download.file(asn_dat_url, asn_dat_fil)

rip <- read_tsv(asn_dat_fil, comment=";", col_names=c("cidr", "asn"))
rip %>%
  separate(cidr, c("ip", "mask"), "/") %>%
  mutate(prefix=stri_sub(ip_to_binary_string(ip_to_numeric(ip)), 1, mask)) -> rip_df

rip_df
## # A tibble: 647,557 x 4
##           ip  mask   asn                   prefix
##        <chr> <chr> <int>                    <chr>
## 1    1.0.0.0    24 47872 000000010000000000000000
## 2    1.0.4.0    24 56203 000000010000000000000100
## 3    1.0.5.0    24 56203 000000010000000000000101
## 4    1.0.6.0    24 56203 000000010000000000000110
## 5    1.0.7.0    24 38803 000000010000000000000111
## 6   1.0.48.0    20 49597     00000001000000000011
## 7   1.0.64.0    18 18144       000000010000000001
## 8  1.0.128.0    17  9737        00000001000000001
## 9  1.0.128.0    18  9737       000000010000000010
## 10 1.0.128.0    19  9737      0000000100000000100
## # ... with 647,547 more rows

You can save off that `data_frame` to an R data file to pull in later (but it’s pretty fast to regenerate).

Now, we create the trie, using the prefix we calculated and a value we’ll piece together for this example:

library(triebeard)

rip_trie <- trie(rip_df$prefix, sprintf("%s/%s|%s", rip_df$ip, rip_df$mask, rip_df$asn))

Yep, that’s it. If you ran this yourself, it should have taken less than 2s on most modern systems to create the nigh 700,000 element trie.

Now, we’ll generate a million random IP addresses and look them up:

set.seed(1492)
data_frame(lkp=ip_random(1000000),
           lkp_bin=ip_to_binary_string(ip_to_numeric(lkp)),
           long=longest_match(rip_trie, lkp_bin)) -> lkp_df

lkp_df
## # A tibble: 1,000,000 x 3
##               lkp                          lkp_bin                long
##             <chr>                            <chr>               <chr>
## 1   35.251.195.57 00100011111110111100001100111001  35.248.0.0/13|4323
## 2     28.57.78.42 00011100001110010100111000101010                <NA>
## 3   24.60.146.202 00011000001111001001001011001010   24.60.0.0/14|7922
## 4    14.236.36.53 00001110111011000010010000110101                <NA>
## 5   7.146.253.182 00000111100100101111110110110110                <NA>
## 6     2.9.228.172 00000010000010011110010010101100     2.9.0.0/16|3215
## 7  108.111.124.79 01101100011011110111110001001111 108.111.0.0/16|3651
## 8    65.78.24.214 01000001010011100001100011010110   65.78.0.0/19|6079
## 9   50.48.151.239 00110010001100001001011111101111   50.48.0.0/13|5650
## 10  97.231.13.131 01100001111001110000110110000011   97.128.0.0/9|6167
## # ... with 999,990 more rows

On most modern systems, that should have taken less than 3s.

The `NA` values are not busted lookups. Many IP networks are assigned but not accessible (see [this](https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_address_blocks) for more info). You can validate this with `cymruservices::bulk_origin()` on your own, too).

The trie structure for these CIDRs takes up approximately 9MB of RAM, a small price to pay for speedy lookups (and, memory really is not what the heart desires, anyway). Hopefully the `triebeard` package will help you speed up your own lookups and stay-tuned for a new version of `iptools` with some new and enhanced functions.

Just about a week ago @thosjleeper posited something on twitter w/r/t how many CRAN packages had associations with GitHub (i.e. how many used GitHub for development). The `DESCRIPTION` file (that comes with all R packages) has some fields that can house this information and most folks who do use GitHub for development of R seem to use the `URL` field to host the repository URL, which looks something like this:

`URL: http://github.com/ropenscilabs/geoparser`

(that’s from @ma_salmon’s ++good `geoparser` package)

There may be traces of GitHub URLs in other fields, but I took this initial naïve assumption and added a step to my daily home CRAN mirror scripts (what, you don’t have your own CRAN mirror at home, too?) to generate two files that you, the R community, can use whenever you want to inspect GitHub R packages:

– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata) (R data file)
– [`http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json`](http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json) (json file)

You can use:

`load(url(“http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata”))`

or

`jsonlite::fromJSON(“http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json”)`

to read these files but they change only daily, so you might want to `download.file()` them vs waste bandwidth re-reading them intra-day.

As of this post there were 1,544 packages meeting my naïve criteria.

One interesting side-discovery of this effort was to discover that there are 122 “distinct” `DESCRIPTION` fields in-use, but some of those are mixed-case versions of each other (118 total unique ones). Plus, there doesn’t seem to be as hard and fast a rule on the fields as one might think. Some examples:

– `acknowledgement`, `acknowledgements`, `acknowledgments`
– `bioviews`, `biocviews`
– `keyword`, `keywords`
– `reference`, `references`, `reference manual`
– `systemrequirement`, `systemrequirements`, `systemrequirementsnote`

You can see the usage counts for all the fields in the table below:



But, I digress.

Who has the most CRAN packages with associated GitHub repositories? The code below _mostly_ answers this. I say “mostly” since I don’t handle edge cases in the `URL` field (look at it to see what I mean). It’s also possible that there are traces of GitHub in other fields, and I’ll address those in my local CRAN parser at some point. Feel free to post your findings, fixes or enhancements in the comments.

library(dplyr)
library(tidyr)
library(stringi)
library(DT)

download.file("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata",
              "ghcran.Rdata")

load("ghcran.Rdata")

ghcran$URL %>% 
  stri_match_first_regex("://github.com/(.*?/[[:alnum:]_\\-\\.]+)") %>% 
  as.data.frame(stringsAsFactors=FALSE) %>% 
  setNames(c("url", "repos")) %>% 
  filter(!is.na(repos)) %>% 
  separate(repos, c("author", "repo"), "/", extra="drop") %>% 
  count(author) %>% 
  arrange(desc(n)) %>% 
  datatable()



The @pewresearch folks have been collecting political survey data for quite a while, and I noticed the [visualization below](http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/#interactive) referenced in a [Tableau vis contest entry](https://www.interworks.com/blog/rrouse/2016/06/24/politics-viz-contest-plotting-political-polarization):

Cursor_and_Political_Polarization_and_Growing_Ideological_Consistency___Pew_Research_Center

Those are filled [frequency polygons](http://onlinestatbook.com/2/graphing_distributions/freq_poly.html), which are super-easy to replicate in ggplot2, especially since Pew even _kind of_ made the data available via their interactive visualization (it’s available in other Pew resources, just not as compact). So, we can look at all 5 study years for both the general population and politically active respondents with `ggplot2` facets, incorporating the use of `V8`, `dplyr`, `tidyr`, `purrr` and some R spatial functions along the way.

The first code block has the “data”, data transformations and initial plot code. The “data” is really javascript blocks picked up from the `view-source:` of the interactive visualization. We use the `V8` package to get this data then bend it to our will for visuals.

library(V8)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)  # devtools::install_github("hadley/ggplot2)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc)
library(rgeos)
library(sp)

ctx <- v8()
ctx$eval("
	var party_data = [
		[{
			name: 'Dem',
			data: [0.57,1.60,1.89,3.49,3.96,6.56,7.23,8.54,9.10,9.45,9.30,9.15,7.74,6.80,4.66,4.32,2.14,1.95,0.87,0.57,0.12]
		},{
			name: 'REP',
			data: [0.03,0.22,0.28,1.49,1.66,2.77,3.26,4.98,5.36,7.28,7.72,8.16,8.86,8.88,8.64,8.00,6.20,5.80,4.87,4.20,1.34]
		}],
		[{
			name: 'Dem',
			data: [1.22,2.78,3.28,5.12,6.15,7.77,8.24,9.35,9.73,9.19,8.83,8.47,5.98,5.17,3.62,2.87,1.06,0.75,0.20,0.15,0.04]
		}, {
			name: 'REP',
			data: [0.23,0.49,0.65,2.23,2.62,4.06,5.02,7.53,7.70,7.28,7.72,8.15,8.87,8.47,7.08,6.27,4.29,3.99,3.54,2.79,1.03]
		}],
		[{
			name: 'Dem',
			data: [2.07,3.57,4.21,6.74,7.95,8.41,8.58,9.07,8.98,8.46,8.47,8.49,5.39,3.62,2.11,1.98,1.00,0.55,0.17,0.17,0.00]
		}, {
			name: 'REP',
			data: [0.19,0.71,1.04,2.17,2.07,3.65,4.92,7.28,8.26,9.64,9.59,9.55,7.91,7.74,6.84,6.01,4.37,3.46,2.09,1.65,0.86]
		}],
		[{
			name: 'Dem',
			data: [2.97,4.09,4.28,6.65,7.90,8.37,8.16,8.74,8.61,8.15,7.74,7.32,4.88,4.82,2.79,2.07,0.96,0.78,0.41,0.29,0.02]
		}, {
			name: 'REP',
			data: [0.04,0.21,0.28,0.88,1.29,2.64,3.08,4.92,5.84,6.65,6.79,6.92,8.50,8.61,8.05,8.00,7.52,7.51,5.61,4.17,2.50]
		}],
		[{
			name: 'Dem',
			data: [4.81,6.04,6.57,7.67,7.84,8.09,8.24,8.91,8.60,6.92,6.69,6.47,4.22,3.85,1.97,1.69,0.66,0.49,0.14,0.10,0.03]
		}, {
			name: 'REP',
			data: [0.11,0.36,0.49,1.23,1.35,2.35,2.83,4.63,5.09,6.12,6.27,6.41,7.88,8.03,7.58,8.26,8.12,7.29,6.38,5.89,3.34]
		}],
	];

	var party_engaged_data = [
		[{
			name: 'Dem',
			data: [0.88,2.19,2.61,4.00,4.76,6.72,7.71,8.45,8.03,8.79,8.79,8.80,7.23,6.13,4.53,4.31,2.22,2.01,1.05,0.66,0.13]
		}, {
			name: 'REP',
			data: [0.00,0.09,0.09,0.95,1.21,1.67,2.24,3.22,3.70,6.24,6.43,6.62,8.01,8.42,8.97,8.48,7.45,7.68,8.64,7.37,2.53]
		}],
		[{
			name: 'Dem',
			data: [1.61,3.35,4.25,6.75,8.01,8.20,8.23,9.14,8.94,8.68,8.46,8.25,4.62,3.51,2.91,2.63,1.19,0.74,0.24,0.17,0.12]
		},{
			name: 'REP',
			data: [0.21,0.38,0.68,1.62,1.55,2.55,3.99,4.65,4.31,5.78,6.28,6.79,8.47,9.01,8.61,8.34,7.16,6.50,6.10,4.78,2.25]
		}],
		[{
			name: 'Dem',
			data: [3.09,4.89,6.22,9.40,9.65,9.20,8.99,6.48,7.36,7.67,6.95,6.22,4.53,3.79,2.19,2.02,0.74,0.07,0.27,0.27,0.00]
		}, {
			name: 'REP',
			data: [0.29,0.59,0.67,2.11,2.03,2.67,4.12,6.55,6.93,8.42,8.79,9.17,7.33,6.84,7.42,7.25,6.36,5.32,3.35,2.57,1.24]
		}],
		[{
			name: 'Dem',
			data: [6.00,5.24,5.11,7.66,9.25,8.25,8.00,8.09,8.12,7.05,6.59,6.12,4.25,4.07,2.30,1.49,0.98,0.80,0.42,0.16,0.06]
		}, {
			name: 'REP',
			data: [0.00,0.13,0.13,0.48,0.97,2.10,2.73,3.14,3.64,5.04,5.30,5.56,6.87,6.75,8.03,9.33,11.01,10.49,7.61,6.02,4.68]
		}],
		[{
			name: 'Dem',
			data: [9.53,9.68,10.35,9.33,9.34,7.59,6.67,6.41,6.60,5.21,4.84,4.47,2.90,2.61,1.37,1.14,0.73,0.59,0.30,0.28,0.06]
		}, {
			name: 'REP',
			data: [0.15,0.11,0.13,0.46,0.52,1.18,1.45,2.46,2.84,4.15,4.37,4.60,6.36,6.66,7.34,9.09,11.40,10.53,10.58,9.85,5.76]
		}],
	];
")

years <- c(1994, 1999, 2004, 2001, 2014)

# Transform the javascript data -------------------------------------------

party_data <- ctx$get("party_data")
map_df(1:length(party_data), function(i) {
  x <- party_data[[i]]
  names(x$data) <- x$name
  dat <- as.data.frame(x$data)
  bind_cols(dat, data_frame(x=-10:10, year=rep(years[i], nrow(dat))))
}) -> party_data

party_engaged_data <- ctx$get("party_engaged_data")
map_df(1:length(party_engaged_data), function(i) {
  x <- party_engaged_data[[i]]
  names(x$data) <- x$name
  dat <- as.data.frame(x$data)
  bind_cols(dat, data_frame(x=-10:10, year=rep(years[i], nrow(dat))))
}) -> party_engaged_data

# We need it in long form -------------------------------------------------

gather(party_data, party, pct, -x, -year) %>%
  mutate(party=factor(party, levels=c("REP", "Dem"))) -> party_data_long

gather(party_engaged_data, party, pct, -x, -year) %>%
  mutate(party=factor(party, levels=c("REP", "Dem"))) -> party_engaged_data_long

# Traditional frequency polygon plots -------------------------------------

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party), alpha=0.5)
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (General Population)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_engaged_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party), alpha=0.5)
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (Politically Active)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

genalpha

engalpha

It provides a similar effect to the Pew & Interworks visuals using alpha transparency to blend the point of polygon intersections. But I _really_ kinda like the way both Pew & Interworks did their visualizations without alpha blending yet still highlighting the intersected areas. We can do that in R as well with a bit more work by:

– grouping each data frame by year
– turning each set of points (Dem & Rep) to R polygons
– computing the intersection of those polygons
– turning that intersection back into a data frame
– adding this new polygon to the plots while also removing the alpha blend

Here’s what that looks like in code:

# Setup a function to do the polygon intersection -------------------------

polysect <- function(df) {

  bind_rows(data_frame(x=-10, pct=0),
            select(filter(df, party=="Dem"), x, pct),
            data_frame(x=10, pct=0)) %>%
    as.matrix() %>%
    Polygon() %>%
    list() %>%
    Polygons(1) %>%
    list() %>%
    SpatialPolygons() -> dem

  bind_rows(data_frame(x=-10, pct=0),
            select(filter(df, party=="REP"), x, pct),
            data_frame(x=10, pct=0)) %>%
    as.matrix() %>%
    Polygon() %>%
    list() %>%
    Polygons(1) %>%
    list() %>%
    SpatialPolygons() -> rep

  inter <- gIntersection(dem, rep)
  inter <- as.data.frame(inter@polygons[[1]]@Polygons[[1]]@coords)[c(-1, -25),]
  inter <- mutate(inter, year=df$year[1])
  inter

}

# Get the intersected area ------------------------------------------------

group_by(party_data_long, year) %>%
  do(polysect(.)) -> general_sect

group_by(party_engaged_data_long, year) %>%
  do(polysect(.)) -> engaged_sect


# Try the plots again -----------------------------------------------------

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party))
gg <- gg + geom_ribbon(data=general_sect, aes(x=x, ymin=0, ymax=y), color="#666979", fill="#666979")
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (General Population)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_engaged_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party))
gg <- gg + geom_ribbon(data=engaged_sect, aes(x=x, ymin=0, ymax=y), color="#666979", fill="#666979")
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (Politically Active)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

genfull

engfull

Without much extra effort/work we now have what I believe to be a more striking set of visuals. (And, I should probably makes a `points_to_spatial_polys()` convenience function.)

You’ll find the “overall” group data as well as the party median values in [the Pew HTML source code](view-source:http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/) if you want to try to fully replicate their visualizations.