hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Candy Coated Confidence Intervals

@mrshrbrmstr hinted that she would like this post by @RickWicklin translated into R for her stats class. She’s quite capable of cranking out the translation of the core component of that post — a call to chisq.test — but she wanted to show the entire post (in R) and really didn’t have time (she’s teaching a full load of classes and is department chair + a mom). I suggested that I, too, was a bit short on time which resulted in her putting out a call to the twitterverse for assistance which ultimately ended up coercing me into tackling the problem.

I won’t re-create Rick’s post or my riff of it here since you can check out the RPubs page for it and also get the source (you can get the source from the Rmd, too, but some folks like gists better).

So, why a blog post if not to present the translation?

Two reasons: I needed tidy Goodman simultaneous confidence intervals (SCIs) and Rick’s final plot was just begging to have “real” M&M’s as the point “geom”.

S[c]imple & Tidy SCIs

We’ve got options for calculating simultaneous CIs in R and I could have just used DescTools::MultinomCI except that I wanted a tibble and it returns a matrix plus it only has three of the more common methods implemented (yes, I am the ultimate package snob). I recalled that the CoinMinD package was tailor made for working with SCIs and has many more methods implemented, but the output is actually only that: print()ed to console output.

Yes, I shouted in disbelief at the glowing rectangle in front of me when I noticed that almost as loudly as you did when you read that sentence.

The algorithms implemented in CoinMinD are just dandy and the package is coming up on it’s 4th birthday. So, as a present from it (via me) to the R community, I whipped together scimple which generates tidy tibbles and has a function scimple_ci() which is similar to binom::binom.confint() in that it will generate the SCIs for all the available (non-Bayesian) methods, including Goodman.

Kick the tyres (pls!) and drop issues and/or PRs as you see fit.

You can’t plot just one

Rick’s post analyzes distributions of M&M’s so I went to the official M&M’s site to grab the official colors for the ones in his data set. I casually went about making the rest of the post with standard points with a superimposed white “m” when it dawned on me that the M&M’s site used those lentils (yes, it seems the candies are called lentils, or at least their icons are) were all over the site. After some site spelunking with Chrome Developer Tools I had the URLs for the candies in question and managed to use the nascent ggimage::geom_image() to place them on the plot:

The plot is a bit sparse as you have to get the aspect ratio just right to keep those tasty, tiny circles as circles.

The new geom_image() opens up many new possibilities for R visualizations (and not all are good possibilities). I think @mrshrbrmstr’s students got a kick out of a stats-y plot having real M&M’s on it so it worked OK this time. Just be wary of using gratuitous imagery and overdoing your watermarking.

As stated earlier you can get the code and see how you can improve upon Rick’s original post and my attempt at a quick riff. If you do end up cranking something out, drop a comment here or a tweet (@hrbrmstr) to show off your creation(s)!

hrbrthemes 0.1.0 is now on CRAN

I’m pleased to announce the inaugural release of my hrbrthemes (0.1.0) package on CRAN

The primary goal of said package is to provide opinionated typographical and other aesthetic defaults for ggplot2 charts.

Two core themes are included:

theme_ipsum() – an Arial Narrow-based theme
theme_ipsum_rc() – a Roboto Condensed-based theme.

The Roboto Condensed Google Font comes with the package along with an installer for said font (it’s an R installer, you still need to install it on your OS manually).

Other niceties include:

scale_[xy]_comma() — shortcut for expand=c(0,0), labels=scales::comma
scale_[xy]_percent() — shortcut for expand=c(0,0), labels=scales::percent
scale_[color|fill]_ipsum() — discrete scale with 9 colors
gg_check() — pass-through spell checker for ggplot2 label elements

Source version is tracked on GitHub.

Critiques, bug reports and enhancement requests are most welcome as GitHub issues.

On Watering Holes, Trust, Defensible Systems and Data Science Community Security

I’ve been threatening to do a series on “data science community security” for a while and had cause to issue this inaugural post today. It all started with this:

Hey #rstats folks: don't do this. Srsly. Don't do this. Pls. Will blog why. Just don't do this. https://t.co/qkem5ruEBi

— boB Rudis (@hrbrmstr) February 23, 2017

Let me begin with the following: @henrikbengtsson is an awesome member of the #rstats community. He makes great things and I trust his code and intentions. This post is not about him, it’s about raising awareness regarding security in the data science community.

I can totally see why folks would like Henrik’s tool. Package dependency management — including installing packages — is not the most beloved of R tasks, especially for new R users or those who prefer performing their respective science or statistical work vs delve deep into the innards of R. The suggestion to use:

source('http://callr.org/install#knitr')

no doubt came from a realization of how cumbersome it can be to deal with said dependency management. You can even ostensibly see what the script does since Henrik provides a link to it right on the page.

So, why the call to not use it?

For starters, if you do want to use this approach, grab his script and make a local copy of it. Read it. Try to grok what it does. Then, use it locally. It will likely be a time-/effort-saver for many R users.

My call was to not source it from the internet.

Why? To answer that I need to talk about trust.

hrbrmstr’s Hierarchy of Package Trust

When you install a package on your system you’re bringing someone else’s code into your personal work space. When you try to use said code with a library() call, R has a few mechanisms to run code on package startup. So, when you just install and load a package you’re executing real code in the context of your local user. Some of that code may be interpreted R code. Some may be calling compiled code. Some of it may be trying to execute binaries (apps) that are already on your system.

Stop and think about that for a second.

If you saw a USB stick outside your office with a label “Cool/Useful R Package” would you insert it into your system and install the package? (Please tell me you answered “No!” :-)

With that in mind, I have a personal “HieraRchy of Package Trust” that I try to stick by:

Tier 1

This should be a pretty obvious one, but if it’s my own code/server or my org’s code/server there’s inherent trust.

Tier 2

When you type install.pacakges() and rely on a known CRAN mirror, MRAN server or Bioconductor download using https you’re getting quite a bit in the exchange.

CRAN GuaRdians at least took some time to review the package. They won’t catch every possible potentially malicious bit and the efficacy of evaluating statistical outcomes is also left to the package user. What you’re getting — at least from the main cran.r-project.org repo and RStudio’s repos — are reviewed packages served from decently secured systems run by organizations with good intentions. Your trust in other mirror servers is up to you but there are no guarantees of security on them. I’ve evaluated the main CRAN and RStudio setups (remotely) and am comfortable with them but I also use my own personal, internal CRAN mirror for many reasons, including security.

Revolution-cum-Microsoft MRAN is also pretty trustworthy, especially since Microsoft has quite a bit to lose if there are security issues there.

Bioconductor also has solid package management practices in place, but I don’t use that ecosystem much (at all, really) so can’t speak too much more about it except that I’m comfortable enough with it to put it with the others at that level.

Tier 3

If I’m part of a known R cabal in private collaboration, I also trust it, but it’s still raw source and I have to scan through code to ensure the efficacy of it, so it’s a bit further down the list.

Tier 4

If I know the contributors to a public source repo, I’ll also consider trusting it, but I will still need to read through the source and doubly-so if there is compiled code involved.

Tiers 5 & 6

If the repo source is a new/out-of-the-blue contributor to the R community or hosted personally, it will be relegated to the “check back later” task list and definitely not installed without a thorough reading of the source.

NOTE

There are caveats to the list above — like CRAN R packages that download pre-compiled Windows libraries from GitHub — that I’ll go into in other posts, along with a demonstration of the perils of trust that I hope doesn’t get Hadley too upset (you’ll see why in said future post ?).

Also note that there is no place on said hierarchy for the random USB stick of cool/useful R code. #justdontdoit

Watering Holes

The places where folks come together to collaborate have a colloquial security name: a “watering hole”. Attackers use these known places to perform — you guessed it — “watering hole” attacks. They figure out where you go, who/what you trust and use that to do bad things. I personally don’t know of any current source-code attacks, but data scientists are being targeted in other ways by attackers. If attackers sense there is a source code soft-spot it will only be a matter of time before they begin to use that vector for attack. The next section mentions one possible attacker that you’re likely not thinking of as an “attacker”.

This isn’t FUD.

Governments, competitors and criminals know that the keys to the 21st century economy (for a while, anyway) reside in data and with those who can gather, analyze and derive insight from data. Not all of us have to worry about this, but many of us do and you should not dismiss the value of the work you’re doing, especially if you’re not performing open research. Imagine if a tiny bit of data exfiltration code managed to get on your Spark cluster or even your own laptop. This can easily happen with a tampered package (remember the incident a few years ago with usage tracking code in R scripts?).

A Bit More On https

I glossed over the https bit above, but by downloading a package over SSL/TLS you’re ensuring that the bits of code aren’t modified in transit from the server to your system and what you’re downloading is also not shown to prying eyes. That’s important since you really want to be sure you’re getting what you think your getting (i.e. no bits are changed) and you may be working in areas your oppressive, authoritarian government doesn’t approve of, such as protecting the environment or tracking global climate change (?).

The use of https also does show — at least in a limited sense — that the maintainers of the server knew enough to actually setup SSL/TLS and thought — at least for a minute or two — about security. The crazy move to “Let’s Encrypt” everything is a topic for another, non-R post, but you can use that service to get free certificates for your own sites with a pretty easy installation mechanism.

I re-mention SSL/TLS as a segue back to the original topic…

Back to the topic at hand

So, what’s so bad about:

source('http://callr.org/install#knitr')

On preview: nothing. Henrik’s a good dude and you can ostensibly see what that script is doing.

On review: much.

I won’t go into a great deal of detail, but that server is running a RHEL 5 server with 15 internet services enabled, ranging from FTP to mail to web serving along with two database servers (both older versions) directly exposed to the internet. The default serving mode is http and the SSL certificate it does have is not trusted by any common certificate store.

None of that was found by any super-elite security mechanism.

Point your various clients at those services on that system and you’ll get a response. To put it bluntly, that system is 100% vulnerable to attack. (How to setup a defensible system is a topic for another post.).

In other words, if said mechanism becomes a popular “watering hole” for easy installation of R packages, it’s also a pretty easy target for attackers to take surreptitious control of and then inject whatever they want, along with keeping track of what’s being installed, by whom and from which internet locale.

Plain base::source() does nothing on your end to validate the integrity of that script. It’s like using devtools::source() or devtools::source_gist() without the sha1 parameter (which uses a hash to validate the integrity of what you’re sourcing). Plus, it seems you cannot do:

devtools::source_url('http://callr.org/install#knitr', sha1="2c1c7fe56ea5b5127f0e709db9169691cc50544a")

since the httr call that lies beneath appears to be stripping away the #… bits. So, there’s no way to run this remotely with any level of integrity or confidentiality.

TLDR

If you like this script (it’s pretty handy) put it in a local directory and source it from there.

Fin

I can’t promise a frequency for “security in the data science community” posts but will endeavor to crank out a few more before summer. Note also that R is not the only ecosystem with these issues, so Python, Julia, node.js and other communities should not get smug :-)

Our pursuit of open, collaborative work has left us vulnerable to bad intentioned ne’er-do-wells and it’s important to at least be aware of the vulnerabilities in our processes, workflows and practices. I’m not saying you have to be wary of every devtools::instal_github() that you do, but you are now armed with information that might help you think twice about how often you trust calls to do such things.

In the meantime, download @henrikbengtsson’s script, thank him for making a very useful tool and run it locally (provided you’re cool with potentially installing things from non-CRAN repos :-)

Putting It All Together

The kind folks over at @RStudio gave a nod to my recently CRAN-released epidata package in their January data package roundup and I thought it might be useful to give it one more showcase using the recent CRAN update to ggalt and the new hrbrthemes (github-only for now) packages.

Labor force participation rate

The U.S. labor force participation rate (LFPR) is an oft-overlooked and under- or mis-reported economic indicator. I’ll borrow the definition from Investopedia:

The participation rate is a measure of the active portion of an economy’s labor force. It refers to the number of people who are either employed or are actively looking for work. During an economic recession, many workers often get discouraged and stop looking for employment, resulting in a decrease in the participation rate.

Population age distributions and other factors are necessary to honestly interpret this statistic. Parties in power usually dismiss/ignore this statistic outright and their opponents tend to wholly embrace it for criticism (it’s an easy target if you’re naive). “Yay” partisan democracy.

Since the LFPR has nuances when looked at categorically, let’s take a look at it by attained education level to see how that particular view has changed over time (at least since the gov-quants have been tracking it).

We can easily grab this data with epidata::get_labor_force_participation_rate()(and, we’ll setup some library() calls while we’re at it:

library(epidata)
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(ggalt)
library(tidyverse)
library(stringi)

part_rate <- get_labor_force_participation_rate("e")

glimpse(part_rate)
## Observations: 457
## Variables: 7
## $ date              <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 1979-05-01...
## $ all               <dbl> 0.634, 0.634, 0.635, 0.636, 0.636, 0.637, 0.637, 0.637, 0.638, 0.638, 0...
## $ less_than_hs      <dbl> 0.474, 0.475, 0.475, 0.475, 0.475, 0.474, 0.474, 0.473, 0.473, 0.473, 0...
## $ high_school       <dbl> 0.690, 0.691, 0.692, 0.692, 0.693, 0.693, 0.694, 0.694, 0.695, 0.696, 0...
## $ some_college      <dbl> 0.709, 0.710, 0.711, 0.712, 0.712, 0.713, 0.712, 0.712, 0.712, 0.712, 0...
## $ bachelor's_degree <dbl> 0.771, 0.772, 0.772, 0.773, 0.772, 0.772, 0.772, 0.772, 0.772, 0.773, 0...
## $ advanced_degree   <dbl> 0.847, 0.847, 0.848, 0.848, 0.848, 0.848, 0.847, 0.847, 0.848, 0.848, 0...

One of the easiest things to do is to use ggplot2 to make a faceted line chart by attained education level. But, let’s change the labels so they are a bit easier on the eyes in the facets and switch the facet order from alphabetical to something more useful:

gather(part_rate, category, rate, -date) %>% 
  mutate(category=stri_replace_all_fixed(category, "_", " "),
         category=stri_trans_totitle(category),
         category=stri_replace_last_regex(category, "Hs$", "High School"),
         category=factor(category, levels=c("Advanced Degree", "Bachelor's Degree", "Some College", 
                                            "High School", "Less Than High School", "All")))  -> part_rate

Now, we’ll make a simple line chart, tweaking the aesthetics just a bit:

ggplot(part_rate) +
  geom_line(aes(date, rate, group=category)) +
  scale_y_percent(limits=c(0.3, 0.9)) +
  facet_wrap(~category, scales="free") +
  labs(x=paste(format(range(part_rate$date), "%Y-%b"), collapse=" to "), 
       y="Participation rate (%)", 
       title="U.S. Labor Force Participation Rate",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_ipsum_rc(grid="XY", axis="XY")

The “All” view is interesting in that the LFPR has held fairly “steady” between 60% & 70%. Those individual and fractional percentage points actually translate to real humans, so the “minor” fluctuations do matter.

It’s also interesting to see the direct contrast between the starting historical rate and current rate (you could also do the same with min/max rates, etc.) We can use a “dumbbell” chart to compare the 1978 value to today’s value, but we’ll need to reshape the data a bit first:

group_by(part_rate, category) %>% 
  arrange(date) %>% 
  slice(c(1, n())) %>% 
  spread(date, rate) %>% 
  ungroup() %>% 
  filter(category != "All") %>% 
  mutate(category=factor(category, levels=rev(levels(category)))) -> rate_range

filter(part_rate, category=="Advanced Degree") %>% 
  arrange(date) %>% 
  slice(c(1, n())) %>% 
  mutate(lab=lubridate::year(date)) -> lab_df

(We’ll be using the extra data frame to add labels the chart.)

Now, we can compare the various ranges, once again tweaking aesthetics a bit:

ggplot(rate_range) +
  geom_dumbbell(aes(y=category, x=`1978-12-01`, xend=`2016-12-01`),
                size=3, color="#e3e2e1",
                colour_x = "#5b8124", colour_xend = "#bad744",
                dot_guide=TRUE, dot_guide_size=0.25) +
  geom_text(data=lab_df, aes(x=rate, y=5.25, label=lab), vjust=0) +
  scale_x_percent(limits=c(0.375, 0.9)) +
  labs(x=NULL, y=NULL,
       title=sprintf("U.S. Labor Force Participation Rate %s-Present", lab_df$lab[1]),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_ipsum_rc(grid="X")

Fin

One takeaway from both these charts is that it’s probably important to take education level into account when talking about the labor force participation rate. The get_labor_force_participation_rate() function — along with most other functions in the epidata package — also has options to factor the data by sex, race and age, so you can pull in all those views to get a more nuanced & informed understanding of this economic health indicator.

ggalt 0.4.0 now on CRAN

I’m uncontainably excited to report that the ggplot2 extension package ggalt is now on CRAN.

The absolute best part of this package is the R community members who contributed suggestions and new geoms, stats, annotations and integration features. This release would not be possible without the PRs from:

Ben Bolker
Ben Marwick
Jan Schulz
Carson Sievert

and a host of folks who have made suggestions and have put up with broken GitHub builds along the way. Y’all are awesome.

Please see the vignette and graphics-annotated help pages for info on everything that’s available. Some highlights include:

multiple ways to render splines (so you can make those cool, smoothed D3-esque line charts :-)
geom_cartogram() which replicates the old functionality of geom_map() so your old mapping code doesn’t break anymore
a re-re-mended coord_proj() (but, read on about why you should be re-thinking of how you do maps in ggplot2)
lollipop charts (geom_lollipop())
dumbbell charts (geom_dumbbell())
step-ribbon charts
the ability to easily encircle points (beyond those boring ellipses)
byte formatters (i.e. turn 1024 to 1 Kb, etc)
better integration with plotly

If you do any mapping in ggplot2, please follow the machinations of geom_sf() and the sf package. Ed and the rest of the R spatial community have done 100% outstanding work here and it is going to change how you think and work spatially in R forever (in an awesome way). I hope to retire coord_proj() and geom_cartogram() some day soon thanks to all their hard work.

Your contributions, feedback and suggestions are welcome and encouraged. The next steps for me w/r/t ggalt are ensuring 100% plotly coverage since it’s the best way to make your ggplot2 plots interactive. There are a few more additions that didn’t make it into this release that I’ll also be integrating.

Please make sure to say “thank you” to the above contributors if you see them in person on on the internets. They’ve done great work and are exemplary examples of how awesome and talented the R community is.

Spelunking XHRs (XMLHttpRequests) with splashr

splashr has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main request for it was to enable rendering of htmlwidgets:

splash_vm <- start_splash(add_tempdir=TRUE)

DiagrammeR("
  graph LR
    A-->B
    A-->C
    C-->E
    B-->D
    C-->D
    D-->F
    E-->F
") %>% 
  saveWidget("/tmp/diag.html")

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="html")
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<script src= ...
## [2] <body style="background-color: white; margin: 0px; padding: 40px;">\n<div id="htmlwidget_container">\n<div id="ht ...

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="png", wait=2)

But if you use the new Docker image and the add_tempdir=TRUE parameter it can render any local HTML file.

The other new bits are helpers to identify content types in the HAR types. Along with get_content_type():

library(tidyverse)

map_chr(rud_har$log$entries, get_content_type)
##  [1] "text/html"                "text/html"                "application/javascript"   "text/css"                
##  [5] "text/css"                 "text/css"                 "text/css"                 "text/css"                
##  [9] "text/css"                 "application/javascript"   "application/javascript"   "application/javascript"  
## [13] "application/javascript"   "application/javascript"   "application/javascript"   "text/javascript"         
## [17] "text/css"                 "text/css"                 "application/x-javascript" "application/x-javascript"
## [21] "application/x-javascript" "application/x-javascript" "application/x-javascript" NA                        
## [25] "text/css"                 "image/png"                "image/png"                "image/png"               
## [29] "font/ttf"                 "font/ttf"                 "text/html"                "font/ttf"                
## [33] "font/ttf"                 "application/font-woff"    "application/font-woff"    "image/svg+xml"           
## [37] "text/css"                 "text/css"                 "image/gif"                "image/svg+xml"           
## [41] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [45] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [49] "text/css"                 "application/x-javascript" "image/gif"                NA                        
## [53] "image/jpeg"               "image/svg+xml"            "image/svg+xml"            "image/svg+xml"           
## [57] "image/svg+xml"            "image/svg+xml"            "image/svg+xml"            "image/gif"               
## [61] NA                         "application/x-javascript" NA                         NA

there are many is_...() functions for logical tests.

But, one of the more interesting is_() functions is is_xhr(). Sites with dynamic content usually load said content via an XMLHttpRequest or XHR for short. Modern web apps usually return JSON in said requests and, for questions like this one on StackOverflow it’s usually better to grab the JSON and use it for data than it is to scrape the table made from JavaScript calls.

Now, it’s not too hard to open Developer Tools and find those XHR requests, but we can also use splashr to programmatically find them. We have to do a bit more work and use the new execute_lua() function since we need to give the page time to load up all the data. (I’ll eventually write a mini-R-DSL around this idiom so you don’t have to grok Lua for non-complex scraping tasks). Here’s how we’d answer that StackOverflow question today…

First, we grab the entire HAR contents (including bodies of the individual requests) after waiting a bit:

splash_local %>%
  execute_lua('
function main(splash)
  splash.response_body_enabled = true
  splash:go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D")
  splash:wait(2)
  return splash:har()
end
') -> res

pg <- as_har(res)

then we look for XHRs:

map_lgl(pg$log$entries, is_xhr) %>% which()
## 10

and, finally, we grab the JSON:

pg$log$entries[[10]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  rawToChar() %>% 
  jsonlite::fromJSON() %>% 
  glimpse()
## List of 4
##  $ TotalPages  : int 16
##  $ TotalRecords: int 384
##  $ Records     :'data.frame': 24 obs. of  21 variables:
##   ..$ ID            : chr [1:24] "{5E4B0D96-18D3-4FC6-B1AA-345675F3765C}" "{674EEC8B-062A-4268-9467-5C61030B83C9}" ## "{3E6257FE-67A1-4F13-B377-9EA7CCBD50F2}" "{C28479E6-5458-4010-A005-84E5F35B2FEA}" ...
##   ..$ FirstName     : chr [1:24] "Mirna" "Barbara" "Donald" "Victoria" ...
##   ..$ LastName      : chr [1:24] "Aeschlimann" "Angus" "Annino" "Arthur" ...
##   ..$ Image         : chr [1:24] "" "/~/media/directory/physicians/ppoc/angus_barbara.ashx" "/~/media/directory/physicians/ppoc/## annino_donald.ashx" "/~/media/directory/physicians/ppoc/arthur_victoria.ashx" ...
##   ..$ Suffix        : chr [1:24] "MD" "MD" "MD" "MD" ...
##   ..$ Url           : chr [1:24] "http://www.childrenshospital.org/doctors/mirna-aeschlimann" "http://www.childrenshospital.org/doctors/## barbara-angus" "http://www.childrenshospital.org/doctors/donald-annino" "http://www.childrenshospital.org/doctors/victoria-arthur" ...
##   ..$ Gender        : chr [1:24] "female" "female" "male" "female" ...
##   ..$ Latitude      : chr [1:24] "42.468769" "42.235088" "42.463177" "42.447168" ...
##   ..$ Longitude     : chr [1:24] "-71.100558" "-71.016021" "-71.143169" "-71.229734" ...
##   ..$ Address       : chr [1:24] "{"practice_name":"Pediatrics, Inc.", "address_1":"577 Main ## Street", "city":&q"| __truncated__ "{"practice_name":"Crown Colony Pediatrics", ## "address_1":"500 Congress Street, Suite 1F""| __truncated__ "{"practice_name":"Pediatricians ## Inc.", "address_1":"955 Main Street", "city":"| __truncated__ ## "{"practice_name":"Lexington Pediatrics", "address_1":"19 Muzzey Street, Suite 105", &qu"| ## __truncated__ ...
##   ..$ Distance      : chr [1:24] "" "" "" "" ...
##   ..$ OtherLocations: chr [1:24] "" "" "" "" ...
##   ..$ AcademicTitle : chr [1:24] "" "" "" "Clinical Instructor of Pediatrics - Harvard Medical School" ...
##   ..$ HospitalTitle : chr [1:24] "Pediatrician" "Pediatrician" "Pediatrician" "Pediatrician" ...
##   ..$ Specialties   : chr [1:24] "Primary Care, Pediatrics, General Pediatrics" "Primary Care, Pediatrics, General Pediatrics" "General ## Pediatrics, Pediatrics, Primary Care" "Primary Care, Pediatrics, General Pediatrics" ...
##   ..$ Departments   : chr [1:24] "" "" "" "" ...
##   ..$ Languages     : chr [1:24] "English" "English" "" "" ...
##   ..$ PPOCLink      : chr [1:24] "http://www.childrenshospital.org/patient-resources/provider-glossary" "/patient-resources/## provider-glossary" "http://www.childrenshospital.org/patient-resources/provider-glossary" "http://www.childrenshospital.org/## patient-resources/provider-glossary" ...
##   ..$ Gallery       : chr [1:24] "" "" "" "" ...
##   ..$ Phone         : chr [1:24] "781-438-7330" "617-471-3411" "781-729-4262" "781-862-4110" ...
##   ..$ Fax           : chr [1:24] "781-279-4046" "(617) 471-3584" "" "(781) 863-2007" ...
##  $ Synonims    : list()

UPDATE So, I wrote a mini-DSL for this:

splash_local %>%
  splash_response_body(TRUE) %>% 
  splash_go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D") %>% 
  splash_wait(2) %>% 
  splash_har() -> res

which should make it easier to perform basic “go-wait-retrieve” operations.

It’s unlikely we want to rely on a running Splash instance for our production work, so I’ll be making a helper function to turn HAR XHR requests into a httr function calls, similar to the way curlconverter works.

Geom❤️

ggplot() +
  geom_heart() +
  coord_equal() +
  labs(title="Happy Valentine's Day") +
  theme_heart()

Presented without exposition (since it’s a silly Geom)

This particular ❤️ math pilfered this morning from @dmarcelinobr:

library(ggplot2)

geom_heart <- function(..., colour = "#67001f", size = 0.5, fill = "#b2182b",
                       mul = 1.0, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {
  
  data <- data.frame(t=seq(0, 10*pi, by=0.1))
  
  x <- function(t) 16*sin(t)^3
  y <- function(t) 13*cos(t) - 5*cos(2*t) - 2*cos(3*t) - cos(4*t)
  
  data$x <- x(data$t) * mul
  data$y <- y(data$t) * mul
  
  data <- rbind(data, data[1,])
  
  layer(
    data = data,
    mapping = aes(x=x, y=y),
    stat = "identity",
    geom = ggplot2::GeomPolygon,
    position = "identity",
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      size = size,
      colour = colour,
      fill = fill,
      ...
    )
  )
  
}

theme_heart <- function() {
  ggthemes::theme_map(base_family = "Zapfino") +
    theme(plot.title=element_text(hjust=0.5, size=28)) +
    theme(plot.margin=margin(30,30,30,30))
}

Diving Into Dynamic Website Content with splashr

If you do enough web scraping, you’ll eventually hit a wall that the trusty httr verbs (that sit beneath rvest) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still stick with httr verbs — if you can figure out what those requests are — and code-up the right parameters (browser “Developer Tools” menus/views and my curlconverter package are super handy for this). Unfortunately, some sites require actual in-page rendering and that’s when scraping turns into a modest chore.

For dynamic sites, the RSelenium and/or seleniumPipes packages are super-handy tools to have in the toolbox. They interface with Selenium which is a feature-rich environment/ecosystem for automating browser tasks. You can programmatically click buttons, press keys, follow links and extract page content because you’re scripting actions in an actual browser or a browser-like tool such as phantomjs. Getting the server component of Selenium running was often a source of pain for R folks, but the new docker images make it much easier to get started. For truly gnarly scraping tasks, it should be your go-to solution.

However, sometimes all you need is the rendering part and for that, there’s a new light[er]weight alternative dubbed Splash. It’s written in python and uses QT webkit for rendering. To avoid deluging your system with all of the Splash dependencies you can use the docker images. In fact, I made it dead easy to do so. Read on!

Going for a dip

The intrepid Winston Chang at RStudio started a package to wrap Docker operations and I’ve recently joind in the fun to add some tweaks & enhancements to it that are necessary to get it on CRAN. Why point this out? Since you need to have Splash running to work with it in splashr I wanted to make it as easy as possible. So, if you install Docker and then devtools::install_github("wch/harbor") you can then devtools::install_github("hrbrmstr/splashr") to get Splash up and running with:

library(splashr)

install_splash()
splash_svr <- start_splash()

The install_splash() function will pull the correct image to your local system and you’ll need that splash_svr object later on to stop the container. Now, you can have Splash running on any host, but this post assumes you’re running it locally.

We can test to see if the server is active:

splash("localhost") %>% splash_active()
## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 70443008

Now, we’re ready to scrape!

We’ll use this site — http://www.techstars.com/companies/ — mentioned over at DataCamp’s tutorial since it doesn’t use XHR but does require rendering and it doesn’t prohibit scraping in the Terms of Service (don’t violate Terms of Service, it is both unethical and could get you blocked, fined or worse).

Let’s scrape the “Summary by Class” table. Here’s an excerpt along with the Developer Tools view:

You’re saying “HEY. That has <table> in the HTML so why not just use rvest? Well, you can validate the lack of <table>s in the “view source” view of the page or with:

library(rvest)

pg <- read_html("http://www.techstars.com/companies/")
html_nodes(pg, "table")
## {xml_nodeset (0)}

Now, let’s do it with splashr:

splash("localhost") %>% 
  render_html("http://www.techstars.com/companies/", wait=5) -> pg
  
html_nodes(pg, "table")
## {xml_nodeset (89)}
##  [1] <table class="table75"><tbody>\n<tr>\n<th>Status</th>\n        <th>Number of Com ...
##  [2] <table class="table75"><tbody>\n<tr>\n<th colspan="2">Impact</th>\n      </tr>\n ...
##  [3] <table class="table75"><tbody>\n<tr>\n<th>Class</th>\n        <th>#Co's</th>\n   ...
##  [4] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Anywhere 2017 Q1</th>\ ...
##  [5] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Atlanta 2016 Summer</t ...
##  [6] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2013 Fall</th>\ ...
##  [7] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2014 Summer</th ...
##  [8] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2015 Spring</th ...
##  [9] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2016 Spring</th ...
## [10] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2014</th>\n   ...
## [11] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2015 Spring</ ...
## [12] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2016 Winter</ ...
## [13] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Cape Town 201 ...
## [14] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2015 Summ ...
## [15] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2016 Summ ...
## [16] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Tel Aviv 2016 ...
## [17] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2015 Summer</th ...
## [18] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2016 Summer</th ...
## [19] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2009 Spring</th ...
## [20] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2010 Spring</th ...
## ...##

We need to set the wait parameter (5 seconds was likely overkill) to give the javascript callbacks time to run. Now you can go crazy turning that into data.

Candid Camera

You can also take snapshots (pictures) of websites with splashr, like this (apologies if you start drooling on your keyboard):

splash("localhost") %>% 
  render_png("https://www.cervelo.com/en/triathlon/p-series/p5x")

The snapshot functions return magick objects, so you can do anything you’d like with them.

HARd Work

Since Splash is rendering the entire site (it’s a real browser), it knows all the information about the various components of a page and can return that in HAR format. You can retrieve this data and use John Harrison’s spiffy HARtools package to visualize and further analyze the data. For the sake of brevity, here’s just the main print() output from a site:

splash("localhost") %>% 
  render_har("https://www.r-bloggers.com/")

## --------HAR VERSION-------- 
## HAR specification version: 1.2 
## --------HAR CREATOR-------- 
## Created by: Splash 
## version: 2.3.1 
## --------HAR BROWSER-------- 
## Browser: QWebKit 
## version: 538.1 
## --------HAR PAGES-------- 
## Page id: 1 , Page title: R-bloggers | R news and tutorials contributed by (750) R bloggers 
## --------HAR ENTRIES-------- 
## Number of entries: 130 
## REQUESTS: 
## Page: 1 
## Number of entries: 130 
##   -  https://www.r-bloggers.com/ 
##   -  https://www.r-bloggers.com/wp-content/themes/magazine-basic-child/style.css 
##   -  https://www.r-bloggers.com/wp-content/plugins/mashsharer/assets/css/mashsb.min.cs... 
##   -  https://www.r-bloggers.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?... 
##   -  https://www.r-bloggers.com/wp-content/plugins/jetpack/css/jetpack.css?ver=4.4.2 
##      ........ 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/10579991_10152371745729891_26331957... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/14962601_10210947974726136_38966601... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/c0.8.50.50/p50x50/311082_286149511398044_4... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/11046696_917285094960943_6143235831... 
##   -  https://static.xx.fbcdn.net/rsrc.php/v3/y2/r/0iTJ2XCgjBy.png

FIN

You can also do some basic scripting in Splash with lua and coding up an interface with that capability is on the TODO as is adding final tests and enabling tweaking the Docker configurations to support more fun things that Splash can do.

File an issue on github if you have feature requests or problems and feel free to jump on board with a PR if you’d like to help put the finishing touches on the package or add some features.

Don’t forget to stop_splash(splash_svr) when you’re finished scraping!