Skip navigation

Category Archives: web scraping

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems—so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no <div> elements because they’re generated dynamically after the page loads.

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source").

First, we’ll try just plain ol’ {rvest}:

library(rvest)

pg <- read_html("http://quotes.toscrape.com/js/")

html_nodes(pg, "div.quote")
## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit)
library(rvest)
library(purrr)
library(tibble)

js_pg <- hu_read_html("http://quotes.toscrape.com/js/")

html_nodes(js_pg, "div.quote")
## {xml_nodeset (10)}
##  [1] <div class="quote">\r\n        <span class="text">\r\n          “The world as we h ...
##  [2] <div class="quote">\r\n        <span class="text">\r\n          “It is our choices ...
##  [3] <div class="quote">\r\n        <span class="text">\r\n          “There are only tw ...
##  [4] <div class="quote">\r\n        <span class="text">\r\n          “The person, be it ...
##  [5] <div class="quote">\r\n        <span class="text">\r\n          “Imperfection is b ...
##  [6] <div class="quote">\r\n        <span class="text">\r\n          “Try not to become ...
##  [7] <div class="quote">\r\n        <span class="text">\r\n          “It is better to b ...
##  [8] <div class="quote">\r\n        <span class="text">\r\n          “I have not failed ...
##  [9] <div class="quote">\r\n        <span class="text">\r\n          “A woman is like a ...
## [10] <div class="quote">\r\n        <span class="text">\r\n          “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble(
  quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE),
  author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE),
  tags = html_nodes(js_pg, "div.quote") %>% 
    map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE))
)
## # A tibble: 10 x 3
##    quote                                                            author         tags   
##    <chr>                                                            <chr>          <list> 
##  1 “The world as we have created it is a process of our thinking. … Albert Einste… <chr […
##  2 “It is our choices, Harry, that show what we truly are, far mor… J.K. Rowling   <chr […
##  3 “There are only two ways to live your life. One is as though no… Albert Einste… <chr […
##  4 “The person, be it gentleman or lady, who has not pleasure in a… Jane Austen    <chr […
##  5 “Imperfection is beauty, madness is genius and it's better to b… Marilyn Monroe <chr […
##  6 “Try not to become a man of success. Rather become a man of val… Albert Einste… <chr […
##  7 “It is better to be hated for what you are than to be loved for… André Gide     <chr […
##  8 “I have not failed. I've just found 10,000 ways that won't work… Thomas A. Edi… <chr […
##  9 “A woman is like a tea bag; you never know how strong it is unt… Eleanor Roose… <chr […
## 10 “A day without sunshine is like, you know, night.”               Steve Martin   <chr […

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8)
library(stringi)

ctx <- v8()

html_node(pg, xpath=".//script[contains(., 'data')]") %>%  # target the <script> tag with the data
  html_text() %>% # get the text of the tag body
  stri_replace_all_regex("for \\(var[[:print:][:space:]]*", "", multiline=TRUE) %>% # delete everything after the `var data=` content
  ctx$eval() # pass it to V8

ctx$get("data") %>% # get the data from V8
  as_tibble() %>%  # tibbles rock
  janitor::clean_names() # the names do not so make them better
## # A tibble: 10 x 3
##    tags    author$name   $goodreads_link        $slug     text                            
##    <list>  <chr>         <chr>                  <chr>     <chr>                           
##  1 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “The world as we have created i…
##  2 <chr [… J.K. Rowling  /author/show/1077326.… J-K-Rowl… “It is our choices, Harry, that…
##  3 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “There are only two ways to liv…
##  4 <chr [… Jane Austen   /author/show/1265.Jan… Jane-Aus… “The person, be it gentleman or…
##  5 <chr [… Marilyn Monr… /author/show/82952.Ma… Marilyn-… “Imperfection is beauty, madnes…
##  6 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “Try not to become a man of suc…
##  7 <chr [… André Gide    /author/show/7617.And… Andre-Gi… “It is better to be hated for w…
##  8 <chr [… Thomas A. Ed… /author/show/3091287.… Thomas-A… “I have not failed. I've just f…
##  9 <chr [… Eleanor Roos… /author/show/44566.El… Eleanor-… “A woman is like a tea bag; you…
## 10 <chr [… Steve Martin  /author/show/7103.Ste… Steve-Ma… “A day without sunshine is like…

But, the {htmlunit} code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).

FIN

While {htmlunit} is great, it won’t work on super complex sites as it’s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content. The package also contains a mini-DSL if you need to perform more complex page scraping tasks as well.

You can find both {htmlunit} and {htmlunitjars} at:

The in-dev htmlunit package for javascript-“enabled” web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-“enabled” sites. The Chrome emulation is now also on-par with Chrome 72 series (my Chrome beta is at 73.0.3683.56 so it’s super close to very current).

In reality, the update was to the htmlunitjars package where the main project JAR and dependent JARs all received a refresh.

The README and tests were all re-run on both packages and Travis is happy.

If you’ve got a working rJava installation (aye, it’s 2019 and that’s still “a thing”) then you can just do:

install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is/")

to get them installed and start playing with the DSL or work directly with the Java classes.

FIN

As usual, use your preferred social coding site to log feature requests or problems.

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days).

The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with the pure-R stevedore? package which will make it loads more compatible across the landscape of R installs as it removes a somewhat heavy dependency on a working Python environment (something quite challenging to consistently achieve in that fragmented language ecosystem).

Another addition is a set of new user agents for Android, Kindle, Apple TV & Chromecast as an increasing number of sites are changing what type of HTML (et. al.) they send to those and other alternative glowing rectangles. A more efficient/sane user agent system will also be introduced prior to the CRAN. Now’s the time to vote on existing issues or file new ones if there is a burning desire for new or modified functionality.

Since the Travis tests now work (they were failing miserably because of they Python dependency) I’ve integrated the changes from the 0.6.0 to the master branch but you can follow the machinations of the 0.6.0 branch up until CRAN release.

Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has made Americans safer and healthier as a result. What’s happened to an EPA left in the hands of evil (yep, “evil”) in the past two years is beyond lamentable and we likely have two more years of lamenting ahead of us (unless you actually like your water with a coal ash chaser).

The authors of the article made this chart to show the stark contrast between 2017 and 2018 when it comes to regulatory actions for eight acts:

  • Clean Air Act (CAA)
  • Clean Water Act (CWA)
  • Emergency Planning and Community Right to Know Act (EPCRA)
  • Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)
  • Resource Conservation and Recovery Act (RCRA)
  • Safe Drinking Water Act (SDWA)
  • Toxic Substances Control Act (TSCA)
    – Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)

They made this arrow chart (via Datawrapper):

For some reason, that chart sparked a “I really need to make that in R” moment, and thus begat this post.

I’ve got a geom for dumbbell charts but that’s not going to work for this arrow chart since I really wanted to (mostly) reproduce it the way it was. Here’s my go at it.

Data First

Datawrapper embeds have a handy “Get the data” link in them but it’s not a link to a file. It’s a javascript-generated data: href so you either need to click on the link and download it or be hard-headed like I am go the way of pain and scrape it (reproducibility FTW). Let’s get packages and data gathering code out of the way. I’ll exposit a bit more about said data gathering after the code block:

library(stringi)
library(rvest)
library(hrbrthemes) # git[la|hu]b / hrbrmstr / hrbrthemes
library(tidyverse)

article <- read_html("https://theconversation.com/the-epa-has-backed-off-enforcement-under-trump-here-are-the-numbers-108640")

html_node(article, "iframe#psm7n") %>% # find the iframe
  html_attr("src") %>% # get iframe URL
  read_html() %>%  # read it in
  html_node(xpath=".//script[contains(., 'data: ')]") %>% # find the javascript section with the data
  html_text() %>% # get that section
  stri_split_lines() %>% # split into lines so we can target the actual data element
  unlist() %>% 
  keep(stri_detect_fixed, 'data: "Fiscal') %>% # just get the data line
  stri_trim_both() %>% # prep it for extraction
  stri_replace_first_fixed('data: "', "") %>% 
  stri_replace_last_fixed('"', "") %>% 
  stri_replace_all_fixed("\\n", "\n") %>% # make lines lines
  stri_split_lines() %>% 
  unlist() %>%
  stri_split_fixed("\\t") %>% # we now have a list of vectors
  map_dfc(~set_names(list(.x[2:length(.x)]), .x[1])) %>%  # first element of each vector is colname
  type_convert(col_types = "cddn") %>% # get real types
  set_names(c("act", "y2018", "y2017", "pct")) -> psm

psm
## # A tibble: 8 x 4
##   act    y2018 y2017   pct
##   <chr>  <dbl> <dbl> <dbl>
## 1 CAA      199   405   -51
## 2 CERCLA   147   194   -24
## 3 CWA      320   565   -43
## 4 EPCRA     56   107   -48
## 5 FIFRA    363   910   -60
## 6 RCRA     149   275   -46
## 7 SDWA     121   178   -32
## 8 TSCA      80   152   -47

Inside the main article URL content there’s an iframe load:

<p><iframe id="psm7n" class="tc-infographic-datawrapper" src="https://datawrapper.dwcdn.net/psm7n/2/" height="400px" width="100%" style="border: none" frameborder="0"></iframe></p>

We grab the contents of that iframe link (https://datawrapper.dwcdn.net/psm7n/2/) which has a data: line way down towards the bottom of one of the last javascript blocks:

That ugly line gets transformed into a link that will download as a normal CSV file, but we have to do the above wrangling on it before we can get it into a format we can work with.

Now, we can make the chart.

Chart Time!

Let’s get the Y axis in the right order:

psm %>%
  arrange(desc(y2017)) %>%
  mutate(act = factor(act, levels = rev(act))) -> psm

Next, we setup X axis breaks and also get the max value for some positioning calculations (so we don’t hardcode values):

# setup x axis breaks and max value for label position computation
x_breaks <- pretty(c(psm$y2018, psm$y2017))
max_val <- max(x_breaks)

I have two minor nitpicks about the original chart (and changes to them as a result). First, I really don’t like the Y axis gridlines but I do believe we need something to help the eye move horizontally and associate each label to its respective geom. Instead of gridlines I opt for a diminutive dotted line from 0 to the first (min) value.

The second nitpick is that — while the chart has the act information in the caption area — the caption is in alpha order vs the order the act acronyms appear in the data. If it was an alpha bullet list I might not complain, but I chose to modify the order to fit the chart, which we build dynamically with the help of this vector:

# act info for caption
c(
  "CAA" = "Clean Air Act (CAA)",
  "CWA" = "Clean Water Act (CWA)",
  "EPCRA" = "Emergency Planning and Community Right to Know Act (EPCRA)",
  "FIFRA" = "Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)",
  "RCRA" = "Resource Conservation and Recovery Act (RCRA)",
  "SDWA" = "Safe Drinking Water Act (SDWA)",
  "TSCA" = "Toxic Substances Control Act (TSCA)",
  "CERCLA" = "Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)"
) -> acts

w125 <- scales::wrap_format(125) # help us word wrap at ~125 chars

# order the vector and turn it into wrapped lines
act_info <- w125(paste0(unname(acts[as.character(psm$act)]), collapse = "; "))

Now, we can generate the geoms. It looks like alot of code, but I like to use newlines to help structure ggplot2 calls. I still miss my old gg <- gg + idiom but RStudio makes it way too easy to execute the whole expression with just the use of + so I’ve succumbed to their behaviour modification. To break it down w/o code, we essentially need:

  • the arrows for each act
  • the 2017 and 2018 direct label values for each act
  • the 2017 and 2018 top “titles”
  • segments for ^^
  • title, subtitle and caption(s)

We use percent-maths to position labels and other objects so the code can be re-used for other arrow plots (hardcoding to the data values is likely fine, but you’ll end up tweaking the numbers more and wasting ~2-5m per new chart).

  # dots from 0 to minval
  geom_segment(
    aes(0, act, xend = y2018, yend = act),
    linetype = "dotted", color = "#b2b2b2", size = 0.33
  ) +

  # minval label
  geom_label(
    aes(y2018, act, label = y2018),
    label.size = 0, hjust = 1, size = 3.5, family = font_rc
  ) +

  # maxval label
  geom_label(
    aes(y2017 + (0.0015 * y2017), act, label = y2017),
    label.size = 0, hjust = 0, size = 3.5, family = font_rc
  ) +

  # the measure line+arrow
  geom_segment(
    aes(y2018, act, xend = y2017, yend = act),
    color = "#4a90e2", size = 0.75, # I pulled the color value from the original chart
    arrow = arrow(ends = "first", length = unit(5, "pt"))
  ) +

  # top of chart year (min)
  geom_label(
    data = head(psm, 1),
    aes(y2018, 9, label = "2018"),
    hjust = 0, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate
  ) +

  # top of chart year (max)
  geom_label(
    data = head(psm, 1),
    aes(y2017, 9, label = "2017"),
    hjust = 1, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate
  ) +

  # bar from top of chart year label to first minval measure
  geom_segment(
    data = head(psm, 1),
    aes(
      y2018 + (0.005 * max_val), 8.5, 
      xend = y2018 + (0.005 * max_val), yend = 8.25
    ), 
    size = 0.25
  ) +

  # bar from top of chart year label to first maxval measure
  geom_segment(
    data = head(psm, 1),
    aes(
      y2017 - (0.005 * max_val), 8.5, 
      xend = y2017 - (0.005 * max_val), yend = 8.25
    ), 
    size = 0.25
  ) +

  # fix x axis scale and place breaks
  scale_x_comma(limits = c(0, max_val), breaks = seq(0, max_val, 200)) +

  # make room for top "titles"
  scale_y_discrete(expand = c(0, 1)) +

  labs(
    y = NULL,
    title = "Decline by statute",
    subtitle = "The number of civil cases the EPA brought to conclusion has dropped across a number of federal statutes,\nincluding the Clean Air Act (CAA) and others.",
    x = act_info,
    caption = "Original Chart/Data: The Conversation, CC-BY-ND;<https://bit.ly/2VuJrOT>; Source: Environmental Data & Government Initiative <https://bit.ly/2VpcFyl>"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(axis.text.x = element_text(color = ft_cols$slate)) +
  theme(axis.title.x = element_text(
    hjust = 0, size = 10, face = "italic", color = ft_cols$gray, margin = margin(t = 10)
  )) +
  theme(plot.caption = element_text(hjust = 0))

Here’s the result:

(it even looks ok in “batman” mode):

FIN

With Microsoft owning GitHub I’m not using gists anymore and the GitLab “snippets” equivalent is just too dog-slow to use, so starting in 2019 I’m self-hosing contiguous R example code used in the blog posts. For the moment, that means links to plain R files but I may just setup gitea for them sometime before the end of Q1. You can find a contiguous, commented version of the above code in here.

If you do your own makeover don’t forget to drop a link to your creation(s) in the comments!

I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this).

Today, they sent me this gem by @JeromeDeveloper: Scrapy and Scrapyrt: how to create your own API from (almost) any website. Go ahead and click it. Give the Medium author the ? they so desperately crave (and to provide context for the rant below).

I have no issue with @JeromeDeveloper’s coding prowess, nor Scrapy/Scrapyrt. In fact, I’m a huge fan of the folks at ScrapingHub, so much so that I wrote splashr to enable use of their Splash server from R.

My issue is with the example the author chose to use.

CoinMarketCap provides cryptocurrency prices and other cryptocurrency info. I use it to track cryptocurrency prices to see which currency attackers who pwn devices to install illegal cryptocurrency rigs might be switching to next and to get a feel for when they’ll stop mining and go back to just stealing data and breaking things.

CoinMarketCap has an API with a generous free tier with the following text in their Terms & Conditions (which, in the U.S. [soon] may stupidly be explicitly repeated & required on each page that scraping is prohibited on vs a universal site link):

You may not, and shall not, copy, reproduce, download, “screen scrape”, store, transmit, broadcast, publish, modify, create a derivative work from, display, perform, distribute, redistribute, sell, license, rent, lease or otherwise use, transfer (either in printed, electronic or other format) or exploit any Content, in whole or in part, in any way that does not comply with these Terms without our prior written permission.

There is only one reason (apart from complete oblivion) to use CoinMarketCap as an example: to show folks how clever you are at bypassing site restrictions and eventually avoiding paying for an API to get data that you did absolutely nothing to help gather, curate and setup infrastructure for. There is no mention of “be sure what you are doing is legal/ethical”, just a casual caution to not abuse the Scrapyrt technology since it may get you banned.

Ethics matter across every area of “data science” (of which, scraping is one component). Just because you can do something doesn’t mean you should and just because you don’t like Terms & Conditions and want to grift the work of others for fun, profit & ? also doesn’t mean you should; and, it definitely doesn’t mean you should be advocating others do it as well.

Ironically, Medium itself places restrictions on what you can do:

Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.

yet they advocated I read and heed a post which violates similar terms of another site. So I wonder how they’d feel if I did a riff of that post and showed how to setup a hackish-API to scrape all their content. O_o

It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.

R folks have had the decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).

However, that nigh-mundane addition is not the only new feature in decapitated.

Introducing gepetto

While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer?. Puppeteer is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.

Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto? which aims to do just that.

Gepetto is a Node.js application which uses puppeteer for all the hard work. After seeing that such a REST API interface was possible via the puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer-fueled javascript for the Lua interface.

A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.

Installing gepetto

There are instructions over at the repo on installing gepetto but R users can try a shortcut by grabbing the latest version of decapitated from Git[La|Hu]b and running decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:

The installer provides some guidance should thing go awry. You’ll notice gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.

Working with gepetto

Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto. You should see something like this:

$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000

NOTE: You can use a different host/port by setting the HOST and PORT environment variables accordingly before startup.

You can then go to http://localhost:3000 in your browser and should see this:

Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.

If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:

The Swagger definition JSON is also at http://localhost:3000/swagger.json.

The API documentation will be a bit more robust as the module’s corners are rounded out.

“But, this is supposed to be an R post…”

Yes. Yes it is.

If you followed along in the previous section and started gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!

library(rvest)
library(decapitated)
library(tidyverse)

gpid <- start_gepetto()

gpid
## PROCESS 'gepetto', running, pid 60827.

gepetto() %>% 
  gep_active()
## [1] TRUE

Anything other than a “running” response means there’s something wrong and you can use the various processx methods on that gpid object to inspect the error log. If you were able to run gepetto from the command line then it should be fine in R, too. The gep() function build a connection object and gep_active() tests an API endpoint to ensure you can communicate with the server.

Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html().

gepetto() %>% 
  gep_render_html("http://therapboard.com/") -> doc

html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%  
  html_attr("src") %>% 
  head(10)
## [1] "audio/2chainz_4.mp3"        "audio/2chainz_yeah2.mp3"   
## [3] "audio/2chainz_tellem.mp3"   "audio/2chainz_tru.mp3"     
## [5] "audio/2chainz_unh3.mp3"     "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3"  "audio/2pac_4.mp3"          
## [9] "audio/2pac_5.mp3"           "audio/2pac_6.mp3"

Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs.

We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:

gepetto() %>% 
  gep_render_magick("https://nytimes.com/")
##   format width height colorspace matte filesize density
## 1    PNG  1440   6828       sRGB  TRUE        0   72x72

Astute readers will notice it returns a magick object so you can work with it immediately.

I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).

Testing gep_render_pdf() is an exercise left to the reader.

FIN

The gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr? which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:

library(splashr)
library(HARtools)
library(urltools)
library(furrr)
library(tidyverse)

list_of_urls_with_unique_urls <- c("http://...", "http://...", ...)

make_a_splash <- function(org_url) {
  splash(
    host = "ip/name of system you started aquarium on", 
    user = "your splash api username", 
    pass = "your splash api password"
  ) %>% 
    splash_response_body(TRUE) %>% # we want to get all the content 
    splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from 
    splash_go(org_url) %>% 
    splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated
    splash_har()
}

safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe

plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection

future_map(sites, ~{
  
  org <- safe_splash(.x) # go get it!
  
  if (is.null(org$result)) {
    sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages
  } else {
    
    HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE
      har = org$result, 
      file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
    )
    
    sprintf("Successfully retrieved %s", .x)
    
  }
  
}) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>% 
  keep(str_detect, "Error")
## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

This past weekend, violent windstorms raged through New England. We — along with over 500,000 other Mainers — went “dark” in the wee hours of Monday morning and (this post was published on Thursday AM) we still have no utility-provided power nor high-speed internet access. The children have turned iFeral, and being a remote worker has been made that much more challenging. Earlier in the week, even making a cellular phone call (not an Google Voice or other VoIP-ish call, just pressing buttons in the phone “app” in iOS) resulted in an “All circuits are busy” message vs human contact. (I had to repair our generator at some point between then and now, but it’s all a blur at this point).

Late Tuesday night, we checked the Central Maine Power outage info and were greeted with a “November 4, 2017” estimate. After regaining composure, we doubled down on the fact that we’d be extreme indoor glamping for a while longer.

As noted, I cope by coding and have a history of scraping Central Maine Power’s site for outage info. I ceased when I discovered the site & twitter bot I mentioned in a recent post, as it does that for the entirety of the U.S. (though many power companies continue make it difficult to scrape their outage info).

However, I wanted to see just how many other streets were in the same position as we are (I should note that less than a mile from us there are folks with power and internet, due mostly to their proximity to “vital” resources and how screwed up the Maine power grid is). Rather than reuse existing code, I wanted to make a modern, tidyverse edition of scrapers. If you follow enough paths in the aforementioned outage site, you’ll see that you eventually come to a page with a pretty ugly iframe that lets you poke around counties and towns. The following code traverses that tree to get street-level outage info:

library(rvest)
library(stringi)
library(hrbrthemes)
library(tidyverse)

# helper to make numbers from comma strings; i still find it amusing that
# this has never been a core "S" or base "R" function given that the
# central goal of both languages are to work with data
to_num <- function(x) { as.numeric(stri_replace_all_fixed(x, ",", "")) }

# top of the tree
pg <- read_html("http://www3.cmpco.com/OutageReports/CMP.html")

# basic idiom all the way down is to find the links to traverse until we get to
# the street level data, plucking the timestamp of the CMP report along the way
html_nodes(pg, "a") %>% 
  map_df(~{
    
    county <- stri_trans_totitle(html_text(.x))
    cpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
    message(sprintf("Processing %s...", county))
    
    html_nodes(cpg, xpath=".//td[not(@colspan = '2') and not(@align = 'right')][1]/a") %>% 
      map_df(~{
        
        town <- stri_trans_totitle(html_text(.x))
        tpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
        message(sprintf("  - %s", town))
    
        html_node(tpg, xpath=".//td[contains(., 'Update:')]") %>%
          html_text() %>%
          stri_replace_first_regex("Update: ", "") %>%
          as.POSIXct("%b %d, %Y %I:%M %p", tz="America/New_York") -> ts
        
        html_node(tpg, "table") %>% 
          html_table() %>% 
          docxtractr::assign_colnames(3) %>%  
          docxtractr::mcga() %>% # in github version
          mutate(street = stri_trans_totitle(street)) %>% 
          mutate_at(vars(-estimated_restoration, -street), .funs=to_num) %>% 
          filter(!is.na(total_customersby_street)) %>% 
          mutate(timestamp = ts) %>% 
          mutate(county = county) %>% 
          mutate(town = town) %>% 
          tbl_df()
      })
    
  }) -> xdf

xdf <- mutate(xdf, estimated_restoration = as.POSIXct(estimated_restoration, "%b %d, %Y %I:%M %p", tz="America/New_York"))

xdf
## # A tibble: 10,601 x 7
##           street total_customersby_street customerswithout_power estimated_restoration           timestamp       county   town
##            <chr>                    <dbl>                  <dbl>                <dttm>              <dttm>        <chr>  <chr>
##  1        2Nd St                       59                     14                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  2        3Rd St                      128                     53                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  3        4Th St                       89                     15                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  4        5Th St                       56                      3                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  5     Adams Ave                        4                      4   2017-11-03 19:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  6     Allain St                        8                      8                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  7     Atwood St                        6                      3   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  8    Baxter Ave                       32                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  9     Beaver Rd                        9                      4   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## 10 Bellflower Dr                       10                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## # ... with 10,591 more rows

One truly upsetting revelation from data is the number of folks still in an “Assessing” condition (i.e. no restoration time estimate):

filter(xdf, is.na(estimated_restoration)) %>% 
  summarise(streets = n(), customers_impacted = sum(total_customersby_street))
## # A tibble: 1 x 2
##   streets customers_impacted
##     <int>              <dbl>
## 1    2255              42067

I’m thankful (for them and us) that Winter has not yet hit and that the weather has been and is going to be sufficiently mild to not make things life-threatening for most folks (it does get cold in northern Maine at this time of year).

It’s About Time

We can get an overview of when things are slated to get better for everyone but the folks I just mentioned:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>%
  count(day) %>% 
  ggplot(aes(day, n)) +
  geom_col() +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma() +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y")

It seems that most of us are in the same “November 4th” bucket. But, we can also see that Central Maine Power’s data curation leaves much to be desired since there should be no dates in the past in that chart, but there are.

With the scraping data above, we can explore the outage info in many ways, but — as time and bandwidth are precious commodities — I’ll leave you with the total number of customers still without power:

count(xdf, wt = customerswithout_power) %>% pull(n)
## [1] 153465

and, a county-level view of the outage:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>% 
  count(county, day) %>% 
  complete(county, day, fill=list(n=0)) %>% 
  filter(day >= Sys.Date()) %>% 
  ggplot(aes(day, n)) +
  geom_segment(aes(xend=day, yend=0), color="steelblue", size=4) +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma(limits=c(0,1250)) +
  facet_wrap(~county, scales="free", ncol=5) +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates by County",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y", strip_text_face = "bold", axis="xy") +
  theme(panel.spacing.x=unit(3, "lines")) +
  theme(panel.spacing.y=unit(2, "lines"))

FIN

In a way, I wish I had continued scraping CMP data since the power outages site I mentioned doesn’t seem to provide access to the raw data and getting a historical perspective of the outage locations and analyzing by geography and other demographics might be interesting.

Hopefully the scraping code will be useful for some folks. It was definitely therapeutic for me :-)