If you do enough web scraping, you’ll eventually hit a wall that the trusty httr
verbs (that sit beneath rvest
) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still stick with httr
verbs — if you can figure out what those requests are — and code-up the right parameters (browser “Developer Tools” menus/views and my curlconverter
package are super handy for this). Unfortunately, some sites require actual in-page rendering and that’s when scraping turns into a modest chore.
For dynamic sites, the RSelenium
and/or seleniumPipes
packages are super-handy tools to have in the toolbox. They interface with Selenium which is a feature-rich environment/ecosystem for automating browser tasks. You can programmatically click buttons, press keys, follow links and extract page content because you’re scripting actions in an actual browser or a browser-like tool such as phantomjs
. Getting the server component of Selenium running was often a source of pain for R folks, but the new docker images make it much easier to get started. For truly gnarly scraping tasks, it should be your go-to solution.
However, sometimes all you need is the rendering part and for that, there’s a new light[er]weight alternative dubbed Splash. It’s written in python and uses QT webkit for rendering. To avoid deluging your system with all of the Splash
dependencies you can use the docker images. In fact, I made it dead easy to do so. Read on!
Going for a dip
The intrepid Winston Chang at RStudio started a package to wrap Docker operations and I’ve recently joind in the fun to add some tweaks & enhancements to it that are necessary to get it on CRAN. Why point this out? Since you need to have Splash
running to work with it in splashr
I wanted to make it as easy as possible. So, if you install Docker and then devtools::install_github("wch/harbor")
you can then devtools::install_github("hrbrmstr/splashr")
to get Splash
up and running with:
library(splashr)
install_splash()
splash_svr <- start_splash()
The install_splash()
function will pull the correct image to your local system and you’ll need that splash_svr
object later on to stop the container. Now, you can have Splash
running on any host, but this post assumes you’re running it locally.
We can test to see if the server is active:
splash("localhost") %>% splash_active()
## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 70443008
Now, we’re ready to scrape!
We’ll use this site — http://www.techstars.com/companies/ — mentioned over at DataCamp’s tutorial since it doesn’t use XHR but does require rendering and it doesn’t prohibit scraping in the Terms of Service (don’t violate Terms of Service, it is both unethical and could get you blocked, fined or worse).
Let’s scrape the “Summary by Class” table. Here’s an excerpt along with the Developer Tools view:
You’re saying “HEY. That has <table>
in the HTML so why not just use rvest
? Well, you can validate the lack of <table>
s in the “view source” view of the page or with:
library(rvest)
pg <- read_html("http://www.techstars.com/companies/")
html_nodes(pg, "table")
## {xml_nodeset (0)}
Now, let’s do it with splashr
:
splash("localhost") %>%
render_html("http://www.techstars.com/companies/", wait=5) -> pg
html_nodes(pg, "table")
## {xml_nodeset (89)}
## [1] <table class="table75"><tbody>\n<tr>\n<th>Status</th>\n <th>Number of Com ...
## [2] <table class="table75"><tbody>\n<tr>\n<th colspan="2">Impact</th>\n </tr>\n ...
## [3] <table class="table75"><tbody>\n<tr>\n<th>Class</th>\n <th>#Co's</th>\n ...
## [4] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Anywhere 2017 Q1</th>\ ...
## [5] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Atlanta 2016 Summer</t ...
## [6] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2013 Fall</th>\ ...
## [7] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2014 Summer</th ...
## [8] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2015 Spring</th ...
## [9] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2016 Spring</th ...
## [10] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2014</th>\n ...
## [11] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2015 Spring</ ...
## [12] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2016 Winter</ ...
## [13] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Cape Town 201 ...
## [14] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2015 Summ ...
## [15] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2016 Summ ...
## [16] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Tel Aviv 2016 ...
## [17] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2015 Summer</th ...
## [18] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2016 Summer</th ...
## [19] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2009 Spring</th ...
## [20] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2010 Spring</th ...
## ...##
We need to set the wait
parameter (5 seconds was likely overkill) to give the javascript callbacks time to run. Now you can go crazy turning that into data.
Candid Camera
You can also take snapshots (pictures) of websites with splashr
, like this (apologies if you start drooling on your keyboard):
splash("localhost") %>%
render_png("https://www.cervelo.com/en/triathlon/p-series/p5x")
The snapshot functions return magick
objects, so you can do anything you’d like with them.
HARd Work
Since Splash
is rendering the entire site (it’s a real browser), it knows all the information about the various components of a page and can return that in HAR format. You can retrieve this data and use John Harrison’s spiffy HARtools
package to visualize and further analyze the data. For the sake of brevity, here’s just the main print()
output from a site:
splash("localhost") %>%
render_har("https://www.r-bloggers.com/")
## --------HAR VERSION--------
## HAR specification version: 1.2
## --------HAR CREATOR--------
## Created by: Splash
## version: 2.3.1
## --------HAR BROWSER--------
## Browser: QWebKit
## version: 538.1
## --------HAR PAGES--------
## Page id: 1 , Page title: R-bloggers | R news and tutorials contributed by (750) R bloggers
## --------HAR ENTRIES--------
## Number of entries: 130
## REQUESTS:
## Page: 1
## Number of entries: 130
## - https://www.r-bloggers.com/
## - https://www.r-bloggers.com/wp-content/themes/magazine-basic-child/style.css
## - https://www.r-bloggers.com/wp-content/plugins/mashsharer/assets/css/mashsb.min.cs...
## - https://www.r-bloggers.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?...
## - https://www.r-bloggers.com/wp-content/plugins/jetpack/css/jetpack.css?ver=4.4.2
## ........
## - https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/10579991_10152371745729891_26331957...
## - https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/14962601_10210947974726136_38966601...
## - https://scontent.xx.fbcdn.net/v/t1.0-1/c0.8.50.50/p50x50/311082_286149511398044_4...
## - https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/11046696_917285094960943_6143235831...
## - https://static.xx.fbcdn.net/rsrc.php/v3/y2/r/0iTJ2XCgjBy.png
FIN
You can also do some basic scripting in Splash
with lua
and coding up an interface with that capability is on the TODO as is adding final tests and enabling tweaking the Docker configurations to support more fun things that Splash
can do.
File an issue on github if you have feature requests or problems and feel free to jump on board with a PR if you’d like to help put the finishing touches on the package or add some features.
Don’t forget to stop_splash(splash_svr)
when you’re finished scraping!
10 Comments
Do you think that you use this to scrape data from the webpage referenced in this question: http://stackoverflow.com/questions/41525681/how-to-extract-data-from-an-interactive-chart
Or do you have any other ideas how to extract the data? Thank you
Hi, first of all thank you for your post.
I am probably sloppy but there is no way I can successfully execute the command ‘installsplash()’, as my output is:
Error in sys::execinternal(docker, args = args, error = TRUE) :
Executing ‘/usr/bin/docker’ failed with status 1
(My OS is Ubuntu 16.04.2 LTS.)
However, Docker has been installed and I can myself execute it through command-line. What should I do? Thanks!
Thx for taking a look at the pkg! Could you post this as an issue on the github repo? It’ll make it easier to do back-and-forth debugging.
Hi – thanks for all your work on great packages and making so many things easier. I am trying out the splashr package and it all works great for me (even on win10 … ). I am trying to scrape some js rendered sites and it seems to me that i would need to increase the wait arg in renderhtml(). When i try the same command with renderpng() instead i notice that only a small portion of the page is loaded, therefore my idea. However it seems like its max value is 10. Is there any way i can get around this – ideally letting the page load for like a minute and then rendering the html? Thanks again and keep up the great work!
Thx for kicking the tyres on the pkg and for the Windows success report!
Indeed, according to https://splash.readthedocs.io/en/stable/api.html#render-html the high-level
render…
wrappers do indeed have a 10s hard-coded value (and I shld handle that in R which I will do).You’ll have to do a bit more with with the lua DSL wrapper functions. Hopefully this pastes well here:
(I wrapped it with
system.time()
to show it does what it says it does wait-wise ;-)You get tons more granular control with lua scripts or the lua-R DSL wrapper I made.
At times, the Splash (not my code, but the VM) seems to get into an interesting state when lua code is sent and I’m still working on whether it’s my script composition and sending or their interpreter that’s causing it to be a bit wonky. Give that a go and let me know if there are issue. -thx
Thanks for sharing! I got it working, and it did indeed grab a page that I previously had to manually save the source of.
There was a lot of fiddling about to get it to that point, including installing some other R packages to get these to work, installing some other Ubuntu packages to get Docker to install, and installing the docker Python package. It was fiddly but not hard, since I could see what needed to be done. (I ought to have documented my process so that I could help others.)
Hi Bob, thank you for this post!
I followed all the steps and got to the point where you ran:
splash(“localhost”) %>% splash_active().
However, I got an output of FALSE and I am not sure why.
I checked docker, and the container for with the splashr image is contained is running, but I cannot seem to verify that in R. Would it be possible that the container is not running locally? If this is the case, how would I go about finding the host for the container?
Any help would be greatly appreciated! Thanks
Hey Alex. This made it to Spam for some reason. Can you file this as an issue in the
splashr
repo and provide some system details?Thanks for sharing. I am getting the below error when I run “splash_svr <- start_splash()” . Any workaround this?
Error in py_module_import(module, convert = convert) :
ImportError: cannot import name ‘NpipeAdapter’ from ‘docker.transport’
yep. use a more recent version of
splashr
. the pkg switched from usingdocker
tostevedore
so the python requirement is no longer there and shld work better.5 Trackbacks/Pingbacks
[…] If you do enough web scraping, you’ll eventually hit a wall that the trusty httr verbs (that sit beneath rvest) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still stick with httr verbs —… Continue reading → […]
[…] has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main […]
[…] has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main […]
[…] has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main […]
[…] https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/ […]