It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.
R folks have had the decapitated
? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).
However, that nigh-mundane addition is not the only new feature in decapitated
.
Introducing gepetto
While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer
?. Puppeteer
is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.
Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto
? which aims to do just that.
Gepetto
is a Node.js application which uses puppeteer
for all the hard work. After seeing that such a REST API interface was possible via the puppetron
proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer
-fueled javascript for the Lua interface.
A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system()
call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.
Installing gepetto
There are instructions over at the repo on installing gepetto
but R users can try a shortcut by grabbing the latest version of decapitated
from Git[La|Hu]b and running decapitated::install_gepetto()
which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:
The installer provides some guidance should thing go awry. You’ll notice gepetto
installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.
Working with gepetto
Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto
. You should see something like this:
$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000
NOTE: You can use a different host/port by setting the HOST
and PORT
environment variables accordingly before startup.
You can then go to http://localhost:3000 in your browser and should see this:
Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.
If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:
The Swagger definition JSON is also at http://localhost:3000/swagger.json.
The API documentation will be a bit more robust as the module’s corners are rounded out.
“But, this is supposed to be an R post…”
Yes. Yes it is.
If you followed along in the previous section and started gepetto
from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!
library(rvest)
library(decapitated)
library(tidyverse)
gpid <- start_gepetto()
gpid
## PROCESS 'gepetto', running, pid 60827.
gepetto() %>%
gep_active()
## [1] TRUE
Anything other than a “running” response means there’s something wrong and you can use the various processx
methods on that gpid
object to inspect the error log. If you were able to run gepetto
from the command line then it should be fine in R, too. The gep()
function build a connection object and gep_active()
tests an API endpoint to ensure you can communicate with the server.
Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html()
.
gepetto() %>%
gep_render_html("http://therapboard.com/") -> doc
html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%
html_attr("src") %>%
head(10)
## [1] "audio/2chainz_4.mp3" "audio/2chainz_yeah2.mp3"
## [3] "audio/2chainz_tellem.mp3" "audio/2chainz_tru.mp3"
## [5] "audio/2chainz_unh3.mp3" "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3" "audio/2pac_4.mp3"
## [9] "audio/2pac_5.mp3" "audio/2pac_6.mp3"
Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs
.
We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:
gepetto() %>%
gep_render_magick("https://nytimes.com/")
## format width height colorspace matte filesize density
## 1 PNG 1440 6828 sRGB TRUE 0 72x72
Astute readers will notice it returns a magick
object so you can work with it immediately.
I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick()
actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).
Testing gep_render_pdf()
is an exercise left to the reader.
FIN
The gepetto
REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.