R folks have had the
decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).
However, that nigh-mundane addition is not the only new feature in
While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed
Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called
gepetto? which aims to do just that.
Gepetto is a Node.js application which uses
puppeteer for all the hard work. After seeing that such a REST API interface was possible via the
puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting
A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more
system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.
There are instructions over at the repo on installing
gepetto but R users can try a shortcut by grabbing the latest version of
decapitated from Git[La|Hu]b and running
decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:
The installer provides some guidance should thing go awry. You’ll notice
gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.
Working with gepetto
Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter
gepetto. You should see something like this:
$ gepetto ? Launch browser! ? gepetto running on: http://localhost:3000
NOTE: You can use a different host/port by setting the
PORT environment variables accordingly before startup.
You can then go to http://localhost:3000 in your browser and should see this:
Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.
If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:
The Swagger definition JSON is also at http://localhost:3000/swagger.json.
The API documentation will be a bit more robust as the module’s corners are rounded out.
“But, this is supposed to be an R post…”
Yes. Yes it is.
If you followed along in the previous section and started
gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!
library(rvest) library(decapitated) library(tidyverse) gpid <- start_gepetto() gpid ## PROCESS 'gepetto', running, pid 60827. gepetto() %>% gep_active() ##  TRUE
Anything other than a “running” response means there’s something wrong and you can use the various
processx methods on that
gpid object to inspect the error log. If you were able to run
gepetto from the command line then it should be fine in R, too. The
gep() function build a connection object and
gep_active() tests an API endpoint to ensure you can communicate with the server.
gepetto() %>% gep_render_html("http://therapboard.com/") -> doc html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>% html_attr("src") %>% head(10) ##  "audio/2chainz_4.mp3" "audio/2chainz_yeah2.mp3" ##  "audio/2chainz_tellem.mp3" "audio/2chainz_tru.mp3" ##  "audio/2chainz_unh3.mp3" "audio/2chainz_watchout.mp3" ##  "audio/2chainz_whistle.mp3" "audio/2pac_4.mp3" ##  "audio/2pac_5.mp3" "audio/2pac_6.mp3"
Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with
We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:
gepetto() %>% gep_render_magick("https://nytimes.com/") ## format width height colorspace matte filesize density ## 1 PNG 1440 6828 sRGB TRUE 0 72x72
Astute readers will notice it returns a
magick object so you can work with it immediately.
I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the
gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).
gep_render_pdf() is an exercise left to the reader.
gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.