Introducing ‘gepetto’ — a Splash-like REST API to Headless Chrome

It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.

R folks have had the decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).

However, that nigh-mundane addition is not the only new feature in decapitated.

Introducing gepetto

While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer?. Puppeteer is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.

Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto? which aims to do just that.

Gepetto is a Node.js application which uses puppeteer for all the hard work. After seeing that such a REST API interface was possible via the puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer-fueled javascript for the Lua interface.

A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.

Installing gepetto

There are instructions over at the repo on installing gepetto but R users can try a shortcut by grabbing the latest version of decapitated from Git[La|Hu]b and running decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:

The installer provides some guidance should thing go awry. You’ll notice gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.

Working with gepetto

Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto. You should see something like this:

$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000

NOTE: You can use a different host/port by setting the HOST and PORT environment variables accordingly before startup.

You can then go to http://localhost:3000 in your browser and should see this:

Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.

If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:

The Swagger definition JSON is also at http://localhost:3000/swagger.json.

The API documentation will be a bit more robust as the module’s corners are rounded out.

“But, this is supposed to be an R post…”

Yes. Yes it is.

If you followed along in the previous section and started gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!

library(rvest)
library(decapitated)
library(tidyverse)

gpid <- start_gepetto()

gpid
## PROCESS 'gepetto', running, pid 60827.

gepetto() %>% 
  gep_active()
## [1] TRUE

Anything other than a “running” response means there’s something wrong and you can use the various processx methods on that gpid object to inspect the error log. If you were able to run gepetto from the command line then it should be fine in R, too. The gep() function build a connection object and gep_active() tests an API endpoint to ensure you can communicate with the server.

Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html().

gepetto() %>% 
  gep_render_html("http://therapboard.com/") -> doc

html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%  
  html_attr("src") %>% 
  head(10)
## [1] "audio/2chainz_4.mp3"        "audio/2chainz_yeah2.mp3"   
## [3] "audio/2chainz_tellem.mp3"   "audio/2chainz_tru.mp3"     
## [5] "audio/2chainz_unh3.mp3"     "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3"  "audio/2pac_4.mp3"          
## [9] "audio/2pac_5.mp3"           "audio/2pac_6.mp3"

Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs.

We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:

gepetto() %>% 
  gep_render_magick("https://nytimes.com/")
##   format width height colorspace matte filesize density
## 1    PNG  1440   6828       sRGB  TRUE        0   72x72

Astute readers will notice it returns a magick object so you can work with it immediately.

I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).

Testing gep_render_pdf() is an exercise left to the reader.

FIN

The gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.

Cover image from Data-Driven Security
Amazon Author Page

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.