I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it. The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The… Continue reading
Post Category → web scraping
htmlunitjars Updated to 2.34.0
The in-dev htmlunit package for javascript-“enabled” web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-“enabled” sites. The Chrome emulation is now also on-par with Chrome 72… Continue reading
splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration
The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days). The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with… Continue reading
‘data:’ Scraping & Chart Reproduction : Arrows of Environmental Destruction
Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has… Continue reading
More “Scraping Ethics Gone Awry” and “Why Do This When There’s a Free API?”
I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this). Today, they sent me this gem by @JeromeDeveloper: Scrapy and… Continue reading
Introducing ‘gepetto’ — a Splash-like REST API to Headless Chrome
It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying… Continue reading
In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium
The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an… Continue reading
Yet-Another-Power Outages Post : Full Tidyverse Edition
This past weekend, violent windstorms raged through New England. We — along with over 500,000 other Mainers — went “dark” in the wee hours of Monday morning and (this post was published on Thursday AM) we still have no utility-provided power nor high-speed internet access. The children have turned iFeral, and being a remote worker… Continue reading