htmlunitjars Updated to 2.34.0

The in-dev htmlunit package for javascript-“enabled” web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-“enabled” sites. The Chrome emulation is now also on-par with Chrome 72… Continue reading

splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days). The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with… Continue reading

‘data:’ Scraping & Chart Reproduction : Arrows of Environmental Destruction

Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has… Continue reading

More “Scraping Ethics Gone Awry” and “Why Do This When There’s a Free API?”

I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this). Today, they sent me this gem by @JeromeDeveloper: Scrapy and… Continue reading

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an… Continue reading

Identify & Analyze Web Site Tech Stacks With rappalyzer

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities),… Continue reading