In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an… Continue reading

Connecting Apache Zeppelin Up to Amazon Athena with an IAM Profile Name

Apache Zeppelin is a “notebook” alternative to Jupyter (and other) notebooks. It supports a plethora of kernels/interpreters and can do a ton of things that this post isn’t going to discuss (perhaps future ones will, especially since it’s the first “notebook” environment I’ve been able to tolerate for longer than a week). One really cool… Continue reading

Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names

Continuing the blog’s UDF theme of late, there are two new UDF kids in town: drill-url-tools🔗 for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post) drill-domain-tools🔗 for slicing & dicing internet domain names (IDNs). Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you… Continue reading

A new ‘boto3’ Amazon Athena client wrapper with dplyr async query support

A previous post explored how to deal with Amazon Athena queries asynchronously. The function presented is a beast, though it is on purpose (to provide options for folks). In reality, nobody really wants to use rJava wrappers much anymore and dealing with icky Python library calls directly just feels wrong, plus Python functions often return… Continue reading