While the future of the Apache Drill ecosystem is somewhat in-play (MapR — a major sponsoring org for the project — is kinda dead), I still use it almost daily (on my local home office cluster) to avoid handing over any more money to Amazon than I/we already do. The latest (yet-to-be-released) v1.18.0 has some great improvements, including JSON resultset streaming for the REST API. Alas, tweaking {sergeant} (my REST API R package) to handle that is not on the TODO for the foreseeable future, so I’ve been using {sergeant.caffeinated} — https://github.com/hrbrmstr/sergeant-caffeinated — (a RJDBC wrapper for the Drill JDBC interface) for quite a while since it handles large resultsets quite nicely.
I broke out the RJDBC functionality from {sergeant} into this separate package since, despite the fact that it’s 2019/2020, many folks still have/had problems getting {rJava} to work (FWIW it’s a seamless install for me on Windows, Ubuntu, or macOS, even Apple Silicon macOS). The surgery to separate it was fairly hack-ish (one reason it’s not on CRAN) and it finally broke with the recent {dbplyr} 2.x release. I assumed fixing the caffeinated version was easier/quicker than the REST API version, so I dug in and am cautiously tossing it out for wider poking.
An All New Way To Use 💂☕️
Gone are the days of src_drill_jdbc()
, but enter in the new term of more standardized {DBI} and {d[b]plyr} access to Apache Drill. To install this version you can do:
remotes::install_github("hrbrmstr/sergeant-caffeinated")
(more install options using safer and saner social coding sites coming soon).
Let’s load up the package(s) and perform some operations.
library(sergeant.caffeinated)
test_host <- Sys.getenv("DRILL_TEST_HOST", "localhost")
be_quiet()
(con <- dbConnect(drv = DrillJDBC(), sprintf("jdbc:drill:zk=%s", test_host)))
## <DrillJDBCConnection>
The DRILL_TEST_HOST
environment variable contains the hostname or IP address of my/your Drill server, defaulting to localhost
if none is found.
The be_quiet()
function stops the Java engine from yelling at you with “illegal reflective access” warnings. If you see this in other rJava-powered packages it means code in some classes in some Java archive files are doing some sketchy old-school things that newer JVMs aren’t happy about. At some point, these warnings become full-on errors which will break many things. Unfortunately, Drill is still fairly tied to Java 8.x and has tons of introspecting code. The errors are ugly, so if you want to get rid of them, just call this function before doing anything with Drill. (You’ll also notice log4j errors are finally gone!)
Now that we have a Drill JDBC connection, we can do something with it. All the DBI-ish operations work, but it’s 2020 and {d[b]ply} is the bee’s knees, so we’ll just dive right in with that:
(db <- tbl(con, "cp.`employee.json`"))
## # Source: table<cp.`employee.json`> [?? x 16]
## # Database: DrillJDBCConnection
## employee_id full_name first_name last_name position_id position_title store_id
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Sheri No… Sheri Nowmer 1 President 0
## 2 2 Derrick … Derrick Whelply 2 VP Country Ma… 0
## 3 4 Michael … Michael Spence 2 VP Country Ma… 0
## 4 5 Maya Gut… Maya Gutierrez 2 VP Country Ma… 0
## 5 6 Roberta … Roberta Damstra 3 VP Informatio… 0
## 6 7 Rebecca … Rebecca Kanagaki 4 VP Human Reso… 0
## 7 8 Kim Brun… Kim Brunner 11 Store Manager 9
## 8 9 Brenda B… Brenda Blumberg 11 Store Manager 21
## 9 10 Darren S… Darren Stanz 5 VP Finance 0
## 10 11 Jonathan… Jonathan Murraiin 11 Store Manager 1
## # … with more rows, and 9 more variables: department_id <dbl>, birth_date <chr>,
## # hire_date <chr>, salary <dbl>, supervisor_id <dbl>, education_level <chr>,
## # marital_status <chr>, gender <chr>, management_role <chr>
Basically, that’s it: it “just works”.
FIN
If you’ve been a user of {sergeant.caffeinated} and really need src_drill_jdbc()
back, drop an issue on GH or a note in the comments, and be sure to file issues if I’ve missed anything as you kick the tyres.
One Comment
Nice. I had inadvertently bumped some dependencies and was hitting quoting issues with the REST interface in sergeant. I had been meaning to roll back and trace back what was causing the problem, but it looks like you’ve already got it manged. Thanks!
3 Trackbacks/Pingbacks
[…] *** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2020/11/20/updated-apache-drill-r-jdbc-interface-package-sergeant-caffeinated-with-… […]
[…] by data_admin [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]
[…] While the future of the Apache Drill ecosystem is somewhat in-play (MapR — a major sponsoring org for the project — is kinda dead), I still use it almost daily (on my local home office cluster) to avoid handing over any more money to Amazon than I/we already do. The latest (yet-to-be-released) v1.18.0 has some… Continue reading → […]