I recently mentioned that I’ve been working on a development version of an Apache Drill R package called
sergeant. Here’s a lifted “TLDR” on Drill:
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.
It also supports reading formats such as:
- [CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
- Hadoop Sequence Files
It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.
There’s already an R package on CRAN to work with Drill:
DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks
httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.
sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an
httr progress bar for data downloading and you can wrap the calls with
httr::with_verbose(…) if you really like seeing cURL messages.
The other thing
sergeant has going for it is a nascent
dplyr interface. Presently, this is a hack-ish wrapper around the
JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.
For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth
dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:
SELECT gender , marital_status , COUNT(*) AS n FROM cp.`employee.json` GROUP BY gender , marital_status
drill-localhost SQL shell? Or:
library(RJDBC) library(dplyr) library(sergeant) ds <- src_drill("localhost:31010", use_zk=FALSE) db <- tbl(ds, "cp.`employee.json`") count(db, gender, marital_status) %>% collect()
(NOTE: that SQL statement is what ultimately gets sent to Drill from
tbl_df idioms don’t translate 1:1 to all other
src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).
src_drill() function uses the JDBC Drill driver and, hence, has an
RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a
DBI interface around their REST API to facilitate the use of
dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the
RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).
You’ve said “parquet” alot…
Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:
CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;
Kick the tyres
The REST and JDBC functions are solid (I’ve been using them at work for a while) and the
dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a
dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.
File an issue with a feature request or bug report.
I expect to have this CRAN-able in January, 2017.
Pingback: sergeant : An R Boot Camp for Apache Drill - Use-R!Use-R!
Pingback: sergeant : An R Boot Camp for Apache Drill – Mubashir Qasim
Pingback: sergeant : An R Boot Camp for Apache Drill | A bunch of data