I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant
. Here’s a lifted “TLDR” on Drill:
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.
It also supports reading formats such as:
- Avro
- [CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
- Parquet
- Hadoop Sequence Files
It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.
Why sergeant
?
There’s already an R package on CRAN to work with Drill: DrillR
. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose()
everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.
The sergeant
package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr
progress bar for data downloading and you can wrap the calls with httr::with_verbose(…)
if you really like seeing cURL messages.
The other thing sergeant
has going for it is a nascent dplyr
interface. Presently, this is a hack-ish wrapper around the RJDBC
JDBCConnection
presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.
SQL vs dplyr
For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr
saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:
SELECT gender , marital_status , COUNT(*) AS n
FROM cp.`employee.json`
GROUP BY gender , marital_status
in a drill-embedded
or drill-localhost
SQL shell? Or:
library(RJDBC)
library(dplyr)
library(sergeant)
ds <- src_drill("localhost:31010", use_zk=FALSE)
db <- tbl(ds, "cp.`employee.json`")
count(db, gender, marital_status) %>% collect()
(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr
)
Now, dplyr
tbl_df
idioms don’t translate 1:1 to all other src_
es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).
sergeant
and dplyr
The src_drill()
function uses the JDBC Drill driver and, hence, has an RJDBC
dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI
interface around their REST API to facilitate the use of dplyr
idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto
route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).
You’ve said “parquet” alot…
Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:
CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts
FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;
Kick the tyres
The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr
support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join()
between parquet and JSON data — in the README and all the exposed functions have documentation.
File an issue with a feature request or bug report.
I expect to have this CRAN-able in January, 2017.