Skip navigation

Category Archives: Apache Drill

I used reading a directory of CSVs as the foundational example in my recent post on idioms.

During my exchange with Matt, Hadley and a few others — in the crazy Twitter thread that spawned said post — I mentioned that I’d personally “just use Drill.

I’ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it.

You can get ahead of those posts by doing two things:

  1. Download, install and test your Apache Drill setup (it’s literally 10 minutes on any platform)
  2. Review the U.S. EPA annual air quality data archive (they have individual, annual CSVs that are perfect for the example)

My goals for this post are really to just to pique your interest enough in Drill and parquet files (yes, I’m ultimately trying to socially engineer you into using parquet files) to convince you to read the future post(s) and show that it’s worth your time to do Step #1 above.

Getting EPA Air Quality Data

The EPA has air quality data going back to 1990 (so, 27 files as of this post). They’re ~1-4MB ZIP compressed and ~10-30MB uncompressed.

You can use the following code to grab them all with the caveat that the libcurl method of performing simultaneous downloads caused some pretty severe issues — like R crashing — for some of my students who use Windows. There are plenty of examples for doing sequential downloads of a list of URLs out there that folks should be able to get all the files even if this succinct method does not work on your platform.

dir.create("airq")

urls <- sprintf("https://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/annual_all_%d.zip", 1990L:2016L)
fils <- sprintf("airq/%s", basename(urls))

download.file(urls, fils, method = "libcurl")

I normally shy away from this particular method since it really hammers the remote server, but this is a beefy U.S. government server, the files are relatively small in number and size and I’ve got a super-fast internet connection (no long-lived sockets) so it should be fine.

Putting all those files under the “control” of Drill is what the next post is for. For now, i’m going to show the basic code and benchmarks for reading in all those files and performing a basic query for all the distinct years. Yes, we know that information already, but it’s a nice, compact task that’s easy to walk through and illustrates the file reading and querying in all three idioms: Drill, tidyverse and data.table.

Data Setup

I’ve converted the EPA annual ZIP files into bzip2 format. ZIP is fine for storage and downloads but it’s not a great format for data analysis tasks. gzip would be slightly faster but it’s not easily splittable and — even though I’m not using the data in a Hadoop context — I think it’s wiser to not have to re-process data later on if I ever had to move raw CSV or JSON data into Hadoop. Uncompressed CSVs are the most portable, but there’s no need to waste space.

All the following files are in a regular filesystem directory accessible to both Drill and R:

> (epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2"))
 [1] "annual_all_1990.csv.bz2" "annual_all_1991.csv.bz2" "annual_all_1992.csv.bz2"
 [4] "annual_all_1993.csv.bz2" "annual_all_1994.csv.bz2" "annual_all_1995.csv.bz2"
 [7] "annual_all_1996.csv.bz2" "annual_all_1997.csv.bz2" "annual_all_1998.csv.bz2"
[10] "annual_all_1999.csv.bz2" "annual_all_2000.csv.bz2" "annual_all_2001.csv.bz2"
[13] "annual_all_2002.csv.bz2" "annual_all_2003.csv.bz2" "annual_all_2004.csv.bz2"
[16] "annual_all_2005.csv.bz2" "annual_all_2006.csv.bz2" "annual_all_2007.csv.bz2"
[19] "annual_all_2008.csv.bz2" "annual_all_2009.csv.bz2" "annual_all_2010.csv.bz2"
[22] "annual_all_2011.csv.bz2" "annual_all_2012.csv.bz2" "annual_all_2013.csv.bz2"
[25] "annual_all_2014.csv.bz2" "annual_all_2015.csv.bz2" "annual_all_2016.csv.bz2"

Drill can directly read plain or compressed JSON, CSV and Apache web server log files plus can treat a directory tree of them as a single data source. It can also read parquet & avro files (both are used frequently in distributed “big data” setups) and access MySQL, MongoDB and other JDBC resources as well as query data stored in Amazon S3 and HDFS (I’ve already mentioned it works fine in plain ‘ol filesystems, too).

I’ve tweaked my Drill configuration to support reading column header info from .csv files (which I’ll show in the next post). In environments like Drill or even Spark, CSV columns are usually queried with some type of column index (e.g. COLUMN[0]) so having named columns makes for less verbose query code.

I turned those individual bzip2 files into parquet format with one Drill query:

CREATE TABLE dfs.pq.`/epa/annual.parquet` AS 
  SELECT * FROM dfs.csv.`/epa/annual/*.csv.bz2`

Future posts will explain the dfs... component but they are likely familiar path specifications for folks used to Spark and are pretty straightforward. The first bit (up to the back-tick) is an internal Drill shortcut to the actual storage path (which is a plain directory in this test) followed by the tail end path spec to the subdirectories and/or target files. That one statement said ‘take all the CSV files in that directory and make one big table out of them”.

The nice thing about parquet files is that they work much like R data frames in that they can be processed on the column level. We’ll see how that speeds up things in a bit.

Benchmark Setup

The tests were performed on a maxed out 2016 13″ MacBook Pro.

There are 55 columns of data in the EPA annual summary files.

To give both read_csv and fread some benchmark boosts, we’ll define the columns up-front and pass those in to each function on data ingestion and I’ll leave them out of this post for brevity (they’re just a cols() specification and colClasses vector). Drill gets no similar help for this at least when it comes to CSV processing.

I’m also disabling progress & verbose reporting in both fread and read_csv despite not stopping Drill from writing out log messages.

Now, we need some setup code to connect to drill and read in the list of files, plus we’ll setup the five benchmark functions to read in all the files and get the list of distinct years from each.

library(sergeant)
library(data.table)
library(tidyverse)

(epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2", full.names = TRUE))

db <- src_drill("localhost")

# Remember, defining ct & ct_dt - the column types specifications - have been left out for brevity

mb_drill_csv <- function() {
  epa_annual <- tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`")
  select(epa_annual, Year) %>% 
    distinct(Year) %>% 
    collect()
}

mb_drill_parquet <- function() {
  epa_annual_pq <- tbl(db, "dfs.pq.`/epa/annual.parquet`")
  select(epa_annual_pq, Year) %>% 
    distinct(Year) %>% 
    collect()
}

mb_tidyverse <- function() {
  map_df(epa_annual_fils, read_csv, col_types = ct, progress = FALSE) -> tmp
  unique(tmp$Year)
}

mb_datatable <- function() {
  rbindlist(
    lapply(
      epa_annual_fils, function(x) { 
        fread(sprintf("bzip2 -c -d %s", x), 
              colClasses = ct_dt, showProgress = FALSE, 
              verbose = FALSE) })) -> tmp
  unique(tmp$Year)
}

mb_rda <- function() {
  read_rds("~/Data/rds/epa/annual.rds") -> tmp
  unique(tmp$Year)
}

microbenchmark(
  csv = { mb_drill_csv()     },
   pq = { mb_drill_parquet() },
   df = { mb_tidyverse()     },
   dt = { mb_datatable()     },
  rda = { mb_rda()           },
  times = 5
) -> mb

Yep, it’s really as simple as:

tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`")

to have Drill treat a directory tree as a single table. It’s also not necessary for all the columns to be in all the files (i.e. you get the bind_rows/map_df/rbindlist behaviour for “free”).

I’m only doing 5 evaluations here since I don’t want to make you wait if you’re going to try this at home now or after the Drill series. I’ve run it with a more robust benchmark configuration and the results are aligned with this one.

Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
  csv 15473.5576 16851.0985 18445.3905 19586.1893 20087.1620 20228.9450     5
   pq   493.7779   513.3704   616.2634   550.5374   732.6553   790.9759     5
   df 41666.1929 42361.1423 42701.2682 42661.9521 43110.3041 43706.7498     5
   dt 37500.9351 40286.2837 41509.0078 42600.9916 43105.3040 44051.5247     5
  rda  9466.6506  9551.7312 10012.8560  9562.9114  9881.8351 11601.1517     5

The R data route, which is the closest to the parquet route, is definitely better than slurping up CSVs all the time. Both parquet and R data files require pre-processing, so they’re not as flexible as having individual CSVs (that may get added hourly or daily to a directory).

Drill’s CSV slurping handily beats the other R methods even with some handicaps the others did not have.

This particular example is gamed a bit, which helped parquet to ultimately “win”. Since Drill can target the singular column (Year) that was asked for, it doesn’t need to read all the extra columns just to compute the final product (the distinct list of years).

IMO both the Drill CSV ingestion and Drill parquet access provide compelling enough use-cases to use them over the other three methods, especially since they are easily transferrable to remote Drill servers or clusters with virtually no code changes. A single node Drillbit (like R) is constrained by the memory on that individual system, so it’s not going to get you out of a memory jam, but it may make it easier to organize and streamline your core data operations before other analysis and visualization tasks.

FIN

I’m sure some member of some other tribe will come up with an example that proves superiority of their particular tribal computations. I’m hoping one of those tribes is the R/Spark tribe so that can get added into the mix (using Spark standalone is much like using Drill, but with more stats/ML functions directly available).

I’m hopeful that this post has showcased enough of Drill’s utility to general R users that you’ll give it a go and consider adding it to your R data analysis toolbox. It can be beneficial having both a precision tools as well as a Swiss Army knife — which is what Drill really is — handy.

You can find the sergeant package on GitHub.

2021-11-04 UPDATE: Just use {arrow}.


Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w/o needing a Hadoop cluster but scale up almost effortlessly). It’s also a bit more lightweight than Spark and a nice alternative to Spark if you only need data wrangling and not the functionality in Spark’s MLlib.

When you’re in this larger-data world, parquet files are one of the core data storage formats. They’re designed to be compact and are optimized for columnar operations. Unlike CSV, JSON files or even R Data files, it’s not necessary to read or scan an entire parquet file to filter, select, aggregate, etc across columns. Unfortunately, parquet files aren’t first-class citizens in R. Well, they aren’t now, but thanks to this project it might not be too difficult to make an R interface to them. But, for now, you have to use some other means to convert or read parquet files.

Spark and sparklyr can help you write parquet files but I don’t need to run Spark all the time.

If you’re already a Drill user, you already know how easy it is to make parquet files with Drill:

CREATE TABLE dfs.tmp.sampleparquet AS 
  (SELECT trans_id, 
   cast(`date` AS date) transdate, 
   cast(`time` AS time) transtime, 
   cast(amount AS double) amountm,
   user_info, marketing_info, trans_info 
   FROM dfs.`/Users/drilluser/sample.json`);

If you’re not used to SQL, that may seem very ugly/foreign/verbose to you and you can thank Hadley for designing a better grammar of tidyness that seamlessly builds SQL queries like that behind the scenes for you. That SQL statement uses a JSON file as a data source (which you can do with Drill) make sure the field data types are correct by explicitly casting them to SQL data types (which is a good habit to get into even if it is verbose) and then tells Drill to make a parquet file (it’s actually a directory of parquet files) from it.

I’ve been working on an R package — sergeant — that provides RJDBC, direct REST and dplyr interfaces to Apache Drill for a while now. There are a number of complexities associated with creating a function to help users make parquet files from R data frames in Drill (which is why said function still does not exist in sergeant):

  • Is Drill installed or does there need to be a helper set of functions for installing and running Drill in embedded mode?
  • Even if there’s a Drill cluster running, does the user — perhaps — want to do the conversion locally in embedded mode? Embedded is way easier since all the files are local. The only real way to convert a data frame to Drill is to save a data frame to a temporary, interim file and them have Drill read it in. In a cluster mode where your local filesystem is not part of the cluster, that would mean finding the right way to get the file to the cluster. Which leads to the next item…
  • Where does the user want the necessary temporary files stored? Local dfs. file system? HDFS?
  • Do we need two different methods? One for quick conversion and one that forces explicit column data type casting?
  • Do we need to support giving the user explicit casting control and column selection capability?
  • Who put the bomp in the bomp, bomp, bomp?

OK, perhaps not that last one (but I think it still remains a mystery despite claims by Jan and Dean).

It’s difficult to wrap something like that up in a simple package that will make 80% of the possible user-base happy (having Drill and Spark operate behind the scenes like “magic” seems like a bad idea to me despite how well sparklyr masks the complexity).

As I continue to work that out (you are encouraged to file an issue with your opines on it at the gh repo) here’s a small R script that you can use it to turn R data frames into parquet files:

library(sergeant)
library(tidyverse)

# make a place to hold our temp files
# this is kinda super destructive. make sure you have the path right
unlink("/tmp/pqtrans", recursive=TRUE, force=TRUE)
dir.create("/tmp/pqtrans", showWarnings=FALSE)

# save off a large-ish tibble
write_csv(nycflights13::flights, "/tmp/pqtrans/flights.csvh")

# connect to drill
db <- src_drill("localhost")

# make the parquet file
dbGetQuery(db$con, "
CREATE TABLE dfs.tmp.`/pqtrans/flights.parquet` AS SELECT * FROM dfs.tmp.`/pqtrans/flights.csvh`
")
## # A tibble: 1 × 2
##   `Number of records written` Fragment
## *                       <int>    <chr>
## 1                      336776      0_0

# prove we did it
list.files("/tmp/pqtrans", recursive=TRUE, include.dirs=TRUE)
## [1] "flights.csvh"                  "flights.parquet"              
## [3] "flights.parquet/0_0_0.parquet"

# prove it again
flights <- tbl(db, "dfs.tmp.`/pqtrans/flights.parquet`")

flights
## Source:   query [?? x 19]
## Database: Drill 1.9.0 [localhost:8047] [8GB direct memory]
## 
##    flight arr_delay distance  year tailnum dep_time sched_dep_time origin
##     <int>     <dbl>    <dbl> <int>   <chr>    <int>          <int>  <chr>
## 1    1545        11     1400  2013  N14228      517            515    EWR
## 2    1714        20     1416  2013  N24211      533            529    LGA
## 3    1141        33     1089  2013  N619AA      542            540    JFK
## 4     725       -18     1576  2013  N804JB      544            545    JFK
## 5     461       -25      762  2013  N668DN      554            600    LGA
## 6    1696        12      719  2013  N39463      554            558    EWR
## 7     507        19     1065  2013  N516JB      555            600    EWR
## 8    5708       -14      229  2013  N829AS      557            600    LGA
## 9      79        -8      944  2013  N593JB      557            600    JFK
## 10    301         8      733  2013  N3ALAA      558            600    LGA
## # ... with more rows, and 11 more variables: sched_arr_time <int>,
## #   dep_delay <dbl>, dest <chr>, minute <dbl>, carrier <chr>, month <int>,
## #   hour <dbl>, arr_time <int>, air_time <dbl>, time_hour <dttm>,
## #   day <int>

# work with the drill parquet file
count(flights, year, origin) %>%
  collect()
## Source: local data frame [3 x 3]
## Groups: year [1]
## 
##    year origin      n
## * <int>  <chr>  <int>
## 1  2013    EWR 120835
## 2  2013    LGA 104662
## 3  2013    JFK 111279

That snippet:

  • assumes Drill is running, which is really as easy as entering drill-embedded at a shell prompt, but try out Drill in 10 Minutes if you don’t believe me
  • dfs.tmp points to /tmp (i.e. you need to modify that if yours doesn’t…see, I told you this wasn’t simple)
  • assumes we’re OK with letting Drill figure out column types
  • assumes we want ALL THE COLUMNS
  • uses the .csvh extension which tells Drill to read the column names from the first line so we don’t have to create the schema from scratch
  • is slow because of ↑ due to the need to create the csvh file first
  • exploits the fact that we can give dplyr the cold shoulder and talk directly to Drill anytime we feel like it with DBI calls by using the $con list field (the dbGetQuery(db$con, …) line).

It’s a naive and destructive snippet, but does provide a means to get your data frames into parquet and into Drill.

Most of my Drill parquet needs are converting ~20-100K JSON files a day into parquet, which is why I haven’t focused on making a nice interface for this particular use case (data frame to parquet) in R. Ultimately, I’ll likely go the “wrap parquet-cpp route” (unless you’re working on that, which — if you are — you should @-ref me in that gh-repo of yours so I can help out). But, if having a sergeant function to do this conversion would help you, drop an issue in the repo.

I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.

It also supports reading formats such as:

  • Avro
  • [CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
  • Parquet
  • Hadoop Sequence Files

It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.

Why sergeant?

There’s already an R package on CRAN to work with Drill: DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.

The sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr progress bar for data downloading and you can wrap the calls with httr::with_verbose(…) if you really like seeing cURL messages.

The other thing sergeant has going for it is a nascent dplyr interface. Presently, this is a hack-ish wrapper around the RJDBC JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.

SQL vs dplyr

For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:

SELECT  gender ,  marital_status , COUNT(*) AS  n 
FROM  cp.`employee.json` 
GROUP BY  gender ,  marital_status

in a drill-embedded or drill-localhost SQL shell? Or:

library(RJDBC)
library(dplyr)
library(sergeant)

ds <- src_drill("localhost:31010", use_zk=FALSE)

db <- tbl(ds, "cp.`employee.json`") 

count(db, gender, marital_status) %>% collect()

(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr)

Now, dplyr tbl_df idioms don’t translate 1:1 to all other src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).

sergeant and dplyr

The src_drill() function uses the JDBC Drill driver and, hence, has an RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI interface around their REST API to facilitate the use of dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).

You’ve said “parquet” alot…

Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:

CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
  SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts 
  FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;

Kick the tyres

The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.

File an issue with a feature request or bug report.

I expect to have this CRAN-able in January, 2017.