Skip navigation

Category Archives: drill

Apache Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores […] without having to create and manage schemas. […] It has a schema-free JSON document model similar to MongoDB and Elasticsearch; [a plethora of APIs, including] ANSI SQL, ODBC/JDBC, and HTTP[S] REST; [is] extremely user and developer friendly; [and, has a] pluggable architecture enables connectivity to multiple datastores.

To ring in the new year the Drill team knocked out a new 1.15.0 release with a cadre of new functionality including:

One super-helpful new feature of the REST API is that it now returns query results metadata along with the query results themselves. This means REST API endpoints finally know both column order and column type. This gave me cause to re-visit the sergeant package [GL|GH] and make some accommodations for some of these new features.

Ushering In A New Order

Drill REST API queries return a "columns" field and "metadata" field with the data itself. We can use that to force an order to the columns as well as mostly use proper types (vs JSON-parsed/guessed types). I say mostly since the package still uses jsonlite to parse the results and there’s no support for 64-bit integers in jsonlite (more on this later).

We’ll use the example from DRILL-6847 and use the example provided by Charles Givre in his Jira issue since it will let me demonstrate more of that “mostly” comment and show off another new feature:

library(sergeant) # 0.8.0 branch of sergeant on gitlab or github
library(tidyverse)

con <- src_drill("localhost")

x <- tbl(con, "cp.`employee.json`")

mutate(x, employee_id = as.integer64(employee_id)) %>% 
  mutate(position_id = as.integer64(position_id)) %>% 
  select(
    employee_id, full_name, first_name, last_name, 
    position_id, position_title
  ) -> bigint_result

The above is (logically):

SELECT 
  CAST (employee_id AS INT) AS employee_id,
  full_name,
  first_name, 
  last_name, 
  CAST (position_id AS BIGINT) AS position_id, 
  position_title 
FROM cp.`employee.json`

What do we get when we take a preview of the result?

bigint_result
## # Source:   lazy query [?? x 6]
## # Database: DrillConnection
##    employee_id full_name  first_name last_name position_id position_title 
##          <dbl> <chr>      <chr>      <chr>           <dbl> <chr>          
##  1           1 Sheri Now… Sheri      Nowmer              1 President      
##  2           2 Derrick W… Derrick    Whelply             2 VP Country Man…
##  3           4 Michael S… Michael    Spence              2 VP Country Man…
##  4           5 Maya Guti… Maya       Gutierrez           2 VP Country Man…
##  5           6 Roberta D… Roberta    Damstra             3 VP Information…
##  6           7 Rebecca K… Rebecca    Kanagaki            4 VP Human Resou…
##  7           8 Kim Brunn… Kim        Brunner            11 Store Manager  
##  8           9 Brenda Bl… Brenda     Blumberg           11 Store Manager  
##  9          10 Darren St… Darren     Stanz               5 VP Finance     
## 10          11 Jonathan … Jonathan   Murraiin           11 Store Manager  
## # ... with more rows
Warning message:
One or more columns are of type BIGINT. The sergeant package currently uses jsonlite::fromJSON()
to process Drill REST API result sets. Since jsonlite does not support 64-bit integers BIGINT 
columns are initially converted to numeric since that's how jsonlite::fromJSON() works. This is
problematic for many reasons, including trying to use 'dplyr' idioms with said converted 
BIGINT-to-numeric columns. It is recommended that you 'CAST' BIGINT columns to 'VARCHAR' prior to
working with them from R/'dplyr'.

If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill 
with the MapR ODBC drivers.

This informational warning will only be shown once per R session and you can disable them from 
appearing by setting the 'sergeant.bigint.warnonce' option to 'FALSE' 
(i.e. options(sergeant.bigint.warnonce = FALSE)). 

The first thing sergeant users will notice is proper column order (before it just returned the columns in the order they came back in the JSON rows[] structure). The second thing is that we didn’t get integer64s back. Instead, we got doubles plus an information warning about why and what you can do about it. Said warning only displays once per-session and can be silenced with the option sergeant.bigint.warnonce. i.e. just put:

options(sergeant.bigint.warnonce = FALSE)

in your script or ~/.Rprofile and you won’t hear from it again.

The as.integer64() we used is not from the bit64 package but an internal sergeant package function that knows how to translate said operation to, e.g. CAST( employee_id AS BIGINT ).

You can use the ODBC drivers to gain BIGINT support and there are plans for the 0.8.0 branch to eventually use rapidjsonr at the C++-level to provide direct in-package support for BIGINTs as well.

Better Error Messages

Drill query errors that the sergeant package bubbled up through its various interfaces have not been pretty or all that useful. This has changed with the 0.8.0 branch. Let’s take a look:

tbl(con, "cp.employees.json")
## # Source:   table<cp.employees.json> [?? x 4]
## # Database: DrillConnection
Warning message:
VALIDATION ERROR: From line 2, column 6 to line 2, column 24: Object 'cp.employees.json' not found

Original Query:

  1: SELECT *
  2: FROM `cp.employees.json`
  3: LIMIT 10

Query Profile Error Link:
http://localhost:8047/profiles/079fc8cf-19c6-4c78-95a9-0b949a3ecf4c 

As you can see in the above output, you now get a highly-formatted return value with the original SQL query broken into lines (with line numbers) and a full link to the Drill query profile so you can dig in to the gnarly details of complex query issues. As you work with this and find edge cases I missed for messages, drop an issue on your social-coding site of choice.

SUPPORT ALL THE PCAPs!

Drill has had packet capture (PCAP) file support for a while now and 1.15.0 adds support for the more modern/rich pcapng format. To enable support for this you need to add "pcapng": {"type": "pcapng", "extensions": ["pcapng"] }, to the "formats" section of your storage plugins and also configure a workspace directory to use that as the default (the principle of which is covered here).

We’ll use one of the Wireshark example captures to demonstrate:

pcaps <- tbl(con, "dfs.caps.`*.pcapng`")

glimpse(pcaps)
## Observations: ??
## Variables: 25
## $ tcp_flags_ece_ecn_capable            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece_congestion_experienced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_psh                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ type                                 <chr> "TCP", "TCP", "TCP", "TCP...
## $ tcp_flags_cwr                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ dst_ip                               <chr> "74.125.28.139", "10.254....
## $ src_ip                               <chr> "10.254.157.208", "74.125...
## $ tcp_flags_fin                        <int> 1, 1, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags                            <int> 17, 17, 16, 16, 16, 0, 0,...
## $ tcp_flags_ack                        <int> 1, 1, 1, 1, 1, 0, 0, 0, 0...
## $ src_mac_address                      <chr> "00:05:9A:3C:7A:00", "00:...
## $ tcp_flags_syn                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_rst                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ timestamp                            <dttm> 2015-04-14 07:19:25, 201...
## $ tcp_session                          <dbl> 8.353837e+17, 8.353837e+1...
## $ packet_data                          <chr> "\"3DU...<z...E..(J.@.......
## $ tcp_parsed_flags                     <chr> "ACK|FIN", "ACK|FIN", "AC...
## $ tcp_flags_ns                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ src_port                             <int> 60268, 443, 60268, 58382,...
## $ packet_length                        <int> 54, 54, 54, 55, 66, 78, 7...
## $ tcp_flags_urg                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_ack                              <int> 662445631, 1496589825, 66...
## $ dst_port                             <int> 443, 60268, 443, 29216, 5...
## $ dst_mac_address                      <chr> "00:11:22:33:44:55", "00:...

count(pcaps, src_ip, dst_ip, sort=TRUE)
## # Source:     lazy query [?? x 3]
## # Database:   DrillConnection
## # Groups:     src_ip
## # Ordered by: desc(n)
##    src_ip         dst_ip             n
##    <chr>          <chr>          <dbl>
##  1 10.254.157.208 10.254.158.25    298
##  2 10.254.158.25  10.254.157.208   204
##  3 174.137.42.81  10.254.157.208    76
##  4 10.254.157.208 10.254.158.8      54
##  5 10.254.158.8   10.254.157.208    49
##  6 74.125.28.102  10.254.157.208    49
##  7 10.254.157.208 74.125.28.102     44
##  8 10.254.157.208 174.137.42.81     41
##  9 54.84.98.25    10.254.157.208    25
## 10 157.55.56.168  10.254.157.208    25
## # ... with more rows

More work appears to be planned by the Drill team to enable digging into the packet (binary) contents.

Drill Metadata As Data

Drill has provided ways to lookup Drill operational information as actual tables but the Drill team has added support for even more metadata-as-data queries.

First up is finally having better access to filesystem information. Prior to 1.15.0 one could get file and path attributes as part of other queries, but now we can treat filesystems as actual data. Let’s list all the PCAPs in the above workspace:

tbl(con, "information_schema.`schemata`") %>% 
  filter(SCHEMA_NAME == "dfs.caps") %>% 
  print() %>% 
  pull(SCHEMA_NAME) -> pcap_schema
## # Source:   lazy query [?? x 9]
## # Database: DrillConnection
##   CATALOG_NAME SCHEMA_NAME SCHEMA_OWNER TYPE  IS_MUTABLE
##   <chr>        <chr>       <chr>        <chr> <chr>     
## 1 DRILL        dfs.caps    <owner>      file  NO

tbl(con, "information_schema.`files`") %>% 
  filter(schema_name == pcap_schema) %>% 
  glimpse()
## Observations: ??
## Variables: 13
## $ SCHEMA_NAME       <chr> "dfs.caps"
## $ ROOT_SCHEMA_NAME  <chr> "dfs"
## $ WORKSPACE_NAME    <chr> "caps"
## $ FILE_NAME         <chr> "dof-short-capture.pcapng"
## $ RELATIVE_PATH     <chr> "dof-short-capture.pcapng"
## $ IS_DIRECTORY      <lgl> FALSE
## $ IS_FILE           <lgl> TRUE
## $ LENGTH            <dbl> 634280
## $ OWNER             <chr> "hrbrmstr"
## $ GROUP             <chr> "staff"
## $ PERMISSION        <chr> "rw-r--r--"
## $ ACCESS_TIME       <dttm> 1969-12-31 19:00:00
## $ MODIFICATION_TIME <dttm> 2019-01-01 19:12:17

The Drill system options table now has full descriptions for the options and also provides a new table that knows about all of Drills functions and all your custom UDFs. drill_opts() and drill_functions() return a data frame of all this info and have an optional browse parameter which, if set to TRUE, will show a DT interactive data table for them. I find this especially handy when I forget something like regexp_like syntax (I use alot of back-ends and many are wildly different) and can now do this:

FIN

Keep on the lookout for the rapidjsonr/BIGINT integration and more new features of the sergeant package. NOTE: The better error messages have been ported over to the sergeant.caffeinated package (the RJDBC interface) and the other niceties will make their way into that package soon as well.

So, make sure you’re using the 0.8.0 GL / GH, kick the tyres, file issues where you’re most comfortable working.

May your queries all be optimized and results sets complete in the new year!

The sergeant? package has a minor update that adds REST API coverage for two “new” storage endpoints that make it possible to add, update and remove storage configurations on-the-fly without using the GUI or manually updating a config file.

This is an especially handy feature when paired with Drill’s new, official Docker container since that means we can:

  • fire up a clean Drill instance
  • modify the storage configuration (to, say, point to a local file system directory)
  • execute SQL ops
  • destroy the Drill instance

all from within R.

This is even more handy for those of us who prefer handling JSON data in Drill than in R directly or with sparklyr.

Quick Example

In a few weeks most of the following verbose-code-snippets will have a more diminutive and user-friendly interface within sergeant, but for now we’ll perform the above bulleted steps with some data that was used in a recent new package which was also generated by another recent new package. The zdnsr::zdns_exec() function ultimately generates a deeply nested JSON file that I really prefer working with in Drill before shunting it into R. Said file is stored, say, in the ~/drilldat directory.

Now, I have Drill running all the time on almost every system I use, but we’ll pretend I don’t for this example. I’ve run zdns_exec() and generated the JSON file and it’s in the aforementioned directory. Let’s fire up an instance and connect to it:

library(sergeant) # git[hu|la]b:hrbrmstr/sergeant
library(dplyr)

docker <- Sys.which("docker") # you do need docker. it's a big dependency, but worth it IMO

(system2(
  command = docker,  
  args = c(
    "run", "-i", 
    "--name", "drill-1.14.0", 
    "-p", "8047:8047", 
    "-v", paste0(c(path.expand("~/drilldat"), "/drilldat"), collapse=":"),
    "--detach", 
    "-t", "drill/apache-drill:1.14.0",
    "/bin/bash"
  ),
  stdout = TRUE
) -> drill_container)
## [1] "d6bc79548fa073d3bfbd32528a12669d753e7a19a6258e1be310e1db378f0e0d"

The above snippet fires up a Drill Docker container (downloads it, too, if not already local) and wires up a virtual directory to it.

We should wait a couple seconds and make sure we can connect to it:

drill_connection() %>% 
  drill_active()
## [1] TRUE

Now, we need to add a storage configuration so we can access our virtual directory. Rather than modify dfs we’ll add a drilldat plugin that will work with the local filesystem just like dfs does:

drill_connection() %>%
  drill_mod_storage(
    name = "drilldat",
    config = '
{
  "config" : {
    "connection" : "file:///", 
    "enabled" : true,
    "formats" : null,
    "type" : "file",
    "workspaces" : {
      "root" : {
        "location" : "/drilldat",
        "writable" : true,
        "defaultInputFormat": null
      }
    }
  },
  "name" : "drilldat"
}
')
## $result
## [1] "success"

Now, we can perform all the Drill ops sergeant has to offer, including ones like this:

(db <- src_drill("localhost"))
## src:  DrillConnection
## tbls: cp.default, dfs.default, dfs.root, dfs.tmp, drilldat.default, drilldat.root,
##   INFORMATION_SCHEMA, sys

tbl(db, "drilldat.root.`/*.json`")
## # Source:   table [?? x 10]
## # Database: DrillConnection
##    data                                              name   error class status timestamp          
##                                                                    
##  1 "{\"authorities\":[{\"ttl\":180,\"type\":\"SOA\"… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  2 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  3 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  4 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  5 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  6 "{\"authorities\":[{\"ttl\":1799,\"type\":\"SOA\… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  7 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  8 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  9 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
## 10 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
## # ... with more rows

(tbl(db, "(
SELECT
 b.answers.name AS question,
 b.answers.answer AS answer
FROM (
 SELECT 
   FLATTEN(a.data.answers) AS answers
 FROM 
   drilldat.root.`/*.json` a
 WHERE 
   (a.status = 'NOERROR')
) b
)") %>% 
 collect() -> dmarc_recs)
## # A tibble: 1,250 x 2
##    question             answer                                                                    
##  *                                                                                      
##  1 _dmarc.washjeff.edu  v=DMARC1; p=none                                                          
##  2 _dmarc.barry.edu     v=DMARC1; p=none; rua=mailto:dmpost@barry.edu,mailto:7cc566d7@mxtoolbox.d…
##  3 _dmarc.yhc.edu       v=DMARC1; pct=100; p=none                                                 
##  4 _dmarc.aacc.edu      v=DMARC1;p=none; rua=mailto:DKIM_DMARC@aacc.edu;ruf=mailto:DKIM_DMARC@aac…
##  5 _dmarc.sagu.edu      v=DMARC1; p=none; rua=mailto:Office365contact@sagu.edu; ruf=mailto:Office…
##  6 _dmarc.colostate.edu v=DMARC1; p=none; pct=100; rua=mailto:re+anahykughvo@dmarc.postmarkapp.co…
##  7 _dmarc.wne.edu       v=DMARC1;p=quarantine;sp=none;fo=1;ri=86400;pct=50;rua=mailto:dmarcreply@…
##  8 _dmarc.csuglobal.edu v=DMARC1; p=none;                                                         
##  9 _dmarc.devry.edu     v=DMARC1; p=none; pct=100; rua=mailto:devry@rua.agari.com; ruf=mailto:dev…
## 10 _dmarc.sullivan.edu  v=DMARC1; p=none; rua=mailto:mcambron@sullivan.edu; ruf=mailto:mcambron@s…
## # ... with 1,240 more rows

Finally (when done), we can terminate the Drill container:

system2(
  command = "docker",
  args = c("rm", "-f", drill_container)
)

FIN

Those system2() calls are hard on the ? and a pain to type/remember, so they’ll be wrapped in some sergeant utility functions (I’m hesitant to add a reticulate dependency to sergeant which is necessary to use the docker package, hence the system call wrapper approach).

Check your favorite repository for more sergeant updates and file issues if you have suggestions for how you’d like this Docker API for Drill to be controlled.

If you’ve got a directory full of Bro NSM logs, it’s easy to work with them in Apache Drill since they’re just tab-separated values (TSV) files by default. The most tedious part is mapping the columns to proper types and hopefully this saves at least one person from typing it out manually:

SELECT 
  TO_TIMESTAMP(CAST(columns[0] AS DOUBLE)) AS ts,
                    columns[1]             AS uid,
                    columns[2]             AS id_orig_h,
                    columns[3]             AS id_orig_p,
                    columns[4]             AS id_resp_h,
                    columns[5]             AS id_resp_p,
                    columns[6]             AS proto,
                    columns[7]             AS service,
              CAST( columns[8] AS DOUBLE)  AS duration,
              CAST( columns[9] AS INTEGER) AS orig_bytes,
              CAST(columns[10] AS INTEGER) AS resp_bytes,
                   columns[11]             AS conn_state,
                   columns[12]             AS local_orig,
                   columns[13]             AS local_resp,
              CAST(columns[14] AS INTEGER) AS missed_bytes,
                   columns[15]             AS history,
              CAST(columns[16] AS INTEGER) AS orig_packets,
              CAST(columns[17] AS INTEGER) AS orig_ip_bytes,
              CAST(columns[18] AS INTEGER) AS resp_pkts,
              CAST(columns[19] AS INTEGER) AS resp_ip_bytes,
                   columns[20]             AS tunnel_parents
FROM dfs.brologs.`/201808/*`

You can either store them all under a single workspace with a default input type or soft-link/rename them to end in .tsv (it’s unlikely you want to change all .log files to be read as TSV everywhere).

While you could just use the logs this way, consider using CTAS to move them to Parquet. The above will created typed columns and the queries will generally be much faster.

Apache Drill 1.14.0 was recently released, bringing with it many new features and a temporary incompatibility with the current rev of the MapR ODBC drivers. The Drill community expects new ODBC drivers to arrive shortly. The sergeant? is an alternative to ODBC for R users as it provides a dplyr interface to the REST API along with a JDBC interface and functions to work directly with the REST API in a more programmatic fashion.

First-class dplyr-citizen support for the sergeant JDBC interface

I’ve been primarily using the ODBC interface for a while, now, since it’s dead simple to use it with dplyr (as has been noted in my still-unfinished, short cookbook on wiring up Drill and R). The ODBC incompatibility is pretty severe since it’s at the protobuf-level, but sergeant::src_drill() is an easy swap out and does not have any issues since it works against the REST API. Unfortunately, the query endpoint of the REST API mangles the field order when it returns query results. This really isn’t too painful since it’s easy to add in a select() call after gathering query results to reorder things. However, it’s painful enough that it facilitated rounding out some of the corners to the JDBC interface.

sergeant::drill_jdbc() now returns a <DrillJDBCConnection> object which was necessary to add dplyr classes for just enough bits to enable smooth operation with the tbl() function (without breaking all your other RJDBC usage in the same session). The next blog section will use the new JDBC interface with dplyr as it introduces one of Drill’s new features.

Query Image Metadata with Apache Drill 1.14.0

There are quite a few R packages for reading image/media metadata. Since that seems to be en vogue, R folks might be interested in Drill’s new image metadata format plugin. Just point drill to a directory of files and you can use a familiar dplyr interface to get the deets on your pirated torrent archivefamily photo inventory.

You first need to follow the directions at the aforelinked resource and add the following format to the formats: section.

formats: {
     "image": {
       "type": "image",
       "extensions": [
         "jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif",
         "ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p",
         "m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw",
         "crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f"
       ],
       "fileSystemMetadata": true,
       "descriptive": true,
       "timeZone": null
     }  
   }

Note that the configuration snippet on Drill’s site (as of the date-stamp on this post) did not have a , after the ] for the extensions array, so copy this one instead.

I created a media workspace and set the defaultInputFormat to image. Here’s a naive first look at what you can get back from a simple query to a jpg directory under it (using the new JDBC interface and dplyr):

library(sergeant)
library(tidyverse)

(con <- drill_jdbc("bigd:2181"))
## 

tbl(con, "dfs.media.`/jpg/*`") %>%
  glimpse()
## Observations: ??
## Variables: 28
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

That’s quite a bit of metadata, but the Drill format plugin page kinda fibs a bit about column types since we see many chrs there. You may be quick to question the sergeant package but this isn’t using the REST interface and we can use DBI calls to ask Drill what’s it’s sending us:

dbSendQuery(con, "SELECT * FROM dfs.media.`/jpg/*`") %>%
  dbColumnInfo()
##         field.name        field.type data.type            name
## 1         FileSize CHARACTER VARYING character        FileSize
## 2     FileDateTime CHARACTER VARYING character    FileDateTime
## 3           Format CHARACTER VARYING character          Format
## 4       PixelWidth CHARACTER VARYING character      PixelWidth
## 5      PixelHeight CHARACTER VARYING character     PixelHeight
## 6     BitsPerPixel CHARACTER VARYING character    BitsPerPixel
## 7         DPIWidth CHARACTER VARYING character        DPIWidth
## 8        DPIHeight CHARACTER VARYING character       DPIHeight
## 9       Orientaion CHARACTER VARYING character      Orientaion
## 10       ColorMode CHARACTER VARYING character       ColorMode
## 11        HasAlpha CHARACTER VARYING character        HasAlpha
## 12        Duration CHARACTER VARYING character        Duration
## 13      VideoCodec CHARACTER VARYING character      VideoCodec
## 14       FrameRate CHARACTER VARYING character       FrameRate
## 15      AudioCodec CHARACTER VARYING character      AudioCodec
## 16 AudioSampleSize CHARACTER VARYING character AudioSampleSize
## 17 AudioSampleRate CHARACTER VARYING character AudioSampleRate
## 18            JPEG               MAP character            JPEG
## 19            JFIF               MAP character            JFIF
## 20        ExifIFD0               MAP character        ExifIFD0
## 21      ExifSubIFD               MAP character      ExifSubIFD
## 22  AppleMakernote               MAP character  AppleMakernote
## 23             GPS               MAP character             GPS
## 24             XMP               MAP character             XMP
## 25       Photoshop               MAP character       Photoshop
## 26            IPTC               MAP character            IPTC
## 27         Huffman               MAP character         Huffman
## 28        FileType               MAP character        FileType

We can still work with the results, but there’s also a pretty key element missing: the media filename. The reason it’s not in the listing is that filename is an implicit column that we have to ask for. So, we need to modify our query to be something like this:

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  glimpse()
## Observations: ??
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

We could work with the “map” columns with Drill’s SQL, but this is just metadata and even if there are many files, most R folks have sufficient system memory these days to collect it all and work with it locally. There’s nothing stopping you from working on the SQL side, though, and it may be a better choice if you’ll be using this to process huge archives. But, we’ll do this in R and convert a bunch of field types along the way:

from_map <- function(x) { map(x, jsonlite::fromJSON)}

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  collect() %>%
  mutate_at(
    .vars = vars(
      JPEG, JFIF, ExifSubIFD, AppleMakernote, GPS, XMP, Photoshop, IPTC, Huffman, FileType
    ),
    .funs=funs(from_map)
  ) %>%
  mutate_at(
    .vars = vars(
      PixelWidth, PixelHeight, DPIWidth, DPIHeight, FrameRate, AudioSampleSize, AudioSampleRate
    ),
    .funs=funs(as.numeric)
  ) %>%
  glimpse() -> media_df
## Observations: 11
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       4032, 4032, 4032, 4032, 4032, 4032, 3024, 4032...
## $ PixelHeight      3024, 3024, 3024, 3024, 3024, 3024, 4032, 3024...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ DPIHeight        72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioSampleRate  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ JPEG             [["Baseline", "8 bits", "3024 pixels", "4032 ...
## $ JFIF             [["1.1", "none", "72 dots", "72 dots", "0", "...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       [["1/2227 sec", "f/1.8", "Program normal", "2...
## $ AppleMakernote   [["5", "[558 values]", "[104 values]", "1", "...
## $ GPS              [["N", "44° 19' 6.3\"", "W", "-68° 11' 22.39\...
## $ XMP              [["4", ["2017-06-22T14:28:04"], ["2017-06-22T...
## $ Photoshop        [["48 89 11 77 33 105 192 33 170 252 63 34 43...
## $ IPTC             [["UTF-8", "2", "14:28:04", "2017:06:22", "20...
## $ Huffman          [["4 Huffman tables"], ["4 Huffman tables"], ...
## $ FileType         [["JPEG", "Joint Photographic Experts Group",...

Now, we can do anything with the data, including getting the average file size:

mutate(media_df, FileSize = str_replace(FileSize, " bytes", "") %>% as.numeric()) %>%
  summarise(mean(FileSize))
## # A tibble: 1 x 1
##   `mean(FileSize)`
##              
## 1         3878963.

FIN

The enhancements to the JDBC interface have only been given a light workout but seem to be doing well so-far. Kick the tyres and file an issue if you have any problems. ODBC users should not have to wait long for new drivers and src_drill() aficionados can keep chugging along as before without issue.

For those new to Apache Drill, there’s now an official Docker image for it so you can get up and running without adding too much cruft to your local systems. I may add support for spinning up and managing a Drill container to the sergeant package, so keep your eyes on pushes to the repo.

Also keep an eye on the mini-cookbook as I’ll be modifying to to account for the new package changes and introduce additional, new Drill features.

A previous post showed how to use a different authentication provider to wire up Apache Zeppelin and Amazon Athena. As noted in that post, Zeppelin is a “notebook” alternative to Jupyter (and other) notebooks. Unlike Jupyter, I can tolerate Zeppelin and it’s got some nifty features like plug-and-play JDBC access. Plus it can do some nifty things automagically, like turn the output of a simple aggregation query into a visualization like the one shown at the top of the post.

Drill + Zeppelin (normally that’d be a deadly combination)

The steps to wire-up Apache Drill are similar to those for Athena:

  • Go to the Interpreter menu (it’s a drop down of the top right main Zeppelin menu bar)
  • Create a new one (I named mine — unimaginatively — drill)
  • Set the default.driver to org.apache.drill.jdbc.Driver
  • Fill in the necessary JDBC default.url. I use jdbc:drill:zk=localhost:2181 and you can have multiple ones if you have need to connect to more than one Drill cluster.
  • Setup authentication parameters if you need to
  • Under Dependencies, add an arifact and use the path JAR in $DRILL_HOME/jars/jdbc-driver/. In my case that’s /usr/local/drill/jars/jdbc-driver/drill-jdbc-all-1.14.0.jar

We can use one of Drill’s built-in datasets to test out our connection.

You can do the same thing in the Query box in the Drill web interface, but — even with the ACE editor now being embedded on the page — I think the Zeppelin interface is better and it’s possible to sequence a number of steps in the same notebook (i.e. use a %shell paragraph to grab some JSON data and copy it to a Drill-accessible folder then have a %drill paragraph right below it convert it to parquet and a %spark paragraph below that do some ML on the data and a %knitr block make a proper visualization with R).

Drill + PostgreSQL/etc.

By now, you’ve likely figured out it’s the same, heh, drill for other databases with JDBC support.

For PostgreSQL (i.e. a %postgres interpreter) you need to obtain the JDBC driver and wire its location up as an artifact; use org.postgresql.Driver for the default.driver; enter an appropriate default.url for your setup, which is more than likely jdbc:postgresql://localhost:5432/…if not (i.e. the Zeppelin node and PostgreSQL node are on different systems), then you’ll need to ensure PostgreSQL is configured to listen on an interface that is accessible to the Zeppelin node; enter authentication info and fire up a test query for data that you have. Something like:

FIN

Fear not! There shall be no more simple “wiring-up” posts unless a commenter asks for one or there’s some weird complexity with a new one that deserves some exposition, especially since there are plenty of examples over at the Zeppelin main site. Hopefully these posts have encouraged you do give Zeppelin a try if you haven’t already. I tried working with very early versions of Zeppelin a while back and left it sit for a while, so give it a chance if you did the same. Version 0.9.0 is about to come out and it’s looking even better than 0.8.0, plus it dropped in perfectly on Ubuntu and macOS (even 10.14 beta), too.

Drop a note in the comments if you have any tips/tricks for connecting to other JDBC sources.

Continuing the blog’s UDF theme of late, there are two new UDF kids in town:

Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url()? function already?” My answer is “Sure, but it’s based on java.net.URL which is fundamentally broken.”

Slicing & dicing URLs and IDNs is a large part of the $DAYJOB and they go together pretty well, hence the joint UDF release.

Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.

Counting Lying Lock Icons

SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).

If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL

There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.

Let’s get the data:

library(sergeant)
library(tidyverse)

read_delim(
  file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", 
  delim = " ", 
  col_names = FALSE,
  col_types = "ccc"
) -> gfg_update

dl_path <- file.path("~/Data/gfg_links.tsv.gz")

if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)

Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.

Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:

db <- src_drill()

tbl(db, "(
SELECT 
  b.host,
  port,
  b.rec.hostname AS hostname,
  b.rec.assigned AS assigned,
  b.rec.tld AS tld,
  b.rec.subdomain AS subdomain
FROM
  (SELECT
    host, port, suffix_extract(host) AS rec             -- break the hostname into components
  FROM
    (SELECT
      a.rec.host AS host, a.rec.port AS port
    FROM
      (SELECT 
        columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components
      FROM dfs.d.`/gfg_links.tsv.gz`) a
    WHERE a.rec.port IS NOT NULL                        -- filter out URL parsing failures
    )
  ) b
WHERE b.rec.tld IS NOT NULL                             -- filter out domain parsing failures
)") -> gfg_df

gfg_df
## # Database: DrillConnection
##    hostname  port host              subdomain assigned      tld  
##                                    
##  1 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  2 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  3 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  4 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  5 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  6 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  7 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  8 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  9 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## 10 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## # ... with more rows

While we could have done it all in SQL, we saved some bits for R:

distinct(gfg_df, assigned, port) %>% 
  count(port) %>% 
  collect() -> port_counts

port_counts
# A tibble: 2 x 2
   port     n
*  
1    80 20648
2   443 22178

You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).

FIN

Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.

I'm not 100% settled on the galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.

Kick the tyres & file issues & PRs as necessary.

There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion of the metadata — often just tweet timestamp, the tweet creator and the tweet text — leaving to the analyst the trudging work of re-extracting hashtags, mentions, URLs (etc).

Twitter provides a tweet-text processing library for many languages. One of these languages is Java. Since it make sense to perform at-scale data operations in Apache Drill, it also seemed to make sense that Apache Drill could use a tweet metadata extraction set of user-defined functions (UDFs). Plus, there just aren’t enough examples of Drill UDFs out there. Thus begat drill-twitter-text?.

What’s Inside the Tin?

There are five UDF functions in the package:

  • tw_parse_tweet(string): Parses the tweet text and returns a map column with the following named values:
    • weightedLength: (int) the overall length of the tweet with code points weighted per the ranges defined in the configuration file
    • permillage: (int) indicates the proportion (per thousand) of the weighted length in comparison to the max weighted length. A value > 1000 indicates input text that is longer than the allowable maximum.
    • isValid: (boolean) indicates if input text length corresponds to a valid result.
    • display_start / display_end: (int) indices identifying the inclusive start and exclusive end of the displayable content of the Tweet.
    • valid_start / valid_end: (int) indices identifying the inclusive start and exclusive end of the valid content of the Tweet.
  • tw_extract_hashtags(string): Extracts all hashtags in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_screennames(string): Extracts all screennames in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_urls(string): Extracts all URLs in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_reply_screenname(): Extracts the reply screenname (if any) from the tweet text into a VARCHAR.

The repo has all the necessary bits and info to help you compile and load the necessary JARs, but those in a hurry can just copy all the files in the target directory to your local jars/3rparty directory and restart Drill.

Usage

Here’s an example of how to call each UDF along with the output:

SELECT 
  tw_extract_screennames(tweetText) AS mentions,
  tw_extract_hashtags(tweetText) AS tags,
  tw_extract_urls(tweetText) AS urls,
  tw_extract_reply_screenname(tweetText) AS reply_to,
  tw_parse_tweet(tweetText) AS tweet_meta
FROM
  (SELECT 
     '@youThere Load data from #Apache Drill to @QlikSense - #Qlik Tuesday Tips and Tricks #ApacheDrill #BigData https://t.co/fkAJokKF5O https://t.co/bxdNCiqdrE' AS tweetText
   FROM (VALUES((1))))

+----------+------+------+----------+------------+
| mentions | tags | urls | reply_to | tweet_meta |
+----------+------+------+----------+------------+
| ["youThere","QlikSense"] | ["Apache","Qlik","ApacheDrill","BigData"] | ["https://t.co/fkAJokKF5O","https://t.co/bxdNCiqdrE"] | youThere | {"weightedLength":154,"permillage":550,"isValid":true,"display_start":0,"display_end":153,"valid_start":0,"valid_end":153} |
+----------+------+------+----------+------------+

FIN

Kick the tyres and file issues and PRs as needed.

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package? which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc + dplyr connection almost instantly:

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib", 
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "Default Credentials",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl <- tbl(con, "elb_logs")

some_tbl
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp    elbname requestip  requestport backendip backendport
##    <chr>        <chr>   <chr>            <int> <chr>           <int>
##  1 2014-09-26T… lb-demo 249.6.80.…        5123 249.6.80…        8888
##  2 2014-09-26T… lb-demo 246.22.15…        5123 248.178.…        8888
##  3 2014-09-26T… lb-demo 248.179.3…       45667 254.70.2…         443
##  4 2014-09-26T… lb-demo 243.2.127…       14496 248.178.…          80
##  5 2014-09-26T… lb-demo 247.76.18…        6887 252.0.81…        8888
##  6 2014-09-26T… lb-demo 254.110.3…       22052 248.178.…        8888
##  7 2014-09-26T… lb-demo 249.113.2…       24902 245.241.…        8888
##  8 2014-09-26T… lb-demo 246.128.7…        5123 244.202.…        8888
##  9 2014-09-26T… lb-demo 249.6.80.…       24902 255.226.…        8888
## 10 2014-09-26T… lb-demo 253.102.6…        6887 246.22.1…        8888
## # ... with more rows, and 10 more variables:
## #   requestprocessingtime <dbl>, backendprocessingtime <dbl>,
## #   clientresponsetime <dbl>, elbresponsecode <chr>,
## #   backendresponsecode <chr>, receivedbytes <S3: integer64>,
## #   sentbytes <S3: integer64>, requestverb <chr>, url <chr>,
## #   protocol <chr>## 

The TLDR is that I can now use 100% dplyr idioms with Athena vs add one to the RJDBC driver I made for metis. The metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill ;-)

What About Drill?

I also maintain the sergeant package? which provides REST API and REST query access to Apache Drill along with a REST API DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
  ConnectionType = "Zookeeper",
  AuthenticationType = "No Authentication",
  ZKCLusterID = "CLUSTERID",
  ZkQuorum = "HOST:2181",
  AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

(employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`")))
## # Source:   SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
##    employee_id   full_name    first_name last_name position_id   position_title   store_id  
##    <S3: integer> <chr>        <chr>      <chr>     <S3: integer> <chr>            <S3: inte>
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0         
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0         
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0         
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0         
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0         
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0         
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9         
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21        
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0         
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1         
## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>## 

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n              
##    <chr>                     <S3: integer64>
##  1 Store Temporary Checker   268            
##  2 Store Temporary Stocker   264            
##  3 Store Permanent Checker   226            
##  4 Store Permanent Stocker   222            
##  5 Store Shift Supervisor    52             
##  6 Store Permanent Butcher   32             
##  7 Store Manager             24             
##  8 Store Assistant Manager   24             
##  9 Store Information Systems 16             
## 10 HQ Finance and Accounting 8              
## # ... with more rows##

Apart from having to do that sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ve updated my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.

FIN

The odbc package is super-slick and it’s refreshing to be able to use dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr verbs. However, those queries can also be put right into sql() with the Athena ODBC driver connection and used via the same dplyr verb magic afterwards.

Today is, indeed, a good day to query!