I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis
package? which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the #rstats Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):
The (GitHub only for now) #rstats metis package for wiring up R to @amazonathema via RJDBC now uses & includes the new Simba Athena JDBC Driver 2.0.2 JAR https://t.co/wvwV6IxCNd (cc: @dabdine)
— hrbrmstr (@hrbrmstr) April 20, 2018
If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc
package with the ODBC driver for Athena.
I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.
Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc
+ dplyr
connection almost instantly:
library(odbc)
library(tidyverse)
DBI::dbConnect(
odbc::odbc(),
driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
Schema = "sampledb",
AwsRegion = "us-east-1",
AuthenticationType = "Default Credentials",
S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con
some_tbl <- tbl(con, "elb_logs")
some_tbl
## # Source: table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
## timestamp elbname requestip requestport backendip backendport
## <chr> <chr> <chr> <int> <chr> <int>
## 1 2014-09-26T… lb-demo 249.6.80.… 5123 249.6.80… 8888
## 2 2014-09-26T… lb-demo 246.22.15… 5123 248.178.… 8888
## 3 2014-09-26T… lb-demo 248.179.3… 45667 254.70.2… 443
## 4 2014-09-26T… lb-demo 243.2.127… 14496 248.178.… 80
## 5 2014-09-26T… lb-demo 247.76.18… 6887 252.0.81… 8888
## 6 2014-09-26T… lb-demo 254.110.3… 22052 248.178.… 8888
## 7 2014-09-26T… lb-demo 249.113.2… 24902 245.241.… 8888
## 8 2014-09-26T… lb-demo 246.128.7… 5123 244.202.… 8888
## 9 2014-09-26T… lb-demo 249.6.80.… 24902 255.226.… 8888
## 10 2014-09-26T… lb-demo 253.102.6… 6887 246.22.1… 8888
## # ... with more rows, and 10 more variables:
## # requestprocessingtime <dbl>, backendprocessingtime <dbl>,
## # clientresponsetime <dbl>, elbresponsecode <chr>,
## # backendresponsecode <chr>, receivedbytes <S3: integer64>,
## # sentbytes <S3: integer64>, requestverb <chr>, url <chr>,
## # protocol <chr>##
The TLDR is that I can now use 100% dplyr
idioms with Athena vs add one to the RJDBC driver I made for metis
. The metis
package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).
The downside is that I’m now even more likely to run up the AWS bill ;-)
What About Drill?
I also maintain the sergeant
package? which provides REST API and REST query access to Apache Drill along with a REST API DBI
driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).
I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.
library(odbc)
library(tidyverse)
DBI::dbConnect(
odbc::odbc(),
driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
ConnectionType = "Zookeeper",
AuthenticationType = "No Authentication",
ZKCLusterID = "CLUSTERID",
ZkQuorum = "HOST:2181",
AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con
(employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`")))
## # Source: SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
## employee_id full_name first_name last_name position_id position_title store_id
## <S3: integer> <chr> <chr> <chr> <S3: integer> <chr> <S3: inte>
## 1 1 Sheri Nowmer Sheri Nowmer 1 President 0
## 2 2 Derrick Whe… Derrick Whelply 2 VP Country Mana… 0
## 3 4 Michael Spe… Michael Spence 2 VP Country Mana… 0
## 4 5 Maya Gutier… Maya Gutierrez 2 VP Country Mana… 0
## 5 6 Roberta Dam… Roberta Damstra 3 VP Information … 0
## 6 7 Rebecca Kan… Rebecca Kanagaki 4 VP Human Resour… 0
## 7 8 Kim Brunner Kim Brunner 11 Store Manager 9
## 8 9 Brenda Blum… Brenda Blumberg 11 Store Manager 21
## 9 10 Darren Stanz Darren Stanz 5 VP Finance 0
## 10 11 Jonathan Mu… Jonathan Murraiin 11 Store Manager 1
## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>,
## # hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>,
## # marital_status <chr>, gender <chr>, management_role <chr>##
count(employee, position_title, sort=TRUE)
## # Source: lazy query [?? x 2]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
## position_title n
## <chr> <S3: integer64>
## 1 Store Temporary Checker 268
## 2 Store Temporary Stocker 264
## 3 Store Permanent Checker 226
## 4 Store Permanent Stocker 222
## 5 Store Shift Supervisor 52
## 6 Store Permanent Butcher 32
## 7 Store Manager 24
## 8 Store Assistant Manager 24
## 9 Store Information Systems 16
## 10 HQ Finance and Accounting 8
## # ... with more rows##
Apart from having to do that sql(…)
to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr
verbs in under ten minutes (total).
You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ve updated my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant
package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.
FIN
The odbc
package is super-slick and it’s refreshing to be able to use dplyr
verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr
verbs. However, those queries can also be put right into sql()
with the Athena ODBC driver connection and used via the same dplyr
verb magic afterwards.
Today is, indeed, a good day to query!
8 Trackbacks/Pingbacks
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you […]
[…] blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare […]
[…] blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare […]
[…] blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you […]
[…] ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided […]
[…] free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided […]
[…] free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided […]