Most of the examples of working with most of the AWS services show basic username & password authentication. That’s all well-and-good, but many shops use the AWS Security Token Service to provide temporary credentials and session tokens to limit exposure and provide more uniform multi-factor authentication. At my workplace, Frank Mitchell created a nice electron app to make it super easy to create and re-up these credentials. The downside of this is that all AWS service usage for work requires using these credentials and I was having the darndest time trying to get Athena’s JDBC driver working with it (but I wasn’t spending alot of time on it as I tend to mirror research data to a local, beefy Apache Drill server).
I finally noticed the
com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider
class and decided to give the following a go (you will need to point fil
to wherever you have the Athena jar file):
library(RJDBC)
library(tidyverse)
fil <- "~/Drivers/AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
aws <- ini::read.ini("~/.aws/credentials")
Sys.setenv(AWS_ACCESS_KEY_ID = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_access_key_id)
Sys.setenv(AWS_SECRET_ACCESS_KEY = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_secret_access_key)
Sys.setenv(AWS_SESSION_TOKEN = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_session_token)
provider <- "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider"
con <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir=Sys.getenv("AWS_S3_STAGING_DIR"),
schema_name="DEFAULT_DB_SCHEMA_NAME",
aws_credentials_provider_class=provider)
dbListTables(con)
dbListFields(con, "SOME_TABLE_IN_THE_DEFAULT_DB")
dbGetQuery(con, "SELECT * FROM DEFAULT_DB_SCHEMA_NAME.SOME_TABLE_IN_THE_DEFAULT_DB limit 10;")
YMMV on Windows (comments about what does and does not work on Windows are welcome).
The provider
line was the key element I was missing prior to last night.
The Awsaml utility monitors/maintains entries under it’s purview credentials
file and keeps consistent profile ids, so I keep that AWS_PROFILE
setting in my ~/.Renviron
.
I also keep the default S3 Athena data staging bucket in an environment variable as well.
If you provide a default schema_name
then you can list tables and fields but queries need fully qualified database (Amazon calls them “schemas”) dot table name.
Initial attempts to have this setup “just work” with dplyr
0.6.0 (the forthcoming EPIC release) were unsuccessful but I’ll poke at all this when I get time and likely write a small Athena package to help smooth over rougher areas.
Y’all likely figured all this out way before I did, but in the event someone else is looking for the information, it should be google-able now.