Skip navigation

Tag Archives: athena

Apache Zeppelin is a “notebook” alternative to Jupyter (and other) notebooks. It supports a plethora of kernels/interpreters and can do a ton of things that this post isn’t going to discuss (perhaps future ones will, especially since it’s the first “notebook” environment I’ve been able to tolerate for longer than a week).

One really cool feature of Zeppelin is the ability for it to wire it up to databases via JDBC and use it interactive queries. A future post will provide instructions for Apache Drill, but this one’s about wiring up Amazon Athena and Apache Zeppelin. A big reason to do this is that image at the top of the post. The query interface is far nicer than the Amazon console and — while RStudio is going to have similar features in the 1.2 release — Zeppelin has some advantages over it, especially as 0.9.0 moves to final release.

If you use basic credentials in Athena, this post can help you connect up.

At $DAYJOB we use an open source application that we developed — Awsaml? at $DAYJOB — which provides automagically rotated temporary AWS credentials every hour after a successful initial multi-factor authentication (you should think about doing this, too).

Because it uses a non default profile name we need to use a different authentication class when using the Athena JDBC interface.

To somewhat dup the aforelinked post, you’ll need to download the driver that matches your version of the JDK and the JDBC data standards.

I like to put JARs like this in /usr/local/jars (just remember where you put it).

Now, just create a Zeppelin interpreter named athena (or whatever you like). Set the default.driver to com.simba.athena.jdbc.Driver and the JDBC string to this horribly long entity:


jdbc:awsathena://athena.us-east-1.amazonaws.com:443;S3OutputLocation=s3://aws-athena-query-results-something-us-east-1;Schema=default;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments="your-profile-name"

I intentionally left it un-wrapped so it’s easier to copy. Here are the individual parts (separating the bullets at the semicolons):

  • jdbc:awsathena://athena.us-east-1.amazonaws.com:443 (use what you need to here)
  • S3OutputLocation=s3://aws-athena-query-results-something-us-east-1 (wherever Athena can write to)
  • Schema=default (the schema you’ll use)
  • AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider (this is the proper class to select a profile by name)
  • AwsCredentialsProviderArguments="your-profile-name" (this is the profile name you want to use)

NOTE: You can use other JDBC driver parameters as well. I just focused on the minimum ones to keep it simple.

Blank-out any username/password fields (which, in theory, won’t be referenced anyway) and then scroll down and add the JAR you’re using an artifact. In my case that’s /usr/local/jars/AthenaJDBC42_2.0.2.jar.

Now, you can use a stored profile and hopefully rotating creds to work with %athena interpreter blocks in Zeppelin.

Most of the examples of working with most of the AWS services show basic username & password authentication. That’s all well-and-good, but many shops use the AWS Security Token Service to provide temporary credentials and session tokens to limit exposure and provide more uniform multi-factor authentication. At my workplace, Frank Mitchell created a nice electron app to make it super easy to create and re-up these credentials. The downside of this is that all AWS service usage for work requires using these credentials and I was having the darndest time trying to get Athena’s JDBC driver working with it (but I wasn’t spending alot of time on it as I tend to mirror research data to a local, beefy Apache Drill server).

I finally noticed the

com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider

class and decided to give the following a go (you will need to point fil to wherever you have the Athena jar file):

library(RJDBC)
library(tidyverse)

fil <- "~/Drivers/AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

aws <- ini::read.ini("~/.aws/credentials")

Sys.setenv(AWS_ACCESS_KEY_ID = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_access_key_id)
Sys.setenv(AWS_SECRET_ACCESS_KEY = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_secret_access_key)
Sys.setenv(AWS_SESSION_TOKEN = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_session_token)

provider <- "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider"

con <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                 s3_staging_dir=Sys.getenv("AWS_S3_STAGING_DIR"),
                 schema_name="DEFAULT_DB_SCHEMA_NAME",
                 aws_credentials_provider_class=provider)

dbListTables(con)

dbListFields(con, "SOME_TABLE_IN_THE_DEFAULT_DB")

dbGetQuery(con, "SELECT * FROM DEFAULT_DB_SCHEMA_NAME.SOME_TABLE_IN_THE_DEFAULT_DB limit 10;")

YMMV on Windows (comments about what does and does not work on Windows are welcome).

The provider line was the key element I was missing prior to last night.

The Awsaml utility monitors/maintains entries under it’s purview credentials file and keeps consistent profile ids, so I keep that AWS_PROFILE setting in my ~/.Renviron.

I also keep the default S3 Athena data staging bucket in an environment variable as well.

If you provide a default schema_name then you can list tables and fields but queries need fully qualified database (Amazon calls them “schemas”) dot table name.

Initial attempts to have this setup “just work” with dplyr 0.6.0 (the forthcoming EPIC release) were unsuccessful but I’ll poke at all this when I get time and likely write a small Athena package to help smooth over rougher areas.

Y’all likely figured all this out way before I did, but in the event someone else is looking for the information, it should be google-able now.