Linux Archives

Category Archives: Linux

Conquering Caffeinated Amazon Athena with the metis Trio of Packages

I must preface this post with the posit that if you’re doing anything interactive() with Amazon Athena you should seriously consider just using their free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided Linux ODBC drivers to work on your flavor of Linux.

However…

There are times—say, when you’re trying to stand up an R service in your kubernetes cluster which bridges data in Athena to analyses & visualizations in R—when ODBC drivers can be more of a hindrance than help and JDBC is the path of least resistance.

Sure, there’s the in-CRAN AWR.Athena package but it’s a fairly constrained and low-feature RJDBC shim which gets the basic job done but not much more.

Enter:

a trio of packages which aims to make it super-straightforward to wire up R to Amazon Athena when ODBC is not available.

Why Three Packages?

For starters, there are CRAN hopes for the metis-trio and one key component of that is separating out the JARs into one package (metis.jars) and actual functionality into others (metis and metis.tidy). We’ll see how the CRAN attempt goes since the JAR package weighs in at sufficient weight to warrant a NOTE. The packaging of the driver reduces the need for you to pre-load the JAR (locally or into, say, a Docker image) or perform a package-initiated download-dance like AWR.Athena does (which I still don’t understand why that hasn’t kicked it out of CRAN the way it does it but ¯\_(ツ)_/¯).

metis.jars also has three helper functions which do some (basic) fun things:

library(metis.jars)

simba_driver_version()
## [1] "02.00.06.1008"

athena_supported_types()
##  [1] "BOOLEAN"   "TINYINT"   "SMALLINT"  "INT"       "INTEGER"  
##  [6] "BIGINT"    "REAL"      "FLOAT"     "DOUBLE"    "DECIMAL"  
## [11] "DATE"      "TIMESTAMP" "BINARY"    "VARBINARY" "CHAR"     
## [16] "VARCHAR"   "STRING"    "ARRAY"     "MAP"       "ROW"      
## [21] "STRUCT"   

metis_jar_path()
## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/metis.jars/java/AthenaJDBC42_2.0.6.jar"

The first uses the rJava interface to directly query the version (since Amazon seems to update the Simba JAR twice a year). By separating out the JAR into a separate package, updates can be made to the other two sibling packages more frequently without crushing CRAN’s disk space. metis.jars is also versioned to the included JAR so configuration management will be easier for folks.

The reason for the second type-lister function is that there’s hope Amazon will add support for all Presto data types, especially IPADDRESS. It, again, performs JDBC driver introspection to collect the supported types.

Finally, the third function abstracts the JAR location from the metis package or even your own interface package should you choose to depend on it.

OK, But Why Not Just Two?

The metis package is a more functional RJDBC superclass of a DBI wrapper than AWR.Athena. One thing it does that its CRAN cousin cannot is handle BIGINTs properly:

library(metis)

dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

dbGetQuery(con, "
SELECT
  CAST('chr' AS CHAR(4)) achar,
  CAST('varchr' AS VARCHAR) avarchr,
  CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday,
  CAST(100.1 AS DOUBLE) AS justadbl,
  CAST(127 AS TINYINT) AS asmallint,
  CAST(100 AS INTEGER) AS justanint,
  CAST(100000000000000000 AS BIGINT) AS abigint,
  CAST(('GET' = 'GET') AS BOOLEAN) AS is_get,
  ARRAY[1, 2, 3] AS arr1,
  ARRAY['1', '2, 3', '4'] AS arr2,
  MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp,
  CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw,
  CAST('{\"a\":1}' AS JSON) js
FROM elb_logs
LIMIT 1
") %>% 
  dplyr::glimpse()
## Observations: 1
## Variables: 13
## $ achar     <chr> "chr "
## $ avarchr   <chr> "varchr"
## $ tsday     <date> 2014-09-29
## $ justadbl  <dbl> 100.1
## $ asmallint <int> 127
## $ justanint <int> 100
## $ abigint   <S3: integer64> 100000000000000000
## $ is_get    <lgl> TRUE
## $ arr1      <chr> "1, 2, 3"
## $ arr2      <chr> "1, 2, 3, 4"
## $ mp        <chr> "{bar=2, foo=1}"
## $ rw        <chr> "{x=1, y=2.0}"
## $ js        <chr> "\"{\\\"a\\\":1}\""

~~Presto~~Athena arrays and maps and rows and JSON come across as characters from the Athena driver and they’re formatted so badly that there’s little hope of full R support for list columns for them. But, you do get real, big integers with metis along with full support for all other current Athena types.

R folk who may be users of the old, standalone metis package need to be aware of some things.

First, dbConnect() has breaking changes. The snake_case names that still exist in the higher-level athena_jdbc() function are gone. In exchange for this pain, you now have full naming-parity with all the Athena JDBC connection properties and can more easily use alternate credential providers which metis‘ cousin totally cannot do for you which is illustrated in the example above and in the package README.

The metis package also makes it easier to see documentation for all available Athena connection properties since it has a vignette with a descriptive table of all of them (rendered here).

There is also nascent support for the “streaming API” (TLDR: faster result set downloads) but that won’t be fully tested until some AWS policy tweaks happen this week.

Gotcha. But, Why Not Just Two?

As awesome as it is (including base Docker image support) the tidyverse is not without overhead in terms of compilation time and dependencies, both of which are especially painful on Linux systems and some Docker environments. You can absolutely get by with some well-crafted SQL and JDBC and the thinner the image the easier it is to deploy and scale.

But! The tidyverse is so helpful that ensuring smooth support for Athena is critical. On its own, metis wires up to dplyr/dbplyr fine, but by providing (in metis.tidy) some enhanced db_data_type() support (primarily for BIGINT) and some extra 💙 in sql_translate_env() )for those of us who continue to mindlessly use R-only verbs like grep() or as.POSIXct() in non-R contexts) we can level-up interactive() use and tidyverse-infused service use:

library(metis.tidy)
library(dbplyr)
library(dplyr)

metis::dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

elb_logs <- tbl(con, "elb_logs")

filter(elb_logs, grepl("20", elbresponsecode)) %>%
  mutate(
    tsday = as.Date(substring(timestamp, 1L, 10L)),
    host = url_extract_host(url),
    proto_version = regexp_extract(protocol, "([[:digit:]\\.]+)"),
  ) %>%
  select(tsday, host, receivedbytes, requestprocessingtime, proto_version) %>%
  head(1) %>%
  glimpse()
## Observations: ??
## Variables: 5
## Database: AthenaConnection
## $ tsday                 <date> 2014-09-29
## $ host                  <chr> "www.abcxyz.com"
## $ receivedbytes         <S3: integer64> 0
## $ requestprocessingtime <dbl> 9.5e-05
## $ proto_version         <chr> "1.1"

FIN

A fairly big impetus for this radical refactoring was the need to use the Athena JDBC interface in R at $DAYJOB in a serverless context. So, if I/we needed it, others may as well. All three packages have tests (that work with my personal Athena setup which is easily replicated since it’s just the default schema & table you get when you enable Athena), pass CRAN checks and will be live in a real production environment by the time you read this.

Note that I do have CRAN plans for these three amigos, but all three packages will need to go in at the same time and I need to get tests into- and prove tests are live in Travis before submitting. Now’s the time for feature requests, problem reports or issues. Until SourceHut’s (sr.ht) API is finished, said contributions are best left to GitLab (preferably) or GitHub (if you must continue to fill the coffers of giant multional companies that undermine your freedom).

POSTSCRIPT

One other reason for re-visiting metis was this R-crashing rJava issue that is really a Simba Athena implementation issue (OS signals in a JDBC driver, rly?)

This Rprofile entry:

options(
  "java.parameters" = c(getOption("java.parameters", default = NULL), "-Xrs")
)

has been a solid workaround until rJava is updated. Note that metis.jars warns about this on load if it detects your setup is at risk.

Moving From system() calls to Rcpp Interfaces

2014-04-25 – 14:47
Posted in Information Security, Linux, Open Source, OS X, Programming, R
Tagged post
Leave a Comment

Over on the [Data Driven Security Blog](http://datadrivensecurity.info/blog/posts/2014/Apr/making-better-dns-txt-record-lookups-with-rcpp/) there’s a post on how to use `Rcpp` to interface with an external library (in this case `ldns` for DNS lookups). It builds on [another post](http://datadrivensecurity.info/blog/posts/2014/Apr/firewall-busting-asn-lookups/) which uses `system()` to make a call to `dig` to lookup DNS `TXT` records.

The core code is below and at both the aforementioned blog post and [this gist](https://gist.github.com/hrbrmstr/11286662). The post walks you though creating a simple interface and a future post will cover how to build a full package interface to an external library.

Securing ‘su’ with Google Authenticator

2011-02-19 – 11:21
Posted in Authentication, Information Security, Linux, Ubuntu
Tagged Authenticator, Cryptographic protocols, Google, Identity management systems, Internet protocols, linux system, login systems, Nick Wilkens, secret key, Secure Shell, Su, Sudo, System administration, Two-factor authentication, Ubuntu, web app frameworks
Comments (3)

Google’s new do-it-yourself two-factor authentication (Google Authenticator) enables you to setup stronger logins on your linux system. Nick Wilkens (@nwilkens) has a good/quick tutorial up on his company’s blog for acquiring, compiling and setting up Google Authenticator for ssh sessions.

NOTE: On the Ubuntu VPS I was doing testing on, I had to add the libpam0g-dev & libpam0g packages to get Google Authenticator to work.

I’m pointing out the obvious (if you’ve read either Google’s link or the tutorial), but the Authenticator comes with a PAM (pluggable authentication module) library that literally just drops into any pam configuration file. This means you aren’t limited to the ssh integration, which opens up many possibilities (one of which is mod_auth_pam for Apache which I haven’t tried yet).

I would argue that there is limited value from the ssh integration as most folks probably have certificate login enabled. However, one area that I can see being of interest is in securing use of su to root. If you have more control over who has the ability to perform full privilege escalation, your system is that much less at risk from being usurped or accidentally broken (there’s actually a whole company built around that concept).

Detractors will point out that VPS setups would still be at risk from hosting admins having virtual disk image access and may further point out that even a locked cage in a hosting data center can be bolt-cut, but I would argue that the whole point of engaging in such a pursuit would be to reduce risk to your environment (not eliminate risk).

I will say that securing root su with two-factor authentication while doing nothing to secure sudo is pretty much pointless. If you have no restrictions on sudo, anyone who gains control of an account that is allowed to sudo with superuser privileges will be able to bypass your two-factor su config (and could compromise the integrity of your system even without root su access). Also, if you have more than one user who needs access to root su, you will be sharing the authenticator setup with them.

I still believe this is a worthwhile exercise even with those caveats, especially since it’s so simple to setup/teardown. After getting the Google Authenticator installed, issue google-authenticator from your root account. Transcribe and secure the QR code URL, secret key, verification code and emergency scratch codes in the event you have problems with you digital authenticator app (I keep the scratch codes on a small piece of paper that is always with me and stored securely at home as well).

Use manual input or the QR code scan to add the account to your authenticator app and then make the following addition to the /etc/pam/su config file:

[sourcecode language=”text” light=”true”]
auth required pam_google_authenticator.so
[/sourcecode]

I have mine before all the session configs.

When you issue your “su –” command you will be prompted for both the code and the root password:

[sourcecode language=”text” light=”true”]
$ su –
Verification code:
Password:
[/sourcecode]

I’ll be experimenting with integrating Google Authenticator with various administrative login systems (e.g. WordPress, Drupal) and maybe even as a generic auth module for various web app frameworks and would be interested in any other uses you have for Google Authenticator

Backup Workflow

2011-02-02 – 16:36
Posted in Backups, Linux, Operating Systems, OS X, Windows
Tagged Backup, cloning, Cloud storage, Cryptographic software, dropbox, Electronics fail, File hosting, FileVault, Hard disk drive, IEEE 1394, Linux, Mac OS X, MacBook, media drives, PS3, rsync.net, Steve Jobs, Time Machine, TrueCrypt, Twitter, VMs
Comments (2)

I was trying to convey my backup workflow/setup to @joeday in 140 and it just wasn’t working very well. Twitter – as one might expect – is not exactly the place for detailed technical discussions, but it does provide fertile ground to spark ideas and dialogue. I told @geekshui that I’d blog my setup and that turned out to be just enough of a catalyst to force me to iron out my strategy for rud.is and future (if any) non-cooking/family blogging.

Background

I’m [still] a die-hard OS X user, despite the increasing gatekeeper motif Apple is sporting these days. My main computer is a MacBook Pro which I would stupidly run back into a burning building to rescue. Everything is on it. Everything. I digitize receipts, house our multimedia, spin out VMs like a DJ, create, compose, torrent, rip, zip and hack from it. Consequently, ensuring my data is available is kinda important to me.

I’ve been around computers long enough to have learned some painful lessons from four simple characters: MTBF. Drives break. Electronics fail. It’s an undeniable fact. The only way to recover from these failures is to have a good strategy for keeping your data available.

Strategy #1: Backups

While hard to digest on Twitter, my backup strategy is pretty straightforward. I use Time Machine for OS-managed full system backups. I rotate these between two large (1TB & 2TB) hard drives and I retire one large hard drive each year (MTBF…remember?). This gets me individual file recovery pretty quickly over a decent time period and a bit of hardware piece of mind.

I also have two 2.5″ IEEE 1394 drives that I SuperDupe/CarbonCopyClone images to every time Apple issues a 10.x.y update. Again, I rotate between since I really don’t trust drive manufacturers. I haven’t relied on TrueCrypt for a while (which would make for an ugly workflow) for system volumes, but it’s easy to clone disks that have FileVault protected data as long as you do so from an account that does not use or rely on FileVault data.

Both Time Machine and the drive cloning can occur while I’m sleeping, so no workflow is impacted.

Strategy #2: Dropbox

I have to start by sharing just how much I <3 Dropbox. I don’t use the free service as I grew weary of keeping within the paultry limits. Getting a paid sub to it provides more than just freedom from minutiae. I now get (as long as they have no hiccups) full recovery back as far as I want in the event I do actually lose a file or two. I have Dropbox configured on my MacBook Pro, a home Windows machine and a home Linux box. This means that even if I lose the drive on my Mac, I can get some of my non-sensitive data back from one of the other Dropbox-enabled systems (which is much faster than recovering from backups). It also means that I can get right back to work on a different system – as long as I have not used an OS X-specific program.

I could rant for quite a while about Dropbox, but it should be pretty obvious why this is part of of my backup strategy.

Strategy #3: rsync.net

While Dropbox houses non-sensitive data offsite (again, assuming no service hiccups), there is a subset of my information that I do want housed off-site in the event there is a catastrophic issue with our abode. For that, I have been using rsync.net since it’s inception. They provide outstanding customer support, have a unique view and practices around warrants and fully understand the needs of technical users concerned about availability and privacy.

There are some other things we do to ensure a refresh of the content on media drives that get hooked up to our PS3 or displays, but the the above three steps are how I ensure that I always have access to the data that enables my workflow.