I was chatting with a fellow Amazon Athena user and the topic of using Presto functions such as approx_distinct() via {d[b]plyr} came up and it seems it might not be fully common knowledge that any non-already translated function is passed to the destination intact. That means you can just “use” approx_distinct()
and it will work just fine. Here’s an example using the ODBC {DBI} interface:
library(dbplyr)
library(tidyverse)
# My personal Athena workgroup has been upgraded to "engine 2"
# so Presto 0.217 functions are available. Only noting that for
# folks who may not keep up with AWS announcements.
#
# https://prestodb.io/docs/0.217/index.html
DBI::dbConnect(
odbc::odbc(),
driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
Schema = "sampledb",
AwsRegion = "us-east-1",
AuthenticationType = "IAM Profile",
AWSProfile = "personal",
MaxCatalogNameLen = 0L,
MaxSchemaNameLen = 0L,
MaxColumnNameLen = 0L,
MaxTableNameLen = 0L,
UseResultsetStreaming = 1L,
StringColumnLength = 32 * 1024L,
S3OutputLocation = "s3://accessible-bucket/"
) -> con
# this comes with Athena
elb_logs <- tbl(con, "elb_logs")
elb_logs
## # Source: table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
## timestamp elbname requestip requestport backendip backendport
## <chr> <chr> <chr> <int> <chr> <int>
## 1 2014-09-… lb-demo 251.51.8… 17141 251.111.… 8000
## 2 2014-09-… lb-demo 244.201.… 17141 244.140.… 8888
## 3 2014-09-… lb-demo 242.204.… 17141 255.196.… 8888
## 4 2014-09-… lb-demo 251.51.8… 17141 255.129.… 8888
## 5 2014-09-… lb-demo 242.241.… 17141 255.129.… 8899
## 6 2014-09-… lb-demo 243.198.… 17141 255.129.… 8888
## 7 2014-09-… lb-demo 244.119.… 17141 242.89.1… 80
## 8 2014-09-… lb-demo 254.173.… 17141 251.51.8… 8000
## 9 2014-09-… lb-demo 243.198.… 17141 254.149.… 8888
## 10 2014-09-… lb-demo 249.185.… 17141 241.36.2… 8888
## # … with more rows, and 10 more variables: requestprocessingtime <dbl>,
## # backendprocessingtime <dbl>, clientresponsetime <dbl>,
## # elbresponsecode <chr>, backendresponsecode <chr>,
## # receivedbytes <int64>, sentbytes <int64>, requestverb <chr>,
## # url <chr>, protocol <chr>
elb_logs %>%
summarise(d = n_distinct(backendip)) # 0.62 seconds
## # Source: lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
## d
## <int64>
## 1 2311
# https://prestodb.io/docs/0.217/functions/aggregate.html#approx_distinct
elb_logs %>%
summarise(d = approx_distinct(backendip)) # 0.49 seconds
## # Source: lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
## d
## <int64>
## 1 2386
In this toy example there’s no real reason to use this alternate function, but on my datasets using the approximator version dramatically reduces query time, reduces query cost, and produces results that by default have a standard error of 2.3% (which is fine for the use-cases I apply this to). There’s an alternate signature which lets you supply the standard error, as well.
If you’re curious as to what functions are translated by default, just use sql_translate_env()
on the connection object:
sql_translate_env(con)
## <sql_variant>
## scalar: -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar: %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar: as_date, as_datetime, as.character, as.Date,
## scalar: as.double, as.integer, as.integer64, as.logical,
## scalar: as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar: bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar: bitwXor, c, case_when, ceil, ceiling, coalesce, cos,
## scalar: cosh, cot, coth, day, desc, exp, floor, hour, if,
## scalar: if_else, ifelse, is.na, is.null, log, log10, mday,
## scalar: minute, month, na_if, nchar, now, paste, paste0, pmax,
## scalar: pmin, qday, round, second, sign, sin, sinh, sql, sqrt,
## scalar: str_c, str_conv, str_count, str_detect, str_dup,
## scalar: str_extract, str_extract_all, str_flatten, str_glue,
## scalar: str_glue_data, str_interp, str_length, str_locate,
## scalar: str_locate_all, str_match, str_match_all, str_order,
## scalar: str_pad, str_remove, str_remove_all, str_replace,
## scalar: str_replace_all, str_replace_na, str_sort, str_split,
## scalar: str_split_fixed, str_squish, str_sub, str_subset,
## scalar: str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar: str_trunc, str_view, str_view_all, str_which,
## scalar: str_wrap, substr, substring, switch, tan, tanh, today,
## scalar: tolower, toupper, trimws, wday, xor, yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window: cume_dist, cummax, cummean, cummin, cumsum,
## window: dense_rank, first, lag, last, lead, max, mean, median,
## window: min, min_rank, n, n_distinct, nth, ntile, order_by,
## window: percent_rank, quantile, rank, row_number, sd, sum, var
The release of the latest versions of {d[b]plyr} destroyed a lazy, bad, hack I was using to cast columns to JSON (you’ll note the lack of a cast()
function above, which is necessary for Athena since the syntax is not that of a function call). I’m _very_glad they did since it’s bad to rely on undocumented functionality and, honestly, it’s pretty straightforward to make an “official” translation for them.
First, we need the class of this Athena ODBC connection:
class(con)
## [1] "Amazon Athena"
## attr(,"package")
## [1] ".GlobalEnv"
We’ll need to write a sql_translation.Amazon Athena()
function for this connection class and we’ll start with writing one that doesn’t handle our casting just to show the basic setup:
`sql_translation.Amazon Athena` <- function(x) {
sql_variant(
dbplyr::base_odbc_scalar,
dbplyr::base_odbc_agg,
dbplyr::base_odbc_win
)
}
All that function is doing (now) is setting up the default translators you’ve seen in the above output listings.
To make it do something else, we need to add casting translator helpers, which fall under the “scalar” category. This, too, is pretty straightforward since {dbplyr} makes it possible to just extend a parent set of category translators:
sql_translator(
.parent = dbplyr::base_odbc_scalar,
cast_as = function(x, y) dbplyr::build_sql("CAST(", x, " AS ", y, ")"),
try_cast_as = function(x, y) dbplyr::build_sql("TRY_CAST(", x, " AS ", y, ")")
) -> athena_scalar
`sql_translation.Amazon Athena` <- function(x) {
sql_variant(
athena_scalar,
dbplyr::base_odbc_agg,
dbplyr::base_odbc_win
)
}
Now, let’s see if it really knows about our new casting functions:
sql_translate_env(con)
## <sql_variant>
## scalar: -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar: %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar: as_date, as_datetime, as.character, as.Date,
## scalar: as.double, as.integer, as.integer64, as.logical,
## scalar: as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar: bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar: bitwXor, c, case_when, cast_as, ceil, ceiling,
## scalar: coalesce, cos, cosh, cot, coth, day, desc, exp, floor,
## scalar: hour, if, if_else, ifelse, is.na, is.null, log, log10,
## scalar: mday, minute, month, na_if, nchar, now, paste, paste0,
## scalar: pmax, pmin, qday, round, second, sign, sin, sinh, sql,
## scalar: sqrt, str_c, str_conv, str_count, str_detect, str_dup,
## scalar: str_extract, str_extract_all, str_flatten, str_glue,
## scalar: str_glue_data, str_interp, str_length, str_locate,
## scalar: str_locate_all, str_match, str_match_all, str_order,
## scalar: str_pad, str_remove, str_remove_all, str_replace,
## scalar: str_replace_all, str_replace_na, str_sort, str_split,
## scalar: str_split_fixed, str_squish, str_sub, str_subset,
## scalar: str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar: str_trunc, str_view, str_view_all, str_which,
## scalar: str_wrap, substr, substring, switch, tan, tanh, today,
## scalar: tolower, toupper, trimws, try_cast_as, wday, xor,
## scalar: yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window: cume_dist, cummax, cummean, cummin, cumsum,
## window: dense_rank, first, lag, last, lead, max, mean, median,
## window: min, min_rank, n, n_distinct, nth, ntile, order_by,
## window: percent_rank, quantile, rank, row_number, sd, sum, var
Aye! Let’s test it out.
Unfortunately, this boring, default database has no MAP
columns to really show this off, but we can convert a simple character column into JSON just to get the idea:
elb_logs %>%
select(backendip)
## # Source: lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
## backendip
## <chr>
## 1 249.6.80.219
## 2 248.178.189.65
## 3 254.70.228.23
## 4 248.178.189.65
## 5 252.0.81.65
## 6 248.178.189.65
## 7 245.241.133.121
## 8 244.202.183.67
## 9 255.226.190.127
## 10 246.22.152.210
## # … with more rows
elb_logs %>%
select(backendip) %>%
mutate(
backendip = cast_as(backendip, JSON)
)
## # Source: lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
## backendip
## <chr>
## 1 "\"244.238.214.120\""
## 2 "\"248.99.214.228\""
## 3 "\"243.3.190.175\""
## 4 "\"246.235.181.255\""
## 5 "\"241.112.203.216\""
## 6 "\"240.147.242.82\""
## 7 "\"248.99.214.228\""
## 8 "\"248.99.214.228\""
## 9 "\"253.161.243.121\""
## 10 "\"248.99.214.228\""
## # … with more rows
FIN
Despite the {tidyverse} documentation being written with care and clarity, this part of the R ecosystem is so extensive and evolving that watching out for all the doors and corners can be tricky. It’s easy for the short paragraph on the “untranslated function” capability to be overlooked and it may be hard to fully grok the translation concept without an IRL example.
Hopefully this helped (even if only a little) demystify these two areas of {d[b]plyr}.
Parler Whack-a-Mole
(this is an unrolled Twitter thread converted to the blog since one never knows how long content will be preserved anywhere anymore)
It looks like @StackPath (
NetCDN[.]com
redirects to them) is enabling insurrection-mongers. They’re frontingnews[.]parler[.]com
.It seems they (Parler) have a second domain
dicecrm[.]com
with the actual content, too.dicecrm[.]com
is hosted in @awscloud, so it looks like Parler folks are smarter than Bezos’ minions. Amazon might want to take this down before it gets going (again).They load JS via @Google tag manager (you can see in the HTML src). The
GA_MEASUREMENT_ID
is “G-P76KHELPLT
“In site source screenshot in the first tweet there’s a reference to
twexit[.]com
. DNS for it shows they also haveleftwexit[.]com
(which is a very odd site)."Twexit" is being enabled by @awscloud @GoDaddy and @WordPress/@automattic plus @StackPath.
While the main page has (unsurprisingly) busted HTML, they’re using their old sitemap[.]xml — https://carbon.now.sh/mdyJbvddCvZaGu2tOnD6 — which has a singular recent (whining) entry:
http://dicecrm[.]com/updates/facebook-continues-their-confusing-hypocritical-stifling-of-free-speech-
Looks like @Shareaholic is also enabling Parler. Their “
shareaholic:site_id
” is “f7b53d75b2e7afdc512ea898bbbff585
“.One of the CDN content refs is this (attached img). It’s loading content for Parler from
free[ ]pressers[.]com
, which is a pretty nutjob fake news site enabled by @IBMcloud (so IBM is enabling Parler as well). thefree[ ]pressers
Twitter is equally nutjob.I suspect Parler is going to keep rejiggering this nutjob-fueled content network knowing that AWS, IBM (et al) won't play whack-a-mole and are rly just waiting for our collective memory and attention to fade so they can go back to making $ from divisiveness, greed, & hate.
protip: perhaps not spin up a new FQDN with such hastily-crafted garbage behind it when you know lots of very technically-resourced 👀 are on you.
Originally tweeted by (@hrbrmstr) on 2021-01-29.