Like more posts than I care to admit, this one starts innocently enough with a tweet by @gshotwell:
Is there a reference document somewhere of which dplyr commands work on various database backends? #rstats
— Gordon Shotwell (@gshotwell) April 9, 2019
Since I use at least 4 different d[b]plyr
backends every week, this same question surfaces in my own noggin on occasion and I couldn’t resist going all Columbo on this mystery.
I should note that if you only really care about the backends that come with dbplyr
@paleolimbot has you covered with this post, which also shows you the translated SQL!
Executing The Plan
There are at least 24 separate backends for dbplyr
. Most folks won’t need more than one if their databases all have a decent ODBC or JDBC driver. To be able to use dplyr
idioms with databases there needs to be a way to translate R code (e.g. function calls) into SQL. A ton of functions are pre-mapped in dbplyr
already and most backend implementations start by relying on these defaults. Furthermore, since SQL is not nearly as “standard” across installations as one might think, some common tasks — such as string manipulation — have a default noop translation.
If you do have to switch across backends with any frequency, knowing which backend provides support for which functions might be nice, but there hasn’t been a reference for this until Dewey & I accepted Gordon’s challenge. What makes this a “challenge” is that you first have to figure out what packages provide a d[b]plyr
backend interface then figure out what SQL translations they offer (they don’t necessarily have to inherit from the ones provided by dbplyr
and may add other ones to account for SQL clauses that aren’t in functional form). So the first step was just a look through CRAN for which packages import dbplyr
and also adding in some I knew were on GitHub:
library(stringi)
library(hrbrthemes)
library(tidyverse)
# All the pkgs from the home CRAN mirror that import 'dbplyr'
c(
"arkdb", "bigrquery", "childesr", "chunked", "civis", "corrr", "cytominer", "dbplot",
"dbplyr", "dexter", "dexterMST", "dlookr", "dplyr", "dplyr.teradata", "etl",
"healthcareai", "hydrolinks", "implyr", "infuser", "ipumsr", "macleish", "mdsr",
"mlbgameday", "modeldb", "MonetDBLite", "mudata2", "parsemsf", "pivot", "pleiades",
"pool", "poplite", "RClickhouse", "replyr", "RPresto", "sergeant", "sparklyr",
"sqlscore", "srvyr", "taxizedb", "valr", "wordbankr", "metis.tidy"
) -> pkgs
I ended up doing install.pkgs(pkgs)
which was easy since I have a home CRAN mirror and use macOS (so binary package installs).
The presence of a dbplyr
import does not mean a package implements a backend, so we have to load their namespaces and see if they have the core “tell” (i.e. they implement sql_translate()
):
(map_df(pkgs, ~{
tibble(
pkg = .x,
trans = loadNamespace(.x) %>%
names() %>%
keep(stri_detect_fixed, "sql_translate")
)
}) -> xdf)
## # A tibble: 28 x 2
## pkg trans
## <chr> <chr>
## 1 bigrquery sql_translate_env.BigQueryConnection
## 2 civis sql_translate_env.CivisConnection
## 3 dbplyr sql_translate_env.ACCESS
## 4 dbplyr sql_translate_env.Oracle
## 5 dbplyr sql_translate_env.SQLiteConnection
## 6 dbplyr sql_translate_env.Impala
## 7 dbplyr sql_translate_env.OdbcConnection
## 8 dbplyr sql_translate_env.MySQLConnection
## 9 dbplyr sql_translate_env.PqConnection
## 10 dbplyr sql_translate_env.PostgreSQLConnection
## # … with 18 more rows
Now we know the types of connections that package has SQL translation support for. But, we’re looking for the actual functions they provide. To discover that, we’re going to make dummy classed connection objects and get the translations they offer.
However, some may take the defaults from dbplyr
and not override them so we also need to test if they use the sql_not_supported()
noop, which we can do by seeing if the function body has a call to stop()
in it. We’re also going to ignore maths operators along the way:
(filter(xdf, stri_detect_fixed(trans, ".")) %>%
filter(trans != "sql_translate_env.NULL") %>% # ignore NULL
filter(trans != "sql_translate_env.Pool") %>% # ignore db connection pooling
filter(trans != "sql_translate_env.PrestoConnection") %>% # this one errored out
mutate(ƒ = map(trans, ~{
# get the sql translate functions
con <- NA
cls <- stri_replace_first_fixed(.x, "sql_translate_env.", "")
class(con) <- cls
env <- sql_translate_env(con)
# but ^^ rly isn't a nice, tidy object, it's a list of environments
# with functions in it so we have to iterate through it to extract
# the function names.
map_df(env, ~{
part <- .x
fs <- names(part)
# but it's not just good enough to do that b/c a given function name
# might just implement the "sql_not_supported()" pass through. So we have
# to actually look to see if the function body has a "stop()" call in it
# and ignore it if it does.
map_df(fs, ~{
tibble(ƒ = .x, src = paste0(as.character(body(part[[.x]])), collapse = "; ")) %>% # this gets the body of the function
filter(!stri_detect_fixed(src, "stop(")) %>%
filter(stri_detect_regex(ƒ, "[[:alpha:]]")) %>% # and we rly don't care about maths
select(-src)
})
})
})) %>%
unnest(ƒ) %>%
mutate(trans = stri_replace_first_fixed(trans, "sql_translate_env.", "")) -> xdf)
## # A tibble: 1,318 x 3
## pkg trans ƒ
## <chr> <chr> <chr>
## 1 bigrquery BigQueryConnection median
## 2 bigrquery BigQueryConnection gsub
## 3 bigrquery BigQueryConnection as.logical
## 4 bigrquery BigQueryConnection is.null
## 5 bigrquery BigQueryConnection case_when
## 6 bigrquery BigQueryConnection is.na
## 7 bigrquery BigQueryConnection if_else
## 8 bigrquery BigQueryConnection str_replace_all
## 9 bigrquery BigQueryConnection as.integer
## 10 bigrquery BigQueryConnection as.character
## # … with 1,308 more rows
The rest is all just ggplot2
basics:
mutate(xdf, db = glue::glue("{pkg}\n{trans}")) %>% # make something useful to display for the DB/conn
mutate(n = 1) %>% # heatmap block on
complete(db, ƒ) %>% # complete the heatmap
arrange(ƒ) %>%
mutate(ƒ = factor(ƒ, levels=rev(unique(ƒ)))) %>% # arrange the Y axis in the proper order
ggplot(aes(db, ƒ)) +
geom_tile(aes(fill = n), color="#2b2b2b", size=0.125, show.legend=FALSE) +
scale_x_discrete(expand=c(0,0.1), position = "top") +
scale_fill_continuous(na.value="white") +
labs(
x = NULL, y = NULL,
title = "SQL Function Support In Known d[b]plyr Backends"
) +
theme_ipsum_ps(grid="", axis_text_size = 9) + # you'll need to use the dev version of hrbrthemes for this function; just sub out a diff theme if you already have hrbrthemes loaded
theme(axis.text.y = element_text(family = "mono", size = 7))
Which makes:
(WP wouldn’t make the featured image linkable so I had to stick it in again to enabled the link so folks can make it full size which is absolutely necessary to see it).
FIN
If you do play with the above, don’t forget to go one more step and incorporate Dewey’s actual SQL mapping to see just how unstandardized the SQL standard is.
Contiguous code for the above is over at SourceHut.
5 Comments
Very cool, I learned a lot of R apart from the intent of the code.
The heuristic of looking for
stop(
must get you most of the way there, but I did stumble upon one edge case where it was a miss — for thesqrt()
function. I see the following:con <- NA
class(con) <- "Oracle"
sql_translate_env(con)[[1]] %>% as.list() %>% pluck("sqrt")
# function (...)
# {
# args <- list(...)
# if (!is.null(n) && length(args) != n) {
# stop("Invalid number of args to SQL ", f, ". Expecting ",
# n, call. = FALSE)
# }
# if (any(names2(args) != "")) {
# warning("Named arguments ignored for SQL ", f, call. = FALSE)
# }
# build_sql(sql(f), args)
# }
I interpret this to mean that
sqrt()
is implemented, even though it hasstop(
in it. I however don’t have a better heuristic to offer…Thank you for digging! I was secretly hoping someone would do that (I did not spend a great deal of time on the approach and the
stop
was a big — and, admittedly, potentially weak — assumption). I’ll improve the heuristic and update the blog and ensure I refer to this comment when I do.This post was particularly relevant for me, since I have begun using Spark through ‘sparklyr’. Though not a purely SQL issue, I have found that certain dplyr verbs do not translate well (or at all) into Spark calls. I suppose the alternative is to explicitly write out SQL and use Spark SQL directly, rather than relying upon dplyr translations. Thanks again for your ongoing and informative posts.
Is there any chance you can update this for 2022?
Sure! I’ll sync CRAN and run it this weekend and make a new post. Will tag you when I do.
One Trackback/Pingback
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]