Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends

Like more posts than I care to admit, this one starts innocently enough with a tweet by @gshotwell:

Since I use at least 4 different d[b]plyr backends every week, this same question surfaces in my own noggin on occasion and I couldn’t resist going all Columbo on this mystery.

I should note that if you only really care about the backends that come with dbplyr @paleolimbot has you covered with this post, which also shows you the translated SQL!

Executing The Plan

There are at least 24 separate backends for dbplyr. Most folks won’t need more than one if their databases all have a decent ODBC or JDBC driver. To be able to use dplyr idioms with databases there needs to be a way to translate R code (e.g. function calls) into SQL. A ton of functions are pre-mapped in dbplyr already and most backend implementations start by relying on these defaults. Furthermore, since SQL is not nearly as “standard” across installations as one might think, some common tasks — such as string manipulation — have a default noop translation.

If you do have to switch across backends with any frequency, knowing which backend provides support for which functions might be nice, but there hasn’t been a reference for this until Dewey & I accepted Gordon’s challenge. What makes this a “challenge” is that you first have to figure out what packages provide a d[b]plyr backend interface then figure out what SQL translations they offer (they don’t necessarily have to inherit from the ones provided by dbplyr and may add other ones to account for SQL clauses that aren’t in functional form). So the first step was just a look through CRAN for which packages import dbplyr and also adding in some I knew were on GitHub:

library(stringi)
library(hrbrthemes)
library(tidyverse)

# All the pkgs from the home CRAN mirror that import 'dbplyr'
c(
  "arkdb", "bigrquery", "childesr", "chunked", "civis", "corrr", "cytominer", "dbplot",
  "dbplyr", "dexter", "dexterMST", "dlookr", "dplyr", "dplyr.teradata", "etl",
  "healthcareai", "hydrolinks", "implyr", "infuser", "ipumsr", "macleish", "mdsr",
  "mlbgameday", "modeldb", "MonetDBLite", "mudata2", "parsemsf", "pivot", "pleiades",
  "pool", "poplite", "RClickhouse", "replyr", "RPresto", "sergeant", "sparklyr",
  "sqlscore", "srvyr", "taxizedb", "valr", "wordbankr", "metis.tidy"
) -> pkgs

I ended up doing install.pkgs(pkgs) which was easy since I have a home CRAN mirror and use macOS (so binary package installs).

The presence of a dbplyr import does not mean a package implements a backend, so we have to load their namespaces and see if they have the core “tell” (i.e. they implement sql_translate()):

(map_df(pkgs, ~{
  tibble(
    pkg = .x,
    trans = loadNamespace(.x) %>%
      names() %>%
      keep(stri_detect_fixed, "sql_translate")
  )
}) -> xdf)
## # A tibble: 28 x 2
##    pkg       trans
##    <chr>     <chr>
##  1 bigrquery sql_translate_env.BigQueryConnection
##  2 civis     sql_translate_env.CivisConnection
##  3 dbplyr    sql_translate_env.ACCESS
##  4 dbplyr    sql_translate_env.Oracle
##  5 dbplyr    sql_translate_env.SQLiteConnection
##  6 dbplyr    sql_translate_env.Impala
##  7 dbplyr    sql_translate_env.OdbcConnection
##  8 dbplyr    sql_translate_env.MySQLConnection
##  9 dbplyr    sql_translate_env.PqConnection
## 10 dbplyr    sql_translate_env.PostgreSQLConnection
## # … with 18 more rows

Now we know the types of connections that package has SQL translation support for. But, we’re looking for the actual functions they provide. To discover that, we’re going to make dummy classed connection objects and get the translations they offer.

However, some may take the defaults from dbplyr and not override them so we also need to test if they use the sql_not_supported() noop, which we can do by seeing if the function body has a call to stop() in it. We’re also going to ignore maths operators along the way:

(filter(xdf, stri_detect_fixed(trans, ".")) %>%
  filter(trans != "sql_translate_env.NULL") %>% # ignore NULL
  filter(trans != "sql_translate_env.Pool") %>% # ignore db connection pooling 
  filter(trans != "sql_translate_env.PrestoConnection") %>% # this one errored out
  mutate(ƒ = map(trans, ~{

    # get the sql translate functions
    con <- NA
    cls <- stri_replace_first_fixed(.x, "sql_translate_env.", "")
    class(con) <- cls

    env <- sql_translate_env(con)

    # but ^^ rly isn't a nice, tidy object, it's a list of environments
    # with functions in it so we have to iterate through it to extract
    # the function names.

    map_df(env, ~{

      part <- .x
      fs <- names(part)

      # but it's not just good enough to do that b/c a given function name
      # might just implement the "sql_not_supported()" pass through. So we have
      # to actually look to see if the function body has a "stop()" call in it
      # and ignore it if it does.

      map_df(fs, ~{
        tibble(ƒ = .x, src = paste0(as.character(body(part[[.x]])), collapse = "; ")) %>% # this gets the body of the function
          filter(!stri_detect_fixed(src, "stop(")) %>%
          filter(stri_detect_regex(ƒ, "[[:alpha:]]")) %>% # and we rly don't care about maths
          select(-src)
      })
    })
  })) %>%
  unnest(ƒ) %>%
  mutate(trans = stri_replace_first_fixed(trans, "sql_translate_env.", "")) -> xdf)
## # A tibble: 1,318 x 3
##    pkg       trans              ƒ
##    <chr>     <chr>              <chr>
##  1 bigrquery BigQueryConnection median
##  2 bigrquery BigQueryConnection gsub
##  3 bigrquery BigQueryConnection as.logical
##  4 bigrquery BigQueryConnection is.null
##  5 bigrquery BigQueryConnection case_when
##  6 bigrquery BigQueryConnection is.na
##  7 bigrquery BigQueryConnection if_else
##  8 bigrquery BigQueryConnection str_replace_all
##  9 bigrquery BigQueryConnection as.integer
## 10 bigrquery BigQueryConnection as.character
## # … with 1,308 more rows

The rest is all just ggplot2 basics:

 mutate(xdf, db = glue::glue("{pkg}\n{trans}")) %>% # make something useful to display for the DB/conn
  mutate(n = 1) %>% # heatmap block on
  complete(db, ƒ) %>% # complete the heatmap
  arrange(ƒ) %>%
  mutate(ƒ = factor(ƒ, levels=rev(unique(ƒ)))) %>% # arrange the Y axis in the proper order
  ggplot(aes(db, ƒ)) +
  geom_tile(aes(fill = n), color="#2b2b2b", size=0.125, show.legend=FALSE) +
  scale_x_discrete(expand=c(0,0.1), position = "top") +
  scale_fill_continuous(na.value="white") +
  labs(
    x = NULL, y = NULL,
    title = "SQL Function Support In Known d[b]plyr Backends"
  ) +
  theme_ipsum_ps(grid="", axis_text_size = 9) + # you'll need to use the dev version of hrbrthemes for this function; just sub out a diff theme if you already have hrbrthemes loaded
  theme(axis.text.y = element_text(family = "mono", size = 7))

Which makes:

(WP wouldn’t make the featured image linkable so I had to stick it in again to enabled the link so folks can make it full size which is absolutely necessary to see it).

FIN

If you do play with the above, don’t forget to go one more step and incorporate Dewey’s actual SQL mapping to see just how unstandardized the SQL standard is.

Contiguous code for the above is over at SourceHut.

Cover image from Data-Driven Security
Amazon Author Page

4 Comments Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends

  1. Pingback: Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends – Data Science Austria

  2. Matthew Pettis

    Very cool, I learned a lot of R apart from the intent of the code.

    The heuristic of looking for stop( must get you most of the way there, but I did stumble upon one edge case where it was a miss — for the sqrt() function. I see the following:

    con <- NA
    class(con) <- "Oracle"
    sql_translate_env(con)[[1]] %>% as.list() %>% pluck("sqrt")

    # function (...)
    # {
    # args <- list(...)
    # if (!is.null(n) && length(args) != n) {
    # stop("Invalid number of args to SQL ", f, ". Expecting ",
    # n, call. = FALSE)
    # }
    # if (any(names2(args) != "")) {
    # warning("Named arguments ignored for SQL ", f, call. = FALSE)
    # }
    # build_sql(sql(f), args)
    # }

    I interpret this to mean that sqrt() is implemented, even though it has stop( in it. I however don’t have a better heuristic to offer…

    Reply
    1. hrbrmstr

      Thank you for digging! I was secretly hoping someone would do that (I did not spend a great deal of time on the approach and the stop was a big — and, admittedly, potentially weak — assumption). I’ll improve the heuristic and update the blog and ensure I refer to this comment when I do.

      Reply
  3. T.A. Davis

    This post was particularly relevant for me, since I have begun using Spark through ‘sparklyr’. Though not a purely SQL issue, I have found that certain dplyr verbs do not translate well (or at all) into Spark calls. I suppose the alternative is to explicitly write out SQL and use Spark SQL directly, rather than relying upon dplyr translations. Thanks again for your ongoing and informative posts.

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.