Skip navigation

I caught a re-tweet of this tweet by @harry_stevens:

Harry’s thread and Observable post are great on their own and both show the power and utility of Observable javascript notebooks.

However, the re-tweet (which I’m not posting because it’s daft) took a swipe at both Python & R. Now, I’m all for a good swipe at Python (mostly to ensure we never forget all those broken spacebars and tab keys that language has caused) but I’ll gladly defend it and R together when it comes to Getting Things Done, even on deadline.

Let’s walk through what one of us might have done had we been in the same scenario as Harry.

Mapping On A Deadline

So, we have to create a map of historical tornado frequency trends on deadline.

We emailed researchers and received three txt files. One is a set of latitudes, another longitudes, and the final one is the trend value. It’s gridded data.

Download that ZIP and pretend you got three files in email vs a nice ZIP and make a new RStudio project called “tornado” and put those three files in a local-to-the-project-root data/ directory. Let’s read them in and look at them:

library(hrbrthemes) # not 100% necessary but i like my ggplot2 theme(s) :-)
library(tidyverse)  # data wrangling & ggplot2

tibble(
  lat = scan(here::here("data/lats.txt")),
  lon = scan(here::here("data/lons.txt")),
  trend = scan(here::here("data/trends.txt"))
) -> tornado

You very likely never directly use the base::scan() function, but it’s handy here since we just have files of doubles with each value separated by whitespace. Now, let’s see what we have:

tornado
## # A tibble: 30,000 x 3
##      lat   lon trend
##    <dbl> <dbl> <dbl>
##  1 0.897 -180.     0
##  2 0.897 -179.     0
##  3 0.897 -178.     0
##  4 0.897 -176.     0
##  5 0.897 -175.     0
##  6 0.897 -174.     0
##  7 0.897 -173.     0
##  8 0.897 -172.     0
##  9 0.897 -170.     0
## 10 0.897 -169.     0
## # … with 29,990 more rows

summary(tornado)
##      lat               lon                 trend           
## Min.   : 0.8973   Min.   :-179.99808   Min.   :-0.4733610  
## 1st Qu.:22.0063   1st Qu.: -90.00066   1st Qu.: 0.0000000  
## Median :43.1154   Median :  -0.00323   Median : 0.0000000  
## Mean   :43.1154   Mean   :  -0.00323   Mean   : 0.0002756  
## 3rd Qu.:64.2245   3rd Qu.:  89.99419   3rd Qu.: 0.0000000  
## Max.   :85.3335   Max.   : 179.99161   Max.   : 0.6314569  

#+ grid-overview
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

#+ trend-overview
ggplot(tornado, aes(trend)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.05))

Since we’re looking for trends (either direction) in just the United States the latitude and longitude ranges will need to be shrunk down a bit (it does indeed look like globally gridded data) and we’ll be able to shrink the data set a bit more since we only want to look at large or small tends.

We don’t really need modern R/ggplot2 mapping idioms for this project (i.e. the new {sf} ecosystem), so we’ll keep it “simple” (scare quotes since that’s a loaded term) and just use the built in maps and geom_map(). First, let’s get the U.S. states and extract their bounding boxes/limits:

maps::map("state", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>% 
  fortify(map_obj) %>% 
  as_tibble() -> state_map

xlim <- range(state_map$long)
ylim <- range(state_map$lat)

NOTE: I tend not to use the handy ggplot::map_data() function since it ends up clobbering purrr::map() which I use heavily (though not in this post). I also try to use {sf} these days so this tends to not be an issue anymore anyway.

Now, let’s focus in on the target area in the original paper and the Axios article:

filter(
  tornado,
  between(lon, -107, xlim[2]), between(lat, ylim[1], ylim[2]), # -107 gets us ~left-edge of TX
  ((trend < -0.07) | (trend > 0.07)) # approximates notebook selection range
) -> tornado

#+ grid-overview-2
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

Now we’re getting close to our final solution.

As stated in the Observable notebook and implied by the word “grid” these dots are centroids of grid rectangles. This means we really want boxes, not points. The article got all fancy but it’s not really necessary since we can use ggplot2::geom_tile() to get us said boxes:

#+ grid-overview-3
ggplot(tornado, aes(lon, lat)) +
  geom_tile(aes(fill = trend, color = trend))

Now, we just need to add in map layers, and tweak some aesthetics to make it look like a map. We’ll start naively:

#+ map-1
ggplot() +
  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +
  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = "black", size = 0.125, fill = NA
  )

Our gridded data is definitely covering the right/same areas so we just need to make this more suitable for an article. We’ll use Harry’s palette and layer in U.S. state borders, an overall country border, and approximate the title and legend aesthetics:

#+ map-final
c(
  "#023858", "#045a8d", "#0570b0", "#3690c0", "#74a9cf",
  "#a6bddb", "#d0d1e6", "#ece7f2", "#fff7fb", "#ffffff",
  "#ffffcc", "#ffeda0", "#fed976", "#feb24c", "#fd8d3c",
  "#fc4e2a", "#e31a1c", "#bd0026", "#800026"
) -> grad_cols # colors from article

ggplot() +

  # tile layer

  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +

  # state borders

  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = ft_cols$slate, size = 0.125, fill = NA
  ) +

  # usa border

  borders("usa", colour = "black", size = 0.5) +

  # color scales

  scale_colour_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +
  scale_fill_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +

  # make it Albers-ish and ensure we can fit the borders in 

  coord_map(
    projection = "polyconic",
    xlim = scales::expand_range(range(tornado$lon), add = 2),
    ylim = scales::expand_range(range(tornado$lat), add = 2)
  ) +

  # tweak legend aesthetics

  guides(
    colour = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    ),
    fill = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    )
  ) +
  labs(
    x = NULL, y = NULL
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "top") +
  theme(legend.title = element_text(size = 16, hjust = 0.5)) +
  theme(legend.key.width = unit(4, "lines")) +
  theme(legend.key.height = unit(0.5, "lines"))

FIN

I went through some extra steps for folks new to R but the overall approach was at the very least equally as expedient as the Observable one and — despite the claims by the quite daft retweet — this is no less “shareable” or “reusable” than the Observable notebook. You can clone the repo (https://git.sr.ht/~hrbrmstr/tornado) and reuse this work immediately.

If you take a stab at an alternate approach — especially if you do use {sf} — definitely blog about it and drop a link here or on Twitter.

I’m using GitUgh links here b/c the issue was submitted there. Those not wishing to be surveilled by Microsoft can find the macOS QuickLook plugin project and {rdatainfo} project in SourceHut and GitLab (~hrbrmstr and hrbrmstr accounts respectively).

I hadn’t touched QuickLookR🔗 or {rdatainfo}🔗 at all since 2016 since it was really just an example proof of concept. Yet the suggestion to have it handle R markdown (Rmd) files felt useful so I updated {rdatainfo} to better handle data loading for rds, rdata, and rda file extensions and made a small update to the macOS {QuickLookR} QuickLook extension project to treat Rmd files as text files which can be previewed then edited with the default Finder extension exitor you’ve (or your apps) have set for Rmd files.

The {rdatainfo} package is only needed if you need/want R data file preview support (i.e. it’s not necessary for R markdown files). Just unzip the plugin release and put it into ~/Library/QuickLook. Here are examples for the four file types (the example code under saveRDS() and save() was used to generate those data files and the R markdown file is the default one):

file icons

file icons

Rmd preview

Rmd Preview

rds preview

rds preview

rdata preview

rdata preview

FIN

This is my first Xcode app build under macOS 10.14 so definitely file issues if you’re having trouble installing or compiling (there are some new shared library “gotchas” that I don’t think apply to this Xcode project but may).

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems—so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no <div> elements because they’re generated dynamically after the page loads.

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source").

First, we’ll try just plain ol’ {rvest}:

library(rvest)

pg <- read_html("http://quotes.toscrape.com/js/")

html_nodes(pg, "div.quote")
## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit)
library(rvest)
library(purrr)
library(tibble)

js_pg <- hu_read_html("http://quotes.toscrape.com/js/")

html_nodes(js_pg, "div.quote")
## {xml_nodeset (10)}
##  [1] <div class="quote">\r\n        <span class="text">\r\n          “The world as we h ...
##  [2] <div class="quote">\r\n        <span class="text">\r\n          “It is our choices ...
##  [3] <div class="quote">\r\n        <span class="text">\r\n          “There are only tw ...
##  [4] <div class="quote">\r\n        <span class="text">\r\n          “The person, be it ...
##  [5] <div class="quote">\r\n        <span class="text">\r\n          “Imperfection is b ...
##  [6] <div class="quote">\r\n        <span class="text">\r\n          “Try not to become ...
##  [7] <div class="quote">\r\n        <span class="text">\r\n          “It is better to b ...
##  [8] <div class="quote">\r\n        <span class="text">\r\n          “I have not failed ...
##  [9] <div class="quote">\r\n        <span class="text">\r\n          “A woman is like a ...
## [10] <div class="quote">\r\n        <span class="text">\r\n          “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble(
  quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE),
  author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE),
  tags = html_nodes(js_pg, "div.quote") %>% 
    map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE))
)
## # A tibble: 10 x 3
##    quote                                                            author         tags   
##    <chr>                                                            <chr>          <list> 
##  1 “The world as we have created it is a process of our thinking. … Albert Einste… <chr […
##  2 “It is our choices, Harry, that show what we truly are, far mor… J.K. Rowling   <chr […
##  3 “There are only two ways to live your life. One is as though no… Albert Einste… <chr […
##  4 “The person, be it gentleman or lady, who has not pleasure in a… Jane Austen    <chr […
##  5 “Imperfection is beauty, madness is genius and it's better to b… Marilyn Monroe <chr […
##  6 “Try not to become a man of success. Rather become a man of val… Albert Einste… <chr […
##  7 “It is better to be hated for what you are than to be loved for… André Gide     <chr […
##  8 “I have not failed. I've just found 10,000 ways that won't work… Thomas A. Edi… <chr […
##  9 “A woman is like a tea bag; you never know how strong it is unt… Eleanor Roose… <chr […
## 10 “A day without sunshine is like, you know, night.”               Steve Martin   <chr […

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8)
library(stringi)

ctx <- v8()

html_node(pg, xpath=".//script[contains(., 'data')]") %>%  # target the <script> tag with the data
  html_text() %>% # get the text of the tag body
  stri_replace_all_regex("for \\(var[[:print:][:space:]]*", "", multiline=TRUE) %>% # delete everything after the `var data=` content
  ctx$eval() # pass it to V8

ctx$get("data") %>% # get the data from V8
  as_tibble() %>%  # tibbles rock
  janitor::clean_names() # the names do not so make them better
## # A tibble: 10 x 3
##    tags    author$name   $goodreads_link        $slug     text                            
##    <list>  <chr>         <chr>                  <chr>     <chr>                           
##  1 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “The world as we have created i…
##  2 <chr [… J.K. Rowling  /author/show/1077326.… J-K-Rowl… “It is our choices, Harry, that…
##  3 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “There are only two ways to liv…
##  4 <chr [… Jane Austen   /author/show/1265.Jan… Jane-Aus… “The person, be it gentleman or…
##  5 <chr [… Marilyn Monr… /author/show/82952.Ma… Marilyn-… “Imperfection is beauty, madnes…
##  6 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “Try not to become a man of suc…
##  7 <chr [… André Gide    /author/show/7617.And… Andre-Gi… “It is better to be hated for w…
##  8 <chr [… Thomas A. Ed… /author/show/3091287.… Thomas-A… “I have not failed. I've just f…
##  9 <chr [… Eleanor Roos… /author/show/44566.El… Eleanor-… “A woman is like a tea bag; you…
## 10 <chr [… Steve Martin  /author/show/7103.Ste… Steve-Ma… “A day without sunshine is like…

But, the {htmlunit} code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).

FIN

While {htmlunit} is great, it won’t work on super complex sites as it’s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content. The package also contains a mini-DSL if you need to perform more complex page scraping tasks as well.

You can find both {htmlunit} and {htmlunitjars} at:

UPDATE 2019-04-17 — The example at the bottom which shows that the, er, randomly chosen site has the offending <meta> tag present is an old result. As of this update timestamp, that robots noindex tag is not on the site. Since the presence status of that tag is in flux, it will continue to be monitored.


Say your organization has done something pretty terrible. Terrible enough that you really didn’t want to acknowledge it initially but eventually blogged about it, and haven’t added a blog post in a long time so that entry is at the top of your blog index page which Google can still index and will since it’s been linked to from this site which has a high rating internally in their massive database.

If you wanted to help ensure nobody finds that original page, there are lots of ways to do that.

First, you could add a Disallow entry in your robots.txt for it. Ironically, some organizations don’t go that route but do try to prevent Google (et al) from indexing their terms of use and privacy policy, which might suggest they don’t want to have a historical record that folks could compare changes to, and perhaps are even planning changes (might be good if more than just me saves off some copies of that now).

Now, robots.txt modifications are fairly straightforward. And, they are also super easy to check.

So, what if you wanted to hide your offense from Google (et al) and not make it obvious in your robots.txt? For that, you can use a special <meta> tag in the header of your site.

This is an example of what that looks like:

datacamp

but that may be hard to see, so let’s look at it up close:

<meta name="robots" content="noindex" class="next-head" />
<title class="next-head">A note to our community (article) - DataCamp</title>
<link rel="canonical" href="https://www.datacamp.com/community/blog/note-to-our-community" class="next-head" />
<meta property="og:url" content="https://www.datacamp.com/community/blog/note-to-our-community" class="next-head" />

That initial <meta> tag will generally be respected by all search engines.

And, if you want to really be sneaky, you can add a special X-Robots-Tag: noindex HTTP header to your web server for any page you want to have no permanent record of and sneak past even more eyes.

Unfortunately, some absolute novices who did know how to do the <meta> tag trick aren’t bright enough to do the sneakier version and get caught. Here’s an example of a site that doesn’t use the super stealthy header approach:

datacamp

FIN

So, if you’re going to be childish and evil, now you know what you really should do to try to keep things out of public view.

Also, if you’re one of the folks who likes to see justice be done, you now know where to check and can use this R snippet to do so whenever you like. Just substitute the randomly chosen site/page below for one that you want to monitor.

library(httr)
library(xml2)

httr::GET(
  url = "https://www.datacamp.com/community/blog/note-to-our-community"
) -> res

data.frame(
  name = names(res$all_headers[[1]]$headers), # if there are more than one set (i.e. redirects) you'll need to iterate
  value = unlist(res$all_headers[[1]]$headers, use.names = FALSE)
) -> hdrs

hdrs[hdrs[["name"]] == "robots",]
## [1] name  value
## <0 rows> (or 0-length row.names)

httr::content(res) %>% 
  xml_find_all(".//meta[@name='robots']")
## {xml_nodeset (1)}
## [1] <meta name="robots" content="noindex" class="next-head">\n

readLines("https://www.datacamp.com/robots.txt")
## [1] "User-Agent: *"                                                              
## [2] "Disallow: /users/auth/linkedin/callback"                                    
## [3] "Disallow: /terms-of-use"                                                    
## [4] "Disallow: /privacy-policy"                                                  
## [5] "Disallow: /create/how"                                                      
## [6] "Sitemap: http://assets.datacamp.com/sitemaps/main/production/sitemap.xml.gz"

Thank you for reading to the end of this note to our community.

Like more posts than I care to admit, this one starts innocently enough with a tweet by @gshotwell:

Since I use at least 4 different d[b]plyr backends every week, this same question surfaces in my own noggin on occasion and I couldn’t resist going all Columbo on this mystery.

I should note that if you only really care about the backends that come with dbplyr @paleolimbot has you covered with this post, which also shows you the translated SQL!

Executing The Plan

There are at least 24 separate backends for dbplyr. Most folks won’t need more than one if their databases all have a decent ODBC or JDBC driver. To be able to use dplyr idioms with databases there needs to be a way to translate R code (e.g. function calls) into SQL. A ton of functions are pre-mapped in dbplyr already and most backend implementations start by relying on these defaults. Furthermore, since SQL is not nearly as “standard” across installations as one might think, some common tasks — such as string manipulation — have a default noop translation.

If you do have to switch across backends with any frequency, knowing which backend provides support for which functions might be nice, but there hasn’t been a reference for this until Dewey & I accepted Gordon’s challenge. What makes this a “challenge” is that you first have to figure out what packages provide a d[b]plyr backend interface then figure out what SQL translations they offer (they don’t necessarily have to inherit from the ones provided by dbplyr and may add other ones to account for SQL clauses that aren’t in functional form). So the first step was just a look through CRAN for which packages import dbplyr and also adding in some I knew were on GitHub:

library(stringi)
library(hrbrthemes)
library(tidyverse)

# All the pkgs from the home CRAN mirror that import 'dbplyr'
c(
  "arkdb", "bigrquery", "childesr", "chunked", "civis", "corrr", "cytominer", "dbplot",
  "dbplyr", "dexter", "dexterMST", "dlookr", "dplyr", "dplyr.teradata", "etl",
  "healthcareai", "hydrolinks", "implyr", "infuser", "ipumsr", "macleish", "mdsr",
  "mlbgameday", "modeldb", "MonetDBLite", "mudata2", "parsemsf", "pivot", "pleiades",
  "pool", "poplite", "RClickhouse", "replyr", "RPresto", "sergeant", "sparklyr",
  "sqlscore", "srvyr", "taxizedb", "valr", "wordbankr", "metis.tidy"
) -> pkgs

I ended up doing install.pkgs(pkgs) which was easy since I have a home CRAN mirror and use macOS (so binary package installs).

The presence of a dbplyr import does not mean a package implements a backend, so we have to load their namespaces and see if they have the core “tell” (i.e. they implement sql_translate()):

(map_df(pkgs, ~{
  tibble(
    pkg = .x,
    trans = loadNamespace(.x) %>%
      names() %>%
      keep(stri_detect_fixed, "sql_translate")
  )
}) -> xdf)
## # A tibble: 28 x 2
##    pkg       trans
##    <chr>     <chr>
##  1 bigrquery sql_translate_env.BigQueryConnection
##  2 civis     sql_translate_env.CivisConnection
##  3 dbplyr    sql_translate_env.ACCESS
##  4 dbplyr    sql_translate_env.Oracle
##  5 dbplyr    sql_translate_env.SQLiteConnection
##  6 dbplyr    sql_translate_env.Impala
##  7 dbplyr    sql_translate_env.OdbcConnection
##  8 dbplyr    sql_translate_env.MySQLConnection
##  9 dbplyr    sql_translate_env.PqConnection
## 10 dbplyr    sql_translate_env.PostgreSQLConnection
## # … with 18 more rows

Now we know the types of connections that package has SQL translation support for. But, we’re looking for the actual functions they provide. To discover that, we’re going to make dummy classed connection objects and get the translations they offer.

However, some may take the defaults from dbplyr and not override them so we also need to test if they use the sql_not_supported() noop, which we can do by seeing if the function body has a call to stop() in it. We’re also going to ignore maths operators along the way:

(filter(xdf, stri_detect_fixed(trans, ".")) %>%
  filter(trans != "sql_translate_env.NULL") %>% # ignore NULL
  filter(trans != "sql_translate_env.Pool") %>% # ignore db connection pooling 
  filter(trans != "sql_translate_env.PrestoConnection") %>% # this one errored out
  mutate(ƒ = map(trans, ~{

    # get the sql translate functions
    con <- NA
    cls <- stri_replace_first_fixed(.x, "sql_translate_env.", "")
    class(con) <- cls

    env <- sql_translate_env(con)

    # but ^^ rly isn't a nice, tidy object, it's a list of environments
    # with functions in it so we have to iterate through it to extract
    # the function names.

    map_df(env, ~{

      part <- .x
      fs <- names(part)

      # but it's not just good enough to do that b/c a given function name
      # might just implement the "sql_not_supported()" pass through. So we have
      # to actually look to see if the function body has a "stop()" call in it
      # and ignore it if it does.

      map_df(fs, ~{
        tibble(ƒ = .x, src = paste0(as.character(body(part[[.x]])), collapse = "; ")) %>% # this gets the body of the function
          filter(!stri_detect_fixed(src, "stop(")) %>%
          filter(stri_detect_regex(ƒ, "[[:alpha:]]")) %>% # and we rly don't care about maths
          select(-src)
      })
    })
  })) %>%
  unnest(ƒ) %>%
  mutate(trans = stri_replace_first_fixed(trans, "sql_translate_env.", "")) -> xdf)
## # A tibble: 1,318 x 3
##    pkg       trans              ƒ
##    <chr>     <chr>              <chr>
##  1 bigrquery BigQueryConnection median
##  2 bigrquery BigQueryConnection gsub
##  3 bigrquery BigQueryConnection as.logical
##  4 bigrquery BigQueryConnection is.null
##  5 bigrquery BigQueryConnection case_when
##  6 bigrquery BigQueryConnection is.na
##  7 bigrquery BigQueryConnection if_else
##  8 bigrquery BigQueryConnection str_replace_all
##  9 bigrquery BigQueryConnection as.integer
## 10 bigrquery BigQueryConnection as.character
## # … with 1,308 more rows

The rest is all just ggplot2 basics:

 mutate(xdf, db = glue::glue("{pkg}\n{trans}")) %>% # make something useful to display for the DB/conn
  mutate(n = 1) %>% # heatmap block on
  complete(db, ƒ) %>% # complete the heatmap
  arrange(ƒ) %>%
  mutate(ƒ = factor(ƒ, levels=rev(unique(ƒ)))) %>% # arrange the Y axis in the proper order
  ggplot(aes(db, ƒ)) +
  geom_tile(aes(fill = n), color="#2b2b2b", size=0.125, show.legend=FALSE) +
  scale_x_discrete(expand=c(0,0.1), position = "top") +
  scale_fill_continuous(na.value="white") +
  labs(
    x = NULL, y = NULL,
    title = "SQL Function Support In Known d[b]plyr Backends"
  ) +
  theme_ipsum_ps(grid="", axis_text_size = 9) + # you'll need to use the dev version of hrbrthemes for this function; just sub out a diff theme if you already have hrbrthemes loaded
  theme(axis.text.y = element_text(family = "mono", size = 7))

Which makes:

(WP wouldn’t make the featured image linkable so I had to stick it in again to enabled the link so folks can make it full size which is absolutely necessary to see it).

FIN

If you do play with the above, don’t forget to go one more step and incorporate Dewey’s actual SQL mapping to see just how unstandardized the SQL standard is.

Contiguous code for the above is over at SourceHut.

Researching “the internet” (i.e. $DAYJOB) means having to deal with a ton of “unique” (I’m being kind) data formats. This is ultimately a tale of how I performed full-text searches across one of them.

It all started off innocently enough. This past week I need to be able to do full-text searches across metadata about who is using which parts of the internet. Normally I don’t need to do that at scale and can just go to RIPE’s excellent resource and manage to find what I need on the first page. However, this time I needed all the resultant info and noticed an interesting foible on that full text search interface. To reproduce it. Enter something like “domino's” (for the record, I’m not researching Domino’s Pizza — nor would I ever consume it — but a Twitter ad happened to fly by for Domino’s and I just typed it for kicks) into the field and page around, keeping an eye on the results. I think they still use Solr for indexing/searching and aren’t passing in all they need to keep session context or something. Anyway, suffice it to say it was fairly useless (I filed a bug report, so I’m not just complaining, and I wish more sites had the same easy error reporting filing capability the RIPE folks do).

If it were just searching for precise data in one field, that’s not really an issue since we have ALL THE WHOIS IP THINGS in Parquet. But:

  • I really hate giving Amazon money (even if it’s $WORK money) for Athena queries
  • Full text search across all columns is not one of Parquet’s strengths
  • This is a third bullet b/c I feel compelled to have a minimum of three points in bullet lists likely thanks to an overbearing middle-school English teacher

Since I have a modest analytics server setup at home, I figured I’d take the opportunity to re-brush-up on either Elasticsearch or Couchbase since both are pretty great at free text searching JSON data. Except…this isn’t JSON data, It’s records formatted like this:

#
# The contents of this file are subject to 
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#

as-block:       AS7 - AS7
descr:          RIPE NCC ASN block
remarks:        These AS Numbers are assigned to network operators in the RIPE NCC service region.
mnt-by:         RIPE-NCC-HM-MNT
created:        2018-11-22T15:27:05Z
last-modified:  2018-11-22T15:27:05Z
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

as-block:       AS28 - AS28
descr:          RIPE NCC ASN block
remarks:        These AS Numbers are assigned to network operators in the RIPE NCC service region.
mnt-by:         RIPE-NCC-HM-MNT
created:        2018-11-22T15:27:05Z
last-modified:  2018-11-22T15:27:05Z
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

They “keys” (the colon-ified line prefixes) vary and there are other record types (which I don’t need) that have other prefixes in them plus those #-prefixed comments are not necessarily only at the top. But, after judicious use of stringi::stri::stri_enc_toutf8(), stringi::stri_split_regex() and some vectorized record targeting they’re pretty easily converted to lovely ndjson data like this (random selection further in the conversion):

{"descr":"Reseau Teleinformatique de l'Education Nationale Educational and research network for Luxembourg","admin_c":"DUMY-RIPE","as_set":"AS-RESTENA","members":"AS2602, AS42909, AS51966, AS49624","mnt_by":"AS2602-MNT","notify":"noc@restena.lu","tech_c":"DUMY-RIPE"}
{"descr":"CWIX ASes announced to EBONE","admin_c":"DUMY-RIPE","as_set":"AS-TMPEBONECWIX","members":"AS3727, AS4445, AS4610, AS4624, AS4637, AS4654, AS4655, AS4656, AS4659 AS4681, AS4696, AS4714, AS4849, AS5089, AS5090, AS5532, AS5551, AS5559 AS5655, AS6081, AS6255, AS6292, AS6618, AS6639","mnt_by":"EBONE-MNT","notify":"staff@ebone.net","tech_c":"DUMY-RIPE"}
{"descr":"ASs accepted by DFN from the University of Cologne","admin_c":"DUMY-RIPE","as_set":"AS-DFNFROMCOLOGNE","members":"AS5520 AS6733","mnt_by":"DFN-MNT","tech_c":"DUMY-RIPE"}
{"descr":"NetMatters UK","admin_c":"DUMY-RIPE","as_set":"AS-NETMATTERS","members":"AS6765 AS3344","mnt_by":"AS8407-MNT","tech_c":"DUMY-RIPE"}

I went with Couchbase since it handles ndjson import by default and — as you know since you read the comparison in the aforelinked article — it can easily index all fields by default without you having to do virtually anything. Plus, Couchbase has been around long enough that it generally installs without pain and has a fairly decent web admin panel. Here’s a snapshot of the final import:

and here’s the config for the “all” full text index:

{
  "type": "fulltext-index",
  "name": "all",
  "uuid": "481bc7ed642dddfb",
  "sourceType": "couchbase",
  "sourceName": "ripe",
  "sourceUUID": "3ffbbe0c0923f233ffe0fc96c652262d",
  "planParams": {
    "maxPartitionsPerPIndex": 171
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": true
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type"
    },
    "store": {
      "indexType": "scorch",
      "kvStoreName": ""
    }
  },
  "sourceParams": {}
}

You Said This Is A Post With R Code

Very true! We’ll get to that in a minute.

Going with Couchbase introduced a different problem: there’s almost no R support for Couchbase. Sure, Couchbase has a gnarly, two-year old, raw httr::-prefixed bit of a tutorial post but that’s not really as cool as if there were a library(couchbase). I mean, you can check GitUgh or CRAN or a more general search yourself if you’d like but it’s going to come up bupkis.

If you were expecting a big reveal, right now, that I’ve got a feature-packed, full R Couchbase package ready to roll…you didn’t actually read the title of the post. What I do have is a set of functions that — given server/connection metadata, a bucket, a full text index, and a query — will return all matching documents (I still do not like that term for “record”) for said set of parameters:

# function code is in: https://paste.sr.ht/~hrbrmstr/051f5d5400644952a3ad2cf8664b84e2cbb9ac6b

cb_fts("domino's", "all", "ripe")
## # A tibble: 120 x 9
##    admin_c   country descr                      inetnum                  mnt_by      netname  status    tech_c  notify         
##    <chr>     <chr>   <chr>                      <chr>                    <chr>       <chr>    <chr>     <chr>   <chr>          
##  1 DUMY-RIPE FR      OPEN IP DOMINO'S PIZZA     79.141.8.44 - 79.141.8.… ALPHALINK-… OPEN-IP  ASSIGNED… DUMY-R… NA             
##  2 DUMY-RIPE NL      Domino's Pizza TILBURG     62.21.176.160 - 62.21.1… AS286-MNT   OTS2634… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  3 DUMY-RIPE NL      Domino's Pizza EINDHOVEN   62.132.252.168 - 62.132… AS286-MNT   OTS2270… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  4 DUMY-RIPE NL      Domino's Pizza SPYKENISSE  194.123.233.232 - 194.1… AS286-MNT   OTS69259 ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  5 DUMY-RIPE NL      Domino's AMSTERDAM         37.74.38.188 - 37.74.38… AS286-MNT   OTS6103… ASSIGNED… DUMY-R… kpn-ip-office@…
##  6 DUMY-RIPE NL      Domino's Pizza VOORSCHOTEN 92.66.116.136 - 92.66.1… AS286-MNT   OTS1914… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  7 DUMY-RIPE NL      Domino's Pizza Doetinchem… 212.241.42.136 - 212.24… AS286-MNT   OTS2301… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  8 DUMY-RIPE NL      Domino's Pizza AMSTERDAM   194.120.45.224 - 194.12… AS286-MNT   OTS82906 ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  9 DUMY-RIPE NL      Domino's Pizza [Woerden] … 62.41.228.80 - 62.41.22… AS286-MNT   OTS2024… ASSIGNED… DUMY-R… ip-reg@kpn.net 
## 10 DUMY-RIPE NL      Domino's Pizza GRONINGEN   188.203.128.0 - 188.203… AS286-MNT   OTS3767… ASSIGNED… DUMY-R… kpn-ip-office@…
## # … with 110 more rows

It’s not fancy.

It’s meets the needs of a narrow use-case.

It’s not in a standalone package (which is triggering my R code OCD something fierce).

But, it’s seriously fast, got me back to “work mode” with a minimum of hassle, and now there’s some google-able Couchbase R code that isn’t just bare httr calls that may help someone else who’s on a quest for how to work with Couchbase in R.

The first primary function – cb_fts() — uses the /api/index/{index-name}/query API endpoint to paginate through results of the full text search and retrieves all matching record doc id keys, then calls the last primary function — cb_get_records_from_keys() — which uses the /query/service API endpoint, issues a SELECT * FROM {bucket} USE KEYS {keys} query with all the found document (record) key ids and returns the result set. Nothing more fancy than that.

FIN

While I do not have these functions in a standalone, Couchbase-focused package I do have them in the package associated with this particular project. If you do know of a Couchbase R package (please don’t link to JDBC/ODBC drivers as I’m not going to buy) please link to them in the comments.

If you have other strategies for how to deal with these “un-packages”, please blog about it and post a link as well! I’m curious how others balance the package/not-a-package/un-package tension, especially when you may need to depend on a series of functions across projects.

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest:

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest S3 method for stats::quantile()).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")

will install from source or binaries onto your system(s).

Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03)

td <- td_create()

for (i in 1:100000) {
  td_add(td, sample(100, 1), 1)
}

quantile(td)
## [1]   1.00000  25.62222  53.09883  74.75522 100.00000

More R-like Ops

Vectorisation is the name of the game in R and we can use tdigest() to work in a vectorised manner:

set.seed(2019-04-03)

x <- sample(100, 1000000, replace=TRUE)

td <- tdigest(x)

quantile(td)
## [1]   1.00000  25.91914  50.79468  74.76439 100.00000

Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr      min         lq        mean    median       uq       max neval
##     tdigest    22.81    26.6525    48.70123    53.355    63.31    151.29   100
##  r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50   100

Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!

I saw a second post on turning htmlwidgets into interactive Twitter Player cards and felt somewhat compelled to make creating said entities a bit easier so posited the following:

I figured 40+ 💙 could not be wrong, so thus begat widgetcard:

To make this post as short as possible, the TLDR is that you just pass in an htmlwidget and some required parameters and you get back a deployable interactive Twitter Player card as an archive file and local directory. The example code is almost as short since we’re cheating and using the immensely helpful plotly package to turn a ggplot2 vis into something interactive.

First, make the vis:

library(ssh)
library(plotly)
library(ggplot2)
library(widgetcard)

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() -> gg

Now, we create a local preview image for the plot we just made since we need one for the card:

preview <- gg_preview(gg)

NOTE that you can use any image you want. This function streamlines the process for plotly plots created from ggplot2 plots. There are links to image sizing guidelines in the package help files.

Now, we convert our ggplot2 object to a plotly object and create the Twitter Player card. Note that Twitter really doesn’t like standalone widgets being used as Twitter Player card links due to their heavyweight size. Therefore, card_widget() creates a non-standalone widget but bundles everything up into a single directory and deployable archive.

ggplotly(gg) %>% 
  card_widget(
    output_dir = "~/widgets/tc",
    name_prefix = "tc",
    preview_img = preview,
    html_title = "A way better title",
    card_twitter_handle = "@hrbrmstr",
    card_title = "Basic ggplot2 example",
    card_description = "This is a sample caRd demonstrating card_widget()",
    card_image_url_prefix = "https://rud.is/vis/tc/",
    card_player_url_prefix = "https://rud.is/vis/tc/",
    card_player_width = 480,
    card_player_height = 480
  ) -> arch_fil

Here’s what the resulting directory structure looks like:

tc
├── tc.html
├── tc.png
└── tc_files
    ├── crosstalk-1.0.0
    │   ├── css
    │   │   └── crosstalk.css
    │   └── js
    │       ├── crosstalk.js
    │       ├── crosstalk.js.map
    │       ├── crosstalk.min.js
    │       └── crosstalk.min.js.map
    ├── htmlwidgets-1.3
    │   └── htmlwidgets.js
    ├── jquery-1.11.3
    │   ├── jquery-AUTHORS.txt
    │   ├── jquery.js
    │   ├── jquery.min.js
    │   └── jquery.min.map
    ├── plotly-binding-4.8.0
    │   └── plotly.js
    ├── plotly-htmlwidgets-css-1.39.2
    │   └── plotly-htmlwidgets.css
    ├── plotly-main-1.39.2
    │   └── plotly-latest.min.js
    ├── pymjs-1.3.2
    │   ├── pym.v1.js
    │   └── pym.v1.min.js
    └── typedarray-0.1
        └── typedarray.min.js

(There’s also a tc.tgz at the same level as the tc directory.)

The widget is iframe’d using widgetframe and then saved out using htmlwidgets::saveWidget().

Now, for deploying this to a web server, one could use a method like this to scp the deployable archive:

sess <- ssh_connect(Sys.getenv("SSH_HOST"))

invisible(scp_upload(
  sess, files = arch_fil, Sys.getenv("REMOTE_VIS_DIR"), verbose = FALSE
))

ssh_exec_wait(
  sess,
  command = c(
    sprintf("cd %s", Sys.getenv("REMOTE_VIS_DIR")),
    sprintf("tar -xzf %s", basename(arch_fil))
  )
)

Alternatively, you can use other workflows to transfer and expand the archive or copy output to your static blog host.

Make sure to test anything you build with Twitter’s validator before tweeting it out.

FIN

This works but is super nascent and could use some serious IRL tyre kicking and brutal feedback. Pick the least offensive social coding site you prefer and file issues & PR’s at-will.