hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Driving Drill Dynamically with Docker and Updating Storage Configurations On-the-fly with sergeant

The sergeant? package has a minor update that adds REST API coverage for two “new” storage endpoints that make it possible to add, update and remove storage configurations on-the-fly without using the GUI or manually updating a config file.

This is an especially handy feature when paired with Drill’s new, official Docker container since that means we can:

fire up a clean Drill instance
modify the storage configuration (to, say, point to a local file system directory)
execute SQL ops
destroy the Drill instance

all from within R.

This is even more handy for those of us who prefer handling JSON data in Drill than in R directly or with sparklyr.

Quick Example

In a few weeks most of the following verbose-code-snippets will have a more diminutive and user-friendly interface within sergeant, but for now we’ll perform the above bulleted steps with some data that was used in a recent new package which was also generated by another recent new package. The zdnsr::zdns_exec() function ultimately generates a deeply nested JSON file that I really prefer working with in Drill before shunting it into R. Said file is stored, say, in the ~/drilldat directory.

Now, I have Drill running all the time on almost every system I use, but we’ll pretend I don’t for this example. I’ve run zdns_exec() and generated the JSON file and it’s in the aforementioned directory. Let’s fire up an instance and connect to it:

library(sergeant) # git[hu|la]b:hrbrmstr/sergeant
library(dplyr)

docker <- Sys.which("docker") # you do need docker. it's a big dependency, but worth it IMO

(system2(
  command = docker,  
  args = c(
    "run", "-i", 
    "--name", "drill-1.14.0", 
    "-p", "8047:8047", 
    "-v", paste0(c(path.expand("~/drilldat"), "/drilldat"), collapse=":"),
    "--detach", 
    "-t", "drill/apache-drill:1.14.0",
    "/bin/bash"
  ),
  stdout = TRUE
) -> drill_container)
## [1] "d6bc79548fa073d3bfbd32528a12669d753e7a19a6258e1be310e1db378f0e0d"

The above snippet fires up a Drill Docker container (downloads it, too, if not already local) and wires up a virtual directory to it.

We should wait a couple seconds and make sure we can connect to it:

drill_connection() %>% 
  drill_active()
## [1] TRUE

Now, we need to add a storage configuration so we can access our virtual directory. Rather than modify dfs we’ll add a drilldat plugin that will work with the local filesystem just like dfs does:

drill_connection() %>%
  drill_mod_storage(
    name = "drilldat",
    config = '
{
  "config" : {
    "connection" : "file:///", 
    "enabled" : true,
    "formats" : null,
    "type" : "file",
    "workspaces" : {
      "root" : {
        "location" : "/drilldat",
        "writable" : true,
        "defaultInputFormat": null
      }
    }
  },
  "name" : "drilldat"
}
')
## $result
## [1] "success"

Now, we can perform all the Drill ops sergeant has to offer, including ones like this:

(db <- src_drill("localhost"))
## src:  DrillConnection
## tbls: cp.default, dfs.default, dfs.root, dfs.tmp, drilldat.default, drilldat.root,
##   INFORMATION_SCHEMA, sys

tbl(db, "drilldat.root.`/*.json`")
## # Source:   table [?? x 10]
## # Database: DrillConnection
##    data                                              name   error class status timestamp          
##                                                                    
##  1 "{\"authorities\":[{\"ttl\":180,\"type\":\"SOA\"… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  2 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  3 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  4 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  5 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  6 "{\"authorities\":[{\"ttl\":1799,\"type\":\"SOA\… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  7 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  8 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  9 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
## 10 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
## # ... with more rows

(tbl(db, "(
SELECT
 b.answers.name AS question,
 b.answers.answer AS answer
FROM (
 SELECT 
   FLATTEN(a.data.answers) AS answers
 FROM 
   drilldat.root.`/*.json` a
 WHERE 
   (a.status = 'NOERROR')
) b
)") %>% 
 collect() -> dmarc_recs)
## # A tibble: 1,250 x 2
##    question             answer                                                                    
##  *                                                                                      
##  1 _dmarc.washjeff.edu  v=DMARC1; p=none                                                          
##  2 _dmarc.barry.edu     v=DMARC1; p=none; rua=mailto:dmpost@barry.edu,mailto:7cc566d7@mxtoolbox.d…
##  3 _dmarc.yhc.edu       v=DMARC1; pct=100; p=none                                                 
##  4 _dmarc.aacc.edu      v=DMARC1;p=none; rua=mailto:DKIM_DMARC@aacc.edu;ruf=mailto:DKIM_DMARC@aac…
##  5 _dmarc.sagu.edu      v=DMARC1; p=none; rua=mailto:Office365contact@sagu.edu; ruf=mailto:Office…
##  6 _dmarc.colostate.edu v=DMARC1; p=none; pct=100; rua=mailto:re+anahykughvo@dmarc.postmarkapp.co…
##  7 _dmarc.wne.edu       v=DMARC1;p=quarantine;sp=none;fo=1;ri=86400;pct=50;rua=mailto:dmarcreply@…
##  8 _dmarc.csuglobal.edu v=DMARC1; p=none;                                                         
##  9 _dmarc.devry.edu     v=DMARC1; p=none; pct=100; rua=mailto:devry@rua.agari.com; ruf=mailto:dev…
## 10 _dmarc.sullivan.edu  v=DMARC1; p=none; rua=mailto:mcambron@sullivan.edu; ruf=mailto:mcambron@s…
## # ... with 1,240 more rows

Finally (when done), we can terminate the Drill container:

system2(
  command = "docker",
  args = c("rm", "-f", drill_container)
)

FIN

Those system2() calls are hard on the ? and a pain to type/remember, so they’ll be wrapped in some sergeant utility functions (I’m hesitant to add a reticulate dependency to sergeant which is necessary to use the docker package, hence the system call wrapper approach).

Check your favorite repository for more sergeant updates and file issues if you have suggestions for how you’d like this Docker API for Drill to be controlled.

Simplifying World Tile Grid Creation with geom_wtg()

Nowadays (I’ve seen that word used so much in journal articles lately that I could not resist using it) I’m using world tile grids more frequently as the need arises to convey the state of exposure of various services at a global (country) scale. Given that necessity fosters invention it seemed that having a ggplot2 geom for world tile grids would make my life easier and also make work more efficient.

To that end, there’s a nascent ggplot2 extension package for making world tile grids called (uncreatively) worldtilegrid?. It’s also at GitHub if you’re more comfortable working from code from there.

The README has examples but we’ll walk through another one here and work with life expectancy data curated by Our World in Data. Rather than draw out this post to a less tenable length with an explanation of how to pull the XHR JSON data into R, just grab the CSV linked with the visualization on that page and substitute the path in the code for where you stored it.

We do need to clean up this data a bit since it has some issues. Let’s do that and carve out some slices into two new data frames so we can work with the most recent curated year and some historical data:

library(hrbrthemes) # gitlab/github :: hrbrmstr/hrbrthemes (use devtools installers)
library(worldtilegrid) # gitlab/github :: hrbrmstr/worldtilegrid (use devtools installers)
library(tidyverse)

cols(
  Entity = col_character(),
  Code = col_character(),
  Year = col_integer(),
  `Life expectancy (Clio-Infra up to 1949; UN Population Division for 1950 to 2015)` = col_double()
) -> le_cols

read_csv("~/Downloads/life-expectancy.csv", col_types = le_cols) %>%
  set_names(c("country", "iso3c", "year", "life_expectancy")) -> lifexp

# clean it up a bit since it's not great data and bust it into groups
# that match the vis on the site vs use continuous raw data, esp since
# the data is really just estimates
filter(lifexp, !is.na(iso3c)) %>%
  filter(nchar(iso3c) == 3) %>%
  mutate(
    grp = cut(
      x = life_expectancy,
      breaks = c(10, 20, 30, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85),
      include.lowest = TRUE
    )
  ) -> lifexp

last_year <-  filter(lifexp, year == last(year))
a_few_years <- filter(lifexp, year %in% c(seq(1900, 2015, 20), 2015))

Now we can use it to make some cartograms.

Using the World Tile Grid geom

You can reference a previous post for what it would normally take to create these tile grids. The idea was to simplify the API down to the caller just needing to specify some country names/ISO[23]Cs (to a country aesthetic) and a value (to a fill aesthetic) and let the package do the rest. To that end, here's how to display the life expectancy data for 2015:

ggplot(last_year) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") + # yep, you can sum up the entire blog post to this one line
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 2015",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number\nof years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to\nstay the same throughout its life",
    caption = "Data source: "

  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

You can add labels via geom_text() and use the wtg stat for it as it provides a number of computed variables you can work with:

x,y: the X,Y position of the tile
name: Country name (e.g. Afghanistan)
country.code: ISO2C country code abbreviation (e.g. AF)
iso_3166.2: Full ISO 3166 2-letter abbreviation code (e.g. ISO 3166-2:AF)
region: Region name (e.g. Asia)
sub.region: Sub-region name (e.g. Southern Asia)
region.code: Region code (e.g. 142)
sub.region.code: Sub-region code (e.g. 034)

Labeling should be a deliberate decision and used sparingly/with care.

Easier World Tile Grid Facets

Making faceted charts is also straightforward. Just use ggplot2's faceting subsytem:

ggplot(a_few_years) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") +
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 1900-2015 (selected years)",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number of years a newborn infant\nwould live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life",
    caption = "Data source: "
  ) +
  facet_wrap(~year) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

(You may want to open that one up in a separate tab/window.)

Animation might have been a better choice than facets (which is an exercise left to the reader ?).

FIN

The package API is still in "rapidly changing" mode, but feel free to kick the tyres and file issues or PRs on your community coding platform of choice as needed.

As noted in the package, the world tile grid concept credit goes to Jon Schwabish and the CSV data used to create the package exposed wtg data frame is the work of Maarten Lambrechts.

Friday #rstats twofer: Finding macOS 32-bit apps & Processing Data from System Commands

Apple has run the death bell on 32-bit macOS apps and, if you’re running a recent macOS version on your Mac (which you should so you can get security updates) you likely see this alert from time-to-time:

If you’re like me, you click through that and keep working but later ponder just how many of those apps you have. They are definitely going away, so knowing if your favourite app is on the chopping block is likely a good idea.

You can get this information via the “About This Mac”͢”System Report” app and sorting via one of the table fields.

R folks are data folks and we know we can do better than that. But, first we need to get the data. Thankfully, we can get this via the system_profiler command-line utility since it can both display user-friendly information in the terminal and also generate an XML version of the information to work with. We won’t need to head to the terminal for this work, though, since there are many ways to execute the command from R and read the generated output.

Executing System Calls from R

Base R provides two core methods for issuing a system call:

system()
system2()

Note that there are other functions provided with a base R installation that can also issue system commands and process the “piped” output, but we’ll focus on these deliberate invocation ones.

Functions in two other packages can also assist with this task and we’ll include a look at:

processx::run()
sys::exec_internal()

as well.

Why leave base R for this task? Truthfully, we really don’t need to, but both sys and processx have other tasks which do make them handy tools in your package toolbox. Having said that, keep reading since we’re going to end up not choosing the built-in functions for this task.

This is the command line we need to execute:

system_profiler -xml -detailLevel full SPApplicationsDataType

Let’s load up all the packages we’ll be needing and execute this command-line all four ways, then briefly discuss the differences:

library(sys)
library(processx)
library(microbenchmark)
library(xml2)
library(tidyverse)

system(
  command = "system_profiler -xml -detailLevel full SPApplicationsDataType",
  intern = TRUE
) -> apps_system

str(apps_system)
##  chr [1:10665] "" ...

system2(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  stdout = TRUE
) -> apps_system2

str(apps_system2)
##  chr [1:10665] "" ...

processx::run(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  spinner = TRUE
) -> apps_processx_run

str(apps_processx_run)
## List of 4
##  $ status : int 0
##  $ stdout : chr "XML STRING THAT prism.js won't let me show"
##  $ stderr : chr ""
##  $ timeout: logi FALSE

sys::exec_internal(
  cmd = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType")
) -> apps_sys_exec_internal

str(apps_sys_exec_internal)
## List of 3
##  $ status: int 0
##  $ stdout: raw [1:331133] 3c 3f 78 6d ...
##  $ stderr: raw(0)

The core difference between system() and the rest is that you need to shQuote()? for system() whereas that’s taken care of for you by the others (so they’re a bit safer by default since you’re more than likely going to forget to shQuote()).

You can definitely notice the main differences in return objects. The built-in functions just give us the character data from the standard output stream (stdout) and the last two return a more structured object that provides more explicit information about the job we just executed. The base ones can provide this detail, but it’s a twisty maze of remembering which options do what vs the more (IMO) straightforward approach both processx and sys take.

You’ll also notice a difference in stdout between processx and sys with the latter giving us a raw vector vs a character vector. This gives us a great deal of power and flexibility. It also turns out to be a great choice for processing command-line-generated XML data. Here’s why:

microbenchmark(
  sys = xml2::read_xml(apps_sys_exec_internal$stdout),
  processx = xml2::read_xml(apps_processx_run$stdout)
)
## Unit: milliseconds
##      expr      min       lq      mean    median        uq      max neval
##       sys 4.086492  4.60078  9.085143  5.508814  5.906942 207.6495   100
##  processx 9.510356 10.98282 14.275499 12.054810 13.292234 163.9870   100

It turns out xml2::read_xml() makes much quicker work of the raw vector data (though, I mean, really—are we really going to care about those ~5ms IRL?).

We’ll move on to the real reason for the post, but definitely explore both sys and processx since they are both super-handy packages.

“Can we please just find the 32-bit apps already?”

No problem. Well, actually, there is a minor annoyance. These are property list XML files and I’ll confess that I truly hate this format. There are “dictionary arrays” of key and value nodes, but those nodes are siblings vs directly associated pairs. So, we have to use the sibling relationship to work with them. It’s not hard, per se, just (again, IMO) suboptimal.

Let’s take a look at it:

apps <- read_xml(apps_sys_exec_internal$stdout)

xml_find_all(apps, "//array/dict")
## {xml_nodeset (476)}
##  [1] \n  _SPCommandLineArguments\n  \n     ...
##  [2] \n  _name\n  Sublime Text\n  h ...
##  [3] \n  _name\n  System Preferences\n   ...
##  [4] \n  _name\n  Google Chrome Canary\n ...
##  [5] \n  _name\n  Google Chrome\n   ...
##  [6] \n  _name\n  Dropbox\n  has64B ...
##  [7] \n  _name\n  Keypad\n  has64Bi ...
##  [8] \n  _name\n  Garmin WebUpdater\n  < ...
##  [9] \n  _name\n  LaTeXiT\n  has64B ...
## [10] \n  _name\n  CocoaPacketAnalyzer\n  ...
## [11] \n  _name\n  Janetter\n  has64 ...
## [12] \n  _name\n  VMware Fusion\n   ...
## [13] \n  _name\n  Photo Library Migration Utility ...
## [14] \n  _name\n  Setup Assistant\n  \n  _name\n  Siri\n  has64BitI ...
## [16] \n  _name\n  Software Update\n  \n  _name\n  Spotlight\n  has6 ...
## [18] \n  _name\n  Stocks\n  has64Bi ...
## [19] \n  _name\n  SystemUIServer\n  \n  _name\n  UniversalAccessControl ...
## ...

Let’s look at a sample record using xml_view() from the htmltidy package:

xml_find_all(apps, "//array/dict[key='_name']")[1] %>% 
  htmltidy::xml_view()

Be wary of using xml_view() on giant XML structures since it’ll freeze up RStudio for a bit and even slows down Chrome since the resultant, composed DOM object can get ginormous.

Now we know we can use has64BitIntelCode for filtering once we get to the data. Let’s read in all the apps, cherry-picking the fields and then just look at the 32-bit apps:

xml_find_all(apps, "//array/dict[key='_name']") %>% 
  map_df(~{
    list(
      name = xml_find_first(.x, ".//string") %>% xml_text(),
      path = xml_find_first(.x, ".//key[.='path']/following-sibling::string") %>% xml_text(),
      is_64bit = xml_find_first(.x, ".//key[.='has64BitIntelCode']/following-sibling::string") %>% xml_text() 
    )
  }) %>% 
  filter(is_64bit == "no") %>% 
  arrange(name) %>% 
  select(-is_64bit)
## # A tibble: 30 x 2
##    name                      path                                           
##                                                                   
##  1 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  2 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  3 AAM Updates Notifier      /Applications/Utilities/Adobe Application Mana…
##  4 AAMLauncherUtil           /Applications/Utilities/Adobe Application Mana…
##  5 ACR_9_10                  /Library/Application Support/Adobe/Uninstall/A…
##  6 Adobe Application Manager /Applications/Utilities/Adobe Application Mana…
##  7 adobe_licutil             /Applications/Utilities/Adobe Application Mana…
##  8 Audacity                  /Applications/Audacity.app                     
##  9 COCM_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## 10 COPS_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## # ... with 20 more rows

The Adobe helper apps are longstanding 32-bit “offenders”. Many of these death-row apps fall into the “helper” category and will hopefully get some attention by their developers. I do find it amusing that Apple kinda wants us to prod the developers to get their collective acts together.

FIN

This exhaustive search finds all of the 32-bit apps residing on your system. If you just want to see the one’s you’ve executed and macOS has kept track of, you can drop to a command-line and do:

sudo Rscript -e 'knitr::kable((dplyr::select(dplyr::tbl(dplyr::src_sqlite("/var/db/SystemPolicyConfiguration/ExecPolicy"), "legacy_exec_history_v3"), responsible_path)))'

You need elevated privileges to access those files, so please read that whole line to make sure I’m not having you rm -rf /.

Remember to take some time to explore the sys and processx packages and perhaps bundle up the salient bits of this post into a script so you can occasionally check to see the 32-bit eradication progress.

Introducing ‘gepetto’ — a Splash-like REST API to Headless Chrome

It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.

R folks have had the decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).

However, that nigh-mundane addition is not the only new feature in decapitated.

Introducing gepetto

While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer?. Puppeteer is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.

Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto? which aims to do just that.

Gepetto is a Node.js application which uses puppeteer for all the hard work. After seeing that such a REST API interface was possible via the puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer-fueled javascript for the Lua interface.

A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.

Installing gepetto

There are instructions over at the repo on installing gepetto but R users can try a shortcut by grabbing the latest version of decapitated from Git[La|Hu]b and running decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:

The installer provides some guidance should thing go awry. You’ll notice gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.

Working with gepetto

Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto. You should see something like this:

$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000

NOTE: You can use a different host/port by setting the HOST and PORT environment variables accordingly before startup.

You can then go to http://localhost:3000 in your browser and should see this:

Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.

If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:

The Swagger definition JSON is also at http://localhost:3000/swagger.json.

The API documentation will be a bit more robust as the module’s corners are rounded out.

“But, this is supposed to be an R post…”

Yes. Yes it is.

If you followed along in the previous section and started gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!

library(rvest)
library(decapitated)
library(tidyverse)

gpid <- start_gepetto()

gpid
## PROCESS 'gepetto', running, pid 60827.

gepetto() %>% 
  gep_active()
## [1] TRUE

Anything other than a “running” response means there’s something wrong and you can use the various processx methods on that gpid object to inspect the error log. If you were able to run gepetto from the command line then it should be fine in R, too. The gep() function build a connection object and gep_active() tests an API endpoint to ensure you can communicate with the server.

Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html().

gepetto() %>% 
  gep_render_html("http://therapboard.com/") -> doc

html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%  
  html_attr("src") %>% 
  head(10)
## [1] "audio/2chainz_4.mp3"        "audio/2chainz_yeah2.mp3"   
## [3] "audio/2chainz_tellem.mp3"   "audio/2chainz_tru.mp3"     
## [5] "audio/2chainz_unh3.mp3"     "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3"  "audio/2pac_4.mp3"          
## [9] "audio/2pac_5.mp3"           "audio/2pac_6.mp3"

Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs.

We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:

gepetto() %>% 
  gep_render_magick("https://nytimes.com/")
##   format width height colorspace matte filesize density
## 1    PNG  1440   6828       sRGB  TRUE        0   72x72

Astute readers will notice it returns a magick object so you can work with it immediately.

I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).

Testing gep_render_pdf() is an exercise left to the reader.

FIN

The gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.

In-brief: Using Bro connection logs with Apache Drill

If you’ve got a directory full of Bro NSM logs, it’s easy to work with them in Apache Drill since they’re just tab-separated values (TSV) files by default. The most tedious part is mapping the columns to proper types and hopefully this saves at least one person from typing it out manually:

SELECT 
  TO_TIMESTAMP(CAST(columns[0] AS DOUBLE)) AS ts,
                    columns[1]             AS uid,
                    columns[2]             AS id_orig_h,
                    columns[3]             AS id_orig_p,
                    columns[4]             AS id_resp_h,
                    columns[5]             AS id_resp_p,
                    columns[6]             AS proto,
                    columns[7]             AS service,
              CAST( columns[8] AS DOUBLE)  AS duration,
              CAST( columns[9] AS INTEGER) AS orig_bytes,
              CAST(columns[10] AS INTEGER) AS resp_bytes,
                   columns[11]             AS conn_state,
                   columns[12]             AS local_orig,
                   columns[13]             AS local_resp,
              CAST(columns[14] AS INTEGER) AS missed_bytes,
                   columns[15]             AS history,
              CAST(columns[16] AS INTEGER) AS orig_packets,
              CAST(columns[17] AS INTEGER) AS orig_ip_bytes,
              CAST(columns[18] AS INTEGER) AS resp_pkts,
              CAST(columns[19] AS INTEGER) AS resp_ip_bytes,
                   columns[20]             AS tunnel_parents
FROM dfs.brologs.`/201808/*`

You can either store them all under a single workspace with a default input type or soft-link/rename them to end in .tsv (it’s unlikely you want to change all .log files to be read as TSV everywhere).

While you could just use the logs this way, consider using CTAS to move them to Parquet. The above will created typed columns and the queries will generally be much faster.

Updates to the sergeant (Apache Drill connector) Package & a look at Apache Drill 1.14.0 release

Apache Drill 1.14.0 was recently released, bringing with it many new features and a temporary incompatibility with the current rev of the MapR ODBC drivers. The Drill community expects new ODBC drivers to arrive shortly. The sergeant? is an alternative to ODBC for R users as it provides a dplyr interface to the REST API along with a JDBC interface and functions to work directly with the REST API in a more programmatic fashion.

First-class dplyr-citizen support for the sergeant JDBC interface

I’ve been primarily using the ODBC interface for a while, now, since it’s dead simple to use it with dplyr (as has been noted in my still-unfinished, short cookbook on wiring up Drill and R). The ODBC incompatibility is pretty severe since it’s at the protobuf-level, but sergeant::src_drill() is an easy swap out and does not have any issues since it works against the REST API. Unfortunately, the query endpoint of the REST API mangles the field order when it returns query results. This really isn’t too painful since it’s easy to add in a select() call after gathering query results to reorder things. However, it’s painful enough that it facilitated rounding out some of the corners to the JDBC interface.

sergeant::drill_jdbc() now returns a <DrillJDBCConnection> object which was necessary to add dplyr classes for just enough bits to enable smooth operation with the tbl() function (without breaking all your other RJDBC usage in the same session). The next blog section will use the new JDBC interface with dplyr as it introduces one of Drill’s new features.

Query Image Metadata with Apache Drill 1.14.0

There are quite a few R packages for reading image/media metadata. Since that seems to be en vogue, R folks might be interested in Drill’s new image metadata format plugin. Just point drill to a directory of files and you can use a familiar dplyr interface to get the deets on your ~~pirated torrent archive~~family photo inventory.

You first need to follow the directions at the aforelinked resource and add the following format to the formats: section.

formats: {
     "image": {
       "type": "image",
       "extensions": [
         "jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif",
         "ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p",
         "m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw",
         "crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f"
       ],
       "fileSystemMetadata": true,
       "descriptive": true,
       "timeZone": null
     }  
   }

Note that the configuration snippet on Drill’s site (as of the date-stamp on this post) did not have a , after the ] for the extensions array, so copy this one instead.

I created a media workspace and set the defaultInputFormat to image. Here’s a naive first look at what you can get back from a simple query to a jpg directory under it (using the new JDBC interface and dplyr):

library(sergeant)
library(tidyverse)

(con <- drill_jdbc("bigd:2181"))
## 

tbl(con, "dfs.media.`/jpg/*`") %>%
  glimpse()
## Observations: ??
## Variables: 28
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

That’s quite a bit of metadata, but the Drill format plugin page kinda fibs a bit about column types since we see many chrs there. You may be quick to question the sergeant package but this isn’t using the REST interface and we can use DBI calls to ask Drill what’s it’s sending us:

dbSendQuery(con, "SELECT * FROM dfs.media.`/jpg/*`") %>%
  dbColumnInfo()
##         field.name        field.type data.type            name
## 1         FileSize CHARACTER VARYING character        FileSize
## 2     FileDateTime CHARACTER VARYING character    FileDateTime
## 3           Format CHARACTER VARYING character          Format
## 4       PixelWidth CHARACTER VARYING character      PixelWidth
## 5      PixelHeight CHARACTER VARYING character     PixelHeight
## 6     BitsPerPixel CHARACTER VARYING character    BitsPerPixel
## 7         DPIWidth CHARACTER VARYING character        DPIWidth
## 8        DPIHeight CHARACTER VARYING character       DPIHeight
## 9       Orientaion CHARACTER VARYING character      Orientaion
## 10       ColorMode CHARACTER VARYING character       ColorMode
## 11        HasAlpha CHARACTER VARYING character        HasAlpha
## 12        Duration CHARACTER VARYING character        Duration
## 13      VideoCodec CHARACTER VARYING character      VideoCodec
## 14       FrameRate CHARACTER VARYING character       FrameRate
## 15      AudioCodec CHARACTER VARYING character      AudioCodec
## 16 AudioSampleSize CHARACTER VARYING character AudioSampleSize
## 17 AudioSampleRate CHARACTER VARYING character AudioSampleRate
## 18            JPEG               MAP character            JPEG
## 19            JFIF               MAP character            JFIF
## 20        ExifIFD0               MAP character        ExifIFD0
## 21      ExifSubIFD               MAP character      ExifSubIFD
## 22  AppleMakernote               MAP character  AppleMakernote
## 23             GPS               MAP character             GPS
## 24             XMP               MAP character             XMP
## 25       Photoshop               MAP character       Photoshop
## 26            IPTC               MAP character            IPTC
## 27         Huffman               MAP character         Huffman
## 28        FileType               MAP character        FileType

We can still work with the results, but there’s also a pretty key element missing: the media filename. The reason it’s not in the listing is that filename is an implicit column that we have to ask for. So, we need to modify our query to be something like this:

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  glimpse()
## Observations: ??
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

We could work with the “map” columns with Drill’s SQL, but this is just metadata and even if there are many files, most R folks have sufficient system memory these days to collect it all and work with it locally. There’s nothing stopping you from working on the SQL side, though, and it may be a better choice if you’ll be using this to process huge archives. But, we’ll do this in R and convert a bunch of field types along the way:

from_map <- function(x) { map(x, jsonlite::fromJSON)}

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  collect() %>%
  mutate_at(
    .vars = vars(
      JPEG, JFIF, ExifSubIFD, AppleMakernote, GPS, XMP, Photoshop, IPTC, Huffman, FileType
    ),
    .funs=funs(from_map)
  ) %>%
  mutate_at(
    .vars = vars(
      PixelWidth, PixelHeight, DPIWidth, DPIHeight, FrameRate, AudioSampleSize, AudioSampleRate
    ),
    .funs=funs(as.numeric)
  ) %>%
  glimpse() -> media_df
## Observations: 11
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       4032, 4032, 4032, 4032, 4032, 4032, 3024, 4032...
## $ PixelHeight      3024, 3024, 3024, 3024, 3024, 3024, 4032, 3024...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ DPIHeight        72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioSampleRate  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ JPEG             [["Baseline", "8 bits", "3024 pixels", "4032 ...
## $ JFIF             [["1.1", "none", "72 dots", "72 dots", "0", "...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       [["1/2227 sec", "f/1.8", "Program normal", "2...
## $ AppleMakernote   [["5", "[558 values]", "[104 values]", "1", "...
## $ GPS              [["N", "44° 19' 6.3\"", "W", "-68° 11' 22.39\...
## $ XMP              [["4", ["2017-06-22T14:28:04"], ["2017-06-22T...
## $ Photoshop        [["48 89 11 77 33 105 192 33 170 252 63 34 43...
## $ IPTC             [["UTF-8", "2", "14:28:04", "2017:06:22", "20...
## $ Huffman          [["4 Huffman tables"], ["4 Huffman tables"], ...
## $ FileType         [["JPEG", "Joint Photographic Experts Group",...

Now, we can do anything with the data, including getting the average file size:

mutate(media_df, FileSize = str_replace(FileSize, " bytes", "") %>% as.numeric()) %>%
  summarise(mean(FileSize))
## # A tibble: 1 x 1
##   `mean(FileSize)`
##              
## 1         3878963.

FIN

The enhancements to the JDBC interface have only been given a light workout but seem to be doing well so-far. Kick the tyres and file an issue if you have any problems. ODBC users should not have to wait long for new drivers and src_drill() aficionados can keep chugging along as before without issue.

For those new to Apache Drill, there’s now an official Docker image for it so you can get up and running without adding too much cruft to your local systems. I may add support for spinning up and managing a Drill container to the sergeant package, so keep your eyes on pushes to the repo.

Also keep an eye on the mini-cookbook as I’ll be modifying to to account for the new package changes and introduce additional, new Drill features.

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr? which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:

library(splashr)
library(HARtools)
library(urltools)
library(furrr)
library(tidyverse)

list_of_urls_with_unique_urls <- c("http://...", "http://...", ...)

make_a_splash <- function(org_url) {
  splash(
    host = "ip/name of system you started aquarium on", 
    user = "your splash api username", 
    pass = "your splash api password"
  ) %>% 
    splash_response_body(TRUE) %>% # we want to get all the content 
    splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from 
    splash_go(org_url) %>% 
    splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated
    splash_har()
}

safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe

plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection

future_map(sites, ~{
  
  org <- safe_splash(.x) # go get it!
  
  if (is.null(org$result)) {
    sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages
  } else {
    
    HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE
      har = org$result, 
      file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
    )
    
    sprintf("Successfully retrieved %s", .x)
    
  }
  
}) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>% 
  keep(str_detect, "Error")
## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

Connecting Apache Zeppelin and Apache Drill, PostgreSQL, etc.

A previous post showed how to use a different authentication provider to wire up Apache Zeppelin and Amazon Athena. As noted in that post, Zeppelin is a “notebook” alternative to Jupyter (and other) notebooks. Unlike Jupyter, I can tolerate Zeppelin and it’s got some nifty features like plug-and-play JDBC access. Plus it can do some nifty things automagically, like turn the output of a simple aggregation query into a visualization like the one shown at the top of the post.

Drill + Zeppelin (normally that’d be a deadly combination)

The steps to wire-up Apache Drill are similar to those for Athena:

Go to the Interpreter menu (it’s a drop down of the top right main Zeppelin menu bar)
Create a new one (I named mine — unimaginatively — drill)
Set the default.driver to org.apache.drill.jdbc.Driver
Fill in the necessary JDBC default.url. I use jdbc:drill:zk=localhost:2181 and you can have multiple ones if you have need to connect to more than one Drill cluster.
Setup authentication parameters if you need to
Under Dependencies, add an arifact and use the path JAR in $DRILL_HOME/jars/jdbc-driver/. In my case that’s /usr/local/drill/jars/jdbc-driver/drill-jdbc-all-1.14.0.jar

We can use one of Drill’s built-in datasets to test out our connection.

You can do the same thing in the Query box in the Drill web interface, but — even with the ACE editor now being embedded on the page — I think the Zeppelin interface is better and it’s possible to sequence a number of steps in the same notebook (i.e. use a %shell paragraph to grab some JSON data and copy it to a Drill-accessible folder then have a %drill paragraph right below it convert it to parquet and a %spark paragraph below that do some ML on the data and a %knitr block make a proper visualization with R).

Drill + PostgreSQL/etc.

By now, you’ve likely figured out it’s the same, heh, drill for other databases with JDBC support.

For PostgreSQL (i.e. a %postgres interpreter) you need to obtain the JDBC driver and wire its location up as an artifact; use org.postgresql.Driver for the default.driver; enter an appropriate default.url for your setup, which is more than likely jdbc:postgresql://localhost:5432/…if not (i.e. the Zeppelin node and PostgreSQL node are on different systems), then you’ll need to ensure PostgreSQL is configured to listen on an interface that is accessible to the Zeppelin node; enter authentication info and fire up a test query for data that you have. Something like:

FIN

Fear not! There shall be no more simple “wiring-up” posts unless a commenter asks for one or there’s some weird complexity with a new one that deserves some exposition, especially since there are plenty of examples over at the Zeppelin main site. Hopefully these posts have encouraged you do give Zeppelin a try if you haven’t already. I tried working with very early versions of Zeppelin a while back and left it sit for a while, so give it a chance if you did the same. Version 0.9.0 is about to come out and it’s looking even better than 0.8.0, plus it dropped in perfectly on Ubuntu and macOS (even 10.14 beta), too.

Drop a note in the comments if you have any tips/tricks for connecting to other JDBC sources.