rud.is

sergeant : An R Boot Camp for Apache Drill

2016-12-20 – 11:01
Posted in Apache Drill, dplyr, drill, R, SQL
Tagged post
Comments (3)

I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.

It also supports reading formats such as:

Avro
[CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
Parquet
Hadoop Sequence Files

It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.

Why `sergeant`?

There’s already an R package on CRAN to work with Drill: DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.

The sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr progress bar for data downloading and you can wrap the calls with httr::with_verbose(…) if you really like seeing cURL messages.

The other thing sergeant has going for it is a nascent dplyr interface. Presently, this is a hack-ish wrapper around the RJDBC JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.

SQL vs `dplyr`

For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:

SELECT  gender ,  marital_status , COUNT(*) AS  n 
FROM  cp.`employee.json` 
GROUP BY  gender ,  marital_status

in a drill-embedded or drill-localhost SQL shell? Or:

library(RJDBC)
library(dplyr)
library(sergeant)

ds <- src_drill("localhost:31010", use_zk=FALSE)

db <- tbl(ds, "cp.`employee.json`") 

count(db, gender, marital_status) %>% collect()

(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr)

Now, dplyr tbl_df idioms don’t translate 1:1 to all other src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).

`sergeant` and `dplyr`

The src_drill() function uses the JDBC Drill driver and, hence, has an RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI interface around their REST API to facilitate the use of dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).

You’ve said “parquet” alot…

Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:

CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
  SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts 
  FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;

Kick the tyres

The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.

File an issue with a feature request or bug report.

I expect to have this CRAN-able in January, 2017.

Package update: longurl 0.3.0 is hitting CRAN mirrors

The longurl package has been updated to version 0.3.0 as a result of a bug report noting that the URL expansion API it was using went pay-for-use. Since this was the second time a short URL expansion service either went belly-up or had breaking changes the package is now completely client-side-based and a very thin, highly-focused wrapper around the httr::HEAD() function.

Why `longurl`?

On the D&D alignment scale, short links are chaotic evil. [Full-disclosure: I use shortened links all the time, so the pot is definitely kettle-calling here]. Ostensibly, they are for making it easier to show memorable links on tiny, glowing rectangles or printed prose but they are mostly used to directly track you and mask other tracking parameters that the target site is using to keep tabs on you. Furthermore, short URLs are also used by those with even more malicious intent than greedy startups or mega-corporations.

In retrospect, giving a third-party API service access to URLs you are interested in expanding just exacerbated the tracking problem, but many of these third-party URL expansion services do use some temporal caching of results, so they can be a bit faster than doing this in a non-caching package (but, there’s nothing stopping you putting caching code around it if you are using it in a “production” capacity).

How does the updated package work without a URL expansion API?

By default, httr “verb” requests use the curl package and that is a wrapper for libcurl. The httr verb calls set the “please follow all HTTP status 3xx redirects that are found in responses” option (this is the libcurl CURLOPT_FOLLOWLOCATION equivalent option). There are other options that can be set to help configure minutae around how redirect following works. So, just by calling httr::HEAD(some_url) you get built-in short URL expansion (if what you passed in was a short URL or a URL with a redirect).

Take, for example, this innocent link: http://lnk.direct/zFu. We can see what goes on under the covers by passing in the verbose() option to an httr::HEAD() call:

httr::HEAD("http://lnk.direct/zFu", verbose())

## -> HEAD /zFu HTTP/1.1
## -> Host: lnk.direct
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: shorturl=4e0aql3p49rat1c8kqcrmv4gn2
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx/1.0.15
## <- Date: Sun, 18 Dec 2016 19:03:48 GMT
## <- Content-Type: text/html; charset=UTF-8
## <- Connection: keep-alive
## <- X-Powered-By: PHP/5.6.20
## <- Expires: Thu, 19 Nov 1981 08:52:00 GMT
## <- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
## <- Pragma: no-cache
## <- Location: http://ow.ly/Ko70307eKmI
## <- 
## -> HEAD /Ko70307eKmI HTTP/1.1
## -> Host: ow.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Content-Length: 0
## <- Location: http://bit.ly/2gZq7qG
## <- Connection: close
## <- 
## -> HEAD /2gZq7qG HTTP/1.1
## -> Host: bit.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx
## <- Date: Sun, 18 Dec 2016 19:04:36 GMT
## <- Content-Type: text/html; charset=utf-8
## <- Content-Length: 127
## <- Connection: keep-alive
## <- Cache-Control: private, max-age=90
## <- Location: http://example.com/IT_IS_A_SURPRISE
## <- 
## -> HEAD /IT_IS_A_SURPRISE HTTP/1.1
## -> Host: example.com
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: _csrf/link=g3iBgezgD_OYN0vOh8yI930E1O9ZAKLr4uHmVioxwwQ; mc=null; dmvk=5856d9e39e747; ts=475630; v1st=03AE3C5AD67E224DEA304AEB56361C9F
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 200 OK
## ...
## <-

We can reduce the clutter and see that it follows multiple redirects from multiple URL shorteners:

Here’s what the output of a request to longurl::expand_urls() returns:

longurl::expand_urls("http://lnk.direct/zFu")
## # A tibble: 1 × 3
##                orig_url                        expanded_url status_code
##                   <chr>                               <chr>       <int>
## 1 http://lnk.direct/zFu http://example.com/IT_IS_A_SURPRISE         200

NOTE: the link does actually go somewhere, and somewhere not malicious, political or preachy (a rarity in general in this post-POTUS-election world of ours).

What else is different?

The longurl::expand_urls() function returns a tbl_df and now includes the HTTP status code of the final, resolved link. You can also pass in a custom HTTP referrer since many (many) sites will change behavior depending on the referrer.

What’s next?

This bug-fix release had to go out fairly quickly since the package was essentially broken. With the new foundation being built on client-side machinations, future enhancements will be to pull more features (in the machine learning sense) out of the curl or httr requests (I may switch directly to using curl if I need more granular data) and include some basic visualizations for both request trees (mostly likely using the DiagrammeR and ggplot2 packages). I may try to add a caching layer, but I believe that’s more of a situation-specific feature folks should add on their own, so I may just add a “check hook” capability that will add an extra function call to a cache checking function of your choosing.

If you have a feature request, please add it to the github repo.

Minding the zoo[keeper] with R

I’ve been drafting a new R package — [`sergeant`](https://github.com/hrbrmstr/sergeant) — to work with [Apache Drill](http://drill.apache.org/) and have found it much easier to manage having Drill operating in a single node cluster vs `drill-embedded` mode (esp when I need to add a couple more nodes for additional capacity). That means running [Apache Zookeeper](https://zookeeper.apache.org/) and I’ve had the occasional need to ping the Zookeeper process to see various stats (especially when I add some systems to the cluster).

Yes, it’s very easy+straightforward to `nc hostname 2181` from the command line to issue commands (or use the Zookeeper CLIs) but when I’m in R I like to stay in R. To that end, I made a small reference class that makes it super-easy to query Zookeeper from within R. Said class will eventually be in a sister package to `sergeant` (and, a more detailed post on `sergeant` is forthcoming), but there may be others who want/need to poke at Zookeeper processes with [4-letter words](https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands) and this will help you stay within R. The only command not available is `stmk` since I don’t have the pressing need to set the trace mask.

After you source the self-contained reference class you connect and then issue commands as so:

zk <- zookeeper$new(host="drill.local")

zk$ruok()

zk$conf()
zk$stat()

Drop a note in the comments (since this isn’t on github, yet) with any questions/issues.

zookeeper <- setRefClass(

  Class="zookeeper",

  fields=list(
    host="character",
    port="integer",
    timeout="integer"
  ),

  methods=list(

    initialize=function(..., host="localhost", port=2181L, timeout=30L) {
      host <<- host ; port <<- port ; timeout <<- timeout; callSuper(...)
    },

    available_commands=function() {
      return(sort(c("conf", "envi", "stat", "srvr", "whcs", "wchc", "wchp", "mntr",
                    "cons", "crst", "srst", "dump", "ruok", "dirs", "isro", "gtmk")))
    },

    connect=function(host="localhost", port=2181L, timeout=30L) {
      .self$host <- host ; .self$port <- port ; .self$timeout <- timeout
    },

    conf=function() {
      res <- .self$send_cmd("conf")
      res <- stringi::stri_split_fixed(res, "=", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    envi=function() {
      res <- .self$send_cmd("envi")
      res <- stringi::stri_split_fixed(res[-1], "=", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    stat=function() {
      res <- .self$send_cmd("stat")
      version <- stri_replace_first_regex(res[1], "^Zoo.*: ", "")
      res <- res[-(1:2)]
      sep <- which(res=="")
      clients <- stri_trim(res[1:(sep-1)])
      zstats <- stringi::stri_split_fixed(res[(sep+1):length(res)], ": ", 2, simplify=TRUE)
      zstats <- as.list(setNames(zstats[,2], zstats[,1]))
      list(version=version, clients=clients, stats=zstats)
    },

    srvr=function() {
      res <- .self$send_cmd("srvr")
      zstats <- stringi::stri_split_fixed(res, ": ", 2, simplify=TRUE)
      as.list(setNames(zstats[,2], zstats[,1]))
    },

    wchs=function() {
      res <- .self$send_cmd("wchs")
      conn_path <- stri_match_first_regex(res[1], "^([[:digit:]]+) connections watching ([[:digit:]]+) paths")
      tot_watch <- stri_match_first_regex(res[2], "Total watches:([[:digit:]]+)")
      list(connections_watching=conn_path[,2], paths=conn_path[,3], total_watches=tot_watch[,2])
    },

    wchc=function() {
      stri_trim(.self$send_cmd("wchc")) %>% discard(`==`, "") -> res
      setNames(list(res[2:length(res)]), res[1])
    },

    wchp=function() {
      .self$send_cmd("wchp") %>% stri_trim() %>% discard(`==`, "") -> res
      data.frame(
        path=qq[seq(1, length(qq), 2)],
        address=qq[seq(2, length(qq), 2)],
        stringsAsFactors=FALSE
      )
    },

    mntr=function() {
      res <- .self$send_cmd("mntr")
      res <- stringi::stri_split_fixed(res, "\t", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    cons=function() { list(clients=stri_trim(.self$send_cmd("cons") %>% discard(`==`, ""))) },
    crst=function() { message(.self$send_cmd("crst")) ; invisible() },
    srst=function() { message(.self$send_cmd("srst")) ; invisible() },
    dump=function() { paste0(.self$send_cmd("dump"), collapse="\n") },
    ruok=function() { .self$send_cmd("ruok") == "imok" },
    dirs=function() { .self$send_cmd("dirs") },
    isro=function() { .self$send_cmd("isro") },
    gtmk=function() { R.utils::intToBin(as.integer(.self$send_cmd("gtmk"))) },

    send_cmd=function(cmd) {
      require(purrr)
      require(stringi)
      require(R.utils)
      sock <- purrr::safely(socketConnection)
      con <- sock(host=.self$host, port=.self$port, blocking=TRUE, open="r+", timeout=.self$timeout)
      if (!is.null(con$result)) {
        con <- con$result
        cat(cmd, file=con)
        response <- readLines(con, warn=FALSE)
        a <- try(close(con))
        purrr::flatten_chr(stringi::stri_split_lines(response))
      } else {
        warning(sprintf("Error connecting to [%s:%s]", host, port))
      }
    }

  )

)

The Actual, Über, Ultimate Fake News List

It is surprisingly shorter (or longer if you grok regular expressions) than you think it might be: http[s]://.*.

Like it or not, bias is everywhere and tis woefully insidious. “Mainstream media” [MSM]; Alt-left; Alt-right; and, everything in-between creates and promotes “fake” news. Hourly. It always has. Due to the speed at which [dis]information travels now, this is the ultimate era of whatever _you_ believe is “true” is actually “true”, regardless of the morality or (unadulterated, non-theoretical) scientific facts. Sadly, it always has been this way and it always will be this way. Failing to recognize Truth tis the nature of an inherently flawed, broken species.

The reason the MSM is having trouble defining and coming to the grips with “fake news” is that it — itself — has been a witting and unwitting co-conspirator to it since before the invention of the printing press. Tis hard to see the big picture when one navel-gazes so much.

Nobody wants to be wrong and — right now — the only ones who _are_ wrong are those who disagree with _you_. Fundamentally, everyone wants their own aberrations to be normalized and we’re obliging that _in spades_ in our 21st century society. Go. Team.

Rather than sit back and accept said aberrations as the norm, verify **everything** that is presented as “fact”. That may mean opening an honest-to-goodness book or three. Sadly, given the sorry state of public & university libraries, you may need to go to multiple ones to finally get at an actual, real, fact. It may also mean coming to grips with that what you want *so desperately to be “OK”* is actually not OK. Just realize you have permission to be wrong.

Naively believe *nothing* (including this rant blog post). Challenge everything. Seek to recognize, acknowledge and promote Truth wherever it shines. Finally, don’t mistake your own desires for said Truth.

Survey on Data Science In Two Year Colleges

The ASA (American Statistical Association) has been working in collaboration with the ACM (Association for Computing Machinery) on developing a data science curriculum for Two Year Colleges. Part of this development is the need to understand the private-sector demand for two-year college data science graduates and the prevalence of the need to invest in the continuing education of existing employees for gap-filling critical skills shortages in “knowledge workers”.

By taking this survey, participating in a workshop/summit or — especially — submitting a letter of collaboration you will be helping to prepare your organization and our combined workforce meet the challenges of succeeding in an increasingly data-driven world.

Take The ACS/ACM Data Science Survey

Optionally (or additionally), you may contact Mary Rudis directly (mrudis@ccsnh.edu) for more information on this joint ASA/ACM initiative and how you can help.

Interacting With Amazon Athena from R

This is a short post for those looking to test out Amazon Athena with R.

Amazon makes Athena available via JDBC, so you can use RJDBC to query data. All you need is their JAR file and some setup information. Here’s how to get the JAR file to the current working directory:

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

To avoid putting credentials in code, you can store the AWS key and secret you’re using for the queries in ATHENA_USER and ATHENA_PASSWORD environment variables via ~/.Renviron. You’ll also need an S3 bucket writable by those credentials for the Athena staging directory. With that info in hand, it’s easy to connect:

library(RJDBC)
library(dplyr)

drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir="s3://accessible-bucket",
                                   user=Sys.getenv("ATHENA_USER"),
                                   password=Sys.getenv("ATHENA_PASSWORD"))

Even if you have no data configured in Athena, you can check out the test data available to all:

dbListTables(con)
## [1] "elb_logs"

If that worked, then you should be able to query data (using the fully qualified table name in this case):

dbGetQuery(con, "SELECT * FROM sampledb.elb_logs LIMIT 10") %>% 
  dplyr::glimpse()
## Observations: 10
## Variables: 16
## $ timestamp             <chr> "2014-09-27T00:00:25.424956Z", "2014-09-27T00:00:56.439218Z", "2014-09-27T00:01:27.441734Z", "2014-09-27T00:01:58.366715Z", "2014-09-27T00:02:29.446363Z", "2014-09-2...
## $ elbname               <chr> "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo"
## $ requestip             <chr> "241.230.198.83", "252.26.60.51", "250.244.20.109", "247.59.58.167", "254.64.224.54", "245.195.140.77", "245.195.140.77", "243.71.49.173", "240.139.5.14", "251.192.4...
## $ requestport           <dbl> 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026
## $ backendip             <chr> "251.192.40.76", "249.89.116.3", "251.111.156.171", "251.139.91.156", "251.111.156.171", "254.64.224.54", "254.64.224.54", "250.244.20.109", "247.65.176.249", "250.2...
## $ backendport           <dbl> 443, 8888, 8888, 8888, 8000, 8888, 8888, 8888, 8888, 8888
## $ requestprocessingtime <dbl> 9.1e-05, 9.4e-05, 8.4e-05, 9.7e-05, 9.1e-05, 9.3e-05, 9.4e-05, 8.3e-05, 9.0e-05, 9.0e-05
## $ backendprocessingtime <dbl> 0.046598, 0.038973, 0.047054, 0.039845, 0.061461, 0.037791, 0.047035, 0.048792, 0.045724, 0.029918
## $ clientresponsetime    <dbl> 4.9e-05, 4.7e-05, 4.9e-05, 4.9e-05, 4.0e-05, 7.7e-05, 7.5e-05, 7.3e-05, 4.0e-05, 6.7e-05
## $ elbresponsecode       <chr> "200", "200", "200", "200", "200", "200", "200", "200", "200", "200"
## $ backendresponsecode   <chr> "200", "200", "200", "200", "200", "400", "400", "200", "200", "200"
## $ receivedbytes         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sentbytes             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ requestverb           <chr> "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET"
## $ url                   <chr> "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http:/...
## $ protocol              <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"

And, you can disconnect when done:

dbDisconnect(con)

You should probably store the JAR file in a central location and refer to it that way in “production” scripts.

Now, you can go crazy querying data and racking up AWS charges ?.

Jumping on the “Banned” Wagon

>_The parent proposed a committee made up of parents and teachers of different cultural backgrounds come up with a list of books that are inclusive for all students._

Where’s the [ACLU](https://www.aclu.org/issues/free-speech/artistic-expression/banned-books)? They’d be right there if this was an alt-right’er asking to ban books with salacious content they deem (rightly or wrongly) “inappropriate” or “harmful” to teens. I hope they step up and fight the fight the good fight here.

We’re rapidly losing — or may have already lost (as a society) — the concept of building resilience & strength through adversity. I’m hopeful that the diversity we bring into this country with immigrants and refugees (if the borders don’t close shut or we scare them away) that know what true adversity is will eventually counteract this downward spiral.

Refs:

–
–
– (mp3 backup in the event they take down the audio record)

The Most Important Hashtag for 2017

When something pops up in the news, on Twitter, Facebook or even (ugh) Gab claiming or teasing findings based on data I believe it’s more important than ever to reply with some polite text and a `#showmethedata` hashtag. We desperately needed it this year and we’re absolutely going to need it in 2017 and beyond.

The catalyst for this is a recent [New Yorker story](http://nymag.com/daily/intelligencer/2016/11/activists-urge-hillary-clinton-to-challenge-election-results.html) about “computer scientists” playing off of heated election emotions and making claims in a non-public, partisan meeting with no release of data to the public or to independent, non-partisan groups with credible, ethical data analysis teams.

I believe agents of parties in power and agents of the parties who want to be in power are going to be misusing, abusing and fabricating data at a velocity and volume we’ve not seen before. If you care about the truth (the real truth, not a “necessary truth” based on an agenda) and are a capable data-worker it’s nigh your civic duty to keep in check those that are want to deceive.

**UPDATE** 2016-11-23 10:30:00 EST

Haderman has an (ugh) [Medium post](https://medium.com/@jhalderm/want-to-know-if-the-election-was-hacked-look-at-the-ballots-c61a6113b0ba#.vvcd9bguw) (ugh for using Medium vs the post content) and, as usual, the media causes more controversy than necessary. He has the best intentions and future confidentiality, integrity and availability of our electoral infrastructure at heart.

Why sergeant?

SQL vs dplyr

sergeant and dplyr