hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Minding the zoo[keeper] with R

I’ve been drafting a new R package — [`sergeant`](https://github.com/hrbrmstr/sergeant) — to work with [Apache Drill](http://drill.apache.org/) and have found it much easier to manage having Drill operating in a single node cluster vs `drill-embedded` mode (esp when I need to add a couple more nodes for additional capacity). That means running [Apache Zookeeper](https://zookeeper.apache.org/) and I’ve had the occasional need to ping the Zookeeper process to see various stats (especially when I add some systems to the cluster).

Yes, it’s very easy+straightforward to `nc hostname 2181` from the command line to issue commands (or use the Zookeeper CLIs) but when I’m in R I like to stay in R. To that end, I made a small reference class that makes it super-easy to query Zookeeper from within R. Said class will eventually be in a sister package to `sergeant` (and, a more detailed post on `sergeant` is forthcoming), but there may be others who want/need to poke at Zookeeper processes with [4-letter words](https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands) and this will help you stay within R. The only command not available is `stmk` since I don’t have the pressing need to set the trace mask.

After you source the self-contained reference class you connect and then issue commands as so:

zk <- zookeeper$new(host="drill.local")

zk$ruok()

zk$conf()
zk$stat()

Drop a note in the comments (since this isn’t on github, yet) with any questions/issues.

zookeeper <- setRefClass(

  Class="zookeeper",

  fields=list(
    host="character",
    port="integer",
    timeout="integer"
  ),

  methods=list(

    initialize=function(..., host="localhost", port=2181L, timeout=30L) {
      host <<- host ; port <<- port ; timeout <<- timeout; callSuper(...)
    },

    available_commands=function() {
      return(sort(c("conf", "envi", "stat", "srvr", "whcs", "wchc", "wchp", "mntr",
                    "cons", "crst", "srst", "dump", "ruok", "dirs", "isro", "gtmk")))
    },

    connect=function(host="localhost", port=2181L, timeout=30L) {
      .self$host <- host ; .self$port <- port ; .self$timeout <- timeout
    },

    conf=function() {
      res <- .self$send_cmd("conf")
      res <- stringi::stri_split_fixed(res, "=", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    envi=function() {
      res <- .self$send_cmd("envi")
      res <- stringi::stri_split_fixed(res[-1], "=", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    stat=function() {
      res <- .self$send_cmd("stat")
      version <- stri_replace_first_regex(res[1], "^Zoo.*: ", "")
      res <- res[-(1:2)]
      sep <- which(res=="")
      clients <- stri_trim(res[1:(sep-1)])
      zstats <- stringi::stri_split_fixed(res[(sep+1):length(res)], ": ", 2, simplify=TRUE)
      zstats <- as.list(setNames(zstats[,2], zstats[,1]))
      list(version=version, clients=clients, stats=zstats)
    },

    srvr=function() {
      res <- .self$send_cmd("srvr")
      zstats <- stringi::stri_split_fixed(res, ": ", 2, simplify=TRUE)
      as.list(setNames(zstats[,2], zstats[,1]))
    },

    wchs=function() {
      res <- .self$send_cmd("wchs")
      conn_path <- stri_match_first_regex(res[1], "^([[:digit:]]+) connections watching ([[:digit:]]+) paths")
      tot_watch <- stri_match_first_regex(res[2], "Total watches:([[:digit:]]+)")
      list(connections_watching=conn_path[,2], paths=conn_path[,3], total_watches=tot_watch[,2])
    },

    wchc=function() {
      stri_trim(.self$send_cmd("wchc")) %>% discard(`==`, "") -> res
      setNames(list(res[2:length(res)]), res[1])
    },

    wchp=function() {
      .self$send_cmd("wchp") %>% stri_trim() %>% discard(`==`, "") -> res
      data.frame(
        path=qq[seq(1, length(qq), 2)],
        address=qq[seq(2, length(qq), 2)],
        stringsAsFactors=FALSE
      )
    },

    mntr=function() {
      res <- .self$send_cmd("mntr")
      res <- stringi::stri_split_fixed(res, "\t", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    cons=function() { list(clients=stri_trim(.self$send_cmd("cons") %>% discard(`==`, ""))) },
    crst=function() { message(.self$send_cmd("crst")) ; invisible() },
    srst=function() { message(.self$send_cmd("srst")) ; invisible() },
    dump=function() { paste0(.self$send_cmd("dump"), collapse="\n") },
    ruok=function() { .self$send_cmd("ruok") == "imok" },
    dirs=function() { .self$send_cmd("dirs") },
    isro=function() { .self$send_cmd("isro") },
    gtmk=function() { R.utils::intToBin(as.integer(.self$send_cmd("gtmk"))) },

    send_cmd=function(cmd) {
      require(purrr)
      require(stringi)
      require(R.utils)
      sock <- purrr::safely(socketConnection)
      con <- sock(host=.self$host, port=.self$port, blocking=TRUE, open="r+", timeout=.self$timeout)
      if (!is.null(con$result)) {
        con <- con$result
        cat(cmd, file=con)
        response <- readLines(con, warn=FALSE)
        a <- try(close(con))
        purrr::flatten_chr(stringi::stri_split_lines(response))
      } else {
        warning(sprintf("Error connecting to [%s:%s]", host, port))
      }
    }

  )

)

The Actual, Über, Ultimate Fake News List

It is surprisingly shorter (or longer if you grok regular expressions) than you think it might be: http[s]://.*.

Like it or not, bias is everywhere and tis woefully insidious. “Mainstream media” [MSM]; Alt-left; Alt-right; and, everything in-between creates and promotes “fake” news. Hourly. It always has. Due to the speed at which [dis]information travels now, this is the ultimate era of whatever _you_ believe is “true” is actually “true”, regardless of the morality or (unadulterated, non-theoretical) scientific facts. Sadly, it always has been this way and it always will be this way. Failing to recognize Truth tis the nature of an inherently flawed, broken species.

The reason the MSM is having trouble defining and coming to the grips with “fake news” is that it — itself — has been a witting and unwitting co-conspirator to it since before the invention of the printing press. Tis hard to see the big picture when one navel-gazes so much.

Nobody wants to be wrong and — right now — the only ones who _are_ wrong are those who disagree with _you_. Fundamentally, everyone wants their own aberrations to be normalized and we’re obliging that _in spades_ in our 21st century society. Go. Team.

Rather than sit back and accept said aberrations as the norm, verify **everything** that is presented as “fact”. That may mean opening an honest-to-goodness book or three. Sadly, given the sorry state of public & university libraries, you may need to go to multiple ones to finally get at an actual, real, fact. It may also mean coming to grips with that what you want *so desperately to be “OK”* is actually not OK. Just realize you have permission to be wrong.

Naively believe *nothing* (including this rant blog post). Challenge everything. Seek to recognize, acknowledge and promote Truth wherever it shines. Finally, don’t mistake your own desires for said Truth.

Survey on Data Science In Two Year Colleges

The ASA (American Statistical Association) has been working in collaboration with the ACM (Association for Computing Machinery) on developing a data science curriculum for Two Year Colleges. Part of this development is the need to understand the private-sector demand for two-year college data science graduates and the prevalence of the need to invest in the continuing education of existing employees for gap-filling critical skills shortages in “knowledge workers”.

By taking this survey, participating in a workshop/summit or — especially — submitting a letter of collaboration you will be helping to prepare your organization and our combined workforce meet the challenges of succeeding in an increasingly data-driven world.

Take The ACS/ACM Data Science Survey

Optionally (or additionally), you may contact Mary Rudis directly (mrudis@ccsnh.edu) for more information on this joint ASA/ACM initiative and how you can help.

Interacting With Amazon Athena from R

This is a short post for those looking to test out Amazon Athena with R.

Amazon makes Athena available via JDBC, so you can use RJDBC to query data. All you need is their JAR file and some setup information. Here’s how to get the JAR file to the current working directory:

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

To avoid putting credentials in code, you can store the AWS key and secret you’re using for the queries in ATHENA_USER and ATHENA_PASSWORD environment variables via ~/.Renviron. You’ll also need an S3 bucket writable by those credentials for the Athena staging directory. With that info in hand, it’s easy to connect:

library(RJDBC)
library(dplyr)

drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir="s3://accessible-bucket",
                                   user=Sys.getenv("ATHENA_USER"),
                                   password=Sys.getenv("ATHENA_PASSWORD"))

Even if you have no data configured in Athena, you can check out the test data available to all:

dbListTables(con)
## [1] "elb_logs"

If that worked, then you should be able to query data (using the fully qualified table name in this case):

dbGetQuery(con, "SELECT * FROM sampledb.elb_logs LIMIT 10") %>% 
  dplyr::glimpse()
## Observations: 10
## Variables: 16
## $ timestamp             <chr> "2014-09-27T00:00:25.424956Z", "2014-09-27T00:00:56.439218Z", "2014-09-27T00:01:27.441734Z", "2014-09-27T00:01:58.366715Z", "2014-09-27T00:02:29.446363Z", "2014-09-2...
## $ elbname               <chr> "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo"
## $ requestip             <chr> "241.230.198.83", "252.26.60.51", "250.244.20.109", "247.59.58.167", "254.64.224.54", "245.195.140.77", "245.195.140.77", "243.71.49.173", "240.139.5.14", "251.192.4...
## $ requestport           <dbl> 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026
## $ backendip             <chr> "251.192.40.76", "249.89.116.3", "251.111.156.171", "251.139.91.156", "251.111.156.171", "254.64.224.54", "254.64.224.54", "250.244.20.109", "247.65.176.249", "250.2...
## $ backendport           <dbl> 443, 8888, 8888, 8888, 8000, 8888, 8888, 8888, 8888, 8888
## $ requestprocessingtime <dbl> 9.1e-05, 9.4e-05, 8.4e-05, 9.7e-05, 9.1e-05, 9.3e-05, 9.4e-05, 8.3e-05, 9.0e-05, 9.0e-05
## $ backendprocessingtime <dbl> 0.046598, 0.038973, 0.047054, 0.039845, 0.061461, 0.037791, 0.047035, 0.048792, 0.045724, 0.029918
## $ clientresponsetime    <dbl> 4.9e-05, 4.7e-05, 4.9e-05, 4.9e-05, 4.0e-05, 7.7e-05, 7.5e-05, 7.3e-05, 4.0e-05, 6.7e-05
## $ elbresponsecode       <chr> "200", "200", "200", "200", "200", "200", "200", "200", "200", "200"
## $ backendresponsecode   <chr> "200", "200", "200", "200", "200", "400", "400", "200", "200", "200"
## $ receivedbytes         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sentbytes             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ requestverb           <chr> "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET"
## $ url                   <chr> "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http:/...
## $ protocol              <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"

And, you can disconnect when done:

dbDisconnect(con)

You should probably store the JAR file in a central location and refer to it that way in “production” scripts.

Now, you can go crazy querying data and racking up AWS charges ?.

Jumping on the “Banned” Wagon

>_The parent proposed a committee made up of parents and teachers of different cultural backgrounds come up with a list of books that are inclusive for all students._

Where’s the [ACLU](https://www.aclu.org/issues/free-speech/artistic-expression/banned-books)? They’d be right there if this was an alt-right’er asking to ban books with salacious content they deem (rightly or wrongly) “inappropriate” or “harmful” to teens. I hope they step up and fight the fight the good fight here.

We’re rapidly losing — or may have already lost (as a society) — the concept of building resilience & strength through adversity. I’m hopeful that the diversity we bring into this country with immigrants and refugees (if the borders don’t close shut or we scare them away) that know what true adversity is will eventually counteract this downward spiral.

Refs:

–
–
– (mp3 backup in the event they take down the audio record)

The Most Important Hashtag for 2017

When something pops up in the news, on Twitter, Facebook or even (ugh) Gab claiming or teasing findings based on data I believe it’s more important than ever to reply with some polite text and a `#showmethedata` hashtag. We desperately needed it this year and we’re absolutely going to need it in 2017 and beyond.

The catalyst for this is a recent [New Yorker story](http://nymag.com/daily/intelligencer/2016/11/activists-urge-hillary-clinton-to-challenge-election-results.html) about “computer scientists” playing off of heated election emotions and making claims in a non-public, partisan meeting with no release of data to the public or to independent, non-partisan groups with credible, ethical data analysis teams.

I believe agents of parties in power and agents of the parties who want to be in power are going to be misusing, abusing and fabricating data at a velocity and volume we’ve not seen before. If you care about the truth (the real truth, not a “necessary truth” based on an agenda) and are a capable data-worker it’s nigh your civic duty to keep in check those that are want to deceive.

**UPDATE** 2016-11-23 10:30:00 EST

Haderman has an (ugh) [Medium post](https://medium.com/@jhalderm/want-to-know-if-the-election-was-hacked-look-at-the-ballots-c61a6113b0ba#.vvcd9bguw) (ugh for using Medium vs the post content) and, as usual, the media causes more controversy than necessary. He has the best intentions and future confidentiality, integrity and availability of our electoral infrastructure at heart.

The Devil is in the Details

The [first public informational video](https://www.greatagain.gov/news/message-president-elect-donald-j-trump.html) from the PEOTUS didn’t add a full transcript of the video to the web site and did not provide (at least as of 0700 EST on 2016-11-22) their own text annotations/captions to the video.

Google’s (YouTube’s) auto-captioning (for the most part) worked and it’s most likely “just good enough” to enable the PETOUS’s team to avoid an all-out A.D.A. violation (which is probably the least of their upcoming legal worries). This is a forgotten small detail in an ever-growing list of forgotten small and large details. I’m also surprised that no progressive web site bothered to put up a transcription for those that need it.

Since “the press” did a terrible job holding the outgoing POTUS accountable during his two terms and also woefully botched their coverage of the 2016 election (NOTE: I didn’t vote for either major party candidate, but did write-in folks for POTUS & veep), I guess it’s up to all of us to help document history.

Here’s a moderately cleaned up version from the auto-generated Google SRT stream, presented without any commentary:

A Message from President-Elect Donald J. Trump

Today I would like to provide the American people with an update on the White House transition and our policy plans for the first 100 days.

Our transition team is working very smoothly, efficiently and effectively. Truly great and talented men and women – patriots — indeed are being brought in and many will soon be a part of our government helping us to make America great again.

My agenda will be based on a simple core principle: putting America first whether it’s producing steel, building cars or curing disease. I want the next generation of production and innovation to happen right here in our great homeland America creating wealth and jobs for American workers.

As part of this plan I’ve asked my transition team to develop a list of executive actions we can take on day one to restore our laws and bring back our jobs (about time) these include the following on trade: I’m going to issue our notification of intent to withdraw from the trans-pacific partnership — a potential disaster for our country. Instead we will negotiate fair bilateral trade deals that bring jobs and industry back onto American shores.

On energy: I will cancel job-killing restrictions on the production of American energy including shale energy and clean coal creating many millions of high-paying jobs. That’s what we want, that’s what we’ve been waiting for.

On regulation I will formulate a role which says that for every one new regulation two old regulations must be eliminated (so important).

For national security I will ask the Department of Defense and the chairman of the Joint Chiefs of Staff to develop a comprehensive plan to protect America’s vital infrastructure from cyberattacks and all other form of attacks.

On immigration: I will direct the Department of Labor to investigate all abuses of visa programs that undercut the American worker.

On ethics reform: as part of our plan to “drain the swamp” we will impose a five-year ban executive officials becoming lobbyists after they leave the administration and a lifetime ban on executive officials lobbying on behalf of a foreign government.

These are just a few of the steps we will take to reform Washington and rebuild our middle class I will provide more updates in the coming days as we work together to make America great again for everyone and I mean everyone.

For those technically inclined, you can grab that feed using the following R code. You’ll first need to ensure Developer Tools is open in your browser and tick the “CC” button in the video before starting the video. Look for a network request that begins with `https://www.youtube.com/api/timedtext` and use the “Copy as cURL” feature (all three major, cross-platform browsers support it), then run the code immediately after it (the `straighten()` function will take the data from the clipboard).

library(curlconverter)
library(httr)
library(xml2)

straighten() %>% 
  make_req() -> req

res <- req[[1]]()

content(res) %>% 
  xml_find_all(".//body/p") %>% 
  xml_text() %>% 
  paste0(collapse="") %>% 
  writeLines("speech.txt")

In this emergent “post-truth” world, we’re all going to need to be citizen data journalists and it’s going to be even more important than ever to ensure any data-driven pieces we write are honest, well-sourced, open to review & civil commentary, plus be fully reproducible. We all will also need to hold those in power — including the media — accountable. Despite whatever leanings you have, cherry-picking data, slicing a bit off the edges of the truth and injecting a bias because of your passionate views that you believe are right are only going to exacerbate the existing problems.

This is not cool.

First it was OpenDNS selling their souls (and, [y]our data) to Cisco (whom I don’t trust at all with my data).

Now, it’s Dyn — — doing something even worse (purely my own opinion).

I’m currently evaluating offerings by [FoolDNS](http://www.fooldns.com/fooldns-community/english-version/) & [GreenTeam](http://members.greentm.co.uk/) as alternatives and I’ll post updates as I review & test them.

I’m also in search of an open source, RPi-able DNS server with regularly updated Squid-like categorical lists and the ability to white list domains (suggestions welcome in the comments).

I’m a cybersecurity data scientist who knows just what can be done with this type of data when handed to `$BIGCORP`, and I’m far more concerned with Oracle than Cisco, but I’d rather work with a smaller company who has more reason to not sell me out.