R Archives - Page 26 of 56

Category Archives: R

R⁶ Series — Random Sampling From Apache Drill Tables With R & sergeant

(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)

At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.

Here’s how to uniformly sample data from Apache Drill using the sergeant package:

library(sergeant)

db <- src_drill("sonar")
tbl <- tbl(db, "dfs.dns.`aaaa.parquet`")

summarise(tbl, n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##          n
##      <int>
## 1 19977415

mutate(tbl, r=rand()) %>% 
  filter(r <= 0.01) %>% 
  summarise(n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##        n
##    <int>
## 1 199808

mutate(tbl, r=rand()) %>% 
  filter(r <= 0.50) %>% 
  summarise(n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##         n
##     <int>
## 1 9988797

And, for groups (using a different/larger “database”):

fdns <- tbl(db, "dfs.fdns.`201708`")

summarise(fdns, n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##            n
##        <int>
## 1 1895133100

filter(fdns, type %in% c("cname", "txt")) %>% 
  count(type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type        n
##   <chr>    <int>
## 1 cname 15389064
## 2   txt 67576750

filter(fdns, type %in% c("cname", "txt")) %>% 
  group_by(type) %>% 
  mutate(r=rand()) %>% 
  ungroup() %>% 
  filter(r <= 0.15) %>% 
  count(type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type        n
##   <chr>    <int>
## 1 cname  2307604
## 2   txt 10132672

I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.

mqtt Development Log : On DSLs, Rcpp Modules and Custom Formula Functions

I know some folks had a bit of fun with the previous post since it exposed the fact that I left out unique MQTT client id generation from the initial 0.1.0 release of the in-development package (client ids need to be unique).

There have been some serious improvements since said post and I thought a (hopefully not-too-frequent) blog-journal of the development of this particular package might be interesting/useful to some folks, especially since I’m delving into some not-oft-blogged (anywhere) topics as I use some new tricks in this particular package.

Thank The Great Maker for C++

I’m comfortable and not-too-shabby at wrapping C/C++ things with an R bow and I felt quite daft seeing this after I had started banging on the mosquitto C interface. Yep, that’s right: it has a C++ interface. It’s waaaaay easier (in my experience) bridging C++ libraries since Dirk/Romain’s (et al, as I know there are many worker hands involved as well) Rcpp has so many tools available to do that very thing.

As an aside, if you do any work with Rcpp or want to start doing work in Rcpp, Masaki E. Tsuda’s Rcpp for Everyone is an excellent tome.

I hadn’t used Rcpp Modules before (that link goes to a succinct but very helpful post by James Curran) but they make exposing C++ library functionality even easier than I had experienced before. So easy, in fact, that it made it possible to whip out an alpha version of a “domain specific language” (or a pipe-able, customized API — however you want to frame these things in your head) for the package. But, I’m getting ahead of myself.

The mosquittopp class in the mosqpp namespace is much like the third bowl of porridge: just right. It’s neither too low-level nor too high-level and it was well thought out enough that it only required a bit of tweaking to use as an Rcpp Module.

First there are more than a few char * parameters that needed to be std::strings. So, something like:

int username_pw_set(const char *username, const char *password);

becomes:

int username_pw_set(std::string username, std::string password);

in our custom wrapper class.

Since the whole point of the mqtt package is to work in R vs C[++] or any other language, the callbacks — the functions that do the work when message, publish, subscribe, etc. events are triggered — need to be in R. I wanted to have some default callbacks during the testing phase and they’re really straightforward to setup in Rcpp:

Rcpp::Environment pkg_env = Rcpp::Environment::namespace_env("mqtt");

Rcpp::Function ccb = pkg_env[".mqtt_connect_cb"];
Rcpp::Function dcb = pkg_env[".mqtt_disconnect_cb"];
Rcpp::Function pcb = pkg_env[".mqtt_publish_cb"];
Rcpp::Function mcb = pkg_env[".mqtt_message_cb"];
Rcpp::Function scb = pkg_env[".mqtt_subscribe_cb"];
Rcpp::Function ucb = pkg_env[".mqtt_unsubscribe_cb"];
Rcpp::Function lcb = pkg_env[".mqtt_log_cb"];
Rcpp::Function ecb = pkg_env[".mqtt_error_cb"];

The handy thing about that approach is you don’t need to export the functions (it works like the ::: does).

But the kicker is the Rcpp Module syntactic sugar:

RCPP_MODULE(MQTT) {

  using namespace Rcpp;

  class_<mqtt_r>("mqtt_r")
    .constructor<std::string, std::string, int>("id/host/port constructor")
    .constructor<std::string, std::string, int, std::string, std::string>("id/host/port/user/pass constructor")
    .constructor<std::string, std::string, int, Rcpp::Function, Rcpp::Function, Rcpp::Function>("id/host/post/con/mess/discon constructor")
    .method("connect", &mqtt_r::connect)
    .method("disconnect", &mqtt_r::disconnect)
    .method("reconnect", &mqtt_r::reconnect)
    .method("username_pw_set", &mqtt_r::username_pw_set)
    .method("loop_start", &mqtt_r::loop_start)
    .method("loop_stop", &mqtt_r::loop_stop)
    .method("loop", &mqtt_r::loop)
    .method("publish_raw", &mqtt_r::publish_raw)
    .method("publish_chr", &mqtt_r::publish_chr)
    .method("subscribe", &mqtt_r::subscribe)
    .method("unsubscribe", &mqtt_r::unsubscribe)
    .method("set_connection_cb", &mqtt_r::set_connection_cb)
    .method("set_discconn_cb", &mqtt_r::set_discconn_cb)
    .method("set_publish_cb", &mqtt_r::set_publish_cb)
    .method("set_message_cb", &mqtt_r::set_message_cb)
    .method("set_subscribe_cb", &mqtt_r::set_subscribe_cb)
    .method("set_unsubscribe_cb", &mqtt_r::set_unsubscribe_cb)
    .method("set_log_cb", &mqtt_r::set_log_cb)
    .method("set_error_cb", &mqtt_r::set_error_cb)
   ;

}

That, combined with RcppModules: MQTT in the DESCRIPTION file and a MQTT <- Rcpp::Module("MQTT") just above where you’d put an .onLoad handler means you can do something like (internally, since it’s not exported):

mqtt_obj <- MQTT$mqtt_r

mqtt_conn_obj <- new(mqtt_obj, "unique_client_id", "test.mosquitto.org", 1883L)

and have access to each of those methods right from R (e.g. mqtt_conn_obj$subscribe(0, "topic", 0)).

If you’re careful with your C++ class code, you’ll be able to breeze through exposing functionality.

Because of the existence of Rcpp Modules, I was able to do what follows in the next section in near record time.

“The stump of a `%>%` he held tight in his teeth”

I felt compelled to get a Christmas reference in the post and it’s relevant to this section. I like %>%, recommend the use of %>% and use %>% in my own day-to-day R coding (it’s even creeping into internal package code, though I still try not to do that). I knew I wanted to expose a certain way of approaching MQTT workflows in this mqtt package and that meant coming up with an initial — but far from complete — mini-language or pipe-able API for it. Here’s the current thinking/implementation:

Setup connection parameters with mqtt_broker(). For now, it takes some parameters, but there is a URI scheme for MQTT so I want to be able to support that as well at some point.
Support authentication with mqtt_username_pw(). There will also be a function for dealing with certificates and other security-ish features which look similar to this.
Make it dead-easy to subscribe to topics and associate callbacks with mqtt_subscribe() (more on this below)
Support an “until you kill it” workflow with mqtt_run() that loops either forever or for a certain number of iterations
Support user-controlled iterations with mqtt_begin(), mqtt_loop() and mqtt_end(). An example (in a bit) should help explain this further, but this one is likely to be especially useful in a Shiny context.

Now, as hopefully came across in the previous post, the heart of MQTT message processing is the callback function. You write a function with a contractually defined set of parameters and operate on the values passed in. While we should all likely get into a better habit of using named function objects vs anonymous functions, anonymous functions are super handy, and short ones don’t cause code to get too gnarly. However, in this new DSL/API I’ve cooked up, each topic message callback has six parameters, so that means if you want to use an anonymous function (vs a named one) you have to do something like this in message_subscribe():

mqtt_subscribe("sometopic",  function(id, topic, payload, qos, retain, con) {})

That’s not very succinct, elegant or handy. Since those are three attributes I absolutely ? about most things related to R, I had to do something about it.

Since I’m highly attached to the ~{} syntax introduced with purrr and now expanding across the Tidyverse, I decided to make a custom version of it for message_subscribe(). As a result, the code above can be written as:

mqtt_subscribe("sometopic",  ~{})

and, you can reference id, topic, payload, etc inside those brackets without the verbose function declaration.

How is this accomplished? Via:

as_message_callback <- function(x, env = rlang::caller_env()) {
  rlang::coerce_type(
    x, rlang::friendly_type("function"),
    closure = { x },
    formula = {
      if (length(x) > 2) rlang::abort("Can't convert a two-sided formula to an mqtt message callback function")
      f <- function() { x }
      formals(f) <- alist(id=, topic=, payload=, qos=, retain=, con=)
      body(f) <- rlang::f_rhs(x)
      f
    }
  )
}

It’s a shortened version of some Tidyverse code that’s more generic in nature. That as_message_callback() function looks to see if you’ve passed in a ~{} or a named/anonymous function. If ~{} was used, that function builds a function with the contractually obligated/expected signature, otherwise it shoves in what you gave it.

A code example is worth a thousand words (which is, in fact, the precise number of “words” up until this snippet, at least insofar as the WordPress editor counts them):

library(mqtt)

# We're going to subscribe to *three* BBC subtitle feeds at the same time!
#
# We'll distinguish between them by coloring the topic and text differently.

# this is a named function object that displays BBC 2's subtitle feed when it get messages
moar_bbc <- function(id, topic, payload, qos, retain, con) {
  if (topic == "bbc/subtitles/bbc_two_england/raw") {
    cat(crayon::cyan(topic), crayon::blue(readBin(payload, "character")), "\n", sep=" ")
  }
}

mqtt_broker("makmeunique", "test.mosquitto.org", 1883L) %>% # connection info
  
  mqtt_silence(c("all")) %>% # silence all the development screen messages
  
  # subscribe to BBC 1's topic using a fully specified anonyous function
  
  mqtt_subscribe(
    "bbc/subtitles/bbc_one_london/raw",
    function(id, topic, payload, qos, retain, con) { # regular anonymous function
      if (topic == "bbc/subtitles/bbc_one_london/raw")
        cat(crayon::yellow(topic), crayon::green(readBin(payload, "character")), "\n", sep=" ")
    }) %>%
  
  # as you can see we can pipe-chain as many subscriptions as we like. the package 
  # handles the details of calling each of them. This makes it possible to have
  # very focused handlers vs lots of "if/then/case_when" impossible-to-read functions.
  
  # Ahh. A tidy, elegant, succinct ~{} function instead
  
  mqtt_subscribe("bbc/subtitles/bbc_news24/raw", ~{ # tilde shortcut function (passing in named, pre-known params)
    if (topic == "bbc/subtitles/bbc_news24/raw")
      cat(crayon::yellow(topic), crayon::red(readBin(payload, "character")), "\n", sep=" ")
  }) %>%
  
  # And, a boring, but -- in the long run, better (IMO) -- named function object
  
  mqtt_subscribe("bbc/subtitles/bbc_two_england/raw", moar_bbc) %>% # named function
  
  mqtt_run() -> res # this runs until you Ctrl-C

There’s in-code commentary, so I’ll refrain from blathering about it more here except for noting there are a staggering amount of depressing stories on BBC News and an equally staggering amount of un-hrbrmstr-like language use in BBC One and BBC Two shows. Apologies if any of the GH README.md snippets or animated screenshots ever cause offense, as it’s quite unintentional.

But you said something about begin/end/loop before?

Quite right! For that we’ll use a different example.

I came across a topic — “sfxrider/+/locations” — on broker.mqttdashboard.com. It looks like live data from folks who do transportation work for “Shadowfax Technologies” (which is a crowd-sourced transportation/logistics provider in India). It publishes the following in the payload:

| device:6170774037 | latitude:28.518363 | longitude:77.095753 | timestamp:1513539899000 |
| device:6170774037 | latitude:28.518075 | longitude:77.09555 | timestamp:1513539909000 |
| device:6170774037 | latitude:28.518015 | longitude:77.095488 | timestamp:1513539918000 |
| device:8690150597 | latitude:28.550963 | longitude:77.13432 | timestamp:1513539921000 |
| device:6170774037 | latitude:28.518018 | longitude:77.095492 | timestamp:1513539928000 |
| device:6170774037 | latitude:28.518022 | longitude:77.095495 | timestamp:1513539938000 |
| device:6170774037 | latitude:28.518025 | longitude:77.095505 | timestamp:1513539947000 |
| device:6170774037 | latitude:28.518048 | longitude:77.095527 | timestamp:1513539957000 |
| device:6170774037 | latitude:28.518075 | longitude:77.095573 | timestamp:1513539967000 |
| device:8690150597 | latitude:28.550963 | longitude:77.13432 | timestamp:1513539975000 |
| device:6170774037 | latitude:28.518205 | longitude:77.095603 | timestamp:1513539977000 |
| device:6170774037 | latitude:28.5182 | longitude:77.095587 | timestamp:1513539986000 |
| device:6170774037 | latitude:28.518202 | longitude:77.095578 | timestamp:1513539996000 |
| device:6170774037 | latitude:28.5182 | longitude:77.095578 | timestamp:1513540006000 |
| device:6170774037 | latitude:28.518203 | longitude:77.095577 | timestamp:1513540015000 |
| device:6170774037 | latitude:28.518208 | longitude:77.095577 | timestamp:1513540025000 |

Let’s turn that into proper, usable, JSON (we’ll just cat() it out for this post):

library(mqtt)
library(purrr)
library(stringi)

# turn the pipe-separated, colon-delimeted lines into a proper list
.decode_payload <- function(.x) {
  .x <- readBin(.x, "character")
  .x <- stri_match_all_regex(.x, "([[:alpha:]]+):([[:digit:]\\.]+)")[[1]][,2:3]
  .x <- as.list(setNames(as.numeric(.x[,2]), .x[,1]))
  .x$timestamp <- as.POSIXct(.x$timestamp/1000, origin="1970-01-01 00:00:00")
  .x
}

# do it safely as the payload in MQTT can be anything
decode_payload <- purrr::safely(.decode_payload)

# change the client id
mqtt_broker("makemeuique", "broker.mqttdashboard.com", 1883L) %>%
  mqtt_silence(c("all")) %>%
  mqtt_subscribe("sfxrider/+/locations", ~{
    x <- decode_payload(payload)$result
    if (!is.null(x)) {
      cat(crayon::yellow(jsonlite::toJSON(x, auto_unbox=TRUE), "\n", sep=""))
    }
  }) %>%
  mqtt_run(times = 10000) -> out

What if you wanted do that one-by-one so you could plot the data live in a Shiny map? Well, we won’t do that in this post, but the user-controlled loop version would look like this:

mqtt_broker("makemeuique", "broker.mqttdashboard.com", 1883L) %>%
  mqtt_silence(c("all")) %>%
  mqtt_subscribe("sfxrider/+/locations", ~{
    x <- decode_payload(payload)$result
    if (!is.null(x)) {
      cat(crayon::yellow(jsonlite::toJSON(x, auto_unbox=TRUE), "\n", sep=""))
    }
  }) %>%
  mqtt_begin() -> tracker # _begin!! not _run!!

# call this individually and have the callback update a
# larger scoped variable or Redis or a database. You
# can also just loop like this `for` setup.

for (i in 1:25) mqtt_loop(tracker, timeout = 1000)

mqtt_end(tracker) # this cleans up stuff!

FIN

Whew. 1,164 words later and I hope I’ve kept your interest through it all. I’ve updated the GH repo for the package and also updated the requirements for the package in the README. I’m also working on a configure script (mimicking @opencpu’s ‘anti-conf’ approach) and found Win32 library binaries that should make this easier to get up and running on Windows, so stay tuned for the next installment and don’t hesitate to jump on board with issues, questions, comments or PRs.

The goal for the next post is to cover reading from either that logistics feed or OwnTracks and dynamically display points on a map with Shiny. Stay tuned!

Inter-operate with ‘MQTT’ Message Brokers With R (a.k.a. Live! BBC! Subtitles!)

Most of us see the internet through the lens of browsers and apps on our laptops, desktops, watches, TVs and mobile devices. These displays are showing us — for the most part — content designed for human consumption. Sure, apps handle API interactions, but even most of that communication happens over ports 80 or 443. But, there are lots of ports out there; 0:65535, in fact (at least TCP-wise). And, all of them have some kind of data, and most of that is still targeted to something for us.

What if I told you the machines are also talking to each other using a thin/efficient protocol that allows one, tiny sensor to talk to hundreds — if not thousands — of systems without even a drop of silicon-laced sweat? How can a mere, constrained sensor do that? Well, it doesn’t do it alone. Many of them share their data over a fairly new protocol dubbed MQTT (Message Queuing Telemetry Transport).

An MQTT broker watches for devices to publish data under various topics and then also watches for other systems to subscribe to said topics and handles the rest of the interchange. The protocol is lightweight enough that fairly low-powered (CPU- and literal electric-use-wise) devices can easily send their data chunks up to a broker, and the entire protocol is robust enough to support a plethora of connections and an equal plethora of types of data.

Why am I telling you all this?

Devices that publish to MQTT brokers tend to be in the spectrum of what folks sadly call the “Internet of Things”. It’s a terrible, ambiguous name, but it’s all over the media and most folks have some idea what it means. In the context of MQTT, you can think of it as, say, a single temperature sensor publishing it’s data to an MQTT broker so many other things — including programs written by humans to capture, log and analyze that data — can receive it. This is starting to sound like something that might be right up R’s alley.

There are also potential use-cases where an online data processing system might want to publish data to many clients without said clients having to poll a poor, single-threaded R server constantly.

Having MQTT connectivity for R could be really interesting.

And, now we have the beginnings of said connectivity with the mqtt? package.

Another Package? Really?

Yes, really.

Besides the huge potential for having a direct R-bridge to the MQTT world, I’m work-interested in MQTT since we’ve found over 35,000 of them on the default, plaintext port for MQTT (1883) alone:

There are enough of them that I don’t even need to show a base map.

Some of these servers require authentication and others aren’t doing much of anything. But, there are a number of them hosted by corporations and individuals that are exposing real data. OwnTracks seems to be one of the more popular self-/badly-hosted ones.

Then, there are others — like test.mosquitto.org — which deliberately run open MQTT servers for “testing”. There definitely is testing going on there, but there are also real services using it as a production broker. The mqtt package is based on the mosquitto C library, so it’s only fitting that we show a few examples from its own test site here.

For now, there’s really one function: topic_subscribe(). Eventually, R will be able to publish to a broker and do more robust data collection operations (say, to make a live MQTT dashboard in Shiny). The topic_subscribe() function is an all-in one tool that enables you to:

connect to a broker
subscribe to a topic
pass in R callback functions which will be executed on connect, disconnect and when new messages come in

That’s plenty of functionality to do some fun things.

(Update: if you had tried this during the ~24 hours after the blog post was up, you may have run into an issue where the client id was not unique and got wonky results. That’s fixed, now, but it’s a good idea to use your own, unique client id when making MQTT requests).

What’s the frequencytemperature, Kenneth?

The mosquitto test server has one topic — /outbox/crouton-demo/temperature — which is a fake temperature sensor that just sends data periodically so you have something to test with. Let’s capture 50 samples and plot them.

Since we’re using a callback we have to use the tricksy <<- operator to store/update variables outside the callback function. And, we should pre-allocate space for said data to avoid needlessly growing objects. Here’s a complete code-block:

library(mqtt) # devtools::install_github("hrbrmstr/mqtt")
library(jsonlite)
library(hrbrthemes)
library(tidyverse)

i <- 0 # initialize our counter
max_recs <- 50 # max number of readings to get

readings <- vector("character", max_recs)

# our callback function
temp_cb <- function(id, topic, payload, qos, retain) {

  i <<- i + 1 # update the counter
  readings[i] <<- readBin(payload, "character") # update our larger-scoped vector

  return(if (i==max_recs) "quit" else "go") # need to send at least "". "quit" == done

}

topic_subscribe(
  topic = "/outbox/crouton-demo/temperature",
  message_callback=temp_cb
)

# each reading looks like this:
# {"update": {"labels":[4631],"series":[[68]]}}
map(readings, fromJSON) %>%
  map(unlist) %>%
  map_df(as.list) %>%
  ggplot(aes(update.labels, update.series)) +
  geom_line() +
  geom_point() +
  labs(x="Reading", y="Temp (F)", title="Temperature via MQTT") +
  theme_ipsum_rc(grid="XY")

We setup temp_cb() to be our callback and topic_subscribe() ensures that the underlying mosquitto library will call it every time a new message is published to that topic. The chart really shows how synthetic the data is.

Subtitles from the Edge

Temperature sensors are just the sort of thing that MQTT was designed for. But, we don’t need to be stodgy about our use of MQTT.

Just about a year ago from this post, the BBC launched live subtitles for iPlayer. Residents of the Colonies may not know what iPlayer is, but it’s the “app” that lets UK citizens watch BBC programmes on glowing rectangles that aren’t proper tellys. Live subtitles are hard to produce well (and get right) and the BBC making the effort to do so also on their digital platform is quite commendable. We U.S. folks will likely be charged $0.99 for each set of digital subtitles now that net neutrality is gone.

Now, some clever person(s) wired up some of these live subtitles to MQTT topics. We can wire up our own code in R to read them live:

bbc_callback <- function(id, topic, payload, qos, retain) {
  cat(crayon::green(readBin(payload, "character")), "\n", sep="")
  return("") # ctrl-c will terminate
}

mqtt::topic_subscribe(topic = "bbc/subtitles/bbc_news24/raw",
                      connection_callback=mqtt::mqtt_silent_connection_callback,
                      message_callback=bbc_callback)

In this case, control-c terminates things (cleanly).

You could easily modify the above code to have a bot that monitors for certain keywords then sends windowed chunks of subtitled text to some other system (Slack, database, etc). Or, create an online tidy text analysis workflow from them.

Shiny MQTT

This is an update to the post. I had posited (below) the potential for future use in a Shiny context. If some missed points are acceptable, you can use this in Shiny now. Here’s code for a live viewer of the first temperature data example (above). (NOTE: I heavily ~~borrowed~~stole from Miles/Alicia’s cool webrockets example for this:

library(shiny)
library(mqtt)
library(hrbrthemes)
library(tidyverse)

ui <- fluidPage(
  plotOutput('plot')
)

server <- function(input, output) {

  get_temps <- function(n) {

    i <- 0
    max_recs <- n
    readings <- vector("character", max_recs)

    temp_cb <- function(id, topic, payload, qos, retain) {
      i <<- i + 1
      readings[i] <<- readBin(payload, "character")
      return(if (i==max_recs) "quit" else "go")
    }

    mqtt::topic_subscribe(topic = "/outbox/crouton-demo/temperature",
                    connection_callback = mqtt::mqtt_silent_connection_callback,
                    message_callback = temp_cb)

    purrr::map(readings, jsonlite::fromJSON) %>%
      purrr::map(unlist) %>%
      purrr::map_df(as.list)

  }

  values <- reactiveValues(x = NULL, y = NULL)

  observeEvent(invalidateLater(450), {
    new_response <- get_temps(1)
    if (length(new_response) != 0) {
      values$x <- c(values$x, new_response$update.labels)
      values$y <- c(values$y, new_response$update.series)
    }
  }, ignoreNULL = FALSE)

  output$plot <- renderPlot({
    xdf <- data.frame(xval = values$x, yval = values$y)
    ggplot(xdf, aes(x = xval, y=yval)) + 
      geom_line() +
      geom_point() +
      theme_ipsum_rc(grid="XY")
  })

}

shinyApp(ui = ui, server = server)

FIN

The code is on GitHub and all input/contributions are welcome and encouraged. Some necessary TBDs are authentication & encryption. But, how would you like the API to look for using it, say, in Shiny apps? What should publishing look like? What helper functions would be useful (ones to slice & dice topic names or another to convert raw message text more safely)? Should there be an R MQTT “DSL”? Lots of things to ponder and so many sites to “test”!

P.S.

In case you are concerned about the unusually boring R package name, I wanted to use RIoT (lower-cased, of course) but riot is, alas, already taken.

A Workaround For When Anti-DDoS Also Means Anti-Data

More sites are turning to services like Cloudflare due to just how stupid-easy it is to DDoS — perform a (possibly Distributed) Denial of Service attack on — a site. Sometimes the DDoS is intentional (malicious). Sometimes it’s because your bot didn’t play nice (stop that, btw). Sadly, at some point, most of us with “vital” sites are going to have to pay protection money to one of these services unless law enforcement or ISPs do a better job stopping DDoS (killing the plethora of pwnd IoT devices that make up one of the largest for-rent DDoS services out there would be a good start).

Soapbox aside, sites like this one — https://www.bitmarket.pl/docs.php?file=api_public.html — (which was giving an SO poster trouble) have DDoS protection enabled but they also want you to be able to automate the downloads (this one even calls it an “API”). However, try to grab one of the files there with your browser and you’ll likely see a Cloudflare interstitial page which eventually gets you the data.

Try the same thing with download.file() or httr::GET() and you’ll run into trouble since neither of those two functions have a way to perform the javascript challenge execution which ultimately is posted (well, GETted in this case) to a checker endpoint which eventually redirects to the original URL with enough ??? to ensure you won’t be bothered again.

Cloudflare has captcha and other types of interstitials, but if you happen on the 503+javascript challenge one, have I got a package for you! Meet: cfhttr?.

The singular function (for now) — cf_GET() — does the following:

Makes an httr::GET() call with the initial URL
Checks to ensure it’s both on Cloudflare and is using the javascript challenge protection scheme
Slices the javascript and tweaks it enough to enable running it in V8
Retrieves the challenge computation from V8
Posts (well, httr::GET()s it since that’s what Cloudflare expects) the challenge form with the proper Referer header and hopefully passes the test so you get your content.

devtools::install_github("hrbrmstr/cfhttr")

library(cfhttr)

res <- cf_GET("https://www.bitmarket.pl/graphs/BTCPLN/90m.json")
## Waiting 5 seconds...

str(httr::content(res, as="parsed"))
## List of 90
##  $ :List of 6
##   ..$ time : int 1512908160
##   ..$ open : chr "48000.00000000"
##   ..$ high : chr "48100.00000000"
##   ..$ low  : chr "48000.00000000"
##   ..$ close: chr "48100.00000000"
##   ..$ vol  : chr "0.00124821"
##  $ :List of 6
##   ..$ time : int 1512908220
##   ..$ open : chr "48100.00000000"
##   ..$ high : chr "48100.00000000"
##   ..$ low  : chr "48100.00000000"
##   ..$ close: chr "48100.00000000"
##   ..$ vol  : chr "0.00000000"
##  $ :List of 6
##   ..$ time : int 1512908280
##   ..$ open : chr "48100.00000000"
##   ..$ high : chr "48100.00000000"
##   ..$ low  : chr "48100.00000000"
##   ..$ close: chr "48100.00000000"
##   ..$ vol  : chr "0.00000000"
## ...

FIN

If you end up using this in workflows and run into a problem, it likely means that Cloudflare changed the challenge code page. Please file an issue so I can update the code.

The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

Let’s try to do the whole thing in R (of course!).

We’ll need to:

Grab the lyrics
Parse the lyrics
Get pricing data
Compute some statistics
Make some (hopefully) pretty charts

This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

Grab the code up on GitHub.

(Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

A Public Apology to Tal/R-Bloggers + The Trouble With Tibbles

Over the past few weeks, I had been noticing that some posts in the R-bloggers feed were getting truncated in Feedly. I don’t remember when I noticed that since I usually click through immediately from the headline entry to the R-bloggers page vs read in Feedly since ultimately I want to get to the author’s site to see it formatted they way they intended it to be.

I let frustration get the better of me and — without verifying with Tal first — tweeted in error in said frustration. I’m not going to perform an extensive validation on the R-Bloggers feed always pushing out full content as Tal say they do.

Tal (and any other folks who work with Tal on R-Bloggers): I apologize for the tweet content and the tone of the tweet, but more apologize for not reaching out directly as if I had the following would have likely emerged after dual investigations. So, I’m really daft on at least two accounts. I will also apologize again, in-person, if we manage to cross paths in 2018. Hopefully said apology will be over a delightful beverage (on me — well, hopefully you won’t actually dump said beverage on me, but you’d be right to do so).

The truncated posts (anyone with Feedly can likely validate my experience) seems to be a combination of issues with a common thread: the tibble. I’m going to use the tibble 1.2.0 post [R-Bloggers link] from RStudio as an example. I have to use pictures (apologies), but you’ll see why in a bit.

The Trouble With Tibbles

This is a snap from the early part of the aforementioned post:

Here’s that content on R-Bloggers:

Now you get to play that favorite childhood game of yours: spot the difference.

You should notice the angle-bracket type headers are missing on R-Bloggers version of the post.

While they are visually missing, they are not — in fact — missing. They are there:

But, HTML wonks have likely already figured out the issue.

Here’s what the source view from RStudio’s blog looks like:

One more opportunity to play “spot the difference”.

That difference can wreak havoc with further HTML/XML post-processors (inspect the elements in different browsers or via rvest/xml2) and it seems Feedly’s ingestion process is doing the truncation when it hits invalid HTML.

This means that tibble output will likely cause more posts to be truncated in feed viewers pulling from R-Bloggers (I verified this with a sample of other, recent posts that I knew used tibble output).

Both R-Bloggers and Feedly should work on said issues. I’ll be pinging Feedly and I’m sure after I tweet this post out Tal will see it.

FIN

Tal: I promise to bring up any further issues with you directly and re-iterate my apology one more time.

Sentiment Analysis of “A Christmas Carol”

2017-11-29 – 18:16
Posted in data wrangling, R, text-mining
Tagged post
Comments (7)

Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holiday-themed post or two up for the season.

One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:

(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)

Let us begin!

STAVE I: hrbrmstr’s Code

We need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:

library(rprojroot)
library(gutenbergr)
library(hrbrthemes)
library(stringi)
library(tidytext)
library(tidyverse)

rt <- find_rstudio_root_file()

carol_rds <- file.path(rt, "data", "carol.rds")

if (!file.exists(carol_rds)) {
  carol_df <- gutenberg_download("46")
  write_rds(carol_df, carol_rds)
} else {
  carol_df <- read_rds(carol_rds)
}

How did I know to use 46? We can use gutenberg_works() to get to that info:

gutenberg_works(author=="Dickens, Charles")
## # A tibble: 74 x 8
##    gutenberg_id                                                                                    title
##           <int>                                                                                    <chr>
##  1           46                             A Christmas Carol in Prose; Being a Ghost Story of Christmas
##  2           98                                                                     A Tale of Two Cities
##  3          564                                                               The Mystery of Edwin Drood
##  4          580                                                                      The Pickwick Papers
##  5          588                                                                  Master Humphrey's Clock
##  6          644                                                  The Haunted Man and the Ghost's Bargain
##  7          650                                                                      Pictures from Italy
##  8          653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Year out and a New Year In"
##  9          675                                                                           American Notes
## 10          678                                          The Cricket on the Hearth: A Fairy Tale of Home
## # ... with 64 more rows, and 6 more variables: author <chr>, gutenberg_author_id <int>, language <chr>,
## #   gutenberg_bookshelf <chr>, rights <chr>, has_text <lgl>

STAVE II: The first of three wrangles

We’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:

#' Convenience only
carol_txt <- carol_df$text

# Just want the chapters (staves)
carol_txt <- carol_txt[-(1:(which(grepl("STAVE I:", carol_txt)))-1)]

#' We'll need this later to make prettier facet titles
data_frame(
  stave = 1:5,
  title = sprintf("Stave %s: %s", stave, carol_txt[stri_detect_fixed(carol_txt, "STAVE")] %>%
    stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>%
    stri_trans_totitle())
) -> stave_titles

stri_trans_totitle() is a super-handy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.

You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.

Now, we need to prep the text for text analysis.

STAVE III: The second of three wrangles

There are other text mining packages and processes in R. I’m using tidytext because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!

Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.

#' Break the text up into chapters, paragraphs, sentences, and words,
#' preserving the hierarchy so we can use it later.
data_frame(txt = carol_txt) %>%
  unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>%
  mutate(stave = 1:n()) %>%
  unnest_tokens(paragraph, chapter, token = "paragraphs") %>% 
  group_by(stave) %>%
  mutate(para = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(sentence, paragraph, token="sentences") %>% 
  group_by(stave, para) %>%
  mutate(sent = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(word, sentence) -> carol_tokens

carol_tokens
##  A tibble: 28,710 x 4
##   stave  para  sent   word
##   <int> <int> <int>  <chr>
## 1     1     1     1 marley
## 2     1     1     1    was
## 3     1     1     1   dead
## 4     1     1     1     to
## 5     1     1     1  begin
## 6     1     1     1   with
## 7     1     1     1  there
## 8     1     1     1     is
## 9     1     1     1     no
## 0     1     1     1  doubt
##  ... with 28,700 more rows

By indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.

STAVE IV: The third of three wrangles

Now, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentiment-al posts (including this one) take a naive approach with basic match and only looking at 1-grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :-). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.

#' Retrieve sentiments and compute them.
#'
#' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse
#' this block (which I'm not doing but thought I might).
inner_join(carol_tokens, get_sentiments("nrc"), "word") %>%
  count(stave, index = para, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  left_join(stave_titles, "stave") -> carol_with_sent

STAVE V: The end of it

Now, we just need to do some really basic ggplot-ing to to get to our desired result:

ggplot(carol_with_sent) +
  geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) +
  scale_x_comma(limits=range(carol_with_sent$index)) +
  scale_y_comma() +
  scale_color_ipsum() +
  facet_wrap(~title, scales="free_x", ncol=5) +
  labs(x=NULL, y="Sentiment",
       title="Sentiment Analysis of A Christmas Carol",
       subtitle="By stave & ¶",
       caption="Humbug!") +
  theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) +
  theme(legend.position="none")

You’ll want to tap/click on that to make it bigger.

Despite using a naive analysis, I think it tracks pretty well with the flow of the book.

Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.

The truly terrible (-10 sentiment) paragraph also makes sense:

Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.

(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)

filter(
  carol_tokens, stave == 1,
  para == filter(carol_with_sent, stave==1) %>% 
    filter(sentiment == min(sentiment)) %>% 
    pull(index)
)

Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostly-positive tone also makes sense.

Stave three (Christmas present) has the highest:

The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookers-on feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highly-decorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.

and lowest (sentiment) points of the entire book:

And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burial-place of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.

Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:

They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people half-naked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.

Finally, Stave five is both short and positive (whew!). Which I heartily agree with!

FIN

The code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.

Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.

For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”

voteogram Is Now On CRAN

Earlier this year, I made a package that riffed off of ProPublica’s really neat voting cartograms (maps) for the U.S. House and Senate. You can see one for disaster relief spending in the House and one for the ACA “Skinny Repeal” in the Senate.

We can replicate both here with the voteogram package (minus the interactivity, for now):

library(voteogram)
library(ggplot2)

hr_566 <- roll_call(critter="house", number=115, session=1, rcall=566)

house_carto(hr_566) +
  coord_equal() +
  theme_voteogram()

sen_179 <- roll_call(critter="senate", number=115, session=1, rcall=179)

senate_carto(sen_179) +
  coord_equal() +
  theme_voteogram()

I think folks might have more fun with the roll_call() objects though:

str(hr_566)
## List of 29
##  $ vote_id              : chr "H_115_1_566"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 566
##  $ needed_to_pass       : int 282
##  $ date_of_vote         : chr "October 12, 2017"
##  $ time_of_vote         : chr "03:23 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "2/3 YEA-AND-NAY"
##  $ question             : chr "On Motion to Suspend the Rules and Agree"
##  $ description          : chr "Providing for the concurrence by the House in the Senate amendment to H.R. ## 2266, with an amendment"
##  $ nyt_title            : chr "On Motion to Suspend the Rules and Agree"
##  $ total_yes            : int 353
##  $ total_no             : int 69
##  $ total_not_voting     : int 11
##  $ gop_yes              : int 164
##  $ gop_no               : int 69
##  $ gop_not_voting       : int 7
##  $ dem_yes              : int 189
##  $ dem_no               : int 0
##  $ dem_not_voting       : int 5
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "Yes"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##   ..$ bioguide_id         : chr [1:435] "A000374" "A000370" "A000055" "A000371" ...
##   ..$ role_id             : int [1:435] 274 294 224 427 268 131 388 320 590 206 ...
##   ..$ member_name         : chr [1:435] "Ralph Abraham" "Alma Adams" "Robert B. Aderholt" "Pete Aguilar" ...
##   ..$ sort_name           : chr [1:435] "Abraham" "Adams" "Aderholt" "Aguilar" ...
##   ..$ party               : chr [1:435] "R" "D" "R" "D" ...
##   ..$ state_abbrev        : chr [1:435] "LA" "NC" "AL" "CA" ...
##   ..$ display_state_abbrev: chr [1:435] "La." "N.C." "Ala." "Calif." ...
##   ..$ district            : int [1:435] 5 12 4 31 12 3 2 19 36 2 ...
##   ..$ position            : chr [1:435] "Yes" "Yes" "Yes" "Yes" ...
##   ..$ dw_nominate         : num [1:435] 0.493 -0.462 0.36 -0.273 0.614 0.684 0.388 NA 0.716 NA ...
##   ..$ pp_id               : chr [1:435] "LA_5" "NC_12" "AL_4" "CA_31" ...
##  - attr(*, "class")= chr [1:2] "pprc" "list"

as they hold tons of info on the votes.

I need to explore the following a bit more but there are some definite “patterns” in the way the 115th Senate has voted this year:

library(hrbrthemes)

# I made a mistake in how I exposed these that I'll correct next month
# but we need to munge it a bit anyway for this view
fills <- voteogram:::vote_carto_fill
names(fills) <- tolower(names(fills))

rcalls <- map(1:280, ~voteogram::roll_call(critter="senate", session=1, number=115, rcall=.x))
# save it off so you don't have to waste those calls again
write_rds(rcalls, "2017-115-1-sen-280-roll-calls.rds")

# do a bit of wrangling
map_df(rcalls, ~{
  mutate(.x$votes, vote_id = .x$vote_id) %>% 
    arrange(party, position) %>% 
    mutate(fill = tolower(sprintf("%s-%s", party, position))) %>% 
    mutate(ques = .x$question) %>% 
    mutate(x = 1:n())
}) -> votes_df

# plot it
ggplot(votes_df, aes(x=x, y=vote_id, fill=fill)) +
  geom_tile() +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  scale_fill_manual(name=NULL, values=fills) +
  labs(x=NULL, y=NULL, title="Senate Roll Call Votes",
       subtitle="2017 / 115th, Session 1, Votes 1-280",
       caption="Note free-Y scales") +
  facet_wrap(~ques, scales="free_y", ncol=3) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position="right")

Hopefully I’ll get some time to dig into the differences and report on anything interesting. If you get to it before me definitely link to your blog post in a comment!

FIN

I still want to make an htmlwidgets version of the plots and also add the ability to get the index of roll call votes by Congress number and session to make it easier to iterate.

I’m also seriously considering creating different palettes. I used the ones from the source interactive site but am not 100% happy with them. Suggestions/PRs welcome.

Hopefully this package will make it easier for U.S. folks to track what’s going on in Congress and keep their representatives more accountable to the truth.

Everything’s on GitHub so please file issues, questions or PRs there.