Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

NOTE: The likelihood of this recipe being added to the recent practice bookdown book is slim, but I’ll try to keep the same format for the blog post.

Problem

You want to collect all the tweets in a Twitter tweet thread

Solution

Use a few key functions in rtweet to piece the thread elements back together.

Discussion

In Twitterland, a “thread” is a series of tweets by an author that are in a reply chain to each other which enables them to be displayed sequentially to form a larger & (ostensibly) more cohesive message. Even with the recent 280 character tweet-length increase, threads are still popular and used daily. They’re very easy to distinguish on Twitter but there is no Twitter API call to collect up all the pieces of these threads.

Let’s build a function — get_thread() — that will take as input a starting thread URL or status id and return a data frame of all the tweets in the thread (in order). As a bonus, we’ll also include a way to include all first-level retweets and replies to each threaded tweet (that, too, happens quite a bit).

There are documentation snippets in the code block (below), but the essence of the function is:

  • first, finding the tweet that belongs to the status id to get some metadata
  • then doing a search for tweets from the author that occurred after that tweet (we do this to save on API calls and we grab a bunch of them)
  • rather than do a bunch of things by hand, we make from/to pairs to feed in as vertex edges into igraph
  • once that’s done, separate out the graph into unique subgraphs and find the one containing the starting status id
  • since that subgraph is just a set of status ids, rebuild the data frame from it and put it in order.

There may be occasions where we want to grab the replies or RTs of any of the original thread tweets. They aren’t always useful, but when they are it’d be good to have this context. So, we’ll add an option that — if TRUE — will cause the function to go down the list of threaded tweets and pull the first-level replies and RTs (excluding the ones from the author). We’ll do this using the Twitter search API as it’ll ultimately save on API calls and it puts the filtering closer to the data (I’m generally “a fan” of putting computation as close to the data as possible for any given task). If there were any, they’ll be in a replies column which can be unnested at-will.

Here’s the complete function:

get_thread <- function(first_status, include_replies=FALSE, .timeline_history=3000) {

  require(rtweet, quietly=TRUE, warn.conflicts=FALSE)
  require(igraph, quietly=TRUE, warn.conflicts=FALSE)
  require(tidyverse, quietly=TRUE, warn.conflicts=FALSE)

  first_status <- if (str_detect(first_status[1], "^http[s]://")) basename(first_status[1]) else first_status[1]

  # get first status
  orig <- rtweet::lookup_tweets(first_status)

  # grab the author's timeline to search
  author_timeline <- rtweet::get_timeline(orig$screen_name, n=.timeline_history, since_id=first_status)

  # build a data frame containing from/to pairs (anything the author
  # replied to) that also includes the `first_status` id.
  suppressWarnings(
    dplyr::filter(
      author_timeline,
      (status_id == first_status) | (reply_to_screen_name == orig$screen_name)
    ) %>%
      dplyr::select(status_id, reply_to_status_id) %>%
      igraph::graph_from_data_frame() -> g
  ) # build a graph from this

  # decompose the graph into unique subgraphs and return them to data frames
  igraph::decompose(g) %>%
    purrr::map(igraph::as_data_frame) -> threads_dfs

  # find the thread with our `first_status` ids

  thread_df <- purrr::keep(threads_dfs, ~any(which(unique(unlist(.x, use.names=FALSE)) == first_status)))

  # BONUS: we get them in the order we need!
  thread_order <- purrr::discard(rev(unique(unlist(thread_df))), str_detect, "NA")

  # filter out the thread from the timeline corpus & sort it
  dplyr::filter(author_timeline, status_id %in% pull(thread_df[[1]], from)) %>%
    dplyr::mutate(status_id = factor(status_id, levels=thread_order)) %>%
    dplyr::arrange(status_id) -> tweet_thread

  if (include_replies) {
    # for each status, lookup 1st-level references to it, excluding ones from the original author
    mutate(
      tweet_thread,
      replies = purrr::map(
        as.character(status_id),
        ~rtweet::search_tweets(sprintf("%s -from:%s", .x, orig$screen_name[1]))
      )
    ) -> tweet_thread
  }

  class(tweet_thread) <-  c("tweet_thread", class(tweet_thread))

  return(tweet_thread)

}

Now, if we grab this thread, the function will return the following:

xdf <- get_thread("https://twitter.com/petersagal/status/952910499825451009")

glimpse(select(xdf, 1:5))
## Observations: 10
## Variables: 5
## $ status_id   <fctr> 952910499825451009, 952910695804305408, 952911012990193664, 952911632077852679, 9529121...
## $ created_at  <dttm> 2018-01-15 14:29:02, 2018-01-15 14:29:48, 2018-01-15 14:31:04, 2018-01-15 14:33:31, 201...
## $ user_id     <chr> "14985228", "14985228", "14985228", "14985228", "14985228", "14985228", "14985228", "149...
## $ screen_name <chr> "petersagal", "petersagal", "petersagal", "petersagal", "petersagal", "petersagal", "pet...
## $ text        <chr> "Funny you mention that. I talked to Minniejean (Brown) Trickey, one of the Little Rock ...

purrr::map(xdf$text, strwrap) %>% 
  purrr::map_chr(paste0, collapse="\n") %>% 
  cat(sep="\n\n")
## Funny you mention that. I talked to Minniejean (Brown) Trickey, one of the Little Rock Nine, about
## that very day in front of CHS for my documentary, "Constitution USA." https://t.co/MRwtlfZtvp
## 
## You would think that of all people, she would be satisfied with the government's response to racism
## and hate. Ike sent the 101st Airborne to escort her to class!
## 
## But what I didn't know is that after the 101st left, CHS expelled her on a trumped up charge of
## assault after she spilled some chili on a white student.
## 
## She spilled some chili. After being tripped by another white kid. "We got rid of one of them!" the
## teachers bragged.
## 
## Then, of course, rather than continue to allow black students to attend CHS, the governor of Alabama
## closed the schools. https://t.co/2DfBEI0OTL"The_Lost_Year"
## 
## Ms Brown looked around the country post high school. She saw Jim Crow, firehoses turned on Blacks,
## the murder of the Birmingham Four and the Mississippi Three. She moved to Canada.
## 
## As of 2012, she found herself coming back to Little Rock, a place she told me she never wanted to
## see again. But she had family. And the National Historic Site center was there. She liked to drop
## by, talk to the kids about what happened.
## 
## Now she lives in Little Rock full time. She doesn't care that her name is inscribed on a bench in
## front of the school. She doesn't care that your dad welcomed her back in '99.  She spends time at
## the Center, telling people what really happened. You should go talk to her.
## 
## (Sorry: Arkansas, obviously. Typing too quickly.)
## 
## Here's me, talking to Ms Trickey and Marty Sammon, who served with the 101st at Little Rock. Buddy
## Squiers on camera. CHS is off to the left. https://t.co/ft4LUBf3sr
## 
## https://t.co/EHLbe1finj

The replies data frame looks much the same as the thread data frame — it’s essentially just another rtweet data frame, so we won’t waste electrons showing it.

While that map/map/cat sequence isn’t bad to type, it’d be more convenient if we had a print() method for this structure (this is one reason we added a class to it). It’d be even spiffier if this print() method made it easier to distinguish the main thread from the RT’s/replies — but still show those extra bits of info. We’ll use the crayon package for added emphasis:

print.tweet_thread <- function(x, ...) {
  
  cat(crayon::cyan(sprintf("@%s - %s\n\n", x$screen_name[1], x$created_at[1])))
  
  if (!("replies" %in% colnames(x))) x$replies <- purrr::map(1:nrow(x), ~list())
  
  purrr::walk2(x$text, x$replies, ~{
    
    cat(crayon::green(paste0(strwrap(.x), collapse="\n")), "\n\n", sep="")
    
    if (length(.y) > 0) {
      purrr::walk2(.y$screen_name, .y$text, ~{
        sprintf("@%s\n%s", .x, .y) %>%
          strwrap(indent=8, exdent=8) %>%
          paste0(collapse="\n") %>%
          crayon::silver$italic() %>%
          cat("\n\n", sep="")
      })
    }
    
  })
  
}

Let’s re-capture the tweet thread but also include replies this time and print it out:

ydf <- get_thread("https://twitter.com/petersagal/status/952910499825451009", include_replies=TRUE)

ydf

See Also

I’ve git-chatted with Sir Kearney to see where to best put this function. I mention that as there are some upcoming posts that kick the aforeblogged tweet_shot() up a notch or two and all of this may work better in a tweetview package.

Regardless, drop a note in the comments if there are other bits of functionality or function options you think belong in get_thread().

The new year begins with me being on the hook to crank out a book on advanced web-scraping in R by July (more on that in a future blog post). The bookdown? package seemed to be the best way to go about doing this but I had only played with the toy/default examples of it and wanted to test out the platform with a “Hello, World”-like example of a “real” book to iron out issues and avoid more refactoring later on than I know I will have to do. I’ve been on an rtweet kick as of late (I have no idea why) and had an e-copy of O’Reilly’s 21 Recipes for Mining Twitter in the their synced Dropbox folder (it was a free giveaway a few years ago) and decided to make an rtweet version of it in a bookdown project.

You can find the GitHub repo for it here and the rendered version here. NOTE: I will likely not finish the remaining two chapters (I need to spend the time on the real book :-) but will gladly add you as a co-author if you shoot over a PR.

I began with Sean Kross’ quick start and decided to work primarily in Sublime Text and use a Makefile to manage the build process. Since the goal was to iron out kinks for a real production book, here’s a bullet list of some tips as a result of figuring out what worked for me:

  • Get Yihui Xie’s book. I have a physical copy but having either will help you when things get frustrating (and they do get frustrating at times)
  • Use git. However you instantiate the project, use git source control so you don’t lose your hard work. However some directories are not tracked in git! You may want to modify the line with *.rds in .gitignore to be a bit less brutal if you happen to generate rds files outside of the project but use them in chapter examples. Also, make sure to put other, sensitive items (like .httr-oauth) in that .gitignore to avoid having to reset credentials.
  • Use a Makefile. I like RStudio, but have far more editing tools in Sublime Text for book-ish work. Plus it has an easy build system manager, and I find it easier to navigate files.
  • Make liberal use of code chunks. Chapter 16 has a structure that I used in many of the chapters. One block for library calls (no caching); load fonts (hidden, and primarily for PDF rendering); named, cached logical sections that go with the flow of the chapter text; custom figure dimensions to ensure they come out as desired. Caching will speed up rendering time immensely.
  • Use saved data and a mixture of echo=FALSE, eval=TRUE, echo=TRUE, eval=FALSE for things you generated outside of the book source code (because they may be long running things you don’t want to wait for even once in rendering) but want to show in the book (perhaps with slightly modified source).
  • Despite using git, create a daily compressed archive of the directory tree and stick it on Dropbox (that can be part of the Makefile). Your work is valuable and you need to make sure it’s backed up.
  • Learn about references. Yihui Xie’s book shows how to deal with in- and cross-chapter references, read and use them!
  • Use a bookdown::word_document2 vs PDF and make a custom Word template for it. The default PDF output is fine for basic things, but you’ll want to generate a better one from Word.
  • When things stop rendering properly save your recently edited files and go back in time with git to a working start. This happened to me a few times as I worked across different machines. git makes glitches almost stress free.
  • Use rsync for publishing. I need to add this to the Makefile but one, short command-line call can publish your work in seconds to a web server.

I’ll likely have more tips as the year goes on and will have a follow-up post for using web server access logs to generate “kindle-like” reading statistics for your tomes.

(You can find all R⁶ posts here)

UPDATE 2018-01-01 — this has been added to rtweet (GH version).

A Twitter discussion:

that spawned from Maëlle’s recent look-back post turned into a quick function for capturing an image of a Tweet/thread using webshot, rtweet, magick and glue.

Pass in a status id or a twitter URL and the function will grab an image of the mobile version of the tweet.

The ultimate goal is to make a function that builds a tweet using only R and magick. This will have to do until the new year.

tweet_shot <- function(statusid_or_url, zoom=3) {

  require(glue, quietly=TRUE)
  require(rtweet, quietly=TRUE)
  require(magick, quietly=TRUE)
  require(webshot, quietly=TRUE)

  x <- statusid_or_url[1]

  is_url <- grepl("^http[s]://", x)

  if (is_url) {

    is_twitter <- grepl("twitter", x)
    stopifnot(is_twitter)

    is_status <- grepl("status", x)
    stopifnot(is_status)

    already_mobile <- grepl("://mobile\\.", x)
    if (!already_mobile) x <- sub("://twi", "://mobile.twi", x)

  } else {

    x <- rtweet::lookup_tweets(x)
    stopifnot(nrow(x) > 0)
    x <- glue_data(x, "https://mobile.twitter.com/{screen_name}/status/{status_id}")

  }

  tf <- tempfile(fileext = ".png")
  on.exit(unlink(tf), add=TRUE)

  webshot(url=x, file=tf, zoom=zoom)

  img <- image_read(tf)
  img <- image_trim(img)

  if (zoom > 1) img <- image_scale(img, scales::percent(1/zoom))

  img

}

Now just do one of these:

tweet_shot("947082036019388416")
tweet_shot("https://twitter.com/jhollist/status/947082036019388416")

to get:

2017 is nearly at an end. We humans seem to need these cycles to help us on our path forward and have, throughout history, used these annual demarcation points as a time of reflection of what was, what is an what shall come next.

To that end, I decided it was about time to help quantify a part of the soon-to-be previous annum in R through the fabrication of a reusable template. Said template contains various incantations that will enable the wielder to enumerate their social contributions on:

  • StackOveflow
  • GitHub
  • Twitter
  • WordPress

through the use of a parameterized R markdown document.

The result of one such execution can be found here (for those who want a glimpse of what I was publicly up to in 2017).

Want to see where you contributed the most on SO? There’s a vis for that:

What about your GitHub activity? There’s a vis for that, too:

Perhaps you just want to see your top blog posts for the year. There’s also a vis for that:

Or — maybe — you just want to see how much you blathered on Twitter. There’s even a vis for that:

Take the Rmd for a spin. File issues & PRs for things that need work and take some time to look back on 2017 with a more quantified eye than you may have in years’ past.

Here’s to 2018 being full of magic, awe, wonder and delight for us all!

It’s been a long time coming, but swatches? is now on CRAN.

What is “swatches”?

First off, swatches has nothing to do with those faux-luxury brand Swiss-made timepieces. swatches is all about color.

R/CRAN has plenty of color picking packages. The colourlovers? ? by @thosjleeper is one of my favs. But, color palettes have been around for ages. Adobe has two: Adobe Color (ACO) and Adobe Swatch Exchange (ASE); GIMP has “GPL”; OpenOffice has “SOC” and KDE has the unimaginative “colors”. So. Many. Formats. Wouldn’t it be great if there were a package that read them all in with a simple read_palette() function? Well, now there is.

I threw together a fledgling version of swatches a few years ago to read in ACO files from a $DAYJOB at the time and it cascaded from there. I decided to resurrect it and get it on CRAN to support a forthcoming “year in review” post that will make its way to your RSS feeds on-or-about December 31st.

True Colors Shining Through

Let’s say you want to get ahead of the game in 2018 and start preparing to dazzle your audience by using a palette that incorporates PANTONE’s 2018 Color of the Year (yes, that’s “a thing”) : Ultra Violet.

If you scroll down there, you’ll see a download link for an ASE version of the palettes. We can skip that and start with some R code:

library(swatches)
library(hrbrthemes)
library(tidyverse)

download.file("https://www.pantone.com/images/pages/21348/adobe-ase/Pantone-COY18-Palette-ASE-files.zip", "ultra_violet.zip")
unique(dirname((unzip("ultra_violet.zip"))))
## [1] "./Pantone COY18 Palette ASE files"
## [2] "./__MACOSX/Pantone COY18 Palette ASE files"


dir("./Pantone COY18 Palette ASE files")
#  [1] "PantoneCOY18-Attitude.ase"         "PantoneCOY18-Desert Sunset.ase"   
#  [3] "PantoneCOY18-Drama Queen.ase"      "PantoneCOY18-Floral Fantasies.ase"
#  [5] "PantoneCOY18-Intrigue.ase"         "PantoneCOY18-Kindred Spirits.ase" 
#  [7] "PantoneCOY18-Purple Haze.ase"      "PantoneCOY18-Quietude.ase"

Ah, if only the designers cleaned up their ZIP file.

We’ve got eight palettes to poke at, and hopefully one will be decent enough to use for our plots.

Let’s take a look:

par(mfrow=c(8,1))

dir("./Pantone COY18 Palette ASE files", full.names=TRUE) %>% 
  walk(~{
    pal_name <- gsub("(^[[:alnum:]]+-|\\.ase$)", "", basename(.x))
    show_palette(read_palette(.x))
    title(pal_name)
  })

par(mfrow=c(1,1))

I had initially thought I’d go for “Attitude”, but f.lux kicked in and “Intrigue” warmed better, so let’s go with that one.

(intrigue <- read_palette("./Pantone COY18 Palette ASE files/PantoneCOY18-Intrigue.ase"))
## PANTONE 19-4053 TCX PANTONE 17-4328 TCX PANTONE 18-3838 TCX PANTONE 18-0324 TCX PANTONE 19-3917 TCX 
##           "#195190"           "#3686A0"           "#5F4B8B"           "#757A4E"           "#4E4B51" 
## PANTONE 15-0927 TCX PANTONE 14-5002 TCX PANTONE 14-3949 TCX 
##           "#BD9865"           "#A2A2A1"           "#B7C0D7"

Having the PANTONE names is all-well-and-good, but those are going to be less-useful in a ggplot2 context due to the way factors are mapped to names in character color vectors in manual scales, so let’s head that off at the pass:

(intrigue <- read_palette("./Pantone COY18 Palette ASE files/PantoneCOY18-Intrigue.ase", use_names=FALSE))
## [1] "#195190" "#3686A0" "#5F4B8B" "#757A4E" "#4E4B51" "#BD9865" "#A2A2A1" "#B7C0D7"

Beautiful.

Let’s put our new color scale to work! We’ve got 8 colors to work with, but won’t need all of them (at least for a quick example):

ggplot(economics_long, aes(date, value)) +
  geom_area(aes(fill=variable)) +
  scale_y_comma() +
  scale_fill_manual(values=intrigue) +
  facet_wrap(~variable, scales = "free", nrow = 2, strip.position = "bottom") +
  theme_ipsum_rc(grid="XY", strip_text_face="bold") +
  theme(strip.placement = "outside") +
  theme(legend.position=c(0.85, 0.2))

This is far from a perfect palette, but it definitely helped illustrate basic package usage without inflicting ocular damage (remember: I could have picked an obnoxious Christmas palette :-)

More Practical Uses

If your workplace or the workplace you’re consulting for has brand guidelines, then they likely have swatches in one of the supported formats. Lot’s do.

You can keep those colors swatches in their native format and try out different ones as your designers refresh their baseline styles.

FIN

As always, kick the tyres, file issues, questions or PRs and hopefully the package will help refresh some designs for all of us in the coming year.

GitHub (2017-12-21 post-time) started adding obnoxious boxes to their activity feed. I use that to discover new projects/developers. While I also have it in RSS and that’s nice and compact, I do browse the activity feed directly.
Those giant boxes had to go.

If you’ve got uBlock installed, these rules filter them out:

github.com##.follow > .body > .py-3.border-gray-light.border-bottom.flex-items-baseline.d-flex > .width-full.flex-column.d-flex > .my-2.p-3.rounded-1.border

github.com##.watch_started > .body > .py-3.border-gray-light.border-bottom.flex-items-baseline.d-flex > .width-full.flex-column.d-flex > .my-2.p-3.rounded-1.border

github.com##.create > .body > .py-3.border-gray-light.border-bottom.flex-items-baseline.d-flex > .width-full.flex-column.d-flex > .my-2.p-3.rounded-1.border

github.com##.public > .body > .py-3.border-gray-light.border-bottom.flex-items-baseline.d-flex > .width-full.flex-column.d-flex > .my-2.p-3.rounded-1.border

(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)

At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.

Here’s how to uniformly sample data from Apache Drill using the sergeant package:

library(sergeant)

db <- src_drill("sonar")
tbl <- tbl(db, "dfs.dns.`aaaa.parquet`")

summarise(tbl, n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##          n
##      <int>
## 1 19977415

mutate(tbl, r=rand()) %>% 
  filter(r <= 0.01) %>% 
  summarise(n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##        n
##    <int>
## 1 199808

mutate(tbl, r=rand()) %>% 
  filter(r <= 0.50) %>% 
  summarise(n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##         n
##     <int>
## 1 9988797

And, for groups (using a different/larger “database”):

fdns <- tbl(db, "dfs.fdns.`201708`")

summarise(fdns, n=n())
## # Source:   lazy query [?? x 1]
## # Database: DrillConnection
##            n
##        <int>
## 1 1895133100

filter(fdns, type %in% c("cname", "txt")) %>% 
  count(type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type        n
##   <chr>    <int>
## 1 cname 15389064
## 2   txt 67576750

filter(fdns, type %in% c("cname", "txt")) %>% 
  group_by(type) %>% 
  mutate(r=rand()) %>% 
  ungroup() %>% 
  filter(r <= 0.15) %>% 
  count(type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type        n
##   <chr>    <int>
## 1 cname  2307604
## 2   txt 10132672

I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.

I know some folks had a bit of fun with the previous post since it exposed the fact that I left out unique MQTT client id generation from the initial 0.1.0 release of the in-development package (client ids need to be unique).

There have been some serious improvements since said post and I thought a (hopefully not-too-frequent) blog-journal of the development of this particular package might be interesting/useful to some folks, especially since I’m delving into some not-oft-blogged (anywhere) topics as I use some new tricks in this particular package.

Thank The Great Maker for C++

I’m comfortable and not-too-shabby at wrapping C/C++ things with an R bow and I felt quite daft seeing this after I had started banging on the mosquitto C interface. Yep, that’s right: it has a C++ interface. It’s waaaaay easier (in my experience) bridging C++ libraries since Dirk/Romain’s (et al, as I know there are many worker hands involved as well) Rcpp has so many tools available to do that very thing.

As an aside, if you do any work with Rcpp or want to start doing work in Rcpp, Masaki E. Tsuda’s Rcpp for Everyone is an excellent tome.

I hadn’t used Rcpp Modules before (that link goes to a succinct but very helpful post by James Curran) but they make exposing C++ library functionality even easier than I had experienced before. So easy, in fact, that it made it possible to whip out an alpha version of a “domain specific language” (or a pipe-able, customized API — however you want to frame these things in your head) for the package. But, I’m getting ahead of myself.

The mosquittopp class in the mosqpp namespace is much like the third bowl of porridge: just right. It’s neither too low-level nor too high-level and it was well thought out enough that it only required a bit of tweaking to use as an Rcpp Module.

First there are more than a few char * parameters that needed to be std::strings. So, something like:

int username_pw_set(const char *username, const char *password);

becomes:

int username_pw_set(std::string username, std::string password);

in our custom wrapper class.

Since the whole point of the mqtt package is to work in R vs C[++] or any other language, the callbacks — the functions that do the work when message, publish, subscribe, etc. events are triggered — need to be in R. I wanted to have some default callbacks during the testing phase and they’re really straightforward to setup in Rcpp:

Rcpp::Environment pkg_env = Rcpp::Environment::namespace_env("mqtt");

Rcpp::Function ccb = pkg_env[".mqtt_connect_cb"];
Rcpp::Function dcb = pkg_env[".mqtt_disconnect_cb"];
Rcpp::Function pcb = pkg_env[".mqtt_publish_cb"];
Rcpp::Function mcb = pkg_env[".mqtt_message_cb"];
Rcpp::Function scb = pkg_env[".mqtt_subscribe_cb"];
Rcpp::Function ucb = pkg_env[".mqtt_unsubscribe_cb"];
Rcpp::Function lcb = pkg_env[".mqtt_log_cb"];
Rcpp::Function ecb = pkg_env[".mqtt_error_cb"];

The handy thing about that approach is you don’t need to export the functions (it works like the ::: does).

But the kicker is the Rcpp Module syntactic sugar:

RCPP_MODULE(MQTT) {

  using namespace Rcpp;

  class_<mqtt_r>("mqtt_r")
    .constructor<std::string, std::string, int>("id/host/port constructor")
    .constructor<std::string, std::string, int, std::string, std::string>("id/host/port/user/pass constructor")
    .constructor<std::string, std::string, int, Rcpp::Function, Rcpp::Function, Rcpp::Function>("id/host/post/con/mess/discon constructor")
    .method("connect", &mqtt_r::connect)
    .method("disconnect", &mqtt_r::disconnect)
    .method("reconnect", &mqtt_r::reconnect)
    .method("username_pw_set", &mqtt_r::username_pw_set)
    .method("loop_start", &mqtt_r::loop_start)
    .method("loop_stop", &mqtt_r::loop_stop)
    .method("loop", &mqtt_r::loop)
    .method("publish_raw", &mqtt_r::publish_raw)
    .method("publish_chr", &mqtt_r::publish_chr)
    .method("subscribe", &mqtt_r::subscribe)
    .method("unsubscribe", &mqtt_r::unsubscribe)
    .method("set_connection_cb", &mqtt_r::set_connection_cb)
    .method("set_discconn_cb", &mqtt_r::set_discconn_cb)
    .method("set_publish_cb", &mqtt_r::set_publish_cb)
    .method("set_message_cb", &mqtt_r::set_message_cb)
    .method("set_subscribe_cb", &mqtt_r::set_subscribe_cb)
    .method("set_unsubscribe_cb", &mqtt_r::set_unsubscribe_cb)
    .method("set_log_cb", &mqtt_r::set_log_cb)
    .method("set_error_cb", &mqtt_r::set_error_cb)
   ;

}

That, combined with RcppModules: MQTT in the DESCRIPTION file and a MQTT <- Rcpp::Module("MQTT") just above where you’d put an .onLoad handler means you can do something like (internally, since it’s not exported):

mqtt_obj <- MQTT$mqtt_r

mqtt_conn_obj <- new(mqtt_obj, "unique_client_id", "test.mosquitto.org", 1883L)

and have access to each of those methods right from R (e.g. mqtt_conn_obj$subscribe(0, "topic", 0)).

If you’re careful with your C++ class code, you’ll be able to breeze through exposing functionality.

Because of the existence of Rcpp Modules, I was able to do what follows in the next section in near record time.

“The stump of a %>% he held tight in his teeth”

I felt compelled to get a Christmas reference in the post and it’s relevant to this section. I like %>%, recommend the use of %>% and use %>% in my own day-to-day R coding (it’s even creeping into internal package code, though I still try not to do that). I knew I wanted to expose a certain way of approaching MQTT workflows in this mqtt package and that meant coming up with an initial — but far from complete — mini-language or pipe-able API for it. Here’s the current thinking/implementation:

  • Setup connection parameters with mqtt_broker(). For now, it takes some parameters, but there is a URI scheme for MQTT so I want to be able to support that as well at some point.
  • Support authentication with mqtt_username_pw(). There will also be a function for dealing with certificates and other security-ish features which look similar to this.
  • Make it dead-easy to subscribe to topics and associate callbacks with mqtt_subscribe() (more on this below)
  • Support an “until you kill it” workflow with mqtt_run() that loops either forever or for a certain number of iterations
  • Support user-controlled iterations with mqtt_begin(), mqtt_loop() and mqtt_end(). An example (in a bit) should help explain this further, but this one is likely to be especially useful in a Shiny context.

Now, as hopefully came across in the previous post, the heart of MQTT message processing is the callback function. You write a function with a contractually defined set of parameters and operate on the values passed in. While we should all likely get into a better habit of using named function objects vs anonymous functions, anonymous functions are super handy, and short ones don’t cause code to get too gnarly. However, in this new DSL/API I’ve cooked up, each topic message callback has six parameters, so that means if you want to use an anonymous function (vs a named one) you have to do something like this in message_subscribe():

mqtt_subscribe("sometopic",  function(id, topic, payload, qos, retain, con) {})

That’s not very succinct, elegant or handy. Since those are three attributes I absolutely ? about most things related to R, I had to do something about it.

Since I’m highly attached to the ~{} syntax introduced with purrr and now expanding across the Tidyverse, I decided to make a custom version of it for message_subscribe(). As a result, the code above can be written as:

mqtt_subscribe("sometopic",  ~{})

and, you can reference id, topic, payload, etc inside those brackets without the verbose function declaration.

How is this accomplished? Via:

as_message_callback <- function(x, env = rlang::caller_env()) {
  rlang::coerce_type(
    x, rlang::friendly_type("function"),
    closure = { x },
    formula = {
      if (length(x) > 2) rlang::abort("Can't convert a two-sided formula to an mqtt message callback function")
      f <- function() { x }
      formals(f) <- alist(id=, topic=, payload=, qos=, retain=, con=)
      body(f) <- rlang::f_rhs(x)
      f
    }
  )
}

It’s a shortened version of some Tidyverse code that’s more generic in nature. That as_message_callback() function looks to see if you’ve passed in a ~{} or a named/anonymous function. If ~{} was used, that function builds a function with the contractually obligated/expected signature, otherwise it shoves in what you gave it.

A code example is worth a thousand words (which is, in fact, the precise number of “words” up until this snippet, at least insofar as the WordPress editor counts them):

library(mqtt)

# We're going to subscribe to *three* BBC subtitle feeds at the same time!
#
# We'll distinguish between them by coloring the topic and text differently.

# this is a named function object that displays BBC 2's subtitle feed when it get messages
moar_bbc <- function(id, topic, payload, qos, retain, con) {
  if (topic == "bbc/subtitles/bbc_two_england/raw") {
    cat(crayon::cyan(topic), crayon::blue(readBin(payload, "character")), "\n", sep=" ")
  }
}

mqtt_broker("makmeunique", "test.mosquitto.org", 1883L) %>% # connection info
  
  mqtt_silence(c("all")) %>% # silence all the development screen messages
  
  # subscribe to BBC 1's topic using a fully specified anonyous function
  
  mqtt_subscribe(
    "bbc/subtitles/bbc_one_london/raw",
    function(id, topic, payload, qos, retain, con) { # regular anonymous function
      if (topic == "bbc/subtitles/bbc_one_london/raw")
        cat(crayon::yellow(topic), crayon::green(readBin(payload, "character")), "\n", sep=" ")
    }) %>%
  
  # as you can see we can pipe-chain as many subscriptions as we like. the package 
  # handles the details of calling each of them. This makes it possible to have
  # very focused handlers vs lots of "if/then/case_when" impossible-to-read functions.
  
  # Ahh. A tidy, elegant, succinct ~{} function instead
  
  mqtt_subscribe("bbc/subtitles/bbc_news24/raw", ~{ # tilde shortcut function (passing in named, pre-known params)
    if (topic == "bbc/subtitles/bbc_news24/raw")
      cat(crayon::yellow(topic), crayon::red(readBin(payload, "character")), "\n", sep=" ")
  }) %>%
  
  # And, a boring, but -- in the long run, better (IMO) -- named function object
  
  mqtt_subscribe("bbc/subtitles/bbc_two_england/raw", moar_bbc) %>% # named function
  
  mqtt_run() -> res # this runs until you Ctrl-C

There’s in-code commentary, so I’ll refrain from blathering about it more here except for noting there are a staggering amount of depressing stories on BBC News and an equally staggering amount of un-hrbrmstr-like language use in BBC One and BBC Two shows. Apologies if any of the GH README.md snippets or animated screenshots ever cause offense, as it’s quite unintentional.

But you said something about begin/end/loop before?

Quite right! For that we’ll use a different example.

I came across a topic — “sfxrider/+/locations” — on broker.mqttdashboard.com. It looks like live data from folks who do transportation work for “Shadowfax Technologies” (which is a crowd-sourced transportation/logistics provider in India). It publishes the following in the payload:

| device:6170774037 | latitude:28.518363 | longitude:77.095753 | timestamp:1513539899000 |
| device:6170774037 | latitude:28.518075 | longitude:77.09555 | timestamp:1513539909000 |
| device:6170774037 | latitude:28.518015 | longitude:77.095488 | timestamp:1513539918000 |
| device:8690150597 | latitude:28.550963 | longitude:77.13432 | timestamp:1513539921000 |
| device:6170774037 | latitude:28.518018 | longitude:77.095492 | timestamp:1513539928000 |
| device:6170774037 | latitude:28.518022 | longitude:77.095495 | timestamp:1513539938000 |
| device:6170774037 | latitude:28.518025 | longitude:77.095505 | timestamp:1513539947000 |
| device:6170774037 | latitude:28.518048 | longitude:77.095527 | timestamp:1513539957000 |
| device:6170774037 | latitude:28.518075 | longitude:77.095573 | timestamp:1513539967000 |
| device:8690150597 | latitude:28.550963 | longitude:77.13432 | timestamp:1513539975000 |
| device:6170774037 | latitude:28.518205 | longitude:77.095603 | timestamp:1513539977000 |
| device:6170774037 | latitude:28.5182 | longitude:77.095587 | timestamp:1513539986000 |
| device:6170774037 | latitude:28.518202 | longitude:77.095578 | timestamp:1513539996000 |
| device:6170774037 | latitude:28.5182 | longitude:77.095578 | timestamp:1513540006000 |
| device:6170774037 | latitude:28.518203 | longitude:77.095577 | timestamp:1513540015000 |
| device:6170774037 | latitude:28.518208 | longitude:77.095577 | timestamp:1513540025000 |

Let’s turn that into proper, usable, JSON (we’ll just cat() it out for this post):

library(mqtt)
library(purrr)
library(stringi)

# turn the pipe-separated, colon-delimeted lines into a proper list
.decode_payload <- function(.x) {
  .x <- readBin(.x, "character")
  .x <- stri_match_all_regex(.x, "([[:alpha:]]+):([[:digit:]\\.]+)")[[1]][,2:3]
  .x <- as.list(setNames(as.numeric(.x[,2]), .x[,1]))
  .x$timestamp <- as.POSIXct(.x$timestamp/1000, origin="1970-01-01 00:00:00")
  .x
}

# do it safely as the payload in MQTT can be anything
decode_payload <- purrr::safely(.decode_payload)

# change the client id
mqtt_broker("makemeuique", "broker.mqttdashboard.com", 1883L) %>%
  mqtt_silence(c("all")) %>%
  mqtt_subscribe("sfxrider/+/locations", ~{
    x <- decode_payload(payload)$result
    if (!is.null(x)) {
      cat(crayon::yellow(jsonlite::toJSON(x, auto_unbox=TRUE), "\n", sep=""))
    }
  }) %>%
  mqtt_run(times = 10000) -> out

What if you wanted do that one-by-one so you could plot the data live in a Shiny map? Well, we won’t do that in this post, but the user-controlled loop version would look like this:

mqtt_broker("makemeuique", "broker.mqttdashboard.com", 1883L) %>%
  mqtt_silence(c("all")) %>%
  mqtt_subscribe("sfxrider/+/locations", ~{
    x <- decode_payload(payload)$result
    if (!is.null(x)) {
      cat(crayon::yellow(jsonlite::toJSON(x, auto_unbox=TRUE), "\n", sep=""))
    }
  }) %>%
  mqtt_begin() -> tracker # _begin!! not _run!!

# call this individually and have the callback update a
# larger scoped variable or Redis or a database. You
# can also just loop like this `for` setup.

for (i in 1:25) mqtt_loop(tracker, timeout = 1000)

mqtt_end(tracker) # this cleans up stuff!

FIN

Whew. 1,164 words later and I hope I’ve kept your interest through it all. I’ve updated the GH repo for the package and also updated the requirements for the package in the README. I’m also working on a configure script (mimicking @opencpu’s ‘anti-conf’ approach) and found Win32 library binaries that should make this easier to get up and running on Windows, so stay tuned for the next installment and don’t hesitate to jump on board with issues, questions, comments or PRs.

The goal for the next post is to cover reading from either that logistics feed or OwnTracks and dynamically display points on a map with Shiny. Stay tuned!