Skip navigation

Category Archives: R

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins?. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on statebins

There were three different functions in the old-style package:

  • one for discrete scales (it automated ‘cuts’)
  • one for continuous scales
  • one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing statebins

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

library(statebins)
library(tidyverse)

data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault") +
  theme_statebins(legend_position="right")

Two things should stand out there:

  • you got scale_fill_distiller() for free!
  • labels are dark/light depending on the tile color

Before we go into ^^, it may be helpful to show the new function interface:

library(statebins)
library(tidyverse)

statebins(state_data, state_col = "state", value_col = "value",
  dark_label = "black", light_label = "white", font_size = 3,
  state_border_col = "white", state_border_size = 2,
  ggplot2_scale_function = ggplot2::scale_fill_distiller, ...)

You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for).

So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger.

The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output.

But that’s not all!

There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that):

library(statebins)
library(tidyverse)
library(viridis)

data(USArrests)

# make up some data for the example

rownames_to_column(USArrests, "state") %>%
  bind_rows(
    data_frame(
      state = c("Virgin Islands", "Puerto Rico", "New York City"),
      Murder = rep(mean(max(USArrests$Murder),3)),
      Assault = rep(mean(max(USArrests$Assault),3)),
      Rape = rep(mean(max(USArrests$Rape),3)),
      UrbanPop = c(93, 95, 100)
    )
  ) -> us_arrests

statebins(us_arrests, value_col="Assault",
          ggplot2_scale_function = viridis::scale_fill_viridis) +
  labs(title="USArrests + made up data") +
  theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv(system.file("extdata", "wapostates.csv", package="statebins")))

mutate(
  adat, 
  share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4"))
) %>% 
  statebins(
    value_col = "share", 
    ggplot2_scale_function = scale_fill_brewer,
    name = "Share of workforce with jobs lost or threatened by trade"
  ) +
  labs(title = "1994-2000") +
  theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

library(statebins)
library(tidyverse)

election_2012 <- suppressMessages(read_csv(system.file("extdata", "election2012.csv", package="statebins")))

mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% 
  statebins(
    font_size=4, dark_label = "white", light_label = "white",
    ggplot2_scale_function = scale_fill_manual,
    name = "Winner",
    values = c(Romney = "#2166ac", Obama = "#b2182b")
  ) +
  theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

library(statebins)
library(tidyverse)
data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) +
  theme_statebins(legend_position="right")

MOAR BREAKING NEWS: geom & faceting

Thomas Wood suggested that faceting wld be nice, but that would really require a Geom. So, I took a stab at one:

library(statebins)
library(cdcfluview)
library(hrbrthemes)
library(tidyverse)

flu <- ili_weekly_activity_indicators(2017)

ggplot(flu, aes(state=statename, fill=activity_level)) +
  geom_statebins() +
  coord_equal() +
  viridis::scale_fill_viridis(
    name = "ILI Activity Level  ", limits=c(0,10), breaks=0:10, option = "magma", direction = -1
  ) +
  facet_wrap(~weekend) +
  labs(title="2017-18 Flu Season ILI Activity Level") +
  theme_statebins(base_family = font_ps) +
  theme(plot.title=element_text(size=16, hjust=0)) +
  theme(plot.margin = margin(30,30,30,30))

FIN

It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package.

So, kick the tyres and don’t hesitate to shoot over some feedback!

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes?. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes)
library(ggplot2)

ggplot(mpg, aes(displ, hwy)) +
  geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) +
  scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) +
  scale_y_continuous(expand=c(0,0), limits=c(10, 50)) +
  scale_color_ipsum() +
  scale_fill_ipsum() +
  facet_wrap(~class, scales="free") +
  labs(
    title="IBM Plex Sans Test",
    subtitle="This is a subtitle to see the how it looks in IBM Plex Sans",
    caption="Source: hrbrthemes & IBM"
  ) +
  theme_ipsum_ps(grid="XY", axis="xy") +
  theme(legend.position="none") -> gg

flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

By now, virtually every major media outlet has covered the “280 Apocalypse”™. For those still not “in the know”, Twitter recently moved the tweet character cap to 280 after a “successful” beta test (some of us have different ideas of what “success” looks like).

I had been on a hiatus from the platform for a while and planned to (and did) jump back into the fray today but wanted to see what my timeline looked like tweet-length-wise. It’s a simple endeavour: use rtweet to grab the timeline, count the characters per-tweet and look up the results. I posted the results of said process to — of course — Twitter and some folks asked me for the code.

Others used it and there were some discussions as to why timelines looked similar (distribution-wise) with not many Tweets going over 140 characters. One posit I had was that it might be due to client-side limitations since I noted that Twitter for macOS — a terrible client they haven’t updated in ages (but there really aren’t any good ones) — still caps tweets at 140 characters. Others, like Buffer on the web, do have support for 280, so I modified the code a bit to look at the distribution by client.

Rather than bore you with my own timeline analysis, and to help the results be a tad more reproducible (which was another discussion that spawned from the tweet-length tweet), here’s a bit of code that tries to grab the last 3,000 tweets with the #rstats hashtag and plots the distribution by Twitter client:

library(rtweet)
library(ggalt)
library(rprojroot)
library(hrbrthemes)
library(tidyverse)

rt <- find_rstudio_root_file()
rstats_tweet_data_file <- file.path(rt, "data", "2017-11-13-rstats-tweet-search-results.rds")

if (!file.exists(rstats_tweet_data_file)) {
  rstats <- search_tweets("#rstats", 3000) # setting up rtweet is an exercise left to the reader
  write_rds(rstats, rstats_tweet_data_file)
} else {
  rstats <- read_rds(rstats_tweet_data_file)
}

rstats <- mutate(rstats, tweet_length=map_int(text, nchar))  # get the tweet length for each tweet

count(rstats, source) %>%
  filter(n > 5) -> usable_sources  # need data for density + I wanted a nice grid :-)

# We want max tweet length & total # of tweets for sorting & labeling facets
filter(rstats, source %in% usable_sources$source) %>%
  group_by(source) %>%
  summarise(max=max(tweet_length), n=n()) %>%
  arrange(desc(max)) -> ordr

# four breaks per panel regardless of the scales (we're using free-y scales)
there_are_FOUR_breaks <- function(limits) { seq(limits[1], limits[2], length.out=4) }

mutate(rstats) %>%
  filter(source %in% usable_sources$source) %>%
  mutate(source = factor(source, levels=ordr$source,
                         labels=sprintf("%s (n=%s)", ordr$source, ordr$n))) %>%
  ggplot(aes(tweet_length)) +
  geom_bkde(aes(color=source, fill=source), bandwidth=5, alpha=2/3) +
  geom_vline(xintercept=140, linetype="dashed", size=0.25) +
  scale_x_comma(breaks=seq(0,280,70), limits=c(0,280)) +
  scale_y_continuous(breaks=there_are_FOUR_breaks, expand=c(0,0)) +
  facet_wrap(~source, scales="free", ncol=5) +
  labs(x="Tweet length", y="Density",
       title="Tweet length distributions by Twitter client (4.5 days #rstats)",
       subtitle="Twitter client facets in decreasing order of ones with >140 length tweets",
       caption="NOTE free Y axis scales\nBrought to you by #rstats, rtweet & ggalt") +
  theme_ipsum_rc(grid="XY", strip_text_face="bold", strip_text_size=8, axis_text_size=7) +
  theme(panel.spacing.x=unit(5, "pt")) +
  theme(panel.spacing.y=unit(5, "pt")) +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(axis.text.y=element_blank()) +
  theme(legend.position="none")

FIN

While the 140 barrier has definitely been broken, it has not been abused (yet) but the naive character counting is also not perfect since it looks like it doesn’t “count” the same way Twitter-proper does (image “attachments”, as an example, are counted as characters here here but they aren’t counted that way in Twitter clients). Bots are also counted as Twitter clients.

It’ll be interesting to track this in a few months as folks start to inch-then-blaze past the former hard-limit.

Give the code (or use your timeline info) a go and post a link with your results! You can find an RStudio project directory over on GitHub ?.

Working remotely has many benefits, but if you work remotely in an area like, say, rural Maine, one of those benefits is not massively speedy internet connections. Being able to go fast and furious on the internet is one of the many things I miss about our time in Seattle and it is unlikely that we’ll be seeing Google Fiber in my small town any time soon. One other issue is that residential plans from evil giants like Comcast come with things like “bandwidth caps”. I suspect many WFH-ers can live within those limits, but I work with internet-scale data and often shunt extracts or whole datasets to the DatCave™ server farm for local processing. As such, I pay an extra penalty as a Comcast “Business-class” user that has little benefit besides getting slightly higher QoS and some faster service response times when there are issues.

Why go into all that? Well, I recently decided to bump up the connection from 100 Mb/s to 150 Mb/s (and managed to do so w/o increasing the overall monthly bill at all) but wanted to have a way to regularly verify I am getting what I’m paying for without having to go to an interactive “speed test” site.

There’s a handy speedtest-cli?, which is a python-based module with a command-line interface that can perform speed tests against Ookla’s legacy server. (You’ll notice that link forwards you to their new socket-based test service; neither speedtest-cli nor the code in the package being shown here uses that yet)

I run plenty of ruby, python, node and go (et al) programs on the command-line, but I wanted a way to measure this in R-proper (storing individual test results) on a regular basis as well as run something from the command-line whenever I wanted an interactive test. Thus begat speedtest?.

Testing the Need for Speed

After you devtools::install_github("hrbrmstr/speedtest") the package, you can either type speedtest::spd_test() at an R console or Rscript --quiet -e 'speedtest::spd_test()' on the command-line to see something like the following:

What you see there is a short-hand version of what’s available in the package:

  • spd_best_servers: Find “best” servers (latency-wise) from master server list
  • spd_closest_servers: Find “closest” servers (geography-wise) from master server list
  • spd_compute_bandwidth: Compute bandwidth from bytes transferred and time taken
  • spd_config: Retrieve client configuration information for the speedtest
  • spd_download_test: Perform a download speed/bandwidth test
  • spd_servers: Retrieve a list of SpeedTest servers
  • spd_upload_test: Perform an upload speed/bandwidth test
  • spd_test: Test your internet speed/bandwidth

The general idiom is to grab the test configuration file, collect the master list of servers, figure out which servers you’re going to test against and perform upload/download tests + collect the resultant statistics:

library(speedtest)
library(stringi)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

config <- spd_config()

servers <- spd_servers(config=config)
closest_servers <- spd_closest_servers(servers, config=config)
only_the_best_severs <- spd_best_servers(closest_servers, config)

glimpse(spd_download_test(closest_servers[1,], config=config))
## Observations: 1
## Variables: 15
## $ url     <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat     <dbl> 42.3875
## $ lng     <dbl> -71.1
## $ name    <chr> "Somerville, MA"
## $ country <chr> "United States"
## $ cc      <chr> "US"
## $ sponsor <chr> "Axcelx Technologies LLC"
## $ id      <chr> "5960"
## $ host    <chr> "speed0.xcelx.net:8080"
## $ url2    <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min     <dbl> 14.40439
## $ mean    <dbl> 60.06834
## $ median  <dbl> 55.28457
## $ max     <dbl> 127.9436
## $ sd      <dbl> 34.20695

glimpse(spd_upload_test(only_the_best_severs[1,], config=config))
## Observations: 1
## Variables: 18
## $ ping_time      <dbl> 0.02712567
## $ total_time     <dbl> 0.059917
## $ retrieval_time <dbl> 2.3e-05
## $ url            <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat            <dbl> 42.3875
## $ lng            <dbl> -71.1
## $ name           <chr> "Somerville, MA"
## $ country        <chr> "United States"
## $ cc             <chr> "US"
## $ sponsor        <chr> "Axcelx Technologies LLC"
## $ id             <chr> "5960"
## $ host           <chr> "speed0.xcelx.net:8080"
## $ url2           <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min            <dbl> 6.240858
## $ mean           <dbl> 9.527599
## $ median         <dbl> 9.303148
## $ max            <dbl> 12.56686
## $ sd             <dbl> 2.451778
```

Spinning More Wheels

The default operation for each test function is to return summary statistics. If you dig into the package source, or look at how the legacy measuring site performs the speed test tasks, you’ll see that the process involves uploading and downloading a series of files (or, more generally, random data) of increasing size and calculating how long it takes to perform those actions. Smaller size tests will have skewed results on fast connections with low latency, which is why the sites do the incremental tests (and why you see the “needles” move the way they do on interactive tests). We can disable summary statistics and retrieve all of results of individual tests (as we’ll do in a bit).

Speed tests are also highly dependent on the processing capability of the target server as well as the network location (hops away), “quality” and load. You’ll often see interactive sites choose the “best” server. There are many ways to do that. One is geography, which has some merit, but should not be the only factor used. Another is latency, which is comparing a small connection test against each server and measuring which one comes back the fastest.

It’s straightforward to show the impact of each with a simple test harness. We’ll pick 3 geographic “closest” geographic servers, 3 “best” latency servers (those may overlap with “closest”) and 3 randomly chosen ones:

set.seed(8675309)

bind_rows(

  closest_servers[1:3,] %>%
    mutate(type="closest"),

  only_the_best_severs[1:3,] %>%
    mutate(type="best"),

  filter(servers, !(id %in% c(closest_servers[1:3,]$id, only_the_best_severs[1:3,]$id))) %>%
    sample_n(3) %>%
    mutate(type="random")

) %>%
  group_by(type) %>%
  ungroup() -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 9 x 5
##                   sponsor            name       country
##                     <chr>           <chr>         <chr>
## 1 Axcelx Technologies LLC  Somerville, MA United States
## 2                 Comcast      Boston, MA United States
## 3            Starry, Inc.      Boston, MA United States
## 4 Axcelx Technologies LLC  Somerville, MA United States
## 5 Norwood Light Broadband     Norwood, MA United States
## 6       CCI - New England  Providence, RI United States
## 7                 PirxNet         Gliwice        Poland
## 8           Interoute VDC Los Angeles, CA United States
## 9                   UNPAD         Bandung     Indonesia
## # ... with 2 more variables: host <chr>, type <chr>

As predicted, there are some duplicates, but we’ll perform upload and download tests for each of them and compare the results. First, download:

map_df(1:nrow(to_compare), ~{
  spd_download_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> dl_results_full

mutate(dl_results_full, type=stri_trans_totitle(type)) %>%
  ggplot(aes(type, bw, fill=type)) +
  geom_quasirandom(aes(size=size, color=type), width=0.15, shape=21, stroke=0.25) +
  scale_y_continuous(expand=c(0,5), labels=c(sprintf("%s", seq(0,150,50)), "200 Mb/s"), limits=c(0,200)) +
  scale_size(range=c(2,6)) +
  scale_color_manual(values=c(Random="#b2b2b2", Best="#2b2b2b", Closest="#2b2b2b")) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Download bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

I’m going to avoid hammering really remote servers with the upload test (truth-be-told I just don’t want to wait as long as I know it’s going to take):

bind_rows(
  closest_servers[1:3,] %>% mutate(type="closest"),
  only_the_best_severs[1:3,] %>% mutate(type="best")
) %>%
  distinct(.keep_all=TRUE) -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 6 x 5
##                   sponsor           name       country
##                     <chr>          <chr>         <chr>
## 1 Axcelx Technologies LLC Somerville, MA United States
## 2                 Comcast     Boston, MA United States
## 3            Starry, Inc.     Boston, MA United States
## 4 Axcelx Technologies LLC Somerville, MA United States
## 5 Norwood Light Broadband    Norwood, MA United States
## 6       CCI - New England Providence, RI United States
## # ... with 2 more variables: host <chr>, type <chr>

map_df(1:nrow(to_compare), ~{
  spd_upload_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> ul_results_full

ggplot(ul_results_full, aes(x="Upload Test", y=bw)) +
  geom_quasirandom(aes(size=size, fill="col"), width=0.1, shape=21, stroke=0.25, color="#2b2b2b") +
  scale_y_continuous(expand=c(0,0.5), breaks=seq(0,16,4),
                     labels=c(sprintf("%s", seq(0,12,4)), "16 Mb/s"), limits=c(0,16)) +
  scale_size(range=c(2,6)) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Upload bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

As indicated, data payload size definitely impacts the speed tests (as does “proximity”), but you’ll also see variance if you run the same tests again with the same target servers. Note, also, that the “command-line” ready function tries to make the quickest target selection and only chooses the max values.

FIN

The github site has a list of TODOs and any/all are welcome to pile on (full credit for any contributions, as usual). So kick the tyres, file issues and start measuring!

Unlike @noamross, I am not an epidemiologist (NOTE: Noam battles pandemics before breakfast, so be super nice to him) but I do like to find kindred methodologies in other disciplines to help foster the growth of cybersecurity into something beyond it’s current Barnum & Bailey state. I also love finding and exposing hidden APIs and especially enjoy killing Adobe Flash. How does all that relate to cdcfluview?

cdcfluview? was one of my first R packages. Someone, somwewhere, was trying to do something with Selenium to automate downloading of data from the CDC’s FluView Portal. It was — and, some of it still is — a Flash-based site that locked up useful data behind application screens that did little more than burn ones retinas and force folks to keep Flash alive and, hence, their browsers insecure.

Rather than let the requester suffer under the weight of a pretty significant external dependency, I used the magic of the browser “developer tools” inspector to see that it was making fairly innocuous and useful XHR requests for real data. The package sat on GitHub for a while and eventually made its way to CRAN.

Times change and Flash is dying, so the CDC paid some serious benjamins to have the site re-done in HTML, replicating the horrible UX and terrible visualizations (so. many. pie. charts.). Said revamp also caused changes to the back-end APIs and forced breaking changes. Craig McGowan jumped to the rescue and fixed some core functionality issues, but so much changed — and so much was added — that I felt it was time for a modern re-write of the cdcfluview package.

This is a pretty solid, real-world example of how dangerous it is to rely on hidden APIs. If Craig hadn’t both notified me and gone the extra mile to make a PR, I’d’ve been in the dark until I tried to commiserate (I always seem get the flu no matter what I do) with code and found my package erroring out.

Enter: cdcfluview 0.7.0.

What’s Different?

Unfortunately, everything; which is one reason I’m writing this post.

First, to have folks that are using current-gen cdcfluview kick the tyres and let me know (via issues) if you need any old API compatibility back. This isn’t anywhere near the most popular package on CRAN, but it does have users (even, I’m told, within the CDC) and I want to make sure I do as little to disrupt them as possible. But, the current package API maps much more closely to the way the revamped portal works and presents data, so I’m hoping it’s a good net-new vs crushing blow to productivity.

Speaking of maps, the package now has actual maps! A new cdc_basemap() function returns the GeoJSON files that the CDC uses in their web views as sf objects. And, there are tons of maps and multi-labeled features to tie data to:

Here’s what’s in the tin:

  • age_group_distribution: Age Group Distribution of Influenza Positive Tests Reported by Public Health Laboratories
  • cdc_basemap: Retrieve CDC U.S. Basemaps
  • geographic_spread: State and Territorial Epidemiologists Reports of Geographic Spread of Influenza
  • hospitalizations: Laboratory-Confirmed Influenza Hospitalizations
  • ilinet: Retrieve ILINet Surveillance Data
  • ili_weekly_activity_indicators: Retrieve weekly state-level ILI indicators per-state for a given season
  • pi_mortality: Pneumonia and Influenza Mortality Surveillance
  • state_data_providers: Retrieve metadata about U.S. State CDC Provider Data
  • surveillance_areas: Retrieve a list of valid sub-regions for each surveillance area.
  • who_nrevss: Retrieve WHO/NREVSS Surveillance Data
  • mmwr_week: Convert a Date to an MMWR day+week+year
  • mmwr_weekday: Convert a Date to an MMWR weekday
  • mmwr_week_to_date: Convert an MMWR year+week or year+week+day to a Date object

Plus there’s a new data object mmwrid_map that makes it super-easy to convert arcane MMWR identifiers to real date objects.

The README has plenty of charts and examples, so I won’t take up post-space with said code or images.

Curiously Enough

Along the way, I was able to discern that there’s a hidden layer of this new, hidden API. Exposing said layer should be as easy as figuring out the right keyword and I’m hoping a bit of fuzzing will do the trick on that. It will be interesting to see what extra data that unlocks. (Yes, I just said relying on hidden APIs is dangerous; and, relying on hidden, hidden APIs is doubly so. I’m just a glutton for punishment.)

I was also able to discern that multiple people or teams worked on this revamp and said folks did not communicate with each other. The per-app APIs are woefully inconsistent. Furthermore, someone goofed and forgot to expose some pretty critical information from a few data retrieval operations (said data is also missing on the clickable download versions, too). Hopefully they’ll be addressing the issue soon (the site is technically in beta release).

FIN

If you’ve been a user of cdcfluview please give the new API a try and file issues with anything you see. All contributors — testers, modders, enhancers — will get full DESCRIPTION credit (so, please also include how you’d like to be cited).

Finally, please do check out the CDC FluView Portal. It’s gosh awful horribad. I know there are some spiffy Shiny experts out there who could run rings around that portal and I’ll be glad to add you as a collaborator if you contribute a Shiny app (or two!) to the package. If you’d rather go your own route with a self-contained, self-published package, just let me know what API changes you’d like and I’ll gladly accommodate. The goal is to help epidemiologists and other researchers keep us all safe.

So, go get your flu shot!!! Then, kick the tyres on this package update and don’t hesitate to convey your criticisms, patches or accolades.

Now to get to my promised final review of cyphr (I’ve not forgotten @ma_salmon ;-)

Despite my week-long Twitter consumption sabbatical (helped — in part — by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem is the text of a very recent ruling from a District Court in Texas and deals with a favourite subject of mine: robots.txt.

The background of the case is that there were two parties who both ran websites for oil and gas professionals that include job postings. One party filed a lawsuit against the other asserting that the they hacked into their system and accessed and used various information in violation of the Computer Fraud and Abuse Act (CFAA), the Stored Wire and Electronic Communications and Transactional Records Access Act (SWECTRA), the Racketeer Influenced and Corrupt Organizations Act (RICO), the Texas Harmful Access by Computer Act (THACA), the Texas Theft Liability Act (TTLA), and the Texas Uniform Trade Secrets Act (TUTS). They also asserted common law claims of misappropriation of confidential information, conversion, trespass to chattels, fraud, breach of fiduciary duty, unfair competition, tortious interference with present and prospective business relationships, civil conspiracy, and aiding and abetting.

The other party filed a motion for dismissal on a number of grounds involving legalese on Terms & Conditions, a rebuttal of CFAA claims and really gnarly legalese around copyrights. There are more than a few paragraphs that make me glad none of my offspring have gone into or have a desire to go into the legal profession. One TLDR here is that T&Cs do, in fact, matter (though that is definitely dependent upon the legal climate where you live or have a case filed against you). We’re going to focus on the DMCA claim which leads us to the robots.txt part.

I shall also preface the rest with “IANAL”, but I don’t think a review of this case requires a law degree.

Command-Shift-R

To refresh memories (or create lasting new ones), robots.txt is a file that is placed at the top of a web site domain (i.e. https://rud.is/robots.txt) that contains robots exclusion standard rules. These rules tell bots (NOTE: if you write a scraper, you’ve written a scraping bot) what they can or cannot scrape and what — if any — delay should be placed between scraping efforts by said bot.

R has two CRAN packages for dealing with these files/rules: robotstxt by Peter Meissner and spiderbar by me. They are not competitors, but are designed much like Reese’s Peanut Butter cups — to go together (though Peter did some wicked good testing and noted a possible error in the underlying C++ library I use that could generate Type I or Type II in certain circumstances) and each has some specialization. I note them now because you don’t have an excuse not to check robots.txt given two CRAN packages being available. Python folks have (at a minimum) robotparser and reppy. Node, Go and other, modern languages all have at least one module/library/package available as well. No. Excuses.

Your Point?

(Y’all are always in a rush, eh?)

This October, 2017 Texas ruling references a 2007 ruling by a District Court in Pennsylvania. I dug in a bit through searchable Federal case law for mentions of robots.txt and there aren’t many U.S. cases that mention this control, though I am amused a small cadre of paralegals had to type robots.txt over-and-over again.

The dismissal request on the grounds that the CFAA did not apply was summarily rejected. Why? The defendant provided proof that they monitor for scraping activity that violates the robots.txt rules and that they use the Windows Firewall (ugh, they use Windows for web serving) to block offending IP addresses when they discover them.

Nuances came out further along in the dismissal text noting that user-interactive viewing of the member profiles on the site was well-within the T&Cs but that the defendant “never authorized [the use of] automated bots to download over 500,000 profiles” nor to have that data used for commercial purposes.

The kicker (for me, anyway) is the last paragraph of the document in the Conclusion where the defendant asserts that:

  • robots.txt is in fact a bona-fide technological measure to effectively control access to copyright materials
  • the “Internet industry” (I seriously dislike lawyers for wording just like that) has recognized robots.txt as a standard for controlling automated access to resources
  • robots.txt has been a valid enforcement mechanism since 1994

The good bit is: -“Whether it actually qualifies in this case will be determined definitively at summary judgment or by a jury.”_ To me, this sounds like a ruling by a jury/judge in favor of robots.txt could mean that it becomes much stronger case law for future misuse claims.

With that in mind:

  • Site owners: USE robots.txt, if — for no other reason — to aid legitimate researchers who want to make use of your data for valid scientific purposes, education or to create non-infringing content or analyses that will be a benefit to the public good. You can also use it to legally protect your content (but there are definitely nuances around how you do that).
  • Scrapers: Check and obey robots.txt rules. You have no technological excuse not to and not doing so really appears that it could come back to haunt you in the very near future.

FIN

I’ve setup an alert for when future rulings come out for this case and will toss up another post here or on the work-blog (so I can run it by our very-non-skeezy legal team) when it pops up again.

“Best Friends” image by Andy Kelly. Used with permission.

This past weekend, violent windstorms raged through New England. We — along with over 500,000 other Mainers — went “dark” in the wee hours of Monday morning and (this post was published on Thursday AM) we still have no utility-provided power nor high-speed internet access. The children have turned iFeral, and being a remote worker has been made that much more challenging. Earlier in the week, even making a cellular phone call (not an Google Voice or other VoIP-ish call, just pressing buttons in the phone “app” in iOS) resulted in an “All circuits are busy” message vs human contact. (I had to repair our generator at some point between then and now, but it’s all a blur at this point).

Late Tuesday night, we checked the Central Maine Power outage info and were greeted with a “November 4, 2017” estimate. After regaining composure, we doubled down on the fact that we’d be extreme indoor glamping for a while longer.

As noted, I cope by coding and have a history of scraping Central Maine Power’s site for outage info. I ceased when I discovered the site & twitter bot I mentioned in a recent post, as it does that for the entirety of the U.S. (though many power companies continue make it difficult to scrape their outage info).

However, I wanted to see just how many other streets were in the same position as we are (I should note that less than a mile from us there are folks with power and internet, due mostly to their proximity to “vital” resources and how screwed up the Maine power grid is). Rather than reuse existing code, I wanted to make a modern, tidyverse edition of scrapers. If you follow enough paths in the aforementioned outage site, you’ll see that you eventually come to a page with a pretty ugly iframe that lets you poke around counties and towns. The following code traverses that tree to get street-level outage info:

library(rvest)
library(stringi)
library(hrbrthemes)
library(tidyverse)

# helper to make numbers from comma strings; i still find it amusing that
# this has never been a core "S" or base "R" function given that the
# central goal of both languages are to work with data
to_num <- function(x) { as.numeric(stri_replace_all_fixed(x, ",", "")) }

# top of the tree
pg <- read_html("http://www3.cmpco.com/OutageReports/CMP.html")

# basic idiom all the way down is to find the links to traverse until we get to
# the street level data, plucking the timestamp of the CMP report along the way
html_nodes(pg, "a") %>% 
  map_df(~{
    
    county <- stri_trans_totitle(html_text(.x))
    cpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
    message(sprintf("Processing %s...", county))
    
    html_nodes(cpg, xpath=".//td[not(@colspan = '2') and not(@align = 'right')][1]/a") %>% 
      map_df(~{
        
        town <- stri_trans_totitle(html_text(.x))
        tpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
        message(sprintf("  - %s", town))
    
        html_node(tpg, xpath=".//td[contains(., 'Update:')]") %>%
          html_text() %>%
          stri_replace_first_regex("Update: ", "") %>%
          as.POSIXct("%b %d, %Y %I:%M %p", tz="America/New_York") -> ts
        
        html_node(tpg, "table") %>% 
          html_table() %>% 
          docxtractr::assign_colnames(3) %>%  
          docxtractr::mcga() %>% # in github version
          mutate(street = stri_trans_totitle(street)) %>% 
          mutate_at(vars(-estimated_restoration, -street), .funs=to_num) %>% 
          filter(!is.na(total_customersby_street)) %>% 
          mutate(timestamp = ts) %>% 
          mutate(county = county) %>% 
          mutate(town = town) %>% 
          tbl_df()
      })
    
  }) -> xdf

xdf <- mutate(xdf, estimated_restoration = as.POSIXct(estimated_restoration, "%b %d, %Y %I:%M %p", tz="America/New_York"))

xdf
## # A tibble: 10,601 x 7
##           street total_customersby_street customerswithout_power estimated_restoration           timestamp       county   town
##            <chr>                    <dbl>                  <dbl>                <dttm>              <dttm>        <chr>  <chr>
##  1        2Nd St                       59                     14                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  2        3Rd St                      128                     53                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  3        4Th St                       89                     15                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  4        5Th St                       56                      3                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  5     Adams Ave                        4                      4   2017-11-03 19:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  6     Allain St                        8                      8                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  7     Atwood St                        6                      3   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  8    Baxter Ave                       32                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  9     Beaver Rd                        9                      4   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## 10 Bellflower Dr                       10                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## # ... with 10,591 more rows

One truly upsetting revelation from data is the number of folks still in an “Assessing” condition (i.e. no restoration time estimate):

filter(xdf, is.na(estimated_restoration)) %>% 
  summarise(streets = n(), customers_impacted = sum(total_customersby_street))
## # A tibble: 1 x 2
##   streets customers_impacted
##     <int>              <dbl>
## 1    2255              42067

I’m thankful (for them and us) that Winter has not yet hit and that the weather has been and is going to be sufficiently mild to not make things life-threatening for most folks (it does get cold in northern Maine at this time of year).

It’s About Time

We can get an overview of when things are slated to get better for everyone but the folks I just mentioned:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>%
  count(day) %>% 
  ggplot(aes(day, n)) +
  geom_col() +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma() +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y")

It seems that most of us are in the same “November 4th” bucket. But, we can also see that Central Maine Power’s data curation leaves much to be desired since there should be no dates in the past in that chart, but there are.

With the scraping data above, we can explore the outage info in many ways, but — as time and bandwidth are precious commodities — I’ll leave you with the total number of customers still without power:

count(xdf, wt = customerswithout_power) %>% pull(n)
## [1] 153465

and, a county-level view of the outage:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>% 
  count(county, day) %>% 
  complete(county, day, fill=list(n=0)) %>% 
  filter(day >= Sys.Date()) %>% 
  ggplot(aes(day, n)) +
  geom_segment(aes(xend=day, yend=0), color="steelblue", size=4) +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma(limits=c(0,1250)) +
  facet_wrap(~county, scales="free", ncol=5) +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates by County",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y", strip_text_face = "bold", axis="xy") +
  theme(panel.spacing.x=unit(3, "lines")) +
  theme(panel.spacing.y=unit(2, "lines"))

FIN

In a way, I wish I had continued scraping CMP data since the power outages site I mentioned doesn’t seem to provide access to the raw data and getting a historical perspective of the outage locations and analyzing by geography and other demographics might be interesting.

Hopefully the scraping code will be useful for some folks. It was definitely therapeutic for me :-)

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr)
library(magick)
library(rtweet)
library(stringi)
library(hrbrthemes)
library(tidyverse)

words_to_num <- function(x) {
  map_dbl(x, ~{
    val <- stri_match_first_regex(.x, "^([[:print:]]+) #PowerOutages")[,2]
    mul <- case_when(
      stri_detect_regex(val, "[Kk]") ~ 1000,
      stri_detect_regex(val, "[Mm]") ~ 1000000,
      TRUE ~ 1
    ) 
    val <- stri_replace_all_regex(val, "[^[:digit:]\\.]", "")
    as.numeric(val) * mul 
  })
}

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage <- get_timeline("PowerOutage_us", n=300)

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "\\#(EastCoast|NorthEast)")) %>% 
  mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% 
  mutate(number_out = words_to_num(text)) %>%  
  ggplot(aes(created_at, number_out)) +
  geom_segment(aes(xend=created_at, yend=0), size=5) +
  scale_x_datetime(date_labels = "%Y-%m-%d\n%H:%M", date_breaks="2 hours") +
  scale_y_comma(limits=c(0,2000000)) +
  labs(
    x=NULL, y="# Customers Without Power",
    title="Northeast Power Outages",
    subtitle="Yay! Twitter as a non-blather data source",
    caption="Data via: @PowerOutage_us <https://twitter.com/PowerOutage_us>"
  ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc <- function(grid = "XY", style = c("stream", "card"), retina=TRUE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(24, 18, 16, 14, 12),
    card = c(22, 16, 14, 12, 10)
  ) -> font_sizes
  
  theme_ipsum_rc(
    grid = grid, 
    plot_title_size = font_sizes[1], 
    subtitle_size = font_sizes[2], 
    axis_title_size = font_sizes[3], 
    axis_text_size = font_sizes[4], 
    caption_size = font_sizes[5]
  )
  
}

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

gg_tweet <- function(g, status = "ggplot2 image", style = c("stream", "card"), 
                     retina=TRUE, send = FALSE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(w=1024, h=512),
    card = c(w=800, h=320)
  ) -> dims
  
  dims["res"] <- 72
  
  if (retina) dims <- dims * 2
  
  fig <- image_graph(width=dims["w"], height=dims["h"], res=dims["res"])
  print(g)
  dev.off()
  
  if (send) {
    
    message("Posting image to twitter...")
    
    tf <- tempfile(fileext = ".png")
    image_write(fig, tf, format="png")
    
    # Create an app at apps.twitter.com w/callback URL of http://127.0.0.1:1410
    # Save the app name, consumer key and secret to the following
    # Environment variables
    
    app <- oauth_app(
      appname = Sys.getenv("TWITTER_APP_NAME"),
      key = Sys.getenv("TWITTER_CONSUMER_KEY"),
      secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
    )
    
    twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), app)
    
    POST(
      url = "https://api.twitter.com/1.1/statuses/update_with_media.json",
      config(token = twitter_token), 
      body = list(
        status = status,
        media = upload_file(path.expand(tf))
      )
    ) -> res
    
    warn_for_status(res)
    
    unlink(tf)
    
  }
  
  fig
  
}

Two Great Tastes That Taste Great Together

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet(
  gg + theme_tweet_rc(grid="Y"),
  status = "Progress! #rtweet #gg_tweet",
  send=TRUE
)

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).