Skip navigation

Earlier this year, I made a package that riffed off of ProPublica’s really neat voting cartograms (maps) for the U.S. House and Senate. You can see one for disaster relief spending in the House and one for the ACA “Skinny Repeal” in the Senate.

We can replicate both here with the voteogram package (minus the interactivity, for now):

library(voteogram)
library(ggplot2)

hr_566 <- roll_call(critter="house", number=115, session=1, rcall=566)

house_carto(hr_566) +
  coord_equal() +
  theme_voteogram()

sen_179 <- roll_call(critter="senate", number=115, session=1, rcall=179)

senate_carto(sen_179) +
  coord_equal() +
  theme_voteogram()

I think folks might have more fun with the roll_call() objects though:

str(hr_566)
## List of 29
##  $ vote_id              : chr "H_115_1_566"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 566
##  $ needed_to_pass       : int 282
##  $ date_of_vote         : chr "October 12, 2017"
##  $ time_of_vote         : chr "03:23 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "2/3 YEA-AND-NAY"
##  $ question             : chr "On Motion to Suspend the Rules and Agree"
##  $ description          : chr "Providing for the concurrence by the House in the Senate amendment to H.R. ## 2266, with an amendment"
##  $ nyt_title            : chr "On Motion to Suspend the Rules and Agree"
##  $ total_yes            : int 353
##  $ total_no             : int 69
##  $ total_not_voting     : int 11
##  $ gop_yes              : int 164
##  $ gop_no               : int 69
##  $ gop_not_voting       : int 7
##  $ dem_yes              : int 189
##  $ dem_no               : int 0
##  $ dem_not_voting       : int 5
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "Yes"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##   ..$ bioguide_id         : chr [1:435] "A000374" "A000370" "A000055" "A000371" ...
##   ..$ role_id             : int [1:435] 274 294 224 427 268 131 388 320 590 206 ...
##   ..$ member_name         : chr [1:435] "Ralph Abraham" "Alma Adams" "Robert B. Aderholt" "Pete Aguilar" ...
##   ..$ sort_name           : chr [1:435] "Abraham" "Adams" "Aderholt" "Aguilar" ...
##   ..$ party               : chr [1:435] "R" "D" "R" "D" ...
##   ..$ state_abbrev        : chr [1:435] "LA" "NC" "AL" "CA" ...
##   ..$ display_state_abbrev: chr [1:435] "La." "N.C." "Ala." "Calif." ...
##   ..$ district            : int [1:435] 5 12 4 31 12 3 2 19 36 2 ...
##   ..$ position            : chr [1:435] "Yes" "Yes" "Yes" "Yes" ...
##   ..$ dw_nominate         : num [1:435] 0.493 -0.462 0.36 -0.273 0.614 0.684 0.388 NA 0.716 NA ...
##   ..$ pp_id               : chr [1:435] "LA_5" "NC_12" "AL_4" "CA_31" ...
##  - attr(*, "class")= chr [1:2] "pprc" "list"

as they hold tons of info on the votes.

I need to explore the following a bit more but there are some definite “patterns” in the way the 115th Senate has voted this year:

library(hrbrthemes)

# I made a mistake in how I exposed these that I'll correct next month
# but we need to munge it a bit anyway for this view
fills <- voteogram:::vote_carto_fill
names(fills) <- tolower(names(fills))

rcalls <- map(1:280, ~voteogram::roll_call(critter="senate", session=1, number=115, rcall=.x))
# save it off so you don't have to waste those calls again
write_rds(rcalls, "2017-115-1-sen-280-roll-calls.rds")

# do a bit of wrangling
map_df(rcalls, ~{
  mutate(.x$votes, vote_id = .x$vote_id) %>% 
    arrange(party, position) %>% 
    mutate(fill = tolower(sprintf("%s-%s", party, position))) %>% 
    mutate(ques = .x$question) %>% 
    mutate(x = 1:n())
}) -> votes_df

# plot it
ggplot(votes_df, aes(x=x, y=vote_id, fill=fill)) +
  geom_tile() +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  scale_fill_manual(name=NULL, values=fills) +
  labs(x=NULL, y=NULL, title="Senate Roll Call Votes",
       subtitle="2017 / 115th, Session 1, Votes 1-280",
       caption="Note free-Y scales") +
  facet_wrap(~ques, scales="free_y", ncol=3) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position="right")

Hopefully I’ll get some time to dig into the differences and report on anything interesting. If you get to it before me definitely link to your blog post in a comment!

FIN

I still want to make an htmlwidgets version of the plots and also add the ability to get the index of roll call votes by Congress number and session to make it easier to iterate.

I’m also seriously considering creating different palettes. I used the ones from the source interactive site but am not 100% happy with them. Suggestions/PRs welcome.

Hopefully this package will make it easier for U.S. folks to track what’s going on in Congress and keep their representatives more accountable to the truth.

Everything’s on GitHub so please file issues, questions or PRs there.

NOTE: This is mainly for those of us in the Colonies, but some tips apply globally.

Black Friday / Cyber Monday / Cyber November / Holiday ?hopping is upon us. You’re going to buy stuff. You’re going to use digital transactions to do so. Here are some tips in a semi-coherent order:

  • Sign up for a “reputable” credit card (is there such a thing? FinServs are pretty evil) with a low interest rate/cash back, multi-factor authentication on their web/app and a limit on total credit and a per-transaction limit. This card is just for shopping. Pay with petrol and groceries with something else.
  • Assign that to your PayPal, Amazon, Apple Pay, et al accounts and keep that as your only physical & digital card for your shopping sprees until the season ends.
  • Setup multi-factor auth on PayPal, Amazon, Apple Pay and anywhere else you shop. Don’t shop where you can’t do this.
  • Use Amazon or a site that accepts PayPal, Apple Pay, or Amazon payments. Yes, all those orgs are evil. But they do a better job than most when it comes to account security.
  • Use Quantum Firefox or the latest Chrome Betas to shop online. Nothing else. Check for updates daily & apply when they are out.
  • Double-check URLs when shopping. Make sure you’re on the site you want to be on. Let’s Encrypt made it super easy for attackers to pwn you this season. You can afford an extra 5 minutes since that’ll save you years battling identity theft or account bankruptcy.
  • Type all URLs into Google’s safety net — https://transparencyreport.google.com/safe-browsing/search — if at all possible before even considering trusting them.
  • Don’t use any storefront that uses a Let’s Encrypt certificate. Any.
  • Never let sites store your credit card or bank info.
  • Never shop on a site that has any errors associated with their SSL/TLS certificates. Let’s Encrypt killed the integrity of the lock icon and well-resourced adversaries can thwart the encryption but the opportunistic attackers likely to try to pwn you are going to be stopped
  • Avoid shopping with Apps. App developers are generally daft and have wretched security practices baked into their apps.
  • Use “Private Browsing” mode to shop if at all possible and start new browser sessions per-site. Your shopping habits and purchase info is as or even more valuable than your card digits, esp to trackers.
  • Use Ublock Origin or other reputable ad-blockers and tracking blockers to prevent orgs from tracking you as you shop. A good hosts file wouldn’t hurt, either.
  • Use Quad9 as your DNS provider starting now.
  • Never shop online from public Wi-Fi.
  • Don’t shop online from your company’s network (even the “guest” network). They track you. They all do or at least send data (whether they know it or not) to security appliance and “cloud” services that will use it against you or to profit off of you.
  • Absolutely do not use a store’s Wi-Fi to shop.
  • If using Amazon, avoid third-party sellers if at all possible. Scammers abound.
  • Never use social networks to share what you just purchased.
  • Never “SQUEEE” on social media that any shipments are “arriving today” and you’re “so excited!”.
  • Don’t use that daft, new Amazon video-delivery-bluetooth-alexa lock thing. Ever.
  • If you can afford it, use an in-home (not cloud-based) security camera pointed at the place where deliveries come and review the footage daily if you are expecting deliveries.
  • In-person/brick-and-mortar shopping should be done at chip+pin establishments or use cash at all others.
  • Review your day’s purchases online at the end of the day or the next morning.
  • Report all issues immediately to authorities then the establishments.

Why this particular slice of advice?

The U.S. moved to chip & signature in October of 2016. This has forced attackers to find different, creative ways to get your credit card info. Yes, there were scads of breaches this year, but a good chunk of digital crime is plain ‘ol theft. Web sites make great targets. Public Wi-Fi makes a great target. You need to protect yourself since no store, org, bank, politician or authority really cares that your identity was stolen. If they did, we wouldn’t be in the breach mess we’re in now.

Attackers know you’re in deep “breach fatigue” and figure you’re all in a “Meh. Nothing matters” mood. Don’t be pwnd! A wrong move could put you in identity theft limbo for years.

The Identity Theft Resource Center — http://www.idtheftcenter.org/ — is a great resource and can definitely help you in the right direction if you don’t follow the above advice and run into issues.

?tay ?afe thi? ?hopping ?sea?on!

(one more time: sub to the feed if you’re only on the blog for /datasci items)


Daughter #2 got married to a wonderful chap yesterday. I wanted to preserve the text of my speech and blessing for them here. I know it’s on their wedding video but I’d like to archive it into the historical record as well.


Hey everyone.

I’m Bob, Victoria’s dad. Her mom, Mary, and I want to express our deepest thanks to you all for sharing in this joyous day with us, Victoria, Kyle and Kyle’s parents.

Soon, you’ll be hearing a song that I believe fully describes the foundation of Kyle & Victoria’s new life together.

A story of grace.

A story of love.

A story of new beginnings.

And, a story of individual faiths united in the One who created an entire universe to make this very day possible for this very moment for them.

Before that, I have just a few words to say to the bride, the groom and to all here.

First: Victoria.

Your mom & I have watched you grow up from a small, fragile baby girl (to audience: who’s nickname was “spud”, btw, given her potato-like appearance at birth…I’m so dead now) into a strong, dynamic, loving, caring force de majeure. When you set upon a goal, there is no subtlety; just bold determination, founded in firm convictions, demanding flawless execution.

You have a sense of justice and purpose that — at times — fills us with sheer awe, and an unwavering Faith that is both inspiring and contagious to everyone around.

For 25 years, we’ve seen your tears, but have also seen you beam with joy; we’ve seen you stumble, but have seen you get back up run as if you could literally see the angels at your side. We’ve seen your disappointments but we’ve also seen you accomplish every goal you’ve set out to achieve. All along the way, you’ve made us proud and given us incredible memories to cherish, including today.

Now, for Kyle:

We’ve had an amazing opportunity to get to know Kyle as he’s been staying with us for a good part of this year. Since we first met you, Kyle, we’ve seen you express love, kindness and respect to Victoria. We’ve caught your transfixed gazes on her as she would dive into one of her meandering, nigh-endless stories at the dinner table (yep, I was watching you :-). We’ve seen you make her paramount in your life. I’ve come to know you as a talented, intelligent, soft-spoken gentleman and have seen your Faith regularly demonstrated in action. I thank your parents for raising such a man, and — while it is difficult to let go of my “little girl” — I know I am doing so into the hand of someone who will indeed love and cherish her for their days to come.

For all of us:

While this is most definitely Kyle & Victoria’s day, each of us here has helped to shape them into who they are today. And, thankfully, our jobs are not over! Our prayer is that we will all continue to give them the love and support they so richly deserve as they turn the page to start this new chapter of their lives together.

Let us all raise a glass to the new couple:

A toast to my daughter. She was a gift from God and I will always be grateful to have been given the honor of being her father. May God continue to bless her and her new husband, Kyle. Grant them safety, love, and happiness all the days of their lives.

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins?. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on statebins

There were three different functions in the old-style package:

  • one for discrete scales (it automated ‘cuts’)
  • one for continuous scales
  • one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing statebins

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

library(statebins)
library(tidyverse)

data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault") +
  theme_statebins(legend_position="right")

Two things should stand out there:

  • you got scale_fill_distiller() for free!
  • labels are dark/light depending on the tile color

Before we go into ^^, it may be helpful to show the new function interface:

library(statebins)
library(tidyverse)

statebins(state_data, state_col = "state", value_col = "value",
  dark_label = "black", light_label = "white", font_size = 3,
  state_border_col = "white", state_border_size = 2,
  ggplot2_scale_function = ggplot2::scale_fill_distiller, ...)

You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for).

So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger.

The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output.

But that’s not all!

There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that):

library(statebins)
library(tidyverse)
library(viridis)

data(USArrests)

# make up some data for the example

rownames_to_column(USArrests, "state") %>%
  bind_rows(
    data_frame(
      state = c("Virgin Islands", "Puerto Rico", "New York City"),
      Murder = rep(mean(max(USArrests$Murder),3)),
      Assault = rep(mean(max(USArrests$Assault),3)),
      Rape = rep(mean(max(USArrests$Rape),3)),
      UrbanPop = c(93, 95, 100)
    )
  ) -> us_arrests

statebins(us_arrests, value_col="Assault",
          ggplot2_scale_function = viridis::scale_fill_viridis) +
  labs(title="USArrests + made up data") +
  theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv(system.file("extdata", "wapostates.csv", package="statebins")))

mutate(
  adat, 
  share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4"))
) %>% 
  statebins(
    value_col = "share", 
    ggplot2_scale_function = scale_fill_brewer,
    name = "Share of workforce with jobs lost or threatened by trade"
  ) +
  labs(title = "1994-2000") +
  theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

library(statebins)
library(tidyverse)

election_2012 <- suppressMessages(read_csv(system.file("extdata", "election2012.csv", package="statebins")))

mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% 
  statebins(
    font_size=4, dark_label = "white", light_label = "white",
    ggplot2_scale_function = scale_fill_manual,
    name = "Winner",
    values = c(Romney = "#2166ac", Obama = "#b2182b")
  ) +
  theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

library(statebins)
library(tidyverse)
data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) +
  theme_statebins(legend_position="right")

MOAR BREAKING NEWS: geom & faceting

Thomas Wood suggested that faceting wld be nice, but that would really require a Geom. So, I took a stab at one:

library(statebins)
library(cdcfluview)
library(hrbrthemes)
library(tidyverse)

flu <- ili_weekly_activity_indicators(2017)

ggplot(flu, aes(state=statename, fill=activity_level)) +
  geom_statebins() +
  coord_equal() +
  viridis::scale_fill_viridis(
    name = "ILI Activity Level  ", limits=c(0,10), breaks=0:10, option = "magma", direction = -1
  ) +
  facet_wrap(~weekend) +
  labs(title="2017-18 Flu Season ILI Activity Level") +
  theme_statebins(base_family = font_ps) +
  theme(plot.title=element_text(size=16, hjust=0)) +
  theme(plot.margin = margin(30,30,30,30))

FIN

It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package.

So, kick the tyres and don’t hesitate to shoot over some feedback!

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes?. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes)
library(ggplot2)

ggplot(mpg, aes(displ, hwy)) +
  geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) +
  scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) +
  scale_y_continuous(expand=c(0,0), limits=c(10, 50)) +
  scale_color_ipsum() +
  scale_fill_ipsum() +
  facet_wrap(~class, scales="free") +
  labs(
    title="IBM Plex Sans Test",
    subtitle="This is a subtitle to see the how it looks in IBM Plex Sans",
    caption="Source: hrbrthemes & IBM"
  ) +
  theme_ipsum_ps(grid="XY", axis="xy") +
  theme(legend.position="none") -> gg

flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

By now, virtually every major media outlet has covered the “280 Apocalypse”™. For those still not “in the know”, Twitter recently moved the tweet character cap to 280 after a “successful” beta test (some of us have different ideas of what “success” looks like).

I had been on a hiatus from the platform for a while and planned to (and did) jump back into the fray today but wanted to see what my timeline looked like tweet-length-wise. It’s a simple endeavour: use rtweet to grab the timeline, count the characters per-tweet and look up the results. I posted the results of said process to — of course — Twitter and some folks asked me for the code.

Others used it and there were some discussions as to why timelines looked similar (distribution-wise) with not many Tweets going over 140 characters. One posit I had was that it might be due to client-side limitations since I noted that Twitter for macOS — a terrible client they haven’t updated in ages (but there really aren’t any good ones) — still caps tweets at 140 characters. Others, like Buffer on the web, do have support for 280, so I modified the code a bit to look at the distribution by client.

Rather than bore you with my own timeline analysis, and to help the results be a tad more reproducible (which was another discussion that spawned from the tweet-length tweet), here’s a bit of code that tries to grab the last 3,000 tweets with the #rstats hashtag and plots the distribution by Twitter client:

library(rtweet)
library(ggalt)
library(rprojroot)
library(hrbrthemes)
library(tidyverse)

rt <- find_rstudio_root_file()
rstats_tweet_data_file <- file.path(rt, "data", "2017-11-13-rstats-tweet-search-results.rds")

if (!file.exists(rstats_tweet_data_file)) {
  rstats <- search_tweets("#rstats", 3000) # setting up rtweet is an exercise left to the reader
  write_rds(rstats, rstats_tweet_data_file)
} else {
  rstats <- read_rds(rstats_tweet_data_file)
}

rstats <- mutate(rstats, tweet_length=map_int(text, nchar))  # get the tweet length for each tweet

count(rstats, source) %>%
  filter(n > 5) -> usable_sources  # need data for density + I wanted a nice grid :-)

# We want max tweet length & total # of tweets for sorting & labeling facets
filter(rstats, source %in% usable_sources$source) %>%
  group_by(source) %>%
  summarise(max=max(tweet_length), n=n()) %>%
  arrange(desc(max)) -> ordr

# four breaks per panel regardless of the scales (we're using free-y scales)
there_are_FOUR_breaks <- function(limits) { seq(limits[1], limits[2], length.out=4) }

mutate(rstats) %>%
  filter(source %in% usable_sources$source) %>%
  mutate(source = factor(source, levels=ordr$source,
                         labels=sprintf("%s (n=%s)", ordr$source, ordr$n))) %>%
  ggplot(aes(tweet_length)) +
  geom_bkde(aes(color=source, fill=source), bandwidth=5, alpha=2/3) +
  geom_vline(xintercept=140, linetype="dashed", size=0.25) +
  scale_x_comma(breaks=seq(0,280,70), limits=c(0,280)) +
  scale_y_continuous(breaks=there_are_FOUR_breaks, expand=c(0,0)) +
  facet_wrap(~source, scales="free", ncol=5) +
  labs(x="Tweet length", y="Density",
       title="Tweet length distributions by Twitter client (4.5 days #rstats)",
       subtitle="Twitter client facets in decreasing order of ones with >140 length tweets",
       caption="NOTE free Y axis scales\nBrought to you by #rstats, rtweet & ggalt") +
  theme_ipsum_rc(grid="XY", strip_text_face="bold", strip_text_size=8, axis_text_size=7) +
  theme(panel.spacing.x=unit(5, "pt")) +
  theme(panel.spacing.y=unit(5, "pt")) +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(axis.text.y=element_blank()) +
  theme(legend.position="none")

FIN

While the 140 barrier has definitely been broken, it has not been abused (yet) but the naive character counting is also not perfect since it looks like it doesn’t “count” the same way Twitter-proper does (image “attachments”, as an example, are counted as characters here here but they aren’t counted that way in Twitter clients). Bots are also counted as Twitter clients.

It’ll be interesting to track this in a few months as folks start to inch-then-blaze past the former hard-limit.

Give the code (or use your timeline info) a go and post a link with your results! You can find an RStudio project directory over on GitHub ?.

Working remotely has many benefits, but if you work remotely in an area like, say, rural Maine, one of those benefits is not massively speedy internet connections. Being able to go fast and furious on the internet is one of the many things I miss about our time in Seattle and it is unlikely that we’ll be seeing Google Fiber in my small town any time soon. One other issue is that residential plans from evil giants like Comcast come with things like “bandwidth caps”. I suspect many WFH-ers can live within those limits, but I work with internet-scale data and often shunt extracts or whole datasets to the DatCave™ server farm for local processing. As such, I pay an extra penalty as a Comcast “Business-class” user that has little benefit besides getting slightly higher QoS and some faster service response times when there are issues.

Why go into all that? Well, I recently decided to bump up the connection from 100 Mb/s to 150 Mb/s (and managed to do so w/o increasing the overall monthly bill at all) but wanted to have a way to regularly verify I am getting what I’m paying for without having to go to an interactive “speed test” site.

There’s a handy speedtest-cli?, which is a python-based module with a command-line interface that can perform speed tests against Ookla’s legacy server. (You’ll notice that link forwards you to their new socket-based test service; neither speedtest-cli nor the code in the package being shown here uses that yet)

I run plenty of ruby, python, node and go (et al) programs on the command-line, but I wanted a way to measure this in R-proper (storing individual test results) on a regular basis as well as run something from the command-line whenever I wanted an interactive test. Thus begat speedtest?.

Testing the Need for Speed

After you devtools::install_github("hrbrmstr/speedtest") the package, you can either type speedtest::spd_test() at an R console or Rscript --quiet -e 'speedtest::spd_test()' on the command-line to see something like the following:

What you see there is a short-hand version of what’s available in the package:

  • spd_best_servers: Find “best” servers (latency-wise) from master server list
  • spd_closest_servers: Find “closest” servers (geography-wise) from master server list
  • spd_compute_bandwidth: Compute bandwidth from bytes transferred and time taken
  • spd_config: Retrieve client configuration information for the speedtest
  • spd_download_test: Perform a download speed/bandwidth test
  • spd_servers: Retrieve a list of SpeedTest servers
  • spd_upload_test: Perform an upload speed/bandwidth test
  • spd_test: Test your internet speed/bandwidth

The general idiom is to grab the test configuration file, collect the master list of servers, figure out which servers you’re going to test against and perform upload/download tests + collect the resultant statistics:

library(speedtest)
library(stringi)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

config <- spd_config()

servers <- spd_servers(config=config)
closest_servers <- spd_closest_servers(servers, config=config)
only_the_best_severs <- spd_best_servers(closest_servers, config)

glimpse(spd_download_test(closest_servers[1,], config=config))
## Observations: 1
## Variables: 15
## $ url     <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat     <dbl> 42.3875
## $ lng     <dbl> -71.1
## $ name    <chr> "Somerville, MA"
## $ country <chr> "United States"
## $ cc      <chr> "US"
## $ sponsor <chr> "Axcelx Technologies LLC"
## $ id      <chr> "5960"
## $ host    <chr> "speed0.xcelx.net:8080"
## $ url2    <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min     <dbl> 14.40439
## $ mean    <dbl> 60.06834
## $ median  <dbl> 55.28457
## $ max     <dbl> 127.9436
## $ sd      <dbl> 34.20695

glimpse(spd_upload_test(only_the_best_severs[1,], config=config))
## Observations: 1
## Variables: 18
## $ ping_time      <dbl> 0.02712567
## $ total_time     <dbl> 0.059917
## $ retrieval_time <dbl> 2.3e-05
## $ url            <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat            <dbl> 42.3875
## $ lng            <dbl> -71.1
## $ name           <chr> "Somerville, MA"
## $ country        <chr> "United States"
## $ cc             <chr> "US"
## $ sponsor        <chr> "Axcelx Technologies LLC"
## $ id             <chr> "5960"
## $ host           <chr> "speed0.xcelx.net:8080"
## $ url2           <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min            <dbl> 6.240858
## $ mean           <dbl> 9.527599
## $ median         <dbl> 9.303148
## $ max            <dbl> 12.56686
## $ sd             <dbl> 2.451778
```

Spinning More Wheels

The default operation for each test function is to return summary statistics. If you dig into the package source, or look at how the legacy measuring site performs the speed test tasks, you’ll see that the process involves uploading and downloading a series of files (or, more generally, random data) of increasing size and calculating how long it takes to perform those actions. Smaller size tests will have skewed results on fast connections with low latency, which is why the sites do the incremental tests (and why you see the “needles” move the way they do on interactive tests). We can disable summary statistics and retrieve all of results of individual tests (as we’ll do in a bit).

Speed tests are also highly dependent on the processing capability of the target server as well as the network location (hops away), “quality” and load. You’ll often see interactive sites choose the “best” server. There are many ways to do that. One is geography, which has some merit, but should not be the only factor used. Another is latency, which is comparing a small connection test against each server and measuring which one comes back the fastest.

It’s straightforward to show the impact of each with a simple test harness. We’ll pick 3 geographic “closest” geographic servers, 3 “best” latency servers (those may overlap with “closest”) and 3 randomly chosen ones:

set.seed(8675309)

bind_rows(

  closest_servers[1:3,] %>%
    mutate(type="closest"),

  only_the_best_severs[1:3,] %>%
    mutate(type="best"),

  filter(servers, !(id %in% c(closest_servers[1:3,]$id, only_the_best_severs[1:3,]$id))) %>%
    sample_n(3) %>%
    mutate(type="random")

) %>%
  group_by(type) %>%
  ungroup() -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 9 x 5
##                   sponsor            name       country
##                     <chr>           <chr>         <chr>
## 1 Axcelx Technologies LLC  Somerville, MA United States
## 2                 Comcast      Boston, MA United States
## 3            Starry, Inc.      Boston, MA United States
## 4 Axcelx Technologies LLC  Somerville, MA United States
## 5 Norwood Light Broadband     Norwood, MA United States
## 6       CCI - New England  Providence, RI United States
## 7                 PirxNet         Gliwice        Poland
## 8           Interoute VDC Los Angeles, CA United States
## 9                   UNPAD         Bandung     Indonesia
## # ... with 2 more variables: host <chr>, type <chr>

As predicted, there are some duplicates, but we’ll perform upload and download tests for each of them and compare the results. First, download:

map_df(1:nrow(to_compare), ~{
  spd_download_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> dl_results_full

mutate(dl_results_full, type=stri_trans_totitle(type)) %>%
  ggplot(aes(type, bw, fill=type)) +
  geom_quasirandom(aes(size=size, color=type), width=0.15, shape=21, stroke=0.25) +
  scale_y_continuous(expand=c(0,5), labels=c(sprintf("%s", seq(0,150,50)), "200 Mb/s"), limits=c(0,200)) +
  scale_size(range=c(2,6)) +
  scale_color_manual(values=c(Random="#b2b2b2", Best="#2b2b2b", Closest="#2b2b2b")) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Download bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

I’m going to avoid hammering really remote servers with the upload test (truth-be-told I just don’t want to wait as long as I know it’s going to take):

bind_rows(
  closest_servers[1:3,] %>% mutate(type="closest"),
  only_the_best_severs[1:3,] %>% mutate(type="best")
) %>%
  distinct(.keep_all=TRUE) -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 6 x 5
##                   sponsor           name       country
##                     <chr>          <chr>         <chr>
## 1 Axcelx Technologies LLC Somerville, MA United States
## 2                 Comcast     Boston, MA United States
## 3            Starry, Inc.     Boston, MA United States
## 4 Axcelx Technologies LLC Somerville, MA United States
## 5 Norwood Light Broadband    Norwood, MA United States
## 6       CCI - New England Providence, RI United States
## # ... with 2 more variables: host <chr>, type <chr>

map_df(1:nrow(to_compare), ~{
  spd_upload_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> ul_results_full

ggplot(ul_results_full, aes(x="Upload Test", y=bw)) +
  geom_quasirandom(aes(size=size, fill="col"), width=0.1, shape=21, stroke=0.25, color="#2b2b2b") +
  scale_y_continuous(expand=c(0,0.5), breaks=seq(0,16,4),
                     labels=c(sprintf("%s", seq(0,12,4)), "16 Mb/s"), limits=c(0,16)) +
  scale_size(range=c(2,6)) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Upload bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

As indicated, data payload size definitely impacts the speed tests (as does “proximity”), but you’ll also see variance if you run the same tests again with the same target servers. Note, also, that the “command-line” ready function tries to make the quickest target selection and only chooses the max values.

FIN

The github site has a list of TODOs and any/all are welcome to pile on (full credit for any contributions, as usual). So kick the tyres, file issues and start measuring!

Unlike @noamross, I am not an epidemiologist (NOTE: Noam battles pandemics before breakfast, so be super nice to him) but I do like to find kindred methodologies in other disciplines to help foster the growth of cybersecurity into something beyond it’s current Barnum & Bailey state. I also love finding and exposing hidden APIs and especially enjoy killing Adobe Flash. How does all that relate to cdcfluview?

cdcfluview? was one of my first R packages. Someone, somwewhere, was trying to do something with Selenium to automate downloading of data from the CDC’s FluView Portal. It was — and, some of it still is — a Flash-based site that locked up useful data behind application screens that did little more than burn ones retinas and force folks to keep Flash alive and, hence, their browsers insecure.

Rather than let the requester suffer under the weight of a pretty significant external dependency, I used the magic of the browser “developer tools” inspector to see that it was making fairly innocuous and useful XHR requests for real data. The package sat on GitHub for a while and eventually made its way to CRAN.

Times change and Flash is dying, so the CDC paid some serious benjamins to have the site re-done in HTML, replicating the horrible UX and terrible visualizations (so. many. pie. charts.). Said revamp also caused changes to the back-end APIs and forced breaking changes. Craig McGowan jumped to the rescue and fixed some core functionality issues, but so much changed — and so much was added — that I felt it was time for a modern re-write of the cdcfluview package.

This is a pretty solid, real-world example of how dangerous it is to rely on hidden APIs. If Craig hadn’t both notified me and gone the extra mile to make a PR, I’d’ve been in the dark until I tried to commiserate (I always seem get the flu no matter what I do) with code and found my package erroring out.

Enter: cdcfluview 0.7.0.

What’s Different?

Unfortunately, everything; which is one reason I’m writing this post.

First, to have folks that are using current-gen cdcfluview kick the tyres and let me know (via issues) if you need any old API compatibility back. This isn’t anywhere near the most popular package on CRAN, but it does have users (even, I’m told, within the CDC) and I want to make sure I do as little to disrupt them as possible. But, the current package API maps much more closely to the way the revamped portal works and presents data, so I’m hoping it’s a good net-new vs crushing blow to productivity.

Speaking of maps, the package now has actual maps! A new cdc_basemap() function returns the GeoJSON files that the CDC uses in their web views as sf objects. And, there are tons of maps and multi-labeled features to tie data to:

Here’s what’s in the tin:

  • age_group_distribution: Age Group Distribution of Influenza Positive Tests Reported by Public Health Laboratories
  • cdc_basemap: Retrieve CDC U.S. Basemaps
  • geographic_spread: State and Territorial Epidemiologists Reports of Geographic Spread of Influenza
  • hospitalizations: Laboratory-Confirmed Influenza Hospitalizations
  • ilinet: Retrieve ILINet Surveillance Data
  • ili_weekly_activity_indicators: Retrieve weekly state-level ILI indicators per-state for a given season
  • pi_mortality: Pneumonia and Influenza Mortality Surveillance
  • state_data_providers: Retrieve metadata about U.S. State CDC Provider Data
  • surveillance_areas: Retrieve a list of valid sub-regions for each surveillance area.
  • who_nrevss: Retrieve WHO/NREVSS Surveillance Data
  • mmwr_week: Convert a Date to an MMWR day+week+year
  • mmwr_weekday: Convert a Date to an MMWR weekday
  • mmwr_week_to_date: Convert an MMWR year+week or year+week+day to a Date object

Plus there’s a new data object mmwrid_map that makes it super-easy to convert arcane MMWR identifiers to real date objects.

The README has plenty of charts and examples, so I won’t take up post-space with said code or images.

Curiously Enough

Along the way, I was able to discern that there’s a hidden layer of this new, hidden API. Exposing said layer should be as easy as figuring out the right keyword and I’m hoping a bit of fuzzing will do the trick on that. It will be interesting to see what extra data that unlocks. (Yes, I just said relying on hidden APIs is dangerous; and, relying on hidden, hidden APIs is doubly so. I’m just a glutton for punishment.)

I was also able to discern that multiple people or teams worked on this revamp and said folks did not communicate with each other. The per-app APIs are woefully inconsistent. Furthermore, someone goofed and forgot to expose some pretty critical information from a few data retrieval operations (said data is also missing on the clickable download versions, too). Hopefully they’ll be addressing the issue soon (the site is technically in beta release).

FIN

If you’ve been a user of cdcfluview please give the new API a try and file issues with anything you see. All contributors — testers, modders, enhancers — will get full DESCRIPTION credit (so, please also include how you’d like to be cited).

Finally, please do check out the CDC FluView Portal. It’s gosh awful horribad. I know there are some spiffy Shiny experts out there who could run rings around that portal and I’ll be glad to add you as a collaborator if you contribute a Shiny app (or two!) to the package. If you’d rather go your own route with a self-contained, self-published package, just let me know what API changes you’d like and I’ll gladly accommodate. The goal is to help epidemiologists and other researchers keep us all safe.

So, go get your flu shot!!! Then, kick the tyres on this package update and don’t hesitate to convey your criticisms, patches or accolades.

Now to get to my promised final review of cyphr (I’ve not forgotten @ma_salmon ;-)