Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holiday-themed post or two up for the season.

One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:

(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)

Let us begin!

STAVE I: hrbrmstr’s Code

We need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:

library(rprojroot)
library(gutenbergr)
library(hrbrthemes)
library(stringi)
library(tidytext)
library(tidyverse)

rt <- find_rstudio_root_file()

carol_rds <- file.path(rt, "data", "carol.rds")

if (!file.exists(carol_rds)) {
  carol_df <- gutenberg_download("46")
  write_rds(carol_df, carol_rds)
} else {
  carol_df <- read_rds(carol_rds)
}

How did I know to use 46? We can use gutenberg_works() to get to that info:

gutenberg_works(author=="Dickens, Charles")
## # A tibble: 74 x 8
##    gutenberg_id                                                                                    title
##           <int>                                                                                    <chr>
##  1           46                             A Christmas Carol in Prose; Being a Ghost Story of Christmas
##  2           98                                                                     A Tale of Two Cities
##  3          564                                                               The Mystery of Edwin Drood
##  4          580                                                                      The Pickwick Papers
##  5          588                                                                  Master Humphrey's Clock
##  6          644                                                  The Haunted Man and the Ghost's Bargain
##  7          650                                                                      Pictures from Italy
##  8          653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Year out and a New Year In"
##  9          675                                                                           American Notes
## 10          678                                          The Cricket on the Hearth: A Fairy Tale of Home
## # ... with 64 more rows, and 6 more variables: author <chr>, gutenberg_author_id <int>, language <chr>,
## #   gutenberg_bookshelf <chr>, rights <chr>, has_text <lgl>

STAVE II: The first of three wrangles

We’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:

#' Convenience only
carol_txt <- carol_df$text

# Just want the chapters (staves)
carol_txt <- carol_txt[-(1:(which(grepl("STAVE I:", carol_txt)))-1)]

#' We'll need this later to make prettier facet titles
data_frame(
  stave = 1:5,
  title = sprintf("Stave %s: %s", stave, carol_txt[stri_detect_fixed(carol_txt, "STAVE")] %>%
    stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>%
    stri_trans_totitle())
) -> stave_titles

stri_trans_totitle() is a super-handy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.

You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.

Now, we need to prep the text for text analysis.

STAVE III: The second of three wrangles

There are other text mining packages and processes in R. I’m using tidytext because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!

Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.

#' Break the text up into chapters, paragraphs, sentences, and words,
#' preserving the hierarchy so we can use it later.
data_frame(txt = carol_txt) %>%
  unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>%
  mutate(stave = 1:n()) %>%
  unnest_tokens(paragraph, chapter, token = "paragraphs") %>% 
  group_by(stave) %>%
  mutate(para = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(sentence, paragraph, token="sentences") %>% 
  group_by(stave, para) %>%
  mutate(sent = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(word, sentence) -> carol_tokens

carol_tokens
##  A tibble: 28,710 x 4
##   stave  para  sent   word
##   <int> <int> <int>  <chr>
## 1     1     1     1 marley
## 2     1     1     1    was
## 3     1     1     1   dead
## 4     1     1     1     to
## 5     1     1     1  begin
## 6     1     1     1   with
## 7     1     1     1  there
## 8     1     1     1     is
## 9     1     1     1     no
## 0     1     1     1  doubt
##  ... with 28,700 more rows

By indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.

STAVE IV: The third of three wrangles

Now, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentiment-al posts (including this one) take a naive approach with basic match and only looking at 1-grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :-). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.

#' Retrieve sentiments and compute them.
#'
#' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse
#' this block (which I'm not doing but thought I might).
inner_join(carol_tokens, get_sentiments("nrc"), "word") %>%
  count(stave, index = para, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  left_join(stave_titles, "stave") -> carol_with_sent

STAVE V: The end of it

Now, we just need to do some really basic ggplot-ing to to get to our desired result:

ggplot(carol_with_sent) +
  geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) +
  scale_x_comma(limits=range(carol_with_sent$index)) +
  scale_y_comma() +
  scale_color_ipsum() +
  facet_wrap(~title, scales="free_x", ncol=5) +
  labs(x=NULL, y="Sentiment",
       title="Sentiment Analysis of A Christmas Carol",
       subtitle="By stave & ¶",
       caption="Humbug!") +
  theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) +
  theme(legend.position="none")

You’ll want to tap/click on that to make it bigger.

Despite using a naive analysis, I think it tracks pretty well with the flow of the book.

Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.

The truly terrible (-10 sentiment) paragraph also makes sense:

Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.

(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)

filter(
  carol_tokens, stave == 1,
  para == filter(carol_with_sent, stave==1) %>% 
    filter(sentiment == min(sentiment)) %>% 
    pull(index)
)

Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostly-positive tone also makes sense.

Stave three (Christmas present) has the highest:

The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookers-on feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highly-decorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.

and lowest (sentiment) points of the entire book:

And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burial-place of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.

Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:

They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people half-naked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.

Finally, Stave five is both short and positive (whew!). Which I heartily agree with!

FIN

The code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.

Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.

For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”

Earlier this year, I made a package that riffed off of ProPublica’s really neat voting cartograms (maps) for the U.S. House and Senate. You can see one for disaster relief spending in the House and one for the ACA “Skinny Repeal” in the Senate.

We can replicate both here with the voteogram package (minus the interactivity, for now):

library(voteogram)
library(ggplot2)

hr_566 <- roll_call(critter="house", number=115, session=1, rcall=566)

house_carto(hr_566) +
  coord_equal() +
  theme_voteogram()

sen_179 <- roll_call(critter="senate", number=115, session=1, rcall=179)

senate_carto(sen_179) +
  coord_equal() +
  theme_voteogram()

I think folks might have more fun with the roll_call() objects though:

str(hr_566)
## List of 29
##  $ vote_id              : chr "H_115_1_566"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 566
##  $ needed_to_pass       : int 282
##  $ date_of_vote         : chr "October 12, 2017"
##  $ time_of_vote         : chr "03:23 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "2/3 YEA-AND-NAY"
##  $ question             : chr "On Motion to Suspend the Rules and Agree"
##  $ description          : chr "Providing for the concurrence by the House in the Senate amendment to H.R. ## 2266, with an amendment"
##  $ nyt_title            : chr "On Motion to Suspend the Rules and Agree"
##  $ total_yes            : int 353
##  $ total_no             : int 69
##  $ total_not_voting     : int 11
##  $ gop_yes              : int 164
##  $ gop_no               : int 69
##  $ gop_not_voting       : int 7
##  $ dem_yes              : int 189
##  $ dem_no               : int 0
##  $ dem_not_voting       : int 5
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "Yes"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##   ..$ bioguide_id         : chr [1:435] "A000374" "A000370" "A000055" "A000371" ...
##   ..$ role_id             : int [1:435] 274 294 224 427 268 131 388 320 590 206 ...
##   ..$ member_name         : chr [1:435] "Ralph Abraham" "Alma Adams" "Robert B. Aderholt" "Pete Aguilar" ...
##   ..$ sort_name           : chr [1:435] "Abraham" "Adams" "Aderholt" "Aguilar" ...
##   ..$ party               : chr [1:435] "R" "D" "R" "D" ...
##   ..$ state_abbrev        : chr [1:435] "LA" "NC" "AL" "CA" ...
##   ..$ display_state_abbrev: chr [1:435] "La." "N.C." "Ala." "Calif." ...
##   ..$ district            : int [1:435] 5 12 4 31 12 3 2 19 36 2 ...
##   ..$ position            : chr [1:435] "Yes" "Yes" "Yes" "Yes" ...
##   ..$ dw_nominate         : num [1:435] 0.493 -0.462 0.36 -0.273 0.614 0.684 0.388 NA 0.716 NA ...
##   ..$ pp_id               : chr [1:435] "LA_5" "NC_12" "AL_4" "CA_31" ...
##  - attr(*, "class")= chr [1:2] "pprc" "list"

as they hold tons of info on the votes.

I need to explore the following a bit more but there are some definite “patterns” in the way the 115th Senate has voted this year:

library(hrbrthemes)

# I made a mistake in how I exposed these that I'll correct next month
# but we need to munge it a bit anyway for this view
fills <- voteogram:::vote_carto_fill
names(fills) <- tolower(names(fills))

rcalls <- map(1:280, ~voteogram::roll_call(critter="senate", session=1, number=115, rcall=.x))
# save it off so you don't have to waste those calls again
write_rds(rcalls, "2017-115-1-sen-280-roll-calls.rds")

# do a bit of wrangling
map_df(rcalls, ~{
  mutate(.x$votes, vote_id = .x$vote_id) %>% 
    arrange(party, position) %>% 
    mutate(fill = tolower(sprintf("%s-%s", party, position))) %>% 
    mutate(ques = .x$question) %>% 
    mutate(x = 1:n())
}) -> votes_df

# plot it
ggplot(votes_df, aes(x=x, y=vote_id, fill=fill)) +
  geom_tile() +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  scale_fill_manual(name=NULL, values=fills) +
  labs(x=NULL, y=NULL, title="Senate Roll Call Votes",
       subtitle="2017 / 115th, Session 1, Votes 1-280",
       caption="Note free-Y scales") +
  facet_wrap(~ques, scales="free_y", ncol=3) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position="right")

Hopefully I’ll get some time to dig into the differences and report on anything interesting. If you get to it before me definitely link to your blog post in a comment!

FIN

I still want to make an htmlwidgets version of the plots and also add the ability to get the index of roll call votes by Congress number and session to make it easier to iterate.

I’m also seriously considering creating different palettes. I used the ones from the source interactive site but am not 100% happy with them. Suggestions/PRs welcome.

Hopefully this package will make it easier for U.S. folks to track what’s going on in Congress and keep their representatives more accountable to the truth.

Everything’s on GitHub so please file issues, questions or PRs there.

NOTE: This is mainly for those of us in the Colonies, but some tips apply globally.

Black Friday / Cyber Monday / Cyber November / Holiday ?hopping is upon us. You’re going to buy stuff. You’re going to use digital transactions to do so. Here are some tips in a semi-coherent order:

  • Sign up for a “reputable” credit card (is there such a thing? FinServs are pretty evil) with a low interest rate/cash back, multi-factor authentication on their web/app and a limit on total credit and a per-transaction limit. This card is just for shopping. Pay with petrol and groceries with something else.
  • Assign that to your PayPal, Amazon, Apple Pay, et al accounts and keep that as your only physical & digital card for your shopping sprees until the season ends.
  • Setup multi-factor auth on PayPal, Amazon, Apple Pay and anywhere else you shop. Don’t shop where you can’t do this.
  • Use Amazon or a site that accepts PayPal, Apple Pay, or Amazon payments. Yes, all those orgs are evil. But they do a better job than most when it comes to account security.
  • Use Quantum Firefox or the latest Chrome Betas to shop online. Nothing else. Check for updates daily & apply when they are out.
  • Double-check URLs when shopping. Make sure you’re on the site you want to be on. Let’s Encrypt made it super easy for attackers to pwn you this season. You can afford an extra 5 minutes since that’ll save you years battling identity theft or account bankruptcy.
  • Type all URLs into Google’s safety net — https://transparencyreport.google.com/safe-browsing/search — if at all possible before even considering trusting them.
  • Don’t use any storefront that uses a Let’s Encrypt certificate. Any.
  • Never let sites store your credit card or bank info.
  • Never shop on a site that has any errors associated with their SSL/TLS certificates. Let’s Encrypt killed the integrity of the lock icon and well-resourced adversaries can thwart the encryption but the opportunistic attackers likely to try to pwn you are going to be stopped
  • Avoid shopping with Apps. App developers are generally daft and have wretched security practices baked into their apps.
  • Use “Private Browsing” mode to shop if at all possible and start new browser sessions per-site. Your shopping habits and purchase info is as or even more valuable than your card digits, esp to trackers.
  • Use Ublock Origin or other reputable ad-blockers and tracking blockers to prevent orgs from tracking you as you shop. A good hosts file wouldn’t hurt, either.
  • Use Quad9 as your DNS provider starting now.
  • Never shop online from public Wi-Fi.
  • Don’t shop online from your company’s network (even the “guest” network). They track you. They all do or at least send data (whether they know it or not) to security appliance and “cloud” services that will use it against you or to profit off of you.
  • Absolutely do not use a store’s Wi-Fi to shop.
  • If using Amazon, avoid third-party sellers if at all possible. Scammers abound.
  • Never use social networks to share what you just purchased.
  • Never “SQUEEE” on social media that any shipments are “arriving today” and you’re “so excited!”.
  • Don’t use that daft, new Amazon video-delivery-bluetooth-alexa lock thing. Ever.
  • If you can afford it, use an in-home (not cloud-based) security camera pointed at the place where deliveries come and review the footage daily if you are expecting deliveries.
  • In-person/brick-and-mortar shopping should be done at chip+pin establishments or use cash at all others.
  • Review your day’s purchases online at the end of the day or the next morning.
  • Report all issues immediately to authorities then the establishments.

Why this particular slice of advice?

The U.S. moved to chip & signature in October of 2016. This has forced attackers to find different, creative ways to get your credit card info. Yes, there were scads of breaches this year, but a good chunk of digital crime is plain ‘ol theft. Web sites make great targets. Public Wi-Fi makes a great target. You need to protect yourself since no store, org, bank, politician or authority really cares that your identity was stolen. If they did, we wouldn’t be in the breach mess we’re in now.

Attackers know you’re in deep “breach fatigue” and figure you’re all in a “Meh. Nothing matters” mood. Don’t be pwnd! A wrong move could put you in identity theft limbo for years.

The Identity Theft Resource Center — http://www.idtheftcenter.org/ — is a great resource and can definitely help you in the right direction if you don’t follow the above advice and run into issues.

?tay ?afe thi? ?hopping ?sea?on!

(one more time: sub to the feed if you’re only on the blog for /datasci items)


Daughter #2 got married to a wonderful chap yesterday. I wanted to preserve the text of my speech and blessing for them here. I know it’s on their wedding video but I’d like to archive it into the historical record as well.


Hey everyone.

I’m Bob, Victoria’s dad. Her mom, Mary, and I want to express our deepest thanks to you all for sharing in this joyous day with us, Victoria, Kyle and Kyle’s parents.

Soon, you’ll be hearing a song that I believe fully describes the foundation of Kyle & Victoria’s new life together.

A story of grace.

A story of love.

A story of new beginnings.

And, a story of individual faiths united in the One who created an entire universe to make this very day possible for this very moment for them.

Before that, I have just a few words to say to the bride, the groom and to all here.

First: Victoria.

Your mom & I have watched you grow up from a small, fragile baby girl (to audience: who’s nickname was “spud”, btw, given her potato-like appearance at birth…I’m so dead now) into a strong, dynamic, loving, caring force de majeure. When you set upon a goal, there is no subtlety; just bold determination, founded in firm convictions, demanding flawless execution.

You have a sense of justice and purpose that — at times — fills us with sheer awe, and an unwavering Faith that is both inspiring and contagious to everyone around.

For 25 years, we’ve seen your tears, but have also seen you beam with joy; we’ve seen you stumble, but have seen you get back up run as if you could literally see the angels at your side. We’ve seen your disappointments but we’ve also seen you accomplish every goal you’ve set out to achieve. All along the way, you’ve made us proud and given us incredible memories to cherish, including today.

Now, for Kyle:

We’ve had an amazing opportunity to get to know Kyle as he’s been staying with us for a good part of this year. Since we first met you, Kyle, we’ve seen you express love, kindness and respect to Victoria. We’ve caught your transfixed gazes on her as she would dive into one of her meandering, nigh-endless stories at the dinner table (yep, I was watching you :-). We’ve seen you make her paramount in your life. I’ve come to know you as a talented, intelligent, soft-spoken gentleman and have seen your Faith regularly demonstrated in action. I thank your parents for raising such a man, and — while it is difficult to let go of my “little girl” — I know I am doing so into the hand of someone who will indeed love and cherish her for their days to come.

For all of us:

While this is most definitely Kyle & Victoria’s day, each of us here has helped to shape them into who they are today. And, thankfully, our jobs are not over! Our prayer is that we will all continue to give them the love and support they so richly deserve as they turn the page to start this new chapter of their lives together.

Let us all raise a glass to the new couple:

A toast to my daughter. She was a gift from God and I will always be grateful to have been given the honor of being her father. May God continue to bless her and her new husband, Kyle. Grant them safety, love, and happiness all the days of their lives.

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins?. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on statebins

There were three different functions in the old-style package:

  • one for discrete scales (it automated ‘cuts’)
  • one for continuous scales
  • one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing statebins

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

library(statebins)
library(tidyverse)

data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault") +
  theme_statebins(legend_position="right")

Two things should stand out there:

  • you got scale_fill_distiller() for free!
  • labels are dark/light depending on the tile color

Before we go into ^^, it may be helpful to show the new function interface:

library(statebins)
library(tidyverse)

statebins(state_data, state_col = "state", value_col = "value",
  dark_label = "black", light_label = "white", font_size = 3,
  state_border_col = "white", state_border_size = 2,
  ggplot2_scale_function = ggplot2::scale_fill_distiller, ...)

You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for).

So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger.

The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output.

But that’s not all!

There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that):

library(statebins)
library(tidyverse)
library(viridis)

data(USArrests)

# make up some data for the example

rownames_to_column(USArrests, "state") %>%
  bind_rows(
    data_frame(
      state = c("Virgin Islands", "Puerto Rico", "New York City"),
      Murder = rep(mean(max(USArrests$Murder),3)),
      Assault = rep(mean(max(USArrests$Assault),3)),
      Rape = rep(mean(max(USArrests$Rape),3)),
      UrbanPop = c(93, 95, 100)
    )
  ) -> us_arrests

statebins(us_arrests, value_col="Assault",
          ggplot2_scale_function = viridis::scale_fill_viridis) +
  labs(title="USArrests + made up data") +
  theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv(system.file("extdata", "wapostates.csv", package="statebins")))

mutate(
  adat, 
  share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4"))
) %>% 
  statebins(
    value_col = "share", 
    ggplot2_scale_function = scale_fill_brewer,
    name = "Share of workforce with jobs lost or threatened by trade"
  ) +
  labs(title = "1994-2000") +
  theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

library(statebins)
library(tidyverse)

election_2012 <- suppressMessages(read_csv(system.file("extdata", "election2012.csv", package="statebins")))

mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% 
  statebins(
    font_size=4, dark_label = "white", light_label = "white",
    ggplot2_scale_function = scale_fill_manual,
    name = "Winner",
    values = c(Romney = "#2166ac", Obama = "#b2182b")
  ) +
  theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

library(statebins)
library(tidyverse)
data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) +
  theme_statebins(legend_position="right")

MOAR BREAKING NEWS: geom & faceting

Thomas Wood suggested that faceting wld be nice, but that would really require a Geom. So, I took a stab at one:

library(statebins)
library(cdcfluview)
library(hrbrthemes)
library(tidyverse)

flu <- ili_weekly_activity_indicators(2017)

ggplot(flu, aes(state=statename, fill=activity_level)) +
  geom_statebins() +
  coord_equal() +
  viridis::scale_fill_viridis(
    name = "ILI Activity Level  ", limits=c(0,10), breaks=0:10, option = "magma", direction = -1
  ) +
  facet_wrap(~weekend) +
  labs(title="2017-18 Flu Season ILI Activity Level") +
  theme_statebins(base_family = font_ps) +
  theme(plot.title=element_text(size=16, hjust=0)) +
  theme(plot.margin = margin(30,30,30,30))

FIN

It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package.

So, kick the tyres and don’t hesitate to shoot over some feedback!

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes?. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes)
library(ggplot2)

ggplot(mpg, aes(displ, hwy)) +
  geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) +
  scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) +
  scale_y_continuous(expand=c(0,0), limits=c(10, 50)) +
  scale_color_ipsum() +
  scale_fill_ipsum() +
  facet_wrap(~class, scales="free") +
  labs(
    title="IBM Plex Sans Test",
    subtitle="This is a subtitle to see the how it looks in IBM Plex Sans",
    caption="Source: hrbrthemes & IBM"
  ) +
  theme_ipsum_ps(grid="XY", axis="xy") +
  theme(legend.position="none") -> gg

flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

By now, virtually every major media outlet has covered the “280 Apocalypse”™. For those still not “in the know”, Twitter recently moved the tweet character cap to 280 after a “successful” beta test (some of us have different ideas of what “success” looks like).

I had been on a hiatus from the platform for a while and planned to (and did) jump back into the fray today but wanted to see what my timeline looked like tweet-length-wise. It’s a simple endeavour: use rtweet to grab the timeline, count the characters per-tweet and look up the results. I posted the results of said process to — of course — Twitter and some folks asked me for the code.

Others used it and there were some discussions as to why timelines looked similar (distribution-wise) with not many Tweets going over 140 characters. One posit I had was that it might be due to client-side limitations since I noted that Twitter for macOS — a terrible client they haven’t updated in ages (but there really aren’t any good ones) — still caps tweets at 140 characters. Others, like Buffer on the web, do have support for 280, so I modified the code a bit to look at the distribution by client.

Rather than bore you with my own timeline analysis, and to help the results be a tad more reproducible (which was another discussion that spawned from the tweet-length tweet), here’s a bit of code that tries to grab the last 3,000 tweets with the #rstats hashtag and plots the distribution by Twitter client:

library(rtweet)
library(ggalt)
library(rprojroot)
library(hrbrthemes)
library(tidyverse)

rt <- find_rstudio_root_file()
rstats_tweet_data_file <- file.path(rt, "data", "2017-11-13-rstats-tweet-search-results.rds")

if (!file.exists(rstats_tweet_data_file)) {
  rstats <- search_tweets("#rstats", 3000) # setting up rtweet is an exercise left to the reader
  write_rds(rstats, rstats_tweet_data_file)
} else {
  rstats <- read_rds(rstats_tweet_data_file)
}

rstats <- mutate(rstats, tweet_length=map_int(text, nchar))  # get the tweet length for each tweet

count(rstats, source) %>%
  filter(n > 5) -> usable_sources  # need data for density + I wanted a nice grid :-)

# We want max tweet length & total # of tweets for sorting & labeling facets
filter(rstats, source %in% usable_sources$source) %>%
  group_by(source) %>%
  summarise(max=max(tweet_length), n=n()) %>%
  arrange(desc(max)) -> ordr

# four breaks per panel regardless of the scales (we're using free-y scales)
there_are_FOUR_breaks <- function(limits) { seq(limits[1], limits[2], length.out=4) }

mutate(rstats) %>%
  filter(source %in% usable_sources$source) %>%
  mutate(source = factor(source, levels=ordr$source,
                         labels=sprintf("%s (n=%s)", ordr$source, ordr$n))) %>%
  ggplot(aes(tweet_length)) +
  geom_bkde(aes(color=source, fill=source), bandwidth=5, alpha=2/3) +
  geom_vline(xintercept=140, linetype="dashed", size=0.25) +
  scale_x_comma(breaks=seq(0,280,70), limits=c(0,280)) +
  scale_y_continuous(breaks=there_are_FOUR_breaks, expand=c(0,0)) +
  facet_wrap(~source, scales="free", ncol=5) +
  labs(x="Tweet length", y="Density",
       title="Tweet length distributions by Twitter client (4.5 days #rstats)",
       subtitle="Twitter client facets in decreasing order of ones with >140 length tweets",
       caption="NOTE free Y axis scales\nBrought to you by #rstats, rtweet & ggalt") +
  theme_ipsum_rc(grid="XY", strip_text_face="bold", strip_text_size=8, axis_text_size=7) +
  theme(panel.spacing.x=unit(5, "pt")) +
  theme(panel.spacing.y=unit(5, "pt")) +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(axis.text.y=element_blank()) +
  theme(legend.position="none")

FIN

While the 140 barrier has definitely been broken, it has not been abused (yet) but the naive character counting is also not perfect since it looks like it doesn’t “count” the same way Twitter-proper does (image “attachments”, as an example, are counted as characters here here but they aren’t counted that way in Twitter clients). Bots are also counted as Twitter clients.

It’ll be interesting to track this in a few months as folks start to inch-then-blaze past the former hard-limit.

Give the code (or use your timeline info) a go and post a link with your results! You can find an RStudio project directory over on GitHub ?.

Working remotely has many benefits, but if you work remotely in an area like, say, rural Maine, one of those benefits is not massively speedy internet connections. Being able to go fast and furious on the internet is one of the many things I miss about our time in Seattle and it is unlikely that we’ll be seeing Google Fiber in my small town any time soon. One other issue is that residential plans from evil giants like Comcast come with things like “bandwidth caps”. I suspect many WFH-ers can live within those limits, but I work with internet-scale data and often shunt extracts or whole datasets to the DatCave™ server farm for local processing. As such, I pay an extra penalty as a Comcast “Business-class” user that has little benefit besides getting slightly higher QoS and some faster service response times when there are issues.

Why go into all that? Well, I recently decided to bump up the connection from 100 Mb/s to 150 Mb/s (and managed to do so w/o increasing the overall monthly bill at all) but wanted to have a way to regularly verify I am getting what I’m paying for without having to go to an interactive “speed test” site.

There’s a handy speedtest-cli?, which is a python-based module with a command-line interface that can perform speed tests against Ookla’s legacy server. (You’ll notice that link forwards you to their new socket-based test service; neither speedtest-cli nor the code in the package being shown here uses that yet)

I run plenty of ruby, python, node and go (et al) programs on the command-line, but I wanted a way to measure this in R-proper (storing individual test results) on a regular basis as well as run something from the command-line whenever I wanted an interactive test. Thus begat speedtest?.

Testing the Need for Speed

After you devtools::install_github("hrbrmstr/speedtest") the package, you can either type speedtest::spd_test() at an R console or Rscript --quiet -e 'speedtest::spd_test()' on the command-line to see something like the following:

What you see there is a short-hand version of what’s available in the package:

  • spd_best_servers: Find “best” servers (latency-wise) from master server list
  • spd_closest_servers: Find “closest” servers (geography-wise) from master server list
  • spd_compute_bandwidth: Compute bandwidth from bytes transferred and time taken
  • spd_config: Retrieve client configuration information for the speedtest
  • spd_download_test: Perform a download speed/bandwidth test
  • spd_servers: Retrieve a list of SpeedTest servers
  • spd_upload_test: Perform an upload speed/bandwidth test
  • spd_test: Test your internet speed/bandwidth

The general idiom is to grab the test configuration file, collect the master list of servers, figure out which servers you’re going to test against and perform upload/download tests + collect the resultant statistics:

library(speedtest)
library(stringi)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

config <- spd_config()

servers <- spd_servers(config=config)
closest_servers <- spd_closest_servers(servers, config=config)
only_the_best_severs <- spd_best_servers(closest_servers, config)

glimpse(spd_download_test(closest_servers[1,], config=config))
## Observations: 1
## Variables: 15
## $ url     <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat     <dbl> 42.3875
## $ lng     <dbl> -71.1
## $ name    <chr> "Somerville, MA"
## $ country <chr> "United States"
## $ cc      <chr> "US"
## $ sponsor <chr> "Axcelx Technologies LLC"
## $ id      <chr> "5960"
## $ host    <chr> "speed0.xcelx.net:8080"
## $ url2    <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min     <dbl> 14.40439
## $ mean    <dbl> 60.06834
## $ median  <dbl> 55.28457
## $ max     <dbl> 127.9436
## $ sd      <dbl> 34.20695

glimpse(spd_upload_test(only_the_best_severs[1,], config=config))
## Observations: 1
## Variables: 18
## $ ping_time      <dbl> 0.02712567
## $ total_time     <dbl> 0.059917
## $ retrieval_time <dbl> 2.3e-05
## $ url            <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat            <dbl> 42.3875
## $ lng            <dbl> -71.1
## $ name           <chr> "Somerville, MA"
## $ country        <chr> "United States"
## $ cc             <chr> "US"
## $ sponsor        <chr> "Axcelx Technologies LLC"
## $ id             <chr> "5960"
## $ host           <chr> "speed0.xcelx.net:8080"
## $ url2           <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min            <dbl> 6.240858
## $ mean           <dbl> 9.527599
## $ median         <dbl> 9.303148
## $ max            <dbl> 12.56686
## $ sd             <dbl> 2.451778
```

Spinning More Wheels

The default operation for each test function is to return summary statistics. If you dig into the package source, or look at how the legacy measuring site performs the speed test tasks, you’ll see that the process involves uploading and downloading a series of files (or, more generally, random data) of increasing size and calculating how long it takes to perform those actions. Smaller size tests will have skewed results on fast connections with low latency, which is why the sites do the incremental tests (and why you see the “needles” move the way they do on interactive tests). We can disable summary statistics and retrieve all of results of individual tests (as we’ll do in a bit).

Speed tests are also highly dependent on the processing capability of the target server as well as the network location (hops away), “quality” and load. You’ll often see interactive sites choose the “best” server. There are many ways to do that. One is geography, which has some merit, but should not be the only factor used. Another is latency, which is comparing a small connection test against each server and measuring which one comes back the fastest.

It’s straightforward to show the impact of each with a simple test harness. We’ll pick 3 geographic “closest” geographic servers, 3 “best” latency servers (those may overlap with “closest”) and 3 randomly chosen ones:

set.seed(8675309)

bind_rows(

  closest_servers[1:3,] %>%
    mutate(type="closest"),

  only_the_best_severs[1:3,] %>%
    mutate(type="best"),

  filter(servers, !(id %in% c(closest_servers[1:3,]$id, only_the_best_severs[1:3,]$id))) %>%
    sample_n(3) %>%
    mutate(type="random")

) %>%
  group_by(type) %>%
  ungroup() -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 9 x 5
##                   sponsor            name       country
##                     <chr>           <chr>         <chr>
## 1 Axcelx Technologies LLC  Somerville, MA United States
## 2                 Comcast      Boston, MA United States
## 3            Starry, Inc.      Boston, MA United States
## 4 Axcelx Technologies LLC  Somerville, MA United States
## 5 Norwood Light Broadband     Norwood, MA United States
## 6       CCI - New England  Providence, RI United States
## 7                 PirxNet         Gliwice        Poland
## 8           Interoute VDC Los Angeles, CA United States
## 9                   UNPAD         Bandung     Indonesia
## # ... with 2 more variables: host <chr>, type <chr>

As predicted, there are some duplicates, but we’ll perform upload and download tests for each of them and compare the results. First, download:

map_df(1:nrow(to_compare), ~{
  spd_download_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> dl_results_full

mutate(dl_results_full, type=stri_trans_totitle(type)) %>%
  ggplot(aes(type, bw, fill=type)) +
  geom_quasirandom(aes(size=size, color=type), width=0.15, shape=21, stroke=0.25) +
  scale_y_continuous(expand=c(0,5), labels=c(sprintf("%s", seq(0,150,50)), "200 Mb/s"), limits=c(0,200)) +
  scale_size(range=c(2,6)) +
  scale_color_manual(values=c(Random="#b2b2b2", Best="#2b2b2b", Closest="#2b2b2b")) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Download bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

I’m going to avoid hammering really remote servers with the upload test (truth-be-told I just don’t want to wait as long as I know it’s going to take):

bind_rows(
  closest_servers[1:3,] %>% mutate(type="closest"),
  only_the_best_severs[1:3,] %>% mutate(type="best")
) %>%
  distinct(.keep_all=TRUE) -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 6 x 5
##                   sponsor           name       country
##                     <chr>          <chr>         <chr>
## 1 Axcelx Technologies LLC Somerville, MA United States
## 2                 Comcast     Boston, MA United States
## 3            Starry, Inc.     Boston, MA United States
## 4 Axcelx Technologies LLC Somerville, MA United States
## 5 Norwood Light Broadband    Norwood, MA United States
## 6       CCI - New England Providence, RI United States
## # ... with 2 more variables: host <chr>, type <chr>

map_df(1:nrow(to_compare), ~{
  spd_upload_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> ul_results_full

ggplot(ul_results_full, aes(x="Upload Test", y=bw)) +
  geom_quasirandom(aes(size=size, fill="col"), width=0.1, shape=21, stroke=0.25, color="#2b2b2b") +
  scale_y_continuous(expand=c(0,0.5), breaks=seq(0,16,4),
                     labels=c(sprintf("%s", seq(0,12,4)), "16 Mb/s"), limits=c(0,16)) +
  scale_size(range=c(2,6)) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Upload bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

As indicated, data payload size definitely impacts the speed tests (as does “proximity”), but you’ll also see variance if you run the same tests again with the same target servers. Note, also, that the “command-line” ready function tries to make the quickest target selection and only chooses the max values.

FIN

The github site has a list of TODOs and any/all are welcome to pile on (full credit for any contributions, as usual). So kick the tyres, file issues and start measuring!