rud.is

The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

Let’s try to do the whole thing in R (of course!).

We’ll need to:

Grab the lyrics
Parse the lyrics
Get pricing data
Compute some statistics
Make some (hopefully) pretty charts

This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

Grab the code up on GitHub.

(Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

A Public Apology to Tal/R-Bloggers + The Trouble With Tibbles

Over the past few weeks, I had been noticing that some posts in the R-bloggers feed were getting truncated in Feedly. I don’t remember when I noticed that since I usually click through immediately from the headline entry to the R-bloggers page vs read in Feedly since ultimately I want to get to the author’s site to see it formatted they way they intended it to be.

I let frustration get the better of me and — without verifying with Tal first — tweeted in error in said frustration. I’m not going to perform an extensive validation on the R-Bloggers feed always pushing out full content as Tal say they do.

Tal (and any other folks who work with Tal on R-Bloggers): I apologize for the tweet content and the tone of the tweet, but more apologize for not reaching out directly as if I had the following would have likely emerged after dual investigations. So, I’m really daft on at least two accounts. I will also apologize again, in-person, if we manage to cross paths in 2018. Hopefully said apology will be over a delightful beverage (on me — well, hopefully you won’t actually dump said beverage on me, but you’d be right to do so).

The truncated posts (anyone with Feedly can likely validate my experience) seems to be a combination of issues with a common thread: the tibble. I’m going to use the tibble 1.2.0 post [R-Bloggers link] from RStudio as an example. I have to use pictures (apologies), but you’ll see why in a bit.

The Trouble With Tibbles

This is a snap from the early part of the aforementioned post:

Here’s that content on R-Bloggers:

Now you get to play that favorite childhood game of yours: spot the difference.

You should notice the angle-bracket type headers are missing on R-Bloggers version of the post.

While they are visually missing, they are not — in fact — missing. They are there:

But, HTML wonks have likely already figured out the issue.

Here’s what the source view from RStudio’s blog looks like:

One more opportunity to play “spot the difference”.

That difference can wreak havoc with further HTML/XML post-processors (inspect the elements in different browsers or via rvest/xml2) and it seems Feedly’s ingestion process is doing the truncation when it hits invalid HTML.

This means that tibble output will likely cause more posts to be truncated in feed viewers pulling from R-Bloggers (I verified this with a sample of other, recent posts that I knew used tibble output).

Both R-Bloggers and Feedly should work on said issues. I’ll be pinging Feedly and I’m sure after I tweet this post out Tal will see it.

FIN

Tal: I promise to bring up any further issues with you directly and re-iterate my apology one more time.

Sentiment Analysis of “A Christmas Carol”

2017-11-29 – 18:16
Posted in data wrangling, R, text-mining
Tagged post
Comments (7)

Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holiday-themed post or two up for the season.

One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:

(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)

Let us begin!

STAVE I: hrbrmstr’s Code

We need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:

library(rprojroot)
library(gutenbergr)
library(hrbrthemes)
library(stringi)
library(tidytext)
library(tidyverse)

rt <- find_rstudio_root_file()

carol_rds <- file.path(rt, "data", "carol.rds")

if (!file.exists(carol_rds)) {
  carol_df <- gutenberg_download("46")
  write_rds(carol_df, carol_rds)
} else {
  carol_df <- read_rds(carol_rds)
}

How did I know to use 46? We can use gutenberg_works() to get to that info:

gutenberg_works(author=="Dickens, Charles")
## # A tibble: 74 x 8
##    gutenberg_id                                                                                    title
##           <int>                                                                                    <chr>
##  1           46                             A Christmas Carol in Prose; Being a Ghost Story of Christmas
##  2           98                                                                     A Tale of Two Cities
##  3          564                                                               The Mystery of Edwin Drood
##  4          580                                                                      The Pickwick Papers
##  5          588                                                                  Master Humphrey's Clock
##  6          644                                                  The Haunted Man and the Ghost's Bargain
##  7          650                                                                      Pictures from Italy
##  8          653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Year out and a New Year In"
##  9          675                                                                           American Notes
## 10          678                                          The Cricket on the Hearth: A Fairy Tale of Home
## # ... with 64 more rows, and 6 more variables: author <chr>, gutenberg_author_id <int>, language <chr>,
## #   gutenberg_bookshelf <chr>, rights <chr>, has_text <lgl>

STAVE II: The first of three wrangles

We’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:

#' Convenience only
carol_txt <- carol_df$text

# Just want the chapters (staves)
carol_txt <- carol_txt[-(1:(which(grepl("STAVE I:", carol_txt)))-1)]

#' We'll need this later to make prettier facet titles
data_frame(
  stave = 1:5,
  title = sprintf("Stave %s: %s", stave, carol_txt[stri_detect_fixed(carol_txt, "STAVE")] %>%
    stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>%
    stri_trans_totitle())
) -> stave_titles

stri_trans_totitle() is a super-handy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.

You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.

Now, we need to prep the text for text analysis.

STAVE III: The second of three wrangles

There are other text mining packages and processes in R. I’m using tidytext because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!

Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.

#' Break the text up into chapters, paragraphs, sentences, and words,
#' preserving the hierarchy so we can use it later.
data_frame(txt = carol_txt) %>%
  unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>%
  mutate(stave = 1:n()) %>%
  unnest_tokens(paragraph, chapter, token = "paragraphs") %>% 
  group_by(stave) %>%
  mutate(para = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(sentence, paragraph, token="sentences") %>% 
  group_by(stave, para) %>%
  mutate(sent = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(word, sentence) -> carol_tokens

carol_tokens
##  A tibble: 28,710 x 4
##   stave  para  sent   word
##   <int> <int> <int>  <chr>
## 1     1     1     1 marley
## 2     1     1     1    was
## 3     1     1     1   dead
## 4     1     1     1     to
## 5     1     1     1  begin
## 6     1     1     1   with
## 7     1     1     1  there
## 8     1     1     1     is
## 9     1     1     1     no
## 0     1     1     1  doubt
##  ... with 28,700 more rows

By indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.

STAVE IV: The third of three wrangles

Now, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentiment-al posts (including this one) take a naive approach with basic match and only looking at 1-grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :-). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.

#' Retrieve sentiments and compute them.
#'
#' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse
#' this block (which I'm not doing but thought I might).
inner_join(carol_tokens, get_sentiments("nrc"), "word") %>%
  count(stave, index = para, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  left_join(stave_titles, "stave") -> carol_with_sent

STAVE V: The end of it

Now, we just need to do some really basic ggplot-ing to to get to our desired result:

ggplot(carol_with_sent) +
  geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) +
  scale_x_comma(limits=range(carol_with_sent$index)) +
  scale_y_comma() +
  scale_color_ipsum() +
  facet_wrap(~title, scales="free_x", ncol=5) +
  labs(x=NULL, y="Sentiment",
       title="Sentiment Analysis of A Christmas Carol",
       subtitle="By stave & ¶",
       caption="Humbug!") +
  theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) +
  theme(legend.position="none")

You’ll want to tap/click on that to make it bigger.

Despite using a naive analysis, I think it tracks pretty well with the flow of the book.

Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.

The truly terrible (-10 sentiment) paragraph also makes sense:

Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.

(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)

filter(
  carol_tokens, stave == 1,
  para == filter(carol_with_sent, stave==1) %>% 
    filter(sentiment == min(sentiment)) %>% 
    pull(index)
)

Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostly-positive tone also makes sense.

Stave three (Christmas present) has the highest:

The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookers-on feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highly-decorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.

and lowest (sentiment) points of the entire book:

And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burial-place of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.

Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:

They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people half-naked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.

Finally, Stave five is both short and positive (whew!). Which I heartily agree with!

FIN

The code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.

Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.

For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”

voteogram Is Now On CRAN

Earlier this year, I made a package that riffed off of ProPublica’s really neat voting cartograms (maps) for the U.S. House and Senate. You can see one for disaster relief spending in the House and one for the ACA “Skinny Repeal” in the Senate.

We can replicate both here with the voteogram package (minus the interactivity, for now):

library(voteogram)
library(ggplot2)

hr_566 <- roll_call(critter="house", number=115, session=1, rcall=566)

house_carto(hr_566) +
  coord_equal() +
  theme_voteogram()

sen_179 <- roll_call(critter="senate", number=115, session=1, rcall=179)

senate_carto(sen_179) +
  coord_equal() +
  theme_voteogram()

I think folks might have more fun with the roll_call() objects though:

str(hr_566)
## List of 29
##  $ vote_id              : chr "H_115_1_566"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 566
##  $ needed_to_pass       : int 282
##  $ date_of_vote         : chr "October 12, 2017"
##  $ time_of_vote         : chr "03:23 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "2/3 YEA-AND-NAY"
##  $ question             : chr "On Motion to Suspend the Rules and Agree"
##  $ description          : chr "Providing for the concurrence by the House in the Senate amendment to H.R. ## 2266, with an amendment"
##  $ nyt_title            : chr "On Motion to Suspend the Rules and Agree"
##  $ total_yes            : int 353
##  $ total_no             : int 69
##  $ total_not_voting     : int 11
##  $ gop_yes              : int 164
##  $ gop_no               : int 69
##  $ gop_not_voting       : int 7
##  $ dem_yes              : int 189
##  $ dem_no               : int 0
##  $ dem_not_voting       : int 5
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "Yes"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##   ..$ bioguide_id         : chr [1:435] "A000374" "A000370" "A000055" "A000371" ...
##   ..$ role_id             : int [1:435] 274 294 224 427 268 131 388 320 590 206 ...
##   ..$ member_name         : chr [1:435] "Ralph Abraham" "Alma Adams" "Robert B. Aderholt" "Pete Aguilar" ...
##   ..$ sort_name           : chr [1:435] "Abraham" "Adams" "Aderholt" "Aguilar" ...
##   ..$ party               : chr [1:435] "R" "D" "R" "D" ...
##   ..$ state_abbrev        : chr [1:435] "LA" "NC" "AL" "CA" ...
##   ..$ display_state_abbrev: chr [1:435] "La." "N.C." "Ala." "Calif." ...
##   ..$ district            : int [1:435] 5 12 4 31 12 3 2 19 36 2 ...
##   ..$ position            : chr [1:435] "Yes" "Yes" "Yes" "Yes" ...
##   ..$ dw_nominate         : num [1:435] 0.493 -0.462 0.36 -0.273 0.614 0.684 0.388 NA 0.716 NA ...
##   ..$ pp_id               : chr [1:435] "LA_5" "NC_12" "AL_4" "CA_31" ...
##  - attr(*, "class")= chr [1:2] "pprc" "list"

as they hold tons of info on the votes.

I need to explore the following a bit more but there are some definite “patterns” in the way the 115th Senate has voted this year:

library(hrbrthemes)

# I made a mistake in how I exposed these that I'll correct next month
# but we need to munge it a bit anyway for this view
fills <- voteogram:::vote_carto_fill
names(fills) <- tolower(names(fills))

rcalls <- map(1:280, ~voteogram::roll_call(critter="senate", session=1, number=115, rcall=.x))
# save it off so you don't have to waste those calls again
write_rds(rcalls, "2017-115-1-sen-280-roll-calls.rds")

# do a bit of wrangling
map_df(rcalls, ~{
  mutate(.x$votes, vote_id = .x$vote_id) %>% 
    arrange(party, position) %>% 
    mutate(fill = tolower(sprintf("%s-%s", party, position))) %>% 
    mutate(ques = .x$question) %>% 
    mutate(x = 1:n())
}) -> votes_df

# plot it
ggplot(votes_df, aes(x=x, y=vote_id, fill=fill)) +
  geom_tile() +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  scale_fill_manual(name=NULL, values=fills) +
  labs(x=NULL, y=NULL, title="Senate Roll Call Votes",
       subtitle="2017 / 115th, Session 1, Votes 1-280",
       caption="Note free-Y scales") +
  facet_wrap(~ques, scales="free_y", ncol=3) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position="right")

Hopefully I’ll get some time to dig into the differences and report on anything interesting. If you get to it before me definitely link to your blog post in a comment!

FIN

I still want to make an htmlwidgets version of the plots and also add the ability to get the index of roll call votes by Congress number and session to make it easier to iterate.

I’m also seriously considering creating different palettes. I used the ones from the source interactive site but am not 100% happy with them. Suggestions/PRs welcome.

Hopefully this package will make it easier for U.S. folks to track what’s going on in Congress and keep their representatives more accountable to the truth.

Everything’s on GitHub so please file issues, questions or PRs there.

“Black”/”Cyber” Tips (a.k.a. How the hrbrgrinch ruined Christmas shopping)

NOTE: This is mainly for those of us in the Colonies, but some tips apply globally.

Black Friday / Cyber Monday / Cyber November / Holiday ?hopping is upon us. You’re going to buy stuff. You’re going to use digital transactions to do so. Here are some tips in a semi-coherent order:

Sign up for a “reputable” credit card (is there such a thing? FinServs are pretty evil) with a low interest rate/cash back, multi-factor authentication on their web/app and a limit on total credit and a per-transaction limit. This card is just for shopping. Pay with petrol and groceries with something else.
Assign that to your PayPal, Amazon, Apple Pay, et al accounts and keep that as your only physical & digital card for your shopping sprees until the season ends.
Setup multi-factor auth on PayPal, Amazon, Apple Pay and anywhere else you shop. Don’t shop where you can’t do this.
Use Amazon or a site that accepts PayPal, Apple Pay, or Amazon payments. Yes, all those orgs are evil. But they do a better job than most when it comes to account security.
Use Quantum Firefox or the latest Chrome Betas to shop online. Nothing else. Check for updates daily & apply when they are out.
Double-check URLs when shopping. Make sure you’re on the site you want to be on. Let’s Encrypt made it super easy for attackers to pwn you this season. You can afford an extra 5 minutes since that’ll save you years battling identity theft or account bankruptcy.
Type all URLs into Google’s safety net — https://transparencyreport.google.com/safe-browsing/search — if at all possible before even considering trusting them.
Don’t use any storefront that uses a Let’s Encrypt certificate. Any.
Never let sites store your credit card or bank info.
Never shop on a site that has any errors associated with their SSL/TLS certificates. Let’s Encrypt killed the integrity of the lock icon and well-resourced adversaries can thwart the encryption but the opportunistic attackers likely to try to pwn you are going to be stopped
Avoid shopping with Apps. App developers are generally daft and have wretched security practices baked into their apps.
Use “Private Browsing” mode to shop if at all possible and start new browser sessions per-site. Your shopping habits and purchase info is as or even more valuable than your card digits, esp to trackers.
Use Ublock Origin or other reputable ad-blockers and tracking blockers to prevent orgs from tracking you as you shop. A good hosts file wouldn’t hurt, either.
Use Quad9 as your DNS provider starting now.
Never shop online from public Wi-Fi.
Don’t shop online from your company’s network (even the “guest” network). They track you. They all do or at least send data (whether they know it or not) to security appliance and “cloud” services that will use it against you or to profit off of you.
Absolutely do not use a store’s Wi-Fi to shop.
If using Amazon, avoid third-party sellers if at all possible. Scammers abound.
Never use social networks to share what you just purchased.
Never “SQUEEE” on social media that any shipments are “arriving today” and you’re “so excited!”.
Don’t use that daft, new Amazon video-delivery-bluetooth-alexa lock thing. Ever.
If you can afford it, use an in-home (not cloud-based) security camera pointed at the place where deliveries come and review the footage daily if you are expecting deliveries.
In-person/brick-and-mortar shopping should be done at chip+pin establishments or use cash at all others.
Review your day’s purchases online at the end of the day or the next morning.
Report all issues immediately to authorities then the establishments.

Why this particular slice of advice?

The U.S. moved to chip & signature in October of 2016. This has forced attackers to find different, creative ways to get your credit card info. Yes, there were scads of breaches this year, but a good chunk of digital crime is plain ‘ol theft. Web sites make great targets. Public Wi-Fi makes a great target. You need to protect yourself since no store, org, bank, politician or authority really cares that your identity was stolen. If they did, we wouldn’t be in the breach mess we’re in now.

Attackers know you’re in deep “breach fatigue” and figure you’re all in a “Meh. Nothing matters” mood. Don’t be pwnd! A wrong move could put you in identity theft limbo for years.

The Identity Theft Resource Center — http://www.idtheftcenter.org/ — is a great resource and can definitely help you in the right direction if you don’t follow the above advice and run into issues.

?tay ?afe thi? ?hopping ?sea?on!

For Victoria & Kyle

(one more time: sub to the #rstats feed if you’re only on the blog for #rstats/datasci items)

Daughter #2 got married to a wonderful chap yesterday. I wanted to preserve the text of my speech and blessing for them here. I know it’s on their wedding video but I’d like to archive it into the historical record as well.

Hey everyone.

I’m Bob, Victoria’s dad. Her mom, Mary, and I want to express our deepest thanks to you all for sharing in this joyous day with us, Victoria, Kyle and Kyle’s parents.

Soon, you’ll be hearing a song that I believe fully describes the foundation of Kyle & Victoria’s new life together.

A story of grace.

A story of love.

A story of new beginnings.

And, a story of individual faiths united in the One who created an entire universe to make this very day possible for this very moment for them.

Before that, I have just a few words to say to the bride, the groom and to all here.

First: Victoria.

Your mom & I have watched you grow up from a small, fragile baby girl (to audience: who’s nickname was “spud”, btw, given her potato-like appearance at birth…I’m so dead now) into a strong, dynamic, loving, caring force de majeure. When you set upon a goal, there is no subtlety; just bold determination, founded in firm convictions, demanding flawless execution.

You have a sense of justice and purpose that — at times — fills us with sheer awe, and an unwavering Faith that is both inspiring and contagious to everyone around.

For 25 years, we’ve seen your tears, but have also seen you beam with joy; we’ve seen you stumble, but have seen you get back up run as if you could literally see the angels at your side. We’ve seen your disappointments but we’ve also seen you accomplish every goal you’ve set out to achieve. All along the way, you’ve made us proud and given us incredible memories to cherish, including today.

Now, for Kyle:

We’ve had an amazing opportunity to get to know Kyle as he’s been staying with us for a good part of this year. Since we first met you, Kyle, we’ve seen you express love, kindness and respect to Victoria. We’ve caught your transfixed gazes on her as she would dive into one of her meandering, nigh-endless stories at the dinner table (yep, I was watching you :-). We’ve seen you make her paramount in your life. I’ve come to know you as a talented, intelligent, soft-spoken gentleman and have seen your Faith regularly demonstrated in action. I thank your parents for raising such a man, and — while it is difficult to let go of my “little girl” — I know I am doing so into the hand of someone who will indeed love and cherish her for their days to come.

For all of us:

While this is most definitely Kyle & Victoria’s day, each of us here has helped to shape them into who they are today. And, thankfully, our jobs are not over! Our prayer is that we will all continue to give them the love and support they so richly deserve as they turn the page to start this new chapter of their lives together.

Let us all raise a glass to the new couple:

A toast to my daughter. She was a gift from God and I will always be grateful to have been given the honor of being her father. May God continue to bless her and her new husband, Kyle. Grant them safety, love, and happiness all the days of their lives.

Statebins Reimagined

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins?. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on `statebins`…

There were three different functions in the old-style package:

one for discrete scales (it automated ‘cuts’)
one for continuous scales
one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing `statebins`

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

library(statebins)
library(tidyverse)

data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault") +
  theme_statebins(legend_position="right")

Two things should stand out there:

you got scale_fill_distiller() for free!
labels are dark/light depending on the tile color

Before we go into ^^, it may be helpful to show the new function interface:

library(statebins)
library(tidyverse)

statebins(state_data, state_col = "state", value_col = "value",
  dark_label = "black", light_label = "white", font_size = 3,
  state_border_col = "white", state_border_size = 2,
  ggplot2_scale_function = ggplot2::scale_fill_distiller, ...)

You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for).

So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger.

The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output.

But that’s not all!

There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that):

library(statebins)
library(tidyverse)
library(viridis)

data(USArrests)

# make up some data for the example

rownames_to_column(USArrests, "state") %>%
  bind_rows(
    data_frame(
      state = c("Virgin Islands", "Puerto Rico", "New York City"),
      Murder = rep(mean(max(USArrests$Murder),3)),
      Assault = rep(mean(max(USArrests$Assault),3)),
      Rape = rep(mean(max(USArrests$Rape),3)),
      UrbanPop = c(93, 95, 100)
    )
  ) -> us_arrests

statebins(us_arrests, value_col="Assault",
          ggplot2_scale_function = viridis::scale_fill_viridis) +
  labs(title="USArrests + made up data") +
  theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv(system.file("extdata", "wapostates.csv", package="statebins")))

mutate(
  adat, 
  share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4"))
) %>% 
  statebins(
    value_col = "share", 
    ggplot2_scale_function = scale_fill_brewer,
    name = "Share of workforce with jobs lost or threatened by trade"
  ) +
  labs(title = "1994-2000") +
  theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

library(statebins)
library(tidyverse)

election_2012 <- suppressMessages(read_csv(system.file("extdata", "election2012.csv", package="statebins")))

mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% 
  statebins(
    font_size=4, dark_label = "white", light_label = "white",
    ggplot2_scale_function = scale_fill_manual,
    name = "Winner",
    values = c(Romney = "#2166ac", Obama = "#b2182b")
  ) +
  theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

library(statebins)
library(tidyverse)
data(USArrests)

USArrests$state <- rownames(USArrests)
statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) +
  theme_statebins(legend_position="right")

MOAR BREAKING NEWS: geom & faceting

Thomas Wood suggested that faceting wld be nice, but that would really require a Geom. So, I took a stab at one:

library(statebins)
library(cdcfluview)
library(hrbrthemes)
library(tidyverse)

flu <- ili_weekly_activity_indicators(2017)

ggplot(flu, aes(state=statename, fill=activity_level)) +
  geom_statebins() +
  coord_equal() +
  viridis::scale_fill_viridis(
    name = "ILI Activity Level  ", limits=c(0,10), breaks=0:10, option = "magma", direction = -1
  ) +
  facet_wrap(~weekend) +
  labs(title="2017-18 Flu Season ILI Activity Level") +
  theme_statebins(base_family = font_ps) +
  theme(plot.title=element_text(size=16, hjust=0)) +
  theme(plot.margin = margin(30,30,30,30))

FIN

It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package.

So, kick the tyres and don’t hesitate to shoot over some feedback!

New IBM Plex Sans Support in hrbrthemes + Automating Axis Text Justification

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes?. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes)
library(ggplot2)

ggplot(mpg, aes(displ, hwy)) +
  geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) +
  scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) +
  scale_y_continuous(expand=c(0,0), limits=c(10, 50)) +
  scale_color_ipsum() +
  scale_fill_ipsum() +
  facet_wrap(~class, scales="free") +
  labs(
    title="IBM Plex Sans Test",
    subtitle="This is a subtitle to see the how it looks in IBM Plex Sans",
    caption="Source: hrbrthemes & IBM"
  ) +
  theme_ipsum_ps(grid="XY", axis="xy") +
  theme(legend.position="none") -> gg

flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, ~~I’m going to add~~ I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.