Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holiday-themed post or two up for the season.
One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:
(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)
Let us begin!
STAVE I: hrbrmstr’s Code
We need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr
package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:
library(rprojroot)
library(gutenbergr)
library(hrbrthemes)
library(stringi)
library(tidytext)
library(tidyverse)
rt <- find_rstudio_root_file()
carol_rds <- file.path(rt, "data", "carol.rds")
if (!file.exists(carol_rds)) {
carol_df <- gutenberg_download("46")
write_rds(carol_df, carol_rds)
} else {
carol_df <- read_rds(carol_rds)
}
How did I know to use 46
? We can use gutenberg_works()
to get to that info:
gutenberg_works(author=="Dickens, Charles")
## # A tibble: 74 x 8
## gutenberg_id title
## <int> <chr>
## 1 46 A Christmas Carol in Prose; Being a Ghost Story of Christmas
## 2 98 A Tale of Two Cities
## 3 564 The Mystery of Edwin Drood
## 4 580 The Pickwick Papers
## 5 588 Master Humphrey's Clock
## 6 644 The Haunted Man and the Ghost's Bargain
## 7 650 Pictures from Italy
## 8 653 "The Chimes\r\nA Goblin Story of Some Bells That Rang an Old Year out and a New Year In"
## 9 675 American Notes
## 10 678 The Cricket on the Hearth: A Fairy Tale of Home
## # ... with 64 more rows, and 6 more variables: author <chr>, gutenberg_author_id <int>, language <chr>,
## # gutenberg_bookshelf <chr>, rights <chr>, has_text <lgl>
STAVE II: The first of three wrangles
We’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:
#' Convenience only
carol_txt <- carol_df$text
# Just want the chapters (staves)
carol_txt <- carol_txt[-(1:(which(grepl("STAVE I:", carol_txt)))-1)]
#' We'll need this later to make prettier facet titles
data_frame(
stave = 1:5,
title = sprintf("Stave %s: %s", stave, carol_txt[stri_detect_fixed(carol_txt, "STAVE")] %>%
stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>%
stri_trans_totitle())
) -> stave_titles
stri_trans_totitle()
is a super-handy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.
You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.
Now, we need to prep the text for text analysis.
STAVE III: The second of three wrangles
There are other text mining packages and processes in R. I’m using tidytext
because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!
Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.
#' Break the text up into chapters, paragraphs, sentences, and words,
#' preserving the hierarchy so we can use it later.
data_frame(txt = carol_txt) %>%
unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>%
mutate(stave = 1:n()) %>%
unnest_tokens(paragraph, chapter, token = "paragraphs") %>%
group_by(stave) %>%
mutate(para = 1:n()) %>%
ungroup() %>%
unnest_tokens(sentence, paragraph, token="sentences") %>%
group_by(stave, para) %>%
mutate(sent = 1:n()) %>%
ungroup() %>%
unnest_tokens(word, sentence) -> carol_tokens
carol_tokens
## A tibble: 28,710 x 4
## stave para sent word
## <int> <int> <int> <chr>
## 1 1 1 1 marley
## 2 1 1 1 was
## 3 1 1 1 dead
## 4 1 1 1 to
## 5 1 1 1 begin
## 6 1 1 1 with
## 7 1 1 1 there
## 8 1 1 1 is
## 9 1 1 1 no
## 0 1 1 1 doubt
## ... with 28,700 more rows
By indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.
STAVE IV: The third of three wrangles
Now, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentiment-al posts (including this one) take a naive approach with basic match and only looking at 1-grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :-). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.
#' Retrieve sentiments and compute them.
#'
#' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse
#' this block (which I'm not doing but thought I might).
inner_join(carol_tokens, get_sentiments("nrc"), "word") %>%
count(stave, index = para, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
left_join(stave_titles, "stave") -> carol_with_sent
STAVE V: The end of it
Now, we just need to do some really basic ggplot-ing to to get to our desired result:
ggplot(carol_with_sent) +
geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) +
scale_x_comma(limits=range(carol_with_sent$index)) +
scale_y_comma() +
scale_color_ipsum() +
facet_wrap(~title, scales="free_x", ncol=5) +
labs(x=NULL, y="Sentiment",
title="Sentiment Analysis of A Christmas Carol",
subtitle="By stave & ¶",
caption="Humbug!") +
theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) +
theme(legend.position="none")
You’ll want to tap/click on that to make it bigger.
Despite using a naive analysis, I think it tracks pretty well with the flow of the book.
Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.
The truly terrible (-10 sentiment) paragraph also makes sense:
Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.
(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)
filter(
carol_tokens, stave == 1,
para == filter(carol_with_sent, stave==1) %>%
filter(sentiment == min(sentiment)) %>%
pull(index)
)
Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostly-positive tone also makes sense.
Stave three (Christmas present) has the highest:
The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookers-on feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highly-decorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.
and lowest (sentiment) points of the entire book:
And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burial-place of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.
Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:
They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people half-naked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.
Finally, Stave five is both short and positive (whew!). Which I heartily agree with!
FIN
The code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.
Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.
For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”
4 Comments
Nice post! Really enjoyed working through it. I think the “scalexcomma(limits=range(carolwithsent$index))” line in the ggplot code prevents the scales=”freex” argument from working in facetwrap. Just a minor observation.
Nay, it’s desired behaviour. I want the labels on the axis on each facet but also want the last one to show it’s shorter. It’s a hack to get that behaviour. I use it quite a bit at work :-)
hi! great post, i learned a lot about regex and stringi functions here so thanks!
i seem to be having a problem with the plot:
in themeipsumrc(), i get an “unused argument” error for (axistextsize = 8). i checked the documentation for themeipsumrc() and i saw that the only similar argument is (axistitlesize = __). Is this a typo or am I missing something completely here?
If you can, try the development version of ggplot2 on GitHub. The one on CRAN should work with recent CRAN versions of ggplot2, though, so check that first.
3 Trackbacks/Pingbacks
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
[…] article was first published on R – rud.is, and kindly contributed to […]