Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus”) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this:

to:
Does pandoc work?
=================
Simple document with **bold** and *italics*.This is definitely a job that pandoc can handle.
pandoc is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown / knitr.
Yes, you read that correctly. Your beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.
Doing the aforementioned conversion from docx to markdown is super-simple from R:
rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")Give the help on rmarkdown::pandoc_convert() a read as well as the very thorough and helpful documentation over at pandoc.org to see the power available at your command.
Just One More Thing
This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist :-)
There’s a neat, non-on-CRAN package by François Keck called subtools — https://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.
There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.
Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext to see how the season looked.
The overall scoring for a given episode is naive and can definitely be improved upon.
Definitely drop a link to anything you create in the comments!
# devtools::install_github("fkeck/subtools")
library(subtools)
library(tidytext)
library(hrbrthemes)
library(tidyverse)
data(stop_words)
bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")
fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE)
pb <- progress_estimated(length(fils))
map_df(1:length(fils), ~{
  pb$tick()$print()
  read.subtitles(fils[.x]) %>%
    sentencify() %>%
    .$subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by="word") %>%
    inner_join(bing, by="word") %>%
    inner_join(afinn, by="word") %>%
    mutate(season = 1, ep = .x)
}) %>% as_tibble() -> season_sentiments
count(season_sentiments, ep, sentiment) %>%
  mutate(pct = n/sum(n),
         pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent
ggplot() +
  geom_ribbon(data = filter(bing_sent, sentiment=="positive"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  geom_ribbon(data = filter(bing_sent, sentiment=="negative"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) +
  scale_y_continuous(expand=c(0,0), limits=c(-1,1),
                     labels=c("100%\nnegative", "50%", "0", "50%", "positive\n100%")) +
  labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1",
       subtitle="Sentiment balance per episode") +
  scale_fill_ipsum(name="Sentiment") +
  guides(fill = guide_legend(reverse=TRUE)) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1))) 
				
 
2 Comments
Thank you for this post, I just wanted to drop you a line to let you know that I’m really appreciating the R6 series. It’s lowered my activation energy to posting small bits of info too.
(there’s also a typo after “Yes, you read that correctly. Your”)
thx for the kind words and ‘your’ keen eyes :-)
4 Trackbacks/Pingbacks
[…] Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this: to: This is definitely a job… Continue reading → […]
[…] Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this: to: This is definitely a job… Continue reading → […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] article was first published on R – rud.is, and kindly contributed to […]