Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus”) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx
) documents to markdown. Something like this:
to:
Does pandoc work?
=================
Simple document with **bold** and *italics*.
This is definitely a job that pandoc
can handle.
pandoc
is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown
/ knitr
.
Yes, you read that correctly. Your beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.
Doing the aforementioned conversion from docx
to markdown is super-simple from R:
rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")
Give the help on rmarkdown::pandoc_convert()
a read as well as the very thorough and helpful documentation over at pandoc.org
to see the power available at your command.
Just One More Thing
This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist :-)
There’s a neat, non-on-CRAN package by François Keck called subtools
— https://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.
There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.
Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext
. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr
package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext
to see how the season looked.
The overall scoring for a given episode is naive and can definitely be improved upon.
Definitely drop a link to anything you create in the comments!
# devtools::install_github("fkeck/subtools")
library(subtools)
library(tidytext)
library(hrbrthemes)
library(tidyverse)
data(stop_words)
bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")
fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE)
pb <- progress_estimated(length(fils))
map_df(1:length(fils), ~{
pb$tick()$print()
read.subtitles(fils[.x]) %>%
sentencify() %>%
.$subtitles %>%
unnest_tokens(word, Text) %>%
anti_join(stop_words, by="word") %>%
inner_join(bing, by="word") %>%
inner_join(afinn, by="word") %>%
mutate(season = 1, ep = .x)
}) %>% as_tibble() -> season_sentiments
count(season_sentiments, ep, sentiment) %>%
mutate(pct = n/sum(n),
pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent
ggplot() +
geom_ribbon(data = filter(bing_sent, sentiment=="positive"),
aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
geom_ribbon(data = filter(bing_sent, sentiment=="negative"),
aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) +
scale_y_continuous(expand=c(0,0), limits=c(-1,1),
labels=c("100%\nnegative", "50%", "0", "50%", "positive\n100%")) +
labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1",
subtitle="Sentiment balance per episode") +
scale_fill_ipsum(name="Sentiment") +
guides(fill = guide_legend(reverse=TRUE)) +
theme_ipsum_rc(grid="Y") +
theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1)))