I was pretty brutal to Apple earlier this week in a Twitter thread that I tried to craft so it occurred in-line with the WWDC live stream (which might be something you want to remember as/if you read on). I really don’t care about “memojis” and I have serious dismay over what is a pretty obvious fact that Apple intends to dumb down computing by shifting most folks from Macs to iPads. Their new “Pro” is for design folks and I’m not holding my breath for them to re-embrace the developer/data science communities with better laptops or smaller cheese graters.
The “meh” hardware/software announcements aren’t the worst parts of these events. The TED-esque scripting (including many failed attempts at faux “authentic” humor) is also becoming quite tedious. I joked about analyzing the “adverbs per minute” but it took a few days for their WWDC 2019 keynote video with a subtitle track to emerge. As a result, current time constraints prevent a dive into the subtitles themselves, but that doesn’t mean you can’t have some fun with them.
Read on to see how I scraped the subtitles or skip to the end to read more about this “Reader Challenge”.
Not So Subtle Subtitle Scraping
If you go to the aforelinked WWDC video URL you’ll see control on the lower right to add a subtitle track. If you do that with browser Developer Tools open you’ll see what that does:
These are WebVTT formatted subtitles which have a format/syntax that enable them to be displayed at the correct playback timecode. We can see how many of them there are by looking at the end of the file:
So, there are 621 of them and each are requested individually (and super-fast, in-parallel). What do these individual requests look like? Just select one of them to take a look. They’re just plain text responses (it’s not a super-intricate format).
Let’s grab one of them to the clipboard and use the {curlconverter} package to turn that into an httr::GET()
request via the straighten()
and make_req()
functions:
I went ahead and wrapped it into a fairly-well-named function, but the GET
request is virtually untouched from the aforementioned process. I just added the {idx}
template into the request URL so we can glue()
the right index into it. It is likely that some headers could have been eliminated but I just went with what {curlconverter}
processed and returned this time.
library(stringi)
library(subtools) # https://github.com/hrbrmstr/subtools ; (ORIG: https://github.com/fkeck/subtools)
library(tidytext)
library(purrrogress) # tidy progress bars for free!
library(tidyverse)
#' Fetches a subtitle by index from the 2019 Apple WWDC Keynote subtitle stream
get_subtitle <- function(idx = 1) {
st_url <- "https://p-events-delivery.akamaized.net/3004qzusahnbjppuwydgjzsdyzsippar/vod3/cc2/eng4/prog_index_{idx}.webvtt"
st_url <- glue::glue(st_url)
httr::GET(
url = st_url,
httr::add_headers(
`sec-ch-ua` = "Google Chrome 75",
`Sec-Fetch-Mode` = "cors",
Origin = "https://developer.apple.com",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36",
Referer = "https://developer.apple.com/videos/play/wwdc2019/101/",
`Sec-Fetch-Dest` = "empty",
`Sec-Fetch-Site` = "cross-site"
)
) -> res
out <- httr::content(res, as = "text", encoding = "UTF-8")
out <- stringi::stri_split_lines(out)
purrr::flatten_chr(out)
}
Let’s see what one looks like:
(tmp <- get_subtitle(1))
## [1] "WEBVTT"
## [2] "X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000"
## [3] ""
## [4] "3"
## [5] "00:00:21.199 --> 00:00:22.333"
## [6] ">> FEMALE SPEAKER:"
## [7] "Don't stay up too late."
## [8] ""
## [9] ""
Looking good! But, it’s just plain characters and I don’t feel like writing a subtitle parser. And, I dont’ have to! François Keck has the {subtools} package which we can use. But, it (used to) only work on files. It now works on character vectors as well (but you’ll need to install it from my fork until the PR is merged). Let’s turn this set of noise into something we can use:
as_subtitle(tmp, format = "webvtt") %>%
flatten_df()
## # A tibble: 1 x 4
## ID Timecode.in Timecode.out Text
## <chr> <chr> <chr> <chr>
## 1 3 00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up to…
So tidy!
We now need to get all of the subtitles. We’ll do that fast since the video player retrieves them even faster than this iteration does:
# no crawl delay b/c the video player grabs these even faster than this code does
map(1:621, with_progress(get_subtitle)) %>% # with_progress gets you a progress bar for free
map(as_subtitle, format = "webvtt") %>%
flatten_df() %>%
as_tibble() -> apple_subs
apple_subs
## # A tibble: 3,220 x 4
## ID Timecode.in Timecode.out Text
## <chr> <chr> <chr> <chr>
## 1 3 00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
## 2 4 00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.
## 3 5 00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.
## 4 6 00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.
## 5 7 00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.
## 6 8 00:01:46.733 00:01:47.833 This is good.
## 7 9 00:01:49.566 00:01:52.666 (Music playing)
## 8 10 00:02:05.200 00:02:12.533 (Applause)
## 9 10 00:02:05.200 00:02:12.533 (Applause)
## 10 11 00:02:14.400 00:02:15.566 >> TIM COOK: Wow.
## # … with 3,210 more rows
Streaming subtitles aren’t error-free and often get duplicated, let’s see if that’s the case:
# Any dups?
distinct(apple_subs)
## # A tibble: 2,734 x 4
## ID Timecode.in Timecode.out Text
## <chr> <chr> <chr> <chr>
## 1 3 00:00:21.199 00:00:22.333 >> FEMALE SPEAKER: Don't stay up t…
## 2 4 00:01:10.933 00:01:11.933 >> MALE SPEAKER: Come on.
## 3 5 00:01:36.500 00:01:37.166 >> MALE SPEAKER: All right.
## 4 6 00:01:40.966 00:01:41.733 >> MALE SPEAKER: Yes.
## 5 7 00:01:45.733 00:01:46.666 >> MALE SPEAKER: Woo.
## 6 8 00:01:46.733 00:01:47.833 This is good.
## 7 9 00:01:49.566 00:01:52.666 (Music playing)
## 8 10 00:02:05.200 00:02:12.533 (Applause)
## 9 11 00:02:14.400 00:02:15.566 >> TIM COOK: Wow.
## 10 12 00:02:15.633 00:02:18.166 Thank you.
## # … with 2,724 more rows
apple_subs <- distinct(apple_subs)
There were dups, but not anymore!
You can get that data frame via: http://rud.is/dl/2019-wwdc-keynote-subtitles.csv.gz.
I wanted to see if these looked OK so I dumped just the text to a file and open them up in Sublime Text to spot check:
apple_subs %>%
pull(Text) %>%
write_lines("/tmp/subs.txt")
system("subl /tmp/subs.txt") # dblchk
Since we have a good capture of what was spoken, we can start the analysis process:
distinct(apple_subs) %>%
filter(!grepl("^\\(|^>>", Text)) %>%
unnest_tokens(word, Text) %>%
anti_join(get_stopwords()) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 2,408 x 2
## word n
## <chr> <int>
## 1 now 246
## 2 can 205
## 3 new 142
## 4 like 119
## 5 just 106
## 6 app 77
## 7 great 74
## 8 apple 69
## 9 right 64
## 10 apps 59
## # … with 2,398 more rows
And, that’s when I’ve run out of time.
Reader Challenge
You’ve got the cleaned WWDC 2019 Keynote subtitle track and access to my brutal WWDC 2019 Twitter thread. What fun can you have with it? I’d still like to know the adverbs-per-‘n’ (and what kind they were). But, what else can you discover? Is there a pattern of emotional manipulation through word choices at different times? Did they change tone/style throughout the event? What other questions can you ask and tease out with data?
Drop links to your creations (and separate links to code) in the comments and I’ll re-broadcast them on Twitter and gather them all up into a new post to see what y’all came up with.
FIN
There’s no deadline as I can keep on curating as new submissions come in. While this is most assuredly an R-focused blog there is no restriction on the tools you use as well.
Hopefully this will be a fun/creative exercise for folks. If you have any questions about the scraping process or about the WebVTT format don’t hesitate to ping me here or on Twitter (@hrbrmstr).