Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should read that (long-ish) intro as there are many caveats to the data source and I’ve also found that the files aren’t always available (i.e. there are often gaps when retrieving a sequence of files).
The R newsflash
package has been able to work with the GDELT Television Explorer API since the inception of the service. It now has the ability work with this new “top topics” resource directly from R.
There are two interfaces to the top topics, but I’ll show you the easiest one to use in this post. Let’s chart the top 25 topics per day for the past ~3 days (this post was generated ~mid-day 2017-09-09).
To start, we’ll need the data!
We provide start and end POSIXct
times in the current time zone (the top_trending_range()
function auto-converts to GMT which is how the file timestamps are stored by GDELT). The function takes care of generating the proper 15-minute sequences.
library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)
from <- as.POSIXct("2017-09-07 00:00:00")
to <- as.POSIXct("2017-09-09 12:00:00")
trends <- top_trending_range(from, to)
glimpse(trends)
## Observations: 233
## Variables: 5
## $ ts <dttm> 2017-09-07 00:00:00, 2017-09-07 00:15:00, 2017-...
## $ overall_trending_topics <list> [<"florida", "irma", "barbuda", "puerto rico", ...
## $ station_trending_topics <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ station_top_topics <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ overall_trending_phrases <list> [<"debt ceiling", "legalize daca", "florida key...
The glimpse
view shows a compact, nested data frame. I encourage you to explore the individual nested elements to see the gems they contain, but we’re going to focus on the station_top_topics
:
glimpse(trends$station_top_topics[[1]])
## Variables: 2
## $ Station <chr> "CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS", "MSNBC", "BBCNEWS"
## $ Topics <list> [<"florida", "irma", "daca", "north korea", "harvey", "united st...
Each individual data frame has the top topics of each tracked station.
To get the top 25 topics per day, we’re going to bust out this structure, count up the topic “mentions” (not 100% accurate term, but good enough for now) per day and slice out the top 25. It’s a pretty straightforward process with tidyverse ops:
select(trends, ts, station_top_topics) %>%
unnest() %>%
unnest() %>%
mutate(day = as.Date(ts)) %>%
rename(station=Station, topic=Topics) %>%
count(day, topic) %>%
group_by(day) %>%
top_n(25) %>%
slice(1:25) %>%
arrange(day, desc(n)) %>%
mutate(rnk = 25:1) -> top_25_trends
glimpse(top_25_trends)
## Observations: 75
## Variables: 4
## $ day <date> 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-0...
## $ topic <chr> "florida", "irma", "harvey", "north korea", "america", "daca", "chi...
## $ n <int> 546, 546, 468, 464, 386, 362, 356, 274, 217, 210, 200, 156, 141, 13...
## $ rnk <int> 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, ...
Now, it’s just a matter of some ggplotting:
ggplot(top_25_trends, aes(day, rnk, label=topic, size=n)) +
geom_text(vjust=0.5, hjust=0.5) +
scale_x_date(expand=c(0,0.5)) +
scale_size(name=NULL, range=c(3,8)) +
labs(
x=NULL, y=NULL,
title="Top 25 Trending Topics Per Day",
subtitle="Topic placed by rank and sized by frequency",
caption="GDELT Television Explorer & #rstats newsflash package github.com/hrbrmstr/newsflash"
) +
theme_ipsum_rc(grid="") +
theme(axis.text.y=element_blank()) +
theme(legend.position=c(0.75, 1.05)) +
theme(legend.direction="horizontal")
Hopefully you’ll have some fun with the new “API”. Make sure to blog your own creations!
UPDATE
As a result of a tweet by @arnicas, you can find a per-day, per-station view (top 10 only) here.