Teasing Out Top Daily Topics with GDELT’s Television Explorer

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should read that (long-ish) intro as there are many caveats to the data source and I’ve also found that the files aren’t always available (i.e. there are often gaps when retrieving a sequence of files).

The R newsflash package has been able to work with the GDELT Television Explorer API since the inception of the service. It now has the ability work with this new “top topics” resource directly from R.

There are two interfaces to the top topics, but I’ll show you the easiest one to use in this post. Let’s chart the top 25 topics per day for the past ~3 days (this post was generated ~mid-day 2017-09-09).

To start, we’ll need the data!

We provide start and end POSIXct times in the current time zone (the top_trending_range() function auto-converts to GMT which is how the file timestamps are stored by GDELT). The function takes care of generating the proper 15-minute sequences.

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

from <- as.POSIXct("2017-09-07 00:00:00")
to <- as.POSIXct("2017-09-09 12:00:00")

trends <- top_trending_range(from, to)

glimpse(trends)
## Observations: 233
## Variables: 5
## $ ts                       <dttm> 2017-09-07 00:00:00, 2017-09-07 00:15:00, 2017-...
## $ overall_trending_topics  <list> [<"florida", "irma", "barbuda", "puerto rico", ...
## $ station_trending_topics  <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ station_top_topics       <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ overall_trending_phrases <list> [<"debt ceiling", "legalize daca", "florida key...

The glimpse view shows a compact, nested data frame. I encourage you to explore the individual nested elements to see the gems they contain, but we’re going to focus on the station_top_topics:

glimpse(trends$station_top_topics[[1]])
## Variables: 2
## $ Station <chr> "CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS", "MSNBC", "BBCNEWS"
## $ Topics  <list> [<"florida", "irma", "daca", "north korea", "harvey", "united st...

Each individual data frame has the top topics of each tracked station.

To get the top 25 topics per day, we’re going to bust out this structure, count up the topic “mentions” (not 100% accurate term, but good enough for now) per day and slice out the top 25. It’s a pretty straightforward process with tidyverse ops:

select(trends, ts, station_top_topics) %>% 
  unnest() %>% 
  unnest() %>% 
  mutate(day = as.Date(ts)) %>% 
  rename(station=Station, topic=Topics) %>% 
  count(day, topic) %>% 
  group_by(day) %>% 
  top_n(25) %>% 
  slice(1:25) %>% 
  arrange(day, desc(n)) %>% 
  mutate(rnk = 25:1) -> top_25_trends

glimpse(top_25_trends)
## Observations: 75
## Variables: 4
## $ day   <date> 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-0...
## $ topic <chr> "florida", "irma", "harvey", "north korea", "america", "daca", "chi...
## $ n     <int> 546, 546, 468, 464, 386, 362, 356, 274, 217, 210, 200, 156, 141, 13...
## $ rnk   <int> 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, ...

Now, it’s just a matter of some ggplotting:

ggplot(top_25_trends, aes(day, rnk, label=topic, size=n)) +
  geom_text(vjust=0.5, hjust=0.5) +
  scale_x_date(expand=c(0,0.5)) +
  scale_size(name=NULL, range=c(3,8)) +
  labs(
    x=NULL, y=NULL, 
    title="Top 25 Trending Topics Per Day",
    subtitle="Topic placed by rank and sized by frequency",
    caption="GDELT Television Explorer & #rstats newsflash package github.com/hrbrmstr/newsflash"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text.y=element_blank()) +
  theme(legend.position=c(0.75, 1.05)) +
  theme(legend.direction="horizontal")

Hopefully you’ll have some fun with the new “API”. Make sure to blog your own creations!

UPDATE

As a result of a tweet by @arnicas, you can find a per-day, per-station view (top 10 only) here.

Cover image from Data-Driven Security
Amazon Author Page

10 Comments Teasing Out Top Daily Topics with GDELT’s Television Explorer

  1. Pingback: Teasing Out Top Daily Topics with GDELT’s Television Explorer – Cloud Data Architect

  2. Pingback: Teasing Out Top Daily Topics with GDELT’s Television Explorer - biva

  3. Pingback: Teasing Out Top Daily Topics with GDELT’s Television Explorer | A bunch of data

  4. Pingback: Teasing Out Top Daily Topics with GDELT’s Television Explorer – Mubashir Qasim

  5. Pingback: After Equifax Breach: Hurricanes Overshadow Massive Cybersecurity Storm – Cyber Security

  6. Pingback: Trending topics on cable news: the newsflash package | A bunch of data

  7. Pingback: Trending topics on cable news: the newsflash package – Mubashir Qasim

  8. Sayon Ghosh

    iatv_top_trending(from,to) turns out empty tibbles for any date starting around 2018-03-21
    is it an issue that can be fixed ?

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.