Exploring News Coverage With newsflash

I was enthused to see a mention of this on the GDELT blog since I’ve been working on an R package dubbed newsflash to work with the API that the form front-ends.

Given the current climate, I feel compelled to note that I’m neither a Clinton supporter/defender/advocate nor a ? supporter/defender/advocate) in any way, shape or form. I’m only using the example for replication and I’m very glad the article author stayed (pretty much) non-partisan apart from some color commentary about the predictability of network coverage of certain topics.

For now, the newsflash package is configured to grab raw count data, not the percent summaries since folks using R to grab this data probably want to do their own work with it. I used the following to try to replicate the author’s findings:

library(newsflash)
library(ggalt) # github version
library(hrbrmisc) # github only
library(tidyverse)
starts <- seq(as.Date("2015-01-01"), (as.Date("2017-01-26")-30), "30 days")
ends <- as.character(starts + 29)
ends[length(ends)] <- ""

pb <- progress_estimated(length(starts))
emails <- map2(starts, ends, function(x, y) {
  pb$tick()$print()
  query_tv("clinton", "email,emails,server", timespan="custom", start_date=x, end_date=y)
})

clinton_timeline <- map_df(emails, "timeline")

sum(clinton_timeline$value)
## [1] 34778

count(clinton_timeline, station, wt=value, sort=TRUE) %>%
  mutate(pct=n/sum(n), pct_lab=sprintf("%s (%s)", scales::comma(n), scales::percent(pct)),
         station=factor(station, levels=rev(station))) -> timeline_df

timeline_df

## # A tibble: 7 × 4
##             station     n         pct        pct_lab
##              <fctr> <int>       <dbl>          <chr>
## 1          FOX News 14807 0.425757663 14,807 (42.6%)
## 2      FOX Business  7607 0.218730232  7,607 (21.9%)
## 3               CNN  5434 0.156248203  5,434 (15.6%)
## 4             MSNBC  4413 0.126890563  4,413 (12.7%)
## 5 Aljazeera America  1234 0.035482201   1,234 (3.5%)
## 6         Bloomberg   980 0.028178734     980 (2.8%)
## 7              CNBC   303 0.008712404     303 (0.9%)

NOTE: I had to break up the queries since the bulk one across the two dates bump up against the API limits and may be providing helper functions for that before CRAN release.

While my package matches the total from the news article and sample query: 34,778 results my percentages are different since it’s the percentages across the raw counts for the included stations. “Percent of Sentences” (result “n” divided by the number of all sentences for each station in the time frame) — which the author used — seems to have some utility so I’ll probably add that as a query parameter or add a new function.

Tidy news text

The package also is designed to work with the tidytext package (it’s on CRAN) and provides a top_text() function which can return a tidytext-ready tibble or a plain character vector for use in other text processing packages. If you were curious as to whether this API has good data behind it, we can take a naive peek with the help of tidytext:

library(tidytext)

tops <- map_df(emails, top_text)
anti_join(tops, stop_words) %>% 
  filter(!(word %in% c("clinton", "hillary", "server", "emails", "mail", "email",
                       "mails", "secretary", "clinton's", "secretary"))) %>% 
  count(word, sort=TRUE) %>% 
  print(n=20)

## # A tibble: 26,861 × 2
##             word     n
##            <chr> <int>
## 1        private 12683
## 2     department  9262
## 3            fbi  7250
## 4       campaign  6790
## 5     classified  6337
## 6          trump  6228
## 7    information  6147
## 8  investigation  5111
## 9         people  5029
## 10          time  4739
## 11      personal  4514
## 12     president  4448
## 13        donald  4011
## 14    foundation  3972
## 15          news  3918
## 16     questions  3043
## 17           top  2862
## 18    government  2799
## 19          bill  2698
## 20      reporter  2684

I’d say the API is doing just fine.

Fin

The package also has some other bits from the API in it and if this has piqued your interest, please leave all package feature requests or problems as a github issue.

Many thanks to the Internet Archive / GDELT for making this API possible. Data like this would be amazing in any time, but is almost invaluable now.

3 Comments

- ANDREW G CLARK
- Posted 2017-03-13 at 20:33
- Permalink
- Reply
Great work, Bob

Trying to vary slightly
querytv(“clinton”, “email,emails,server”, timespan=”custom”, startdate=x, enddate=y, filternetwork = “AFFNETALL”) and get error
Error: lexical error: inside a string, ‘\’ occurs before a character which it may not.
h! /xx tt4wt2nqt'' mnh!_\8 tt4wt2nqt” nz(l `-‘8 tt4w
(right here) ——^

same if putting in individual broadcaster e.g. “AFFNETABC”. Works if specifically enter filternetwork = “NATIONAL”
- - hrbrmstr
  - Posted 2017-03-14 at 09:10
  - Permalink
  - Reply
  can you post as an issue to https://github.com/hrbrmstr/newsflash/issues (pls) ?
  - - ANDREW G CLARK
    - Posted 2017-03-14 at 11:35
    - Permalink
    - Reply
    Have done
    BTW, I am not familiar with US TV but would I be right in saying that shows are repeated e.g the Lou Dobbs Tonight show at 23:00 hrs Day 1 is the same as the 03:00 one Day 2?

rud.is