I was enthused to see a mention of this on the GDELT blog since I’ve been working on an R package dubbed newsflash
to work with the API that the form front-ends.
Given the current climate, I feel compelled to note that I’m neither a Clinton supporter/defender/advocate nor a ? supporter/defender/advocate) in any way, shape or form. I’m only using the example for replication and I’m very glad the article author stayed (pretty much) non-partisan apart from some color commentary about the predictability of network coverage of certain topics.
For now, the newsflash
package is configured to grab raw count data, not the percent summaries since folks using R to grab this data probably want to do their own work with it. I used the following to try to replicate the author’s findings:
library(newsflash)
library(ggalt) # github version
library(hrbrmisc) # github only
library(tidyverse)
starts <- seq(as.Date("2015-01-01"), (as.Date("2017-01-26")-30), "30 days")
ends <- as.character(starts + 29)
ends[length(ends)] <- ""
pb <- progress_estimated(length(starts))
emails <- map2(starts, ends, function(x, y) {
pb$tick()$print()
query_tv("clinton", "email,emails,server", timespan="custom", start_date=x, end_date=y)
})
clinton_timeline <- map_df(emails, "timeline")
sum(clinton_timeline$value)
## [1] 34778
count(clinton_timeline, station, wt=value, sort=TRUE) %>%
mutate(pct=n/sum(n), pct_lab=sprintf("%s (%s)", scales::comma(n), scales::percent(pct)),
station=factor(station, levels=rev(station))) -> timeline_df
timeline_df
## # A tibble: 7 × 4
## station n pct pct_lab
## <fctr> <int> <dbl> <chr>
## 1 FOX News 14807 0.425757663 14,807 (42.6%)
## 2 FOX Business 7607 0.218730232 7,607 (21.9%)
## 3 CNN 5434 0.156248203 5,434 (15.6%)
## 4 MSNBC 4413 0.126890563 4,413 (12.7%)
## 5 Aljazeera America 1234 0.035482201 1,234 (3.5%)
## 6 Bloomberg 980 0.028178734 980 (2.8%)
## 7 CNBC 303 0.008712404 303 (0.9%)
NOTE: I had to break up the queries since the bulk one across the two dates bump up against the API limits and may be providing helper functions for that before CRAN release.
While my package matches the total from the news article and sample query: 34,778 results my percentages are different since it’s the percentages across the raw counts for the included stations. “Percent of Sentences” (result “n” divided by the number of all sentences for each station in the time frame) — which the author used — seems to have some utility so I’ll probably add that as a query parameter or add a new function.
Tidy news text
The package also is designed to work with the tidytext
package (it’s on CRAN) and provides a top_text()
function which can return a tidytext
-ready tibble or a plain character vector for use in other text processing packages. If you were curious as to whether this API has good data behind it, we can take a naive peek with the help of tidytext
:
library(tidytext)
tops <- map_df(emails, top_text)
anti_join(tops, stop_words) %>%
filter(!(word %in% c("clinton", "hillary", "server", "emails", "mail", "email",
"mails", "secretary", "clinton's", "secretary"))) %>%
count(word, sort=TRUE) %>%
print(n=20)
## # A tibble: 26,861 × 2
## word n
## <chr> <int>
## 1 private 12683
## 2 department 9262
## 3 fbi 7250
## 4 campaign 6790
## 5 classified 6337
## 6 trump 6228
## 7 information 6147
## 8 investigation 5111
## 9 people 5029
## 10 time 4739
## 11 personal 4514
## 12 president 4448
## 13 donald 4011
## 14 foundation 3972
## 15 news 3918
## 16 questions 3043
## 17 top 2862
## 18 government 2799
## 19 bill 2698
## 20 reporter 2684
I’d say the API is doing just fine.
Fin
The package also has some other bits from the API in it and if this has piqued your interest, please leave all package feature requests or problems as a github issue.
Many thanks to the Internet Archive / GDELT for making this API possible. Data like this would be amazing in any time, but is almost invaluable now.
3 Comments
Great work, Bob
Trying to vary slightly
querytv(“clinton”, “email,emails,server”, timespan=”custom”, startdate=x, enddate=y, filternetwork = “AFFNETALL”) and get error
Error: lexical error: inside a string, ‘\’ occurs before a character which it may not.
h!
/xx tt4w
t2nqt'' mnh!
_\8 tt4wt2n
qt” nz(l `-‘8 tt4w(right here) ——^
same if putting in individual broadcaster e.g. “AFFNETABC”. Works if specifically enter filternetwork = “NATIONAL”
can you post as an issue to https://github.com/hrbrmstr/newsflash/issues (pls) ?
Have done
BTW, I am not familiar with US TV but would I be right in saying that shows are repeated e.g the Lou Dobbs Tonight show at 23:00 hrs Day 1 is the same as the 03:00 one Day 2?