rud.is

Taking a Shot at cdcfluview v0.7.0 (a.k.a. The Dangers of Relying on ‘Hidden’ APIs)

Unlike @noamross, I am not an epidemiologist (NOTE: Noam battles pandemics before breakfast, so be super nice to him) but I do like to find kindred methodologies in other disciplines to help foster the growth of cybersecurity into something beyond it’s current Barnum & Bailey state. I also love finding and exposing hidden APIs and especially enjoy killing Adobe Flash. How does all that relate to cdcfluview?

cdcfluview? was one of my first R packages. Someone, somwewhere, was trying to do something with Selenium to automate downloading of data from the CDC’s FluView Portal. It was — and, some of it still is — a Flash-based site that locked up useful data behind application screens that did little more than burn ones retinas and force folks to keep Flash alive and, hence, their browsers insecure.

Rather than let the requester suffer under the weight of a pretty significant external dependency, I used the magic of the browser “developer tools” inspector to see that it was making fairly innocuous and useful XHR requests for real data. The package sat on GitHub for a while and eventually made its way to CRAN.

Times change and Flash is dying, so the CDC paid some serious benjamins to have the site re-done in HTML, replicating the horrible UX and terrible visualizations (so. many. pie. charts.). Said revamp also caused changes to the back-end APIs and forced breaking changes. Craig McGowan jumped to the rescue and fixed some core functionality issues, but so much changed — and so much was added — that I felt it was time for a modern re-write of the cdcfluview package.

This is a pretty solid, real-world example of how dangerous it is to rely on hidden APIs. If Craig hadn’t both notified me and gone the extra mile to make a PR, I’d’ve been in the dark until I tried to commiserate (I always seem get the flu no matter what I do) with code and found my package erroring out.

Enter: cdcfluview 0.7.0.

What’s Different?

Unfortunately, everything; which is one reason I’m writing this post.

First, to have folks that are using current-gen cdcfluview kick the tyres and let me know (via issues) if you need any old API compatibility back. This isn’t anywhere near the most popular package on CRAN, but it does have users (even, I’m told, within the CDC) and I want to make sure I do as little to disrupt them as possible. But, the current package API maps much more closely to the way the revamped portal works and presents data, so I’m hoping it’s a good net-new vs crushing blow to productivity.

Speaking of maps, the package now has actual maps! A new cdc_basemap() function returns the GeoJSON files that the CDC uses in their web views as sf objects. And, there are tons of maps and multi-labeled features to tie data to:

: National Basemap

: HHS Regions Basemap

: Census Regions Basemap

: States Basemap

: ILI Spread Basemap

: Surveillance Data Basemap

Here’s what’s in the tin:

age_group_distribution: Age Group Distribution of Influenza Positive Tests Reported by Public Health Laboratories
cdc_basemap: Retrieve CDC U.S. Basemaps
geographic_spread: State and Territorial Epidemiologists Reports of Geographic Spread of Influenza
hospitalizations: Laboratory-Confirmed Influenza Hospitalizations
ilinet: Retrieve ILINet Surveillance Data
ili_weekly_activity_indicators: Retrieve weekly state-level ILI indicators per-state for a given season
pi_mortality: Pneumonia and Influenza Mortality Surveillance
state_data_providers: Retrieve metadata about U.S. State CDC Provider Data
surveillance_areas: Retrieve a list of valid sub-regions for each surveillance area.
who_nrevss: Retrieve WHO/NREVSS Surveillance Data
mmwr_week: Convert a Date to an MMWR day+week+year
mmwr_weekday: Convert a Date to an MMWR weekday
mmwr_week_to_date: Convert an MMWR year+week or year+week+day to a Date object

Plus there’s a new data object mmwrid_map that makes it super-easy to convert arcane MMWR identifiers to real date objects.

The README has plenty of charts and examples, so I won’t take up post-space with said code or images.

Curiously Enough

Along the way, I was able to discern that there’s a hidden layer of this new, hidden API. Exposing said layer should be as easy as figuring out the right keyword and I’m hoping a bit of fuzzing will do the trick on that. It will be interesting to see what extra data that unlocks. (Yes, I just said relying on hidden APIs is dangerous; and, relying on hidden, hidden APIs is doubly so. I’m just a glutton for punishment.)

I was also able to discern that multiple people or teams worked on this revamp and said folks did not communicate with each other. The per-app APIs are woefully inconsistent. Furthermore, someone goofed and forgot to expose some pretty critical information from a few data retrieval operations (said data is also missing on the clickable download versions, too). Hopefully they’ll be addressing the issue soon (the site is technically in beta release).

FIN

If you’ve been a user of cdcfluview please give the new API a try and file issues with anything you see. All contributors — testers, modders, enhancers — will get full DESCRIPTION credit (so, please also include how you’d like to be cited).

Finally, please do check out the CDC FluView Portal. It’s gosh awful horribad. I know there are some spiffy Shiny experts out there who could run rings around that portal and I’ll be glad to add you as a collaborator if you contribute a Shiny app (or two!) to the package. If you’d rather go your own route with a self-contained, self-published package, just let me know what API changes you’d like and I’ll gladly accommodate. The goal is to help epidemiologists and other researchers keep us all safe.

So, go get your flu shot!!! Then, kick the tyres on this package update and don’t hesitate to convey your criticisms, patches or accolades.

Now to get to my promised final review of cyphr (I’ve not forgotten @ma_salmon ;-)

I, For One, Welcome Our Forthcoming New robots.txt Overlords

2017-11-03 – 08:18
Posted in R
Tagged post, robots exclusion standard, robots.txt, robotstxt
Comments (6)

Despite my week-long Twitter consumption sabbatical (helped — in part — by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem is the text of a very recent ruling from a District Court in Texas and deals with a favourite subject of mine: robots.txt.

The background of the case is that there were two parties who both ran websites for oil and gas professionals that include job postings. One party filed a lawsuit against the other asserting that the they hacked into their system and accessed and used various information in violation of the Computer Fraud and Abuse Act (CFAA), the Stored Wire and Electronic Communications and Transactional Records Access Act (SWECTRA), the Racketeer Influenced and Corrupt Organizations Act (RICO), the Texas Harmful Access by Computer Act (THACA), the Texas Theft Liability Act (TTLA), and the Texas Uniform Trade Secrets Act (TUTS). They also asserted common law claims of misappropriation of confidential information, conversion, trespass to chattels, fraud, breach of fiduciary duty, unfair competition, tortious interference with present and prospective business relationships, civil conspiracy, and aiding and abetting.

The other party filed a motion for dismissal on a number of grounds involving legalese on Terms & Conditions, a rebuttal of CFAA claims and really gnarly legalese around copyrights. There are more than a few paragraphs that make me glad none of my offspring have gone into or have a desire to go into the legal profession. One TLDR here is that T&Cs do, in fact, matter (though that is definitely dependent upon the legal climate where you live or have a case filed against you). We’re going to focus on the DMCA claim which leads us to the robots.txt part.

I shall also preface the rest with “IANAL”, but I don’t think a review of this case requires a law degree.

Command-Shift-R

To refresh memories (or create lasting new ones), robots.txt is a file that is placed at the top of a web site domain (i.e. https://rud.is/robots.txt) that contains robots exclusion standard rules. These rules tell bots (NOTE: if you write a scraper, you’ve written a scraping bot) what they can or cannot scrape and what — if any — delay should be placed between scraping efforts by said bot.

R has two CRAN packages for dealing with these files/rules: robotstxt by Peter Meissner and spiderbar by me. They are not competitors, but are designed much like Reese’s Peanut Butter cups — to go together (though Peter did some wicked good testing and noted a possible error in the underlying C++ library I use that could generate Type I or Type II in certain circumstances) and each has some specialization. I note them now because you don’t have an excuse not to check robots.txt given two CRAN packages being available. Python folks have (at a minimum) robotparser and reppy. Node, Go and other, modern languages all have at least one module/library/package available as well. No. Excuses.

Your Point?

(Y’all are always in a rush, eh?)

This October, 2017 Texas ruling references a 2007 ruling by a District Court in Pennsylvania. I dug in a bit through searchable Federal case law for mentions of robots.txt and there aren’t many U.S. cases that mention this control, though I am amused a small cadre of paralegals had to type robots.txt over-and-over again.

The dismissal request on the grounds that the CFAA did not apply was summarily rejected. Why? The defendant provided proof that they monitor for scraping activity that violates the robots.txt rules and that they use the Windows Firewall (ugh, they use Windows for web serving) to block offending IP addresses when they discover them.

Nuances came out further along in the dismissal text noting that user-interactive viewing of the member profiles on the site was well-within the T&Cs but that the defendant “never authorized [the use of] automated bots to download over 500,000 profiles” nor to have that data used for commercial purposes.

The kicker (for me, anyway) is the last paragraph of the document in the Conclusion where the defendant asserts that:

robots.txt is in fact a bona-fide technological measure to effectively control access to copyright materials
the “Internet industry” (I seriously dislike lawyers for wording just like that) has recognized robots.txt as a standard for controlling automated access to resources
robots.txt has been a valid enforcement mechanism since 1994

The good bit is: -“Whether it actually qualifies in this case will be determined definitively at summary judgment or by a jury.”_ To me, this sounds like a ruling by a jury/judge in favor of robots.txt could mean that it becomes much stronger case law for future misuse claims.

With that in mind:

Site owners: USE robots.txt, if — for no other reason — to aid legitimate researchers who want to make use of your data for valid scientific purposes, education or to create non-infringing content or analyses that will be a benefit to the public good. You can also use it to legally protect your content (but there are definitely nuances around how you do that).
Scrapers: Check and obey robots.txt rules. You have no technological excuse not to and not doing so really appears that it could come back to haunt you in the very near future.

FIN

I’ve setup an alert for when future rulings come out for this case and will toss up another post here or on the work-blog (so I can run it by our very-non-skeezy legal team) when it pops up again.

“Best Friends” image by Andy Kelly. Used with permission.

Yet-Another-Power Outages Post : Full Tidyverse Edition

This past weekend, violent windstorms raged through New England. We — along with over 500,000 other Mainers — went “dark” in the wee hours of Monday morning and (this post was published on Thursday AM) we still have no utility-provided power nor high-speed internet access. The children have turned iFeral, and being a remote worker has been made that much more challenging. Earlier in the week, even making a cellular phone call (not an Google Voice or other VoIP-ish call, just pressing buttons in the phone “app” in iOS) resulted in an “All circuits are busy” message vs human contact. (I had to repair our generator at some point between then and now, but it’s all a blur at this point).

Late Tuesday night, we checked the Central Maine Power outage info and were greeted with a “November 4, 2017” estimate. After regaining composure, we doubled down on the fact that we’d be extreme indoor glamping for a while longer.

As noted, I cope by coding and have a history of scraping Central Maine Power’s site for outage info. I ceased when I discovered the site & twitter bot I mentioned in a recent post, as it does that for the entirety of the U.S. (though many power companies continue make it difficult to scrape their outage info).

However, I wanted to see just how many other streets were in the same position as we are (I should note that less than a mile from us there are folks with power and internet, due mostly to their proximity to “vital” resources and how screwed up the Maine power grid is). Rather than reuse existing code, I wanted to make a modern, tidyverse edition of scrapers. If you follow enough paths in the aforementioned outage site, you’ll see that you eventually come to a page with a pretty ugly iframe that lets you poke around counties and towns. The following code traverses that tree to get street-level outage info:

library(rvest)
library(stringi)
library(hrbrthemes)
library(tidyverse)

# helper to make numbers from comma strings; i still find it amusing that
# this has never been a core "S" or base "R" function given that the
# central goal of both languages are to work with data
to_num <- function(x) { as.numeric(stri_replace_all_fixed(x, ",", "")) }

# top of the tree
pg <- read_html("http://www3.cmpco.com/OutageReports/CMP.html")

# basic idiom all the way down is to find the links to traverse until we get to
# the street level data, plucking the timestamp of the CMP report along the way
html_nodes(pg, "a") %>% 
  map_df(~{
    
    county <- stri_trans_totitle(html_text(.x))
    cpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
    message(sprintf("Processing %s...", county))
    
    html_nodes(cpg, xpath=".//td[not(@colspan = '2') and not(@align = 'right')][1]/a") %>% 
      map_df(~{
        
        town <- stri_trans_totitle(html_text(.x))
        tpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
        message(sprintf("  - %s", town))
    
        html_node(tpg, xpath=".//td[contains(., 'Update:')]") %>%
          html_text() %>%
          stri_replace_first_regex("Update: ", "") %>%
          as.POSIXct("%b %d, %Y %I:%M %p", tz="America/New_York") -> ts
        
        html_node(tpg, "table") %>% 
          html_table() %>% 
          docxtractr::assign_colnames(3) %>%  
          docxtractr::mcga() %>% # in github version
          mutate(street = stri_trans_totitle(street)) %>% 
          mutate_at(vars(-estimated_restoration, -street), .funs=to_num) %>% 
          filter(!is.na(total_customersby_street)) %>% 
          mutate(timestamp = ts) %>% 
          mutate(county = county) %>% 
          mutate(town = town) %>% 
          tbl_df()
      })
    
  }) -> xdf

xdf <- mutate(xdf, estimated_restoration = as.POSIXct(estimated_restoration, "%b %d, %Y %I:%M %p", tz="America/New_York"))

xdf
## # A tibble: 10,601 x 7
##           street total_customersby_street customerswithout_power estimated_restoration           timestamp       county   town
##            <chr>                    <dbl>                  <dbl>                <dttm>              <dttm>        <chr>  <chr>
##  1        2Nd St                       59                     14                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  2        3Rd St                      128                     53                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  3        4Th St                       89                     15                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  4        5Th St                       56                      3                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  5     Adams Ave                        4                      4   2017-11-03 19:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  6     Allain St                        8                      8                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  7     Atwood St                        6                      3   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  8    Baxter Ave                       32                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  9     Beaver Rd                        9                      4   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## 10 Bellflower Dr                       10                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## # ... with 10,591 more rows

One truly upsetting revelation from data is the number of folks still in an “Assessing” condition (i.e. no restoration time estimate):

filter(xdf, is.na(estimated_restoration)) %>% 
  summarise(streets = n(), customers_impacted = sum(total_customersby_street))
## # A tibble: 1 x 2
##   streets customers_impacted
##     <int>              <dbl>
## 1    2255              42067

I’m thankful (for them and us) that Winter has not yet hit and that the weather has been and is going to be sufficiently mild to not make things life-threatening for most folks (it does get cold in northern Maine at this time of year).

It’s About Time

We can get an overview of when things are slated to get better for everyone but the folks I just mentioned:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>%
  count(day) %>% 
  ggplot(aes(day, n)) +
  geom_col() +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma() +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y")

It seems that most of us are in the same “November 4th” bucket. But, we can also see that Central Maine Power’s data curation leaves much to be desired since there should be no dates in the past in that chart, but there are.

With the scraping data above, we can explore the outage info in many ways, but — as time and bandwidth are precious commodities — I’ll leave you with the total number of customers still without power:

count(xdf, wt = customerswithout_power) %>% pull(n)
## [1] 153465

and, a county-level view of the outage:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>% 
  count(county, day) %>% 
  complete(county, day, fill=list(n=0)) %>% 
  filter(day >= Sys.Date()) %>% 
  ggplot(aes(day, n)) +
  geom_segment(aes(xend=day, yend=0), color="steelblue", size=4) +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma(limits=c(0,1250)) +
  facet_wrap(~county, scales="free", ncol=5) +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates by County",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y", strip_text_face = "bold", axis="xy") +
  theme(panel.spacing.x=unit(3, "lines")) +
  theme(panel.spacing.y=unit(2, "lines"))

FIN

In a way, I wish I had continued scraping CMP data since the power outages site I mentioned doesn’t seem to provide access to the raw data and getting a historical perspective of the outage locations and analyzing by geography and other demographics might be interesting.

Hopefully the scraping code will be useful for some folks. It was definitely therapeutic for me :-)

gg_tweet’ing Power Outages

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr)
library(magick)
library(rtweet)
library(stringi)
library(hrbrthemes)
library(tidyverse)

words_to_num <- function(x) {
  map_dbl(x, ~{
    val <- stri_match_first_regex(.x, "^([[:print:]]+) #PowerOutages")[,2]
    mul <- case_when(
      stri_detect_regex(val, "[Kk]") ~ 1000,
      stri_detect_regex(val, "[Mm]") ~ 1000000,
      TRUE ~ 1
    ) 
    val <- stri_replace_all_regex(val, "[^[:digit:]\\.]", "")
    as.numeric(val) * mul 
  })
}

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage <- get_timeline("PowerOutage_us", n=300)

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "\\#(EastCoast|NorthEast)")) %>% 
  mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% 
  mutate(number_out = words_to_num(text)) %>%  
  ggplot(aes(created_at, number_out)) +
  geom_segment(aes(xend=created_at, yend=0), size=5) +
  scale_x_datetime(date_labels = "%Y-%m-%d\n%H:%M", date_breaks="2 hours") +
  scale_y_comma(limits=c(0,2000000)) +
  labs(
    x=NULL, y="# Customers Without Power",
    title="Northeast Power Outages",
    subtitle="Yay! Twitter as a non-blather data source",
    caption="Data via: @PowerOutage_us <https://twitter.com/PowerOutage_us>"
  ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc <- function(grid = "XY", style = c("stream", "card"), retina=TRUE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(24, 18, 16, 14, 12),
    card = c(22, 16, 14, 12, 10)
  ) -> font_sizes
  
  theme_ipsum_rc(
    grid = grid, 
    plot_title_size = font_sizes[1], 
    subtitle_size = font_sizes[2], 
    axis_title_size = font_sizes[3], 
    axis_text_size = font_sizes[4], 
    caption_size = font_sizes[5]
  )
  
}

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

gg_tweet <- function(g, status = "ggplot2 image", style = c("stream", "card"), 
                     retina=TRUE, send = FALSE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(w=1024, h=512),
    card = c(w=800, h=320)
  ) -> dims
  
  dims["res"] <- 72
  
  if (retina) dims <- dims * 2
  
  fig <- image_graph(width=dims["w"], height=dims["h"], res=dims["res"])
  print(g)
  dev.off()
  
  if (send) {
    
    message("Posting image to twitter...")
    
    tf <- tempfile(fileext = ".png")
    image_write(fig, tf, format="png")
    
    # Create an app at apps.twitter.com w/callback URL of http://127.0.0.1:1410
    # Save the app name, consumer key and secret to the following
    # Environment variables
    
    app <- oauth_app(
      appname = Sys.getenv("TWITTER_APP_NAME"),
      key = Sys.getenv("TWITTER_CONSUMER_KEY"),
      secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
    )
    
    twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), app)
    
    POST(
      url = "https://api.twitter.com/1.1/statuses/update_with_media.json",
      config(token = twitter_token), 
      body = list(
        status = status,
        media = upload_file(path.expand(tf))
      )
    ) -> res
    
    warn_for_status(res)
    
    unlink(tf)
    
  }
  
  fig
  
}

Two Great Tastes That Taste Great Together

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet(
  gg + theme_tweet_rc(grid="Y"),
  status = "Progress! #rtweet #gg_tweet",
  send=TRUE
)

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).

A Call to Tweets (& Blog Posts)!

Way back in July of 2009, the first version of the twitteR package was published by Geoff Jentry in CRAN. Since then it has seen 28 updates, finally breaking the 0.x.y barrier into 1.x.y territory in March of 2013 and receiving it’s last update in July of 2015.

For a very long time, the twitteR package was the way to siphon precious nuggets of 140 character data from that platform and is the top hit when one searches for r twitter package. It even ha[sd] it’s own mailing list and is quite popular, judging by RStudio’s CRAN logs total downloads stats .

I blog today to suggest there is a better way to work with Twitter data from R, especially if your central use-case is searching Twitter and mining tweet data. This new way is rtweet by Michael Kearney. It popped up on the scene back in August of 2016 and receives quite a bit of ? from the developer, especially on GitHub.

This post is short and mostly designed to convince you to (a) try out the package and (b) blog and tweet about the package — if you do agree that it’s the best modern way to work with Twitter from R — to raise awareness about it. Because of that focus, I won’t be delving into all of rtweet‘s seekrits, but you can explore them yourself on it’s spiffy pkgdown site.

While both packages have nigh complete access to the Twitter API, I posit that the quintessential use-case for working with Twitter in R is searching through tweets/users and then performing various types of data mining on the retrieved results. To that end, I’m going to show one use-case (out of many potential ones) that will both save you API-time and post-API munging time in order to convince you to switch to rtweet and spread the word about it.

Data-mining 300 `#rstats` Tweets : A Play in Two Acts

We’ll search Twitter for #rstats-tagged tweets with both twitteR and rtweet, starting with the former:

library(twitteR)

# this relies on you setting up an app in apps.twitter.com
setup_twitter_oauth(
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)

r_folks <- searchTwitter("#rstats", n=300)

str(r_folks, 1)
## List of 300
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant

str(r_folks[1])
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @historying: Wow. This is an enormously helpful tutorial by @vivalosburros for anyone interested in mapping "| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2017-10-22 17:18:31"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "922150185916157952"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"
##   ..$ screenName   : chr "jasonrhody"
##   ..$ retweetCount : num 3
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame': 0 obs. of  4 variables:
##   .. ..$ url         : chr(0) 
##   .. ..$ expanded_url: chr(0) 
##   .. ..$ dispaly_url : chr(0) 
##   .. ..$ indices     : num(0) 
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID,
##   ..  getReplyToSN, getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName,
##   ..  getStatusSource, getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
##   ..  setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount,
##   ..  setRetweeted, setScreenName, setStatusSource, setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj

Both packages follow the similar idioms and you need to have done some prep-work by creating a Twitter “app” (both packages have instructions for that).

That operation took about 3 seconds on a fast internet connection and wicked fast computer. What you get back is definitely usable data, but it’s in lists of custom objects. This is due to the way that package models the Twitter API on to custom R objects. It’s elegant, but also likely overkill for most operations. You can use something like purrr::map_df(r_folks, as.data.frame) to get that list into a data frame, there are some other “gotchas”, such as text encoding (on a later run of this code both dplyr::glimpse() and str() gave me “invalid multibyte string” errors but that same thing did not happen with rtweet.

Here’s the rtweet version:

library(rtweet)

# this relies on you setting up an app in apps.twitter.com
create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
) -> twitter_token

saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# ideally put this in ~/.Renviron
Sys.setenv(TWITTER_PAT=path.expand("~/.rtweet-oauth.rds"))

rtweet_folks <- search_tweets("#rstats", n=300)

dplyr::glimpse(rtweet_folks)
## Observations: 300
## Variables: 35
## $ screen_name                    <chr> "AndySugs", "jsbreker", "__rahulgupta__", "AndySugs", "jasonrhody", "sibanjan...
## $ user_id                        <chr> "230403822", "703927710", "752359265394909184", "230403822", "14184263", "863...
## $ created_at                     <dttm> 2017-10-22 17:23:13, 2017-10-22 17:19:48, 2017-10-22 17:19:39, 2017-10-22 17...
## $ status_id                      <chr> "922151366767906819", "922150507745079297", "922150470382125057", "9221504090...
## $ text                           <chr> "RT:  (Rbloggers)Markets Performance after Election: Day 239  https://t.co/D1...
## $ retweet_count                  <int> 0, 0, 9, 0, 3, 1, 1, 57, 57, 103, 10, 10, 0, 0, 0, 34, 0, 0, 642, 34, 1, 1, 1...
## $ favorite_count                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_quote_status                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ quote_status_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ is_retweet                     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, F...
## $ retweet_status_id              <chr> NA, NA, "922085241493360642", NA, "921782329936408576", "922149318550843393",...
## $ in_reply_to_status_status_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_user_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lang                           <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "ro",...
## $ source                         <chr> "IFTTT", "Twitter for iPhone", "GaggleAMP", "IFTTT", "Twitter for Android", "...
## $ media_id                       <chr> NA, "922150500237062144", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "92...
## $ media_url                      <chr> NA, "http://pbs.twimg.com/media/DMwi_oQUMAAdx5A.jpg", NA, NA, NA, NA, NA, NA,...
## $ media_url_expanded             <chr> NA, "https://twitter.com/jsbreker/status/922150507745079297/photo/1", NA, NA,...
## $ urls                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_display                   <chr> "ift.tt/2xe1xrR", NA, NA, "ift.tt/2xe1xrR", NA, "bit.ly/2yAAL0M", "bit.ly/2yA...
## $ urls_expanded                  <chr> "http://ift.tt/2xe1xrR", NA, NA, "http://ift.tt/2xe1xrR", NA, "http://bit.ly/...
## $ mentions_screen_name           <chr> NA, NA, "DataRobot", NA, "historying vivalosburros", "NoorDinTech ikashnitsky...
## $ mentions_user_id               <chr> NA, NA, "622519917", NA, "18521423 304837258", "2511247075 739773414316118017...
## $ symbols                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ hashtags                       <chr> "rstats DataScience", "Rstats ACSmtg", "rstats", "rstats DataScience", "rstat...
## $ coordinates                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_coordinates       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

That took about 1.5 seconds and provides a tidy, immediately usable data structure.

But, that’s not all!

Michael has support for accessing the Twitter streaming API and also has some handy plot functions for quickly exploring retrieved Twitter content, enabling you to make pretty spiffy plots like this with almost no effort (this was pirated from Michael’s package website):

Fin

If the legacy twitteR package is already in your workflows, there may be little to gain. But, I would suggest that R folks give rtweet a try and blog about your experiences. It’ll give others a chance to see usage in different contacts and will also help spread the word about this alternative package and help bump up it’s pagerank.

As I said on Twitter:

rtweet is to Twitter+R as dplyr is to databases+R: the best modern way to access the data/API

— boB Rudis (@hrbrmstr) October 22, 2017

Make sure to try out the GitHub version as well since it has gained some new functionality not currently in CRAN and don’t hesitate to ping @kearneymw on Twitter (though he may regret me suggesting that :-).

Enabling Concerned Visitors & Ethical Security Researchers with security.txt Web Security Policies (plus analyze them at-scale with R)

I’ve blogged a bit about robots.txt — the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers and content owners alike.

There’s an emerging IETF draft for a different type of site metadata that content owners should absolutely consider adopting. This one defines “web security policies” for a given site and has much in common with robots exclusion standard, including the name (security.txt) and format (policy directives are defined with simple syntax — see Chapter 5 of the Debian Policy Manual).

One core difference is that this file is intended for humans. If you are are a general user and visit a site and notice something “off” (security-wise) or if you are an honest, honorable security researcher who found a vulnerability or weakness on a site, this security.txt file should make it easier to contact the appropriate folks at the site to help them identify and resolve security issues. The IETF abstract summarizes the intent well:

When security risks in web services are discovered by independent security researchers who understand the severity of the risk, they often lack the channels to properly disclose them. As a result, security issues may be left unreported. Security.txt defines a standard to help organizations define the process for security researchers to securely disclose security vulnerabilities.

A big change from robots.txt is where the security.txt file goes. The IETF standard is still in draft state so the location may change, but the current thinking is to have it go into /.well-known/security.txt vs being placed in the top level root (i.e. it’s not supposed to be in /security.txt). If you aren’t familiar with the .well-known directory, give RFC 5785 a read.

You can visit the general information site to find out more and install a development version of a Chrome extension that will make it easier for pull up this info in your browser if you find an issue.

Here’s the security.txt for my site:

Contact: bob@rud.is
Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
Disclosure: Full

With that info, you know where to contact me, have the ability to encrypt your message and know that I’ll give you credit and will disclose the bugs openly.

So, Why the [R] tag?

Ah, yes. This post is in the R RSS category feed for a reason. I do at-scale analysis of the web for a living and will be tracking the adoption of security.txt across the internet (initially with the Umbrella Top 1m and a choice list of sites with more categorical data associated with them) over time. My esteemed colleague @jhartftw is handling the crawling part, but I needed a way to speedily read in these files for a broader analysis. So, I made an R package: securitytxt?.

It’s pretty easy to use. Here’s how to install it and use one of the functions to generate a security.txt target URL for a site:

devtools::install_github("hrbrmstr/securitytxt")

library(securitytxt)

(xurl <- sectxt_url("https://rud.is/b"))
## [1] "https://rud.is/.well-known/security.txt"

This is how you read in and parse a security.txt file:

(x <- sectxt(url(xurl)))
## <Web Security Policies Object>
## Contact: bob@rud.is
## Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
## Disclosure: Full

And, this is how you turn that into a usable data frame:

sectxt_info(x)
##          key                                                                                         value
## 1    contact                                                                                    bob@rud.is
## 2 encryption https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
## 3 disclosure                                                                                          Full

There’s also a function to validate that the keys are within the current IETF standard. That will become more useful once the standard moves out of draft status.

FIN

So, definitely adopt the standard and feel invited to kick the tyres on the package. Don’t hesitate to jump on board if you have ideas for how you’d like to extend the package, and drop a note in the comments if you have questions on it or on adopting the standard for your site.

Tragic Documentation

NOTE: If the usual aggregators are picking this up and there are humans curating said aggregators, this post is/was not intended as something to go into the “data science” aggregation sites. Just personal commentary with code in the event someone stumbles across it and wanted to double check me. These “data-dives” help me cope with these type of horrible events.

The “data science” feed URL is https://rud.is/b/category/r/feed/.

I saw the story about body camera footage from a officers involved police stop & fatal shooting in Salt Lake City.
The indiviual killed was a felon — convicted of aggravated assuault — with an outstanding warrant.

He tried to run. At some point in the brief chase he pivoted and appeared to be reaching for a weapon — likely a knife, which was confirmed after the fact.

One officer pulled a tazer. Another pulled a gun. Officer Fox — the one who fired the gun — said he was terrified by how close Mr. Harmon was to the officers when Mr. Harmon stopped and turned toward them.

I wasn’t there. I don’t risk getting injured or killed in the line of duty every day. I don’t face down armed suspects in fast-moving, tense situations.

But, I’m weary of this being a cut+paste story that is a nigh weekly event in America.

Officers are killed by suspects as well. It’s equally tragic.

Below is just “data”. Just a visual documentary of where we are 17-ish years into the 21st Century in America.

And, most of America seems to be OK with this. Then again, most of America is OK with the “price of freedom” being one mass shooting a day.

I’m not.

I scaled the Y axis the same in both faceted charts to make it easier to glance across both sets of tragedies.

This was generated on Sunday, October 8, 2017. If you run the code after that date, remove the saved data files and tweak the Y-scale limits since the death toll will rise.

library(httr)
library(rvest)
library(stringi)
library(hrbrthemes)
library(tidyverse)

read.table(sep=":", stringsAsFactors=FALSE, header=TRUE, 
           text="race:description
W:White, non-Hispanic
B:Black, non-Hispanic
H:Hispanic
N:Native American
A:Asian
None:Other/Unknown
O:Other") -> rdf

wapo_data_url <- "https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv"
shootings_file <- basename(wapo_data_url)
if (!file.exists(shootings_file)) download.file(wapo_data_url, shootings_file)

cols(
  id = col_integer(),
  name = col_character(),
  date = col_date(format = ""),
  manner_of_death = col_character(),
  armed = col_character(),
  age = col_integer(),
  gender = col_character(),
  race = col_character(),
  city = col_character(),
  state = col_character(),
  signs_of_mental_illness = col_character(),
  threat_level = col_character(),
  flee = col_character(),
  body_camera = col_character()
) -> shootings_cols

read_csv(shootings_file, col_types = shootings_cols) %>% 
  mutate(yr = lubridate::year(date), wk = lubridate::week(date)) %>% 
  filter(yr >= 2017) %>% 
  mutate(race = ifelse(is.na(race), "None", race)) %>% 
  mutate(race = ifelse(race=="O", "None", race)) %>% 
  count(race, wk) %>% 
  left_join(rdf, by="race") %>% 
  mutate(description = factor(description, levels=rdf$description)) -> xdf

lod_url <- "https://www.odmp.org/search/year/2017?ref=sidebar"
lod_rds <- "officer_lod.rds"
if (!file.exists(lod_rds)) {
  res <- httr::GET(lod_url)
  write_rds(res, lod_rds)
} else {
  res <- read_rds(lod_rds)
}
pg <- httr::content(res, as="parsed", encoding = "UTF-8")

html_nodes(pg, xpath=".//table[contains(., 'Detective Chad William Parque')]") %>% 
  html_nodes(xpath=".//td[contains(., 'EOW')]") %>% 
  html_text() %>% 
  stri_extract_all_regex("(EOW:[[:space:]]+(.*)\n|Cause of Death:[[:space:]]+(.*)\n)", simplify = TRUE) %>% 
  as_data_frame() %>% 
  mutate_all(~{
    stri_replace_first_regex(.x, "^[[:alpha:][:space:]]+: ", "") %>% 
      stri_trim_both() 
    }
  ) %>% 
  as_data_frame() %>%  
  set_names(c("day", "cause")) %>% 
  mutate(day = as.Date(day, "%A, %B %e, %Y"), wk = lubridate::week(day))%>% 
  count(wk, cause) -> odf 

ggplot(xdf, aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0)) +
  scale_y_comma(limits=c(0,15)) +
  facet_wrap(~description, scales="free_x") +
  labs(x="2017 Week #", y="# Deaths",
       title="Weekly Fatal Police Shootings in 2017",
       subtitle=sprintf("2017 total: %s", scales::comma(sum(xdf$n))),
       caption="Source: https://www.washingtonpost.com/graphics/national/police-shootings-2017/") +
  theme_ipsum_rc(grid="Y")

count(odf, cause, wt=n, sort=TRUE) -> ordr

mutate(odf, cause = factor(cause, levels=ordr$cause)) %>% 
  ggplot(aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0)) +
  scale_x_continuous(limits=c(0, 40)) +
  scale_y_comma(limits=c(0,15)) +
  facet_wrap(~cause, ncol=3, scales="free_x") +
  labs(x="2017 Week #", y="# Deaths",
       title="Weekly Officer Line of Duty Deaths in 2017",
       subtitle=sprintf("2017 total: %s", scales::comma(sum(odf$n))),
       caption="Source: https://www.odmp.org/search/year/2017") +
  theme_ipsum_rc(grid="Y")

Retrieve & process TV News chyrons with newsflash

The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available. There’s tons of information about the project on its main page where you can interactively work with the API if that’s how you roll.

Since my newsflash? package already had a “news” theme and worked with the joint IA-GDELT project TV data, it seemed to be a good home for a Third Eye interface to live.

Basic usage

You can read long-form details of the Third Eye service on their site. The TLDR is that they provide two feeds:

a “raw” one which has massive duplicates and tons of errors
a “clean” one that filters out duplicates, cleans up the text and is much better to work with

You can retrieve either with newsflash::read_chyrons() but the default is to use the clean feed. If you are studying text processing and or NLP/text-cleanup via machine learning, then the raw feed may be very interesting for you. I suspect most data journalists will want to use the clean feed that also powers the IA chyron twitter bots.

Since it’s the Internet Archive, they’re awesome at providing metadata about their data. Heck, even their metadata has metadata about metadata. We can use the fact that they provide a metadata feed to enable listing available chyron archive dates:

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

list_chyrons()
## # A tibble: 61 x 3
##            ts    type     size
##        <date>   <chr>    <dbl>
##  1 2017-09-30 cleaned   539061
##  2 2017-09-30     raw 17927121
##  3 2017-09-29 cleaned   635812
##  4 2017-09-29     raw 19234407
##  5 2017-09-28 cleaned   414067
##  6 2017-09-28     raw 12663606
##  7 2017-09-27 cleaned   613474
##  8 2017-09-27     raw 20442644
##  9 2017-09-26 cleaned   659930
## 10 2017-09-26     raw 19942951
## # ... with 51 more rows

Reading the chyrons in only requires passing in a Date object or a YYYY-mm-dd format date string:

chyrons <- read_chyrons("2017-09-30")


glimpse(chyrons)
## Observations: 2,729
## Variables: 5
## $ ts       <dttm> 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30...
## $ channel  <chr> "BBCNEWS", "CNNW", "FOXNEWSW", "BBCNEWS", "CNNW", "MSNBCW", "BBCNEWS", "CN...
## $ duration <int> 18, 42, 26, 10, 47, 19, 14, 62, 26, 11, 45, 17, 35, 11, 62, 32, 35, 35, 15...
## $ details  <chr> "BBCNEWS_20170929_233000_Race_and_Pace/start/1800", "CNNW_20170929_230000_...
## $ text     <chr> "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT", "TRUMP BRAGS...

You get five columns in a data frame on a successful retrieval:

ts (POSIXct) chyron timestamp
channel (character) news channel the chyron appeared on
duration (integer) see Description
details (character) Internet Archive details path
text (character) the chyron text

We’ll talk about the details path in a bit. The text is likely what you want, so here’s a sample:

head(chyrons$text, 30)
##  [1] "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT"                                                                                                                                                                                            
##  [2] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nAnderson Cooper"                                                                                                                                      
##  [3] "ALIFORNIA STUDENT SWIPES 'MAGA' HAT"                                                                                                                                                                                                                   
##  [4] "US HEALTH SECRETARY QUITS. Mr Price apologised for use O126 private \\ufb02ights since May\\nUS HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nLAURA BICKER. Washington"                                                     
##  [5] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
##  [6] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
##  [7] "US HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nUS HEALTH SECRETARY QUITS. Government otficials required to take commercial \\ufb02ights\\nUS HEALTH SECRETARY QUITS. Scandal emerged after..."                            
##  [8] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
##  [9] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nTRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                        
## [10] "US HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for use O126..."                       
## [11] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nHHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL. . Ryan Nobles (J\\\\N Washington Correspondent"                                                                                                       
## [12] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I"                                                                                                                                                                                                      
## [13] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nREPORTER WHO BROKE PRICE SCANDAL SPEAKS OUT"                                                                                                                                                              
## [14] "US HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for..."                                
## [15] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
## [16] "BIZARRE LIBERAL MELTDOWNS I\\nTUCKER & THE CAT IN THE HAT I. . _ < 'rnnwnn FAD! cnm tint-\\ufb01nk"                                                                                                                                                    
## [17] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [18] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
## [19] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nBIZARRE LIBERAL MELTDOWN I"                                                                                                                                                                         
## [20] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                                                                     
## [21] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                                                                        
## [22] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nSCHOOL LIBRARIAN REJECTS DR. SEUSS. BOOKS GIFTED BY MELANIA TRUMP. . _' tnnx'nkr"                                                                                                                   
## [23] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [24] "YEMEN WAR CRIMES. UN Human Rights Council agrees on investigation\\nINIGO MENDEZ DE VIGO. Spanish Education Minister"                                                                                                                                  
## [25] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\"\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\". . Mavor Carmen Yulin Cruz San Juan,..."
## [26] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS"                                                                                                                                                                                                        
## [27] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
## [28] "TRUMP ASIA TOUR. US President to visit Japan, South Korea and China\\nYEMEN WAR CRIMES. UN Human Rights Council agrees on investigation"                                                                                                               
## [29] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE\\nSAN JUAN MAYOR: \\\"MAD AS HELLII OVER HURRICANE RESPONSE. . Dr. Saniav Gupta (J\\\\N Chief Medical Correspondent"                                                                       
## [30] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE"

Be warned: even the “clean” text is often kinda messy.

For now, there are only four channels, so it’s easy to show a quick example. Since chyrons are supposed to be super-important things you need to know NOW, let’s see how many times Puerto Rico was mentioned on them in the above archive. NOTE: This is a quick example, not a thorough one. I’m searching for some key letter combinations to see just mentions of something looking like “Puerto Rico”. “San Juan” and other text that might be associated with the topic aren’t being considered for this toy example.

mutate(
  chyrons, 
  hour = lubridate::hour(ts),
  text = tolower(text),
  mention = grepl("erto ri", text)
) %>% 
  filter(mention) %>% 
  count(hour, channel) %>% 
  ggplot(aes(hour, n)) +
  geom_segment(aes(xend=hour, yend=0)) +
  scale_x_continuous(name="Hour (GMT)", breaks=seq(0, 23, 6),
                     labels=sprintf("%02d:00", seq(0, 23, 6))) +
  scale_y_continuous(name="# Chyrons", limits=c(0,30)) +
  facet_wrap(~channel, scales="free") +
  labs(title="Chyrons mentioning 'Puerto Rico' per hour per channel",
       subtitle="Chyron date: 2017-09-30",
       caption="Source: Internet Archive Third Eye project & <github.com/hrbrmstr/newsflash>") +
  theme_ipsum_rc(grid="Y")

Details, details, details

Entries in details column look like this:

head(chyrons$details)
## [1] "BBCNEWS_20170929_233000_Race_and_Pace/start/1800"                   
## [2] "CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600"              
## [3] "FOXNEWSW_20170929_230000_The_Story_With_Martha_MacCallum/start/3600"
## [4] "BBCNEWS_20170930_000000_BBC_News/start/60"                          
## [5] "CNNW_20170930_000000_Anderson_Cooper_360/start/60"                  
## [6] "MSNBCW_20170930_000000_All_In_With_Chris_Hayes/start/60"

They are path fragments that can be attached to a URL prefix to see the news clip from that station on that day/time. newsflash::view_clip() does that work for you:

view_clip(chyrons$details[2])

The URL for that is https://archive.org/details/CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600/end/3660 in the event the iframe load failed or you really like being annoyed with cable news shows.

FIN

Grab the package on GitHub, kick the tyres and don’t hesitate to file issues, questions or jump on board with package development. There’s plenty of room for improvement before it hits CRAN and your ideas are most welcome.