Skip navigation

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr)
library(magick)
library(rtweet)
library(stringi)
library(hrbrthemes)
library(tidyverse)

words_to_num <- function(x) {
  map_dbl(x, ~{
    val <- stri_match_first_regex(.x, "^([[:print:]]+) #PowerOutages")[,2]
    mul <- case_when(
      stri_detect_regex(val, "[Kk]") ~ 1000,
      stri_detect_regex(val, "[Mm]") ~ 1000000,
      TRUE ~ 1
    ) 
    val <- stri_replace_all_regex(val, "[^[:digit:]\\.]", "")
    as.numeric(val) * mul 
  })
}

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage <- get_timeline("PowerOutage_us", n=300)

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "\\#(EastCoast|NorthEast)")) %>% 
  mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% 
  mutate(number_out = words_to_num(text)) %>%  
  ggplot(aes(created_at, number_out)) +
  geom_segment(aes(xend=created_at, yend=0), size=5) +
  scale_x_datetime(date_labels = "%Y-%m-%d\n%H:%M", date_breaks="2 hours") +
  scale_y_comma(limits=c(0,2000000)) +
  labs(
    x=NULL, y="# Customers Without Power",
    title="Northeast Power Outages",
    subtitle="Yay! Twitter as a non-blather data source",
    caption="Data via: @PowerOutage_us <https://twitter.com/PowerOutage_us>"
  ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc <- function(grid = "XY", style = c("stream", "card"), retina=TRUE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(24, 18, 16, 14, 12),
    card = c(22, 16, 14, 12, 10)
  ) -> font_sizes
  
  theme_ipsum_rc(
    grid = grid, 
    plot_title_size = font_sizes[1], 
    subtitle_size = font_sizes[2], 
    axis_title_size = font_sizes[3], 
    axis_text_size = font_sizes[4], 
    caption_size = font_sizes[5]
  )
  
}

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

gg_tweet <- function(g, status = "ggplot2 image", style = c("stream", "card"), 
                     retina=TRUE, send = FALSE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(w=1024, h=512),
    card = c(w=800, h=320)
  ) -> dims
  
  dims["res"] <- 72
  
  if (retina) dims <- dims * 2
  
  fig <- image_graph(width=dims["w"], height=dims["h"], res=dims["res"])
  print(g)
  dev.off()
  
  if (send) {
    
    message("Posting image to twitter...")
    
    tf <- tempfile(fileext = ".png")
    image_write(fig, tf, format="png")
    
    # Create an app at apps.twitter.com w/callback URL of http://127.0.0.1:1410
    # Save the app name, consumer key and secret to the following
    # Environment variables
    
    app <- oauth_app(
      appname = Sys.getenv("TWITTER_APP_NAME"),
      key = Sys.getenv("TWITTER_CONSUMER_KEY"),
      secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
    )
    
    twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), app)
    
    POST(
      url = "https://api.twitter.com/1.1/statuses/update_with_media.json",
      config(token = twitter_token), 
      body = list(
        status = status,
        media = upload_file(path.expand(tf))
      )
    ) -> res
    
    warn_for_status(res)
    
    unlink(tf)
    
  }
  
  fig
  
}

Two Great Tastes That Taste Great Together

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet(
  gg + theme_tweet_rc(grid="Y"),
  status = "Progress! #rtweet #gg_tweet",
  send=TRUE
)

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).

Way back in July of 2009, the first version of the twitteR package was published by Geoff Jentry in CRAN. Since then it has seen 28 updates, finally breaking the 0.x.y barrier into 1.x.y territory in March of 2013 and receiving it’s last update in July of 2015.

For a very long time, the twitteR package was the way to siphon precious nuggets of 140 character data from that platform and is the top hit when one searches for r twitter package. It even ha[sd] it’s own mailing list and is quite popular, judging by RStudio’s CRAN logs total downloads stats .

I blog today to suggest there is a better way to work with Twitter data from R, especially if your central use-case is searching Twitter and mining tweet data. This new way is rtweet by Michael Kearney. It popped up on the scene back in August of 2016 and receives quite a bit of ? from the developer, especially on GitHub.

This post is short and mostly designed to convince you to (a) try out the package and (b) blog and tweet about the package — if you do agree that it’s the best modern way to work with Twitter from R — to raise awareness about it. Because of that focus, I won’t be delving into all of rtweet‘s seekrits, but you can explore them yourself on it’s spiffy pkgdown site.

While both packages have nigh complete access to the Twitter API, I posit that the quintessential use-case for working with Twitter in R is searching through tweets/users and then performing various types of data mining on the retrieved results. To that end, I’m going to show one use-case (out of many potential ones) that will both save you API-time and post-API munging time in order to convince you to switch to rtweet and spread the word about it.

Data-mining 300 #rstats Tweets : A Play in Two Acts

We’ll search Twitter for #rstats-tagged tweets with both twitteR and rtweet, starting with the former:

library(twitteR)

# this relies on you setting up an app in apps.twitter.com
setup_twitter_oauth(
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)

r_folks <- searchTwitter("#rstats", n=300)

str(r_folks, 1)
## List of 300
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant

str(r_folks[1])
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @historying: Wow. This is an enormously helpful tutorial by @vivalosburros for anyone interested in mapping "| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2017-10-22 17:18:31"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "922150185916157952"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"
##   ..$ screenName   : chr "jasonrhody"
##   ..$ retweetCount : num 3
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame': 0 obs. of  4 variables:
##   .. ..$ url         : chr(0) 
##   .. ..$ expanded_url: chr(0) 
##   .. ..$ dispaly_url : chr(0) 
##   .. ..$ indices     : num(0) 
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID,
##   ..  getReplyToSN, getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName,
##   ..  getStatusSource, getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
##   ..  setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount,
##   ..  setRetweeted, setScreenName, setStatusSource, setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj

Both packages follow the similar idioms and you need to have done some prep-work by creating a Twitter “app” (both packages have instructions for that).

That operation took about 3 seconds on a fast internet connection and wicked fast computer. What you get back is definitely usable data, but it’s in lists of custom objects. This is due to the way that package models the Twitter API on to custom R objects. It’s elegant, but also likely overkill for most operations. You can use something like purrr::map_df(r_folks, as.data.frame) to get that list into a data frame, there are some other “gotchas”, such as text encoding (on a later run of this code both dplyr::glimpse() and str() gave me “invalid multibyte string” errors but that same thing did not happen with rtweet.

Here’s the rtweet version:

library(rtweet)

# this relies on you setting up an app in apps.twitter.com
create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
) -> twitter_token

saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# ideally put this in ~/.Renviron
Sys.setenv(TWITTER_PAT=path.expand("~/.rtweet-oauth.rds"))

rtweet_folks <- search_tweets("#rstats", n=300)

dplyr::glimpse(rtweet_folks)
## Observations: 300
## Variables: 35
## $ screen_name                    <chr> "AndySugs", "jsbreker", "__rahulgupta__", "AndySugs", "jasonrhody", "sibanjan...
## $ user_id                        <chr> "230403822", "703927710", "752359265394909184", "230403822", "14184263", "863...
## $ created_at                     <dttm> 2017-10-22 17:23:13, 2017-10-22 17:19:48, 2017-10-22 17:19:39, 2017-10-22 17...
## $ status_id                      <chr> "922151366767906819", "922150507745079297", "922150470382125057", "9221504090...
## $ text                           <chr> "RT:  (Rbloggers)Markets Performance after Election: Day 239  https://t.co/D1...
## $ retweet_count                  <int> 0, 0, 9, 0, 3, 1, 1, 57, 57, 103, 10, 10, 0, 0, 0, 34, 0, 0, 642, 34, 1, 1, 1...
## $ favorite_count                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_quote_status                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ quote_status_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ is_retweet                     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, F...
## $ retweet_status_id              <chr> NA, NA, "922085241493360642", NA, "921782329936408576", "922149318550843393",...
## $ in_reply_to_status_status_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_user_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lang                           <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "ro",...
## $ source                         <chr> "IFTTT", "Twitter for iPhone", "GaggleAMP", "IFTTT", "Twitter for Android", "...
## $ media_id                       <chr> NA, "922150500237062144", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "92...
## $ media_url                      <chr> NA, "http://pbs.twimg.com/media/DMwi_oQUMAAdx5A.jpg", NA, NA, NA, NA, NA, NA,...
## $ media_url_expanded             <chr> NA, "https://twitter.com/jsbreker/status/922150507745079297/photo/1", NA, NA,...
## $ urls                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_display                   <chr> "ift.tt/2xe1xrR", NA, NA, "ift.tt/2xe1xrR", NA, "bit.ly/2yAAL0M", "bit.ly/2yA...
## $ urls_expanded                  <chr> "http://ift.tt/2xe1xrR", NA, NA, "http://ift.tt/2xe1xrR", NA, "http://bit.ly/...
## $ mentions_screen_name           <chr> NA, NA, "DataRobot", NA, "historying vivalosburros", "NoorDinTech ikashnitsky...
## $ mentions_user_id               <chr> NA, NA, "622519917", NA, "18521423 304837258", "2511247075 739773414316118017...
## $ symbols                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ hashtags                       <chr> "rstats DataScience", "Rstats ACSmtg", "rstats", "rstats DataScience", "rstat...
## $ coordinates                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_coordinates       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

That took about 1.5 seconds and provides a tidy, immediately usable data structure.

But, that’s not all!

Michael has support for accessing the Twitter streaming API and also has some handy plot functions for quickly exploring retrieved Twitter content, enabling you to make pretty spiffy plots like this with almost no effort (this was pirated from Michael’s package website):

Fin

If the legacy twitteR package is already in your workflows, there may be little to gain. But, I would suggest that R folks give rtweet a try and blog about your experiences. It’ll give others a chance to see usage in different contacts and will also help spread the word about this alternative package and help bump up it’s pagerank.

As I said on Twitter:

Make sure to try out the GitHub version as well since it has gained some new functionality not currently in CRAN and don’t hesitate to ping @kearneymw on Twitter (though he may regret me suggesting that :-).

I’ve blogged a bit about robots.txt — the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers and content owners alike.

There’s an emerging IETF draft for a different type of site metadata that content owners should absolutely consider adopting. This one defines “web security policies” for a given site and has much in common with robots exclusion standard, including the name (security.txt) and format (policy directives are defined with simple syntax — see Chapter 5 of the Debian Policy Manual).

One core difference is that this file is intended for humans. If you are are a general user and visit a site and notice something “off” (security-wise) or if you are an honest, honorable security researcher who found a vulnerability or weakness on a site, this security.txt file should make it easier to contact the appropriate folks at the site to help them identify and resolve security issues. The IETF abstract summarizes the intent well:

When security risks in web services are discovered by independent security researchers who understand the severity of the risk, they often lack the channels to properly disclose them. As a result, security issues may be left unreported. Security.txt defines a standard to help organizations define the process for security researchers to securely disclose security vulnerabilities.

A big change from robots.txt is where the security.txt file goes. The IETF standard is still in draft state so the location may change, but the current thinking is to have it go into /.well-known/security.txt vs being placed in the top level root (i.e. it’s not supposed to be in /security.txt). If you aren’t familiar with the .well-known directory, give RFC 5785 a read.

You can visit the general information site to find out more and install a development version of a Chrome extension that will make it easier for pull up this info in your browser if you find an issue.

Here’s the security.txt for my site:

Contact: bob@rud.is
Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
Disclosure: Full

With that info, you know where to contact me, have the ability to encrypt your message and know that I’ll give you credit and will disclose the bugs openly.

So, Why the [R] tag?

Ah, yes. This post is in the R RSS category feed for a reason. I do at-scale analysis of the web for a living and will be tracking the adoption of security.txt across the internet (initially with the Umbrella Top 1m and a choice list of sites with more categorical data associated with them) over time. My esteemed colleague @jhartftw is handling the crawling part, but I needed a way to speedily read in these files for a broader analysis. So, I made an R package: securitytxt?.

It’s pretty easy to use. Here’s how to install it and use one of the functions to generate a security.txt target URL for a site:

devtools::install_github("hrbrmstr/securitytxt")

library(securitytxt)

(xurl <- sectxt_url("https://rud.is/b"))
## [1] "https://rud.is/.well-known/security.txt"

This is how you read in and parse a security.txt file:

(x <- sectxt(url(xurl)))
## <Web Security Policies Object>
## Contact: bob@rud.is
## Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
## Disclosure: Full

And, this is how you turn that into a usable data frame:

sectxt_info(x)
##          key                                                                                         value
## 1    contact                                                                                    bob@rud.is
## 2 encryption https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
## 3 disclosure                                                                                          Full

There’s also a function to validate that the keys are within the current IETF standard. That will become more useful once the standard moves out of draft status.

FIN

So, definitely adopt the standard and feel invited to kick the tyres on the package. Don’t hesitate to jump on board if you have ideas for how you’d like to extend the package, and drop a note in the comments if you have questions on it or on adopting the standard for your site.

NOTE: If the usual aggregators are picking this up and there are humans curating said aggregators, this post is/was not intended as something to go into the “data science” aggregation sites. Just personal commentary with code in the event someone stumbles across it and wanted to double check me. These “data-dives” help me cope with these type of horrible events.

The “data science” feed URL is https://rud.is/b/category/r/feed/.

I saw the story about body camera footage from a officers involved police stop & fatal shooting in Salt Lake City.
The indiviual killed was a felon — convicted of aggravated assuault — with an outstanding warrant.

He tried to run. At some point in the brief chase he pivoted and appeared to be reaching for a weapon — likely a knife, which was confirmed after the fact.

One officer pulled a tazer. Another pulled a gun. Officer Fox — the one who fired the gun — said he was terrified by how close Mr. Harmon was to the officers when Mr. Harmon stopped and turned toward them.

I wasn’t there. I don’t risk getting injured or killed in the line of duty every day. I don’t face down armed suspects in fast-moving, tense situations.

But, I’m weary of this being a cut+paste story that is a nigh weekly event in America.

Officers are killed by suspects as well. It’s equally tragic.

Below is just “data”. Just a visual documentary of where we are 17-ish years into the 21st Century in America.

And, most of America seems to be OK with this. Then again, most of America is OK with the “price of freedom” being one mass shooting a day.

I’m not.

I scaled the Y axis the same in both faceted charts to make it easier to glance across both sets of tragedies.

This was generated on Sunday, October 8, 2017. If you run the code after that date, remove the saved data files and tweak the Y-scale limits since the death toll will rise.

library(httr)
library(rvest)
library(stringi)
library(hrbrthemes)
library(tidyverse)

read.table(sep=":", stringsAsFactors=FALSE, header=TRUE, 
           text="race:description
W:White, non-Hispanic
B:Black, non-Hispanic
H:Hispanic
N:Native American
A:Asian
None:Other/Unknown
O:Other") -> rdf

wapo_data_url <- "https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv"
shootings_file <- basename(wapo_data_url)
if (!file.exists(shootings_file)) download.file(wapo_data_url, shootings_file)

cols(
  id = col_integer(),
  name = col_character(),
  date = col_date(format = ""),
  manner_of_death = col_character(),
  armed = col_character(),
  age = col_integer(),
  gender = col_character(),
  race = col_character(),
  city = col_character(),
  state = col_character(),
  signs_of_mental_illness = col_character(),
  threat_level = col_character(),
  flee = col_character(),
  body_camera = col_character()
) -> shootings_cols

read_csv(shootings_file, col_types = shootings_cols) %>% 
  mutate(yr = lubridate::year(date), wk = lubridate::week(date)) %>% 
  filter(yr >= 2017) %>% 
  mutate(race = ifelse(is.na(race), "None", race)) %>% 
  mutate(race = ifelse(race=="O", "None", race)) %>% 
  count(race, wk) %>% 
  left_join(rdf, by="race") %>% 
  mutate(description = factor(description, levels=rdf$description)) -> xdf

lod_url <- "https://www.odmp.org/search/year/2017?ref=sidebar"
lod_rds <- "officer_lod.rds"
if (!file.exists(lod_rds)) {
  res <- httr::GET(lod_url)
  write_rds(res, lod_rds)
} else {
  res <- read_rds(lod_rds)
}
pg <- httr::content(res, as="parsed", encoding = "UTF-8")

html_nodes(pg, xpath=".//table[contains(., 'Detective Chad William Parque')]") %>% 
  html_nodes(xpath=".//td[contains(., 'EOW')]") %>% 
  html_text() %>% 
  stri_extract_all_regex("(EOW:[[:space:]]+(.*)\n|Cause of Death:[[:space:]]+(.*)\n)", simplify = TRUE) %>% 
  as_data_frame() %>% 
  mutate_all(~{
    stri_replace_first_regex(.x, "^[[:alpha:][:space:]]+: ", "") %>% 
      stri_trim_both() 
    }
  ) %>% 
  as_data_frame() %>%  
  set_names(c("day", "cause")) %>% 
  mutate(day = as.Date(day, "%A, %B %e, %Y"), wk = lubridate::week(day))%>% 
  count(wk, cause) -> odf 

ggplot(xdf, aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0)) +
  scale_y_comma(limits=c(0,15)) +
  facet_wrap(~description, scales="free_x") +
  labs(x="2017 Week #", y="# Deaths",
       title="Weekly Fatal Police Shootings in 2017",
       subtitle=sprintf("2017 total: %s", scales::comma(sum(xdf$n))),
       caption="Source: https://www.washingtonpost.com/graphics/national/police-shootings-2017/") +
  theme_ipsum_rc(grid="Y")

count(odf, cause, wt=n, sort=TRUE) -> ordr

mutate(odf, cause = factor(cause, levels=ordr$cause)) %>% 
  ggplot(aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0)) +
  scale_x_continuous(limits=c(0, 40)) +
  scale_y_comma(limits=c(0,15)) +
  facet_wrap(~cause, ncol=3, scales="free_x") +
  labs(x="2017 Week #", y="# Deaths",
       title="Weekly Officer Line of Duty Deaths in 2017",
       subtitle=sprintf("2017 total: %s", scales::comma(sum(odf$n))),
       caption="Source: https://www.odmp.org/search/year/2017") +
  theme_ipsum_rc(grid="Y")

The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available. There’s tons of information about the project on its main page where you can interactively work with the API if that’s how you roll.

Since my newsflash? package already had a “news” theme and worked with the joint IA-GDELT project TV data, it seemed to be a good home for a Third Eye interface to live.

Basic usage

You can read long-form details of the Third Eye service on their site. The TLDR is that they provide two feeds:

  • a “raw” one which has massive duplicates and tons of errors
  • a “clean” one that filters out duplicates, cleans up the text and is much better to work with

You can retrieve either with newsflash::read_chyrons() but the default is to use the clean feed. If you are studying text processing and or NLP/text-cleanup via machine learning, then the raw feed may be very interesting for you. I suspect most data journalists will want to use the clean feed that also powers the IA chyron twitter bots.

Since it’s the Internet Archive, they’re awesome at providing metadata about their data. Heck, even their metadata has metadata about metadata. We can use the fact that they provide a metadata feed to enable listing available chyron archive dates:

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

list_chyrons()
## # A tibble: 61 x 3
##            ts    type     size
##        <date>   <chr>    <dbl>
##  1 2017-09-30 cleaned   539061
##  2 2017-09-30     raw 17927121
##  3 2017-09-29 cleaned   635812
##  4 2017-09-29     raw 19234407
##  5 2017-09-28 cleaned   414067
##  6 2017-09-28     raw 12663606
##  7 2017-09-27 cleaned   613474
##  8 2017-09-27     raw 20442644
##  9 2017-09-26 cleaned   659930
## 10 2017-09-26     raw 19942951
## # ... with 51 more rows

Reading the chyrons in only requires passing in a Date object or a YYYY-mm-dd format date string:

chyrons <- read_chyrons("2017-09-30")


glimpse(chyrons)
## Observations: 2,729
## Variables: 5
## $ ts       <dttm> 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30...
## $ channel  <chr> "BBCNEWS", "CNNW", "FOXNEWSW", "BBCNEWS", "CNNW", "MSNBCW", "BBCNEWS", "CN...
## $ duration <int> 18, 42, 26, 10, 47, 19, 14, 62, 26, 11, 45, 17, 35, 11, 62, 32, 35, 35, 15...
## $ details  <chr> "BBCNEWS_20170929_233000_Race_and_Pace/start/1800", "CNNW_20170929_230000_...
## $ text     <chr> "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT", "TRUMP BRAGS...

You get five columns in a data frame on a successful retrieval:

  • ts (POSIXct) chyron timestamp
  • channel (character) news channel the chyron appeared on
  • duration (integer) see Description
  • details (character) Internet Archive details path
  • text (character) the chyron text

We’ll talk about the details path in a bit. The text is likely what you want, so here’s a sample:

head(chyrons$text, 30)
##  [1] "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT"                                                                                                                                                                                            
##  [2] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nAnderson Cooper"                                                                                                                                      
##  [3] "ALIFORNIA STUDENT SWIPES 'MAGA' HAT"                                                                                                                                                                                                                   
##  [4] "US HEALTH SECRETARY QUITS. Mr Price apologised for use O126 private \\ufb02ights since May\\nUS HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nLAURA BICKER. Washington"                                                     
##  [5] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
##  [6] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
##  [7] "US HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nUS HEALTH SECRETARY QUITS. Government otficials required to take commercial \\ufb02ights\\nUS HEALTH SECRETARY QUITS. Scandal emerged after..."                            
##  [8] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
##  [9] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nTRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                        
## [10] "US HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for use O126..."                       
## [11] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nHHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL. . Ryan Nobles (J\\\\N Washington Correspondent"                                                                                                       
## [12] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I"                                                                                                                                                                                                      
## [13] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nREPORTER WHO BROKE PRICE SCANDAL SPEAKS OUT"                                                                                                                                                              
## [14] "US HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for..."                                
## [15] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
## [16] "BIZARRE LIBERAL MELTDOWNS I\\nTUCKER & THE CAT IN THE HAT I. . _ < 'rnnwnn FAD! cnm tint-\\ufb01nk"                                                                                                                                                    
## [17] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [18] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
## [19] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nBIZARRE LIBERAL MELTDOWN I"                                                                                                                                                                         
## [20] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                                                                     
## [21] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                                                                        
## [22] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nSCHOOL LIBRARIAN REJECTS DR. SEUSS. BOOKS GIFTED BY MELANIA TRUMP. . _' tnnx'nkr"                                                                                                                   
## [23] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [24] "YEMEN WAR CRIMES. UN Human Rights Council agrees on investigation\\nINIGO MENDEZ DE VIGO. Spanish Education Minister"                                                                                                                                  
## [25] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\"\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\". . Mavor Carmen Yulin Cruz San Juan,..."
## [26] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS"                                                                                                                                                                                                        
## [27] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
## [28] "TRUMP ASIA TOUR. US President to visit Japan, South Korea and China\\nYEMEN WAR CRIMES. UN Human Rights Council agrees on investigation"                                                                                                               
## [29] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE\\nSAN JUAN MAYOR: \\\"MAD AS HELLII OVER HURRICANE RESPONSE. . Dr. Saniav Gupta (J\\\\N Chief Medical Correspondent"                                                                       
## [30] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE"

Be warned: even the “clean” text is often kinda messy.

For now, there are only four channels, so it’s easy to show a quick example. Since chyrons are supposed to be super-important things you need to know NOW, let’s see how many times Puerto Rico was mentioned on them in the above archive. NOTE: This is a quick example, not a thorough one. I’m searching for some key letter combinations to see just mentions of something looking like “Puerto Rico”. “San Juan” and other text that might be associated with the topic aren’t being considered for this toy example.

mutate(
  chyrons, 
  hour = lubridate::hour(ts),
  text = tolower(text),
  mention = grepl("erto ri", text)
) %>% 
  filter(mention) %>% 
  count(hour, channel) %>% 
  ggplot(aes(hour, n)) +
  geom_segment(aes(xend=hour, yend=0)) +
  scale_x_continuous(name="Hour (GMT)", breaks=seq(0, 23, 6),
                     labels=sprintf("%02d:00", seq(0, 23, 6))) +
  scale_y_continuous(name="# Chyrons", limits=c(0,30)) +
  facet_wrap(~channel, scales="free") +
  labs(title="Chyrons mentioning 'Puerto Rico' per hour per channel",
       subtitle="Chyron date: 2017-09-30",
       caption="Source: Internet Archive Third Eye project & <github.com/hrbrmstr/newsflash>") +
  theme_ipsum_rc(grid="Y")

Details, details, details

Entries in details column look like this:

head(chyrons$details)
## [1] "BBCNEWS_20170929_233000_Race_and_Pace/start/1800"                   
## [2] "CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600"              
## [3] "FOXNEWSW_20170929_230000_The_Story_With_Martha_MacCallum/start/3600"
## [4] "BBCNEWS_20170930_000000_BBC_News/start/60"                          
## [5] "CNNW_20170930_000000_Anderson_Cooper_360/start/60"                  
## [6] "MSNBCW_20170930_000000_All_In_With_Chris_Hayes/start/60"

They are path fragments that can be attached to a URL prefix to see the news clip from that station on that day/time. newsflash::view_clip() does that work for you:

view_clip(chyrons$details[2])

The URL for that is https://archive.org/details/CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600/end/3660 in the event the iframe load failed or you really like being annoyed with cable news shows.

FIN

Grab the package on GitHub, kick the tyres and don’t hesitate to file issues, questions or jump on board with package development. There’s plenty of room for improvement before it hits CRAN and your ideas are most welcome.

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them.

Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or company and come back to see why I’ve mentioned it.

Back? Good.

If you explored that site a bit, you likely saw a link to their GitHub repo. There you’ll find JavaScript source code and even some browser plugins along with the core “fingerprint database”, apps.json?. If you poke around on their repo, you’ll find the repo’s wiki and eventually see that folks have made various unofficial ports to other languages.

By now, you’ve likely guessed that this blog post is introducing an R port of “wappalyzer” and if you have made such a guess, you’re correct! The rappalyzer? package is a “work-in-progress” but it’s usable enough to get some feedback on the package API, solicit contributors and show some things you can do with it.

Just rappalyze something already!

For the moment, one real function is exposed: rappalyze(). Feed it a URL string or an httr response object and — if the analysis was fruitful — you’ll get back a data frame with all the identified tech stack components. Let’s see what jquery.com? is made of:

devtools::install_github("hrbrmstr/rappalyzer")

library(rappalyzer)

rappalyze("https://jquery.com/")
## # A tibble: 8 x 5
##         tech              category match_type version confidence
##        <chr>                 <chr>      <chr>   <chr>      <dbl>
## 1 CloudFlare                   CDN    headers    <NA>        100
## 2     Debian     Operating Systems    headers    <NA>        100
## 3  Modernizr JavaScript Frameworks     script    <NA>        100
## 4      Nginx           Web Servers    headers    <NA>        100
## 5        PHP Programming Languages    headers  5.4.45        100
## 6  WordPress                   CMS       meta   4.5.2        100
## 7  WordPress                 Blogs       meta   4.5.2        100
## 8     jQuery JavaScript Frameworks     script  1.11.3        100

If you played around on the Wappalyzer site you saw that it “spiders” many URLs to try to get a very complete picture of components that a given site may use. I just used the main jQuery site in the above example and we managed to get quite a bit of information from just that interaction.

Wappalyzer (and, hence rappalyzer) works by comparing sets of regular expressions against different parts of a site’s content. For now, rappalyzer checks:

  • the site URL (the one after a site’s been contacted & content retrieved)
  • HTTP headers (an invisible set of metadata browsers/crawlers & servers share)
  • HTML content
  • scripts
  • meta tags

Wappalyzer-proper runs in a browser-context, so it can pick up DOM elements and JavaScript environment variables which can provide some extra information not present in the static content.

As you can see from the returned data frame, each tech component has one or more associated categories as well as a version number (if it could be guessed) and a value indicating how confident the guess was. It will also show where the guess came from (headers, scripts, etc).

A peek under the hood

I started making a pure R port of Wappalyzer but there are some frustrating “gotchas” with the JavaScript regular expressions that would have meant spending time identifying all the edge cases. Since the main Wappalyzer source is in JavaScript it was trivial to down-port it to a V8?-compatible version and expose it via an R function. A longer-term plan is to deal with the regular expression conversion, but I kinda needed the functionality sooner than later (well, wanted it in R, anyway).

On the other hand, keeping it in JavaScript has some advantages, especially with the advent of chimera? (a phantomjs alternative that can put a full headless browser into a V8 engine context). Using that would mean getting the V8 package ported to more modern version of the V8 source code which isn’t trivial, but doable. Using chimera would make it possible to identify even more tech stack components.

Note also that these tech stack analyses can be dreadfully slow. That’s not due to using V8. The ones that slow are slow in all the language ports. This is due to the way the regular expressions are checked against the site content and for just using regular expressions to begin with. I’ve got some ideas to speed things up and may also introduce some “guide rails” to prevent lengthy operations and avoiding some checks if page characteristics meet certain criteria. Drop an issue if you have ideas as well.

Why is this useful?

Well, you can interactively see what a site uses like we did above. But, we can also look for commonalities across a number of sites. We’ll do this in a small example now. You can skip the expository and just work with the RStudio project if that’s how you’d rather roll.

Crawling 1,000 orgs

I have a recent-ish copy of a list of Fortune 1000 companies and their industry sectors along with their website URLs. Let’s see how many of those sites we can still access and what we identify.

I’m going to pause for just a minute to revisit some “rules” in the context of this operation.

Specifically, I’m not:

  • reading each site’s terms of service/terms & conditions
  • checking each site’s robots.txt file
  • using a uniquely identifying non-browser string crawler user-agent (since we need the site to present content like it would to a browser)

However, I’m also not:

  • using the site content in any way except to identify tech stack components (something dozens of organizations do)
  • scraping past the index page HTML content, so there’s no measurable resource usage

I believe I’ve threaded the ethical needle properly (and, I do this for a living), but if you’re uncomfortable with this you should use a different target URL list.

Back to the code. We’ll need some packages:

library(hrbrthemes)
library(tidyverse)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(urltools)
library(rappalyzer) # devtools::install_github("hrbrmstr/rappalyzer")
library(rprojroot)
# I'm also using a user agent shortcut from the splashr package which is on CRAN

Now, I’ve included data in the RStudio project GH repo since it can be used later on to test out changes to rappalyzer. That’s the reason for the if/file.exists tests. As I’ve said before, be kind to sites you scrape and cache content whenever possible to avoid consuming resources that you don’t own.

Let’s take a look at our crawler code:

rt <- find_rstudio_root_file()

if (!file.exists(file.path(rt, "data", "f1k_gets.rds"))) {

  f1k <- read_csv(file.path(rt, "data", "f1k.csv"))

  targets <- pull(f1k, website)

  results <- list()
  errors <- list()

  OK <- function(res) {
    cat(".", sep="")
    results <<- c(results, list(res))
  }

  BAD <- function(err_msg) {
    cat("X", sep="")
    errors <<- c(errors, list(err_msg))
  }

  pool <- multi_set(total_con = 20)

  walk(targets, ~{
    multi_add(
      new_handle(url = .x, useragent = splashr::ua_macos_chrome, followlocation = TRUE, timeout = 60),
      OK, BAD
    )
  })

  multi_run(pool = pool)

  write_rds(results, file.path(rt, "data", "f1k_gets.rds"), compress = "xz")

} else {
  results <- read_rds(file.path(rt, "data", "f1k_gets.rds"))
}

If you were expecting read_html() and/or GET() calls you’re likely surprised at what you see. We’re trying to pull content from a thousand web sites, some of which may not be there anymore (for good or just temporarily). Sequential calls would need many minutes with error handling. We’re using the asynchronous/parallel capabilities in the curl? package to setup all 1,000 requests with the ability to capture the responses. There’s a hack-ish “progress” bar that uses . for “good” requests and X for ones that didn’t work. It still takes a little time (and you may need to tweak the total_con value on older systems) but way less than sequential GETs would have, and any errors are implicitly handled (ignored).

The next block does the tech stack identification:

if (!file.exists(file.path(rt, "data", "rapp_results.rds"))) {

  results <- keep(results, ~.x$status_code < 300)
  map(results, ~{
    list(
      url = .x$url,
      status_code = .x$status_code,
      content = .x$content,
      headers = httr:::parse_headers(.x$headers)
    ) -> res
    class(res) <- "response"
    res
  }) -> results

  # this takes a *while*
  pb <- progress_estimated(length(results))
  map_df(results, ~{
    pb$tick()$print()
    rap_df <- rappalyze(.x)
    if (nrow(rap_df) > 0) rap_df <- mutate(rap_df, url = .x$url)
  }) -> rapp_results

  write_rds(rapp_results, file.path(rt, "data", "rapp_results.rds"))

} else {
  rapp_results <- read_rds(file.path(rt, "data", "rapp_results.rds"))
}

We filter out non-“OK” (HTTP 200-ish response) curl results and turn them into just enough of an httr request object to be able to work with them.

Then we let rappalyzer work. I had time to kill so I let it run sequentially. In a production or time-critical context, I’d use the future? package.

We’re almost down to data we can play with! Let’s join it back up with the original metadata:

left_join(
  mutate(rapp_results, host = domain(rapp_results$url)) %>%
    bind_cols(suffix_extract(.$host))
  ,
  mutate(f1k, host = domain(website)) %>%
    bind_cols(suffix_extract(.$host)),
  by = c("domain", "suffix")
) %>%
  filter(!is.na(name)) -> rapp_results

length(unique(rapp_results$name))
## [1] 754

Note that some orgs seem to have changed hands/domains so we’re left with ~750 sites to play with (we could have tried to match more, but this is just an exercise). If you do the same curl-fetching exercise on your own, you may get fewer or more total sites. Internet crawling is fraught with peril and it’s not really possible to conveniently convey 100% production code in a blog post.

Comparing “Fortune 1000” org web stacks

Let’s see how many categories we picked up:

xdf <- distinct(rapp_results, name, sector, category, tech)

sort(unique(xdf$category))
##  [1] "Advertising Networks"  "Analytics"             "Blogs"
##  [4] "Cache Tools"           "Captchas"              "CMS"
##  [7] "Ecommerce"             "Editors"               "Font Scripts"
## [10] "JavaScript Frameworks" "JavaScript Graphics"   "Maps"
## [13] "Marketing Automation"  "Miscellaneous"         "Mobile Frameworks"
## [16] "Programming Languages" "Tag Managers"          "Video Players"
## [19] "Web Frameworks"        "Widgets"

Plenty to play with!

Now, what do you think the most common tech component is across all these diverse sites?

count(xdf, tech, sort=TRUE)
## # A tibble: 115 x 2
##                        tech     n
##                       <chr> <int>
##  1                   jQuery   572
##  2       Google Tag Manager   220
##  3                Modernizr   197
##  4        Microsoft ASP.NET   193
##  5          Google Font API   175
##  6        Twitter Bootstrap   162
##  7                WordPress   150
##  8                jQuery UI   143
##  9             Font Awesome   118
## 10 Adobe Experience Manager    69
## # ... with 105 more rows

I had suspected it was jQuery (and hinted at that when I used it as the initial rappalyzer example). In a future blog post we’ll look at just how vulnerable orgs are based on which CDNs they use (many use similar jQuery and other resource CDNs, and use them insecurely).

Let’s see how broadly each tech stack category is used across the sectors:

group_by(xdf, sector) %>%
  count(category) %>%
  ungroup() %>%
  arrange(category) %>%
  mutate(category = factor(category, levels=rev(unique(category)))) %>%
  ggplot(aes(category, n)) +
  geom_boxplot() +
  scale_y_comma() +
  coord_flip() +
  labs(x=NULL, y="Tech/Services Detected across Sectors",
       title="Usage of Tech Stack Categories by Sector") +
  theme_ipsum_rc(grid="X")

That’s alot of JavaScript. But, there’s also large-ish use of CMS, Web Frameworks and other components.

I wonder what the CMS usage looks like:

filter(xdf, category=="CMS") %>%
  count(tech, sort=TRUE)
## # A tibble: 15 x 2
##                        tech     n
##                       <chr> <int>
##  1                WordPress    75
##  2 Adobe Experience Manager    69
##  3                   Drupal    68
##  4               Sitefinity    26
##  5     Microsoft SharePoint    19
##  6                 Sitecore    14
##  7                      DNN    11
##  8                Concrete5     6
##  9     IBM WebSphere Portal     4
## 10        Business Catalyst     2
## 11                    Hippo     2
## 12                TYPO3 CMS     2
## 13               Contentful     1
## 14              Orchard CMS     1
## 15             SilverStripe     1

Drupal and WordPress weren’t shocking to me (again, I do this for a living) but they may be to others when you consider this is the Fortune 1000 we’re talking about.

What about Web Frameworks?

filter(xdf, category=="Web Frameworks") %>%
  count(tech, sort=TRUE)
## # A tibble: 7 x 2
##                          tech     n
##                         <chr> <int>
## 1           Microsoft ASP.NET   193
## 2           Twitter Bootstrap   162
## 3             ZURB Foundation    61
## 4               Ruby on Rails     2
## 5            Adobe ColdFusion     1
## 6                    Kendo UI     1
## 7 Woltlab Community Framework     1

Yikes! Quite a bit of ASP.NET. But, many enterprises haven’t migrated to more modern, useful platforms (yet).

FIN

As noted earlier, the data & code are on GitHub. There are many questions you can ask and answer with this data set and definitely make sure to share any findings you discover!

We’ll continue exploring different aspects of what you can glean from looking at web sites in a different way in future posts.

I’ll leave you with one interactive visualization that lets you explore the tech stacks by sector.

devtools::install_github("jeromefroe/circlepackeR")
library(circlepackeR)
library(data.tree)
library(treemap)

cpdf <- count(xdf, sector, tech)

cpdf$pathString <- paste("rapp", cpdf$sector, cpdf$tech, sep = "/")
stacks <- as.Node(cpdf)

circlepackeR(stacks, size = "n", color_min = "hsl(56,80%,80%)",
             color_max = "hsl(341,30%,40%)", width = "100%", height="800px")

I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality into a package (if warranted). Some make it to CRAN (like those two), others stay on GitHub.

This (short) post is about two new ones: webhose? and pigeon?.

The webhose package is an API interface package to https://webhose.io/, which is an interesting service that scrapes the web & “dark web” and provides a short but handy API to retrieving the content using a fairly intuitive query language.

The pigeon package is a hastily-hacked-together wrapper around pgn-extract, a cross-platform utility written in C for working with chess game data in PGN format.

I’m not going to have any time to round out the corners on either of those packages but will gladly make time to help anyone who wants to jump on board to either (or both!) of them.

Working on either package will let you get your feet wet (or, wetter) with R package development. They both need:

  • more functions + docs!
  • test harness setup
  • Travis-CI, code coverage & AppVeyor configs

Working on webhose will give you experience dealing with HTTP APIs (and their API is super clean to work with) and possibly introduce you to an area of research you’re not already in.

Working on pigeon will give you experience integrating C[++] & R code (and the C-library I’ve hack-wrapped definitely needs some care & feeding to ensure no memory leaks are present).

Neither is “mission critical”. The world will gladly go on w/o either of them. But, if you wanted a judgement-free place to hone your R package skills or try/learn new things (and ask questions along the way), file an issue, drop a note in the comments or hit me up on Twitter. If you’re an experienced R package-r and want to “own” either of these, that’s ? as well!

I’d encourage all nascent R coders to adopt “SODD” and use SO as a place to hone your skills while you help others (and you don’t need to write a package for every answer :-).

spiderbar, spiderbar 
Reads robots rules from afar.
Crawls the web, any size; 
Fetches with respect, never lies.
Look Out! 
Here comes the spiderbar.

Is it fast? 
Listen bud, 
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there 
There goes the spiderbar. 

(Check the end of the post if you don’t recognize the lyrical riff.)

Face front, true believer!

I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

How much faster?

I’m glad you asked!

Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)

rob <- get_robotstxt("imdb.com")

microbenchmark(

  robotstxt = {
    x <- parse_robotstxt(rob)
    x$crawl_delay
  },

  spiderbar = {
    y <- robxp(rob)
    crawl_delays(y)
  }

) -> mb1

update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))

autoplot(mb1) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

As you can see, it’s just a tad bit faster ?.

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

But, I don’t care about Crawl-Delay!

OK, fine. Do you care about fetchability? We can speed that up, too!

rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)

microbenchmark(

  robotstxt = {
    robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
  },

  spiderbar = {
    can_fetch(rob_spi, "/Vote")
  }

) -> mb2

autoplot(mb2) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Vectorized or it didn’t happen.

(Gosh, even Spider-Man got more respect!)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

microbenchmark(

  robotstxt = {
    paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
  },

  spiderbar = {
    can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
  }

) -> mb3

autoplot(mb3) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Excelsior!

Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)