Skip navigation

Category Archives: R

Way back in July of 2009, the first version of the twitteR package was published by Geoff Jentry in CRAN. Since then it has seen 28 updates, finally breaking the 0.x.y barrier into 1.x.y territory in March of 2013 and receiving it’s last update in July of 2015.

For a very long time, the twitteR package was the way to siphon precious nuggets of 140 character data from that platform and is the top hit when one searches for r twitter package. It even ha[sd] it’s own mailing list and is quite popular, judging by RStudio’s CRAN logs total downloads stats .

I blog today to suggest there is a better way to work with Twitter data from R, especially if your central use-case is searching Twitter and mining tweet data. This new way is rtweet by Michael Kearney. It popped up on the scene back in August of 2016 and receives quite a bit of ? from the developer, especially on GitHub.

This post is short and mostly designed to convince you to (a) try out the package and (b) blog and tweet about the package — if you do agree that it’s the best modern way to work with Twitter from R — to raise awareness about it. Because of that focus, I won’t be delving into all of rtweet‘s seekrits, but you can explore them yourself on it’s spiffy pkgdown site.

While both packages have nigh complete access to the Twitter API, I posit that the quintessential use-case for working with Twitter in R is searching through tweets/users and then performing various types of data mining on the retrieved results. To that end, I’m going to show one use-case (out of many potential ones) that will both save you API-time and post-API munging time in order to convince you to switch to rtweet and spread the word about it.

Data-mining 300 #rstats Tweets : A Play in Two Acts

We’ll search Twitter for #rstats-tagged tweets with both twitteR and rtweet, starting with the former:

library(twitteR)

# this relies on you setting up an app in apps.twitter.com
setup_twitter_oauth(
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)

r_folks <- searchTwitter("#rstats", n=300)

str(r_folks, 1)
## List of 300
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant

str(r_folks[1])
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @historying: Wow. This is an enormously helpful tutorial by @vivalosburros for anyone interested in mapping "| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2017-10-22 17:18:31"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "922150185916157952"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"
##   ..$ screenName   : chr "jasonrhody"
##   ..$ retweetCount : num 3
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame': 0 obs. of  4 variables:
##   .. ..$ url         : chr(0) 
##   .. ..$ expanded_url: chr(0) 
##   .. ..$ dispaly_url : chr(0) 
##   .. ..$ indices     : num(0) 
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID,
##   ..  getReplyToSN, getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName,
##   ..  getStatusSource, getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
##   ..  setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount,
##   ..  setRetweeted, setScreenName, setStatusSource, setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj

Both packages follow the similar idioms and you need to have done some prep-work by creating a Twitter “app” (both packages have instructions for that).

That operation took about 3 seconds on a fast internet connection and wicked fast computer. What you get back is definitely usable data, but it’s in lists of custom objects. This is due to the way that package models the Twitter API on to custom R objects. It’s elegant, but also likely overkill for most operations. You can use something like purrr::map_df(r_folks, as.data.frame) to get that list into a data frame, there are some other “gotchas”, such as text encoding (on a later run of this code both dplyr::glimpse() and str() gave me “invalid multibyte string” errors but that same thing did not happen with rtweet.

Here’s the rtweet version:

library(rtweet)

# this relies on you setting up an app in apps.twitter.com
create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
) -> twitter_token

saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# ideally put this in ~/.Renviron
Sys.setenv(TWITTER_PAT=path.expand("~/.rtweet-oauth.rds"))

rtweet_folks <- search_tweets("#rstats", n=300)

dplyr::glimpse(rtweet_folks)
## Observations: 300
## Variables: 35
## $ screen_name                    <chr> "AndySugs", "jsbreker", "__rahulgupta__", "AndySugs", "jasonrhody", "sibanjan...
## $ user_id                        <chr> "230403822", "703927710", "752359265394909184", "230403822", "14184263", "863...
## $ created_at                     <dttm> 2017-10-22 17:23:13, 2017-10-22 17:19:48, 2017-10-22 17:19:39, 2017-10-22 17...
## $ status_id                      <chr> "922151366767906819", "922150507745079297", "922150470382125057", "9221504090...
## $ text                           <chr> "RT:  (Rbloggers)Markets Performance after Election: Day 239  https://t.co/D1...
## $ retweet_count                  <int> 0, 0, 9, 0, 3, 1, 1, 57, 57, 103, 10, 10, 0, 0, 0, 34, 0, 0, 642, 34, 1, 1, 1...
## $ favorite_count                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_quote_status                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ quote_status_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ is_retweet                     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, F...
## $ retweet_status_id              <chr> NA, NA, "922085241493360642", NA, "921782329936408576", "922149318550843393",...
## $ in_reply_to_status_status_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_user_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lang                           <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "ro",...
## $ source                         <chr> "IFTTT", "Twitter for iPhone", "GaggleAMP", "IFTTT", "Twitter for Android", "...
## $ media_id                       <chr> NA, "922150500237062144", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "92...
## $ media_url                      <chr> NA, "http://pbs.twimg.com/media/DMwi_oQUMAAdx5A.jpg", NA, NA, NA, NA, NA, NA,...
## $ media_url_expanded             <chr> NA, "https://twitter.com/jsbreker/status/922150507745079297/photo/1", NA, NA,...
## $ urls                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_display                   <chr> "ift.tt/2xe1xrR", NA, NA, "ift.tt/2xe1xrR", NA, "bit.ly/2yAAL0M", "bit.ly/2yA...
## $ urls_expanded                  <chr> "http://ift.tt/2xe1xrR", NA, NA, "http://ift.tt/2xe1xrR", NA, "http://bit.ly/...
## $ mentions_screen_name           <chr> NA, NA, "DataRobot", NA, "historying vivalosburros", "NoorDinTech ikashnitsky...
## $ mentions_user_id               <chr> NA, NA, "622519917", NA, "18521423 304837258", "2511247075 739773414316118017...
## $ symbols                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ hashtags                       <chr> "rstats DataScience", "Rstats ACSmtg", "rstats", "rstats DataScience", "rstat...
## $ coordinates                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_coordinates       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

That took about 1.5 seconds and provides a tidy, immediately usable data structure.

But, that’s not all!

Michael has support for accessing the Twitter streaming API and also has some handy plot functions for quickly exploring retrieved Twitter content, enabling you to make pretty spiffy plots like this with almost no effort (this was pirated from Michael’s package website):

Fin

If the legacy twitteR package is already in your workflows, there may be little to gain. But, I would suggest that R folks give rtweet a try and blog about your experiences. It’ll give others a chance to see usage in different contacts and will also help spread the word about this alternative package and help bump up it’s pagerank.

As I said on Twitter:

Make sure to try out the GitHub version as well since it has gained some new functionality not currently in CRAN and don’t hesitate to ping @kearneymw on Twitter (though he may regret me suggesting that :-).

I’ve blogged a bit about robots.txt — the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers and content owners alike.

There’s an emerging IETF draft for a different type of site metadata that content owners should absolutely consider adopting. This one defines “web security policies” for a given site and has much in common with robots exclusion standard, including the name (security.txt) and format (policy directives are defined with simple syntax — see Chapter 5 of the Debian Policy Manual).

One core difference is that this file is intended for humans. If you are are a general user and visit a site and notice something “off” (security-wise) or if you are an honest, honorable security researcher who found a vulnerability or weakness on a site, this security.txt file should make it easier to contact the appropriate folks at the site to help them identify and resolve security issues. The IETF abstract summarizes the intent well:

When security risks in web services are discovered by independent security researchers who understand the severity of the risk, they often lack the channels to properly disclose them. As a result, security issues may be left unreported. Security.txt defines a standard to help organizations define the process for security researchers to securely disclose security vulnerabilities.

A big change from robots.txt is where the security.txt file goes. The IETF standard is still in draft state so the location may change, but the current thinking is to have it go into /.well-known/security.txt vs being placed in the top level root (i.e. it’s not supposed to be in /security.txt). If you aren’t familiar with the .well-known directory, give RFC 5785 a read.

You can visit the general information site to find out more and install a development version of a Chrome extension that will make it easier for pull up this info in your browser if you find an issue.

Here’s the security.txt for my site:

Contact: bob@rud.is
Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
Disclosure: Full

With that info, you know where to contact me, have the ability to encrypt your message and know that I’ll give you credit and will disclose the bugs openly.

So, Why the [R] tag?

Ah, yes. This post is in the R RSS category feed for a reason. I do at-scale analysis of the web for a living and will be tracking the adoption of security.txt across the internet (initially with the Umbrella Top 1m and a choice list of sites with more categorical data associated with them) over time. My esteemed colleague @jhartftw is handling the crawling part, but I needed a way to speedily read in these files for a broader analysis. So, I made an R package: securitytxt?.

It’s pretty easy to use. Here’s how to install it and use one of the functions to generate a security.txt target URL for a site:

devtools::install_github("hrbrmstr/securitytxt")

library(securitytxt)

(xurl <- sectxt_url("https://rud.is/b"))
## [1] "https://rud.is/.well-known/security.txt"

This is how you read in and parse a security.txt file:

(x <- sectxt(url(xurl)))
## <Web Security Policies Object>
## Contact: bob@rud.is
## Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
## Disclosure: Full

And, this is how you turn that into a usable data frame:

sectxt_info(x)
##          key                                                                                         value
## 1    contact                                                                                    bob@rud.is
## 2 encryption https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399
## 3 disclosure                                                                                          Full

There’s also a function to validate that the keys are within the current IETF standard. That will become more useful once the standard moves out of draft status.

FIN

So, definitely adopt the standard and feel invited to kick the tyres on the package. Don’t hesitate to jump on board if you have ideas for how you’d like to extend the package, and drop a note in the comments if you have questions on it or on adopting the standard for your site.

The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available. There’s tons of information about the project on its main page where you can interactively work with the API if that’s how you roll.

Since my newsflash? package already had a “news” theme and worked with the joint IA-GDELT project TV data, it seemed to be a good home for a Third Eye interface to live.

Basic usage

You can read long-form details of the Third Eye service on their site. The TLDR is that they provide two feeds:

  • a “raw” one which has massive duplicates and tons of errors
  • a “clean” one that filters out duplicates, cleans up the text and is much better to work with

You can retrieve either with newsflash::read_chyrons() but the default is to use the clean feed. If you are studying text processing and or NLP/text-cleanup via machine learning, then the raw feed may be very interesting for you. I suspect most data journalists will want to use the clean feed that also powers the IA chyron twitter bots.

Since it’s the Internet Archive, they’re awesome at providing metadata about their data. Heck, even their metadata has metadata about metadata. We can use the fact that they provide a metadata feed to enable listing available chyron archive dates:

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

list_chyrons()
## # A tibble: 61 x 3
##            ts    type     size
##        <date>   <chr>    <dbl>
##  1 2017-09-30 cleaned   539061
##  2 2017-09-30     raw 17927121
##  3 2017-09-29 cleaned   635812
##  4 2017-09-29     raw 19234407
##  5 2017-09-28 cleaned   414067
##  6 2017-09-28     raw 12663606
##  7 2017-09-27 cleaned   613474
##  8 2017-09-27     raw 20442644
##  9 2017-09-26 cleaned   659930
## 10 2017-09-26     raw 19942951
## # ... with 51 more rows

Reading the chyrons in only requires passing in a Date object or a YYYY-mm-dd format date string:

chyrons <- read_chyrons("2017-09-30")


glimpse(chyrons)
## Observations: 2,729
## Variables: 5
## $ ts       <dttm> 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30...
## $ channel  <chr> "BBCNEWS", "CNNW", "FOXNEWSW", "BBCNEWS", "CNNW", "MSNBCW", "BBCNEWS", "CN...
## $ duration <int> 18, 42, 26, 10, 47, 19, 14, 62, 26, 11, 45, 17, 35, 11, 62, 32, 35, 35, 15...
## $ details  <chr> "BBCNEWS_20170929_233000_Race_and_Pace/start/1800", "CNNW_20170929_230000_...
## $ text     <chr> "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT", "TRUMP BRAGS...

You get five columns in a data frame on a successful retrieval:

  • ts (POSIXct) chyron timestamp
  • channel (character) news channel the chyron appeared on
  • duration (integer) see Description
  • details (character) Internet Archive details path
  • text (character) the chyron text

We’ll talk about the details path in a bit. The text is likely what you want, so here’s a sample:

head(chyrons$text, 30)
##  [1] "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT"                                                                                                                                                                                            
##  [2] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nAnderson Cooper"                                                                                                                                      
##  [3] "ALIFORNIA STUDENT SWIPES 'MAGA' HAT"                                                                                                                                                                                                                   
##  [4] "US HEALTH SECRETARY QUITS. Mr Price apologised for use O126 private \\ufb02ights since May\\nUS HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nLAURA BICKER. Washington"                                                     
##  [5] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
##  [6] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
##  [7] "US HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nUS HEALTH SECRETARY QUITS. Government otficials required to take commercial \\ufb02ights\\nUS HEALTH SECRETARY QUITS. Scandal emerged after..."                            
##  [8] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
##  [9] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nTRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                        
## [10] "US HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for use O126..."                       
## [11] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nHHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL. . Ryan Nobles (J\\\\N Washington Correspondent"                                                                                                       
## [12] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I"                                                                                                                                                                                                      
## [13] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nREPORTER WHO BROKE PRICE SCANDAL SPEAKS OUT"                                                                                                                                                              
## [14] "US HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for..."                                
## [15] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL"                                                                                                                                                                                                        
## [16] "BIZARRE LIBERAL MELTDOWNS I\\nTUCKER & THE CAT IN THE HAT I. . _ < 'rnnwnn FAD! cnm tint-\\ufb01nk"                                                                                                                                                    
## [17] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [18] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                       
## [19] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nBIZARRE LIBERAL MELTDOWN I"                                                                                                                                                                         
## [20] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL"                                                                                                                                                                                     
## [21] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL"                                                                                                                                                        
## [22] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nSCHOOL LIBRARIAN REJECTS DR. SEUSS. BOOKS GIFTED BY MELANIA TRUMP. . _' tnnx'nkr"                                                                                                                   
## [23] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                        
## [24] "YEMEN WAR CRIMES. UN Human Rights Council agrees on investigation\\nINIGO MENDEZ DE VIGO. Spanish Education Minister"                                                                                                                                  
## [25] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\"\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\". . Mavor Carmen Yulin Cruz San Juan,..."
## [26] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS"                                                                                                                                                                                                        
## [27] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL"                                                                                                                                                                                                            
## [28] "TRUMP ASIA TOUR. US President to visit Japan, South Korea and China\\nYEMEN WAR CRIMES. UN Human Rights Council agrees on investigation"                                                                                                               
## [29] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE\\nSAN JUAN MAYOR: \\\"MAD AS HELLII OVER HURRICANE RESPONSE. . Dr. Saniav Gupta (J\\\\N Chief Medical Correspondent"                                                                       
## [30] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE"

Be warned: even the “clean” text is often kinda messy.

For now, there are only four channels, so it’s easy to show a quick example. Since chyrons are supposed to be super-important things you need to know NOW, let’s see how many times Puerto Rico was mentioned on them in the above archive. NOTE: This is a quick example, not a thorough one. I’m searching for some key letter combinations to see just mentions of something looking like “Puerto Rico”. “San Juan” and other text that might be associated with the topic aren’t being considered for this toy example.

mutate(
  chyrons, 
  hour = lubridate::hour(ts),
  text = tolower(text),
  mention = grepl("erto ri", text)
) %>% 
  filter(mention) %>% 
  count(hour, channel) %>% 
  ggplot(aes(hour, n)) +
  geom_segment(aes(xend=hour, yend=0)) +
  scale_x_continuous(name="Hour (GMT)", breaks=seq(0, 23, 6),
                     labels=sprintf("%02d:00", seq(0, 23, 6))) +
  scale_y_continuous(name="# Chyrons", limits=c(0,30)) +
  facet_wrap(~channel, scales="free") +
  labs(title="Chyrons mentioning 'Puerto Rico' per hour per channel",
       subtitle="Chyron date: 2017-09-30",
       caption="Source: Internet Archive Third Eye project & <github.com/hrbrmstr/newsflash>") +
  theme_ipsum_rc(grid="Y")

Details, details, details

Entries in details column look like this:

head(chyrons$details)
## [1] "BBCNEWS_20170929_233000_Race_and_Pace/start/1800"                   
## [2] "CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600"              
## [3] "FOXNEWSW_20170929_230000_The_Story_With_Martha_MacCallum/start/3600"
## [4] "BBCNEWS_20170930_000000_BBC_News/start/60"                          
## [5] "CNNW_20170930_000000_Anderson_Cooper_360/start/60"                  
## [6] "MSNBCW_20170930_000000_All_In_With_Chris_Hayes/start/60"

They are path fragments that can be attached to a URL prefix to see the news clip from that station on that day/time. newsflash::view_clip() does that work for you:

view_clip(chyrons$details[2])

The URL for that is https://archive.org/details/CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600/end/3660 in the event the iframe load failed or you really like being annoyed with cable news shows.

FIN

Grab the package on GitHub, kick the tyres and don’t hesitate to file issues, questions or jump on board with package development. There’s plenty of room for improvement before it hits CRAN and your ideas are most welcome.

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them.

Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or company and come back to see why I’ve mentioned it.

Back? Good.

If you explored that site a bit, you likely saw a link to their GitHub repo. There you’ll find JavaScript source code and even some browser plugins along with the core “fingerprint database”, apps.json?. If you poke around on their repo, you’ll find the repo’s wiki and eventually see that folks have made various unofficial ports to other languages.

By now, you’ve likely guessed that this blog post is introducing an R port of “wappalyzer” and if you have made such a guess, you’re correct! The rappalyzer? package is a “work-in-progress” but it’s usable enough to get some feedback on the package API, solicit contributors and show some things you can do with it.

Just rappalyze something already!

For the moment, one real function is exposed: rappalyze(). Feed it a URL string or an httr response object and — if the analysis was fruitful — you’ll get back a data frame with all the identified tech stack components. Let’s see what jquery.com? is made of:

devtools::install_github("hrbrmstr/rappalyzer")

library(rappalyzer)

rappalyze("https://jquery.com/")
## # A tibble: 8 x 5
##         tech              category match_type version confidence
##        <chr>                 <chr>      <chr>   <chr>      <dbl>
## 1 CloudFlare                   CDN    headers    <NA>        100
## 2     Debian     Operating Systems    headers    <NA>        100
## 3  Modernizr JavaScript Frameworks     script    <NA>        100
## 4      Nginx           Web Servers    headers    <NA>        100
## 5        PHP Programming Languages    headers  5.4.45        100
## 6  WordPress                   CMS       meta   4.5.2        100
## 7  WordPress                 Blogs       meta   4.5.2        100
## 8     jQuery JavaScript Frameworks     script  1.11.3        100

If you played around on the Wappalyzer site you saw that it “spiders” many URLs to try to get a very complete picture of components that a given site may use. I just used the main jQuery site in the above example and we managed to get quite a bit of information from just that interaction.

Wappalyzer (and, hence rappalyzer) works by comparing sets of regular expressions against different parts of a site’s content. For now, rappalyzer checks:

  • the site URL (the one after a site’s been contacted & content retrieved)
  • HTTP headers (an invisible set of metadata browsers/crawlers & servers share)
  • HTML content
  • scripts
  • meta tags

Wappalyzer-proper runs in a browser-context, so it can pick up DOM elements and JavaScript environment variables which can provide some extra information not present in the static content.

As you can see from the returned data frame, each tech component has one or more associated categories as well as a version number (if it could be guessed) and a value indicating how confident the guess was. It will also show where the guess came from (headers, scripts, etc).

A peek under the hood

I started making a pure R port of Wappalyzer but there are some frustrating “gotchas” with the JavaScript regular expressions that would have meant spending time identifying all the edge cases. Since the main Wappalyzer source is in JavaScript it was trivial to down-port it to a V8?-compatible version and expose it via an R function. A longer-term plan is to deal with the regular expression conversion, but I kinda needed the functionality sooner than later (well, wanted it in R, anyway).

On the other hand, keeping it in JavaScript has some advantages, especially with the advent of chimera? (a phantomjs alternative that can put a full headless browser into a V8 engine context). Using that would mean getting the V8 package ported to more modern version of the V8 source code which isn’t trivial, but doable. Using chimera would make it possible to identify even more tech stack components.

Note also that these tech stack analyses can be dreadfully slow. That’s not due to using V8. The ones that slow are slow in all the language ports. This is due to the way the regular expressions are checked against the site content and for just using regular expressions to begin with. I’ve got some ideas to speed things up and may also introduce some “guide rails” to prevent lengthy operations and avoiding some checks if page characteristics meet certain criteria. Drop an issue if you have ideas as well.

Why is this useful?

Well, you can interactively see what a site uses like we did above. But, we can also look for commonalities across a number of sites. We’ll do this in a small example now. You can skip the expository and just work with the RStudio project if that’s how you’d rather roll.

Crawling 1,000 orgs

I have a recent-ish copy of a list of Fortune 1000 companies and their industry sectors along with their website URLs. Let’s see how many of those sites we can still access and what we identify.

I’m going to pause for just a minute to revisit some “rules” in the context of this operation.

Specifically, I’m not:

  • reading each site’s terms of service/terms & conditions
  • checking each site’s robots.txt file
  • using a uniquely identifying non-browser string crawler user-agent (since we need the site to present content like it would to a browser)

However, I’m also not:

  • using the site content in any way except to identify tech stack components (something dozens of organizations do)
  • scraping past the index page HTML content, so there’s no measurable resource usage

I believe I’ve threaded the ethical needle properly (and, I do this for a living), but if you’re uncomfortable with this you should use a different target URL list.

Back to the code. We’ll need some packages:

library(hrbrthemes)
library(tidyverse)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(urltools)
library(rappalyzer) # devtools::install_github("hrbrmstr/rappalyzer")
library(rprojroot)
# I'm also using a user agent shortcut from the splashr package which is on CRAN

Now, I’ve included data in the RStudio project GH repo since it can be used later on to test out changes to rappalyzer. That’s the reason for the if/file.exists tests. As I’ve said before, be kind to sites you scrape and cache content whenever possible to avoid consuming resources that you don’t own.

Let’s take a look at our crawler code:

rt <- find_rstudio_root_file()

if (!file.exists(file.path(rt, "data", "f1k_gets.rds"))) {

  f1k <- read_csv(file.path(rt, "data", "f1k.csv"))

  targets <- pull(f1k, website)

  results <- list()
  errors <- list()

  OK <- function(res) {
    cat(".", sep="")
    results <<- c(results, list(res))
  }

  BAD <- function(err_msg) {
    cat("X", sep="")
    errors <<- c(errors, list(err_msg))
  }

  pool <- multi_set(total_con = 20)

  walk(targets, ~{
    multi_add(
      new_handle(url = .x, useragent = splashr::ua_macos_chrome, followlocation = TRUE, timeout = 60),
      OK, BAD
    )
  })

  multi_run(pool = pool)

  write_rds(results, file.path(rt, "data", "f1k_gets.rds"), compress = "xz")

} else {
  results <- read_rds(file.path(rt, "data", "f1k_gets.rds"))
}

If you were expecting read_html() and/or GET() calls you’re likely surprised at what you see. We’re trying to pull content from a thousand web sites, some of which may not be there anymore (for good or just temporarily). Sequential calls would need many minutes with error handling. We’re using the asynchronous/parallel capabilities in the curl? package to setup all 1,000 requests with the ability to capture the responses. There’s a hack-ish “progress” bar that uses . for “good” requests and X for ones that didn’t work. It still takes a little time (and you may need to tweak the total_con value on older systems) but way less than sequential GETs would have, and any errors are implicitly handled (ignored).

The next block does the tech stack identification:

if (!file.exists(file.path(rt, "data", "rapp_results.rds"))) {

  results <- keep(results, ~.x$status_code < 300)
  map(results, ~{
    list(
      url = .x$url,
      status_code = .x$status_code,
      content = .x$content,
      headers = httr:::parse_headers(.x$headers)
    ) -> res
    class(res) <- "response"
    res
  }) -> results

  # this takes a *while*
  pb <- progress_estimated(length(results))
  map_df(results, ~{
    pb$tick()$print()
    rap_df <- rappalyze(.x)
    if (nrow(rap_df) > 0) rap_df <- mutate(rap_df, url = .x$url)
  }) -> rapp_results

  write_rds(rapp_results, file.path(rt, "data", "rapp_results.rds"))

} else {
  rapp_results <- read_rds(file.path(rt, "data", "rapp_results.rds"))
}

We filter out non-“OK” (HTTP 200-ish response) curl results and turn them into just enough of an httr request object to be able to work with them.

Then we let rappalyzer work. I had time to kill so I let it run sequentially. In a production or time-critical context, I’d use the future? package.

We’re almost down to data we can play with! Let’s join it back up with the original metadata:

left_join(
  mutate(rapp_results, host = domain(rapp_results$url)) %>%
    bind_cols(suffix_extract(.$host))
  ,
  mutate(f1k, host = domain(website)) %>%
    bind_cols(suffix_extract(.$host)),
  by = c("domain", "suffix")
) %>%
  filter(!is.na(name)) -> rapp_results

length(unique(rapp_results$name))
## [1] 754

Note that some orgs seem to have changed hands/domains so we’re left with ~750 sites to play with (we could have tried to match more, but this is just an exercise). If you do the same curl-fetching exercise on your own, you may get fewer or more total sites. Internet crawling is fraught with peril and it’s not really possible to conveniently convey 100% production code in a blog post.

Comparing “Fortune 1000” org web stacks

Let’s see how many categories we picked up:

xdf <- distinct(rapp_results, name, sector, category, tech)

sort(unique(xdf$category))
##  [1] "Advertising Networks"  "Analytics"             "Blogs"
##  [4] "Cache Tools"           "Captchas"              "CMS"
##  [7] "Ecommerce"             "Editors"               "Font Scripts"
## [10] "JavaScript Frameworks" "JavaScript Graphics"   "Maps"
## [13] "Marketing Automation"  "Miscellaneous"         "Mobile Frameworks"
## [16] "Programming Languages" "Tag Managers"          "Video Players"
## [19] "Web Frameworks"        "Widgets"

Plenty to play with!

Now, what do you think the most common tech component is across all these diverse sites?

count(xdf, tech, sort=TRUE)
## # A tibble: 115 x 2
##                        tech     n
##                       <chr> <int>
##  1                   jQuery   572
##  2       Google Tag Manager   220
##  3                Modernizr   197
##  4        Microsoft ASP.NET   193
##  5          Google Font API   175
##  6        Twitter Bootstrap   162
##  7                WordPress   150
##  8                jQuery UI   143
##  9             Font Awesome   118
## 10 Adobe Experience Manager    69
## # ... with 105 more rows

I had suspected it was jQuery (and hinted at that when I used it as the initial rappalyzer example). In a future blog post we’ll look at just how vulnerable orgs are based on which CDNs they use (many use similar jQuery and other resource CDNs, and use them insecurely).

Let’s see how broadly each tech stack category is used across the sectors:

group_by(xdf, sector) %>%
  count(category) %>%
  ungroup() %>%
  arrange(category) %>%
  mutate(category = factor(category, levels=rev(unique(category)))) %>%
  ggplot(aes(category, n)) +
  geom_boxplot() +
  scale_y_comma() +
  coord_flip() +
  labs(x=NULL, y="Tech/Services Detected across Sectors",
       title="Usage of Tech Stack Categories by Sector") +
  theme_ipsum_rc(grid="X")

That’s alot of JavaScript. But, there’s also large-ish use of CMS, Web Frameworks and other components.

I wonder what the CMS usage looks like:

filter(xdf, category=="CMS") %>%
  count(tech, sort=TRUE)
## # A tibble: 15 x 2
##                        tech     n
##                       <chr> <int>
##  1                WordPress    75
##  2 Adobe Experience Manager    69
##  3                   Drupal    68
##  4               Sitefinity    26
##  5     Microsoft SharePoint    19
##  6                 Sitecore    14
##  7                      DNN    11
##  8                Concrete5     6
##  9     IBM WebSphere Portal     4
## 10        Business Catalyst     2
## 11                    Hippo     2
## 12                TYPO3 CMS     2
## 13               Contentful     1
## 14              Orchard CMS     1
## 15             SilverStripe     1

Drupal and WordPress weren’t shocking to me (again, I do this for a living) but they may be to others when you consider this is the Fortune 1000 we’re talking about.

What about Web Frameworks?

filter(xdf, category=="Web Frameworks") %>%
  count(tech, sort=TRUE)
## # A tibble: 7 x 2
##                          tech     n
##                         <chr> <int>
## 1           Microsoft ASP.NET   193
## 2           Twitter Bootstrap   162
## 3             ZURB Foundation    61
## 4               Ruby on Rails     2
## 5            Adobe ColdFusion     1
## 6                    Kendo UI     1
## 7 Woltlab Community Framework     1

Yikes! Quite a bit of ASP.NET. But, many enterprises haven’t migrated to more modern, useful platforms (yet).

FIN

As noted earlier, the data & code are on GitHub. There are many questions you can ask and answer with this data set and definitely make sure to share any findings you discover!

We’ll continue exploring different aspects of what you can glean from looking at web sites in a different way in future posts.

I’ll leave you with one interactive visualization that lets you explore the tech stacks by sector.

devtools::install_github("jeromefroe/circlepackeR")
library(circlepackeR)
library(data.tree)
library(treemap)

cpdf <- count(xdf, sector, tech)

cpdf$pathString <- paste("rapp", cpdf$sector, cpdf$tech, sep = "/")
stacks <- as.Node(cpdf)

circlepackeR(stacks, size = "n", color_min = "hsl(56,80%,80%)",
             color_max = "hsl(341,30%,40%)", width = "100%", height="800px")

I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality into a package (if warranted). Some make it to CRAN (like those two), others stay on GitHub.

This (short) post is about two new ones: webhose? and pigeon?.

The webhose package is an API interface package to https://webhose.io/, which is an interesting service that scrapes the web & “dark web” and provides a short but handy API to retrieving the content using a fairly intuitive query language.

The pigeon package is a hastily-hacked-together wrapper around pgn-extract, a cross-platform utility written in C for working with chess game data in PGN format.

I’m not going to have any time to round out the corners on either of those packages but will gladly make time to help anyone who wants to jump on board to either (or both!) of them.

Working on either package will let you get your feet wet (or, wetter) with R package development. They both need:

  • more functions + docs!
  • test harness setup
  • Travis-CI, code coverage & AppVeyor configs

Working on webhose will give you experience dealing with HTTP APIs (and their API is super clean to work with) and possibly introduce you to an area of research you’re not already in.

Working on pigeon will give you experience integrating C[++] & R code (and the C-library I’ve hack-wrapped definitely needs some care & feeding to ensure no memory leaks are present).

Neither is “mission critical”. The world will gladly go on w/o either of them. But, if you wanted a judgement-free place to hone your R package skills or try/learn new things (and ask questions along the way), file an issue, drop a note in the comments or hit me up on Twitter. If you’re an experienced R package-r and want to “own” either of these, that’s ? as well!

I’d encourage all nascent R coders to adopt “SODD” and use SO as a place to hone your skills while you help others (and you don’t need to write a package for every answer :-).

spiderbar, spiderbar 
Reads robots rules from afar.
Crawls the web, any size; 
Fetches with respect, never lies.
Look Out! 
Here comes the spiderbar.

Is it fast? 
Listen bud, 
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there 
There goes the spiderbar. 

(Check the end of the post if you don’t recognize the lyrical riff.)

Face front, true believer!

I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

How much faster?

I’m glad you asked!

Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)

rob <- get_robotstxt("imdb.com")

microbenchmark(

  robotstxt = {
    x <- parse_robotstxt(rob)
    x$crawl_delay
  },

  spiderbar = {
    y <- robxp(rob)
    crawl_delays(y)
  }

) -> mb1

update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))

autoplot(mb1) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

As you can see, it’s just a tad bit faster ?.

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

But, I don’t care about Crawl-Delay!

OK, fine. Do you care about fetchability? We can speed that up, too!

rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)

microbenchmark(

  robotstxt = {
    robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
  },

  spiderbar = {
    can_fetch(rob_spi, "/Vote")
  }

) -> mb2

autoplot(mb2) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Vectorized or it didn’t happen.

(Gosh, even Spider-Man got more respect!)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

microbenchmark(

  robotstxt = {
    paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
  },

  spiderbar = {
    can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
  }

) -> mb3

autoplot(mb3) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Excelsior!

Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)

International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet:

Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick.

I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!

(If folks want some exposition here, let me know in the comments and I’ll rig up an R Markdown document with more step-by-step details.)

Full RStudio project file (with pre-cached data) is on GitHub.

library(rprojroot)
library(sf)
library(magick)
library(tidyverse) # NOTE: Needs github version of ggplot2

root <- find_rstudio_root_file()

# "borrow" the files from SmokyMountains.com, but be nice and cache them to
# avoid hitting their web server for every iteration

c("https://smokymountains.com/wp-content/themes/smcom-2015/to-delete/js/us.json",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage2.tsv",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage-2017.csv") %>%
  walk(~{
    sav_tmp <- file.path(root, "data", basename(.x))
    if (!file.exists(sav_tmp)) download.file(.x, sav_tmp)
  })

# next, we read in the GeoJSON file twice. first, to get the counties
states_sf <- read_sf(file.path(root, "data", "us.json"), "states", stringsAsFactors = FALSE)

# we only want the continental US
states_sf <- filter(states_sf, !(id %in% c("2", "15", "72", "78")))

# it doesn't have a CRS so we give it one
st_crs(states_sf) <- 4326

# I ran into hiccups using coord_sf() to do this, so we convert it to Albers here
states_sf <- st_transform(states_sf, 5070)


# next we read in the states
counties_sf <- read_sf(file.path(root, "data", "us.json"), "counties", stringsAsFactors = FALSE)
st_crs(counties_sf) <- 4326
counties_sf <- st_transform(counties_sf, 5070)

# now, we read in the foliage data
foliage <- read_tsv(file.path(root, "data", "foliage-2017.csv"),
                    col_types = cols(.default=col_double(), id=col_character()))

# and, since we have a lovely sf tidy data frame, bind it together
left_join(counties_sf, foliage, "id") %>%
  filter(!is.na(rate1)) -> foliage_sf

# now, we do some munging so we have better labels and so we can
# iterate over the weeks
gather(foliage_sf, week, value, -id, -geometry) %>%
  mutate(value = factor(value)) %>%
  filter(week != "rate1") %>%
  mutate(week = factor(week,
                       levels=unique(week),
                       labels=format(seq(as.Date("2017-08-26"),
                                         as.Date("2017-11-11"), "1 week"),
                                     "%b %d"))) -> foliage_sf

# now we make a ggplot object for each week and save it out to a png
pb <- progress_estimated(nlevels(foliage_sf$week))
walk(1:nlevels(foliage_sf$week), ~{

  pb$tick()$print()

  xdf <- filter(foliage_sf, week == levels(week)[.x])

  ggplot() +
    geom_sf(data=xdf, aes(fill=value), size=0.05, color="#2b2b2b") +
    geom_sf(data=states_sf, color="white", size=0.125, fill=NA) +
    viridis::scale_fill_viridis(
      name=NULL,
      discrete = TRUE,
      labels=c("No Change", "Minimal", "Patchy", "Partial", "Near Peak", "Peak", "Past Peak"),
      drop=FALSE
    ) +
    labs(title=sprintf("Foliage: %s ", unique(xdf$week))) +
    ggthemes::theme_map() +
    theme(panel.grid=element_line(color="#00000000")) +
    theme(panel.grid.major=element_line(color="#00000000")) +
    theme(legend.position="right") -> gg

  ggsave(sprintf("%02d.png", .x), gg, width=5, height=3)

})

# we read them all back in and animate the foliage
sprintf("%02d.png", 1:nlevels(foliage_sf$week)) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(1)