You searched for splashr - Page 2 of 2

Search Results for: splashr

More “Scraping Ethics Gone Awry” and “Why Do This When There’s a Free API?”

I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this).

Today, they sent me this gem by @JeromeDeveloper: Scrapy and Scrapyrt: how to create your own API from (almost) any website. Go ahead and click it. Give the Medium author the ? they so desperately crave (and to provide context for the rant below).

I have no issue with @JeromeDeveloper’s coding prowess, nor Scrapy/Scrapyrt. In fact, I’m a huge fan of the folks at ScrapingHub, so much so that I wrote splashr to enable use of their Splash server from R.

My issue is with the example the author chose to use.

CoinMarketCap provides cryptocurrency prices and other cryptocurrency info. I use it to track cryptocurrency prices to see which currency attackers who pwn devices to install illegal cryptocurrency rigs might be switching to next and to get a feel for when they’ll stop mining and go back to just stealing data and breaking things.

CoinMarketCap has an API with a generous free tier with the following text in their Terms & Conditions (which, in the U.S. [soon] may stupidly be explicitly repeated & required on each page that scraping is prohibited on vs a universal site link):

You may not, and shall not, copy, reproduce, download, “screen scrape”, store, transmit, broadcast, publish, modify, create a derivative work from, display, perform, distribute, redistribute, sell, license, rent, lease or otherwise use, transfer (either in printed, electronic or other format) or exploit any Content, in whole or in part, in any way that does not comply with these Terms without our prior written permission.

There is only one reason (apart from complete oblivion) to use CoinMarketCap as an example: to show folks how clever you are at bypassing site restrictions and eventually avoiding paying for an API to get data that you did absolutely nothing to help gather, curate and setup infrastructure for. There is no mention of “be sure what you are doing is legal/ethical”, just a casual caution to not abuse the Scrapyrt technology since it may get you banned.

Ethics matter across every area of “data science” (of which, scraping is one component). Just because you can do something doesn’t mean you should and just because you don’t like Terms & Conditions and want to grift the work of others for fun, profit & ? also doesn’t mean you should; and, it definitely doesn’t mean you should be advocating others do it as well.

Ironically, Medium itself places restrictions on what you can do:

Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.

yet they advocated I read and heed a post which violates similar terms of another site. So I wonder how they’d feel if I did a riff of that post and showed how to setup a hackish-API to scrape all their content. O_o

A Word About Tracking On This Site

Since I just railed against Congress for being a bit two-faced about privacy I thought some rud.is site disclosure would be in order.

At present, third-party tracking is limited to:

Something in my WordPress configuration adding a DNS pre-fetch for fonts.googleapis.com. There are a few more other DNS pre-fetches that I’m also going to try to eradicate (but that aren’t showing up in my uBlock Origin likely to to /etc/hosts blocks);
Gravatar (which displays logos near comment author names). I’m torn on this one but Gravatar is owned by Automattic (who owns WordPress). See next bullet on that;
WordPress. Vain site stats tracking, JetPack uptime warnings and some other WordPress pings happen (including some automatic short-linking) as well as the previous bullet bits. I’m not likely going to do the site surgery necessary to stop this but you have full disclosure and can easily avoid pings to those sites via uBlock Origin site-specific rules;
SendPulse; I’m running an experiment on user behaviours when it comes to authorizing web notifications (and I just kinda ruined said experiment). I’ll be disabling it later this year (after a full year of it being on so I can have more than just a few sentences to say).

The above came from an in-browser uBlock Origin report.

I ran a splashr::render_har() — which is how I measured things for the Congressional privacy post — on one of my pages and this is the result:

tld                 n
1 rud.is           67
2 wp.com           21
3 gravatar.com      6
4 wordpress.com     3
5 w.org             3
6 sendpulse.com     2

Props on WordPress capturing w.org! I’m still ticked Microsoft stole bob.com from me ages ago.

As you can see, most resources load from my site and none come from Twitter, Facebook or Google Plus.

I run WordPress for a ton of reasons too long to go into for this post, so I’m likely not going to change anything about that list (apart from the DNS pre-fetching).

Hopefully that will abate any concerns visitors might have, especially after reading the post about Congress.

Does Congress Really Care About Your Privacy?

I apologize up-front for using bad words in this post.

Said bad words include “Facebook”, “Mark Zuckerberg” and many referrals to entities within the U.S. Government. Given the topic, it cannot be helped.

I’ve also left the R tag on this despite only showing some ggplot2 plots and Markdown tables. See the end of the post for how to get access to the code & data. R was used solely and extensively for the work behind the words.

This week Congress put on a show as they summoned the current Facebook CEO — Mark Zuckerberg — down to Washington, D.C. to demonstrate how little most of them know about how the modern internet and social networks actually work plus chest-thump to prove to their constituents they really and truly care about you.

These Congress-critters offered such proof in the guise of railing against Facebook for how they’ve handled your data. Note that I should really say our data since they do have an extensive profile database on me and most everyone else even if they’re not Facebook platform users (full disclosure: I do not have a Facebook account).

Ostensibly, this data-mishandling impacted your privacy. Most of the committee members wanted any constituent viewers to come away believing they and their fellow Congress-critters truly care about your privacy.

Fortunately, we have a few ways to measure this “caring” and the remainder of this post will explore how much members of the U.S. House and Senate care about your privacy when you visit their official .gov web sites. Future posts may explore campaign web sites and other metrics, but what better place to show they care about you then right there in their digital houses.

Privacy Primer

When you visit a web site with any browser, the main URL pulls in resources to aid in the composition and functionality of the page. These could be:

HTML (the main page is very likely HTML unless it’s just a media URL)
images (png, jpg, gif, “svg”, etc),
fonts
CSS (the “style sheet” that tells the browser how to decorate and position elements on the page)
binary objects (such as embedded PDF files or “protocol buffer” content)
XML or JSON
JavaScript

(plus some others)

When you go to, say, www.example.com the site does not have to load all the resources from example.com domains. In fact, it’s rare to find a modern site which does not use resources from one or more third party sites.

When each resource is loaded (generally) some information about you goes along for the ride. At a minimum, the request time and source (your) IP address is exposed and — unless you’re really careful/paranoid — the referring site, browser configuration and even cookies are even available to the third party sites. It does not take many of these data points to (pretty much) uniquely identify you. And, this is just for “benign” content like images. We’ll get to JavaScript in a bit.

As you move along the web, these third-party touch-points add up. To demonstrate this, I did my best to de-privatize my browser and OS configuration and visited 12 web sites while keeping a fresh install of Firefox Lightbeam running. Here’s the result:

Each main circle is a distinct/main site and the triangles are resources the site tried to load. The red triangles indicate a common third-party resource that was loaded by two or more sites. Each of those red triangles knows where you’ve been (again, unless you’ve been very careful/paranoid) and can use that information to enhance their knowledge about you.

It gets a bit worse with JavaScript content since a much stronger fingerprint can be created for you (you can learn more about fingerprints at this spiffy EFF site). Plus, JavaScript code can try to pilfer cookies, “hack” the browser, serve up malicious adverts, measure time-on-site, and even enlist you in a cryptomining army.

There are other issues with trusting loaded browser content, but we’ll cover that a bit further into the investigation.

Measuring “Caring”

The word “privacy” was used over 100 times each day by both Zuckerberg and our Congress-critters. Senators and House members made it pretty clear Facebook should care more about your privacy. Implicit in said posit is that they, themselves, must care about your privacy. I’m sure they’ll be glad to point out all along the midterm campaign trails just how much they’re doing to protect your privacy.

We don’t just have to take their word for it. After berating Facebook’s chief college dropout and chastising the largest social network on the planet we can see just how much of “you” these representatives give to Facebook (and other sites) and also how much they protect you when you decide to pay them[†] [‡] a digital visit.

For this metrics experiment, I built a crawler using R and my splashr? package which, in turn, uses ScrapingHub’s open source Splash. Splash is an automation framework that lets you programmatically visit a site just like a human would with a real browser.

Normally when one scrapes content from the internet they’re just grabbing the plain, single HTML file that is at the target of a URL. Splash lets us behave like a browser and capture all the resources — images, CSS, fonts, JavaScript — the site loads and will also execute any JavaScript, so it will also capture resources each script may itself load.

By capturing the entire browser experience for the main page of each member of Congress we can get a pretty good idea of just how much each one cares about your digital privacy, and just how much they secretly love Facebook.

Let’s take a look, first, at where you go when you digitally visit a Congress-critter.

Network/Hosting/DNS

Each House and Senate member has an official (not campaign) site that is hosted on a .gov domain and served up from a handful of IP addresses across the following (n is the number of Congress-critter web sites):

asn	aso	n
AS5511	Orange	425
AS7016	Comcast Cable Communications, LLC	95
AS20940	Akamai International B.V.	13
AS1999	U.S. House of Representatives	6
AS7843	Time Warner Cable Internet LLC	1
AS16625	Akamai Technologies, Inc.	1

“Orange” is really Akamai and Akamai is a giant content delivery network which helps web sites efficiently provide content to your browser and can offer Denial of Service (DoS) protection. Most sites are behind Akamai, which means you “touch” Akamai every time you visit the site. They know you were there, but I know a sufficient body of folks who work at Akamai and I’m fairly certain they’re not too evil. Virtually no representative solely uses House/Senate infrastructure, but this is almost a necessity given how easy it is to take down a site with a DoS attack and how polarized politics is in America.

To get to those IP addresses, DNS names like www.king.senate.gov (one of the Senators from my state) needs to be translated to IP addresses. DNS queries are also data gold mines and everyone from your ISP to the DNS server that knows the name-to-IP mapping likely sees your IP address. Here are the DNS servers that serve up the directory lookups for all of the House and Senate domains:

nameserver	gov_hosted
e4776.g.akamaiedge.net.	FALSE
wc.house.gov.edgekey.net.	FALSE
e509.b.akamaiedge.net.	FALSE
evsan2.senate.gov.edgekey.net.	FALSE
e485.b.akamaiedge.net.	FALSE
evsan1.senate.gov.edgekey.net.	FALSE
e483.g.akamaiedge.net.	FALSE
evsan3.senate.gov.edgekey.net.	FALSE
wwwhdv1.house.gov.	TRUE
firesideweb02cc.house.gov.	TRUE
firesideweb01cc.house.gov.	TRUE
firesideweb03cc.house.gov.	TRUE
dchouse01cc.house.gov.	TRUE
c3pocc.house.gov.	TRUE
ceweb.house.gov.	TRUE
wwwd2-cdn.house.gov.	TRUE
45press.house.gov.	TRUE
gopweb1a.house.gov.	TRUE
eleven11web.house.gov.	TRUE
frontierweb.house.gov.	TRUE
primitivesocialweb.house.gov.	TRUE

Akamai kinda does need to serve up DNS for the sites they host, so this list also makes sense. But, you’ve now had two touch-points logged and we haven’t even loaded a single web page yet.

Safe? & Secure? Connections

When we finally make a connection to a Congress-critter’s site, it is going to be over SSL/TLS. They all support it (which is ?, but SSL/TLS confidentiality is not as bullet-proof as many “HTTPS Everywhere” proponents would like to con you into believing). However, I took a look at the SSL certificates for House and Senate sites. Here’s a sampling from, again, my state (one House representative):

The *.house.gov “Common Name (CN)” is a wildcard certificate. Many SSL certificates have just one valid CN, but it’s also possible to list alternate, valid “alt” names that can all use the same, single certificate. Wildcard certificates ease the burden of administration but it also means that if, say, I managed to get my hands on the certificate chain and private key file, I could setup vladimirputin.house.gov somewhere and your browser would think it’s A-OK. Granted, there are far more Representatives than there are Senators and their tenure length is pretty erratic these days, so I can sort of forgive them for taking the easy route, but I also in no way, shape or form believe they protect those chains and private keys well.

In contrast, the Senate can and does embed the alt-names:

Are We There Yet?

We’ve got the IP address of the site and established a “secure” connection. Now it’s time to grab the index page and all the rest of the resources that come along for the ride. As noted in the Privacy Primer (above), the loading of third-party resources is problematic from a privacy (and security) perspective. Just how many third party resources do House and Senate member sites rely on?

To figure that out, I tallied up all of the non-.gov resources loaded by each web site and plotted the distribution of House and Senate (separately) in a “beeswarm” plot with a boxplot shadowing underneath so you can make out the pertinent quantiles:

As noted, the median is around 30 for both House and Senate member sites. In other words, they value your browsing privacy so little that most Congress-critters gladly share your browser session with many other sites.

We also talked about confidentiality above. If an https site loads http resources the contents of what you see on the page cannot but guaranteed. So, how responsible are they when it comes to at least ensuring these third-party resources are loaded over https?

You’re mostly covered from a pseudo-confidentiality perspective, but what are they serving up to you? Here’s a summary of the MIME types being delivered to you:

MIME Type	Number of Resources Loaded
image/jpeg	6,445
image/png	3,512
text/html	2,850
text/css	1,830
image/gif	1,518
text/javascript	1,512
font/ttf	1,266
video/mp4	974
application/json	673
application/javascript	670
application/x-javascript	353
application/octet-stream	187
application/font-woff2	99
image/bmp	44
image/svg+xml	39
text/plain	33
application/xml	15
image/jpeg, video/mp2t	12
application/x-protobuf	9
binary/octet-stream	5
font/woff	4
image/jpg	4
application/font-woff	2
application/vnd.google.gdata.error+xml	1

We’ll cover some of these in more detail a bit further into the post.

Facebook & “Friends”

Facebook started all this, so just how cozy are these Congress-critters with Facebook?

Turns out that both Senators and House members are very comfortable letting you give Facebook a love-tap when you come visit their sites since over 60% of House and 40% of Senate sites use 2 or more Facebook resources. Not all Facebook resources are created equal[ly evil] and we’ll look at some of the more invasive ones soon.

Facebook is not the only devil out there. I added in the public filter list from Disconnect and the numbers go up from 60% to 70% for the House and from 40% to 60% for the Senate when it comes to a larger corpus of known tracking sites/resources.

Here’s a list of some (first 20) of the top domains (with one of Twitter’s media-serving domains taking the individual top-spot):

Main third-party domain	# of ‘pings’	%
twimg.com	764	13.7%
fbcdn.net	655	11.8%
twitter.com	573	10.3%
google-analytics.com	489	8.8%
doubleclick.net	462	8.3%
facebook.com	451	8.1%
gstatic.com	385	6.9%
fonts.googleapis.com	270	4.9%
youtube.com	246	4.4%
google.com	183	3.3%
maps.googleapis.com	144	2.6%
webtrendslive.com	95	1.7%
instagram.com	75	1.3%
bootstrapcdn.com	68	1.2%
cdninstagram.com	63	1.1%
fonts.net	51	0.9%
ajax.googleapis.com	50	0.9%
staticflickr.com	34	0.6%
translate.googleapis.com	34	0.6%
sharethis.com	32	0.6%

So, when you go to check out what your representative is ‘officially’ up to, you’re being served…up on a silver platter to a plethora of sites where you are the product.

It’s starting to look like Congress-folk aren’t as sincere about your privacy as they may have led us all to believe this week.

A [Java]Script for Success[ful Privacy Destruction]

As stated earlier, not all third-party content is created equally malicious. JavaScript resources run code in your browser on your device and while there are limits to what it can do, those limits diminish weekly as crafty coders figure out more ways to use JavaScript to collect information and perform shady or malicious deeds.

So, how many House/Senate sites load one or more third-party JavaScript resources?

Virtually all of them.

To make matters worse, no .gov or third-party resource of any kind was loaded using subresource integrity validation. Subresource integrity validation means that the site owner — at some point — ensured that the resource being loaded was not malicious and then created a fingerprint for it and told your browser what that fingerprint is so it can compare it to what got loaded. If the fingerprints don’t match, the content is not loaded/executed. Using subresource integrity is not trivial since it requires a top-notch content management team and failure to synchronize/checkpoint third-party content fingerprints will result in resources failing to load.

Congress was quick to demand that Facebook implement stronger policies and controls, but they, themselves, cannot be bothered.

Future Work

There are plenty more avenues to explore in this data set (such as “security headers” — they all 100% use strict-transport-security pretty well, but are deeply deficient in others) and more targets for future works, such as the campaign sites of House and Senate members. I may follow up with a look at a specific slice from this data set (the members of the committees who were berating Zuckerberg this week).

The bottom line is that while the beating Facebook took this week was just, those inflicting the pain have a long way to go themselves before they can truly judge what other social media and general internet sites do when it comes to ensuring the safety and privacy of their visitors.

In other words, “Legislator, regulate thyself” before thy regulatists others.

FIN

Apart from some egregiously bad (or benign) examples, I tried not to “name and shame”. I also won’t answer any questions about facets by party since that really doesn’t matter too much as they’re all pretty bad when it comes to understanding and implementing privacy and safey on their sites.

The data set can be found over at Zenodo (alternately, click/tap/select the badge below). I converted the R data frame to ndjson/streaming JSON/jsonlines (however you refer to the format) and tested it out in Apache Drill.

I’ll toss up some R code using data extracts later this week (meaning by April 20th).

On MIMEs, software versions and web site promiscuity (a.k.a. three new packages to round out the week)

A quick Friday post to let folks know about three in-development R packages that you’re encouraged to poke the tyres o[fn] and also jump in and file issues or PRs for.

Alleviating aversion to versions

I introduced a “version chart” in a recent post and one key element of tagging years (which are really helpful to get a feel for scope of exposure + technical/cyber-debt) is knowing the dates of product version releases. You can pay for such a database but it’s also possible to cobble one together, and that activity will be much easier as time goes on with the vershist? package.

Here’s a sample:

apache_httpd_version_history()
## # A tibble: 29 x 8
##    vers   rls_date   rls_year major minor patch prerelease build
##    <fct>  <date>        <dbl> <int> <int> <int> <chr>      <chr>
##  1 1.3.0  1998-06-05     1998     1     3     0 ""         ""   
##  2 1.3.1  1998-07-22     1998     1     3     1 ""         ""   
##  3 1.3.2  1998-09-21     1998     1     3     2 ""         ""   
##  4 1.3.3  1998-10-09     1998     1     3     3 ""         ""   
##  5 1.3.4  1999-01-10     1999     1     3     4 ""         ""   
##  6 1.3.6  1999-03-23     1999     1     3     6 ""         ""   
##  7 1.3.9  1999-08-19     1999     1     3     9 ""         ""   
##  8 1.3.11 2000-01-22     2000     1     3    11 ""         ""   
##  9 1.3.12 2000-02-25     2000     1     3    12 ""         ""   
## 10 1.3.14 2000-10-10     2000     1     3    14 ""         ""   
## # ... with 19 more rows

Not all vendored-software uses semantic versioning and many have terrible schemes that make it really hard to create an ordered factor, but when that is possible, you get a nice data frame with an ordered factor you can use for all sorts of fun and useful things.

It has current support for:

Apache httpd
Apple iOS
Google Chrome
lighttpd
memcached
MongoDB
MySQL
nginx
openresty
openssh
sendmail
SQLite

and I’ll add more over time.

Thanks to @bikesRdata there will be a …_latest() function for each vendor and I’ll likely add some helper functions so you only need to call one function with a parameter vs individual ones for each version and will also likely add a caching layer so you don’t have to scrape/clone/munge every time you need versions (seriously: look at the code to see what you have to do to collect some of this data).

And, they all it a MIME…a MIME!

I’ve had the wand? package out for a while but have never been truly happy with it. It uses libmagic on unix-ish systems but requires Rtools on Windows and relies on a system call to file.exe on that platform. Plus the “magic” database is too big to embed in the package and due to the (very, very, very good and necessary) privacy/safety practices of CRAN, writing the boilerplate code to deal with compilation or downloading of the magic database is not something I have time for (and it really needs regular updates for consistent output on all platforms).

A very helpful chap, @VincentGuyader, was lamenting some of the Windows issues which spawned a quick release of simplemagic?. The goal of this package is to be a zero-dependency install with no reliance on external databases. It has built-in support for basing MIME-type “guesses” off of a handful of the more common types folks might want to use this package for and a built-in “database” of over 1,500 file type-to-MIME mappings for guessing based solely on extension.

list.files(system.file("extdat", package="simplemagic"), full.names=TRUE) %>% 
  purrr::map_df(~{
    dplyr::data_frame(
      fil = basename(.x),
      mime = list(simplemagic::get_content_type(.x))
    )
  }) %>% 
  tidyr::unnest()
## # A tibble: 85 x 2
##    fil                        mime                                                                     
##    <chr>                      <chr>                                                                    
##  1 actions.csv                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  2 actions.txt                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  3 actions.xlsx               application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  4 test_1.2.class             application/java-vm                                                      
##  5 test_1.3.class             application/java-vm                                                      
##  6 test_1.4.class             application/java-vm                                                      
##  7 test_1.5.class             application/java-vm                                                      
##  8 test_128_44_jstereo.mp3    audio/mp3                                                                
##  9 test_excel_2000.xls        application/msword                                                       
## 10 test_excel_spreadsheet.xml application/xml      
## ...

File issues or PRs if you need more header-magic introspected guesses.

NOTE: The rtika? package could theoretically do a more comprehensive job since Apache Tika has an amazing assortment of file-type introspect-ors. Also, an interesting academic exercise might be to collect a sufficient corpus of varying files, pull the first 512-4096 bytes of each, do some feature generation and write an ML-based classifier for files with a confidence level + MIME-type output.

Site promiscuity detection

urlscan is a fun site since it frees you from the tedium (and expense/privacy-concerns) of using a javascript-enabled scraping setup to pry into the makeup of a target URL and find out all sorts of details about it, including how many sites it lets track you. You can do the same with my splashr? package, but you have the benefit of a third-party making the connection with urlscan.io vs requests coming from your IP space.

I’m waiting on an API key so I can write the “submit a scan request programmatically” function, but—until then—you can retrieve existing sites in their database or manually enter one for later retrieval.

The package is a WIP but has enough bits to be useful now to, say, see just how promiscuous cnn.com makes you:

cnn_db <- urlscan::urlscan_search("domain:cnn.com")

latest_scan_results <- urlscan::urlscan_result(cnn_db$results$`_id`[1], TRUE, TRUE)

latest_scan_results$scan_result$lists$ips
##  [1] "151.101.65.67"   "151.101.113.67"  "2.19.34.83"     
##  [4] "2.20.22.7"       "2.16.186.112"    "54.192.197.56"  
##  [7] "151.101.114.202" "83.136.250.242"  "157.166.238.142"
## [10] "13.32.217.114"   "23.67.129.200"   "2.18.234.21"    
## [13] "13.32.145.105"   "151.101.112.175" "172.217.21.194" 
## [16] "52.73.250.52"    "172.217.18.162"  "216.58.210.2"   
## [19] "172.217.23.130"  "34.238.24.243"   "13.107.21.200"  
## [22] "13.32.159.194"   "2.18.234.190"    "104.244.43.16"  
## [25] "54.192.199.124"  "95.172.94.57"    "138.108.6.20"   
## [28] "63.140.33.27"    "2.19.43.224"     "151.101.114.2"  
## [31] "74.201.198.92"   "54.76.62.59"     "151.101.113.194"
## [34] "2.18.233.186"    "216.58.207.70"   "95.172.94.20"   
## [37] "104.244.42.5"    "2.18.234.36"     "52.94.218.7"    
## [40] "62.67.193.96"    "62.67.193.41"    "69.172.216.55"  
## [43] "13.32.145.124"   "50.31.185.52"    "54.210.114.183" 
## [46] "74.120.149.167"  "64.202.112.28"   "185.60.216.19"  
## [49] "54.192.197.119"  "185.60.216.35"   "46.137.176.25"  
## [52] "52.73.56.77"     "178.250.2.67"    "54.229.189.67"  
## [55] "185.33.223.197"  "104.244.42.3"    "50.16.188.173"  
## [58] "50.16.238.189"   "52.59.88.2"      "52.38.152.125"  
## [61] "185.33.223.80"   "216.58.207.65"   "2.18.235.40"    
## [64] "69.172.216.58"   "107.23.150.218"  "34.192.246.235" 
## [67] "107.23.209.129"  "13.32.145.107"   "35.157.255.181" 
## [70] "34.228.72.179"   "69.172.216.111"  "34.205.202.95"

latest_scan_results$scan_result$lists$countries
## [1] "US" "EU" "GB" "NL" "IE" "FR" "DE"

latest_scan_results$scan_result$lists$domains
##  [1] "cdn.cnn.com"                    "edition.i.cdn.cnn.com"         
##  [3] "edition.cnn.com"                "dt.adsafeprotected.com"        
##  [5] "pixel.adsafeprotected.com"      "securepubads.g.doubleclick.net"
##  [7] "tpc.googlesyndication.com"      "z.moatads.com"                 
##  [9] "mabping.chartbeat.net"          "fastlane.rubiconproject.com"   
## [11] "b.sharethrough.com"             "geo.moatads.com"               
## [13] "static.adsafeprotected.com"     "beacon.krxd.net"               
## [15] "revee.outbrain.com"             "smetrics.cnn.com"              
## [17] "pagead2.googlesyndication.com"  "secure.adnxs.com"              
## [19] "0914.global.ssl.fastly.net"     "cdn.livefyre.com"              
## [21] "logx.optimizely.com"            "cdn.krxd.net"                  
## [23] "s0.2mdn.net"                    "as-sec.casalemedia.com"        
## [25] "errors.client.optimizely.com"   "social-login.cnn.com"          
## [27] "invocation.combotag.com"        "sb.scorecardresearch.com"      
## [29] "secure-us.imrworldwide.com"     "bat.bing.com"                  
## [31] "jadserve.postrelease.com"       "ssl.cdn.turner.com"            
## [33] "cnn.sdk.beemray.com"            "static.chartbeat.com"          
## [35] "native.sharethrough.com"        "www.cnn.com"                   
## [37] "btlr.sharethrough.com"          "platform-cdn.sharethrough.com" 
## [39] "pixel.moatads.com"              "www.summerhamster.com"         
## [41] "mms.cnn.com"                    "ping.chartbeat.net"            
## [43] "analytics.twitter.com"          "sharethrough.adnxs.com"        
## [45] "match.adsrvr.org"               "gum.criteo.com"                
## [47] "www.facebook.com"               "d3qdfnco3bamip.cloudfront.net" 
## [49] "connect.facebook.net"           "log.outbrain.com"              
## [51] "serve2.combotag.com"            "rva.outbrain.com"              
## [53] "odb.outbrain.com"               "dynaimage.cdn.cnn.com"         
## [55] "data.api.cnn.io"                "aax.amazon-adsystem.com"       
## [57] "cdns.gigya.com"                 "t.co"                          
## [59] "pixel.quantserve.com"           "ad.doubleclick.net"            
## [61] "cdn3.optimizely.com"            "w.usabilla.com"                
## [63] "amplifypixel.outbrain.com"      "tr.outbrain.com"               
## [65] "mab.chartbeat.com"              "data.cnn.com"                  
## [67] "widgets.outbrain.com"           "secure.quantserve.com"         
## [69] "static.ads-twitter.com"         "amplify.outbrain.com"          
## [71] "tag.bounceexchange.com"         "adservice.google.com"          
## [73] "adservice.google.com.ua"        "www.googletagservices.com"     
## [75] "cdn.adsafeprotected.com"        "js-sec.indexww.com"            
## [77] "ads.rubiconproject.com"         "c.amazon-adsystem.com"         
## [79] "www.ugdturner.com"              "a.postrelease.com"             
## [81] "cdn.optimizely.com"             "cnn.com"

O_o

FIN

Again, kick the tyres, file issues/PRs and drop a note if you’ve found something interesting as a result of any (or all!) of the packages.

Pirating Web Content Responsibly With R

2017-09-19 – 07:28
Posted in data wrangling, R, TLAPD, web scraping
Tagged post
Comments (20)

International ~~Code~~ Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still ~~pirate~~ crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary ~~sailors~~ crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the ~~map~~ code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content

I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney:

[Please RT]#RStats folks, @nj_tierney & I need your help for {naniar}!
When does R silently drop/omit NA? https://t.co/V5elyGcG8Z pic.twitter.com/VScLXFCl2n

— Colin Fay (@_ColinFay) August 29, 2017

When I got distracted by a PDF in the CRAN doc/contrib directory: Short-refcard.pdf. I’m not a big reference card user but students really like them and after seeing what it was I remembered having seen the document ages ago, but never associated it with CRAN before.

I saw:

by Tom Short, EPRI PEAC, tshort@epri-peac.com 2004-11-07 Granted to the public domain. See www. Rpad. org for the source and latest version. Includes material from R for Beginners by Emmanuel Paradis (with permission).

at the top of the card. The link (which I’ve made unclickable for reasons you’ll see in a sec — don’t visit that URL) was clickable and I tapped it as I wanted to see if it had changed since 2004.

You can open that image in a new tab to see the full, rendered site and take a moment to see if you can find the section that links to objectionable — and, potentially malicious — content. It’s easy to spot.

I made a likely correct assumption that Tom Short had nothing to do with this and wanted to dig into it a bit further to see when this may have happened. So, don your bestest deerstalker and follow along as we see when this may have happened.

Digging In Domain Land

We’ll need some helpers to poke around this data in a safe manner:

library(wayback) # devtools::install_github("hrbrmstr/wayback")
library(ggTimeSeries) # devtools::install_github("AtherEnergy/ggTimeSeries")
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(passivetotal) # devtools::install_github("hrbrmstr/passivetotal")
library(cymruservices)
library(magick)
library(tidyverse)

(You’ll need to get a RiskIQ PassiveTotal key to use those functions. Also, please donate to Archive.org if you use the wayback package.)

Now, let’s see if the main Rpad content URL is in the wayback machine:

glimpse(archive_available("http://www.rpad.org/Rpad/"))
## Observations: 1
## Variables: 5
## $ url        <chr> "http://www.rpad.org/Rpad/"
## $ available  <lgl> TRUE
## $ closet_url <chr> "http://web.archive.org/web/20170813053454/http://ww...
## $ timestamp  <dttm> 2017-08-13
## $ status     <chr> "200"

It is! Let’s see how many versions of it are in the archive:

x <- cdx_basic_query("http://www.rpad.org/Rpad/")

ts_range <- range(x$timestamp)

count(x, timestamp) %>%
  ggplot(aes(timestamp, n)) +
  geom_segment(aes(xend=timestamp, yend=0)) +
  labs(x=NULL, y="# changes in year", title="rpad.org Wayback Change Timeline") +
  theme_ipsum_rc(grid="Y")

count(x, timestamp) %>%
  mutate(Year = lubridate::year(timestamp)) %>%
  complete(timestamp=seq(ts_range[1], ts_range[2], "1 day"))  %>%
  filter(!is.na(timestamp), !is.na(Year)) %>%
  ggplot(aes(date = timestamp, fill = n)) +
  stat_calendar_heatmap() +
  viridis::scale_fill_viridis(na.value="white", option = "magma") +
  facet_wrap(~Year, ncol=1) +
  labs(x=NULL, y=NULL, title="rpad.org Wayback Change Timeline") +
  theme_ipsum_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(panel.spacing = grid::unit(0.5, "lines"))

There’s a big span between 2008/9 and 2016/17. Let’s poke around there a bit. First 2016:

tm <- get_timemap("http://www.rpad.org/Rpad/")

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2016))
## # A tibble: 1 x 5
##       rel                                                                   link  type
##     <chr>                                                                  <chr> <chr>
## 1 memento http://web.archive.org/web/20160629104907/http://www.rpad.org:80/Rpad/  <NA>
## # ... with 2 more variables: from <chr>, datetime <chr>

(p2016 <- render_png(url = rurl$link))

Hrm. Could be server or network errors.

Let’s go back to 2009.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2009))
## # A tibble: 4 x 5
##       rel                                                                  link  type
##     <chr>                                                                 <chr> <chr>
## 1 memento     http://web.archive.org/web/20090219192601/http://rpad.org:80/Rpad  <NA>
## 2 memento http://web.archive.org/web/20090322163146/http://www.rpad.org:80/Rpad  <NA>
## 3 memento http://web.archive.org/web/20090422082321/http://www.rpad.org:80/Rpad  <NA>
## 4 memento http://web.archive.org/web/20090524155658/http://www.rpad.org:80/Rpad  <NA>
## # ... with 2 more variables: from <chr>, datetime <chr>

(p2009 <- render_png(url = rurl$link[4]))

If you poke around that, it looks like the original Rpad content, so it was “safe” back then.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2017))
## # A tibble: 6 x 5
##       rel                                                                link  type
##     <chr>                                                               <chr> <chr>
## 1 memento  http://web.archive.org/web/20170323222705/http://www.rpad.org/Rpad  <NA>
## 2 memento http://web.archive.org/web/20170331042213/http://www.rpad.org/Rpad/  <NA>
## 3 memento http://web.archive.org/web/20170412070515/http://www.rpad.org/Rpad/  <NA>
## 4 memento http://web.archive.org/web/20170518023345/http://www.rpad.org/Rpad/  <NA>
## 5 memento http://web.archive.org/web/20170702130918/http://www.rpad.org/Rpad/  <NA>
## 6 memento http://web.archive.org/web/20170813053454/http://www.rpad.org/Rpad/  <NA>
## # ... with 2 more variables: from <chr>, datetime <chr>

(p2017 <- render_png(url = rurl$link[1]))

I won’t break your browser and add another giant image, but that one has the icky content. So, it’s a relatively recent takeover and it’s likely that whomever added the icky content links did so to try to ensure those domains and URLs have both good SEO and a positive reputation.

Let’s see if they were dumb enough to make their info public:

rwho <- passive_whois("rpad.org")
str(rwho, 1)
## List of 18
##  $ registryUpdatedAt: chr "2016-10-05"
##  $ admin            :List of 10
##  $ domain           : chr "rpad.org"
##  $ registrant       :List of 10
##  $ telephone        : chr "5078365503"
##  $ organization     : chr "WhoisGuard, Inc."
##  $ billing          : Named list()
##  $ lastLoadedAt     : chr "2017-03-14"
##  $ nameServers      : chr [1:2] "ns-1147.awsdns-15.org" "ns-781.awsdns-33.net"
##  $ whoisServer      : chr "whois.publicinterestregistry.net"
##  $ registered       : chr "2004-06-15"
##  $ contactEmail     : chr "411233718f2a4cad96274be88d39e804.protect@whoisguard.com"
##  $ name             : chr "WhoisGuard Protected"
##  $ expiresAt        : chr "2018-06-15"
##  $ registrar        : chr "eNom, Inc."
##  $ compact          :List of 10
##  $ zone             : Named list()
##  $ tech             :List of 10

Nope. #sigh

Is this site considered “malicious”?

(rclass <- passive_classification("rpad.org"))
## $everCompromised
## NULL

Nope. #sigh

What’s the hosting history for the site?

rdns <- passive_dns("rpad.org")
rorig <- bulk_origin(rdns$results$resolve)

tbl_df(rdns$results) %>%
  type_convert() %>%
  select(firstSeen, resolve) %>%
  left_join(select(rorig, resolve=ip, as_name=as_name)) %>% 
  arrange(firstSeen) %>%
  print(n=100)
## # A tibble: 88 x 3
##              firstSeen        resolve                                              as_name
##                 <dttm>          <chr>                                                <chr>
##  1 2009-12-18 11:15:20  144.58.240.79      EPRI-PA - Electric Power Research Institute, US
##  2 2016-06-19 00:00:00 208.91.197.132 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG
##  3 2016-07-29 00:00:00  208.91.197.27 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG
##  4 2016-08-12 20:46:15  54.230.14.253                     AMAZON-02 - Amazon.com, Inc., US
##  5 2016-08-16 14:21:17  54.230.94.206                     AMAZON-02 - Amazon.com, Inc., US
##  6 2016-08-19 20:57:04  54.230.95.249                     AMAZON-02 - Amazon.com, Inc., US
##  7 2016-08-26 20:54:02 54.192.197.200                     AMAZON-02 - Amazon.com, Inc., US
##  8 2016-09-12 10:35:41   52.84.40.164                     AMAZON-02 - Amazon.com, Inc., US
##  9 2016-09-17 07:43:03  54.230.11.212                     AMAZON-02 - Amazon.com, Inc., US
## 10 2016-09-23 18:17:50 54.230.202.223                     AMAZON-02 - Amazon.com, Inc., US
## 11 2016-09-30 19:47:31 52.222.174.253                     AMAZON-02 - Amazon.com, Inc., US
## 12 2016-10-24 17:44:38  52.85.112.250                     AMAZON-02 - Amazon.com, Inc., US
## 13 2016-10-28 18:14:16 52.222.174.231                     AMAZON-02 - Amazon.com, Inc., US
## 14 2016-11-11 10:44:22 54.240.162.201                     AMAZON-02 - Amazon.com, Inc., US
## 15 2016-11-17 04:34:15 54.192.197.242                     AMAZON-02 - Amazon.com, Inc., US
## 16 2016-12-16 17:49:29   52.84.32.234                     AMAZON-02 - Amazon.com, Inc., US
## 17 2016-12-19 02:34:32 54.230.141.240                     AMAZON-02 - Amazon.com, Inc., US
## 18 2016-12-23 14:25:32  54.192.37.182                     AMAZON-02 - Amazon.com, Inc., US
## 19 2017-01-20 17:26:28  52.84.126.252                     AMAZON-02 - Amazon.com, Inc., US
## 20 2017-02-03 15:28:24   52.85.94.225                     AMAZON-02 - Amazon.com, Inc., US
## 21 2017-02-10 19:06:07   52.85.94.252                     AMAZON-02 - Amazon.com, Inc., US
## 22 2017-02-17 21:37:21   52.85.63.229                     AMAZON-02 - Amazon.com, Inc., US
## 23 2017-02-24 21:43:45   52.85.63.225                     AMAZON-02 - Amazon.com, Inc., US
## 24 2017-03-05 12:06:32  54.192.19.242                     AMAZON-02 - Amazon.com, Inc., US
## 25 2017-04-01 00:41:07 54.192.203.223                     AMAZON-02 - Amazon.com, Inc., US
## 26 2017-05-19 00:00:00   13.32.246.44                     AMAZON-02 - Amazon.com, Inc., US
## 27 2017-05-28 00:00:00    52.84.74.38                     AMAZON-02 - Amazon.com, Inc., US
## 28 2017-06-07 08:10:32  54.230.15.154                     AMAZON-02 - Amazon.com, Inc., US
## 29 2017-06-07 08:10:32  54.230.15.142                     AMAZON-02 - Amazon.com, Inc., US
## 30 2017-06-07 08:10:32  54.230.15.168                     AMAZON-02 - Amazon.com, Inc., US
## 31 2017-06-07 08:10:32   54.230.15.57                     AMAZON-02 - Amazon.com, Inc., US
## 32 2017-06-07 08:10:32   54.230.15.36                     AMAZON-02 - Amazon.com, Inc., US
## 33 2017-06-07 08:10:32  54.230.15.129                     AMAZON-02 - Amazon.com, Inc., US
## 34 2017-06-07 08:10:32   54.230.15.61                     AMAZON-02 - Amazon.com, Inc., US
## 35 2017-06-07 08:10:32   54.230.15.51                     AMAZON-02 - Amazon.com, Inc., US
## 36 2017-07-16 09:51:12 54.230.187.155                     AMAZON-02 - Amazon.com, Inc., US
## 37 2017-07-16 09:51:12 54.230.187.184                     AMAZON-02 - Amazon.com, Inc., US
## 38 2017-07-16 09:51:12 54.230.187.125                     AMAZON-02 - Amazon.com, Inc., US
## 39 2017-07-16 09:51:12  54.230.187.91                     AMAZON-02 - Amazon.com, Inc., US
## 40 2017-07-16 09:51:12  54.230.187.74                     AMAZON-02 - Amazon.com, Inc., US
## 41 2017-07-16 09:51:12  54.230.187.36                     AMAZON-02 - Amazon.com, Inc., US
## 42 2017-07-16 09:51:12 54.230.187.197                     AMAZON-02 - Amazon.com, Inc., US
## 43 2017-07-16 09:51:12 54.230.187.185                     AMAZON-02 - Amazon.com, Inc., US
## 44 2017-07-17 13:10:13 54.239.168.225                     AMAZON-02 - Amazon.com, Inc., US
## 45 2017-08-06 01:14:07  52.222.149.75                     AMAZON-02 - Amazon.com, Inc., US
## 46 2017-08-06 01:14:07 52.222.149.172                     AMAZON-02 - Amazon.com, Inc., US
## 47 2017-08-06 01:14:07 52.222.149.245                     AMAZON-02 - Amazon.com, Inc., US
## 48 2017-08-06 01:14:07  52.222.149.41                     AMAZON-02 - Amazon.com, Inc., US
## 49 2017-08-06 01:14:07  52.222.149.38                     AMAZON-02 - Amazon.com, Inc., US
## 50 2017-08-06 01:14:07 52.222.149.141                     AMAZON-02 - Amazon.com, Inc., US
## 51 2017-08-06 01:14:07 52.222.149.163                     AMAZON-02 - Amazon.com, Inc., US
## 52 2017-08-06 01:14:07  52.222.149.26                     AMAZON-02 - Amazon.com, Inc., US
## 53 2017-08-11 19:11:08 216.137.61.247                     AMAZON-02 - Amazon.com, Inc., US
## 54 2017-08-21 20:44:52  13.32.253.116                     AMAZON-02 - Amazon.com, Inc., US
## 55 2017-08-21 20:44:52  13.32.253.247                     AMAZON-02 - Amazon.com, Inc., US
## 56 2017-08-21 20:44:52  13.32.253.117                     AMAZON-02 - Amazon.com, Inc., US
## 57 2017-08-21 20:44:52  13.32.253.112                     AMAZON-02 - Amazon.com, Inc., US
## 58 2017-08-21 20:44:52   13.32.253.42                     AMAZON-02 - Amazon.com, Inc., US
## 59 2017-08-21 20:44:52  13.32.253.162                     AMAZON-02 - Amazon.com, Inc., US
## 60 2017-08-21 20:44:52  13.32.253.233                     AMAZON-02 - Amazon.com, Inc., US
## 61 2017-08-21 20:44:52   13.32.253.29                     AMAZON-02 - Amazon.com, Inc., US
## 62 2017-08-23 14:24:15 216.137.61.164                     AMAZON-02 - Amazon.com, Inc., US
## 63 2017-08-23 14:24:15 216.137.61.146                     AMAZON-02 - Amazon.com, Inc., US
## 64 2017-08-23 14:24:15  216.137.61.21                     AMAZON-02 - Amazon.com, Inc., US
## 65 2017-08-23 14:24:15 216.137.61.154                     AMAZON-02 - Amazon.com, Inc., US
## 66 2017-08-23 14:24:15 216.137.61.250                     AMAZON-02 - Amazon.com, Inc., US
## 67 2017-08-23 14:24:15 216.137.61.217                     AMAZON-02 - Amazon.com, Inc., US
## 68 2017-08-23 14:24:15  216.137.61.54                     AMAZON-02 - Amazon.com, Inc., US
## 69 2017-08-25 19:21:58  13.32.218.245                     AMAZON-02 - Amazon.com, Inc., US
## 70 2017-08-26 09:41:34   52.85.173.67                     AMAZON-02 - Amazon.com, Inc., US
## 71 2017-08-26 09:41:34  52.85.173.186                     AMAZON-02 - Amazon.com, Inc., US
## 72 2017-08-26 09:41:34  52.85.173.131                     AMAZON-02 - Amazon.com, Inc., US
## 73 2017-08-26 09:41:34   52.85.173.18                     AMAZON-02 - Amazon.com, Inc., US
## 74 2017-08-26 09:41:34   52.85.173.91                     AMAZON-02 - Amazon.com, Inc., US
## 75 2017-08-26 09:41:34  52.85.173.174                     AMAZON-02 - Amazon.com, Inc., US
## 76 2017-08-26 09:41:34  52.85.173.210                     AMAZON-02 - Amazon.com, Inc., US
## 77 2017-08-26 09:41:34   52.85.173.88                     AMAZON-02 - Amazon.com, Inc., US
## 78 2017-08-27 22:02:41  13.32.253.169                     AMAZON-02 - Amazon.com, Inc., US
## 79 2017-08-27 22:02:41  13.32.253.203                     AMAZON-02 - Amazon.com, Inc., US
## 80 2017-08-27 22:02:41  13.32.253.209                     AMAZON-02 - Amazon.com, Inc., US
## 81 2017-08-29 13:17:37 54.230.141.201                     AMAZON-02 - Amazon.com, Inc., US
## 82 2017-08-29 13:17:37  54.230.141.83                     AMAZON-02 - Amazon.com, Inc., US
## 83 2017-08-29 13:17:37  54.230.141.30                     AMAZON-02 - Amazon.com, Inc., US
## 84 2017-08-29 13:17:37 54.230.141.193                     AMAZON-02 - Amazon.com, Inc., US
## 85 2017-08-29 13:17:37 54.230.141.152                     AMAZON-02 - Amazon.com, Inc., US
## 86 2017-08-29 13:17:37 54.230.141.161                     AMAZON-02 - Amazon.com, Inc., US
## 87 2017-08-29 13:17:37  54.230.141.38                     AMAZON-02 - Amazon.com, Inc., US
## 88 2017-08-29 13:17:37 54.230.141.151                     AMAZON-02 - Amazon.com, Inc., US

Unfortunately, I expected this. The owner keeps moving it around on AWS infrastructure.

So What?

This was an innocent link in a document on CRAN that went to a site that looked legit. A clever individual or organization found the dead domain and saw an opportunity to legitimize some fairly nasty stuff.

Now, I realize nobody is likely using “Rpad” anymore, but this type of situation can happen to any registered domain. If this individual or organization were doing more than trying to make objectionable content legit, they likely could have succeeded, especially if they enticed you with a shiny new devtools::install_…() link with promises of statistically sound animated cat emoji gif creation tools. They did an eerily good job of making this particular site still seem legit.

There’s nothing most folks can do to “fix” that site or have it removed. I’m not sure CRAN should remove the helpful PDF, but with a clickable link, it might be a good thing to suggest.

You’ll see that I used the splashr package (which has been submitted to CRAN but not there yet). It’s a good way to work with potentially malicious web content since you can “see” it and mine content from it without putting your own system at risk.

After going through this, I’ll see what I can do to put some bows on some of the devel-only packages and get them into CRAN so there’s a bit more assurance around using them.

I’m an army of one when it comes to fielding R-related security issues, but if you do come across suspicious items (like this or icky/malicious in other ways) don’t hesitate to drop me an @ or DM on Twitter.

R⁶ — Using pandoc from R + A Neat Package For Reading Subtitles

Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus”) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this:

to:

Does pandoc work?
=================

Simple document with **bold** and *italics*.

This is definitely a job that pandoc can handle.

pandoc is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown / knitr.

Yes, you read that correctly. Your beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.

Doing the aforementioned conversion from docx to markdown is super-simple from R:

rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")

Give the help on rmarkdown::pandoc_convert() a read as well as the very thorough and helpful documentation over at pandoc.org to see the power available at your command.

Just One More Thing

This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist :-)

There’s a neat, non-on-CRAN package by François Keck called subtools — https://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.

There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.

Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext to see how the season looked.

The overall scoring for a given episode is naive and can definitely be improved upon.

Definitely drop a link to anything you create in the comments!

# devtools::install_github("fkeck/subtools")

library(subtools)
library(tidytext)
library(hrbrthemes)
library(tidyverse)

data(stop_words)

bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")

fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE)

pb <- progress_estimated(length(fils))

map_df(1:length(fils), ~{

  pb$tick()$print()

  read.subtitles(fils[.x]) %>%
    sentencify() %>%
    .$subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by="word") %>%
    inner_join(bing, by="word") %>%
    inner_join(afinn, by="word") %>%
    mutate(season = 1, ep = .x)

}) %>% as_tibble() -> season_sentiments


count(season_sentiments, ep, sentiment) %>%
  mutate(pct = n/sum(n),
         pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent

ggplot() +
  geom_ribbon(data = filter(bing_sent, sentiment=="positive"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  geom_ribbon(data = filter(bing_sent, sentiment=="negative"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) +
  scale_y_continuous(expand=c(0,0), limits=c(-1,1),
                     labels=c("100%\nnegative", "50%", "0", "50%", "positive\n100%")) +
  labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1",
       subtitle="Sentiment balance per episode") +
  scale_fill_ipsum(name="Sentiment") +
  guides(fill = guide_legend(reverse=TRUE)) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1)))