You searched for splashr

Search Results for: splashr

splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration

2019-01-14 – 20:56
Posted in R, splash, splashr, web scraping
Comments (2)

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days).

The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with the pure-R stevedore? package which will make it loads more compatible across the landscape of R installs as it removes a somewhat heavy dependency on a working Python environment (something quite challenging to consistently achieve in that fragmented language ecosystem).

Another addition is a set of new user agents for Android, Kindle, Apple TV & Chromecast as an increasing number of sites are changing what type of HTML (et. al.) they send to those and other alternative glowing rectangles. A more efficient/sane user agent system will also be introduced prior to the CRAN. Now’s the time to vote on existing issues or file new ones if there is a burning desire for new or modified functionality.

Since the Travis tests now work (they were failing miserably because of they Python dependency) I’ve integrated the changes from the 0.6.0 to the master branch but you can follow the machinations of the 0.6.0 branch up until CRAN release.

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr? which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:

library(splashr)
library(HARtools)
library(urltools)
library(furrr)
library(tidyverse)

list_of_urls_with_unique_urls <- c("http://...", "http://...", ...)

make_a_splash <- function(org_url) {
  splash(
    host = "ip/name of system you started aquarium on", 
    user = "your splash api username", 
    pass = "your splash api password"
  ) %>% 
    splash_response_body(TRUE) %>% # we want to get all the content 
    splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from 
    splash_go(org_url) %>% 
    splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated
    splash_har()
}

safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe

plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection

future_map(sites, ~{
  
  org <- safe_splash(.x) # go get it!
  
  if (is.null(org$result)) {
    sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages
  } else {
    
    HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE
      har = org$result, 
      file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
    )
    
    sprintf("Successfully retrieved %s", .x)
    
  }
  
}) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>% 
  keep(str_detect, "Error")
## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

New CRAN Package Announcement: splashr

I’m pleased to announce that splashr is now on CRAN.

(That image was generated with splashr::render_png(url = "https://cran.r-project.org/web/packages/splashr/")).

The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the hood.

I’ve blogged about splashr before:

and, the package comes with three vignettes that (hopefully) provide a solid introduction to using the web scraping framework.

More features — including additional DSL functions — will be added in the coming months, but please kick the tyres and file an issue with problems or suggestions.

Many thanks to all who took it for a spin and provided suggestions and even more thanks to the CRAN team for a speedy onboarding.

Spelunking XHRs (XMLHttpRequests) with splashr

splashr has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main request for it was to enable rendering of htmlwidgets:

splash_vm <- start_splash(add_tempdir=TRUE)

DiagrammeR("
  graph LR
    A-->B
    A-->C
    C-->E
    B-->D
    C-->D
    D-->F
    E-->F
") %>% 
  saveWidget("/tmp/diag.html")

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="html")
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<script src= ...
## [2] <body style="background-color: white; margin: 0px; padding: 40px;">\n<div id="htmlwidget_container">\n<div id="ht ...

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="png", wait=2)

But if you use the new Docker image and the add_tempdir=TRUE parameter it can render any local HTML file.

The other new bits are helpers to identify content types in the HAR types. Along with get_content_type():

library(tidyverse)

map_chr(rud_har$log$entries, get_content_type)
##  [1] "text/html"                "text/html"                "application/javascript"   "text/css"                
##  [5] "text/css"                 "text/css"                 "text/css"                 "text/css"                
##  [9] "text/css"                 "application/javascript"   "application/javascript"   "application/javascript"  
## [13] "application/javascript"   "application/javascript"   "application/javascript"   "text/javascript"         
## [17] "text/css"                 "text/css"                 "application/x-javascript" "application/x-javascript"
## [21] "application/x-javascript" "application/x-javascript" "application/x-javascript" NA                        
## [25] "text/css"                 "image/png"                "image/png"                "image/png"               
## [29] "font/ttf"                 "font/ttf"                 "text/html"                "font/ttf"                
## [33] "font/ttf"                 "application/font-woff"    "application/font-woff"    "image/svg+xml"           
## [37] "text/css"                 "text/css"                 "image/gif"                "image/svg+xml"           
## [41] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [45] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [49] "text/css"                 "application/x-javascript" "image/gif"                NA                        
## [53] "image/jpeg"               "image/svg+xml"            "image/svg+xml"            "image/svg+xml"           
## [57] "image/svg+xml"            "image/svg+xml"            "image/svg+xml"            "image/gif"               
## [61] NA                         "application/x-javascript" NA                         NA

there are many is_...() functions for logical tests.

But, one of the more interesting is_() functions is is_xhr(). Sites with dynamic content usually load said content via an XMLHttpRequest or XHR for short. Modern web apps usually return JSON in said requests and, for questions like this one on StackOverflow it’s usually better to grab the JSON and use it for data than it is to scrape the table made from JavaScript calls.

Now, it’s not too hard to open Developer Tools and find those XHR requests, but we can also use splashr to programmatically find them. We have to do a bit more work and use the new execute_lua() function since we need to give the page time to load up all the data. (I’ll eventually write a mini-R-DSL around this idiom so you don’t have to grok Lua for non-complex scraping tasks). Here’s how we’d answer that StackOverflow question today…

First, we grab the entire HAR contents (including bodies of the individual requests) after waiting a bit:

splash_local %>%
  execute_lua('
function main(splash)
  splash.response_body_enabled = true
  splash:go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D")
  splash:wait(2)
  return splash:har()
end
') -> res

pg <- as_har(res)

then we look for XHRs:

map_lgl(pg$log$entries, is_xhr) %>% which()
## 10

and, finally, we grab the JSON:

pg$log$entries[[10]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  rawToChar() %>% 
  jsonlite::fromJSON() %>% 
  glimpse()
## List of 4
##  $ TotalPages  : int 16
##  $ TotalRecords: int 384
##  $ Records     :'data.frame': 24 obs. of  21 variables:
##   ..$ ID            : chr [1:24] "{5E4B0D96-18D3-4FC6-B1AA-345675F3765C}" "{674EEC8B-062A-4268-9467-5C61030B83C9}" ## "{3E6257FE-67A1-4F13-B377-9EA7CCBD50F2}" "{C28479E6-5458-4010-A005-84E5F35B2FEA}" ...
##   ..$ FirstName     : chr [1:24] "Mirna" "Barbara" "Donald" "Victoria" ...
##   ..$ LastName      : chr [1:24] "Aeschlimann" "Angus" "Annino" "Arthur" ...
##   ..$ Image         : chr [1:24] "" "/~/media/directory/physicians/ppoc/angus_barbara.ashx" "/~/media/directory/physicians/ppoc/## annino_donald.ashx" "/~/media/directory/physicians/ppoc/arthur_victoria.ashx" ...
##   ..$ Suffix        : chr [1:24] "MD" "MD" "MD" "MD" ...
##   ..$ Url           : chr [1:24] "http://www.childrenshospital.org/doctors/mirna-aeschlimann" "http://www.childrenshospital.org/doctors/## barbara-angus" "http://www.childrenshospital.org/doctors/donald-annino" "http://www.childrenshospital.org/doctors/victoria-arthur" ...
##   ..$ Gender        : chr [1:24] "female" "female" "male" "female" ...
##   ..$ Latitude      : chr [1:24] "42.468769" "42.235088" "42.463177" "42.447168" ...
##   ..$ Longitude     : chr [1:24] "-71.100558" "-71.016021" "-71.143169" "-71.229734" ...
##   ..$ Address       : chr [1:24] "{"practice_name":"Pediatrics, Inc.", "address_1":"577 Main ## Street", "city":&q"| __truncated__ "{"practice_name":"Crown Colony Pediatrics", ## "address_1":"500 Congress Street, Suite 1F""| __truncated__ "{"practice_name":"Pediatricians ## Inc.", "address_1":"955 Main Street", "city":"| __truncated__ ## "{"practice_name":"Lexington Pediatrics", "address_1":"19 Muzzey Street, Suite 105", &qu"| ## __truncated__ ...
##   ..$ Distance      : chr [1:24] "" "" "" "" ...
##   ..$ OtherLocations: chr [1:24] "" "" "" "" ...
##   ..$ AcademicTitle : chr [1:24] "" "" "" "Clinical Instructor of Pediatrics - Harvard Medical School" ...
##   ..$ HospitalTitle : chr [1:24] "Pediatrician" "Pediatrician" "Pediatrician" "Pediatrician" ...
##   ..$ Specialties   : chr [1:24] "Primary Care, Pediatrics, General Pediatrics" "Primary Care, Pediatrics, General Pediatrics" "General ## Pediatrics, Pediatrics, Primary Care" "Primary Care, Pediatrics, General Pediatrics" ...
##   ..$ Departments   : chr [1:24] "" "" "" "" ...
##   ..$ Languages     : chr [1:24] "English" "English" "" "" ...
##   ..$ PPOCLink      : chr [1:24] "http://www.childrenshospital.org/patient-resources/provider-glossary" "/patient-resources/## provider-glossary" "http://www.childrenshospital.org/patient-resources/provider-glossary" "http://www.childrenshospital.org/## patient-resources/provider-glossary" ...
##   ..$ Gallery       : chr [1:24] "" "" "" "" ...
##   ..$ Phone         : chr [1:24] "781-438-7330" "617-471-3411" "781-729-4262" "781-862-4110" ...
##   ..$ Fax           : chr [1:24] "781-279-4046" "(617) 471-3584" "" "(781) 863-2007" ...
##  $ Synonims    : list()

UPDATE So, I wrote a mini-DSL for this:

splash_local %>%
  splash_response_body(TRUE) %>% 
  splash_go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D") %>% 
  splash_wait(2) %>% 
  splash_har() -> res

which should make it easier to perform basic “go-wait-retrieve” operations.

It’s unlikely we want to rely on a running Splash instance for our production work, so I’ll be making a helper function to turn HAR XHR requests into a httr function calls, similar to the way curlconverter works.

Diving Into Dynamic Website Content with splashr

If you do enough web scraping, you’ll eventually hit a wall that the trusty httr verbs (that sit beneath rvest) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still stick with httr verbs — if you can figure out what those requests are — and code-up the right parameters (browser “Developer Tools” menus/views and my curlconverter package are super handy for this). Unfortunately, some sites require actual in-page rendering and that’s when scraping turns into a modest chore.

For dynamic sites, the RSelenium and/or seleniumPipes packages are super-handy tools to have in the toolbox. They interface with Selenium which is a feature-rich environment/ecosystem for automating browser tasks. You can programmatically click buttons, press keys, follow links and extract page content because you’re scripting actions in an actual browser or a browser-like tool such as phantomjs. Getting the server component of Selenium running was often a source of pain for R folks, but the new docker images make it much easier to get started. For truly gnarly scraping tasks, it should be your go-to solution.

However, sometimes all you need is the rendering part and for that, there’s a new light[er]weight alternative dubbed Splash. It’s written in python and uses QT webkit for rendering. To avoid deluging your system with all of the Splash dependencies you can use the docker images. In fact, I made it dead easy to do so. Read on!

Going for a dip

The intrepid Winston Chang at RStudio started a package to wrap Docker operations and I’ve recently joind in the fun to add some tweaks & enhancements to it that are necessary to get it on CRAN. Why point this out? Since you need to have Splash running to work with it in splashr I wanted to make it as easy as possible. So, if you install Docker and then devtools::install_github("wch/harbor") you can then devtools::install_github("hrbrmstr/splashr") to get Splash up and running with:

library(splashr)

install_splash()
splash_svr <- start_splash()

The install_splash() function will pull the correct image to your local system and you’ll need that splash_svr object later on to stop the container. Now, you can have Splash running on any host, but this post assumes you’re running it locally.

We can test to see if the server is active:

splash("localhost") %>% splash_active()
## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 70443008

Now, we’re ready to scrape!

We’ll use this site — http://www.techstars.com/companies/ — mentioned over at DataCamp’s tutorial since it doesn’t use XHR but does require rendering and it doesn’t prohibit scraping in the Terms of Service (don’t violate Terms of Service, it is both unethical and could get you blocked, fined or worse).

Let’s scrape the “Summary by Class” table. Here’s an excerpt along with the Developer Tools view:

You’re saying “HEY. That has <table> in the HTML so why not just use rvest? Well, you can validate the lack of <table>s in the “view source” view of the page or with:

library(rvest)

pg <- read_html("http://www.techstars.com/companies/")
html_nodes(pg, "table")
## {xml_nodeset (0)}

Now, let’s do it with splashr:

splash("localhost") %>% 
  render_html("http://www.techstars.com/companies/", wait=5) -> pg
  
html_nodes(pg, "table")
## {xml_nodeset (89)}
##  [1] <table class="table75"><tbody>\n<tr>\n<th>Status</th>\n        <th>Number of Com ...
##  [2] <table class="table75"><tbody>\n<tr>\n<th colspan="2">Impact</th>\n      </tr>\n ...
##  [3] <table class="table75"><tbody>\n<tr>\n<th>Class</th>\n        <th>#Co's</th>\n   ...
##  [4] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Anywhere 2017 Q1</th>\ ...
##  [5] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Atlanta 2016 Summer</t ...
##  [6] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2013 Fall</th>\ ...
##  [7] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2014 Summer</th ...
##  [8] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2015 Spring</th ...
##  [9] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2016 Spring</th ...
## [10] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2014</th>\n   ...
## [11] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2015 Spring</ ...
## [12] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2016 Winter</ ...
## [13] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Cape Town 201 ...
## [14] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2015 Summ ...
## [15] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2016 Summ ...
## [16] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Tel Aviv 2016 ...
## [17] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2015 Summer</th ...
## [18] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2016 Summer</th ...
## [19] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2009 Spring</th ...
## [20] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2010 Spring</th ...
## ...##

We need to set the wait parameter (5 seconds was likely overkill) to give the javascript callbacks time to run. Now you can go crazy turning that into data.

Candid Camera

You can also take snapshots (pictures) of websites with splashr, like this (apologies if you start drooling on your keyboard):

splash("localhost") %>% 
  render_png("https://www.cervelo.com/en/triathlon/p-series/p5x")

The snapshot functions return magick objects, so you can do anything you’d like with them.

HARd Work

Since Splash is rendering the entire site (it’s a real browser), it knows all the information about the various components of a page and can return that in HAR format. You can retrieve this data and use John Harrison’s spiffy HARtools package to visualize and further analyze the data. For the sake of brevity, here’s just the main print() output from a site:

splash("localhost") %>% 
  render_har("https://www.r-bloggers.com/")

## --------HAR VERSION-------- 
## HAR specification version: 1.2 
## --------HAR CREATOR-------- 
## Created by: Splash 
## version: 2.3.1 
## --------HAR BROWSER-------- 
## Browser: QWebKit 
## version: 538.1 
## --------HAR PAGES-------- 
## Page id: 1 , Page title: R-bloggers | R news and tutorials contributed by (750) R bloggers 
## --------HAR ENTRIES-------- 
## Number of entries: 130 
## REQUESTS: 
## Page: 1 
## Number of entries: 130 
##   -  https://www.r-bloggers.com/ 
##   -  https://www.r-bloggers.com/wp-content/themes/magazine-basic-child/style.css 
##   -  https://www.r-bloggers.com/wp-content/plugins/mashsharer/assets/css/mashsb.min.cs... 
##   -  https://www.r-bloggers.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?... 
##   -  https://www.r-bloggers.com/wp-content/plugins/jetpack/css/jetpack.css?ver=4.4.2 
##      ........ 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/10579991_10152371745729891_26331957... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/14962601_10210947974726136_38966601... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/c0.8.50.50/p50x50/311082_286149511398044_4... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/11046696_917285094960943_6143235831... 
##   -  https://static.xx.fbcdn.net/rsrc.php/v3/y2/r/0iTJ2XCgjBy.png

FIN

You can also do some basic scripting in Splash with lua and coding up an interface with that capability is on the TODO as is adding final tests and enabling tweaking the Docker configurations to support more fun things that Splash can do.

File an issue on github if you have feature requests or problems and feel free to jump on board with a PR if you’d like to help put the finishing touches on the package or add some features.

Don’t forget to stop_splash(splash_svr) when you’re finished scraping!

{catchpole} Redux and Hashing Files & Websites with {ssdeepr}

Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:

If we tweak the buffer space around the squares, I think the cartogram looks better:

but, you should likely use a different palette (see this Twitter thread for examples).

I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:

library(sf)
library(hrbrthemes) 
library(catchpole)
library(tidyverse)

delegates <- read_delegates()

candidates_expanded <- expand_candidates()

gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx"))

m <- delegates_map()

# split off each "area" on the map so we can make a border+background
list(
  setdiff(state.abb, c("HI", "AK")),
  "AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS"
) %>% 
  map(~{
    suppressWarnings(suppressMessages(st_buffer(
      x = st_union(m[m$state %in% .x, ]),
      dist = 0.0001,
      endCapStyle = "SQUARE"
    )))
  }) -> m_borders

gg <- ggplot()
for (mb in m_borders) {
  gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125)
}

gg + 
  geom_sf(
    data = gsf,
    aes(fill = candidate),
    col = "white", shape = 22, size = 3, stroke = 0.125
  ) +
  scale_fill_manual(
    name = NULL,
    na.value = "#f0f0f0",
    values = c(
      "Biden" = '#f0027f',
      "Sanders" = '#7fc97f',
      "Warren" = '#beaed4',
      "Buttigieg" = '#fdc086',
      "Klobuchar" = '#ffff99',
      "Gabbard" = '#386cb0',
      "Bloomberg" = '#bf5b17'
    ),
    limits = intersect(unique(delegates$candidate), names(delegates_pal))
  ) +
  guides(
    fill = guide_legend(
      override.aes = list(size = 4)
    )
  ) +
  coord_sf(datum = NA) +
  theme_ipsum_es(grid="") +
  theme(legend.position = "bottom")

{ssdeepr}

Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.

Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at $DAYJOB), I went ahead and packaged that up as well.

I recommend using the hash_con() function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file() doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).

These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?

library(ssdeepr) # see the links above for installation

cran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html"))
cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html"))
cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html"))

hash_compare(cran1, cran2)
## [1] 0

hash_compare(cran1, cran3)
## [1] 94

I picked on cran.biotools.fr as I saw they were well-behind CRAN-proper on the monitoring page.

I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!

library(httr)
library(ssdeepr)
library(splashr)

# regular grab
h1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth"))

# you need Splash running for javascript-enabled scraping this way
sp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass")

# js-enabled with one ua
sp %>%
  splash_user_agent(ua_macos_chrome) %>%
  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%
  splash_wait(2) %>%
  splash_html(raw_html = TRUE) -> js1

# js-enabled with another ua
sp %>%
  splash_user_agent(ua_ios_safari) %>%
  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%
  splash_wait(2) %>%
  splash_html(raw_html = TRUE) -> js2

h2 <- hash_raw(js1)
h3 <- hash_raw(js2)

# same way {rvest} does it
res <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth")

h4 <- hash_raw(content(res, as = "raw"))

Now, let’s compare them:

hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal
## [1] 100

# things look way different with js-enabled

hash_compare(h1, h2)
## [1] 0
hash_compare(h1, h3)
## [1] 0

# and with variations between user-agents

hash_compare(h2, h3)
## [1] 0

hash_compare(h2, h4)
## [1] 0

# only doing this for completeness

hash_compare(h3, h4)
## [1] 0

For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):

length(js1)
## [1] 432914

length(js2)
## [1] 270538

nchar(
  paste0(
    readLines(url("https://en.wikipedia.org/wiki/Donald_Knuth")),
    collapse = "\n"
  )
)
## [1] 373078

length(content(res, as = "raw"))
## [1] 374099

FIN

If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!

The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).

As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.

Quick Hit: Scraping javascript-“enabled” Sites with {htmlunit}

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems—so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no <div> elements because they’re generated dynamically after the page loads.

: view source view of toscrape javascript example site

: devtools elements view of toscrape javascript example site

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source").

First, we’ll try just plain ol’ {rvest}:

library(rvest)

pg <- read_html("http://quotes.toscrape.com/js/")

html_nodes(pg, "div.quote")
## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit)
library(rvest)
library(purrr)
library(tibble)

js_pg <- hu_read_html("http://quotes.toscrape.com/js/")

html_nodes(js_pg, "div.quote")
## {xml_nodeset (10)}
##  [1] <div class="quote">\r\n        <span class="text">\r\n          “The world as we h ...
##  [2] <div class="quote">\r\n        <span class="text">\r\n          “It is our choices ...
##  [3] <div class="quote">\r\n        <span class="text">\r\n          “There are only tw ...
##  [4] <div class="quote">\r\n        <span class="text">\r\n          “The person, be it ...
##  [5] <div class="quote">\r\n        <span class="text">\r\n          “Imperfection is b ...
##  [6] <div class="quote">\r\n        <span class="text">\r\n          “Try not to become ...
##  [7] <div class="quote">\r\n        <span class="text">\r\n          “It is better to b ...
##  [8] <div class="quote">\r\n        <span class="text">\r\n          “I have not failed ...
##  [9] <div class="quote">\r\n        <span class="text">\r\n          “A woman is like a ...
## [10] <div class="quote">\r\n        <span class="text">\r\n          “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble(
  quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE),
  author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE),
  tags = html_nodes(js_pg, "div.quote") %>% 
    map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE))
)
## # A tibble: 10 x 3
##    quote                                                            author         tags   
##    <chr>                                                            <chr>          <list> 
##  1 “The world as we have created it is a process of our thinking. … Albert Einste… <chr […
##  2 “It is our choices, Harry, that show what we truly are, far mor… J.K. Rowling   <chr […
##  3 “There are only two ways to live your life. One is as though no… Albert Einste… <chr […
##  4 “The person, be it gentleman or lady, who has not pleasure in a… Jane Austen    <chr […
##  5 “Imperfection is beauty, madness is genius and it's better to b… Marilyn Monroe <chr […
##  6 “Try not to become a man of success. Rather become a man of val… Albert Einste… <chr […
##  7 “It is better to be hated for what you are than to be loved for… André Gide     <chr […
##  8 “I have not failed. I've just found 10,000 ways that won't work… Thomas A. Edi… <chr […
##  9 “A woman is like a tea bag; you never know how strong it is unt… Eleanor Roose… <chr […
## 10 “A day without sunshine is like, you know, night.”               Steve Martin   <chr […

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8)
library(stringi)

ctx <- v8()

html_node(pg, xpath=".//script[contains(., 'data')]") %>%  # target the <script> tag with the data
  html_text() %>% # get the text of the tag body
  stri_replace_all_regex("for \\(var[[:print:][:space:]]*", "", multiline=TRUE) %>% # delete everything after the `var data=` content
  ctx$eval() # pass it to V8

ctx$get("data") %>% # get the data from V8
  as_tibble() %>%  # tibbles rock
  janitor::clean_names() # the names do not so make them better
## # A tibble: 10 x 3
##    tags    author$name   $goodreads_link        $slug     text                            
##    <list>  <chr>         <chr>                  <chr>     <chr>                           
##  1 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “The world as we have created i…
##  2 <chr [… J.K. Rowling  /author/show/1077326.… J-K-Rowl… “It is our choices, Harry, that…
##  3 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “There are only two ways to liv…
##  4 <chr [… Jane Austen   /author/show/1265.Jan… Jane-Aus… “The person, be it gentleman or…
##  5 <chr [… Marilyn Monr… /author/show/82952.Ma… Marilyn-… “Imperfection is beauty, madnes…
##  6 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “Try not to become a man of suc…
##  7 <chr [… André Gide    /author/show/7617.And… Andre-Gi… “It is better to be hated for w…
##  8 <chr [… Thomas A. Ed… /author/show/3091287.… Thomas-A… “I have not failed. I've just f…
##  9 <chr [… Eleanor Roos… /author/show/44566.El… Eleanor-… “A woman is like a tea bag; you…
## 10 <chr [… Steve Martin  /author/show/7103.Ste… Steve-Ma… “A day without sunshine is like…

But, the {htmlunit} code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).

FIN

While {htmlunit} is great, it won’t work on super complex sites as it’s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content. The package also contains a mini-DSL if you need to perform more complex page scraping tasks as well.

You can find both {htmlunit} and {htmlunitjars} at:

r-cyber

2019-02-04 – 19:07
Posted in
Comments Off on r-cyber

A soon-to-be organized list of R packages for use in cybersecurity research, DFIR, risk analysis, metadata collection, document/data processing and more (not just by me, but the current list is made up of ones I’ve created or resurrected). If you want your packages to appear here, add the r-cyber topic to GitLab or GitHub repos and this list will be automagically periodically updated.

AnomalyDetection : ⏰ Anomaly Detection with R (separately maintained fork of Twitter’s AnomalyDetection ?) (r, rstats, anomaly-detection, anomalydetection, r-cyber)
aquarium : ???? Validate ‘Phishing’ ‘URLs’ with the ‘PhishTank’ Service (r, rstats, phishing, phishtank, r-cyber)
astools : ⚒ Tools to Work With Autonomous System (‘AS’) Network and Organization Data (r, rstats, autonomous-systems, routeviews, r-cyber)
blackmagic : ? Automagically Convert XML to JSON an JSON to XML (r, rstats, xmltojson, xml-to-json, xml-js, json-to-xml-converter, json-to-xml, jsontoxml, r-cyber)
burrp : ? Tools to Import and Process ‘PortSwigger’ ‘Burp’ Proxy Data (rstats, r, burpsuite, proxy, har, r-cyber)
carbondater : ? Estimate the Age of Web Resources (r, rstats, r-cyber)
cc : ⛏Extract metadata of a specific target based on the results of “commoncrawl.org” (r, rstats, common-crawl, domains, urls, reconnaissance, recon, r-cyber)
cdx : ? Query Web Archive Crawl Indexes (‘CDX’) (r, rstats, cdx, web-archives, r-cyber)
censys : R interface to the Censys “cyber”/scans search engine • https://www.censys.io/tutorial (censys-data, censys-api, r, rstats, r-cyber)
clandnstine : ㊙️ Perform ‘DNS’ over ‘TLS’ Queries (r, rstats, dns-over-tls, getdnsapi, getdns, dns, r-cyber)
crafter : ? An R package to work with PCAPs (r, rstats, pcap, pcap-files, pcap-analyzer, packet-capture, r-cyber)
curlconverter : ➰➡️➖ Translate cURL command lines into parameters for use with httr or actual httr calls (R) (curl, httr, r, rstats, r-cyber)
curlparse : ?Parse ‘URLs’ with ‘libcurl’ (r, rstats, libcurl, url-parse, r-cyber)
cymruservices : ? package that provides interfaces to various Team Cymru Services (r, rstats, team-cymru-webservice, malware-hash-registry, bogons, r-cyber)
czdaptools : R tools for downloading zone data from ICANN’s CZDS application (r, rstats, r-cyber, czdap)
decapitated : Headless ‘Chrome’ Orchestration in R (r, rstats, headless-chrome, web-scraping, javascript, r-cyber)
devd : Install, Start and Stop ‘devd’ Instances from R (r, rstats, devd, r-cyber)
dnsflare : ❓?Query ‘Cloudflare’ Domain Name System (‘DNS’) Servers over ‘HTTPS’ (r, dns, dns-over-https, cloudflare, cloudflare-1-dot-1-dot-1-dot-1, 1-dot-1-dot-1-dot-1, rstats, r-cyber)
dnshelpers : ℹ Tools to Process ‘DNS’ Response Data (r, rstats, dns, dns-parser, r-cyber)
domaintools : R API interface to the DomainTools API (r, rstats, domaintools, domaintools-api, r-cyber)
dshield : Query ‘SANS’ ‘DShield’ ‘API’ (r, rstats, dshield, isc, r-cyber)
exiv : ? Read and Write ‘Exif’ Image/Media Tags with R (r, rstats, exiv2-library, exiv2, exif, r-cyber)
gdns : Tools to work with the Google DNS over HTTPS API in R (spf-record, google-dns, dns, rstats, r, r-cyber)
gepetto : ? ScrapingHub Splash-like REST API for Headless Chrome (headless-chrome, nodejs, node-js, npm, hapi, hapijs, splash, r-cyber)
greynoise : Query ‘GreyNoise Intelligence ‘API’ in R (r, rstats, r-cyber, greynoise-intelligence)
hgr : ? Tools to Work with the ‘Postlight’ ‘Mercury’ ‘API’ — https://mercury.postlight.com/web-parser/ — in R (r, rstats, postlight-mercury-api, postlight, r-cyber)
hormel : ⚙️ Retrieve and Process ‘Spamhaus’ Zone/Host Metadata (r, rstats, spamhaus, spam, block-list, r-cyber)
htmltidy : ? Tidy Up and Test XPath Queries on HTML and XML Content in R (r, rstats, html, xml, r-cyber)
htmlunit : ??☕️Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library (r, rstats, htmlunit, web-scraping, javascript, r-cyber)
htmlunitjars : ☕️ Java Archive Wrapper Supporting the ‘htmlunit’ Package (r, rstats, rjava, htmlunit, web-scraping, r-cyber)
ipapi : An R package to geolocate IPv4/6 addresses and/or domain names using ip-api.com’s API (r, rstats, r-cyber, ipapi)
ipinfo : ℹ Collect Metadata on ‘IP’ Addresses and Autonomous Systems (r, rstats, ipv4, ipv6, asn, ipinfo, r-cyber)
ipstack : ⛏ Tools to Query ‘IP’ Address Information from the ‘ipstack’ ‘API’ (r, rstats, ipstack, ip-reputation, ip-geolocation, r-cyber)
iptools : ? A toolkit for manipulating, validating and testing IP addresses and ranges, along with datasets relating to IP add… (iptools, rstats, r, ipv4-address, r-cyber)
iptrie : ? Efficiently Store and Query ‘IPv4’ Internet Addresses with Associated Data (r, rstats, ip-address, cidr, trie, r-cyber, ipv4-trie, ipv4-address, internet-address, ip-trie)
jerichojars : Java Archive Wrapper Supporting the ‘jericho’ R Package (r, rstats, r-cyber, jeric)
jwatr : ? Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R (r, rstats, java, warc, r-cyber)
longurl : ℹ️ Small R package for no-API-required URL expansion (r, rstats, url-shortener, url, r-cyber)
mactheknife : ? Various ‘macOS’-oriented Tools and Utilities in R (r, rstats, reticulate, python, ds-store, macos, r-cyber)
MACtools : ⬢ Tools to Work with Media Access Control (‘MAC’) Addresses (r, rstats, mac-address, r-cyber, mac-age-database)
mhn : ? Analyze and Visualize Data from Modern Honey Network Servers with R (r, rstats, mhn, r-cyber, honeypot)
middlechild : R interface to MITM (r, rstats, r-cyber, mitm, mitmproxy)
mqtt : ? Interoperate with ‘MQTT’ Message Brokers with R (r, rstats, mqtt, mosquitto, r-cyber)
mrt : Tools to Retrieve and Process ‘BGP’ Files in R (r, rstats, mrt, rib, router, bgp, r-cyber)
msgxtractr : ? Extract contents from Outlook ‘.msg’ files in R (r, rstats, outlook, msg, attachment, r-cyber)
myip : Tools to Determine Your Public ‘IP’ Address in R (r, rstats, ip-address, ip-info, httpbin, icanhazip, ip-echo, amazon-checkip, akamai-whatismyp, opendns-checkip, r-cyber)
ndjson : ♨️ Wicked-Fast Streaming ‘JSON’ (‘ndjson’) Reader in R (r, ndjson, rstats, json, r-cyber)
newsflash : Tools to Work with the Internet Archive and GDELT Television Explorer in R (internet-archive, gdelt-television-explorer, r, rstats, r-cyber)
nmapr : Perform Network Discovery and Security Auditing with ‘nmap’ in R (r, rstats, nmap, r-cyber)
ooni : Tools to Access the Open Observatory of Network Interference (‘OONI’) (r, rstats, ooni, censorship, internet-measurements, r-cyber)
opengraph : Tools to Mine ‘Open Graph’-like Tags From ‘HTML’ Content (r, rstats, opengraph, r-cyber)
osqueryr : ⁇ ‘osquery’ ‘DBI’ and ‘dbplyr’ Interface for R (r, rstats, osquery, dplyr, tidyverse, dbi, r-cyber)
passivetotal : Useful tools for working with the PassiveTotal API in R (r, rstats, passivetotal, passive-dns-data, r-cyber)
passwordrandom : ? Access the PasswordRandom.com API in R (r, rstats, r-cyber)
pastebin : ? Tools to work with the pastebin API in R (rstats, r, pastebin, pastebin-client, r-cyber)
pdfbox : ?◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper) (r, rstats, pdfbox, pdf-document, pdf-files, pdfbox-wrapper, r-cyber)
pdfboxjars : Java ‘.jar’ Files for ‘pdfbox’ (r, rstats, java, r-cyber)
porc : ? Tools to Work with ‘Snort’ Rules, Logs and Data (r, rstats, snort, snort-rules, cybersecurity, cyber, r-cyber)
publicwww : Query the ‘PublicWWW’ Source Code Search Engine in R (r, rstats, publicwww, r-cyber)
radb : ? Tools to Query the ‘Merit’ ‘RADb’ Network Route Server (r, rstats, merit, radb, r-cyber)
rappalyzer : ? :: WIP :: R port of Wappalyzer (r, rstats, wappalyzer, r-cyber)
reapr : ?→ℹ️ Reap Information from Websites (r, rstats, web-scraping, r-cyber, rvest, html, xpath)
rgeocodio : Tools to Work with the https://geocod.io/ API (r, rstats, geocodio, geocoding, reverse-geocoding, reverse-geocode, r-cyber)
ripestat : ? Query and Retrieve Data from the ‘RIPEstat’ ‘API’ (r, rstats, ripe, ripestat, r-cyber)
robotify : ? Browser extension to check for and preview a site’s robots.txt in a new tab (if it exists) (browser-extension, robots-txt, r-cyber)
rpdns : R port of CIRCL.LU’s PyPDNS Python module https://github.com/CIRCL/PyPDNS (r, rstats, passive-dns, circl-lu, dns, r-cyber)
rrecog : ?Pattern Recognition for Hosts, Services and Content (r, rstats, recognizer, rapid7, r-cyber)
scamtracker : R pacakge interface to the BBB ScamTracker : https://www.bbb.org/scamtracker/us (r, rstats, r-cyber, scamtracker)
securitytrails : ??‍♂️Tools to Query the ‘SecurityTrails’ ‘API’ (r, rstats, securitytrails, cybersecurity, ipv4, ipv6, threat-intelligence, domain-name, whois-lookup, r-cyber)
securitytxt : ? Identify and Parse Web Security Policies Files in R (r, rstats, securitytxt, r-cyber)
sergeant : ? Tools to Transform and Query Data with ‘Apache’ ‘Drill’ (drill, parquet-files, sql, dplyr, r, rstats, apache-drill, r-cyber)
shodan : ? R package to work with the Shodan API (r, rstats, shodan, shodan-api, r-cyber)
simplemagic : ? Lightweight File ‘MIME’ Type Detection Based On Contents or Extension (r, rstats, magic, mime, mime-types, file-types, r-cyber)
speedtest : ? Measure upload/download speed/bandwidth for your network with R (r, rstats, bandwidth-test, bandwidth, bandwidth-monitor, r-cyber)
spiderbar : Lightweight R wrapper around rep-cpp for robot.txt (Robots Exclusion Protocol) parsing and path testing in R (r, rstats, robots-exclusion-protocol, robots-txt, r-cyber)
splashr : ? Tools to Work with the ‘Splash’ JavaScript Rendering Service in R (r, rstats, web-scraping, splash, selenium, phantomjs, har, r-cyber)
ssllabs : Tools to Work with the SSL Labs API in R (r, rstats, r-cyber, ssllabs, ssl-labs)
threatcrowd : R tools to work with the ThreatCrowd API (r, rstats, threatcrowd, r-cyber)
tidyweb : Easily Install and Load Modern Web-Scraping Packages (r, rstats, r-cyber)
tlsh : #️⃣ Local Sensitivity Hashing Using the ‘Trend Micro’ ‘TLSH’ Implementation (based on https://github.com/trendmicro/… (r, rstats, tlsh, lsh, lsh-implmentation, r-cyber)
tlsobs : ? Tools to Work with the ‘Mozilla’ ‘TLS’ Observatory ‘API’ in R (r, rstats, mozilla-observatory, r-cyber)
udpprobe : ? Send User Datagram Protocol (‘UDP’) Probes and Receive Responses in R (r, rstats, udp-client, udp, ubiquiti, r-cyber)
ulid : ⚙️ Universally Unique Lexicographically Sortable Identifiers in R (r, rstats, ulid, uuid, r-cyber)
urldiversity : ? Quantify ‘URL’ Diversity and Apply Popular Biodiversity Indices to a ‘URL’ Collection (r, rstats, species-diversity, url, urls, uri, r-cyber)
urlscan : ? Analyze Websites and Resources They Request (r, rstats, urlscan, analyze-websites, scanning, urlscan-io, r-cyber)
vershist : ??‍♀️ Collect Version Histories For Vendor Products (rstats, r, semantic-versions, version-check, version-checker, r-cyber, release-history)
wand : ? R interface to libmagic – returns file mime type (r, rstats, magic-bytes, file, r-cyber)
warc : ? Tools to Work with the Web Archive Ecosystem in R (r, rstats, warc, r-cyber, warc-ecosystem, warc-files)
wayback : ⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs (r, rstats, wayback-machine, internet-archive, web-scraping, r-cyber, memento, wayback)
webhose : ? Tools to Work with the ‘webhose.io’ ‘API’ in R (r, rstats, r-cyber, webhose)
whoisxmlapi : ❔ R package to interface with the WhoisXMLAPI.com service (r, rstats, whoisxmlapi, r-cyber)
xattrs : ? Work With Filesystem Object Extended Attributes — https://hrbrmstr.github.io/xattrs/index.html (r, rstats, xattr, xattr-support, r-cyber)
xforce : ? Tools to Gather Threat Intelligence from ‘IBM’ ‘X-Force’ (r, rstats, ibm-xforce, threat-intel, threat-intelligence, r-cyber)
zdnsr : ? Perform Bulk ‘DNS’ Queries Using ‘zdns’ (r, rstats, zdns, bulk-dns, r-cyber)