Spelunking XHRs (XMLHttpRequests) with splashr

splashr has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main request for it was to enable rendering of htmlwidgets:

splash_vm <- start_splash(add_tempdir=TRUE)

DiagrammeR("
  graph LR
    A-->B
    A-->C
    C-->E
    B-->D
    C-->D
    D-->F
    E-->F
") %>% 
  saveWidget("/tmp/diag.html")

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="html")
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<script src= ...
## [2] <body style="background-color: white; margin: 0px; padding: 40px;">\n<div id="htmlwidget_container">\n<div id="ht ...

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="png", wait=2)

But if you use the new Docker image and the add_tempdir=TRUE parameter it can render any local HTML file.

The other new bits are helpers to identify content types in the HAR types. Along with get_content_type():

library(tidyverse)

map_chr(rud_har$log$entries, get_content_type)
##  [1] "text/html"                "text/html"                "application/javascript"   "text/css"                
##  [5] "text/css"                 "text/css"                 "text/css"                 "text/css"                
##  [9] "text/css"                 "application/javascript"   "application/javascript"   "application/javascript"  
## [13] "application/javascript"   "application/javascript"   "application/javascript"   "text/javascript"         
## [17] "text/css"                 "text/css"                 "application/x-javascript" "application/x-javascript"
## [21] "application/x-javascript" "application/x-javascript" "application/x-javascript" NA                        
## [25] "text/css"                 "image/png"                "image/png"                "image/png"               
## [29] "font/ttf"                 "font/ttf"                 "text/html"                "font/ttf"                
## [33] "font/ttf"                 "application/font-woff"    "application/font-woff"    "image/svg+xml"           
## [37] "text/css"                 "text/css"                 "image/gif"                "image/svg+xml"           
## [41] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [45] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [49] "text/css"                 "application/x-javascript" "image/gif"                NA                        
## [53] "image/jpeg"               "image/svg+xml"            "image/svg+xml"            "image/svg+xml"           
## [57] "image/svg+xml"            "image/svg+xml"            "image/svg+xml"            "image/gif"               
## [61] NA                         "application/x-javascript" NA                         NA

there are many is_...() functions for logical tests.

But, one of the more interesting is_() functions is is_xhr(). Sites with dynamic content usually load said content via an XMLHttpRequest or XHR for short. Modern web apps usually return JSON in said requests and, for questions like this one on StackOverflow it’s usually better to grab the JSON and use it for data than it is to scrape the table made from JavaScript calls.

Now, it’s not too hard to open Developer Tools and find those XHR requests, but we can also use splashr to programmatically find them. We have to do a bit more work and use the new execute_lua() function since we need to give the page time to load up all the data. (I’ll eventually write a mini-R-DSL around this idiom so you don’t have to grok Lua for non-complex scraping tasks). Here’s how we’d answer that StackOverflow question today…

First, we grab the entire HAR contents (including bodies of the individual requests) after waiting a bit:

splash_local %>%
  execute_lua('
function main(splash)
  splash.response_body_enabled = true
  splash:go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D")
  splash:wait(2)
  return splash:har()
end
') -> res

pg <- as_har(res)

then we look for XHRs:

map_lgl(pg$log$entries, is_xhr) %>% which()
## 10

and, finally, we grab the JSON:

pg$log$entries[[10]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  rawToChar() %>% 
  jsonlite::fromJSON() %>% 
  glimpse()
## List of 4
##  $ TotalPages  : int 16
##  $ TotalRecords: int 384
##  $ Records     :'data.frame': 24 obs. of  21 variables:
##   ..$ ID            : chr [1:24] "{5E4B0D96-18D3-4FC6-B1AA-345675F3765C}" "{674EEC8B-062A-4268-9467-5C61030B83C9}" ## "{3E6257FE-67A1-4F13-B377-9EA7CCBD50F2}" "{C28479E6-5458-4010-A005-84E5F35B2FEA}" ...
##   ..$ FirstName     : chr [1:24] "Mirna" "Barbara" "Donald" "Victoria" ...
##   ..$ LastName      : chr [1:24] "Aeschlimann" "Angus" "Annino" "Arthur" ...
##   ..$ Image         : chr [1:24] "" "/~/media/directory/physicians/ppoc/angus_barbara.ashx" "/~/media/directory/physicians/ppoc/## annino_donald.ashx" "/~/media/directory/physicians/ppoc/arthur_victoria.ashx" ...
##   ..$ Suffix        : chr [1:24] "MD" "MD" "MD" "MD" ...
##   ..$ Url           : chr [1:24] "http://www.childrenshospital.org/doctors/mirna-aeschlimann" "http://www.childrenshospital.org/doctors/## barbara-angus" "http://www.childrenshospital.org/doctors/donald-annino" "http://www.childrenshospital.org/doctors/victoria-arthur" ...
##   ..$ Gender        : chr [1:24] "female" "female" "male" "female" ...
##   ..$ Latitude      : chr [1:24] "42.468769" "42.235088" "42.463177" "42.447168" ...
##   ..$ Longitude     : chr [1:24] "-71.100558" "-71.016021" "-71.143169" "-71.229734" ...
##   ..$ Address       : chr [1:24] "{"practice_name":"Pediatrics, Inc.", "address_1":"577 Main ## Street", "city":&q"| __truncated__ "{"practice_name":"Crown Colony Pediatrics", ## "address_1":"500 Congress Street, Suite 1F""| __truncated__ "{"practice_name":"Pediatricians ## Inc.", "address_1":"955 Main Street", "city":"| __truncated__ ## "{"practice_name":"Lexington Pediatrics", "address_1":"19 Muzzey Street, Suite 105", &qu"| ## __truncated__ ...
##   ..$ Distance      : chr [1:24] "" "" "" "" ...
##   ..$ OtherLocations: chr [1:24] "" "" "" "" ...
##   ..$ AcademicTitle : chr [1:24] "" "" "" "Clinical Instructor of Pediatrics - Harvard Medical School" ...
##   ..$ HospitalTitle : chr [1:24] "Pediatrician" "Pediatrician" "Pediatrician" "Pediatrician" ...
##   ..$ Specialties   : chr [1:24] "Primary Care, Pediatrics, General Pediatrics" "Primary Care, Pediatrics, General Pediatrics" "General ## Pediatrics, Pediatrics, Primary Care" "Primary Care, Pediatrics, General Pediatrics" ...
##   ..$ Departments   : chr [1:24] "" "" "" "" ...
##   ..$ Languages     : chr [1:24] "English" "English" "" "" ...
##   ..$ PPOCLink      : chr [1:24] "http://www.childrenshospital.org/patient-resources/provider-glossary" "/patient-resources/## provider-glossary" "http://www.childrenshospital.org/patient-resources/provider-glossary" "http://www.childrenshospital.org/## patient-resources/provider-glossary" ...
##   ..$ Gallery       : chr [1:24] "" "" "" "" ...
##   ..$ Phone         : chr [1:24] "781-438-7330" "617-471-3411" "781-729-4262" "781-862-4110" ...
##   ..$ Fax           : chr [1:24] "781-279-4046" "(617) 471-3584" "" "(781) 863-2007" ...
##  $ Synonims    : list()

UPDATE So, I wrote a mini-DSL for this:

splash_local %>%
  splash_response_body(TRUE) %>% 
  splash_go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D") %>% 
  splash_wait(2) %>% 
  splash_har() -> res

which should make it easier to perform basic “go-wait-retrieve” operations.

It’s unlikely we want to rely on a running Splash instance for our production work, so I’ll be making a helper function to turn HAR XHR requests into a httr function calls, similar to the way curlconverter works.

Cover image from Data-Driven Security
Amazon Author Page

2 Comments Spelunking XHRs (XMLHttpRequests) with splashr

  1. ericrhetoricamubu

    Wait, wait go to your diagram? A, B, C, … Is A a docker? Is B a docker? Is C a docker? Break that down?

    Reply
  2. Pingback: Spelunking XHRs (XMLHttpRequests) with splashr – Cyber Security

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.