Quick Hit: Scraping javascript-"enabled" Sites with {htmlunit}

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems—so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no <div> elements because they’re generated dynamically after the page loads.

: view source view of toscrape javascript example site

: devtools elements view of toscrape javascript example site

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source").

First, we’ll try just plain ol’ {rvest}:

library(rvest)

pg <- read_html("http://quotes.toscrape.com/js/")

html_nodes(pg, "div.quote")
## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit)
library(rvest)
library(purrr)
library(tibble)

js_pg <- hu_read_html("http://quotes.toscrape.com/js/")

html_nodes(js_pg, "div.quote")
## {xml_nodeset (10)}
##  [1] <div class="quote">\r\n        <span class="text">\r\n          “The world as we h ...
##  [2] <div class="quote">\r\n        <span class="text">\r\n          “It is our choices ...
##  [3] <div class="quote">\r\n        <span class="text">\r\n          “There are only tw ...
##  [4] <div class="quote">\r\n        <span class="text">\r\n          “The person, be it ...
##  [5] <div class="quote">\r\n        <span class="text">\r\n          “Imperfection is b ...
##  [6] <div class="quote">\r\n        <span class="text">\r\n          “Try not to become ...
##  [7] <div class="quote">\r\n        <span class="text">\r\n          “It is better to b ...
##  [8] <div class="quote">\r\n        <span class="text">\r\n          “I have not failed ...
##  [9] <div class="quote">\r\n        <span class="text">\r\n          “A woman is like a ...
## [10] <div class="quote">\r\n        <span class="text">\r\n          “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble(
  quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE),
  author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE),
  tags = html_nodes(js_pg, "div.quote") %>% 
    map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE))
)
## # A tibble: 10 x 3
##    quote                                                            author         tags   
##    <chr>                                                            <chr>          <list> 
##  1 “The world as we have created it is a process of our thinking. … Albert Einste… <chr […
##  2 “It is our choices, Harry, that show what we truly are, far mor… J.K. Rowling   <chr […
##  3 “There are only two ways to live your life. One is as though no… Albert Einste… <chr […
##  4 “The person, be it gentleman or lady, who has not pleasure in a… Jane Austen    <chr […
##  5 “Imperfection is beauty, madness is genius and it's better to b… Marilyn Monroe <chr […
##  6 “Try not to become a man of success. Rather become a man of val… Albert Einste… <chr […
##  7 “It is better to be hated for what you are than to be loved for… André Gide     <chr […
##  8 “I have not failed. I've just found 10,000 ways that won't work… Thomas A. Edi… <chr […
##  9 “A woman is like a tea bag; you never know how strong it is unt… Eleanor Roose… <chr […
## 10 “A day without sunshine is like, you know, night.”               Steve Martin   <chr […

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8)
library(stringi)

ctx <- v8()

html_node(pg, xpath=".//script[contains(., 'data')]") %>%  # target the <script> tag with the data
  html_text() %>% # get the text of the tag body
  stri_replace_all_regex("for \\(var[[:print:][:space:]]*", "", multiline=TRUE) %>% # delete everything after the `var data=` content
  ctx$eval() # pass it to V8

ctx$get("data") %>% # get the data from V8
  as_tibble() %>%  # tibbles rock
  janitor::clean_names() # the names do not so make them better
## # A tibble: 10 x 3
##    tags    author$name   $goodreads_link        $slug     text                            
##    <list>  <chr>         <chr>                  <chr>     <chr>                           
##  1 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “The world as we have created i…
##  2 <chr [… J.K. Rowling  /author/show/1077326.… J-K-Rowl… “It is our choices, Harry, that…
##  3 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “There are only two ways to liv…
##  4 <chr [… Jane Austen   /author/show/1265.Jan… Jane-Aus… “The person, be it gentleman or…
##  5 <chr [… Marilyn Monr… /author/show/82952.Ma… Marilyn-… “Imperfection is beauty, madnes…
##  6 <chr [… Albert Einst… /author/show/9810.Alb… Albert-E… “Try not to become a man of suc…
##  7 <chr [… André Gide    /author/show/7617.And… Andre-Gi… “It is better to be hated for w…
##  8 <chr [… Thomas A. Ed… /author/show/3091287.… Thomas-A… “I have not failed. I've just f…
##  9 <chr [… Eleanor Roos… /author/show/44566.El… Eleanor-… “A woman is like a tea bag; you…
## 10 <chr [… Steve Martin  /author/show/7103.Ste… Steve-Ma… “A day without sunshine is like…

But, the {htmlunit} code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).

FIN

While {htmlunit} is great, it won’t work on super complex sites as it’s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content. The package also contains a mini-DSL if you need to perform more complex page scraping tasks as well.

You can find both {htmlunit} and {htmlunitjars} at:

3 Comments

- Frank
- Posted 2019-04-29 at 09:11
- Permalink
- Reply
Impossible to install packages :
Warning in install.packages :
unable to access index for repository https://cinc.rud.is/bin/windows/contrib/3.6:
cannot open URL ‘https://cinc.rud.is/bin/windows/contrib/3.6/PACKAGES’

or : cannot open URL ‘https://api.github.com/repos/hrbrmstr/htmlunitjars/tarball/master’

or : cannot open URL ‘https://api.github.com/repos/hrbrmstr/htmlunit/tarball/master’
- - hrbrmstr
  - Posted 2019-04-29 at 09:14
  - Permalink
  - Reply
  try adding type="source" to the install.package() line.
  - - FRANK
    - Posted 2019-05-01 at 04:01
    - Permalink
    - Reply
    Thank you very much,
    
    It was a problem with Java installation that we can see on the post :
    
    https://rud.is/b/2019/04/27/quick-hit-scraping-javascript-enabled-sites-with-htmlunit/
    
    Now all works very well

rud.is