

{"id":12162,"date":"2019-04-27T13:20:50","date_gmt":"2019-04-27T18:20:50","guid":{"rendered":"https:\/\/rud.is\/b\/?p=12162"},"modified":"2019-04-29T09:15:24","modified_gmt":"2019-04-29T14:15:24","slug":"quick-hit-scraping-javascript-enabled-sites-with-htmlunit","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/","title":{"rendered":"Quick Hit: Scraping javascript-&#8220;enabled&#8221; Sites with {htmlunit}"},"content":{"rendered":"<p>I&#8217;ve mentioned <a href=\"https:\/\/rud.is\/b\/2019\/02\/28\/htmlunitjars-updated-to-2-34-0\/\"><code>{htmlunit}<\/code><\/a> in passing before, but did not put any code in the blog post. Since I just updated <code>{htmlunitjars}<\/code> to the latest and greatest version, now might be a good time to do a quick demo of it.<\/p>\n<p>The <code>{htmlunit}<\/code>\/<code>{htmunitjars}<\/code> packages make the functionality of the <a href=\"http:\/\/htmlunit.sourceforge.net\/\">HtmlUnit Java libray<\/a> available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you&#8217;d have to use Selenium\/<code>{Rselenium}<\/code>, Splash\/<code>{splashr}<\/code> or Chrome\/<code>{decapitated}<\/code> to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While <code>{htmlunit}<\/code> does have an <code>{rJava}<\/code> dependency, I haven&#8217;t had any issues getting Java working with R on Windows, Ubuntu\/Debian or macOS in a very long while&#8212;even on freshly minted systems&#8212;so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).<\/p>\n<p>To demonstrate the capabilities of <code>{htmlunit}<\/code> we&#8217;ll work with a site that&#8217;s dedicated to practicing web scraping&#8212;<code>toscrape.com<\/code>&#8212;and, specifically, the <a href=\"http:\/\/quotes.toscrape.com\/js\/\">javascript generated sandbox site<\/a>. It looks like this:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"12163\" data-permalink=\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/quotes-to-scrape-01\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=2220%2C1808&amp;ssl=1\" data-orig-size=\"2220,1808\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"quotes-to-scrape-01\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=300%2C244&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=510%2C415&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?resize=510%2C415&#038;ssl=1\" alt=\"\" width=\"510\" height=\"415\" class=\"aligncenter size-full wp-image-12163\" \/><\/a><\/p>\n<p>Now bring up both the &#8220;view source&#8221; version of the page on your browser and the developer tools &#8220;elements&#8221; panel and you&#8217;ll see that the content is in javascript right there on the site but the source has no <code>&lt;div&gt;<\/code> elements because they&#8217;re generated dynamically after the page loads.<\/p>\n\n\t\t<style type=\"text\/css\">\n\t\t\t#gallery-1 {\n\t\t\t\tmargin: auto;\n\t\t\t}\n\t\t\t#gallery-1 .gallery-item {\n\t\t\t\tfloat: left;\n\t\t\t\tmargin-top: 10px;\n\t\t\t\ttext-align: center;\n\t\t\t\twidth: 33%;\n\t\t\t}\n\t\t\t#gallery-1 img {\n\t\t\t\tborder: 2px solid #cfcfcf;\n\t\t\t}\n\t\t\t#gallery-1 .gallery-caption {\n\t\t\t\tmargin-left: 0;\n\t\t\t}\n\t\t\t\/* see gallery_shortcode() in wp-includes\/media.php *\/\n\t\t<\/style>\n\t\t<div data-carousel-extra='{&quot;blog_id&quot;:1,&quot;permalink&quot;:&quot;https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/&quot;}' id='gallery-1' class='gallery galleryid-12162 gallery-columns-3 gallery-size-thumbnail'><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/quotes-to-scrape-03\/'><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"122\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-03.png?fit=150%2C122&amp;ssl=1\" class=\"attachment-thumbnail size-thumbnail\" alt=\"\" aria-describedby=\"gallery-1-12165\" data-attachment-id=\"12165\" data-permalink=\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/quotes-to-scrape-03\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-03.png?fit=1110%2C904&amp;ssl=1\" data-orig-size=\"1110,904\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"quotes-to-scrape-03\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;view source view of toscrape javascript example site&lt;\/p&gt;\n\" data-medium-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-03.png?fit=300%2C244&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-03.png?fit=510%2C415&amp;ssl=1\" \/><\/a>\n\t\t\t<\/dt>\n\t\t\t\t<dd class='wp-caption-text gallery-caption' id='gallery-1-12165'>\n\t\t\t\tview source view of toscrape javascript example site\n\t\t\t\t<\/dd><\/dl><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/quotes-to-scrape-02\/'><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"104\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-02.png?fit=150%2C104&amp;ssl=1\" class=\"attachment-thumbnail size-thumbnail\" alt=\"View Source &amp; DevTools Elements Views\" aria-describedby=\"gallery-1-12164\" data-attachment-id=\"12164\" data-permalink=\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/quotes-to-scrape-02\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-02.png?fit=1192%2C830&amp;ssl=1\" data-orig-size=\"1192,830\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"View Source &amp;#038; DevTools Elements Views\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;devtools elements view of toscrape javascript example site&lt;\/p&gt;\n\" data-medium-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-02.png?fit=300%2C209&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-02.png?fit=510%2C355&amp;ssl=1\" \/><\/a>\n\t\t\t<\/dt>\n\t\t\t\t<dd class='wp-caption-text gallery-caption' id='gallery-1-12164'>\n\t\t\t\tdevtools elements view of toscrape javascript example site\n\t\t\t\t<\/dd><\/dl>\n\t\t\t<br style='clear: both' \/>\n\t\t<\/div>\n\n<p>The critical differences between both of those views is one reason I consider the use of tools like &#8220;Selector Gadget&#8221; to be more harmful than helpful. You&#8217;re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it&#8217;ll definitely come back to bite you some day.<\/p>\n<p>Let&#8217;s try to grab that first page of quotes. Note that to run all the code you&#8217;ll need to install both <code>{htmlunitjars}<\/code> and <code>{htmlunit}<\/code> which can be done via: <code>install.packages(c(\"htmlunitjars\", \"htmlunit\"), repos = \"https:\/\/cinc.rud.is\", type=\"source\")<\/code>.<\/p>\n<p>First, we&#8217;ll try just plain ol&#8217; <code>{rvest}<\/code>:<\/p>\n<pre><code class=\"language-r\">library(rvest)\n\npg &lt;- read_html(\"http:\/\/quotes.toscrape.com\/js\/\")\n\nhtml_nodes(pg, \"div.quote\")\n## {xml_nodeset (0)}\n<\/code><\/pre>\n<p>Getting no content back is to be expected since no javascript is executed. Now, we&#8217;ll use <code>{htmlunit}<\/code> to see if we can get to the actual content:<\/p>\n<pre><code class=\"language-r\">library(htmlunit)\nlibrary(rvest)\nlibrary(purrr)\nlibrary(tibble)\n\njs_pg &lt;- hu_read_html(\"http:\/\/quotes.toscrape.com\/js\/\")\n\nhtml_nodes(js_pg, \"div.quote\")\n## {xml_nodeset (10)}\n##  [1] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cThe world as we h ...\n##  [2] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cIt is our choices ...\n##  [3] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cThere are only tw ...\n##  [4] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cThe person, be it ...\n##  [5] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cImperfection is b ...\n##  [6] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cTry not to become ...\n##  [7] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cIt is better to b ...\n##  [8] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cI have not failed ...\n##  [9] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cA woman is like a ...\n## [10] &lt;div class=\"quote\"&gt;\\r\\n        &lt;span class=\"text\"&gt;\\r\\n          \u201cA day without sun ...\n<\/code><\/pre>\n<p>I loaded up <code>{purrr}<\/code> and <code>{tibble}<\/code> for a reason so let&#8217;s use them to make a nice data frame from the content:<\/p>\n<pre><code class=\"language-r\">tibble(\n  quote = html_nodes(js_pg, \"div.quote &gt; span.text\") %&gt;% html_text(trim=TRUE),\n  author = html_nodes(js_pg, \"div.quote &gt; span &gt; small.author\") %&gt;% html_text(trim=TRUE),\n  tags = html_nodes(js_pg, \"div.quote\") %&gt;% \n    map(~html_nodes(.x, \"div.tags &gt; a.tag\") %&gt;% html_text(trim=TRUE))\n)\n## # A tibble: 10 x 3\n##    quote                                                            author         tags   \n##    &lt;chr&gt;                                                            &lt;chr&gt;          &lt;list&gt; \n##  1 \u201cThe world as we have created it is a process of our thinking. \u2026 Albert Einste\u2026 &lt;chr [\u2026\n##  2 \u201cIt is our choices, Harry, that show what we truly are, far mor\u2026 J.K. Rowling   &lt;chr [\u2026\n##  3 \u201cThere are only two ways to live your life. One is as though no\u2026 Albert Einste\u2026 &lt;chr [\u2026\n##  4 \u201cThe person, be it gentleman or lady, who has not pleasure in a\u2026 Jane Austen    &lt;chr [\u2026\n##  5 \u201cImperfection is beauty, madness is genius and it's better to b\u2026 Marilyn Monroe &lt;chr [\u2026\n##  6 \u201cTry not to become a man of success. Rather become a man of val\u2026 Albert Einste\u2026 &lt;chr [\u2026\n##  7 \u201cIt is better to be hated for what you are than to be loved for\u2026 Andr\u00e9 Gide     &lt;chr [\u2026\n##  8 \u201cI have not failed. I've just found 10,000 ways that won't work\u2026 Thomas A. Edi\u2026 &lt;chr [\u2026\n##  9 \u201cA woman is like a tea bag; you never know how strong it is unt\u2026 Eleanor Roose\u2026 &lt;chr [\u2026\n## 10 \u201cA day without sunshine is like, you know, night.\u201d               Steve Martin   &lt;chr [\u2026\n<\/code><\/pre>\n<p>To be fair, we didn&#8217;t <em>really<\/em> need <code>{htmlunit}<\/code> for this site. The javascript data comes along with the page and it&#8217;s in a decent form so we could also use <code>{V8}<\/code>:<\/p>\n<pre><code class=\"language-r\">library(V8)\nlibrary(stringi)\n\nctx &lt;- v8()\n\nhtml_node(pg, xpath=\".\/\/script[contains(., 'data')]\") %&gt;%  # target the &lt;script&gt; tag with the data\n  html_text() %&gt;% # get the text of the tag body\n  stri_replace_all_regex(\"for \\\\(var[[:print:][:space:]]*\", \"\", multiline=TRUE) %&gt;% # delete everything after the `var data=` content\n  ctx$eval() # pass it to V8\n\nctx$get(\"data\") %&gt;% # get the data from V8\n  as_tibble() %&gt;%  # tibbles rock\n  janitor::clean_names() # the names do not so make them better\n## # A tibble: 10 x 3\n##    tags    author$name   $goodreads_link        $slug     text                            \n##    &lt;list&gt;  &lt;chr&gt;         &lt;chr&gt;                  &lt;chr&gt;     &lt;chr&gt;                           \n##  1 &lt;chr [\u2026 Albert Einst\u2026 \/author\/show\/9810.Alb\u2026 Albert-E\u2026 \u201cThe world as we have created i\u2026\n##  2 &lt;chr [\u2026 J.K. Rowling  \/author\/show\/1077326.\u2026 J-K-Rowl\u2026 \u201cIt is our choices, Harry, that\u2026\n##  3 &lt;chr [\u2026 Albert Einst\u2026 \/author\/show\/9810.Alb\u2026 Albert-E\u2026 \u201cThere are only two ways to liv\u2026\n##  4 &lt;chr [\u2026 Jane Austen   \/author\/show\/1265.Jan\u2026 Jane-Aus\u2026 \u201cThe person, be it gentleman or\u2026\n##  5 &lt;chr [\u2026 Marilyn Monr\u2026 \/author\/show\/82952.Ma\u2026 Marilyn-\u2026 \u201cImperfection is beauty, madnes\u2026\n##  6 &lt;chr [\u2026 Albert Einst\u2026 \/author\/show\/9810.Alb\u2026 Albert-E\u2026 \u201cTry not to become a man of suc\u2026\n##  7 &lt;chr [\u2026 Andr\u00e9 Gide    \/author\/show\/7617.And\u2026 Andre-Gi\u2026 \u201cIt is better to be hated for w\u2026\n##  8 &lt;chr [\u2026 Thomas A. Ed\u2026 \/author\/show\/3091287.\u2026 Thomas-A\u2026 \u201cI have not failed. I've just f\u2026\n##  9 &lt;chr [\u2026 Eleanor Roos\u2026 \/author\/show\/44566.El\u2026 Eleanor-\u2026 \u201cA woman is like a tea bag; you\u2026\n## 10 &lt;chr [\u2026 Steve Martin  \/author\/show\/7103.Ste\u2026 Steve-Ma\u2026 \u201cA day without sunshine is like\u2026\n<\/code><\/pre>\n<p>But, the <code>{htmlunit}<\/code> code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).<\/p>\n<h3>FIN<\/h3>\n<p>While <code>{htmlunit}<\/code> is great, it won&#8217;t work on super complex sites as it&#8217;s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content.  The package also contains a mini-DSL if you need to perform more complex page scraping tasks as well.<\/p>\n<p>You can find both <code>{htmlunit}<\/code> and <code>{htmlunitjars}<\/code> at:<\/p>\n<ul>\n<li>SourceHut: <a href=\"https:\/\/git.sr.ht\/~hrbrmstr\/htmlunitjars\">https:\/\/git.sr.ht\/~hrbrmstr\/htmlunitjars<\/a> \/ <a href=\"https:\/\/git.sr.ht\/~hrbrmstr\/htmlunit\">https:\/\/git.sr.ht\/~hrbrmstr\/htmlunit<\/a><\/li>\n<li>GitLab: <a href=\"https:\/\/gitlab.com\/hrbrmstr\/htmlunitjars\">https:\/\/gitlab.com\/hrbrmstr\/htmlunitjars<\/a> \/ <a href=\"https:\/\/gitlab.com\/hrbrmstr\/htmlunit\">https:\/\/gitlab.com\/hrbrmstr\/htmlunit<\/a><\/li>\n<li>GitUgh: <a href=\"https:\/\/github.com\/hrbrmstr\/htmlunitjars\">https:\/\/github.com\/hrbrmstr\/htmlunitjars<\/a> \/ <a href=\"https:\/\/github.com\/hrbrmstr\/htmlunit\">https:\/\/github.com\/hrbrmstr\/htmlunit<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it. The {htmlunit}\/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91,725],"tags":[],"class_list":["post-12162","post","type-post","status-publish","format-standard","hentry","category-r","category-web-scraping"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Quick Hit: Scraping javascript-&quot;enabled&quot; Sites with {htmlunit} - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Quick Hit: Scraping javascript-&quot;enabled&quot; Sites with {htmlunit} - rud.is\" \/>\n<meta property=\"og:description\" content=\"I&#8217;ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it. The {htmlunit}\/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2019-04-27T18:20:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-04-29T14:15:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-03.png?fit=150%2C122&amp;ssl=1\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Quick Hit: Scraping javascript-&#8220;enabled&#8221; Sites with {htmlunit}\",\"datePublished\":\"2019-04-27T18:20:50+00:00\",\"dateModified\":\"2019-04-29T14:15:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\"},\"wordCount\":573,\"commentCount\":3,\"publisher\":{\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"image\":{\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png\",\"articleSection\":[\"R\",\"web scraping\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\",\"url\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\",\"name\":\"Quick Hit: Scraping javascript-\\\"enabled\\\" Sites with {htmlunit} - rud.is\",\"isPartOf\":{\"@id\":\"https:\/\/rud.is\/b\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png\",\"datePublished\":\"2019-04-27T18:20:50+00:00\",\"dateModified\":\"2019-04-29T14:15:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=2220%2C1808&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=2220%2C1808&ssl=1\",\"width\":2220,\"height\":1808},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/rud.is\/b\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Quick Hit: Scraping javascript-&#8220;enabled&#8221; Sites with {htmlunit}\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/rud.is\/b\/#website\",\"url\":\"https:\/\/rud.is\/b\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/rud.is\/b\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\/\/rud.is\"],\"url\":\"https:\/\/rud.is\/b\/author\/hrbrmstr\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Quick Hit: Scraping javascript-\"enabled\" Sites with {htmlunit} - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/","og_locale":"en_US","og_type":"article","og_title":"Quick Hit: Scraping javascript-\"enabled\" Sites with {htmlunit} - rud.is","og_description":"I&#8217;ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it. The {htmlunit}\/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The [&hellip;]","og_url":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/","og_site_name":"rud.is","article_published_time":"2019-04-27T18:20:50+00:00","article_modified_time":"2019-04-29T14:15:24+00:00","og_image":[{"url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-03.png?fit=150%2C122&amp;ssl=1","type":"","width":"","height":""}],"author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Quick Hit: Scraping javascript-&#8220;enabled&#8221; Sites with {htmlunit}","datePublished":"2019-04-27T18:20:50+00:00","dateModified":"2019-04-29T14:15:24+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/"},"wordCount":573,"commentCount":3,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"image":{"@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage"},"thumbnailUrl":"https:\/\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png","articleSection":["R","web scraping"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/","url":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/","name":"Quick Hit: Scraping javascript-\"enabled\" Sites with {htmlunit} - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"primaryImageOfPage":{"@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage"},"image":{"@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage"},"thumbnailUrl":"https:\/\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png","datePublished":"2019-04-27T18:20:50+00:00","dateModified":"2019-04-29T14:15:24+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#primaryimage","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=2220%2C1808&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/04\/quotes-to-scrape-01.png?fit=2220%2C1808&ssl=1","width":2220,"height":1808},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2019\/04\/27\/quick-hit-scraping-javascript-enabled-sites-with-htmlunit\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Quick Hit: Scraping javascript-&#8220;enabled&#8221; Sites with {htmlunit}"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-3aa","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":12013,"url":"https:\/\/rud.is\/b\/2019\/02\/28\/htmlunitjars-updated-to-2-34-0\/","url_meta":{"origin":12162,"position":0},"title":"htmlunitjars Updated to 2.34.0","author":"hrbrmstr","date":"2019-02-28","format":false,"excerpt":"The in-dev htmlunit package for javascript-\"enabled\" web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-\"enabled\" sites. The Chrome emulation is now\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6115,"url":"https:\/\/rud.is\/b\/2017\/07\/25\/r%e2%81%b6-general-attys-distributions\/","url_meta":{"origin":12162,"position":1},"title":"R\u2076 \u2014 General (Attys) Distributions","author":"hrbrmstr","date":"2017-07-25","format":false,"excerpt":"Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice): Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A\u2026","rel":"","context":"In &quot;Data Visualization&quot;","block_context":{"text":"Data Visualization","link":"https:\/\/rud.is\/b\/category\/data-visualization\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":5004,"url":"https:\/\/rud.is\/b\/2017\/02\/09\/diving-into-dynamic-website-content-with-splashr\/","url_meta":{"origin":12162,"position":2},"title":"Diving Into Dynamic Website Content with splashr","author":"hrbrmstr","date":"2017-02-09","format":false,"excerpt":"If you do enough web scraping, you'll eventually hit a wall that the trusty httr verbs (that sit beneath rvest) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2717,"url":"https:\/\/rud.is\/b\/2013\/09\/25\/scraping-content-from-google-groups\/","url_meta":{"origin":12162,"position":3},"title":"Scraping Content From Google Groups","author":"hrbrmstr","date":"2013-09-25","format":false,"excerpt":"I was helping a friend out who wanted to build a word cloud from the text in Google Groups posts. If you've made any efforts to try to get content out of Google Groups you know that the only way to do so is to ensure you subscribe to the\u2026","rel":"","context":"In &quot;Development&quot;","block_context":{"text":"Development","link":"https:\/\/rud.is\/b\/category\/development\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2013\/09\/Untitled_and_input_text_not_updated_-_Google_Groups.png?fit=535%2C356&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2013\/09\/Untitled_and_input_text_not_updated_-_Google_Groups.png?fit=535%2C356&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2013\/09\/Untitled_and_input_text_not_updated_-_Google_Groups.png?fit=535%2C356&ssl=1&resize=525%2C300 1.5x"},"classes":[]},{"id":11765,"url":"https:\/\/rud.is\/b\/2019\/01\/14\/splashr-0-6-0-now-uses-the-cran-nascent-stevedore-package-for-docker-orchestration\/","url_meta":{"origin":12162,"position":4},"title":"splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration","author":"hrbrmstr","date":"2019-01-14","format":false,"excerpt":"The splashr package [srht|GL|GH] \u2014 an alternative to Selenium for javascript-enabled\/browser-emulated web scraping \u2014 is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days). The major change from version 0.5.x (which never made it to CRAN) is a swap out of\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11383,"url":"https:\/\/rud.is\/b\/2018\/08\/13\/in-brief-splashr-update-high-performance-scraping-with-splashr-furrr-teamhg-memexs-aquarium\/","url_meta":{"origin":12162,"position":5},"title":"In-brief: splashr update + High Performance Scraping with splashr, furrr &#038; TeamHG-Memex&#8217;s Aquarium","author":"hrbrmstr","date":"2018-08-13","format":false,"excerpt":"The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and\/or Splash: the latter is a lightweight alternative to tools like Selenium\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/12162","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=12162"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/12162\/revisions"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=12162"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=12162"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=12162"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}