

{"id":6465,"date":"2017-09-25T06:17:04","date_gmt":"2017-09-25T11:17:04","guid":{"rendered":"https:\/\/rud.is\/b\/?p=6465"},"modified":"2018-03-07T17:01:43","modified_gmt":"2018-03-07T22:01:43","slug":"speeding-up-digital-arachinds","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","title":{"rendered":"Speeding Up Digital Arachnids"},"content":{"rendered":"<pre>\nspiderbar, spiderbar \nReads robots rules from afar.\nCrawls the web, any size; \nFetches with respect, never lies.\nLook Out! \nHere comes the spiderbar.\n\nIs it fast? \nListen bud, \nIt's got C++ under the hood.\nCan you scrape, from a site?\nTest with can_fetch(), TRUE == alright\nHey, there \nThere goes the spiderbar. \n<\/pre>\n<p>(Check the end of the post if you don&#8217;t recognize the lyrical riff.)<\/p>\n<h3>Face front, true believer!<\/h3>\n<p>I&#8217;ve used and blogged about Peter Meissner&#8217;s most excellent <a href=\"https:\/\/cran.rstudio.com\/web\/packages\/robotstxt\/\"><code>robotstxt<\/code><\/a> package <a href=\"https:\/\/rud.is\/b\/2017\/09\/19\/pirating-web-content-responsibly-with-r\/\">before<\/a>. It&#8217;s an essential tool for any ethical web scraper.<\/p>\n<p>But (there&#8217;s always a &#8220;<em>but<\/em>&#8220;, right?), it was a definite bottleneck for an <a href=\"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/\">unintended package use case<\/a> earlier this year (yes, I still have not rounded out the corners on my &#8220;crawl delay&#8221; forthcoming post).<\/p>\n<p>I needed something faster for my bulk <code>Crawl-Delay<\/code> analysis which led me to <a href=\"https:\/\/github.com\/seomoz\/rep-cpp\">this small, spiffy C++ library<\/a> for parsing <code>robots.txt<\/code> files. After a tiny bit of wrangling, that C++ library has turned into a small R package <a href=\"https:\/\/cran.r-project.org\/web\/packages\/spiderbar\/\"><code>spiderbar<\/code><\/a> which is now hitting a CRAN mirror near you, soon. (CRAN &#8212; rightly so &#8212; did not like the unoriginal name <code>rep<\/code>).<\/p>\n<h3>How much faster?<\/h3>\n<p>I&#8217;m glad you asked!<\/p>\n<p>Let&#8217;s take a look at one benchmark: parsing <code>robots.txt<\/code> and extracting <code>Crawl-delay<\/code> entries. Just how much faster is <code>spiderbar<\/code>?<\/p>\n<pre id=\"spiderbar01\"><code class=\"language-r\">library(spiderbar)\r\nlibrary(robotstxt)\r\nlibrary(microbenchmark)\r\nlibrary(tidyverse)\r\nlibrary(hrbrthemes)\r\n\r\nrob &lt;- get_robotstxt(&quot;imdb.com&quot;)\r\n\r\nmicrobenchmark(\r\n\r\n  robotstxt = {\r\n    x &lt;- parse_robotstxt(rob)\r\n    x$crawl_delay\r\n  },\r\n\r\n  spiderbar = {\r\n    y &lt;- robxp(rob)\r\n    crawl_delays(y)\r\n  }\r\n\r\n) -&gt; mb1\r\n\r\nupdate_geom_defaults(&quot;violin&quot;, list(colour = &quot;#4575b4&quot;, fill=&quot;#abd9e9&quot;))\r\n\r\nautoplot(mb1) +\r\n  scale_y_comma(name=&quot;nanoseconds&quot;, trans=&quot;log10&quot;) +\r\n  labs(title=&quot;Microbenchmark results for parsing &#039;robots.txt&#039; and extracting &#039;Crawl-delay&#039; entries&quot;,\r\n       subtitle=&quot;Compares performance between robotstxt &amp; spiderbar packages. Lower values are better.&quot;) +\r\n  theme_ipsum_rc(grid=&quot;Xx&quot;)<\/code><\/pre>\n<p><a href=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"6475\" data-permalink=\"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/spiderbar_1-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&amp;ssl=1\" data-orig-size=\"1920,1152\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"spiderbar_1-1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=510%2C306&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?resize=510%2C306&#038;ssl=1\" alt=\"\" width=\"510\" height=\"306\" class=\"aligncenter size-full wp-image-6475\" \/><\/a><\/p>\n<p>As you can see, it&#8217;s just <em>a tad bit faster<\/em> ?.<\/p>\n<p>Now, you won&#8217;t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.<\/p>\n<h3>But, I don&#8217;t care about <code>Crawl-Delay<\/code>!<\/h3>\n<p>OK, fine. Do you care about fetchability? We can speed that up, too!<\/p>\n<pre id=\"spiderbar02\"><code class=\"language-r\">rob_txt &lt;- parse_robotstxt(rob)\r\nrob_spi &lt;- robxp(rob)\r\n\r\nmicrobenchmark(\r\n\r\n  robotstxt = {\r\n    robotstxt:::path_allowed(rob_txt$permissions, &quot;\/Vote&quot;)\r\n  },\r\n\r\n  spiderbar = {\r\n    can_fetch(rob_spi, &quot;\/Vote&quot;)\r\n  }\r\n\r\n) -&gt; mb2\r\n\r\nautoplot(mb2) +\r\n  scale_y_comma(name=&quot;nanoseconds&quot;, trans=&quot;log10&quot;) +\r\n  labs(title=&quot;Microbenchmark results for testing resource &#039;fetchability&#039;&quot;,\r\n       subtitle=&quot;Compares performance between robotstxt &amp; spiderbar packages. Lower values are better.&quot;) +\r\n  theme_ipsum_rc(grid=&quot;Xx&quot;)<\/code><\/pre>\n<p><a href=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_2-1.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"6476\" data-permalink=\"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/spiderbar_2-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_2-1.png?fit=1920%2C1152&amp;ssl=1\" data-orig-size=\"1920,1152\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"spiderbar_2-1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_2-1.png?fit=510%2C306&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_2-1.png?resize=510%2C306&#038;ssl=1\" alt=\"\" width=\"510\" height=\"306\" class=\"aligncenter size-full wp-image-6476\" \/><\/a><\/p>\n<h3>Vectorized or it didn&#8217;t happen.<\/h3>\n<p>(<em>Gosh, even Spider-Man got more respect!<\/em>)<\/p>\n<p>OK, this is a tough crowd, but we&#8217;ve got vectorization covered as well:<\/p>\n<pre id=\"spiderbar03\"><code class=\"language-r\">microbenchmark(\r\n\r\n  robotstxt = {\r\n    paths_allowed(c(&quot;\/ShowAll\/this\/that&quot;, &quot;\/Soundtracks\/this\/that&quot;, &quot;\/Tsearch\/this\/that&quot;), &quot;imdb.com&quot;)\r\n  },\r\n\r\n  spiderbar = {\r\n    can_fetch(rob_spi, c(&quot;\/ShowAll\/this\/that&quot;, &quot;\/Soundtracks\/this\/that&quot;, &quot;\/Tsearch\/this\/that&quot;))\r\n  }\r\n\r\n) -&gt; mb3\r\n\r\nautoplot(mb3) +\r\n  scale_y_comma(name=&quot;nanoseconds&quot;, trans=&quot;log10&quot;) +\r\n  labs(title=&quot;Microbenchmark results for testing multiple resource &#039;fetchability&#039;&quot;,\r\n       subtitle=&quot;Compares performance between robotstxt &amp; spiderbar packages. Lower values are better.&quot;) +\r\n  theme_ipsum_rc(grid=&quot;Xx&quot;)<\/code><\/pre>\n<p><a href=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_3-1.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"6477\" data-permalink=\"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/spiderbar_3-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_3-1.png?fit=1920%2C1152&amp;ssl=1\" data-orig-size=\"1920,1152\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"spiderbar_3-1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_3-1.png?fit=510%2C306&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_3-1.png?resize=510%2C306&#038;ssl=1\" alt=\"\" width=\"510\" height=\"306\" class=\"aligncenter size-full wp-image-6477\" \/><\/a><\/p>\n<h3>Excelsior!<\/h3>\n<p>Peter&#8217;s package does more than this one since it helps find the <code>robots.txt<\/code> files and provides helpful data frames for more robots exclusion protocol content. And, <a href=\"https:\/\/github.com\/hrbrmstr\/spiderbar\/issues\/1#issuecomment-331636193\">we&#8217;ve got some plans<\/a> for package interoperability. So, stay tuned, true believer, for more spider-y goodness.<\/p>\n<p>You can check out the code and leave package questions or comments <a href=\"https:\/\/github.com\/hrbrmstr\/spiderbar\">on GitHub<\/a>.<\/p>\n<p><iframe loading=\"lazy\" width=\"427\" height=\"240\" src=\"https:\/\/www.youtube.com\/embed\/SUtziaZlDeE\" frameborder=\"0\" allowfullscreen><\/iframe><\/p>\n<p><i>(Hrm&hellip;<b>Peter<\/b> Parker was Spider-Man and <b>Peter<\/b> Meissner wrote <code>robotstxt<\/code> which is all about spiders. Coincidence?! I think not!)<\/i><\/p>\n","protected":false},"excerpt":{"rendered":"<p>spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It&#8217;s got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar. (Check the end [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6475,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91,725],"tags":[810],"class_list":["post-6465","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-r","category-web-scraping","tag-post"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Speeding Up Digital Arachnids - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Speeding Up Digital Arachnids - rud.is\" \/>\n<meta property=\"og:description\" content=\"spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It&#039;s got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar. (Check the end [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-09-25T11:17:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-03-07T22:01:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1152\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Speeding Up Digital Arachnids\",\"datePublished\":\"2017-09-25T11:17:04+00:00\",\"dateModified\":\"2018-03-07T22:01:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/\"},\"wordCount\":347,\"commentCount\":2,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/09\\\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1\",\"keywords\":[\"post\"],\"articleSection\":[\"R\",\"web scraping\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/\",\"name\":\"Speeding Up Digital Arachnids - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/09\\\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1\",\"datePublished\":\"2017-09-25T11:17:04+00:00\",\"dateModified\":\"2018-03-07T22:01:43+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#primaryimage\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/09\\\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/09\\\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1\",\"width\":1920,\"height\":1152},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/09\\\/25\\\/speeding-up-digital-arachinds\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Speeding Up Digital Arachnids\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Speeding Up Digital Arachnids - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","og_locale":"en_US","og_type":"article","og_title":"Speeding Up Digital Arachnids - rud.is","og_description":"spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar. (Check the end [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","og_site_name":"rud.is","article_published_time":"2017-09-25T11:17:04+00:00","article_modified_time":"2018-03-07T22:01:43+00:00","og_image":[{"width":1920,"height":1152,"url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1","type":"image\/png"}],"author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Speeding Up Digital Arachnids","datePublished":"2017-09-25T11:17:04+00:00","dateModified":"2018-03-07T22:01:43+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/"},"wordCount":347,"commentCount":2,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"image":{"@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1","keywords":["post"],"articleSection":["R","web scraping"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","url":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","name":"Speeding Up Digital Arachnids - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"primaryImageOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#primaryimage"},"image":{"@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1","datePublished":"2017-09-25T11:17:04+00:00","dateModified":"2018-03-07T22:01:43+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#primaryimage","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1","width":1920,"height":1152},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Speeding Up Digital Arachnids"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1920%2C1152&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1Gh","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":6206,"url":"https:\/\/rud.is\/b\/2017\/08\/29\/new-cran-package-announcement-splashr\/","url_meta":{"origin":6465,"position":0},"title":"New CRAN Package Announcement: splashr","author":"hrbrmstr","date":"2017-08-29","format":false,"excerpt":"I'm pleased to announce that splashr is now on CRAN. (That image was generated with splashr::render_png(url = \"https:\/\/cran.r-project.org\/web\/packages\/splashr\/\")). The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/08\/splashr.png?fit=1066%2C1108&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/08\/splashr.png?fit=1066%2C1108&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/08\/splashr.png?fit=1066%2C1108&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/08\/splashr.png?fit=1066%2C1108&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/08\/splashr.png?fit=1066%2C1108&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":12060,"url":"https:\/\/rud.is\/b\/2019\/03\/05\/heads-up-roll-your-own-http-headers-investigations-with-the-hdrs-package\/","url_meta":{"origin":6465,"position":1},"title":"Head&#8217;s Up! Roll Your Own HTTP Headers Investigations with the &#8216;hdrs&#8217; Package","author":"hrbrmstr","date":"2019-03-05","format":false,"excerpt":"I blathered alot about HTTP headers in the last post. In the event you wanted to dig deeper I threw together a small package that will let you grab HTTP headers from a given URL and take a look at them. The README has examples for most things but we'll\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11765,"url":"https:\/\/rud.is\/b\/2019\/01\/14\/splashr-0-6-0-now-uses-the-cran-nascent-stevedore-package-for-docker-orchestration\/","url_meta":{"origin":6465,"position":2},"title":"splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration","author":"hrbrmstr","date":"2019-01-14","format":false,"excerpt":"The splashr package [srht|GL|GH] \u2014 an alternative to Selenium for javascript-enabled\/browser-emulated web scraping \u2014 is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days). The major change from version 0.5.x (which never made it to CRAN) is a swap out of\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12004,"url":"https:\/\/rud.is\/b\/2019\/02\/28\/drat-all-the-%f0%9f%93%a6-enabling-easier-package-discovery-and-installation-with-your-own-cran-like-repo-for-your-packages\/","url_meta":{"origin":6465,"position":3},"title":"drat All The ?! : Enabling Easier Package Discovery and Installation with Your Own CRAN-like Repo for Your Packages","author":"hrbrmstr","date":"2019-02-28","format":false,"excerpt":"I've got a work-in-progress drat-ified CRAN-like repo for (eventually) all my packages over at CINC? (\"CINC is not CRAN\" and it also sounds like \"sync\"). This is in parallel with a co-location\/migration of all my packages to SourceHut (just waiting for the sr.ht alpha API to be baked) and a\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":4490,"url":"https:\/\/rud.is\/b\/2016\/07\/05\/a-simple-prediction-web-service-using-the-new-firery-package\/","url_meta":{"origin":6465,"position":4},"title":"A Simple Prediction Web Service Using the New fiery Package","author":"hrbrmstr","date":"2016-07-05","format":false,"excerpt":"[`fiery`](https:\/\/github.com\/thomasp85\/fiery) is a new `Rook`\/`httuv`-based R web server in town created by @thomasp85 that aims to fill the gap between raw http & websockets and Shiny with a flexible framework for handling requests and serving up responses. The intent of this post is to provide a quick-start to using it\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11859,"url":"https:\/\/rud.is\/b\/2019\/02\/03\/r-package-update-urlscan\/","url_meta":{"origin":6465,"position":5},"title":"R Package Update: urlscan","author":"hrbrmstr","date":"2019-02-03","format":false,"excerpt":"The urlscan? package (an interface to the urlscan.io API) is now at version 0.2.0 and supports urlscan.io's authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details \u2014 all the gory technical details \u2014 for a website. For instance, say\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6465","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=6465"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6465\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media\/6475"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=6465"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=6465"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=6465"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}