

{"id":6164,"date":"2017-08-22T12:53:34","date_gmt":"2017-08-22T17:53:34","guid":{"rendered":"https:\/\/rud.is\/b\/?p=6164"},"modified":"2018-03-07T17:06:55","modified_gmt":"2018-03-07T22:06:55","slug":"caching-httr-requests-this-means-warc","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/","title":{"rendered":"Caching httr Requests? This means WAR[C]!"},"content":{"rendered":"<p>I&#8217;ve blathered about my <a href=\"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/\"><code>crawl_delay<\/code> project before<\/a> and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of <a href=\"https:\/\/www.loc.gov\/preservation\/digital\/formats\/fdd\/fdd000236.shtml\">Web Archive (WARC)<\/a> files. While I have a nascent package on github to work with WARC files it&#8217;s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).<\/p>\n<p>One of those implementations is <a href=\"https:\/\/sbforge.org\/display\/JWAT\/JWAT\">JWAT<\/a>, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with <code>rJava<\/code>. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files &#8212; since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space &#8212; and another for the actual package that does the work. They are:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/hrbrmstr\/jwatjars\"><code>jwatjars<\/code><\/a><\/li>\n<li><a href=\"https:\/\/github.com\/hrbrmstr\/jwatr\"><code>jwatr<\/code><\/a><\/li>\n<\/ul>\n<p>I&#8217;ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn&#8217;t considered before: pairing WARC files with <code>httr<\/code> web scraping tasks to maintain a local cache of what you&#8217;ve scraped.<\/p>\n<p>Web scraping consumes network &amp; compute resources on the server end that you typically don&#8217;t own and &#8212; in many cases &#8212; do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won&#8217;t change.<\/p>\n<p>The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).<\/p>\n<p>To that end I&#8217;ve put together the beginning of some &#8220;WARC wrappers&#8221; for <code>httr<\/code> verbs that make it seamless to cache scraping or API results as you gather and process them. Let&#8217;s work through an example using the <a href=\"https:\/\/data.police.uk\/\">U.K. open data portal on crime and policing<\/a> API.<\/p>\n<p>First, we&#8217;ll need some helpers:<\/p>\n<pre id=\"warcwrap01\"><code class=\"language-r\">library(rJava)\r\nlibrary(jwatjars) # devtools::install_github(&quot;hrbrmstr\/jwatjars&quot;)\r\nlibrary(jwatr) # devtools::install_github(&quot;hrbrmstr\/jwatr&quot;)\r\nlibrary(httr)\r\nlibrary(jsonlite)\r\nlibrary(tidyverse)<\/code><\/pre>\n<p>Just doing <code>library(jwatr)<\/code> would have covered much of that but I wanted to show some of the work R does behind the scenes for you.<\/p>\n<p>Now, we&#8217;ll grab some neighbourhood and crime info:<\/p>\n<pre id=\"warcwrap02\"><code class=\"language-r\">wf &lt;- warc_file(&quot;~\/Data\/wrap-test&quot;)\r\n\r\nres &lt;- warc_GET(wf, &quot;https:\/\/data.police.uk\/api\/leicestershire\/neighbourhoods&quot;)\r\n\r\nstr(jsonlite::fromJSON(content(res, as=&quot;text&quot;)), 2)\r\n## &#039;data.frame&#039;:\t67 obs. of  2 variables:\r\n##  $ id  : chr  &quot;NC04&quot; &quot;NC66&quot; &quot;NC67&quot; &quot;NC68&quot; ...\r\n##  $ name: chr  &quot;City Centre&quot; &quot;Cultural Quarter&quot; &quot;Riverside&quot; &quot;Clarendon Park&quot; ...\r\n\r\nres &lt;- warc_GET(wf, &quot;https:\/\/data.police.uk\/api\/crimes-street\/all-crime&quot;,\r\n                query = list(lat=52.629729, lng=-1.131592, date=&quot;2017-01&quot;))\r\n\r\nres &lt;- warc_GET(wf, &quot;https:\/\/data.police.uk\/api\/crimes-at-location&quot;,\r\n                query = list(location_id=&quot;884227&quot;, date=&quot;2017-02&quot;))\r\n\r\nclose_warc_file(wf)<\/code><\/pre>\n<p>As you can see, the standard <code>httr<\/code> <code>response<\/code> object is returned for processing, and the HTTP response itself is being stored away for us as we process it.<\/p>\n<pre id=\"warcwrap03\"><code class=\"language-r\">file.info(&quot;~\/Data\/wrap-test.warc.gz&quot;)$size\r\n## [1] 76020<\/code><\/pre>\n<p>We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):<\/p>\n<pre id=\"warcwrap04\"><code class=\"language-r\">xdf &lt;- read_warc(&quot;~\/Data\/wrap-test.warc.gz&quot;, include_payload = TRUE)\r\n\r\nglimpse(xdf)\r\n## Observations: 3\r\n## Variables: 14\r\n## $ target_uri                 &lt;chr&gt; &quot;https:\/\/data.police.uk\/api\/leicestershire\/neighbourhoods&quot;, &quot;https:\/\/data.police.uk\/api\/crimes-street...\r\n## $ ip_address                 &lt;chr&gt; &quot;54.76.101.128&quot;, &quot;54.76.101.128&quot;, &quot;54.76.101.128&quot;\r\n## $ warc_content_type          &lt;chr&gt; &quot;application\/http; msgtype=response&quot;, &quot;application\/http; msgtype=response&quot;, &quot;application\/http; msgtyp...\r\n## $ warc_type                  &lt;chr&gt; &quot;response&quot;, &quot;response&quot;, &quot;response&quot;\r\n## $ content_length             &lt;dbl&gt; 2984, 511564, 688\r\n## $ payload_type               &lt;chr&gt; &quot;application\/json&quot;, &quot;application\/json&quot;, &quot;application\/json&quot;\r\n## $ profile                    &lt;chr&gt; NA, NA, NA\r\n## $ date                       &lt;dttm&gt; 2017-08-22, 2017-08-22, 2017-08-22\r\n## $ http_status_code           &lt;dbl&gt; 200, 200, 200\r\n## $ http_protocol_content_type &lt;chr&gt; &quot;application\/json&quot;, &quot;application\/json&quot;, &quot;application\/json&quot;\r\n## $ http_version               &lt;chr&gt; &quot;HTTP\/1.1&quot;, &quot;HTTP\/1.1&quot;, &quot;HTTP\/1.1&quot;\r\n## $ http_raw_headers           &lt;list&gt; [&lt;48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63...\r\n## $ warc_record_id             &lt;chr&gt; &quot;&lt;urn:uuid:2ae3e851-a1cf-466a-8f73-9681aab25d0c&gt;&quot;, &quot;&lt;urn:uuid:77b30905-37f7-4c78-a120-2a008e194f94&gt;&quot;,...\r\n## $ payload                    &lt;list&gt; [&lt;5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74...\r\n\r\nxdf$target_uri\r\n## [1] &quot;https:\/\/data.police.uk\/api\/leicestershire\/neighbourhoods&quot;                                   \r\n## [2] &quot;https:\/\/data.police.uk\/api\/crimes-street\/all-crime?lat=52.629729&amp;lng=-1.131592&amp;date=2017-01&quot;\r\n## [3] &quot;https:\/\/data.police.uk\/api\/crimes-at-location?location_id=884227&amp;date=2017-02&quot; <\/code><\/pre>\n<p>The URLs are all there, so it will be easier to map the original calls to them.<\/p>\n<p>Now, the <code>payload<\/code> field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it&#8217;s JSON content (that&#8217;s what the API returns), we can just decode it:<\/p>\n<pre id=\"warcwrap05\"><code class=\"language-r\">for (i in 1:nrow(xdf)) {\r\n  res &lt;- jsonlite::fromJSON(readBin(xdf$payload[[i]], &quot;character&quot;))\r\n  print(str(res, 2))\r\n}\r\n## &#039;data.frame&#039;: 67 obs. of  2 variables:\r\n##  $ id  : chr  &quot;NC04&quot; &quot;NC66&quot; &quot;NC67&quot; &quot;NC68&quot; ...\r\n##  $ name: chr  &quot;City Centre&quot; &quot;Cultural Quarter&quot; &quot;Riverside&quot; &quot;Clarendon Park&quot; ...\r\n## NULL\r\n## &#039;data.frame&#039;: 1318 obs. of  9 variables:\r\n##  $ category        : chr  &quot;anti-social-behaviour&quot; &quot;anti-social-behaviour&quot; &quot;anti-social-behaviour&quot; &quot;anti-social-behaviour&quot; ...\r\n##  $ location_type   : chr  &quot;Force&quot; &quot;Force&quot; &quot;Force&quot; &quot;Force&quot; ...\r\n##  $ location        :&#039;data.frame&#039;: 1318 obs. of  3 variables:\r\n##   ..$ latitude : chr  &quot;52.616961&quot; &quot;52.629963&quot; &quot;52.641646&quot; &quot;52.635184&quot; ...\r\n##   ..$ street   :&#039;data.frame&#039;: 1318 obs. of  2 variables:\r\n##   ..$ longitude: chr  &quot;-1.120719&quot; &quot;-1.122291&quot; &quot;-1.131486&quot; &quot;-1.135455&quot; ...\r\n##  $ context         : chr  &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot; ...\r\n##  $ outcome_status  :&#039;data.frame&#039;: 1318 obs. of  2 variables:\r\n##   ..$ category: chr  NA NA NA NA ...\r\n##   ..$ date    : chr  NA NA NA NA ...\r\n##  $ persistent_id   : chr  &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot; ...\r\n##  $ id              : int  54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ...\r\n##  $ location_subtype: chr  &quot;&quot; &quot;&quot; &quot;&quot; &quot;&quot; ...\r\n##  $ month           : chr  &quot;2017-01&quot; &quot;2017-01&quot; &quot;2017-01&quot; &quot;2017-01&quot; ...\r\n## NULL\r\n## &#039;data.frame&#039;: 1 obs. of  9 variables:\r\n##  $ category        : chr &quot;violent-crime&quot;\r\n##  $ location_type   : chr &quot;Force&quot;\r\n##  $ location        :&#039;data.frame&#039;: 1 obs. of  3 variables:\r\n##   ..$ latitude : chr &quot;52.643950&quot;\r\n##   ..$ street   :&#039;data.frame&#039;: 1 obs. of  2 variables:\r\n##   ..$ longitude: chr &quot;-1.143042&quot;\r\n##  $ context         : chr &quot;&quot;\r\n##  $ outcome_status  :&#039;data.frame&#039;: 1 obs. of  2 variables:\r\n##   ..$ category: chr &quot;Unable to prosecute suspect&quot;\r\n##   ..$ date    : chr &quot;2017-02&quot;\r\n##  $ persistent_id   : chr &quot;4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735&quot;\r\n##  $ id              : int 54726925\r\n##  $ location_subtype: chr &quot;&quot;\r\n##  $ month           : chr &quot;2017-02&quot;\r\n## NULL<\/code><\/pre>\n<p>We can also use a <code>jwatr<\/code> helper function &#8212; <code>payload_content()<\/code> &#8212; which mimics the <code>httr::content()<\/code> function:<\/p>\n<pre id=\"warcwrap06\"><code class=\"language-r\">for (i in 1:nrow(xdf)) {\r\n  \r\n  payload_content(\r\n    xdf$target_uri[i], \r\n    xdf$http_protocol_content_type[i], \r\n    xdf$http_raw_headers[[i]], \r\n    xdf$payload[[i]], as = &quot;text&quot;\r\n  ) %&gt;% \r\n    jsonlite::fromJSON() -&gt; res\r\n  \r\n  print(str(res, 2))\r\n  \r\n}<\/code><\/pre>\n<p>The same output is printed, so I&#8217;m saving some blog content space by not including it.<\/p>\n<h3>Future Work<\/h3>\n<p>I kept this example small, but ideally one would write a <code>warcinfo<\/code> record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC <code>request<\/code> record as well as a <code>response<\/code>record`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.<\/p>\n<p>So, please kick the tyres and file as many <a href=\"https:\/\/github.com\/hrbrmstr\/jwatr\/issues\">issues<\/a> as you have time or interest to. I&#8217;m still designing the full package API and making refinements to existing function, so there&#8217;s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it&#8217;s a tad [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91,725],"tags":[810,799],"class_list":["post-6164","post","type-post","status-publish","format-standard","hentry","category-r","category-web-scraping","tag-post","tag-warc"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Caching httr Requests? This means WAR[C]! - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Caching httr Requests? This means WAR[C]! - rud.is\" \/>\n<meta property=\"og:description\" content=\"I&#8217;ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it&#8217;s a tad [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-08-22T17:53:34+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-03-07T22:06:55+00:00\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Caching httr Requests? This means WAR[C]!\",\"datePublished\":\"2017-08-22T17:53:34+00:00\",\"dateModified\":\"2018-03-07T22:06:55+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/\"},\"wordCount\":758,\"commentCount\":3,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"keywords\":[\"post\",\"warc\"],\"articleSection\":[\"R\",\"web scraping\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/\",\"name\":\"Caching httr Requests? This means WAR[C]! - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"datePublished\":\"2017-08-22T17:53:34+00:00\",\"dateModified\":\"2018-03-07T22:06:55+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/22\\\/caching-httr-requests-this-means-warc\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Caching httr Requests? This means WAR[C]!\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Caching httr Requests? This means WAR[C]! - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/","og_locale":"en_US","og_type":"article","og_title":"Caching httr Requests? This means WAR[C]! - rud.is","og_description":"I&#8217;ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it&#8217;s a tad [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/","og_site_name":"rud.is","article_published_time":"2017-08-22T17:53:34+00:00","article_modified_time":"2018-03-07T22:06:55+00:00","author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Caching httr Requests? This means WAR[C]!","datePublished":"2017-08-22T17:53:34+00:00","dateModified":"2018-03-07T22:06:55+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/"},"wordCount":758,"commentCount":3,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"keywords":["post","warc"],"articleSection":["R","web scraping"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/","url":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/","name":"Caching httr Requests? This means WAR[C]! - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"datePublished":"2017-08-22T17:53:34+00:00","dateModified":"2018-03-07T22:06:55+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Caching httr Requests? This means WAR[C]!"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1Bq","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":6465,"url":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","url_meta":{"origin":6164,"position":0},"title":"Speeding Up Digital Arachnids","author":"hrbrmstr","date":"2017-09-25","format":false,"excerpt":"spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":6134,"url":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","url_meta":{"origin":6164,"position":1},"title":"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R","author":"hrbrmstr","date":"2017-07-28","format":false,"excerpt":"One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats \/\/ Ethics in Web Scraping https:\/\/t.co\/y5YxvzB8Fd\u2014 boB Rudis (@hrbrmstr) July 26, 2017 If you\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":6385,"url":"https:\/\/rud.is\/b\/2017\/09\/19\/pirating-web-content-responsibly-with-r\/","url_meta":{"origin":6164,"position":2},"title":"Pirating Web Content Responsibly With R","author":"hrbrmstr","date":"2017-09-19","format":false,"excerpt":"International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":12366,"url":"https:\/\/rud.is\/b\/2019\/06\/26\/quick-hit-above-the-fold-hard-wrapping-text-at-n-characters\/","url_meta":{"origin":6164,"position":3},"title":"Quick Hit: Above the Fold; Hard wrapping text at &#8216;n&#8217; characters","author":"hrbrmstr","date":"2019-06-26","format":false,"excerpt":"Despite being on holiday I'm getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I've been working on a PR into {lutz}, a package by @andyteucher that turns lat\/lng pairs into timezone strings. The\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/06\/lutz-widths-02.png?fit=1200%2C628&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/06\/lutz-widths-02.png?fit=1200%2C628&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/06\/lutz-widths-02.png?fit=1200%2C628&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/06\/lutz-widths-02.png?fit=1200%2C628&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2019\/06\/lutz-widths-02.png?fit=1200%2C628&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":9475,"url":"https:\/\/rud.is\/b\/2018\/04\/06\/adding-macos-touch-bar-support-to-rstudio\/","url_meta":{"origin":6164,"position":4},"title":"Adding macOS Touch Bar Support to RStudio","author":"hrbrmstr","date":"2018-04-06","format":false,"excerpt":"Modern MacBook Pros have a fairly useless (c'mon, admit it!) \"Touch Bar\" that did little more than cause severe ire in the developer community after turning a full-fledged, tactile Escape key into a hollow version if its former self. Having said, that, some apps do make OK use of it,\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1243,"url":"https:\/\/rud.is\/b\/2012\/06\/10\/slopegraphs-everywhere\/","url_meta":{"origin":6164,"position":5},"title":"Slopegraphs Everywhere","author":"hrbrmstr","date":"2012-06-10","format":false,"excerpt":"Not much progress over the weekend on my latest obsession (been busy enjoying some non-rainy days here in Maine). So, here are some other slopegraph implementations\/resources I've found through mining the internets: A spiffy editable javascript implementation defensive slopegraphs slopegraphs in tableau an interesting d3 implementation another javascript implementation an\u2026","rel":"","context":"In &quot;Charts &amp; Graphs&quot;","block_context":{"text":"Charts &amp; Graphs","link":"https:\/\/rud.is\/b\/category\/charts-graphs\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6164","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=6164"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6164\/revisions"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=6164"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=6164"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=6164"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}