Skip navigation

Tag Archives: warc

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava)
library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars")
library(jwatr) # devtools::install_github("hrbrmstr/jwatr")
library(httr)
library(jsonlite)
library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf <- warc_file("~/Data/wrap-test")

res <- warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods")

str(jsonlite::fromJSON(content(res, as="text")), 2)
## 'data.frame':	67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...

res <- warc_GET(wf, "https://data.police.uk/api/crimes-street/all-crime",
                query = list(lat=52.629729, lng=-1.131592, date="2017-01"))

res <- warc_GET(wf, "https://data.police.uk/api/crimes-at-location",
                query = list(location_id="884227", date="2017-02"))

close_warc_file(wf)

As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size
## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf <- read_warc("~/Data/wrap-test.warc.gz", include_payload = TRUE)

glimpse(xdf)
## Observations: 3
## Variables: 14
## $ target_uri                 <chr> "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street...
## $ ip_address                 <chr> "54.76.101.128", "54.76.101.128", "54.76.101.128"
## $ warc_content_type          <chr> "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp...
## $ warc_type                  <chr> "response", "response", "response"
## $ content_length             <dbl> 2984, 511564, 688
## $ payload_type               <chr> "application/json", "application/json", "application/json"
## $ profile                    <chr> NA, NA, NA
## $ date                       <dttm> 2017-08-22, 2017-08-22, 2017-08-22
## $ http_status_code           <dbl> 200, 200, 200
## $ http_protocol_content_type <chr> "application/json", "application/json", "application/json"
## $ http_version               <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
## $ http_raw_headers           <list> [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63...
## $ warc_record_id             <chr> "<urn:uuid:2ae3e851-a1cf-466a-8f73-9681aab25d0c>", "<urn:uuid:77b30905-37f7-4c78-a120-2a008e194f94>",...
## $ payload                    <list> [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74...

xdf$target_uri
## [1] "https://data.police.uk/api/leicestershire/neighbourhoods"                                   
## [2] "https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01"
## [3] "https://data.police.uk/api/crimes-at-location?location_id=884227&date=2017-02" 

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) {
  res <- jsonlite::fromJSON(readBin(xdf$payload[[i]], "character"))
  print(str(res, 2))
}
## 'data.frame': 67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...
## NULL
## 'data.frame': 1318 obs. of  9 variables:
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location        :'data.frame': 1318 obs. of  3 variables:
##   ..$ latitude : chr  "52.616961" "52.629963" "52.641646" "52.635184" ...
##   ..$ street   :'data.frame': 1318 obs. of  2 variables:
##   ..$ longitude: chr  "-1.120719" "-1.122291" "-1.131486" "-1.135455" ...
##  $ context         : chr  "" "" "" "" ...
##  $ outcome_status  :'data.frame': 1318 obs. of  2 variables:
##   ..$ category: chr  NA NA NA NA ...
##   ..$ date    : chr  NA NA NA NA ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ id              : int  54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ month           : chr  "2017-01" "2017-01" "2017-01" "2017-01" ...
## NULL
## 'data.frame': 1 obs. of  9 variables:
##  $ category        : chr "violent-crime"
##  $ location_type   : chr "Force"
##  $ location        :'data.frame': 1 obs. of  3 variables:
##   ..$ latitude : chr "52.643950"
##   ..$ street   :'data.frame': 1 obs. of  2 variables:
##   ..$ longitude: chr "-1.143042"
##  $ context         : chr ""
##  $ outcome_status  :'data.frame': 1 obs. of  2 variables:
##   ..$ category: chr "Unable to prosecute suspect"
##   ..$ date    : chr "2017-02"
##  $ persistent_id   : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735"
##  $ id              : int 54726925
##  $ location_subtype: chr ""
##  $ month           : chr "2017-02"
## NULL

We can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:

for (i in 1:nrow(xdf)) {
  
  payload_content(
    xdf$target_uri[i], 
    xdf$http_protocol_content_type[i], 
    xdf$http_raw_headers[[i]], 
    xdf$payload[[i]], as = "text"
  ) %>% 
    jsonlite::fromJSON() -> res
  
  print(str(res, 2))
  
}

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.