Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Putting this here to make it easier for others who try to Google this topic to find it w/o having to find and tediously search through other UDFs (user-defined functions).

I was/am making a custom UDF for base64 decoding/encoding and ran into:

SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: #### (expected: range(0, 256))

It’s incredibly easy to “fix” (and, if my Java weren’t so rusty I’d likely have seen it sooner) but I found this idiom in the spatial UDFs for Drill that enables increasing the default buffer size:

buffer = out.buffer = buffer.reallocIfNeeded(outputSize);

Hopefully this will prevent someone else from spinning a few minutes trying to tackle this use-case. I even had looked at the source for the DrillBuf class and did not manage to put 2 + 2 together for some reason.

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should read that (long-ish) intro as there are many caveats to the data source and I’ve also found that the files aren’t always available (i.e. there are often gaps when retrieving a sequence of files).

The R newsflash package has been able to work with the GDELT Television Explorer API since the inception of the service. It now has the ability work with this new “top topics” resource directly from R.

There are two interfaces to the top topics, but I’ll show you the easiest one to use in this post. Let’s chart the top 25 topics per day for the past ~3 days (this post was generated ~mid-day 2017-09-09).

To start, we’ll need the data!

We provide start and end POSIXct times in the current time zone (the top_trending_range() function auto-converts to GMT which is how the file timestamps are stored by GDELT). The function takes care of generating the proper 15-minute sequences.

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

from <- as.POSIXct("2017-09-07 00:00:00")
to <- as.POSIXct("2017-09-09 12:00:00")

trends <- top_trending_range(from, to)

glimpse(trends)
## Observations: 233
## Variables: 5
## $ ts                       <dttm> 2017-09-07 00:00:00, 2017-09-07 00:15:00, 2017-...
## $ overall_trending_topics  <list> [<"florida", "irma", "barbuda", "puerto rico", ...
## $ station_trending_topics  <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ station_top_topics       <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ overall_trending_phrases <list> [<"debt ceiling", "legalize daca", "florida key...

The glimpse view shows a compact, nested data frame. I encourage you to explore the individual nested elements to see the gems they contain, but we’re going to focus on the station_top_topics:

glimpse(trends$station_top_topics[[1]])
## Variables: 2
## $ Station <chr> "CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS", "MSNBC", "BBCNEWS"
## $ Topics  <list> [<"florida", "irma", "daca", "north korea", "harvey", "united st...

Each individual data frame has the top topics of each tracked station.

To get the top 25 topics per day, we’re going to bust out this structure, count up the topic “mentions” (not 100% accurate term, but good enough for now) per day and slice out the top 25. It’s a pretty straightforward process with tidyverse ops:

select(trends, ts, station_top_topics) %>% 
  unnest() %>% 
  unnest() %>% 
  mutate(day = as.Date(ts)) %>% 
  rename(station=Station, topic=Topics) %>% 
  count(day, topic) %>% 
  group_by(day) %>% 
  top_n(25) %>% 
  slice(1:25) %>% 
  arrange(day, desc(n)) %>% 
  mutate(rnk = 25:1) -> top_25_trends

glimpse(top_25_trends)
## Observations: 75
## Variables: 4
## $ day   <date> 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-0...
## $ topic <chr> "florida", "irma", "harvey", "north korea", "america", "daca", "chi...
## $ n     <int> 546, 546, 468, 464, 386, 362, 356, 274, 217, 210, 200, 156, 141, 13...
## $ rnk   <int> 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, ...

Now, it’s just a matter of some ggplotting:

ggplot(top_25_trends, aes(day, rnk, label=topic, size=n)) +
  geom_text(vjust=0.5, hjust=0.5) +
  scale_x_date(expand=c(0,0.5)) +
  scale_size(name=NULL, range=c(3,8)) +
  labs(
    x=NULL, y=NULL, 
    title="Top 25 Trending Topics Per Day",
    subtitle="Topic placed by rank and sized by frequency",
    caption="GDELT Television Explorer & #rstats newsflash package github.com/hrbrmstr/newsflash"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text.y=element_blank()) +
  theme(legend.position=c(0.75, 1.05)) +
  theme(legend.direction="horizontal")

Hopefully you’ll have some fun with the new “API”. Make sure to blog your own creations!

UPDATE

As a result of a tweet by @arnicas, you can find a per-day, per-station view (top 10 only) here.

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text.

I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt a try on your own conversion needs since each method produces different results.

On my macOS systems, the htm2txt calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations).

I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava-based R package dubbed jericho which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment).

Is it worth it to get rJava up and running to use jericho vs htm2txt? Let’s take a look:

library(jericho) # devtools::install_github("hrbrmstr/jericho")
library(microbenchmark)
library(htm2txt)
library(tidyverse)

c(
  "https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526",
  "https://en.wikipedia.org/wiki/Timeline_of_antisemitism",
  "http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/"
) -> urls

map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[1])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[1])
  },
  htm2txt = {
    a <- htm2txt(sites_html[1])
  },
  times = 10
) -> mb1

# microbenchmark(
#   jericho_txt = {
#     a <- html_to_text(sites_html[2])
#   },
#   jericho_render = {
#     a <- render_html_to_text(sites_html[2])
#   },
#   htm2txt = {
#     a <- htm2txt(sites_html[2])
#   },
#   times = 10
# ) -> mb2

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[3])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[3])
  },
  htm2txt = {
    a <- htm2txt(sites_html[3])
  },
  times = 10
) -> mb3

The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho goes fast in that test). Here’s what the other two look like:

mb1
## Unit: milliseconds
##            expr         min          lq        mean      median          uq         max neval
##     jericho_txt    4.121872    4.294953    4.567241    4.405356    4.734923    5.621142    10
##  jericho_render    5.446296    5.564006    5.927956    5.719971    6.357465    6.785791    10
##         htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132    10

mb3
## Unit: milliseconds
##            expr        min         lq       mean     median         uq        max neval
##     jericho_txt   2.641352   2.814318   3.297543   3.034445   3.488639   5.437411    10
##  jericho_render   3.034765   3.143431   4.708136   3.746157   5.953550   8.931072    10
##         htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958    10

You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text.

I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R.

I’m pleased to announce that splashr is now on CRAN.

(That image was generated with splashr::render_png(url = "https://cran.r-project.org/web/packages/splashr/")).

The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the hood.

I’ve blogged about splashr before:

and, the package comes with three vignettes that (hopefully) provide a solid introduction to using the web scraping framework.

More features — including additional DSL functions — will be added in the coming months, but please kick the tyres and file an issue with problems or suggestions.

Many thanks to all who took it for a spin and provided suggestions and even more thanks to the CRAN team for a speedy onboarding.

I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney:

When I got distracted by a PDF in the CRAN doc/contrib directory: Short-refcard.pdf. I’m not a big reference card user but students really like them and after seeing what it was I remembered having seen the document ages ago, but never associated it with CRAN before.

I saw:

by Tom Short, EPRI PEAC, tshort@epri-peac.com 2004-11-07 Granted to the public domain. See www. Rpad. org for the source and latest version. Includes material from R for Beginners by Emmanuel Paradis (with permission).

at the top of the card. The link (which I’ve made unclickable for reasons you’ll see in a sec — don’t visit that URL) was clickable and I tapped it as I wanted to see if it had changed since 2004.

You can open that image in a new tab to see the full, rendered site and take a moment to see if you can find the section that links to objectionable — and, potentially malicious — content. It’s easy to spot.

I made a likely correct assumption that Tom Short had nothing to do with this and wanted to dig into it a bit further to see when this may have happened. So, don your bestest deerstalker and follow along as we see when this may have happened.

Digging In Domain Land

We’ll need some helpers to poke around this data in a safe manner:

library(wayback) # devtools::install_github("hrbrmstr/wayback")
library(ggTimeSeries) # devtools::install_github("AtherEnergy/ggTimeSeries")
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(passivetotal) # devtools::install_github("hrbrmstr/passivetotal")
library(cymruservices)
library(magick)
library(tidyverse)

(You’ll need to get a RiskIQ PassiveTotal key to use those functions. Also, please donate to Archive.org if you use the wayback package.)

Now, let’s see if the main Rpad content URL is in the wayback machine:

glimpse(archive_available("http://www.rpad.org/Rpad/"))
## Observations: 1
## Variables: 5
## $ url        <chr> "http://www.rpad.org/Rpad/"
## $ available  <lgl> TRUE
## $ closet_url <chr> "http://web.archive.org/web/20170813053454/http://ww...
## $ timestamp  <dttm> 2017-08-13
## $ status     <chr> "200"

It is! Let’s see how many versions of it are in the archive:

x <- cdx_basic_query("http://www.rpad.org/Rpad/")

ts_range <- range(x$timestamp)

count(x, timestamp) %>%
  ggplot(aes(timestamp, n)) +
  geom_segment(aes(xend=timestamp, yend=0)) +
  labs(x=NULL, y="# changes in year", title="rpad.org Wayback Change Timeline") +
  theme_ipsum_rc(grid="Y")

count(x, timestamp) %>%
  mutate(Year = lubridate::year(timestamp)) %>%
  complete(timestamp=seq(ts_range[1], ts_range[2], "1 day"))  %>%
  filter(!is.na(timestamp), !is.na(Year)) %>%
  ggplot(aes(date = timestamp, fill = n)) +
  stat_calendar_heatmap() +
  viridis::scale_fill_viridis(na.value="white", option = "magma") +
  facet_wrap(~Year, ncol=1) +
  labs(x=NULL, y=NULL, title="rpad.org Wayback Change Timeline") +
  theme_ipsum_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(panel.spacing = grid::unit(0.5, "lines"))

There’s a big span between 2008/9 and 2016/17. Let’s poke around there a bit. First 2016:

tm <- get_timemap("http://www.rpad.org/Rpad/")

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2016))
## # A tibble: 1 x 5
##       rel                                                                   link  type
##     <chr>                                                                  <chr> <chr>
## 1 memento http://web.archive.org/web/20160629104907/http://www.rpad.org:80/Rpad/  <NA>
## # ... with 2 more variables: from <chr>, datetime <chr>

(p2016 <- render_png(url = rurl$link))

Hrm. Could be server or network errors.

Let’s go back to 2009.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2009))
## # A tibble: 4 x 5
##       rel                                                                  link  type
##     <chr>                                                                 <chr> <chr>
## 1 memento     http://web.archive.org/web/20090219192601/http://rpad.org:80/Rpad  <NA>
## 2 memento http://web.archive.org/web/20090322163146/http://www.rpad.org:80/Rpad  <NA>
## 3 memento http://web.archive.org/web/20090422082321/http://www.rpad.org:80/Rpad  <NA>
## 4 memento http://web.archive.org/web/20090524155658/http://www.rpad.org:80/Rpad  <NA>
## # ... with 2 more variables: from <chr>, datetime <chr>

(p2009 <- render_png(url = rurl$link[4]))

If you poke around that, it looks like the original Rpad content, so it was “safe” back then.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2017))
## # A tibble: 6 x 5
##       rel                                                                link  type
##     <chr>                                                               <chr> <chr>
## 1 memento  http://web.archive.org/web/20170323222705/http://www.rpad.org/Rpad  <NA>
## 2 memento http://web.archive.org/web/20170331042213/http://www.rpad.org/Rpad/  <NA>
## 3 memento http://web.archive.org/web/20170412070515/http://www.rpad.org/Rpad/  <NA>
## 4 memento http://web.archive.org/web/20170518023345/http://www.rpad.org/Rpad/  <NA>
## 5 memento http://web.archive.org/web/20170702130918/http://www.rpad.org/Rpad/  <NA>
## 6 memento http://web.archive.org/web/20170813053454/http://www.rpad.org/Rpad/  <NA>
## # ... with 2 more variables: from <chr>, datetime <chr>

(p2017 <- render_png(url = rurl$link[1]))

I won’t break your browser and add another giant image, but that one has the icky content. So, it’s a relatively recent takeover and it’s likely that whomever added the icky content links did so to try to ensure those domains and URLs have both good SEO and a positive reputation.

Let’s see if they were dumb enough to make their info public:

rwho <- passive_whois("rpad.org")
str(rwho, 1)
## List of 18
##  $ registryUpdatedAt: chr "2016-10-05"
##  $ admin            :List of 10
##  $ domain           : chr "rpad.org"
##  $ registrant       :List of 10
##  $ telephone        : chr "5078365503"
##  $ organization     : chr "WhoisGuard, Inc."
##  $ billing          : Named list()
##  $ lastLoadedAt     : chr "2017-03-14"
##  $ nameServers      : chr [1:2] "ns-1147.awsdns-15.org" "ns-781.awsdns-33.net"
##  $ whoisServer      : chr "whois.publicinterestregistry.net"
##  $ registered       : chr "2004-06-15"
##  $ contactEmail     : chr "411233718f2a4cad96274be88d39e804.protect@whoisguard.com"
##  $ name             : chr "WhoisGuard Protected"
##  $ expiresAt        : chr "2018-06-15"
##  $ registrar        : chr "eNom, Inc."
##  $ compact          :List of 10
##  $ zone             : Named list()
##  $ tech             :List of 10

Nope. #sigh

Is this site considered “malicious”?

(rclass <- passive_classification("rpad.org"))
## $everCompromised
## NULL

Nope. #sigh

What’s the hosting history for the site?

rdns <- passive_dns("rpad.org")
rorig <- bulk_origin(rdns$results$resolve)

tbl_df(rdns$results) %>%
  type_convert() %>%
  select(firstSeen, resolve) %>%
  left_join(select(rorig, resolve=ip, as_name=as_name)) %>% 
  arrange(firstSeen) %>%
  print(n=100)
## # A tibble: 88 x 3
##              firstSeen        resolve                                              as_name
##                 <dttm>          <chr>                                                <chr>
##  1 2009-12-18 11:15:20  144.58.240.79      EPRI-PA - Electric Power Research Institute, US
##  2 2016-06-19 00:00:00 208.91.197.132 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG
##  3 2016-07-29 00:00:00  208.91.197.27 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG
##  4 2016-08-12 20:46:15  54.230.14.253                     AMAZON-02 - Amazon.com, Inc., US
##  5 2016-08-16 14:21:17  54.230.94.206                     AMAZON-02 - Amazon.com, Inc., US
##  6 2016-08-19 20:57:04  54.230.95.249                     AMAZON-02 - Amazon.com, Inc., US
##  7 2016-08-26 20:54:02 54.192.197.200                     AMAZON-02 - Amazon.com, Inc., US
##  8 2016-09-12 10:35:41   52.84.40.164                     AMAZON-02 - Amazon.com, Inc., US
##  9 2016-09-17 07:43:03  54.230.11.212                     AMAZON-02 - Amazon.com, Inc., US
## 10 2016-09-23 18:17:50 54.230.202.223                     AMAZON-02 - Amazon.com, Inc., US
## 11 2016-09-30 19:47:31 52.222.174.253                     AMAZON-02 - Amazon.com, Inc., US
## 12 2016-10-24 17:44:38  52.85.112.250                     AMAZON-02 - Amazon.com, Inc., US
## 13 2016-10-28 18:14:16 52.222.174.231                     AMAZON-02 - Amazon.com, Inc., US
## 14 2016-11-11 10:44:22 54.240.162.201                     AMAZON-02 - Amazon.com, Inc., US
## 15 2016-11-17 04:34:15 54.192.197.242                     AMAZON-02 - Amazon.com, Inc., US
## 16 2016-12-16 17:49:29   52.84.32.234                     AMAZON-02 - Amazon.com, Inc., US
## 17 2016-12-19 02:34:32 54.230.141.240                     AMAZON-02 - Amazon.com, Inc., US
## 18 2016-12-23 14:25:32  54.192.37.182                     AMAZON-02 - Amazon.com, Inc., US
## 19 2017-01-20 17:26:28  52.84.126.252                     AMAZON-02 - Amazon.com, Inc., US
## 20 2017-02-03 15:28:24   52.85.94.225                     AMAZON-02 - Amazon.com, Inc., US
## 21 2017-02-10 19:06:07   52.85.94.252                     AMAZON-02 - Amazon.com, Inc., US
## 22 2017-02-17 21:37:21   52.85.63.229                     AMAZON-02 - Amazon.com, Inc., US
## 23 2017-02-24 21:43:45   52.85.63.225                     AMAZON-02 - Amazon.com, Inc., US
## 24 2017-03-05 12:06:32  54.192.19.242                     AMAZON-02 - Amazon.com, Inc., US
## 25 2017-04-01 00:41:07 54.192.203.223                     AMAZON-02 - Amazon.com, Inc., US
## 26 2017-05-19 00:00:00   13.32.246.44                     AMAZON-02 - Amazon.com, Inc., US
## 27 2017-05-28 00:00:00    52.84.74.38                     AMAZON-02 - Amazon.com, Inc., US
## 28 2017-06-07 08:10:32  54.230.15.154                     AMAZON-02 - Amazon.com, Inc., US
## 29 2017-06-07 08:10:32  54.230.15.142                     AMAZON-02 - Amazon.com, Inc., US
## 30 2017-06-07 08:10:32  54.230.15.168                     AMAZON-02 - Amazon.com, Inc., US
## 31 2017-06-07 08:10:32   54.230.15.57                     AMAZON-02 - Amazon.com, Inc., US
## 32 2017-06-07 08:10:32   54.230.15.36                     AMAZON-02 - Amazon.com, Inc., US
## 33 2017-06-07 08:10:32  54.230.15.129                     AMAZON-02 - Amazon.com, Inc., US
## 34 2017-06-07 08:10:32   54.230.15.61                     AMAZON-02 - Amazon.com, Inc., US
## 35 2017-06-07 08:10:32   54.230.15.51                     AMAZON-02 - Amazon.com, Inc., US
## 36 2017-07-16 09:51:12 54.230.187.155                     AMAZON-02 - Amazon.com, Inc., US
## 37 2017-07-16 09:51:12 54.230.187.184                     AMAZON-02 - Amazon.com, Inc., US
## 38 2017-07-16 09:51:12 54.230.187.125                     AMAZON-02 - Amazon.com, Inc., US
## 39 2017-07-16 09:51:12  54.230.187.91                     AMAZON-02 - Amazon.com, Inc., US
## 40 2017-07-16 09:51:12  54.230.187.74                     AMAZON-02 - Amazon.com, Inc., US
## 41 2017-07-16 09:51:12  54.230.187.36                     AMAZON-02 - Amazon.com, Inc., US
## 42 2017-07-16 09:51:12 54.230.187.197                     AMAZON-02 - Amazon.com, Inc., US
## 43 2017-07-16 09:51:12 54.230.187.185                     AMAZON-02 - Amazon.com, Inc., US
## 44 2017-07-17 13:10:13 54.239.168.225                     AMAZON-02 - Amazon.com, Inc., US
## 45 2017-08-06 01:14:07  52.222.149.75                     AMAZON-02 - Amazon.com, Inc., US
## 46 2017-08-06 01:14:07 52.222.149.172                     AMAZON-02 - Amazon.com, Inc., US
## 47 2017-08-06 01:14:07 52.222.149.245                     AMAZON-02 - Amazon.com, Inc., US
## 48 2017-08-06 01:14:07  52.222.149.41                     AMAZON-02 - Amazon.com, Inc., US
## 49 2017-08-06 01:14:07  52.222.149.38                     AMAZON-02 - Amazon.com, Inc., US
## 50 2017-08-06 01:14:07 52.222.149.141                     AMAZON-02 - Amazon.com, Inc., US
## 51 2017-08-06 01:14:07 52.222.149.163                     AMAZON-02 - Amazon.com, Inc., US
## 52 2017-08-06 01:14:07  52.222.149.26                     AMAZON-02 - Amazon.com, Inc., US
## 53 2017-08-11 19:11:08 216.137.61.247                     AMAZON-02 - Amazon.com, Inc., US
## 54 2017-08-21 20:44:52  13.32.253.116                     AMAZON-02 - Amazon.com, Inc., US
## 55 2017-08-21 20:44:52  13.32.253.247                     AMAZON-02 - Amazon.com, Inc., US
## 56 2017-08-21 20:44:52  13.32.253.117                     AMAZON-02 - Amazon.com, Inc., US
## 57 2017-08-21 20:44:52  13.32.253.112                     AMAZON-02 - Amazon.com, Inc., US
## 58 2017-08-21 20:44:52   13.32.253.42                     AMAZON-02 - Amazon.com, Inc., US
## 59 2017-08-21 20:44:52  13.32.253.162                     AMAZON-02 - Amazon.com, Inc., US
## 60 2017-08-21 20:44:52  13.32.253.233                     AMAZON-02 - Amazon.com, Inc., US
## 61 2017-08-21 20:44:52   13.32.253.29                     AMAZON-02 - Amazon.com, Inc., US
## 62 2017-08-23 14:24:15 216.137.61.164                     AMAZON-02 - Amazon.com, Inc., US
## 63 2017-08-23 14:24:15 216.137.61.146                     AMAZON-02 - Amazon.com, Inc., US
## 64 2017-08-23 14:24:15  216.137.61.21                     AMAZON-02 - Amazon.com, Inc., US
## 65 2017-08-23 14:24:15 216.137.61.154                     AMAZON-02 - Amazon.com, Inc., US
## 66 2017-08-23 14:24:15 216.137.61.250                     AMAZON-02 - Amazon.com, Inc., US
## 67 2017-08-23 14:24:15 216.137.61.217                     AMAZON-02 - Amazon.com, Inc., US
## 68 2017-08-23 14:24:15  216.137.61.54                     AMAZON-02 - Amazon.com, Inc., US
## 69 2017-08-25 19:21:58  13.32.218.245                     AMAZON-02 - Amazon.com, Inc., US
## 70 2017-08-26 09:41:34   52.85.173.67                     AMAZON-02 - Amazon.com, Inc., US
## 71 2017-08-26 09:41:34  52.85.173.186                     AMAZON-02 - Amazon.com, Inc., US
## 72 2017-08-26 09:41:34  52.85.173.131                     AMAZON-02 - Amazon.com, Inc., US
## 73 2017-08-26 09:41:34   52.85.173.18                     AMAZON-02 - Amazon.com, Inc., US
## 74 2017-08-26 09:41:34   52.85.173.91                     AMAZON-02 - Amazon.com, Inc., US
## 75 2017-08-26 09:41:34  52.85.173.174                     AMAZON-02 - Amazon.com, Inc., US
## 76 2017-08-26 09:41:34  52.85.173.210                     AMAZON-02 - Amazon.com, Inc., US
## 77 2017-08-26 09:41:34   52.85.173.88                     AMAZON-02 - Amazon.com, Inc., US
## 78 2017-08-27 22:02:41  13.32.253.169                     AMAZON-02 - Amazon.com, Inc., US
## 79 2017-08-27 22:02:41  13.32.253.203                     AMAZON-02 - Amazon.com, Inc., US
## 80 2017-08-27 22:02:41  13.32.253.209                     AMAZON-02 - Amazon.com, Inc., US
## 81 2017-08-29 13:17:37 54.230.141.201                     AMAZON-02 - Amazon.com, Inc., US
## 82 2017-08-29 13:17:37  54.230.141.83                     AMAZON-02 - Amazon.com, Inc., US
## 83 2017-08-29 13:17:37  54.230.141.30                     AMAZON-02 - Amazon.com, Inc., US
## 84 2017-08-29 13:17:37 54.230.141.193                     AMAZON-02 - Amazon.com, Inc., US
## 85 2017-08-29 13:17:37 54.230.141.152                     AMAZON-02 - Amazon.com, Inc., US
## 86 2017-08-29 13:17:37 54.230.141.161                     AMAZON-02 - Amazon.com, Inc., US
## 87 2017-08-29 13:17:37  54.230.141.38                     AMAZON-02 - Amazon.com, Inc., US
## 88 2017-08-29 13:17:37 54.230.141.151                     AMAZON-02 - Amazon.com, Inc., US

Unfortunately, I expected this. The owner keeps moving it around on AWS infrastructure.

So What?

This was an innocent link in a document on CRAN that went to a site that looked legit. A clever individual or organization found the dead domain and saw an opportunity to legitimize some fairly nasty stuff.

Now, I realize nobody is likely using “Rpad” anymore, but this type of situation can happen to any registered domain. If this individual or organization were doing more than trying to make objectionable content legit, they likely could have succeeded, especially if they enticed you with a shiny new devtools::install_…() link with promises of statistically sound animated cat emoji gif creation tools. They did an eerily good job of making this particular site still seem legit.

There’s nothing most folks can do to “fix” that site or have it removed. I’m not sure CRAN should remove the helpful PDF, but with a clickable link, it might be a good thing to suggest.

You’ll see that I used the splashr package (which has been submitted to CRAN but not there yet). It’s a good way to work with potentially malicious web content since you can “see” it and mine content from it without putting your own system at risk.

After going through this, I’ll see what I can do to put some bows on some of the devel-only packages and get them into CRAN so there’s a bit more assurance around using them.

I’m an army of one when it comes to fielding R-related security issues, but if you do come across suspicious items (like this or icky/malicious in other ways) don’t hesitate to drop me an @ or DM on Twitter.

There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R packagemsgxtractr — that can read in these ‘.msg’ files and produce a list as a result.

I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, not-so-cross-platform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg() that smashes open bottled up msgs, grabs sane/useful fields and produces a list() with them all wrapped up in a bow (an example is at the end and in the GH README).

Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.

Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.

devtools::install_github("hrbrmstr/msgxtractr")

library(msgxtractr)

print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr"))))

## List of 7
##  $ headers         :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  18 variables:
##   ..$ Return-path               : chr "<brizhou@gmail.com>"
##   ..$ Received                  :List of 1
##   .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n<brianzhou@me.com>; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)"
##   ..$ Original-recipient        : chr "rfc822;brianzhou@me.com"
##   ..$ Received-SPF              : chr "pass (st11p00mm-smtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__
##   ..$ DKIM-Signature            : chr "v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__
##   ..$ MIME-version              : chr "1.0"
##   ..$ X-Received                : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)"
##   ..$ Date                      : chr "Mon, 18 Nov 2013 10:26:24 +0200"
##   ..$ Message-id                : chr "<CADtJ4eNjQSkGcBtVteCiTF+YFG89+AcHxK3QZ=-Mt48xygkvdQ@mail.gmail.com>"
##   ..$ Subject                   : chr "Test for TIF files"
##   ..$ From                      : chr "Brian Zhou <brizhou@gmail.com>"
##   ..$ To                        : chr "brianzhou@me.com"
##   ..$ Cc                        : chr "Brian Zhou <brizhou@gmail.com>"
##   ..$ Content-type              : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a"
##   ..$ Authentication-results    : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com header.i=@gmail.com\nheader."| __truncated__
##   ..$ x-icloud-spam-score       : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent"
##   ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__
##   ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__
##  $ sender          :List of 2
##   ..$ sender_email: chr "brizhou@gmail.com"
##   ..$ sender_name : chr "Brian Zhou"
##  $ recipients      :List of 2
##   ..$ :List of 3
##   .. ..$ display_name : NULL
##   .. ..$ address_type : chr "SMTP"
##   .. ..$ email_address: chr "brianzhou@me.com"
##   ..$ :List of 3
##   .. ..$ display_name : NULL
##   .. ..$ address_type : chr "SMTP"
##   .. ..$ email_address: chr "brizhou@gmail.com"
##  $ subject         : chr "Test for TIF files"
##  $ body            : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__
##  $ attachments     :List of 2
##   ..$ :List of 4
##   .. ..$ filename     : chr "importOl.tif"
##   .. ..$ long_filename: chr "import OleFileIO.tif"
##   .. ..$ mime         : chr "image/tiff"
##   .. ..$ content      : raw [1:969674] 49 49 2a 00 ...
##   ..$ :List of 4
##   .. ..$ filename     : chr "raisedva.tif"
##   .. ..$ long_filename: chr "raised value error.tif"
##   .. ..$ mime         : chr "image/tiff"
##   .. ..$ content      : raw [1:1033142] 49 49 2a 00 ...
##  $ display_envelope:List of 2
##   ..$ display_cc: chr "Brian Zhou"
##   ..$ display_to: chr "brianzhou@me.com"
## NULL

NOTE: Don’t try to read those TIFF images with magick or even the tiff package. The content seems to have some strange tags/fields. But, saving it (use writeBin()) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

library(xml2)
library(httr)
library(reticulate)
library(magrittr)

res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")

content(res, as="text", endoding="UTF-8")
## [1] "\n \n<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <meta charset=\"utf-8\">\n <meta name=\"apple-mobile-web-app-capable\" content=\"yes\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" />\n <meta name=\"apple-mobile-web-app-status-bar-style\" content=\"black\" />\n <link rel=\"shortcut icon\" href=\"/assets/flat-ui/images/favicon.ico\">\n\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://ropensci.org/feed.xml\" />\n\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/bootstrap/css/bootstrap.css\">\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/css/flat-ui.css\">\n\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/icon-font.css\">\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/animations.css\">\n <link rel=\"stylesheet\" href=\"/static/css/style.css\">\n <link href=\"/assets/css/ss-social/webfonts/ss-social.css\" rel=\"stylesheet\" />\n <link href=\"/assets/css/ss-standard/webfonts/ss-standard.css\" rel=\"stylesheet\"/>\n <link rel=\"stylesheet\" href=\"/static/css/github.css\">\n <script type=\"text/javascript\" src=\"//use.typekit.net/djn7rbd.js\"></script>\n <script type=\"text/javascript\">try{Typekit.load();}catch(e){}</script>\n <script src=\"/static/highlight.pack.js\"></script>\n <script>hljs.initHighlightingOnLoad();</script>\n\n <title>Onboarding visdat, a tool for preliminary visualisation of whole dataframes</title>\n <meta name=\"keywords\" content=\"R, software, package, review, community, visdat, data-visualisation\" />\n <meta name=\"description\" content=\"\" />\n <meta name=\"resource_type\" content=\"website\"/>\n <!– RDFa Metadata (in DublinCore) –>\n <meta property=\"dc:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"dc:creator\" content=\"\" />\n <meta property=\"dc:date\" content=\"\" />\n <meta property=\"dc:format\" content=\"text/html\" />\n <meta property=\"dc:language\" content=\"en\" />\n <meta property=\"dc:identifier\" content=\"/blog/blog/2017/08/22/visdat\" />\n <meta property=\"dc:rights\" content=\"CC0\" />\n <meta property=\"dc:source\" content=\"\" />\n <meta property=\"dc:subject\" content=\"Ecology\" />\n <meta property=\"dc:type\" content=\"website\" />\n <!– RDFa Metadata (in OpenGraph) –>\n <meta property=\"og:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"og:author\" content=\"/index.html#me\" /> <!– Should be Liquid? URI? –>\n <meta property=\"http://ogp.me/ns/profile#first_name\" content=\"\"/>\n <meta property=\"http://ogp.me/ns/profile#last_name\" content=\"\"/>\n

(it goes on for a bit, best to run the code locally)

We can use the reticulate package to load the Python readability module to just get the clean, article text:

readability <- import("readability") # pip install readability-lxml

doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))

doc$summary() %>%
  read_xml() %>%
  xml_text()
# [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

(again, it goes on for a bit, best to run the code locally)

That text is now in good enough shape to tidy.

Here’s the same version with clean_text():

# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))
## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n  \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

(lastly, it goes on for a bit, best to run the code locally)

As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava)
library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars")
library(jwatr) # devtools::install_github("hrbrmstr/jwatr")
library(httr)
library(jsonlite)
library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf <- warc_file("~/Data/wrap-test")

res <- warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods")

str(jsonlite::fromJSON(content(res, as="text")), 2)
## 'data.frame':	67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...

res <- warc_GET(wf, "https://data.police.uk/api/crimes-street/all-crime",
                query = list(lat=52.629729, lng=-1.131592, date="2017-01"))

res <- warc_GET(wf, "https://data.police.uk/api/crimes-at-location",
                query = list(location_id="884227", date="2017-02"))

close_warc_file(wf)

As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size
## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf <- read_warc("~/Data/wrap-test.warc.gz", include_payload = TRUE)

glimpse(xdf)
## Observations: 3
## Variables: 14
## $ target_uri                 <chr> "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street...
## $ ip_address                 <chr> "54.76.101.128", "54.76.101.128", "54.76.101.128"
## $ warc_content_type          <chr> "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp...
## $ warc_type                  <chr> "response", "response", "response"
## $ content_length             <dbl> 2984, 511564, 688
## $ payload_type               <chr> "application/json", "application/json", "application/json"
## $ profile                    <chr> NA, NA, NA
## $ date                       <dttm> 2017-08-22, 2017-08-22, 2017-08-22
## $ http_status_code           <dbl> 200, 200, 200
## $ http_protocol_content_type <chr> "application/json", "application/json", "application/json"
## $ http_version               <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
## $ http_raw_headers           <list> [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63...
## $ warc_record_id             <chr> "<urn:uuid:2ae3e851-a1cf-466a-8f73-9681aab25d0c>", "<urn:uuid:77b30905-37f7-4c78-a120-2a008e194f94>",...
## $ payload                    <list> [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74...

xdf$target_uri
## [1] "https://data.police.uk/api/leicestershire/neighbourhoods"                                   
## [2] "https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01"
## [3] "https://data.police.uk/api/crimes-at-location?location_id=884227&date=2017-02" 

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) {
  res <- jsonlite::fromJSON(readBin(xdf$payload[[i]], "character"))
  print(str(res, 2))
}
## 'data.frame': 67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...
## NULL
## 'data.frame': 1318 obs. of  9 variables:
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location        :'data.frame': 1318 obs. of  3 variables:
##   ..$ latitude : chr  "52.616961" "52.629963" "52.641646" "52.635184" ...
##   ..$ street   :'data.frame': 1318 obs. of  2 variables:
##   ..$ longitude: chr  "-1.120719" "-1.122291" "-1.131486" "-1.135455" ...
##  $ context         : chr  "" "" "" "" ...
##  $ outcome_status  :'data.frame': 1318 obs. of  2 variables:
##   ..$ category: chr  NA NA NA NA ...
##   ..$ date    : chr  NA NA NA NA ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ id              : int  54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ month           : chr  "2017-01" "2017-01" "2017-01" "2017-01" ...
## NULL
## 'data.frame': 1 obs. of  9 variables:
##  $ category        : chr "violent-crime"
##  $ location_type   : chr "Force"
##  $ location        :'data.frame': 1 obs. of  3 variables:
##   ..$ latitude : chr "52.643950"
##   ..$ street   :'data.frame': 1 obs. of  2 variables:
##   ..$ longitude: chr "-1.143042"
##  $ context         : chr ""
##  $ outcome_status  :'data.frame': 1 obs. of  2 variables:
##   ..$ category: chr "Unable to prosecute suspect"
##   ..$ date    : chr "2017-02"
##  $ persistent_id   : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735"
##  $ id              : int 54726925
##  $ location_subtype: chr ""
##  $ month           : chr "2017-02"
## NULL

We can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:

for (i in 1:nrow(xdf)) {
  
  payload_content(
    xdf$target_uri[i], 
    xdf$http_protocol_content_type[i], 
    xdf$http_raw_headers[[i]], 
    xdf$payload[[i]], as = "text"
  ) %>% 
    jsonlite::fromJSON() -> res
  
  print(str(res, 2))
  
}

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.