Speeding Up Digital Arachnids

spiderbar, spiderbar 
Reads robots rules from afar.
Crawls the web, any size; 
Fetches with respect, never lies.
Look Out! 
Here comes the spiderbar.

Is it fast? 
Listen bud, 
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there 
There goes the spiderbar.

(Check the end of the post if you don’t recognize the lyrical riff.)

Face front, true believer!

I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

How much faster?

I’m glad you asked!

Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)

rob <- get_robotstxt("imdb.com")

microbenchmark(

  robotstxt = {
    x <- parse_robotstxt(rob)
    x$crawl_delay
  },

  spiderbar = {
    y <- robxp(rob)
    crawl_delays(y)
  }

) -> mb1

update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))

autoplot(mb1) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

As you can see, it’s just a tad bit faster ?.

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

But, I don’t care about `Crawl-Delay`!

OK, fine. Do you care about fetchability? We can speed that up, too!

rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)

microbenchmark(

  robotstxt = {
    robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
  },

  spiderbar = {
    can_fetch(rob_spi, "/Vote")
  }

) -> mb2

autoplot(mb2) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Vectorized or it didn’t happen.

(Gosh, even Spider-Man got more respect!)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

microbenchmark(

  robotstxt = {
    paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
  },

  spiderbar = {
    can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
  }

) -> mb3

autoplot(mb3) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Excelsior!

Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)

2 Trackbacks/Pingbacks

By Speeding Up Digital Arachnids | A bunch of data on 25 Sep 2017 at 11:11 am

[…] article was first published on R – rud.is, and kindly contributed to […]
By Speeding Up Digital Arachnids – Mubashir Qasim on 25 Sep 2017 at 2:17 pm

[…] article was first published on R – rud.is, and kindly contributed to […]

rud.is