spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar.
(Check the end of the post if you don’t recognize the lyrical riff.)
Face front, true believer!
I’ve used and blogged about Peter Meissner’s most excellent robotstxt
package before. It’s an essential tool for any ethical web scraper.
But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).
I needed something faster for my bulk Crawl-Delay
analysis which led me to this small, spiffy C++ library for parsing robots.txt
files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar
which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep
).
How much faster?
I’m glad you asked!
Let’s take a look at one benchmark: parsing robots.txt
and extracting Crawl-delay
entries. Just how much faster is spiderbar
?
library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)
rob <- get_robotstxt("imdb.com")
microbenchmark(
robotstxt = {
x <- parse_robotstxt(rob)
x$crawl_delay
},
spiderbar = {
y <- robxp(rob)
crawl_delays(y)
}
) -> mb1
update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))
autoplot(mb1) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
As you can see, it’s just a tad bit faster ?.
Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.
But, I don’t care about Crawl-Delay
!
OK, fine. Do you care about fetchability? We can speed that up, too!
rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)
microbenchmark(
robotstxt = {
robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
},
spiderbar = {
can_fetch(rob_spi, "/Vote")
}
) -> mb2
autoplot(mb2) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for testing resource 'fetchability'",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
Vectorized or it didn’t happen.
(Gosh, even Spider-Man got more respect!)
OK, this is a tough crowd, but we’ve got vectorization covered as well:
microbenchmark(
robotstxt = {
paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
},
spiderbar = {
can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
}
) -> mb3
autoplot(mb3) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
Excelsior!
Peter’s package does more than this one since it helps find the robots.txt
files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.
You can check out the code and leave package questions or comments on GitHub.
(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt
which is all about spiders. Coincidence?! I think not!)
2 Trackbacks/Pingbacks
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] article was first published on R – rud.is, and kindly contributed to […]