Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities),… Continue reading
Post Category → web scraping
Speeding Up Digital Arachnids
spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It’s got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar. (Check the end… Continue reading
Pirating Web Content Responsibly With R
International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no ‘rrrrrr’ abuse in this post,… Continue reading
Reticulating Readability
I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages… Continue reading
Caching httr Requests? This means WAR[C]!
I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad… Continue reading
Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R
One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats // Ethics in Web Scraping https://t.co/y5YxvzB8Fd — boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that… Continue reading
R⁶ — Scraping Images To PDFs
I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but… Continue reading
Scrapeover Friday — a.k.a. Another R Scraping Makeover
I caught a glimpse of a tweet by @dataandme on Friday: Using R & rvest to explore Malaysian property mkt: "Web Scraping: The Sequel, Propwall.my" https://t.co/daZOOJJfPN #rstats #rvest pic.twitter.com/u6QMhm4M3e — Mara Averick (@dataandme) May 5, 2017 Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have… Continue reading