I’ve blogged a bit about robots.txt — the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers… Continue reading
Posts Tagged → post
Tragic Documentation
NOTE: If the usual aggregators are picking this up and there are humans curating said aggregators, this post is/was not intended as something to go into the “data science” aggregation sites. Just personal commentary with code in the event someone stumbles across it and wanted to double check me. These “data-dives” help me cope with… Continue reading
Retrieve & process TV News chyrons with newsflash
The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available…. Continue reading
Identify & Analyze Web Site Tech Stacks With rappalyzer
Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities),… Continue reading
SODD — StackOverflow Driven-Development
I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality into a package (if warranted)…. Continue reading
Speeding Up Digital Arachnids
spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It’s got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar. (Check the end… Continue reading
Pirating Web Content Responsibly With R
International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no ‘rrrrrr’ abuse in this post,… Continue reading
Mapping Fall Foliage with sf
I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet: Where to see the best fall foliage, based on your location: https://t.co/12pQU29ksB pic.twitter.com/JiywYVpmno — Vox (@voxdotcom) September 18, 2017 Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild #rstats web, here’s a short… Continue reading