Identify & Analyze Web Site Tech Stacks With rappalyzer

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities),… Continue reading

Reticulating Readability

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages… Continue reading

R⁶ — Scraping Images To PDFs

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but… Continue reading

Scrapeover Friday — a.k.a. Another R Scraping Makeover

I caught a glimpse of a tweet by @dataandme on Friday: Using R & rvest to explore Malaysian property mkt: "Web Scraping: The Sequel, Propwall.my" https://t.co/daZOOJJfPN #rstats #rvest pic.twitter.com/u6QMhm4M3e — Mara Averick (@dataandme) May 5, 2017 Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have… Continue reading