R⁶ — Scraping Images To PDFs

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all.

CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since the image quality and font was likely deliberately set to make it hard to do precisely what I’m trying to do.

If you work through the example, you’ll get a feel for:

using sprintf() to take a template and build a vector of URLs
use dplyr progress bars
customize httr verb options to ensure you can get to content
use purrr to iterate through a process of turning raw image bytes into image content (via magick) and turn a list of images into a PDF

library(httr)
library(magick)
library(tidyverse)

url_template <- "https://assets.documentcloud.org/documents/1657793/pages/radioshack-convert-p%s-large.gif"

pb <- progress_estimated(38)

sprintf(url_template, 1:38) %>% 
  map(~{
    pb$tick()$print()
    GET(url = .x, 
        add_headers(
          accept = "image/webp,image/apng,image/*,*/*;q=0.8", 
          referer = "http://money.cnn.com/interactive/technology/radio-shack-closure-list/index.html", 
          authority = "assets.documentcloud.org"))    
  }) -> store_list_pages

map(store_list_pages, content) %>% 
  map(image_read) %>% 
  reduce(image_join) %>% 
  image_write("combined_pages.pdf", format = "pdf")

I figured out the Document Cloud links and necessary httr::GET() options by using Chrome Developer Tools and my curlconverter package.

If any academic-y folks have a ~~test subject~~summer intern with a free hour and would be willing to have them transcribe this list and stick it on GitHub, you’d have my eternal thanks.

2 Comments

- timothyjkiely
- Posted 2017-06-05 at 09:57
- Permalink
- Reply
Great post! Do you have another post or maybe some reference materials about analyzing and choosing the correct parameters for add_headers()? You mention Chrome Developer Tools and your curlconverter package. Thanks!
- Matt Stiles
- Posted 2017-06-06 at 01:19
- Permalink
- Reply
That’s DocumentCloud, it appears. Anyway, all the geocoded data are here: http://i.cdn.turner.com/money/interactive/technology/radio-shack-closings-map/scripts/data.js?Sdfdsf

3 Trackbacks/Pingbacks

By R⁶ — Scraping Images To PDFs – Cyber Security on 05 Jun 2017 at 1:11 pm

[…] I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but… Continue reading → […]
By R⁶ — Scraping Images To PDFs – Mubashir Qasim on 05 Jun 2017 at 2:20 pm

[…] article was first published on R – rud.is, and kindly contributed to […]
By R⁶ — Scraping Images To PDFs | A bunch of data on 06 Jun 2017 at 12:21 am

[…] article was first published on R – rud.is, and kindly contributed to […]

rud.is