Access the Internet Archive Advanced Search/Scrape API with wayback (+ links to a new vignette & pkgdown site)

The wayback? package has had an update to more efficiently retrieve mementos and added support for working with the Internet Archive’s advanced search+scrape API.

Search/Scrape

The search/scrape interface lets you examine the IA collections and download what you are after (programmatically). The main function is ia_scrape() but you can also paginate through results with the helper functions provided.

To demonstrate, let’s peruse the IA NASA collection and then grab one of the images. First, we need to search the collection then choose a target URL to retrieve and finally download it. The identifier is the key element to ensure we can retrieve the information about a particular collection.

library(wayback)

nasa <- ia_scrape("collection:nasa", count=100L)

tibble:::print.tbl_df(nasa)
## # A tibble: 100 x 3
##    identifier addeddate            title                                       
##                                                                 
##  1 00-042-154 2009-08-26T16:30:09Z International Space Station exhibit         
##  2 00-042-32  2009-08-26T16:30:12Z Swamp to Space historical exhibit           
##  3 00-042-43  2009-08-26T16:30:16Z Naval Meteorology and Oceanography Command …
##  4 00-042-56  2009-08-26T16:30:19Z Test Control Center exhibit                 
##  5 00-042-71  2009-08-26T16:30:21Z Space Shuttle Cockpit exhibit               
##  6 00-042-94  2009-08-26T16:30:24Z RocKeTeria restaurant                       
##  7 00-050D-01 2009-08-26T16:30:26Z Swamp to Space exhibit                      
##  8 00-057D-01 2009-08-26T16:30:29Z Astro Camp 2000 Rocketry Exercise           
##  9 00-062D-03 2009-08-26T16:30:32Z Launch Pad Tour Stop                        
## 10 00-068D-01 2009-08-26T16:30:34Z Lunar Lander Exhibit                        
## # ... with 90 more rows

(item <- ia_retrieve(nasa$identifier[1]))

## # A tibble: 6 x 4
##   file                       link                                                               last_mod          size 
## 1 00-042-154.jpg             https://archive.org/download/00-042-154/00-042-154.jpg             06-Nov-2000 15:34 1.2M 
## 2 00-042-154_archive.torrent https://archive.org/download/00-042-154/00-042-154_archive.torrent 06-Jul-2018 11:14 1.8K 
## 3 00-042-154_files.xml       https://archive.org/download/00-042-154/00-042-154_files.xml       06-Jul-2018 11:14 1.7K 
## 4 00-042-154_meta.xml        https://archive.org/download/00-042-154/00-042-154_meta.xml        03-Jun-2016 02:06 1.4K 
## 5 00-042-154_thumb.jpg       https://archive.org/download/00-042-154/00-042-154_thumb.jpg       26-Aug-2009 16:30 7.7K 
## 6 __ia_thumb.jpg             https://archive.org/download/00-042-154/__ia_thumb.jpg             06-Jul-2018 11:14 26.6K

download.file(item$link[1], file.path("man/figures", item$file[1]))

I just happened to know this would take me to an image. You can add the media type to the result (along with a host of other fields) to help with programmatic filtering.

The API is still not sealed in stone, so you're encouraged to submit questions/suggestions.

FIN

The vignette is embedded below and frame-busted here. It covers a very helpful and practical use-case identified recently by an OP on StackOverflow.

There's also a new pkgdown-gen'd site for the package.

Issues & PRs welcome at your community coding site of choice.

Cover image from Data-Driven Security
Amazon Author Page

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.