Skip navigation

UPDATE curlconverter will now return (as the function return value) a working R function. See the README for examples


When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:

primary

Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:

XHR

You can actually see a preview of those requests (usually JSON):

Developer_Tools_-_http___graphics_latimes_com_election-2016-new-hampshire-results_

While you could go through all the headers and cookies and transcribe them into httr::GET or httr::POST requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl and curl packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:

curl 'http://graphics.latimes.com/election-2016-31146-feed.json' 
  -H 'Pragma: no-cache' 
  -H 'DNT: 1' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' 
  -H 'Accept: */*' 
  -H 'Cache-Control: no-cache' 
  -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' 
  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' 
  -H 'Connection: keep-alive' 
  -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT'
  -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' 
  --compressed

While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter.

The curlconverter package has (for the moment) two main functions:

  • straighten() : which returns a list with all of the necessary parts to craft an httr POST or GET call
  • make_req() : which actually _returns a working httr call, pre-filled with all of the necessary information.

By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req() or req_params <- straighten()), but they can take in a vector of cURL command lines, too (NOTE: make_req() is currently limited to one while straighten() can handle as many as you want).

Let’s show what happens using election results cURL command line:

REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache'  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed"
 
resp <- curlconverter::straighten(REP)
jsonlite::toJSON(resp, pretty=TRUE)
 
    ## [
    ##   {
    ##     "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"],
    ##     "method": ["get"],
    ##     "headers": {
    ##       "Pragma": ["no-cache"],
    ##       "DNT": ["1"],
    ##       "Accept-Encoding": ["gzip, deflate, sdch"],
    ##       "X-Requested-With": ["XMLHttpRequest"],
    ##       "Accept-Language": ["en-US,en;q=0.8"],
    ##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"],
    ##       "Accept": ["*/*"],
    ##       "Cache-Control": ["no-cache"],
    ##       "Connection": ["keep-alive"],
    ##       "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"],
    ##       "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"]
    ##     },
    ##     "cookies": {
    ##       "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"],
    ##       "s_cc": ["true"]
    ##     },
    ##     "url_parts": {
    ##       "scheme": ["http"],
    ##       "hostname": ["graphics.latimes.com"],
    ##       "port": {},
    ##       "path": ["election-2016-31146-feed.json"],
    ##       "query": {},
    ##       "params": {},
    ##       "fragment": {},
    ##       "username": {},
    ##       "password": {}
    ##     }
    ##   }
    ## ]

You can then use the items in the returned list to make a GET request manually (but still tediously).

curlconverter‘s make_req() will try to do this conversion for you automagically using httr‘s little used VERB() function. It’s easier to show than to tell:

curlconverter::make_req(REP)
VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", 
     add_headers(Pragma = "no-cache", 
                 DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
                 `X-Requested-With` = "XMLHttpRequest", 
                 `Accept-Language` = "en-US,en;q=0.8", 
                 `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
                 Accept = "*/*", 
                 `Cache-Control` = "no-cache", 
                 Connection = "keep-alive", 
                 `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", 
                 Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/"))

You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).

You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter Node module by Nick Carneiro.

It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).

8 Comments

  1. How do you execute the curl request once you have the makereq() object ?
    Using curl
    fetchmemory(REQ), or stopfor_status(REQ); content(REQ), it just doesn’t work with your example request.

  2. How do you execute the curl request once you have the make_req() object ? Using curl_fetch_memory(REQ), or stop_for_status(REQ); content(REQ), it just doesn’t work with your example request.

    • it outputs source for a VERB() function that you have to source. I’m working on having it return a proper R function, but I envision most folks are just going to want the VERB() source so they can customize it.

    • Your desired functionality is now present in the most recent version (0.5.0)

  3. Using the cURL (copied from Chrome Dev Tools) below, I get the following error: Error in .f(.x[[i]], …) : attempt to apply non-function

    curl ‘https://www.orbitz.com/Hotel-Search?inpAjax=true&responsive=true’ -H ‘Cookie: AustinLocale=en_US; __gads=ID=c97a6fa8d58b98a5:T=1446198813:S=ALNI_MaMdY-7wbLEVOSu48PKgSBtgSxFYw; _br_uid_1=uid’%’3D6174891861156’%’3A; MC1=GUID=cbb90f129dc84368bf537dc83c597c48; abucket=CgUBEVag/9pUixzDBNj8Ag==; aspp=v.1,0|||||||||||||; tpid=v.1,70201; btpdb.ZsFRwSu.dGZjLjQ3MTMyMA=REFZUw; NSC_JOnfsaedefkaausbcxbjviedpem5qcq=ffffffff09c02b8245525d5f4f58455e445a4a422f1b; logging=CED7103AB7B82130||wl-103.cpwm.orbitz.net; iEAPID=0,; NSC_ufbmfbg.tel.80_dt_ufbmfbg=ffffffff09c038d345525d5f4f58455e445a4a4217b9; AMCV_C00802BE5330A8350A490D4C’%’40AdobeOrg=793872103’%’7CMCIDTS’%’7C16843’%’7CMCMID’%’7C27210446768186842151111767959025392376’%’7CMCAAMLH-1455530150’%’7C6’%’7CMCAAMB-1455798672’%’7CNRX38WO0n5BH8Th-nqAG_A’%’7CMCAID’%’7CNONE; lsrc=v.1,02/25/2016; cesc=’%’7B’%’7D; s_cc=true; utag_main=v_id:015264e7595f0022a99cd72e534805073003606b00ac2$_sn:4$_ss:0$_st:1455196627017$_pn:3’%’3Bexp-session$ses_id:1455193874041’%’3Bexp-session; __utma=263700051.856429117.1454925354.1454925354.1455193874.2; __utmc=263700051; __utmz=263700051.1454925354.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=FE37EDE82B901E08; NSC_xxx.pscjua.dpn.443_gxe=ffffffff09e4087545525d5f4f58455e445a4a42378b; mbox=PC#1453474061720-408766.26_22#1457792937|session#1455200936467-458533#1455202797; IPE_S_112084=0; JSESSION=b3766587-1ed2-4cc3-bdfa-bd6b1580a3ef; _br_uid_2=uid’%’3D6174891861156’%’3A’%’3A_uid’%’3D6174891861156’%’3A; _gat_test=1; _ga=GA1.2.844953304.1446198810; _gat=1; NSC_xxx.pscjua.dpn.80_gxe=ffffffff09e3087545525d5f4f58455e445a4a423660; WT_FPC=id=7e5108f6-d0e9-43db-bb2b-cb93ce0d0a32:lv=1455175778685:ss=1455175737631; curr=USD; anon=13f81c3c-527e-4cec-b118-4eb72e1adafc; linfo=v.4,|0|0|255|1|0||||||||1033|0|0||0|0|0|-1|-1’ -H ‘Origin: https://www.orbitz.com‘ -H ‘Accept-Encoding: gzip, deflate’ -H ‘Accept-Language: de,en-US;q=0.8,en;q=0.6’ -H ‘User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36’ -H ‘Content-type: application/x-www-form-urlencoded’ -H ‘Accept: application/json, text/javascript’ -H ‘Referer: https://www.orbitz.com/Hotel-Search?’ -H ‘X-Requested-With: XMLHttpRequest’ -H ‘Connection: keep-alive’ –data ‘destination=Toulouse’%’2C+France&startDate=03/30/2016&endDate=04/01/2016&adults=2&children=&hotelName=&star=&regionId=3475&hashParam=’ –compressed

    • can you file this as an issue? https://github.com/hrbrmstr/curlconverter/issues

    • but given “access, monitor or copy any content or information of this Website using any robot, spider, scraper or other automated means or any manual process for any purpose without our express written permission;” on their Terms of Use I’m probably not going to delve into this util I get a legit cURL. From a quick diagnosis, it’s an issue with the underlying javascript library.

    • and, it didn’t work for me initially since WordPress turned the quotes into “smart” quotes. I just went to their site and did a search and pulled a similar URL (with the most recent version of the pkg) and it generated a perfectly usable function.


One Trackback/Pingback

  1. […] article was first published on R – rud.is, and kindly contributed to […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.