Skip navigation

Category Archives: web scraping

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.

splashr has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main request for it was to enable rendering of htmlwidgets:

splash_vm <- start_splash(add_tempdir=TRUE)

DiagrammeR("
  graph LR
    A-->B
    A-->C
    C-->E
    B-->D
    C-->D
    D-->F
    E-->F
") %>% 
  saveWidget("/tmp/diag.html")

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="html")
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<script src= ...
## [2] <body style="background-color: white; margin: 0px; padding: 40px;">\n<div id="htmlwidget_container">\n<div id="ht ...

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="png", wait=2)

But if you use the new Docker image and the add_tempdir=TRUE parameter it can render any local HTML file.

The other new bits are helpers to identify content types in the HAR types. Along with get_content_type():

library(tidyverse)

map_chr(rud_har$log$entries, get_content_type)
##  [1] "text/html"                "text/html"                "application/javascript"   "text/css"                
##  [5] "text/css"                 "text/css"                 "text/css"                 "text/css"                
##  [9] "text/css"                 "application/javascript"   "application/javascript"   "application/javascript"  
## [13] "application/javascript"   "application/javascript"   "application/javascript"   "text/javascript"         
## [17] "text/css"                 "text/css"                 "application/x-javascript" "application/x-javascript"
## [21] "application/x-javascript" "application/x-javascript" "application/x-javascript" NA                        
## [25] "text/css"                 "image/png"                "image/png"                "image/png"               
## [29] "font/ttf"                 "font/ttf"                 "text/html"                "font/ttf"                
## [33] "font/ttf"                 "application/font-woff"    "application/font-woff"    "image/svg+xml"           
## [37] "text/css"                 "text/css"                 "image/gif"                "image/svg+xml"           
## [41] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [45] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [49] "text/css"                 "application/x-javascript" "image/gif"                NA                        
## [53] "image/jpeg"               "image/svg+xml"            "image/svg+xml"            "image/svg+xml"           
## [57] "image/svg+xml"            "image/svg+xml"            "image/svg+xml"            "image/gif"               
## [61] NA                         "application/x-javascript" NA                         NA

there are many is_...() functions for logical tests.

But, one of the more interesting is_() functions is is_xhr(). Sites with dynamic content usually load said content via an XMLHttpRequest or XHR for short. Modern web apps usually return JSON in said requests and, for questions like this one on StackOverflow it’s usually better to grab the JSON and use it for data than it is to scrape the table made from JavaScript calls.

Now, it’s not too hard to open Developer Tools and find those XHR requests, but we can also use splashr to programmatically find them. We have to do a bit more work and use the new execute_lua() function since we need to give the page time to load up all the data. (I’ll eventually write a mini-R-DSL around this idiom so you don’t have to grok Lua for non-complex scraping tasks). Here’s how we’d answer that StackOverflow question today…

First, we grab the entire HAR contents (including bodies of the individual requests) after waiting a bit:

splash_local %>%
  execute_lua('
function main(splash)
  splash.response_body_enabled = true
  splash:go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D")
  splash:wait(2)
  return splash:har()
end
') -> res

pg <- as_har(res)

then we look for XHRs:

map_lgl(pg$log$entries, is_xhr) %>% which()
## 10

and, finally, we grab the JSON:

pg$log$entries[[10]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  rawToChar() %>% 
  jsonlite::fromJSON() %>% 
  glimpse()
## List of 4
##  $ TotalPages  : int 16
##  $ TotalRecords: int 384
##  $ Records     :'data.frame': 24 obs. of  21 variables:
##   ..$ ID            : chr [1:24] "{5E4B0D96-18D3-4FC6-B1AA-345675F3765C}" "{674EEC8B-062A-4268-9467-5C61030B83C9}" ## "{3E6257FE-67A1-4F13-B377-9EA7CCBD50F2}" "{C28479E6-5458-4010-A005-84E5F35B2FEA}" ...
##   ..$ FirstName     : chr [1:24] "Mirna" "Barbara" "Donald" "Victoria" ...
##   ..$ LastName      : chr [1:24] "Aeschlimann" "Angus" "Annino" "Arthur" ...
##   ..$ Image         : chr [1:24] "" "/~/media/directory/physicians/ppoc/angus_barbara.ashx" "/~/media/directory/physicians/ppoc/## annino_donald.ashx" "/~/media/directory/physicians/ppoc/arthur_victoria.ashx" ...
##   ..$ Suffix        : chr [1:24] "MD" "MD" "MD" "MD" ...
##   ..$ Url           : chr [1:24] "http://www.childrenshospital.org/doctors/mirna-aeschlimann" "http://www.childrenshospital.org/doctors/## barbara-angus" "http://www.childrenshospital.org/doctors/donald-annino" "http://www.childrenshospital.org/doctors/victoria-arthur" ...
##   ..$ Gender        : chr [1:24] "female" "female" "male" "female" ...
##   ..$ Latitude      : chr [1:24] "42.468769" "42.235088" "42.463177" "42.447168" ...
##   ..$ Longitude     : chr [1:24] "-71.100558" "-71.016021" "-71.143169" "-71.229734" ...
##   ..$ Address       : chr [1:24] "{"practice_name":"Pediatrics, Inc.", "address_1":"577 Main ## Street", "city":&q"| __truncated__ "{"practice_name":"Crown Colony Pediatrics", ## "address_1":"500 Congress Street, Suite 1F""| __truncated__ "{"practice_name":"Pediatricians ## Inc.", "address_1":"955 Main Street", "city":"| __truncated__ ## "{"practice_name":"Lexington Pediatrics", "address_1":"19 Muzzey Street, Suite 105", &qu"| ## __truncated__ ...
##   ..$ Distance      : chr [1:24] "" "" "" "" ...
##   ..$ OtherLocations: chr [1:24] "" "" "" "" ...
##   ..$ AcademicTitle : chr [1:24] "" "" "" "Clinical Instructor of Pediatrics - Harvard Medical School" ...
##   ..$ HospitalTitle : chr [1:24] "Pediatrician" "Pediatrician" "Pediatrician" "Pediatrician" ...
##   ..$ Specialties   : chr [1:24] "Primary Care, Pediatrics, General Pediatrics" "Primary Care, Pediatrics, General Pediatrics" "General ## Pediatrics, Pediatrics, Primary Care" "Primary Care, Pediatrics, General Pediatrics" ...
##   ..$ Departments   : chr [1:24] "" "" "" "" ...
##   ..$ Languages     : chr [1:24] "English" "English" "" "" ...
##   ..$ PPOCLink      : chr [1:24] "http://www.childrenshospital.org/patient-resources/provider-glossary" "/patient-resources/## provider-glossary" "http://www.childrenshospital.org/patient-resources/provider-glossary" "http://www.childrenshospital.org/## patient-resources/provider-glossary" ...
##   ..$ Gallery       : chr [1:24] "" "" "" "" ...
##   ..$ Phone         : chr [1:24] "781-438-7330" "617-471-3411" "781-729-4262" "781-862-4110" ...
##   ..$ Fax           : chr [1:24] "781-279-4046" "(617) 471-3584" "" "(781) 863-2007" ...
##  $ Synonims    : list()

UPDATE So, I wrote a mini-DSL for this:

splash_local %>%
  splash_response_body(TRUE) %>% 
  splash_go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D") %>% 
  splash_wait(2) %>% 
  splash_har() -> res

which should make it easier to perform basic “go-wait-retrieve” operations.

It’s unlikely we want to rely on a running Splash instance for our production work, so I’ll be making a helper function to turn HAR XHR requests into a httr function calls, similar to the way curlconverter works.

If you do enough web scraping, you’ll eventually hit a wall that the trusty httr verbs (that sit beneath rvest) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still stick with httr verbs — if you can figure out what those requests are — and code-up the right parameters (browser “Developer Tools” menus/views and my curlconverter package are super handy for this). Unfortunately, some sites require actual in-page rendering and that’s when scraping turns into a modest chore.

For dynamic sites, the RSelenium and/or seleniumPipes packages are super-handy tools to have in the toolbox. They interface with Selenium which is a feature-rich environment/ecosystem for automating browser tasks. You can programmatically click buttons, press keys, follow links and extract page content because you’re scripting actions in an actual browser or a browser-like tool such as phantomjs. Getting the server component of Selenium running was often a source of pain for R folks, but the new docker images make it much easier to get started. For truly gnarly scraping tasks, it should be your go-to solution.

However, sometimes all you need is the rendering part and for that, there’s a new light[er]weight alternative dubbed Splash. It’s written in python and uses QT webkit for rendering. To avoid deluging your system with all of the Splash dependencies you can use the docker images. In fact, I made it dead easy to do so. Read on!

Going for a dip

The intrepid Winston Chang at RStudio started a package to wrap Docker operations and I’ve recently joind in the fun to add some tweaks & enhancements to it that are necessary to get it on CRAN. Why point this out? Since you need to have Splash running to work with it in splashr I wanted to make it as easy as possible. So, if you install Docker and then devtools::install_github("wch/harbor") you can then devtools::install_github("hrbrmstr/splashr") to get Splash up and running with:

library(splashr)

install_splash()
splash_svr <- start_splash()

The install_splash() function will pull the correct image to your local system and you’ll need that splash_svr object later on to stop the container. Now, you can have Splash running on any host, but this post assumes you’re running it locally.

We can test to see if the server is active:

splash("localhost") %>% splash_active()
## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 70443008

Now, we’re ready to scrape!

We’ll use this site — http://www.techstars.com/companies/ — mentioned over at DataCamp’s tutorial since it doesn’t use XHR but does require rendering and it doesn’t prohibit scraping in the Terms of Service (don’t violate Terms of Service, it is both unethical and could get you blocked, fined or worse).

Let’s scrape the “Summary by Class” table. Here’s an excerpt along with the Developer Tools view:

You’re saying “HEY. That has <table> in the HTML so why not just use rvest? Well, you can validate the lack of <table>s in the “view source” view of the page or with:

library(rvest)

pg <- read_html("http://www.techstars.com/companies/")
html_nodes(pg, "table")
## {xml_nodeset (0)}

Now, let’s do it with splashr:

splash("localhost") %>% 
  render_html("http://www.techstars.com/companies/", wait=5) -> pg
  
html_nodes(pg, "table")
## {xml_nodeset (89)}
##  [1] <table class="table75"><tbody>\n<tr>\n<th>Status</th>\n        <th>Number of Com ...
##  [2] <table class="table75"><tbody>\n<tr>\n<th colspan="2">Impact</th>\n      </tr>\n ...
##  [3] <table class="table75"><tbody>\n<tr>\n<th>Class</th>\n        <th>#Co's</th>\n   ...
##  [4] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Anywhere 2017 Q1</th>\ ...
##  [5] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Atlanta 2016 Summer</t ...
##  [6] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2013 Fall</th>\ ...
##  [7] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2014 Summer</th ...
##  [8] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2015 Spring</th ...
##  [9] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2016 Spring</th ...
## [10] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2014</th>\n   ...
## [11] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2015 Spring</ ...
## [12] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2016 Winter</ ...
## [13] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Cape Town 201 ...
## [14] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2015 Summ ...
## [15] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2016 Summ ...
## [16] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Tel Aviv 2016 ...
## [17] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2015 Summer</th ...
## [18] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2016 Summer</th ...
## [19] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2009 Spring</th ...
## [20] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2010 Spring</th ...
## ...##

We need to set the wait parameter (5 seconds was likely overkill) to give the javascript callbacks time to run. Now you can go crazy turning that into data.

Candid Camera

You can also take snapshots (pictures) of websites with splashr, like this (apologies if you start drooling on your keyboard):

splash("localhost") %>% 
  render_png("https://www.cervelo.com/en/triathlon/p-series/p5x")

The snapshot functions return magick objects, so you can do anything you’d like with them.

HARd Work

Since Splash is rendering the entire site (it’s a real browser), it knows all the information about the various components of a page and can return that in HAR format. You can retrieve this data and use John Harrison’s spiffy HARtools package to visualize and further analyze the data. For the sake of brevity, here’s just the main print() output from a site:

splash("localhost") %>% 
  render_har("https://www.r-bloggers.com/")

## --------HAR VERSION-------- 
## HAR specification version: 1.2 
## --------HAR CREATOR-------- 
## Created by: Splash 
## version: 2.3.1 
## --------HAR BROWSER-------- 
## Browser: QWebKit 
## version: 538.1 
## --------HAR PAGES-------- 
## Page id: 1 , Page title: R-bloggers | R news and tutorials contributed by (750) R bloggers 
## --------HAR ENTRIES-------- 
## Number of entries: 130 
## REQUESTS: 
## Page: 1 
## Number of entries: 130 
##   -  https://www.r-bloggers.com/ 
##   -  https://www.r-bloggers.com/wp-content/themes/magazine-basic-child/style.css 
##   -  https://www.r-bloggers.com/wp-content/plugins/mashsharer/assets/css/mashsb.min.cs... 
##   -  https://www.r-bloggers.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?... 
##   -  https://www.r-bloggers.com/wp-content/plugins/jetpack/css/jetpack.css?ver=4.4.2 
##      ........ 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/10579991_10152371745729891_26331957... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/14962601_10210947974726136_38966601... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/c0.8.50.50/p50x50/311082_286149511398044_4... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/11046696_917285094960943_6143235831... 
##   -  https://static.xx.fbcdn.net/rsrc.php/v3/y2/r/0iTJ2XCgjBy.png

FIN

You can also do some basic scripting in Splash with lua and coding up an interface with that capability is on the TODO as is adding final tests and enabling tweaking the Docker configurations to support more fun things that Splash can do.

File an issue on github if you have feature requests or problems and feel free to jump on board with a PR if you’d like to help put the finishing touches on the package or add some features.

Don’t forget to stop_splash(splash_svr) when you’re finished scraping!

UPDATE curlconverter will now return (as the function return value) a working R function. See the README for examples


When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:

primary

Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:

XHR

You can actually see a preview of those requests (usually JSON):

Developer_Tools_-_http___graphics_latimes_com_election-2016-new-hampshire-results_

While you could go through all the headers and cookies and transcribe them into httr::GET or httr::POST requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl and curl packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:

curl 'http://graphics.latimes.com/election-2016-31146-feed.json' 
  -H 'Pragma: no-cache' 
  -H 'DNT: 1' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' 
  -H 'Accept: */*' 
  -H 'Cache-Control: no-cache' 
  -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' 
  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' 
  -H 'Connection: keep-alive' 
  -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT'
  -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' 
  --compressed

While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter.

The curlconverter package has (for the moment) two main functions:

  • straighten() : which returns a list with all of the necessary parts to craft an httr POST or GET call
  • make_req() : which actually _returns a working httr call, pre-filled with all of the necessary information.

By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req() or req_params <- straighten()), but they can take in a vector of cURL command lines, too (NOTE: make_req() is currently limited to one while straighten() can handle as many as you want).

Let’s show what happens using election results cURL command line:

REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache'  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed"
 
resp <- curlconverter::straighten(REP)
jsonlite::toJSON(resp, pretty=TRUE)
 
    ## [
    ##   {
    ##     "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"],
    ##     "method": ["get"],
    ##     "headers": {
    ##       "Pragma": ["no-cache"],
    ##       "DNT": ["1"],
    ##       "Accept-Encoding": ["gzip, deflate, sdch"],
    ##       "X-Requested-With": ["XMLHttpRequest"],
    ##       "Accept-Language": ["en-US,en;q=0.8"],
    ##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"],
    ##       "Accept": ["*/*"],
    ##       "Cache-Control": ["no-cache"],
    ##       "Connection": ["keep-alive"],
    ##       "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"],
    ##       "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"]
    ##     },
    ##     "cookies": {
    ##       "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"],
    ##       "s_cc": ["true"]
    ##     },
    ##     "url_parts": {
    ##       "scheme": ["http"],
    ##       "hostname": ["graphics.latimes.com"],
    ##       "port": {},
    ##       "path": ["election-2016-31146-feed.json"],
    ##       "query": {},
    ##       "params": {},
    ##       "fragment": {},
    ##       "username": {},
    ##       "password": {}
    ##     }
    ##   }
    ## ]

You can then use the items in the returned list to make a GET request manually (but still tediously).

curlconverter‘s make_req() will try to do this conversion for you automagically using httr‘s little used VERB() function. It’s easier to show than to tell:

curlconverter::make_req(REP)
VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", 
     add_headers(Pragma = "no-cache", 
                 DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
                 `X-Requested-With` = "XMLHttpRequest", 
                 `Accept-Language` = "en-US,en;q=0.8", 
                 `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
                 Accept = "*/*", 
                 `Cache-Control` = "no-cache", 
                 Connection = "keep-alive", 
                 `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", 
                 Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/"))

You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).

You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter Node module by Nick Carneiro.

It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).

As I was putting together the [coord_proj](https://rud.is/b/2015/07/24/a-path-towards-easier-map-projection-machinations-with-ggplot2/) ggplot2 extension I had posted a (https://gist.github.com/hrbrmstr/363e33f74e2972c93ca7) that I shared on Twitter. Said gist received a comment (several, in fact) and a bunch of us were painfully reminded of the fact that there is no built-in way to receive notifications from said comment activity.

@jennybryan posited that it could be possible to use IFTTT as a broker for these notifications, but after some checking that ended up not being directly doable since there are no “gist comment” triggers to act upon in IFTTT.

There are a few standalone Ruby gems that programmatically retrieve gist comments but I wasn’t interested in managing a Ruby workflow [ugh]. I did find a Heroku-hosted service – https://gh-rss.herokuapp.com/ – that will turn gist comments into an RSS/Atom feed (based on Ruby again). I gave it a shot and hooked it up to IFTTT but my feed is far enough down on the food chain there that it never gets updated. It was possible to deploy that app on my own Heroku instance, but—again—I’m not interested in managing a Ruby workflow.

The Ruby scripts pretty much:

– grab your main gist RSS/Atom feed
– visit each gist in the feed
– extract comments & comment metadata from them (if any)
– return a composite data structure you can do anything with

That’s super-easy to duplicate in R, so I decided to build a small R script that does all that and generates an RSS/Atom file which I added to my Feedly feeds (I’m pretty much always scanning RSS, so really didn’t need the IFTTT notification setup). I put it into a `cron` job that runs every hour. When Feedly refreshes the feed, a new entry will appear whenever there’s a new comment.

The script is below and [on github](https://gist.github.com/hrbrmstr/0ad1ced217edd137de27) (ironically as a gist). Here’s what you’ll grok from the code:

– one way to deal with the “default namespace” issue in R+XML
– one way to deal with error checking for scraping
– how to build an XML file (and, specifically, an RSS/Atom feed) with R
– how to escape XML entities with R
– how to get an XML object as a character string in R

You’ll definitely need to tweak this a bit for your own setup, but it should be a fairly complete starting point for you to work from. To see the output, grab the [generated feed](http://dds.ec/hrbrmstrgcfeed.xml).

# Roll your own GitHub Gist Comments Feed in R
 
library(xml2)    # github version
library(rvest)   # github version
library(stringr) # for str_trim & str_replace
library(dplyr)   # for data_frame & bind_rows
library(pbapply) # free progress bars for everyone!
library(XML)     # to build the RSS feed
 
who <- "hrbrmstr" # CHANGE ME!
 
# Grab the user's gist feed -----------------------------------------------
 
gist_feed <- sprintf("https://gist.github.com/%s.atom", who)
feed_pg <- read_xml(gist_feed)
ns <- xml_ns_rename(xml_ns(feed_pg), d1 = "feed")
 
# Extract the links & titles of the gists in the feed ---------------------
 
links <-  xml_attr(xml_find_all(feed_pg, "//feed:entry/feed:link", ns), "href")
titles <-  xml_text(xml_find_all(feed_pg, "//feed:entry/feed:title", ns))
 
#' This function does the hard part by iterating over the
#' links/titles and building a tbl_df of all the comments per-gist
get_comments <- function(links, titles) {
 
  bind_rows(pblapply(1:length(links), function(i) {
 
    # get gist
 
    pg <- read_html(links[i])
 
    # look for comments
 
    ref <- tryCatch(html_attr(html_nodes(pg, "div.timeline-comment-wrapper a[href^='#gistcomment']"), "href"),
                    error=function(e) character(0))
 
    # in theory if 'ref' exists then the rest will
 
    if (length(ref) != 0) {
 
      # if there were comments, get all the metadata we care about
 
      author <- html_text(html_nodes(pg, "div.timeline-comment-wrapper a.author"))
      timestamp <- html_attr(html_nodes(pg, "div.timeline-comment-wrapper time"), "datetime")
      contentpg <- str_trim(html_text(html_nodes(pg, "div.timeline-comment-wrapper div.comment-body")))
 
    } else {
      ref <- author <- timestamp <- contentpg <- character(0)
    }
 
    # bind_rows ignores length 0 tbl_df's
    if (sum(lengths(list(ref, author, timestamp, contentpg))==0)) {
      return(data_frame())
    }
 
    return(data_frame(title=titles[i], link=links[i],
                      ref=ref, author=author,
                      timestamp=timestamp, contentpg=contentpg))
 
  }))
 
}
 
comments <- get_comments(links, titles)
 
feed <- xmlTree("feed")
feed$addNode("id", sprintf("user:%s", who))
feed$addNode("title", sprintf("%s's gist comments", who))
feed$addNode("icon", "https://assets-cdn.github.com/favicon.ico")
feed$addNode("link", attrs=list(href=sprintf("https://github.com/%s", who)))
feed$addNode("updated", format(Sys.time(), "%Y-%m-%dT%H:%M:%SZ", tz="GMT"))
 
for (i in 1:nrow(comments)) {
 
  feed$addNode("entry", close=FALSE)
    feed$addNode("id", sprintf("gist:comment:%s:%s", who, comments[i, "timestamp"]))
    feed$addNode("link", attrs=list(href=sprintf("%s%s", comments[i, "link"], comments[i, "ref"])))
    feed$addNode("title", sprintf("Comment by %s", comments[i, "author"]))
    feed$addNode("updated", comments[i, "timestamp"])
    feed$addNode("author", close=FALSE)
      feed$addNode("name", comments[i, "author"])
    feed$closeTag()
    feed$addNode("content", saveXML(xmlTextNode(as.character(comments[i, "contentpg"])), prefix=""), 
                 attrs=list(type="html"))
  feed$closeTag()
 
}
 
rss <- str_replace(saveXML(feed), "<feed>", '<feed xmlns="http://www.w3.org/2005/Atom">')
 
writeLines(rss, con="feed.xml")

To get that RSS feed into something that an internet service can process you have to make sure that `feed.xml` is being written to a directory that translates to a publicly accessible web location (mine is at [http://dds.ec/hrbrmstrgcfeed.xml](http://dds.ec/hrbrmstrgcfeed.xml) if you want to see it).

On the internet-facing Ubuntu box that generated the feed I’ve got a `cron` entry:

30  * * * * /home/bob/bin/gengcfeed.R

which means it’s going to check github every 30 minutes for comment updates. Tune said parameters to your liking.

At the top of `gengcfeed.R` I have an `Rscript` shebang:

#!/usr/bin/Rscript

and the execute bit is set on the file.

Run the file by hand, first, and then test the feed via [https://validator.w3.org/feed/](https://validator.w3.org/feed/) to ensure it’s accessible and that it validates correctly. Now you can enter that feed URL into your favorite newsfeed reader (I use @feedly).

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

List_of_Netflix_Movies_and_TV_Shows___AllFlicks

(i.e. by the controls & style of the controls available) and in the source:

view-source_www_allflicks_net

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Developer_Tools_-_http___www_allflicks_net_

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:
Developer_Tools_-_http___www_allflicks_net_ 2

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr)
library(jsonlite)
library(dplyr) # for glimpse
 
url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"
 
resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4
##  $ draw           : int 1
##  $ recordsTotal   : int 7448
##  $ recordsFiltered: int 7448
##  $ data           :'data.frame':  7448 obs. of  9 variables:
##   ..$ box_art  : chr [1:7448] "<img src=\"http://cdn1.nflximg.net/images/9159/12119159.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/6195/20866195.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/3735/2243735.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn0.nflximg.net/images/2668/21112668.jpg\" width=\"55\" alt=\"Thumbnail\">" ...
##   ..$ title    : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ...
##   ..$ year     : chr [1:7448] "2001" "2013" "2006" "2015" ...
##   ..$ rating   : chr [1:7448] "3.3" "2.5" "3.6" "4" ...
##   ..$ category : chr [1:7448] "<a href=\"http://www.allflicks.net/category/thrillers/\">Thrillers</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" ...
##   ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ...
##   ..$ cast     : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ...
##   ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ...
##   ..$ id       : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

<

pre lang=”rsplus”>glimpse(recs$data)

Observations: 7448

Variables:

$ box_art (chr) “<img src=\”http://cdn1.nflximg.net/images/9159/12…

$ title (chr) “In the Bedroom”, “Wolfy: The Incredible Secret”, …

$ year (chr) “2001”, “2013”, “2006”, “2015”, “1993”, “2013”, “2…

$ rating (chr) “3.3”, “2.5”, “3.6”, “4”, “3.5”, “3.1”, “3.3”, “4….

$ category (chr) “<a href=\”http://www.allflicks.net/category/thril…

$ available (chr) “17 May 2015”, “17 May 2015”, “17 May 2015”, “17 M…

$ cast (chr) “Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T…

$ director (chr) “Todd Field”, “Éric Omond”, “Mucci Fassett, Nico R…

$ id (chr) “60022258”, “70302834”, “70053695”, “80028529”, “8…

Now, we can use that in any R workflow or write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.