Scrapeover Friday — a.k.a. Another R Scraping Makeover

I caught a glimpse of a tweet by @dataandme on Friday:

Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her finger on the pulse of everything that’s happening in the data science world and is one of the most ardent amplifiers there is.

The post she linked to was a bit older (2015) and had a very “stream of consciousness” feel to it. I actually wish more R folks took to their blogs like this to post their explorations into various topics. The code in this post likely worked at the time it was posted and accomplished the desired goal (which means it was ultimately decent code). Said practice will ultimately help both you and others.

Makeover Time

As I’ve noted before, web scraping has some rules, even though they can be tough to find. This post made a very common mistake of not putting in a time delay between requests (a cardinal scraping rule) which we’ll fix in a moment.

There are a few other optimizations we can make. The first is moving from a for loop to something a bit more vectorized. Another is to figure out how many pages we need to scrape from information in the first set of results.

However, an even bigger one is to take advantage of the underlying XHR POST request that the new version of the site ultimately calls (it appears this site has undergone some changes since the blog post and it’s unlikely the code in the post actually works now).

Let’s start by setting up a function to grab individual pages:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

get_page <- function(i=1, pb=NULL) {
  
  if (!is.null(pb)) pb$tick()$print()
  
  POST(url = "http://www.propwall.my/wp-admin/admin-ajax.php", 
       body = list(action = "star_property_classified_list_change_ajax", 
                   tab = "Most Relevance", 
                   page = as.integer(i), location = "Mont Kiara", 
                   category = "", listing = "For Sale", 
                   price = "", keywords = "Mont Kiara, Kuala Lumpur", 
                   filter_id = "17", filter_type = "Location", 
                   furnishing = "", builtup = "", 
                   tenure = "", view = "list", 
                   map = "on", blurb = "0"), 
       encode = "form") -> res
  
  stop_for_status(res)
  
  res <- content(res, as="parsed") 
  
  Sys.sleep(sample(seq(0,2,0.5), 1))
  
  res
  
}

The i parameter gets passed into the body of the POST request. You can find that XHR POST request via the Network tab of your browser Developer Tools view. You can either transcribe it by hand or use the curlconverter package (which is temporarily off CRAN so you’ll need to get it from github) to auto-convert it to an httr::VERB request.

We also add a parameter (default to NULL) to support the use of a progress bar (so we can see what’s going on). If we pass in a populated dplyr progress bar, this will tick it down for us.

Now, we can use that to get the total number of listings.

get_page(1) %>% 
  html_node(xpath=".//a[contains(., 'Classifieds:')]") %>% 
  html_text() %>% 
  stri_match_last_regex("([[:digit:],]+)$") %>% 
  .[,2] %>% 
  stri_replace_all_fixed(",", "") %>% 
  as.numeric() -> classified_ct

total_pages <- 1 + (classified_ct %/% 20)

We’ll setup another function to extract the listing URLs and titles:

get_listings <- function(pg) {
  data_frame(
    link = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>%  html_attr("href"),
    description = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>% html_text(trim = TRUE)  
  )
}

Rather than chain calls to html_nodes() we take advantage of well-formed CSS selectors (which ultimately gets auto-translated to XPath strings). This has the advantage of speed (though that’s not necessarily an issue when web scraping) as well as brevity.

Now, we’ll scrape all the listings:

pb <- progress_estimated(total_pages)
listings_df <- map_df(1:total_pages, ~get_listings(get_page(.x, pb)))

Yep. That’s it. Everything’s been neatly abstracted into functions and we’ve taken advantage of some modern R idioms to accomplish our first task.

FIN

With the above code you should be able to do your own makeover of the remaining code in the original post. Remember to:

  • add a delay when you sequentially scrape pages from a site
  • abstract out common operations into functions
  • take advantage of purrr functions (or built-in *apply functions) to avoid for loops

I’ll close with a note about adhering to site terms of service / terms and conditions. Nothing I found when searching for ToS/ToC on the site suggested that scraping, automated grabbing or use of the underlying data in bulk was prohibited. Many sites have such restrictions — like IMDB (I mention that as it’s been used alot lately by R folks and it really shouldn’t be). LinkedIn recently sued scrapers for ToS such violations.

I fundamentally believe violating ToS is unethical behavior and should be avoided just on those grounds. When I come across sites I need information from that have restrictive ToS I contact the site owner (when I can find them) and ask them for permission and have only been refused a small handful of times. Given those recent legal actions, it’s also to better be safe than sorry.

Cover image from Data-Driven Security
Amazon Author Page

6 Comments Scrapeover Friday — a.k.a. Another R Scraping Makeover

  1. Pingback: Scrapeover Friday — a.k.a. Another R Scraping Makeover – Cyber Security

  2. Pingback: Scrapeover Friday — a.k.a. Another R Scraping Makeover – Mubashir Qasim

  3. Pingback: Scrapeover Friday — a.k.a. Another R Scraping Makeover | A bunch of data

    1. hrbrmstr

      Sure! Outside the function a dplyr progress bar is created with progress_estimated(). The variable that it creates is an R reference class variable (which makes it easier to keep it’s own, self-contained environment, in this case it has the starting # of ticks to count down from stored in there). So pb$tick() ticks down the internal counter but you may want it to just tick down and not print anything, so the print method is separate. The call to pb$tick() just returns pb so you can chain ref class function calls similar to the way one would in javascript.

      Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.