Scrapeover Friday — a.k.a. Another R Scraping Makeover

I caught a glimpse of a tweet by @dataandme on Friday:

Using R & rvest to explore Malaysian property mkt: "Web Scraping: The Sequel, Propwall.my" https://t.co/daZOOJJfPN #rstats #rvest pic.twitter.com/u6QMhm4M3e

— Mara Averick (@dataandme) May 5, 2017

Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her finger on the pulse of everything that’s happening in the data science world and is one of the most ardent amplifiers there is.

The post she linked to was a bit older (2015) and had a very “stream of consciousness” feel to it. I actually wish more R folks took to their blogs like this to post their explorations into various topics. The code in this post likely worked at the time it was posted and accomplished the desired goal (which means it was ultimately decent code). Said practice will ultimately help both you and others.

Makeover Time

As I’ve noted before, web scraping has some rules, even though they can be tough to find. This post made a very common mistake of not putting in a time delay between requests (a cardinal scraping rule) which we’ll fix in a moment.

There are a few other optimizations we can make. The first is moving from a for loop to something a bit more vectorized. Another is to figure out how many pages we need to scrape from information in the first set of results.

However, an even bigger one is to take advantage of the underlying XHR POST request that the new version of the site ultimately calls (it appears this site has undergone some changes since the blog post and it’s unlikely the code in the post actually works now).

Let’s start by setting up a function to grab individual pages:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

get_page <- function(i=1, pb=NULL) {
  
  if (!is.null(pb)) pb$tick()$print()
  
  POST(url = "http://www.propwall.my/wp-admin/admin-ajax.php", 
       body = list(action = "star_property_classified_list_change_ajax", 
                   tab = "Most Relevance", 
                   page = as.integer(i), location = "Mont Kiara", 
                   category = "", listing = "For Sale", 
                   price = "", keywords = "Mont Kiara, Kuala Lumpur", 
                   filter_id = "17", filter_type = "Location", 
                   furnishing = "", builtup = "", 
                   tenure = "", view = "list", 
                   map = "on", blurb = "0"), 
       encode = "form") -> res
  
  stop_for_status(res)
  
  res <- content(res, as="parsed") 
  
  Sys.sleep(sample(seq(0,2,0.5), 1))
  
  res
  
}

The i parameter gets passed into the body of the POST request. You can find that XHR POST request via the Network tab of your browser Developer Tools view. You can either transcribe it by hand or use the curlconverter package (which is temporarily off CRAN so you’ll need to get it from github) to auto-convert it to an httr::VERB request.

We also add a parameter (default to NULL) to support the use of a progress bar (so we can see what’s going on). If we pass in a populated dplyr progress bar, this will tick it down for us.

Now, we can use that to get the total number of listings.

get_page(1) %>% 
  html_node(xpath=".//a[contains(., 'Classifieds:')]") %>% 
  html_text() %>% 
  stri_match_last_regex("([[:digit:],]+)$") %>% 
  .[,2] %>% 
  stri_replace_all_fixed(",", "") %>% 
  as.numeric() -> classified_ct

total_pages <- 1 + (classified_ct %/% 20)

We’ll setup another function to extract the listing URLs and titles:

get_listings <- function(pg) {
  data_frame(
    link = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>%  html_attr("href"),
    description = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>% html_text(trim = TRUE)  
  )
}

Rather than chain calls to html_nodes() we take advantage of well-formed CSS selectors (which ultimately gets auto-translated to XPath strings). This has the advantage of speed (though that’s not necessarily an issue when web scraping) as well as brevity.

Now, we’ll scrape all the listings:

pb <- progress_estimated(total_pages)
listings_df <- map_df(1:total_pages, ~get_listings(get_page(.x, pb)))

Yep. That’s it. Everything’s been neatly abstracted into functions and we’ve taken advantage of some modern R idioms to accomplish our first task.

FIN

With the above code you should be able to do your own makeover of the remaining code in the original post. Remember to:

add a delay when you sequentially scrape pages from a site
abstract out common operations into functions
take advantage of purrr functions (or built-in *apply functions) to avoid for loops

I’ll close with a note about adhering to site terms of service / terms and conditions. Nothing I found when searching for ToS/ToC on the site suggested that scraping, automated grabbing or use of the underlying data in bulk was prohibited. Many sites have such restrictions — like IMDB (I mention that as it’s been used alot lately by R folks and it really shouldn’t be). LinkedIn recently sued scrapers for ToS such violations.

I fundamentally believe violating ToS is unethical behavior and should be avoided just on those grounds. When I come across sites I need information from that have restrictive ToS I contact the site owner (when I can find them) and ask them for permission and have only been refused a small handful of times. Given those recent legal actions, it’s also to better be safe than sorry.

3 Comments

- Robert M. McDonnell (@RobertMylesMc)
- Posted 2017-05-07 at 12:39
- Permalink
- Reply
Hi hrbrmstr. Mind if I ask you a question? How does pb$tick()$print() work?
- - hrbrmstr
  - Posted 2017-05-07 at 16:54
  - Permalink
  - Reply
  Sure! Outside the function a dplyr progress bar is created with progress_estimated(). The variable that it creates is an R reference class variable (which makes it easier to keep it’s own, self-contained environment, in this case it has the starting # of ticks to count down from stored in there). So pb$tick() ticks down the internal counter but you may want it to just tick down and not print anything, so the print method is separate. The call to pb$tick() just returns pb so you can chain ref class function calls similar to the way one would in javascript.
  - - Robert McDonnell
    - Posted 2017-05-21 at 15:29
    - Permalink
    - Reply
    Nice, thanks! :-)

3 Trackbacks/Pingbacks

By Scrapeover Friday — a.k.a. Another R Scraping Makeover – Cyber Security on 06 May 2017 at 3:22 am

[…] I caught a glimpse of a tweet by @dataandme on Friday: Using R & rvest to explore Malaysian property mkt: “Web Scraping: The Sequel, Propwall.my” https://t.co/daZOOJJfPN #rstats #rvest pic.twitter.com/u6QMhm4M3e— Mara Averick (@dataandme) May 5, 2017 Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her… Continue reading → […]
By Scrapeover Friday — a.k.a. Another R Scraping Makeover – Mubashir Qasim on 06 May 2017 at 6:18 am

[…] article was first published on R – rud.is, and kindly contributed to […]
By Scrapeover Friday — a.k.a. Another R Scraping Makeover | A bunch of data on 06 May 2017 at 7:46 am

[…] article was first published on R – rud.is, and kindly contributed to […]

rud.is