The intrepid @ma_salmon cranked out another blog post, remixing classical music schedule data from Radio Swiss Classic. It’s a fun post and you should read it before continuing here.
Seriously, click the link and go read it before continuing.
No, I mean it. Click the link or the rest of this makes no sense ;-)
OK, good. You finally read her ? post.
Now, I’m riffing off of said post here for four reasons. Three of the reasons are short, one is longer.
The first, short one is: be kind to web servers when scraping. If you ran a site and suddenly got hit with 3,000+ immediately sequential requests you might not be able to handle it depending on your server config. At a minimum add a Sys.sleep(sample(seq(0,1,0.25), 1))
before each sequential scrape and — if you can spare the time — sample(5,1)
would be even better for a delay.
The second, short one is: purrr::safely()
is your bff when it comes to xml2::read_html()
and other network-ops. The internet is fundamentally broken. Nodes die. Pages get lost. Links rot. You have to be able to handle exceptions and if you define something like s_read_html <- safely(read_html)
then when you do s_read_html("https://example.com/")
the $result
component will be NULL
if the network request failed but will contain valid, parsed HTML if it succeeds. It is silent by default and works quite well (as we’ll see below).
The third, short one is: MPGA (Make Progress-bars Great Again). dplyr::progress_estimated()
can really simplify the usage of progress bars in purrr
calls (drop a note in the comments if the code is confusing and I’ll add some expository).
The last requires the code example for context:
library(rvest)
library(stringi)
library(lubridate)
library(tidyverse)
s_read_html <- safely(read_html)
# helper for brevity
xtract_nodes <- function(node, css) {
html_nodes(node, css) %>% html_text(trim = TRUE)
}
get_one_day_program <- function(date=Sys.Date(),
base_url="http://www.radioswissclassic.ch/en/music-programme/search/%s",
pb=NULL) {
if (!is.null(pb)) pb$tick()$print()
Sys.sleep(sample(seq(0,1,0.25), 1)) # ideally, make this sample(5,1)
date <- ymd(date) # handles case where input is character ISO date
pg <- s_read_html(sprintf(base_url, format(date, "%Y%m%d")))
if (!is.null(pg$result)) {
data_frame(
date = date,
duration = xtract_nodes(pg$result, 'div[class="playlist"] *
span[class="time hidden-xs"]') %>% hm() %>% as.numeric(),
artist = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="titletag"]'),
title = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="artist"]'),
hour = purrr::map(0:23, ~{
if (.x<23) {
nod <- html_nodes(pg$result,
xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')
and (following-sibling::div[@id='%02d'])]", .x, .x+1))
} else {
nod <- html_nodes(pg$result,
xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')]", .x))
}
rep(.x, length(nod))
}) %>%
flatten_int()
)
} else {
closeAllConnections()
NULL
}
}
search_dates <- seq(from = ymd("2008-09-01"), to = ymd("2017-04-22"), by = "1 day")
pb <- progress_estimated(length(search_dates[1:5]))
programs_df <- map_df(search_dates[1:5], get_one_day_program, pb=pb)
programs_df
## # A tibble: 825 × 5
## date duration artist title hour
## <date> <dbl> <chr> <chr> <int>
## 1 2008-09-01 60 Franz Anton Hoffmeister "Andante grazioso" From Flute Quartet In A Major (After Mozart's KV 331) (CH) 0
## 2 2008-09-01 360 Johann Nepomuk Hummel "Rondo brillante" Op. 56 For Piano And Orchestra 0
## 3 2008-09-01 1380 Franz Schubert "Andante con moto" From Symphony No. 9 In C Major D 944 0
## 4 2008-09-01 2340 Camille Saint-Saëns Violin Concerto No. 1 In A Major Op. 20 0
## 5 2008-09-01 3000 Alexander Scriabin Nocturne In A Flat Major Op. posth. 0
## 6 2008-09-01 3180 Alexander Glazunov Valse From "Scènes de ballet" Op. 52 0
## 7 2008-09-01 3540 Carl Philipp Emanuel Bach Symphony In G Major 0
## 8 2008-09-01 4200 Giuseppe Verdi "O Signore, dal tetto natio" From The Opera "I Lombardi" 1
## 9 2008-09-01 4440 Franz Krommer Clarinet Concerto In E Flat Major Op. 36 (CH) 1
## 10 2008-09-01 5820 Georges Onslow "Andantino molto cantabile" From Symphony No. 4 In G Major Op. 71 1
## # ... with 815 more rows
One of the reasons Maëlle created her post was to use XPath. Now, I was around when XML was defined and I have a sad, long history with the format, so XPath & I are old friends adversaries. However, there are simpler ways to target some of the nodes.
xpath="//span[@class='time hidden-xs']//text()"
is ++gd XPath but it doesn’t need to be if we switch to using html_nodes()
which will automatically translate CSS selectors to XPath for us. That bit of XPath turns into div[class="playlist"] * span[class="time hidden-xs"]
. Why the extra selector at the beginning? Read on!
div[class="playlist"] * span[class="time hidden-xs"]
actually translates to the following XPath:
selectr::css_to_xpath('div[class="playlist"] * span[class="time hidden-xs"]')
## [1] "descendant-or-self::div[@class = 'playlist']/descendant::*/descendant::span[@class = 'time hidden-xs']"
I use the parent playlist
<div>
because a few of the code bits in Maëlle’s post have to subtract away the last node because the XPath expression is a bit too greedy and also gets the “now playing” info vs just the “what played that day” info. It’s not strictly necessary for the time-code but it is for the artist & title. You can see that it simplifies the scraping a bit.
However, we can use XPath for to scrape the “hour the song played” and use it to fill the resultant data frame.
This .//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row') and (following-sibling::div[@id='%02d'])]
is not the most complex XPath but it is pretty gnarly, yet it also shows the power of XPath. What we’re doing in that purrr::map()
call (which said XPath is in) is:
- if the hour is
0:22
, then use get all the sibling target nodes between one <div id="hh">
and the next <div id="hh">
.
- if the hour is
23
, then get all the target nodes until there are no sibling
- for either result, make an integer vector containing the hour repeated
n
times (n
being the length of the number of songs played in the hour)
- flatten it all into one big integer vector
(also: note that whitespace is your bff as well when it comes to formatting XPath queries)
If any read_html
request is “bad” NULL
will be returned instead of a data_frame
, which purrr::map_df()
will ignore.
I only did 5 scrapes since I won’t be using the data, but it’s working well on other random sequences I tried.
I tossed in a few more alternative ways to get some of the data, which you can pick up on if you compare the each code bits to each other.
Drop any questions, jibes or better XPath queries (once you post an XPath query on the internet the XPath wonks — like me ? — come out of hiding to prey on innocent bloggers) in the comments.