Skip navigation

The intrepid @ma_salmon cranked out another blog post, remixing classical music schedule data from Radio Swiss Classic. It’s a fun post and you should read it before continuing here.

Seriously, click the link and go read it before continuing.

No, I mean it. Click the link or the rest of this makes no sense ;-)

OK, good. You finally read her ? post.

Now, I’m riffing off of said post here for four reasons. Three of the reasons are short, one is longer.

The first, short one is: be kind to web servers when scraping. If you ran a site and suddenly got hit with 3,000+ immediately sequential requests you might not be able to handle it depending on your server config. At a minimum add a Sys.sleep(sample(seq(0,1,0.25), 1)) before each sequential scrape and — if you can spare the time — sample(5,1) would be even better for a delay.

The second, short one is: purrr::safely() is your bff when it comes to xml2::read_html() and other network-ops. The internet is fundamentally broken. Nodes die. Pages get lost. Links rot. You have to be able to handle exceptions and if you define something like s_read_html <- safely(read_html) then when you do s_read_html("https://example.com/") the $result component will be NULL if the network request failed but will contain valid, parsed HTML if it succeeds. It is silent by default and works quite well (as we’ll see below).

The third, short one is: MPGA (Make Progress-bars Great Again). dplyr::progress_estimated() can really simplify the usage of progress bars in purrr calls (drop a note in the comments if the code is confusing and I’ll add some expository).

The last requires the code example for context:

library(rvest)
library(stringi)
library(lubridate)
library(tidyverse)

s_read_html <- safely(read_html)

# helper for brevity
xtract_nodes <- function(node, css) {
  html_nodes(node, css) %>% html_text(trim = TRUE)
}

get_one_day_program <- function(date=Sys.Date(),
                                base_url="http://www.radioswissclassic.ch/en/music-programme/search/%s",
                                pb=NULL) {

  if (!is.null(pb)) pb$tick()$print()

  Sys.sleep(sample(seq(0,1,0.25), 1)) # ideally, make this sample(5,1)

  date <- ymd(date) # handles case where input is character ISO date

  pg <- s_read_html(sprintf(base_url, format(date, "%Y%m%d")))

  if (!is.null(pg$result)) {

    data_frame(

      date = date,
      duration = xtract_nodes(pg$result, 'div[class="playlist"] *
                                            span[class="time hidden-xs"]') %>% hm() %>% as.numeric(),
      artist = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="titletag"]'),
      title = xtract_nodes(pg$result, 'div[class="playlist"] * span[class="artist"]'),

      hour = purrr::map(0:23, ~{
        if (.x<23) {
          nod <- html_nodes(pg$result,
                             xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')
                                                                 and (following-sibling::div[@id='%02d'])]", .x, .x+1))
        } else {
          nod <- html_nodes(pg$result,
                            xpath=sprintf(".//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row')]", .x))
        }
        rep(.x, length(nod))
      }) %>%
        flatten_int()

    )

  } else {
    closeAllConnections()
    NULL
  }

}

search_dates <- seq(from = ymd("2008-09-01"), to = ymd("2017-04-22"), by = "1 day")

pb <- progress_estimated(length(search_dates[1:5]))
programs_df <- map_df(search_dates[1:5], get_one_day_program, pb=pb)
programs_df
## # A tibble: 825 × 5
##          date duration                    artist                                                                         title  hour
##        <date>    <dbl>                     <chr>                                                                         <chr> <int>
## 1  2008-09-01       60   Franz Anton Hoffmeister "Andante grazioso" From Flute Quartet In A Major (After Mozart's KV 331) (CH)     0
## 2  2008-09-01      360     Johann Nepomuk Hummel                              "Rondo brillante" Op. 56 For Piano And Orchestra     0
## 3  2008-09-01     1380            Franz Schubert                       "Andante con moto" From Symphony No. 9 In C Major D 944     0
## 4  2008-09-01     2340       Camille Saint-Saëns                                       Violin Concerto No. 1 In A Major Op. 20     0
## 5  2008-09-01     3000        Alexander Scriabin                                           Nocturne In A Flat Major Op. posth.     0
## 6  2008-09-01     3180        Alexander Glazunov                                          Valse From "Scènes de ballet" Op. 52     0
## 7  2008-09-01     3540 Carl Philipp Emanuel Bach                                                           Symphony In G Major     0
## 8  2008-09-01     4200            Giuseppe Verdi                      "O Signore, dal tetto natio" From The Opera "I Lombardi"     1
## 9  2008-09-01     4440             Franz Krommer                                 Clarinet Concerto In E Flat Major Op. 36 (CH)     1
## 10 2008-09-01     5820            Georges Onslow             "Andantino molto cantabile" From Symphony No. 4 In G Major Op. 71     1
## # ... with 815 more rows

One of the reasons Maëlle created her post was to use XPath. Now, I was around when XML was defined and I have a sad, long history with the format, so XPath & I are old friends adversaries. However, there are simpler ways to target some of the nodes.

xpath="//span[@class='time hidden-xs']//text()" is ++gd XPath but it doesn’t need to be if we switch to using html_nodes() which will automatically translate CSS selectors to XPath for us. That bit of XPath turns into div[class="playlist"] * span[class="time hidden-xs"]. Why the extra selector at the beginning? Read on!

div[class="playlist"] * span[class="time hidden-xs"] actually translates to the following XPath:

selectr::css_to_xpath('div[class="playlist"] * span[class="time hidden-xs"]')
## [1] "descendant-or-self::div[@class = 'playlist']/descendant::*/descendant::span[@class = 'time hidden-xs']"

I use the parent playlist <div> because a few of the code bits in Maëlle’s post have to subtract away the last node because the XPath expression is a bit too greedy and also gets the “now playing” info vs just the “what played that day” info. It’s not strictly necessary for the time-code but it is for the artist & title. You can see that it simplifies the scraping a bit.

However, we can use XPath for to scrape the “hour the song played” and use it to fill the resultant data frame.

This .//div[@id='%02d']/following-sibling::div[contains(@class, 'item-row') and (following-sibling::div[@id='%02d'])] is not the most complex XPath but it is pretty gnarly, yet it also shows the power of XPath. What we’re doing in that purrr::map() call (which said XPath is in) is:

  • if the hour is 0:22, then use get all the sibling target nodes between one <div id="hh"> and the next <div id="hh">.
  • if the hour is 23, then get all the target nodes until there are no sibling
  • for either result, make an integer vector containing the hour repeated n times (n being the length of the number of songs played in the hour)
  • flatten it all into one big integer vector

(also: note that whitespace is your bff as well when it comes to formatting XPath queries)

If any read_html request is “bad” NULL will be returned instead of a data_frame, which purrr::map_df() will ignore.

I only did 5 scrapes since I won’t be using the data, but it’s working well on other random sequences I tried.

I tossed in a few more alternative ways to get some of the data, which you can pick up on if you compare the each code bits to each other.

Drop any questions, jibes or better XPath queries (once you post an XPath query on the internet the XPath wonks — like me ? — come out of hiding to prey on innocent bloggers) in the comments.

4 Trackbacks/Pingbacks

  1. By Decomposing Composers with R – Cyber Security on 23 Apr 2017 at 8:08 pm

    […] The intrepid @ma_salmon cranked out another blog post, remixing classical music schedule data from Radio Swiss Classic. It’s a fun post and you should read it before continuing here. Seriously, click the link and go read it before continuing. No, I mean it. Click the link or the rest of this makes no sense ;-)… Continue reading → […]

  2. […] that I used Bob Rudis’ code as an example of how to insert a progress bar which made me feel quite cool. I also inserted a […]

  3. […] that I used Bob Rudis’ code as an example of how to insert a progress bar which made me feel quite cool. I also inserted a […]

  4. […] się na szybkie zbanowanie wykorzystamy sztuczkę, którą znalazłem na blogu Maëlle (w ślad za rud.is). Cały poniższy kod można zapisać w jednym skrypcie i uruchomić tylko […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.