Guillaume Pressiat (@GuillaumePressiat) did a solid post & video on using Selenium to scrape a paginated table from understat[.]com/league/EPL/2020
(I just cannot bring myself to provide an active link to any SportsBall site). He does a great job walking folks through acquiring & orchestrating the heavy dependency that is Selenium.
I did a quick “look at browser Developer Tools” tweet a few weeks back that included the entire code for retrieving the Forbes billionaires list via the JSON file the Forbes’ site loads via an XHR request responding to a similar fine article by another R user on using Selenium to do the same thing.
If you find yourself thwarted by rvest::read_html()
not returning “nodes that are clearly there” it is likely due to the page rendering nodes dynamically via javascript. Selenium orchestrates full or headless browsers and lets you scrape the dynamically rendered DOM. You can see this yourself if you first view the source of an HTML page (via the browser’s “view source” menu) and then use Developer Tools to inspect the browser session. The “view source” view (in Blink-based browsers, at least) will be the raw, unrendered source HTML from the site and the DevTools “Elements” tab will have the rendered DOM elements.
The “Nework” tab of DevTools has an “XHR” tab of its own, but if you try to use it on this SportsBall site to see the JSON it loads, you’ll be bitterly disappointed because — while it does indeed render JSON into HTML DOM nodes dynamically — that JSON is embedded in the web page:
We can work in two different ways without the use of Selenium.
First, we’ll “cheat” and use the {V8} package, which is an R interface to a javascript virtual machine, the type of which browsers use to run javascript on web pages. I say “cheat” because we’re still depending on a chunk of a browser engine.
Let’s get some boilerplate out of the way:
library(V8) # V8 engine
library(rvest) # Scraping
library(stringi) # String manipulation which we'll use later
library(tidyverse) # Duh
ctx <- v8() # create a new instance of the javascript VM
pg <- read_html("https://understat.com/league/EPL/2020") # read sportsball page
If you examine the SportsBall page you’ll see that JSON.parse
in a few different locations, let’s target them all:
html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]")
## {xml_nodeset (4)}
## [1] <script>\n\tvar datesData \t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x2214086 ...
## [2] <script>\n\tvar teamsData = JSON.parse('\\x7B\\x2271\\x22\\x3A\\x7B\\x22id\\x22 ...
## [3] <script>\n\tvar playersData\t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x22647\ ...
## [4] <script>\n\t\tWebFont.load({\n\t\t\tgoogle: {\n\t\t\t\tfamilies: ['Barlow:500', ...
We don’t need that last one, so the first three contain all the data we need.
Turning that into data is pretty straightforward work:
html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]") %>%
.[1:3] %>% # only want the first three nodes
html_text() %>% # turn the nodes into text
walk(ctx$eval) # tell V8 to evaluate the javascript
The VM we created now has those three variables:
ctx$get(JS("Object.keys(global)"))
## [1] "print" "console" "global" "datesData" "_week"
## [6] "_year" "teamsData" "playersData"
and, we can retrieve them like this:
as_tibble(ctx$get("datesData"))
## A tibble: 380 x 8
## id isResult h$id $title $short_title a$id $title $short_title goals$h $a
## <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 14086 TRUE 228 Fulham FLH 83 Arsen… ARS 0 3
## 2 14087 TRUE 78 Cryst… CRY 74 South… SOU 1 0
## 3 14090 TRUE 87 Liver… LIV 245 Leeds LED 4 3
## 4 14091 TRUE 81 West … WHU 86 Newca… NEW 0 2
## 5 14092 TRUE 76 West … WBA 75 Leice… LEI 0 3
## 6 14093 TRUE 82 Totte… TOT 72 Evert… EVE 0 1
## 7 14094 TRUE 238 Sheff… SHE 229 Wolve… WOL 0 2
## 8 14095 TRUE 220 Brigh… BRI 80 Chels… CHE 1 3
## 9 14096 TRUE 72 Evert… EVE 76 West … WBA 5 2
## 10 14097 TRUE 245 Leeds LED 228 Fulham FLH 4 3
## # … with 370 more rows, and 6 more variables: xG$h <chr>, $a <chr>, datetime <chr>,
## # forecast$w <chr>, $d <chr>, $l <chr>
Note that we need to do some extra processing of the second one to make it a bit tidier:
ctx$get("teamsData") %>%
map_df(~{
.x$history$id <- .x$id
.x$history$title <- .x$title
.x$history
}) %>%
as_tibble()
## # A tibble: 616 x 21
## h_a xG xGA npxG npxGA ppda$att $def ppda_allowed$att $def deep
## <chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 h 0.805 0.850 0.805 0.0885 89 20 247 14 17
## 2 a 2.03 0.535 2.03 0.535 307 33 143 24 10
## 3 h 3.08 1.66 3.08 1.66 365 25 119 25 7
## 4 a 0.874 0.672 0.874 0.672 212 23 210 24 7
## 5 h 1.50 2.38 1.50 2.38 225 17 124 34 7
## 6 h 2.45 1.00 1.69 1.00 161 23 164 22 5
## 7 a 1.99 1.39 1.99 1.39 331 24 169 15 16
## 8 h 1.77 1.50 1.77 1.50 257 14 208 17 6
## 9 a 2.39 0.572 1.63 0.572 144 11 289 23 8
## 10 a 1.27 1.14 0.508 1.14 162 28 166 20 5
## # … with 606 more rows, and 13 more variables: deep_allowed <int>, scored <int>,
## # missed <int>, xpts <dbl>, result <chr>, date <chr>, wins <int>, draws <int>,
## # loses <int>, pts <int>, npxGD <dbl>, id <chr>, title <chr>
The last one does not need any extra help:
as_tibble(ctx$get("playersData"))
## # A tibble: 505 x 18
## id player_name games time goals xG assists xA shots key_passes
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 647 Harry Kane 29 2557 19 17.6… 13 6.73… 113 39
## 2 1250 Mohamed Sa… 30 2529 19 16.1… 3 4.55… 99 40
## 3 1228 Bruno Fern… 31 2659 16 13.4… 11 10.8… 95 87
## 4 453 Son Heung-… 30 2509 14 9.35… 9 8.03… 55 56
## 5 822 Patrick Ba… 31 2572 14 14.8… 7 3.44… 93 24
## 6 5555 Dominic Ca… 26 2248 14 15.7… 0 0.95… 65 13
## 7 3277 Alexandre … 27 1818 13 11.8… 2 1.77… 43 21
## 8 314 Ilkay Günd… 24 1776 12 8.64… 1 3.29… 45 35
## 9 755 Jamie Vardy 27 2230 12 16.0… 7 4.26… 64 22
## 10 8865 Ollie Watk… 30 2700 12 13.7… 3 4.15… 81 36
## # … with 495 more rows, and 8 more variables: yellow_cards <chr>, red_cards <chr>,
## # position <chr>, team_title <chr>, npg <chr>, npxG <chr>, xGChain <chr>,
## # xGBuildup <chr>
We don’t really need {V8} for this, though, if we’re willing to use some regular expressions. We have to be a bit careful since some extra, non-JSON data comes along for the ride with that first <script>
tag (see the embedded image above).
We perform the same initial setup (get the text of the first three <script>
tags), then we erase everthing that isn’t JSON data (so all the var
and javascript punctuation). By using comments = TRUE
in call to stri_replace_all_regex
we can provide documentation along with the (ugly) regex.
The creator of the SportsBall site did some encoding to make the string easier to shove into a <script>
tag, so we need to undo that by converting the hex-escapes to HTML entity escapes (replace \x
with %
) and then decoding them with curl::curl_unescape()
We could have made the regular expression uglier to avoid the other javascript cruft in the first <script>
tag, but it’s just as easy to split them all into lines and pull the first line out.
Then, it’s just a matter of running each one through jsonlite::fromJSON()
. I kept it a list and just set the names as the names of the variables above.
html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]") %>%
.[1:3] %>%
html_text() %>%
stri_replace_all_regex("
^[^\\(]+\\(' # remove everything from the beginning of the line to the first ('
| # OR
'\\)[;,][[:space:]]* # remove the last ') and everything after it
$
", "", comments = TRUE, multiline = TRUE) %>%
stri_replace_all_fixed("\\x", "%") %>%
curl::curl_unescape() %>%
stri_split_lines() %>%
map_chr(1) %>%
map(jsonlite::fromJSON) %>%
map(as_tibble) %>%
set_names(c("datesData", "teamsData", "playersData")) %>%
str(3)
## List of 3
## $ datesData : tibble [380 × 8] (S3: tbl_df/tbl/data.frame)
## ..$ id : chr [1:380] "14086" "14087" "14090" "14091" ...
## ..$ isResult: logi [1:380] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..$ h :'data.frame': 380 obs. of 3 variables:
## ..$ a :'data.frame': 380 obs. of 3 variables:
## ..$ goals :'data.frame': 380 obs. of 2 variables:
## ..$ xG :'data.frame': 380 obs. of 2 variables:
## ..$ datetime: chr [1:380] "2020-09-12 11:30:00" "2020-09-12 14:00:00" "2020-09-12 16:30:00" "2020-09-12 19:00:00" ...
## ..$ forecast:'data.frame': 380 obs. of 3 variables:
## $ teamsData : tibble [3 × 20] (S3: tbl_df/tbl/data.frame)
## ..$ 71 :List of 3
## ..$ 72 :List of 3
## ..$ 74 :List of 3
## ..$ 75 :List of 3
## ..$ 76 :List of 3
## ..$ 78 :List of 3
## ..$ 80 :List of 3
## ..$ 81 :List of 3
## ..$ 82 :List of 3
## ..$ 83 :List of 3
## ..$ 86 :List of 3
## ..$ 87 :List of 3
## ..$ 88 :List of 3
## ..$ 89 :List of 3
## ..$ 92 :List of 3
## ..$ 220:List of 3
## ..$ 228:List of 3
## ..$ 229:List of 3
## ..$ 238:List of 3
## ..$ 245:List of 3
## $ playersData: tibble [505 × 18] (S3: tbl_df/tbl/data.frame)
## ..$ id : chr [1:505] "647" "1250" "1228" "453" ...
## ..$ player_name : chr [1:505] "Harry Kane" "Mohamed Salah" "Bruno Fernandes" "Son Heung-Min" ...
## ..$ games : chr [1:505] "29" "30" "31" "30" ...
## ..$ time : chr [1:505] "2557" "2529" "2659" "2509" ...
## ..$ goals : chr [1:505] "19" "19" "16" "14" ...
## ..$ xG : chr [1:505] "17.650331255048513" "16.19410896115005" "13.438796618022025" "9.352356541901827" ...
## ..$ assists : chr [1:505] "13" "3" "11" "9" ...
## ..$ xA : chr [1:505] "6.7384555246680975" "4.557050030678511" "10.812157344073057" "8.036493374034762" ...
## ..$ shots : chr [1:505] "113" "99" "95" "55" ...
## ..$ key_passes : chr [1:505] "39" "40" "87" "56" ...
## ..$ yellow_cards: chr [1:505] "1" "0" "5" "0" ...
## ..$ red_cards : chr [1:505] "0" "0" "0" "0" ...
## ..$ position : chr [1:505] "F" "F S" "M S" "F M S" ...
## ..$ team_title : chr [1:505] "Tottenham" "Liverpool" "Manchester United" "Tottenham" ...
## ..$ npg : chr [1:505] "15" "13" "8" "14" ...
## ..$ npxG : chr [1:505] "14.605655785650015" "11.627095961943269" "6.5883138151839375" "9.352356541901827" ...
## ..$ xGChain : chr [1:505] "20.556765687651932" "21.694580920040607" "22.04182725213468" "17.928756553679705" ...
## ..$ xGBuildup : chr [1:505] "3.99019683804363" "8.287332298234105" "8.843060294166207" "5.881684513762593" ...
You can use the cleanup code from the {V8} example to reshape that second element, and readr::type_convert()
can help you turn the character vectors into something more useful.
FIN
It really always pays to take a look at the DevTools pane before introducing heavy dependencies. More sites are using very straightforward idioms that make the dynamically rendered page JSON source data readily available. Further, sites often add extra fields that you don’t see rendered, but may be useful to have around as you work with the resulting data.
2 Comments
Hi,
Thank you for your post following mine, cool.
I’m not a big fan of SportsBall stats by the way. It was a technical interest to me.
Using json part of the page to have raw data was my first idea, see here https://stackoverflow.com/a/67032500/10527496
Stackoverflow user has then answered: “Thanks, however, I am more concerned with the process of scraping the table from the webpage.” So I try RSelenium for fun (during pseudo-lockdown).
Finally we’re there and your V8 usage is really interesting to me!
Your method was a perfect! Plus the video ‘splainer was great for folks who have different learning styles. And, it’s great to have modern R+Selenium posts (some of the older ones from nearly a decade ago are really crufty!)
I try not to let SO cookies hit my browser anymore, but I rly shld have looked there.
2 Trackbacks/Pingbacks
[…] article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]
[…] *** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2021/04/12/check-developer-tools-first-to-avoid-heavy-ish-dependencies/ […]