The urlscan
? package (an interface to the urlscan.io API) is now at version 0.2.0 and supports urlscan.io’s authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details — all the gory technical details — for a website.
For instance, say you wanted to check on r-project.org
. You could manually go to the site, enter that into the request bar and wait for the result:
Or, you can use R!. First pick your preferred social coding site, read through the source (this is going to be new advice for every post starting with this one. Don’t blindly trust code from any social coding site) an then install the urlscan
package:
devtools::install_git("https://git.sr.ht/~hrbrmstr/urlscan")
# or
devtools::install_gitlab("hrbrmstr/urlscan")
# or
devtools::install_github("hrbrmstr/urlscan")
Next, head on over back to urlscan.io and grab an API key (it’s free). Stick that in your ~/.Renviron
under URLSCAN_API_KEY
and then readRenviron("~/.Renviron")
in your R console.
Now, let’s check out r-project.org
.
library(urlscan)
library(tidyverse)
rproj <- urlscan_search("r-project.org")
rproj
## URL Submitted: https://r-project.org/
## Submission ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
## Submission Type: public
## Submission Note: Submission successful
There is more data in that rproj
object but we have enough to get more detailed results back. Note that site will return an error when you use urlscan_result()
if it hasn’t finished the analysis yet.
rproj_res <- urlscan_result("eb2a5da1-dc0d-43e9-8236-dbc340b53772", include_shot = TRUE)
rproj_res
## URL: https://www.r-project.org/
## Scan ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
## Malicious: FALSE
## Ad Blocked: FALSE
## Total Links: 8
## Secure Requests: 19
## Secure Req %: 100%
That rproj_res
holds quite a bit of data and makes no assumptions about how you want to use it so you will need to do some wrangling with it to find out. The rproj_res$scan_result
entry contains entries with the following information:
task
: Information about the submission: Time, method, options, links to screenshot/DOMpage
: High-level information about the page: Geolocation, IP, PTRlists
: Lists of domains, IPs, URLs, ASNs, servers, hashesdata
: All of the requests/responses, links, cookies, messagesmeta
: Processor output: ASN, GeoIP, AdBlock, Google Safe Browsingstats
: Computed stats (by type, protocol, IP, etc.)
Let’s see how many domains the R Core folks are allowing to track you (if you’re not on legacy Windows OS you can find curlparse
at https://git.sr.ht/~hrbrmstr/curlparse or git[la|hu]b/hrbrmstr/curlparse
:
curlparse::domain(rproj_res$scan_result$lists$urls) %>% # you can use urltools::domain() instead of curlparse
table(dnn = "domain") %>%
broom::tidy() %>%
arrange(desc(n))
## # A tibble: 7 x 2
## domain n
## <chr> <int>
## 1 platform.twitter.com 7
## 2 www.r-project.org 5
## 3 pbs.twimg.com 3
## 4 syndication.twitter.com 2
## 5 ajax.googleapis.com 1
## 6 cdn.syndication.twimg.com 1
## 7 r-project.org 1
Ironically, this is also how I learned that they allow Twitter to insecurely (no subresource integrity nor any content security policy) execute javascript in your browser (twitter javascript is blocked via multiple means at the hrbrmstr compound so I couldn’t see the widget).
Since I added the include_shot = TRUE
option, we also get a page screenshot back (as a magick
object):
rproj_res$screenshot
FIN
There’s tons of metadata to explore about web sites by using this package so jump in, kick the tyres, have fun! and file issues/PRs as needed.
3 Comments
you might want to let readers know that curlparse is available from your github repository
aye, but only linux and macOS users can use (i.e. barely ~15% of R users) it since the libcurl “winlib” that supports the
curl
package is woefully out of date. hence the comment next to it to useurltools
.aaand, done! #ty
2 Trackbacks/Pingbacks
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
[…] *** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2019/02/03/r-package-update-urlscan/ […]