A quick Friday post to let folks know about three in-development R packages that you’re encouraged to poke the tyres o[fn] and also jump in and file issues or PRs for.
Alleviating aversion to versions
I introduced a “version chart” in a recent post and one key element of tagging years (which are really helpful to get a feel for scope of exposure + technical/cyber-debt) is knowing the dates of product version releases. You can pay for such a database but it’s also possible to cobble one together, and that activity will be much easier as time goes on with the vershist
? package.
Here’s a sample:
apache_httpd_version_history()
## # A tibble: 29 x 8
## vers rls_date rls_year major minor patch prerelease build
## <fct> <date> <dbl> <int> <int> <int> <chr> <chr>
## 1 1.3.0 1998-06-05 1998 1 3 0 "" ""
## 2 1.3.1 1998-07-22 1998 1 3 1 "" ""
## 3 1.3.2 1998-09-21 1998 1 3 2 "" ""
## 4 1.3.3 1998-10-09 1998 1 3 3 "" ""
## 5 1.3.4 1999-01-10 1999 1 3 4 "" ""
## 6 1.3.6 1999-03-23 1999 1 3 6 "" ""
## 7 1.3.9 1999-08-19 1999 1 3 9 "" ""
## 8 1.3.11 2000-01-22 2000 1 3 11 "" ""
## 9 1.3.12 2000-02-25 2000 1 3 12 "" ""
## 10 1.3.14 2000-10-10 2000 1 3 14 "" ""
## # ... with 19 more rows
Not all vendored-software uses semantic versioning and many have terrible schemes that make it really hard to create an ordered factor, but when that is possible, you get a nice data frame with an ordered factor you can use for all sorts of fun and useful things.
It has current support for:
- Apache httpd
- Apple iOS
- Google Chrome
- lighttpd
- memcached
- MongoDB
- MySQL
- nginx
- openresty
- openssh
- sendmail
- SQLite
and I’ll add more over time.
Thanks to @bikesRdata there will be a …_latest()
function for each vendor and I’ll likely add some helper functions so you only need to call one function with a parameter vs individual ones for each version and will also likely add a caching layer so you don’t have to scrape/clone/munge every time you need versions (seriously: look at the code to see what you have to do to collect some of this data).
And, they all it a MIME…a MIME!
I’ve had the wand
? package out for a while but have never been truly happy with it. It uses libmagic
on unix-ish systems but requires Rtools on Windows and relies on a system call to file.exe
on that platform. Plus the “magic” database is too big to embed in the package and due to the (very, very, very good and necessary) privacy/safety practices of CRAN, writing the boilerplate code to deal with compilation or downloading of the magic database is not something I have time for (and it really needs regular updates for consistent output on all platforms).
A very helpful chap, @VincentGuyader, was lamenting some of the Windows issues which spawned a quick release of simplemagic
?. The goal of this package is to be a zero-dependency install with no reliance on external databases. It has built-in support for basing MIME-type “guesses” off of a handful of the more common types folks might want to use this package for and a built-in “database” of over 1,500 file type-to-MIME mappings for guessing based solely on extension.
list.files(system.file("extdat", package="simplemagic"), full.names=TRUE) %>%
purrr::map_df(~{
dplyr::data_frame(
fil = basename(.x),
mime = list(simplemagic::get_content_type(.x))
)
}) %>%
tidyr::unnest()
## # A tibble: 85 x 2
## fil mime
## <chr> <chr>
## 1 actions.csv application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## 2 actions.txt application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## 3 actions.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## 4 test_1.2.class application/java-vm
## 5 test_1.3.class application/java-vm
## 6 test_1.4.class application/java-vm
## 7 test_1.5.class application/java-vm
## 8 test_128_44_jstereo.mp3 audio/mp3
## 9 test_excel_2000.xls application/msword
## 10 test_excel_spreadsheet.xml application/xml
## ...
File issues or PRs if you need more header-magic introspected guesses.
NOTE: The rtika
? package could theoretically do a more comprehensive job since Apache Tika has an amazing assortment of file-type introspect-ors. Also, an interesting academic exercise might be to collect a sufficient corpus of varying files, pull the first 512-4096 bytes of each, do some feature generation and write an ML-based classifier for files with a confidence level + MIME-type output.
Site promiscuity detection
urlscan is a fun site since it frees you from the tedium (and expense/privacy-concerns) of using a javascript-enabled scraping setup to pry into the makeup of a target URL and find out all sorts of details about it, including how many sites it lets track you. You can do the same with my splashr
? package, but you have the benefit of a third-party making the connection with urlscan.io vs requests coming from your IP space.
I’m waiting on an API key so I can write the “submit a scan request programmatically” function, but—until then—you can retrieve existing sites in their database or manually enter one for later retrieval.
The package is a WIP but has enough bits to be useful now to, say, see just how promiscuous cnn.com
makes you:
cnn_db <- urlscan::urlscan_search("domain:cnn.com")
latest_scan_results <- urlscan::urlscan_result(cnn_db$results$`_id`[1], TRUE, TRUE)
latest_scan_results$scan_result$lists$ips
## [1] "151.101.65.67" "151.101.113.67" "2.19.34.83"
## [4] "2.20.22.7" "2.16.186.112" "54.192.197.56"
## [7] "151.101.114.202" "83.136.250.242" "157.166.238.142"
## [10] "13.32.217.114" "23.67.129.200" "2.18.234.21"
## [13] "13.32.145.105" "151.101.112.175" "172.217.21.194"
## [16] "52.73.250.52" "172.217.18.162" "216.58.210.2"
## [19] "172.217.23.130" "34.238.24.243" "13.107.21.200"
## [22] "13.32.159.194" "2.18.234.190" "104.244.43.16"
## [25] "54.192.199.124" "95.172.94.57" "138.108.6.20"
## [28] "63.140.33.27" "2.19.43.224" "151.101.114.2"
## [31] "74.201.198.92" "54.76.62.59" "151.101.113.194"
## [34] "2.18.233.186" "216.58.207.70" "95.172.94.20"
## [37] "104.244.42.5" "2.18.234.36" "52.94.218.7"
## [40] "62.67.193.96" "62.67.193.41" "69.172.216.55"
## [43] "13.32.145.124" "50.31.185.52" "54.210.114.183"
## [46] "74.120.149.167" "64.202.112.28" "185.60.216.19"
## [49] "54.192.197.119" "185.60.216.35" "46.137.176.25"
## [52] "52.73.56.77" "178.250.2.67" "54.229.189.67"
## [55] "185.33.223.197" "104.244.42.3" "50.16.188.173"
## [58] "50.16.238.189" "52.59.88.2" "52.38.152.125"
## [61] "185.33.223.80" "216.58.207.65" "2.18.235.40"
## [64] "69.172.216.58" "107.23.150.218" "34.192.246.235"
## [67] "107.23.209.129" "13.32.145.107" "35.157.255.181"
## [70] "34.228.72.179" "69.172.216.111" "34.205.202.95"
latest_scan_results$scan_result$lists$countries
## [1] "US" "EU" "GB" "NL" "IE" "FR" "DE"
latest_scan_results$scan_result$lists$domains
## [1] "cdn.cnn.com" "edition.i.cdn.cnn.com"
## [3] "edition.cnn.com" "dt.adsafeprotected.com"
## [5] "pixel.adsafeprotected.com" "securepubads.g.doubleclick.net"
## [7] "tpc.googlesyndication.com" "z.moatads.com"
## [9] "mabping.chartbeat.net" "fastlane.rubiconproject.com"
## [11] "b.sharethrough.com" "geo.moatads.com"
## [13] "static.adsafeprotected.com" "beacon.krxd.net"
## [15] "revee.outbrain.com" "smetrics.cnn.com"
## [17] "pagead2.googlesyndication.com" "secure.adnxs.com"
## [19] "0914.global.ssl.fastly.net" "cdn.livefyre.com"
## [21] "logx.optimizely.com" "cdn.krxd.net"
## [23] "s0.2mdn.net" "as-sec.casalemedia.com"
## [25] "errors.client.optimizely.com" "social-login.cnn.com"
## [27] "invocation.combotag.com" "sb.scorecardresearch.com"
## [29] "secure-us.imrworldwide.com" "bat.bing.com"
## [31] "jadserve.postrelease.com" "ssl.cdn.turner.com"
## [33] "cnn.sdk.beemray.com" "static.chartbeat.com"
## [35] "native.sharethrough.com" "www.cnn.com"
## [37] "btlr.sharethrough.com" "platform-cdn.sharethrough.com"
## [39] "pixel.moatads.com" "www.summerhamster.com"
## [41] "mms.cnn.com" "ping.chartbeat.net"
## [43] "analytics.twitter.com" "sharethrough.adnxs.com"
## [45] "match.adsrvr.org" "gum.criteo.com"
## [47] "www.facebook.com" "d3qdfnco3bamip.cloudfront.net"
## [49] "connect.facebook.net" "log.outbrain.com"
## [51] "serve2.combotag.com" "rva.outbrain.com"
## [53] "odb.outbrain.com" "dynaimage.cdn.cnn.com"
## [55] "data.api.cnn.io" "aax.amazon-adsystem.com"
## [57] "cdns.gigya.com" "t.co"
## [59] "pixel.quantserve.com" "ad.doubleclick.net"
## [61] "cdn3.optimizely.com" "w.usabilla.com"
## [63] "amplifypixel.outbrain.com" "tr.outbrain.com"
## [65] "mab.chartbeat.com" "data.cnn.com"
## [67] "widgets.outbrain.com" "secure.quantserve.com"
## [69] "static.ads-twitter.com" "amplify.outbrain.com"
## [71] "tag.bounceexchange.com" "adservice.google.com"
## [73] "adservice.google.com.ua" "www.googletagservices.com"
## [75] "cdn.adsafeprotected.com" "js-sec.indexww.com"
## [77] "ads.rubiconproject.com" "c.amazon-adsystem.com"
## [79] "www.ugdturner.com" "a.postrelease.com"
## [81] "cdn.optimizely.com" "cnn.com"
O_o
FIN
Again, kick the tyres, file issues/PRs and drop a note if you’ve found something interesting as a result of any (or all!) of the packages.
2 Trackbacks/Pingbacks
[…] First up is a full rewrite of the {wand} pacakge which still uses magic but is 100% R code (vs a mix of compiled C and R code) and now works on Windows. A newer version will be up on CRAN in a bit that has additional MIME mappings and enables specifying a custom extension mapping data frame. You’ve seen this beast on the blog before, though, by another name. […]
[…] First up is a full rewrite of the {wand} pacakge which still uses magic but is 100% R code (vs a mix of compiled C and R code) and now works on Windows. A newer version will be up on CRAN in a bit that has additional MIME mappings and enables specifying a custom extension mapping data frame. You’ve seen this beast on the blog before, though, by another name. […]