On MIMEs, software versions and web site promiscuity (a.k.a. three new packages to round out the week)

A quick Friday post to let folks know about three in-development R packages that you’re encouraged to poke the tyres o[fn] and also jump in and file issues or PRs for.

Alleviating aversion to versions

I introduced a “version chart” in a recent post and one key element of tagging years (which are really helpful to get a feel for scope of exposure + technical/cyber-debt) is knowing the dates of product version releases. You can pay for such a database but it’s also possible to cobble one together, and that activity will be much easier as time goes on with the vershist? package.

Here’s a sample:

apache_httpd_version_history()
## # A tibble: 29 x 8
##    vers   rls_date   rls_year major minor patch prerelease build
##    <fct>  <date>        <dbl> <int> <int> <int> <chr>      <chr>
##  1 1.3.0  1998-06-05     1998     1     3     0 ""         ""   
##  2 1.3.1  1998-07-22     1998     1     3     1 ""         ""   
##  3 1.3.2  1998-09-21     1998     1     3     2 ""         ""   
##  4 1.3.3  1998-10-09     1998     1     3     3 ""         ""   
##  5 1.3.4  1999-01-10     1999     1     3     4 ""         ""   
##  6 1.3.6  1999-03-23     1999     1     3     6 ""         ""   
##  7 1.3.9  1999-08-19     1999     1     3     9 ""         ""   
##  8 1.3.11 2000-01-22     2000     1     3    11 ""         ""   
##  9 1.3.12 2000-02-25     2000     1     3    12 ""         ""   
## 10 1.3.14 2000-10-10     2000     1     3    14 ""         ""   
## # ... with 19 more rows

Not all vendored-software uses semantic versioning and many have terrible schemes that make it really hard to create an ordered factor, but when that is possible, you get a nice data frame with an ordered factor you can use for all sorts of fun and useful things.

It has current support for:

Apache httpd
Apple iOS
Google Chrome
lighttpd
memcached
MongoDB
MySQL
nginx
openresty
openssh
sendmail
SQLite

and I’ll add more over time.

Thanks to @bikesRdata there will be a …_latest() function for each vendor and I’ll likely add some helper functions so you only need to call one function with a parameter vs individual ones for each version and will also likely add a caching layer so you don’t have to scrape/clone/munge every time you need versions (seriously: look at the code to see what you have to do to collect some of this data).

And, they all it a MIME…a MIME!

I’ve had the wand? package out for a while but have never been truly happy with it. It uses libmagic on unix-ish systems but requires Rtools on Windows and relies on a system call to file.exe on that platform. Plus the “magic” database is too big to embed in the package and due to the (very, very, very good and necessary) privacy/safety practices of CRAN, writing the boilerplate code to deal with compilation or downloading of the magic database is not something I have time for (and it really needs regular updates for consistent output on all platforms).

A very helpful chap, @VincentGuyader, was lamenting some of the Windows issues which spawned a quick release of simplemagic?. The goal of this package is to be a zero-dependency install with no reliance on external databases. It has built-in support for basing MIME-type “guesses” off of a handful of the more common types folks might want to use this package for and a built-in “database” of over 1,500 file type-to-MIME mappings for guessing based solely on extension.

list.files(system.file("extdat", package="simplemagic"), full.names=TRUE) %>% 
  purrr::map_df(~{
    dplyr::data_frame(
      fil = basename(.x),
      mime = list(simplemagic::get_content_type(.x))
    )
  }) %>% 
  tidyr::unnest()
## # A tibble: 85 x 2
##    fil                        mime                                                                     
##    <chr>                      <chr>                                                                    
##  1 actions.csv                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  2 actions.txt                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  3 actions.xlsx               application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  4 test_1.2.class             application/java-vm                                                      
##  5 test_1.3.class             application/java-vm                                                      
##  6 test_1.4.class             application/java-vm                                                      
##  7 test_1.5.class             application/java-vm                                                      
##  8 test_128_44_jstereo.mp3    audio/mp3                                                                
##  9 test_excel_2000.xls        application/msword                                                       
## 10 test_excel_spreadsheet.xml application/xml      
## ...

File issues or PRs if you need more header-magic introspected guesses.

NOTE: The rtika? package could theoretically do a more comprehensive job since Apache Tika has an amazing assortment of file-type introspect-ors. Also, an interesting academic exercise might be to collect a sufficient corpus of varying files, pull the first 512-4096 bytes of each, do some feature generation and write an ML-based classifier for files with a confidence level + MIME-type output.

Site promiscuity detection

urlscan is a fun site since it frees you from the tedium (and expense/privacy-concerns) of using a javascript-enabled scraping setup to pry into the makeup of a target URL and find out all sorts of details about it, including how many sites it lets track you. You can do the same with my splashr? package, but you have the benefit of a third-party making the connection with urlscan.io vs requests coming from your IP space.

I’m waiting on an API key so I can write the “submit a scan request programmatically” function, but—until then—you can retrieve existing sites in their database or manually enter one for later retrieval.

The package is a WIP but has enough bits to be useful now to, say, see just how promiscuous cnn.com makes you:

cnn_db <- urlscan::urlscan_search("domain:cnn.com")

latest_scan_results <- urlscan::urlscan_result(cnn_db$results$`_id`[1], TRUE, TRUE)

latest_scan_results$scan_result$lists$ips
##  [1] "151.101.65.67"   "151.101.113.67"  "2.19.34.83"     
##  [4] "2.20.22.7"       "2.16.186.112"    "54.192.197.56"  
##  [7] "151.101.114.202" "83.136.250.242"  "157.166.238.142"
## [10] "13.32.217.114"   "23.67.129.200"   "2.18.234.21"    
## [13] "13.32.145.105"   "151.101.112.175" "172.217.21.194" 
## [16] "52.73.250.52"    "172.217.18.162"  "216.58.210.2"   
## [19] "172.217.23.130"  "34.238.24.243"   "13.107.21.200"  
## [22] "13.32.159.194"   "2.18.234.190"    "104.244.43.16"  
## [25] "54.192.199.124"  "95.172.94.57"    "138.108.6.20"   
## [28] "63.140.33.27"    "2.19.43.224"     "151.101.114.2"  
## [31] "74.201.198.92"   "54.76.62.59"     "151.101.113.194"
## [34] "2.18.233.186"    "216.58.207.70"   "95.172.94.20"   
## [37] "104.244.42.5"    "2.18.234.36"     "52.94.218.7"    
## [40] "62.67.193.96"    "62.67.193.41"    "69.172.216.55"  
## [43] "13.32.145.124"   "50.31.185.52"    "54.210.114.183" 
## [46] "74.120.149.167"  "64.202.112.28"   "185.60.216.19"  
## [49] "54.192.197.119"  "185.60.216.35"   "46.137.176.25"  
## [52] "52.73.56.77"     "178.250.2.67"    "54.229.189.67"  
## [55] "185.33.223.197"  "104.244.42.3"    "50.16.188.173"  
## [58] "50.16.238.189"   "52.59.88.2"      "52.38.152.125"  
## [61] "185.33.223.80"   "216.58.207.65"   "2.18.235.40"    
## [64] "69.172.216.58"   "107.23.150.218"  "34.192.246.235" 
## [67] "107.23.209.129"  "13.32.145.107"   "35.157.255.181" 
## [70] "34.228.72.179"   "69.172.216.111"  "34.205.202.95"

latest_scan_results$scan_result$lists$countries
## [1] "US" "EU" "GB" "NL" "IE" "FR" "DE"

latest_scan_results$scan_result$lists$domains
##  [1] "cdn.cnn.com"                    "edition.i.cdn.cnn.com"         
##  [3] "edition.cnn.com"                "dt.adsafeprotected.com"        
##  [5] "pixel.adsafeprotected.com"      "securepubads.g.doubleclick.net"
##  [7] "tpc.googlesyndication.com"      "z.moatads.com"                 
##  [9] "mabping.chartbeat.net"          "fastlane.rubiconproject.com"   
## [11] "b.sharethrough.com"             "geo.moatads.com"               
## [13] "static.adsafeprotected.com"     "beacon.krxd.net"               
## [15] "revee.outbrain.com"             "smetrics.cnn.com"              
## [17] "pagead2.googlesyndication.com"  "secure.adnxs.com"              
## [19] "0914.global.ssl.fastly.net"     "cdn.livefyre.com"              
## [21] "logx.optimizely.com"            "cdn.krxd.net"                  
## [23] "s0.2mdn.net"                    "as-sec.casalemedia.com"        
## [25] "errors.client.optimizely.com"   "social-login.cnn.com"          
## [27] "invocation.combotag.com"        "sb.scorecardresearch.com"      
## [29] "secure-us.imrworldwide.com"     "bat.bing.com"                  
## [31] "jadserve.postrelease.com"       "ssl.cdn.turner.com"            
## [33] "cnn.sdk.beemray.com"            "static.chartbeat.com"          
## [35] "native.sharethrough.com"        "www.cnn.com"                   
## [37] "btlr.sharethrough.com"          "platform-cdn.sharethrough.com" 
## [39] "pixel.moatads.com"              "www.summerhamster.com"         
## [41] "mms.cnn.com"                    "ping.chartbeat.net"            
## [43] "analytics.twitter.com"          "sharethrough.adnxs.com"        
## [45] "match.adsrvr.org"               "gum.criteo.com"                
## [47] "www.facebook.com"               "d3qdfnco3bamip.cloudfront.net" 
## [49] "connect.facebook.net"           "log.outbrain.com"              
## [51] "serve2.combotag.com"            "rva.outbrain.com"              
## [53] "odb.outbrain.com"               "dynaimage.cdn.cnn.com"         
## [55] "data.api.cnn.io"                "aax.amazon-adsystem.com"       
## [57] "cdns.gigya.com"                 "t.co"                          
## [59] "pixel.quantserve.com"           "ad.doubleclick.net"            
## [61] "cdn3.optimizely.com"            "w.usabilla.com"                
## [63] "amplifypixel.outbrain.com"      "tr.outbrain.com"               
## [65] "mab.chartbeat.com"              "data.cnn.com"                  
## [67] "widgets.outbrain.com"           "secure.quantserve.com"         
## [69] "static.ads-twitter.com"         "amplify.outbrain.com"          
## [71] "tag.bounceexchange.com"         "adservice.google.com"          
## [73] "adservice.google.com.ua"        "www.googletagservices.com"     
## [75] "cdn.adsafeprotected.com"        "js-sec.indexww.com"            
## [77] "ads.rubiconproject.com"         "c.amazon-adsystem.com"         
## [79] "www.ugdturner.com"              "a.postrelease.com"             
## [81] "cdn.optimizely.com"             "cnn.com"

O_o

FIN

Again, kick the tyres, file issues/PRs and drop a note if you’ve found something interesting as a result of any (or all!) of the packages.

2 Trackbacks/Pingbacks

By Two New Testing-themed Addins + One New and One Updated CRAN Package | rud.is on 27 Jul 2019 at 7:47 am

[…] First up is a full rewrite of the {wand} pacakge which still uses magic but is 100% R code (vs a mix of compiled C and R code) and now works on Windows. A newer version will be up on CRAN in a bit that has additional MIME mappings and enables specifying a custom extension mapping data frame. You’ve seen this beast on the blog before, though, by another name. […]
By Two New Testing-themed Addins + One New and One Updated CRAN Package – Data Science Austria on 28 Jul 2019 at 10:47 am

[…] First up is a full rewrite of the {wand} pacakge which still uses magic but is 100% R code (vs a mix of compiled C and R code) and now works on Windows. A newer version will be up on CRAN in a bit that has additional MIME mappings and enables specifying a custom extension mapping data frame. You’ve seen this beast on the blog before, though, by another name. […]

rud.is