Skip navigation

There’s a yuge chance you’re reading this post (at least initially) on R-Bloggers right now (though you should also check out R Weekly and add their live feed to your RSS reader pronto!). It’s a central “watering hole” for R folks and is read by many (IIRC over 20,000 Feedly users have it in their OPML).

I’m addicted to Feedly and waited years for them to publish their API. They have and there will eventually be a package for it (go for it if you want to get’er done before me since I won’t have time to do it justice for a while). As just parenthetically noted, I’ve started work on one and have scaffolded just enough to give R folks a present: almost 5 years of R-Bloggers data — posts, engagement rates, authors, etc). But, you’ll have to put up with some expository, first.

Digging In

We’ll need some packages to help this expository and extraction. Plus, you’ll need to go to https://developer.feedly.com/ to get your developer token (NOTE: this requires a “Pro” account or a regular account and you manually doing the OAuth dance to get an access token; any final “Feedly package” by myself or others will likely use OAuth) and store it in your ~/.Renviron in FEEDLY_ACCESS_TOKEN.

I’ve sliced and diced bits from the (non-published) fledgling package to give a peek behind the API covers. There’s plenty of exposition in the following code block comment header to describe what it does:

#' Simplifying some example package setup for this non-pkg example
.pkgenv <- new.env(parent=emptyenv())
.pkgenv$token <- Sys.getenv("FEEDLY_ACCESS_TOKEN")

#' In reality, this is more complex since the non-toy example has to
#' refresh tokens when they expire.
.feedly_token <- function() {
  return(.pkgenv$token)
}

#' Get a chunk of a Feedly "stream"
#'
#' For the purposes of this short example, consider a
#' "stream" to be all the historical items in a feed.
#' (Note: the definition is more complex than that)
#'
#' Max "page size" (mad numbner of items returned in a single call)
#' is 1,000. For example simplicity, there's a blanket assumption
#' that if `continuation` is actually present, the caller is
#' savvy and asked for a large number of items (e.g. 10,000).
#' Therefore, assume we're paging by the thousands.
#'
#' @md
#' @param feed_id the id of the stream (for this examplea feed id)
#' @param ct numnber of items to retrieve (API will only return 1,000
#'        items for a single response and populate `continuation`
#'        with a value that should be passed to subsequent calls
#'        to page through the results; `ct` will be reset to 1,000
#'        internally if this is the case)
#' @param continuation see `ct`
#' @references <https://developer.feedly.com/v3/streams/>
#' @return for this example, an ugly `list`
feedly_stream <- function(stream_id, ct=100L, continuation=NULL) {

  ct <- as.integer(ct)

  if (!is.null(continuation)) ct <- 1000L

  httr::GET(
    url = "https://cloud.feedly.com/v3/streams/contents",
    httr::add_headers(
      `Authorization` = sprintf("OAuth %s", .feedly_token())
    ),
    query = list(
      streamId = stream_id,
      count = ct,
      continuation = continuation
    )
  ) -> res

  httr::stop_for_status(res)

  res <- httr::content(res, as="text")
  res <- jsonlite::fromJSON(res)

  res

}

We’ll grab 10,000 Feedly entries for the R-Bloggers feed stream:

r_bloggers_feed_id <- "feed/http://feeds.feedburner.com/RBloggers"

rb_stream <- feedly_stream(r_bloggers_feed_id, 10000L)

# preallocate space
streams <- vector("list", 10)
streams[1L] <- list(rb_stream)

# gotta catch'em all!
idx <- 2L
while(length(rb_stream$continuation) > 0) {
  cat(".", sep="") # poor dude's progress par
  feedly_stream(
    stream_id = r_bloggers_feed_id,
    ct = 1000L,
    continuation = rb_stream$continuation
  ) -> rb_stream
  streams[idx] <- list(rb_stream)
  idx <- idx + 1L
}
cat("\n")

For those who aren’t used to piecing together bits from API’s like this (and for those who do not have a Pro account, those who didn’t want to write OAuth code or those who don’t use Feedly and cannot reproduce the post example), here’s some dissection:

str(streams, 1)
## List of 12
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 6 # No "continuation" in this one

str(streams[[1]], 1)
## List of 7
##  $ id          : chr "feed/http://feeds.feedburner.com/RBloggers"
##  $ title       : chr "R-bloggers"
##  $ direction   : chr "ltr"
##  $ updated     : num 1.52e+12
##  $ alternate   :'data.frame':	1 obs. of  2 variables:
##  $ continuation: chr "15f457e2b66:160d6e:8cbd7d4f"
##  $ items       :'data.frame':	1000 obs. of  22 variables:

glimpse(streams[[1]]$items)
## Observations: 1,000
## Variables: 22
## $ id             <chr> "XGq6cYRY3hH9/vdZr0WOJiPdAe0u6dQ2ddUFEsTqP10=_1628f55fc26:7feb...
## $ keywords       <list> ["R bloggers", "R bloggers", "R bloggers", "R bloggers", "R b...
## $ originId       <chr> "https://tjmahr.github.io/ridgelines-in-bayesplot-1-5-0-releas...
## $ fingerprint    <chr> "f96c93f7", "9b2344db", "ca3762c8", "980635d0", "fbd60fac", "6...
## $ content        <data.frame> c("<p><div><div><div><div data-show-faces=\"false\" dat...
## $ title          <chr> "Ridgelines in bayesplot 1.5.0", "Mathematical art in R", "R a...
## $ published      <dbl> 1.522732e+12, 1.522796e+12, 1.522714e+12, 1.522714e+12, 1.5227...
## $ crawled        <dbl> 1.522823e+12, 1.522809e+12, 1.522794e+12, 1.522793e+12, 1.5227...
## $ canonical      <list> [<https://www.r-bloggers.com/ridgelines-in-bayesplot-1-5-0/, ...
## $ origin         <data.frame> c("feed/http://feeds.feedburner.com/RBloggers", "feed/h...
## $ author         <chr> "Higher Order Functions", "David Smith", "R Views", "rOpenSci ...
## $ alternate      <list> [<http://feedproxy.google.com/~r/RBloggers/~3/O5DIWloFJO8/, t...
## $ summary        <data.frame> c("At the end of March, Jonah Gabry and I released\nbay...
## $ visual         <data.frame> c("feedly-nikon-v3.1", "feedly-nikon-v3.1", "feedly-nik...
## $ unread         <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories     <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c3d/category/big data...
## $ engagement     <int> 9, 37, 52, 15, 78, 35, 31, 9, 28, 2, 21, 8, 25, 11, 21, 29, 12...
## $ engagementRate <dbl> 0.41, 1.37, 1.58, 0.45, 2.23, 0.97, 0.84, 0.23, 0.72, 0.05, 0....
## $ recrawled      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1.522807e+12, NA, NA, NA, NA, ...
## $ tags           <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
## $ decorations    <data.frame> c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",...
## $ enclosure      <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...

That entries structure is defined in the Feedly API docs.

We’ll extract the bits we want to use for the rest of the post and clean it up a bit:

map_df(streams, ~{
  select(.x$items, title, author, published, engagement) %>%
    mutate(published = anytime::anydate(published / 1000)) %>% # overly-high-resolution timestamp
    tbl_df()
}) -> xdf

glimpse(xdf)
## Observations: 11,421
## Variables: 4
## $ title      <chr> "Ridgelines in bayesplot 1.5.0", "Mathematical art in R", "R and T...
## $ author     <chr> "Higher Order Functions", "David Smith", "R Views", "rOpenSci - op...
## $ published  <date> 2018-04-03, 2018-04-03, 2018-04-02, 2018-04-02, 2018-04-03, 2018-...
## $ engagement <int> 9, 37, 52, 15, 78, 35, 31, 9, 28, 2, 21, 8, 25, 11, 21, 29, 12, 11...

Using an arbitrary “10,000” extract didn’t give us full months:

range(xdf$published)
## [1] "2013-05-31" "2018-04-03"

so we’ll filter out the incomplete bits and add in some additional temporal metadata:

xdf %>%
  filter(
    published > as.Date("2013-05-31"),  # complete months
    published < as.Date("2018-04-01")
  ) %>%
  mutate(
    year = as.integer(lubridate::year(published)),
    month = lubridate::month(published, label=TRUE, abbr=TRUE),
    wday = lubridate::wday(published, label=TRUE, abbr=TRUE),
    ym = as.Date(format(published, "%Y-%m-01"))
  ) -> xdf

I’m only going to do some light analysis work with engagement data (how “popular” a post was) but the full post summary and body content is available in the data dump you’re going to get at the end (this is reminding me of the Sesame Street “Monster at the End of This Book” story). That means enterprising folk can do some tidy text mining to cluster away some additional insights.

Thankfully, there’s not a ton of missing engagement data:

sum(is.na(xdf$engagement)) / nrow(xdf)
## [1] 0.06506849

broom::tidy((summary(xdf$engagement)))
##   minimum q1 median     mean q3 maximum  na
## 1       0  5     20 69.27219 75    4785 741

Let’s look at post count over time, first:

count(xdf, ym) %>%
  arrange(ym) %>%
  ggplot(aes(ym, n)) +
  ggforce::geom_bspline0(color="lightslategray") +
  scale_x_date(expand=c(0,0.5)) +
  labs(
    x=NULL, y="Post count",
    title="R-Bloggers Post Count",
    subtitle="June 2013 — March 2018"
  ) +
  theme_ipsum_ps(grid="XY")

It’ll be interesting to watch that over this year and compare 2017 to 2018 given how “hot” 2017 seems to have been. To turn a Mythbuster phrase: a neat “try this at home” exercise would be to tease out some “whys” for various spikes (which likely means some post content spelunking).

Let’s see if any days are more popular than others:

count(xdf, wday) %>%
  ggplot(aes(wday, n)) +
  geom_col(fill="lightslategray", width=0.65) +
  scale_y_comma() +
  labs(
    x=NULL, y="Post count",
    title="R-Bloggers Aggregate Post Count By Day of Week"
  ) +
  theme_ipsum_ps(grid="Y")

Weekends are sleepy and there are some “go-getters” at the beginning of the week. More “try this at home” would be to see if any individuals have “patterns” by day of week (or even time of day, since that’s also available in the published time stamp).

The summary() above told us we have a pretty skewed engagement distribution, but it’s always nice to visualise just how bad it is:

ggplot(xdf, aes(engagement)) +
  geom_density(aes(y=calc(count)), fill="lightslategray", alpha=2/3) +
  scale_x_comma() +
  scale_y_comma() +
  labs(
    x=NULL, y="Engagement",
    title = "R-Bloggers Post Engagement Distribution",
    subtitle = "June 2013 — March 2018"
  ) +
  theme_ipsum_ps(grid="XY")

That graph is the story of my daily life dealing with internet data. Couldn’t even get a break when trying to have some fun. #sigh

We’ll close with the “all time top 10” based on total engagement:

count(xdf, author, wt=engagement, sort=TRUE)
## # A tibble: 1,065 x 2
##    author               n
##  1 David Smith      87381
##  2 Tal Galili       29302
##  3 Joseph Rickert   16846
##  4 DataCamp Blog    14402
##  5 DataCamp         14208
##  6 John Mount       13274
##  7 Francis Smart     8506
##  8 hadleywickham     8129
##  9 hrbrmstr          7855
## 10 Sharp Sight Labs  7620
## # ... with 1,055 more rows

@revodavid is a blogging machine, and that top-spot is well-deserved given the plethora of interesting, useful and fun content he shares. And, it looks like someone only needs to blog a bit more this year to overtake @hadley (I’m comin’ fer ya, Hadley!).

FIN

As promised, you can get the data in a ~30MB RDS file via https://rud.is/dl/r-bloggers-feedly-streams.rds and can then use the extraction-to-data-frame example from above to work with the bits you care about.

Hopefully folks will have some fun with this and share their results!

The 2018 IEEE Security & Privacy Conference is in May but they’ve posted their full proceedings and it’s better to grab them early than to wait for it to become part of a paid journal offering.

There are alot of papers. Not all match my interests but (fortunately?) many did and I’ve filtered down a list of the more interesting (to me) ones. It’s encouraging to see academic cybersecurity researchers branching out across a whole host of areas.

I can’t promise a the morning paper-esque daily treatment of these on the blog but I’ll likely exposit a few of them over the coming weeks. I’ve emoji’d a few that stood out. Order is the order I read them in (no other meaning to the order).

A while back, Medium blogger ‘Nykolas Z’ posted results from a globally distributed DNS resolver test to find the speediest provider (NOTE: speed is not the only consideration when choosing an alternative DNS provider). While the test methodology is not provided (the “scientific method” has yet to fully penetrate “cyber”) the data is provided…in in text form in <blockquote>s. O_o

While Nykolas ranked them, a visual comparison teases out some interesting differences between the providers. However, Cloudflare seems to be the clear winner (click/tap chart for larger version):

I’m going to give Cloudflare a few weeks to “settle in” and setup a series of geographically distributed RIPE Atlas probes for them and the others on the list Nykolas provided, then measure them with the same probe sets and frequencies for a few months and report back.

Some enterprising internet explorers have already begun monitoring 1.1.1.1 (that link may take a few seconds to show data since it performs a live search; a screen shot of the first page of results is below).

You have to have been living under a rock to not know about Cloudflare’s new 1.1.1.1 DNS offering. I won’t go into “privacy”, “security” or “speed” concepts in this post since that’s a pretty huge topic to distill for folks given the, now, plethora of confusing (and pretty technical) options that exist to support one or more of those goals.

Instead, I’ll remind R folks about the gdns? package which provides a query interface to Google’s DNS-over-HTTPS JSON API and announce dnsflare? which wraps the new and similar offering by Cloudflare. In fact, Cloudflare adopted Google’s response format so they’re pretty interchangeable:

str(gdns::query("r-project.org"))
## List of 10
##  $ Status            : int 0
##  $ TC                : logi FALSE
##  $ RD                : logi TRUE
##  $ RA                : logi TRUE
##  $ AD                : logi FALSE
##  $ CD                : logi FALSE
##  $ Question          :'data.frame': 1 obs. of  2 variables:
##   ..$ name: chr "r-project.org."
##   ..$ type: int 1
##  $ Answer            :'data.frame': 1 obs. of  4 variables:
##   ..$ name: chr "r-project.org."
##   ..$ type: int 1
##   ..$ TTL : int 2095
##   ..$ data: chr "137.208.57.37"
##  $ Additional        : list()
##  $ edns_client_subnet: chr "0.0.0.0/0"

str(dnsflare::query("r-project.org"))
## List of 8
##  $ Status  : int 0
##  $ TC      : logi FALSE
##  $ RD      : logi TRUE
##  $ RA      : logi TRUE
##  $ AD      : logi FALSE
##  $ CD      : logi FALSE
##  $ Question:'data.frame': 1 obs. of  2 variables:
##   ..$ name: chr "r-project.org."
##   ..$ type: int 1
##  $ Answer  :'data.frame': 1 obs. of  4 variables:
##   ..$ name: chr "r-project.org."
##   ..$ type: int 1
##   ..$ TTL : int 1420
##   ..$ data: chr "137.208.57.37"
##  - attr(*, "class")= chr "cf_dns_result"

The packages are primarily of use for internet researchers who need to lookup DNS-y things either as a data source in-and-of itself or to add metadata to names or IP addresses in other data sets.

I need to do some work on ensuring they both are on-par feature-wise (named-classes, similar print and batch query methods, etc) and should, perhaps, consider retiring gdns in favour of a new meta-DNS package that wraps all of these since I suspect all the cool kids will be setting these up, soon. (Naming suggestions welcome!)

There’s also getdns? which has very little but stub test code in it (for now) since it was unclear how quickly these new, modern DNS services would take off. But, since they have, that project will be revisited this year (jump in if ye like!) as is is (roughly) a “non-JSON” version of what gns and dnsflare are.

If you know of other, similar services that can be wrapped, drop a note in the comments or as an issue on one of those repos and also file an issue there if you have preferred response formats or have functionality you’d like implemented.

The fs package makes it super quick and easy to find out just how much “package hoarding” you’ve been doing:

library(fs)
library(ggalt) # devtools::install_github("hrbrmstr/ggalt")
library(igraph) 
library(ggraph) # devtools::install_github("thomasp85/ggraph")
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(tidyverse)

installed.packages() %>%
  as_data_frame() %>%
  mutate(pkg_dir = sprintf("%s/%s", LibPath, Package)) %>%
  select(pkg_dir) %>%
  mutate(pkg_dir_size = map_dbl(pkg_dir, ~{
    fs::dir_info(.x, all=TRUE, recursive=TRUE) %>%
      summarise(tot_dir_size = sum(size)) %>% 
      pull(tot_dir_size)
  })) %>% 
  summarise(
    total_size_of_all_installed_packages=ggalt::Gb(sum(pkg_dir_size))
  ) %>% 
  unlist()
## total_size_of_all_installed_packages 
##                             "1.6 Gb"

While you can modify the above and peruse the list of packages/directories in tabular format or programmatically, you can also do a bit more work to get a visual overview of package size (click/tap the image for a larger view):

installed.packages() %>%
  as_data_frame() %>%
  mutate(pkg_dir = sprintf("%s/%s", LibPath, Package)) %>%
  mutate(dir_info = map(pkg_dir, fs::dir_info, all=TRUE, recursive=TRUE)) %>% 
  mutate(dir_size = map_dbl(dir_info, ~sum(.x$size))) -> xdf

select(xdf, Package, dir_size) %>% 
  mutate(grp = "ROOT") %>% 
  add_row(grp = "ROOT", Package="ROOT", dir_size=0) %>% 
  select(grp, Package, dir_size) %>% 
  arrange(desc(dir_size)) -> gdf

select(gdf, -grp) %>% 
  mutate(lab = sprintf("%s\n(%s)", Package, ggalt::Mb(dir_size))) %>% 
  mutate(lab = ifelse(dir_size > 1500000, lab, "")) -> vdf

g <- graph_from_data_frame(gdf, vertices=vdf)

ggraph(g, "treemap", weight=dir_size) +
  geom_node_tile(fill="lightslategray", size=0.25) +
  geom_text(
    aes(x, y, label=lab, size=dir_size), 
    color="#cccccc", family=font_ps, lineheight=0.875
  ) +
  scale_x_reverse(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_size_continuous(trans="sqrt", range = c(0.5, 8)) +
  ggraph::theme_graph(base_family = font_ps) +
  theme(legend.position="none")

treemap of package disk consumption

Challenge

Do some wrangling with the above data and turn it into a package “disk explorer” with @timelyportfolio’s d3treeR? package.

(R⁶ == brief, low-expository posts)

@yoniceedee suggested I look at the Cambridge Analytics “whistleblower” testimony proceedings:

I value the resources @yoniceedee tosses my way (they often end me down twisted paths like this one, though :-) but I really dislike spending any amount of time on youtube and can consume text context much faster than even accelerated video playback.

Google auto-generated captions for that video and you can display them by clicking below the video on the right and enabling the transcript which slowly (well, in my frame of reference) loads into the upper-right. That’s still sub-optimal since we need to be on the youtube page to read/scroll. There’s no “export” option and my initial instinct was to go to Developer Tools and look for the https://www.youtube.com/service_ajax?name=getTranscriptEndpoint URL and “Copy the Response” to the clipboard and save it to a file then do some JSON/list wrangling (the transcript JSON URL is in the snippet below):

library(tidyverse)

trscrpt <- jsonlite::fromJSON("https://rud.is/dl/ca-transcript.json")

runs <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$formattedStartOffset$runs
cues <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$cues

data_frame(
  mark = map_chr(runs, ~.x$text),
  text = map_chr(cues, ~.x$transcriptCueRenderer$cue$runs[[1]]$text)  
) %>% 
  separate(mark, c("minute", "second"), sep=":", remove = FALSE, convert = TRUE) 
## # A tibble: 3,247 x 4
##    mark  minute second text                                    
##    <chr>  <int>  <int> <chr>                                   
##  1 00:00      0      0 all sort of yeah web of things if it's a
##  2 00:02      0      2 franchise then there's a kind of        
##  3 00:03      0      3 ultimately there's a there's a there's a
##  4 00:05      0      5 coordinator of that franchise or someone
##  5 00:07      0      7 who's a you got a that franchise is well
##  6 00:09      0      9 well when I was there that was Alexander
##  7 00:13      0     13 Nixon Steve banning but that's that's a 
##  8 00:16      0     16 question you should be asking aiq yeah  
##  9 00:18      0     18 yeah and just got to a IQ and the GSR   
## 10 00:24      0     24 state from gts-r that's other Hogan data
## # ... with 3,237 more rows

But, then I remembered YouTube has an API for this and threw together a quick script to grab them that way as well:

# the API needs these scopes

c(
  "https://www.googleapis.com/auth/youtube.force-ssl",
  "https://www.googleapis.com/auth/youtubepartner"
) -> scope_list

# oauth dance

httr::oauth_app(
  appname = "google",
  key = Sys.getenv("GOOGLE_APP_SECRET"),
  secret = Sys.getenv("GOOGLE_APP_KEY")
) -> captions_app

httr::oauth2.0_token(
  endpoint = httr::oauth_endpoints("google"),
  app = captions_app,
  scope = scope_list,
  cache = TRUE
) -> google_token

# list the available captions for this video
# (captions can be in one or more languages)

httr::GET(
  url = "https://www.googleapis.com/youtube/v3/captions",
  query = list(
    part = "snippet",
    videoId = "f2Sxob3fl0k" # the v=string in the YouTube URL
  ),
  httr::config(token = google_token)
) -> caps_list

# I'm cheating since I know there's only one but you'd want
# to introspect `caps_list` before blindly doing this for 
# other videos.

httr::GET(
  url = sprintf(
    "https://www.googleapis.com/youtube/v3/captions/%s",
    httr::content(caps_list)$items[[1]]$id
  ),
  httr::config(token = google_token)
) -> caps

# strangely enough, the JSON response "feels" better than this
# one, though this is a standard format that's parseable quite well.

cat(rawToChar(httr::content(caps)))
## 0:00:00.000,0:00:03.659
## all sort of yeah web of things if it's a
## 
## 0:00:02.490,0:00:05.819
## franchise then there's a kind of
## 
## 0:00:03.659,0:00:07.589
## ultimately there's a there's a there's a
## 
## 0:00:05.819,0:00:09.660
## coordinator of that franchise or someone
## 
## 0:00:07.589,0:00:13.139
## who's a you got a that franchise is well
## 
## 0:00:09.660,0:00:16.230
## well when I was there that was Alexander
## ...

Neither a reflection on active memory nor a quick Duck Duck Go search (I try not to use Google Search anymore) seemed to point to an existing R resource for this, hence the quick post in the event the snippet is helpful to anyone else.

If you do know of an R package/snippet that does this already, please shoot a note into the comments so others can find it.

A quick Friday post to let folks know about three in-development R packages that you’re encouraged to poke the tyres o[fn] and also jump in and file issues or PRs for.

Alleviating aversion to versions

I introduced a “version chart” in a recent post and one key element of tagging years (which are really helpful to get a feel for scope of exposure + technical/cyber-debt) is knowing the dates of product version releases. You can pay for such a database but it’s also possible to cobble one together, and that activity will be much easier as time goes on with the vershist? package.

Here’s a sample:

apache_httpd_version_history()
## # A tibble: 29 x 8
##    vers   rls_date   rls_year major minor patch prerelease build
##    <fct>  <date>        <dbl> <int> <int> <int> <chr>      <chr>
##  1 1.3.0  1998-06-05     1998     1     3     0 ""         ""   
##  2 1.3.1  1998-07-22     1998     1     3     1 ""         ""   
##  3 1.3.2  1998-09-21     1998     1     3     2 ""         ""   
##  4 1.3.3  1998-10-09     1998     1     3     3 ""         ""   
##  5 1.3.4  1999-01-10     1999     1     3     4 ""         ""   
##  6 1.3.6  1999-03-23     1999     1     3     6 ""         ""   
##  7 1.3.9  1999-08-19     1999     1     3     9 ""         ""   
##  8 1.3.11 2000-01-22     2000     1     3    11 ""         ""   
##  9 1.3.12 2000-02-25     2000     1     3    12 ""         ""   
## 10 1.3.14 2000-10-10     2000     1     3    14 ""         ""   
## # ... with 19 more rows

Not all vendored-software uses semantic versioning and many have terrible schemes that make it really hard to create an ordered factor, but when that is possible, you get a nice data frame with an ordered factor you can use for all sorts of fun and useful things.

It has current support for:

  • Apache httpd
  • Apple iOS
  • Google Chrome
  • lighttpd
  • memcached
  • MongoDB
  • MySQL
  • nginx
  • openresty
  • openssh
  • sendmail
  • SQLite

and I’ll add more over time.

Thanks to @bikesRdata there will be a …_latest() function for each vendor and I’ll likely add some helper functions so you only need to call one function with a parameter vs individual ones for each version and will also likely add a caching layer so you don’t have to scrape/clone/munge every time you need versions (seriously: look at the code to see what you have to do to collect some of this data).

And, they all it a MIME…a MIME!

I’ve had the wand? package out for a while but have never been truly happy with it. It uses libmagic on unix-ish systems but requires Rtools on Windows and relies on a system call to file.exe on that platform. Plus the “magic” database is too big to embed in the package and due to the (very, very, very good and necessary) privacy/safety practices of CRAN, writing the boilerplate code to deal with compilation or downloading of the magic database is not something I have time for (and it really needs regular updates for consistent output on all platforms).

A very helpful chap, @VincentGuyader, was lamenting some of the Windows issues which spawned a quick release of simplemagic?. The goal of this package is to be a zero-dependency install with no reliance on external databases. It has built-in support for basing MIME-type “guesses” off of a handful of the more common types folks might want to use this package for and a built-in “database” of over 1,500 file type-to-MIME mappings for guessing based solely on extension.

list.files(system.file("extdat", package="simplemagic"), full.names=TRUE) %>% 
  purrr::map_df(~{
    dplyr::data_frame(
      fil = basename(.x),
      mime = list(simplemagic::get_content_type(.x))
    )
  }) %>% 
  tidyr::unnest()
## # A tibble: 85 x 2
##    fil                        mime                                                                     
##    <chr>                      <chr>                                                                    
##  1 actions.csv                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  2 actions.txt                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  3 actions.xlsx               application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  4 test_1.2.class             application/java-vm                                                      
##  5 test_1.3.class             application/java-vm                                                      
##  6 test_1.4.class             application/java-vm                                                      
##  7 test_1.5.class             application/java-vm                                                      
##  8 test_128_44_jstereo.mp3    audio/mp3                                                                
##  9 test_excel_2000.xls        application/msword                                                       
## 10 test_excel_spreadsheet.xml application/xml      
## ...

File issues or PRs if you need more header-magic introspected guesses.

NOTE: The rtika? package could theoretically do a more comprehensive job since Apache Tika has an amazing assortment of file-type introspect-ors. Also, an interesting academic exercise might be to collect a sufficient corpus of varying files, pull the first 512-4096 bytes of each, do some feature generation and write an ML-based classifier for files with a confidence level + MIME-type output.

Site promiscuity detection

urlscan is a fun site since it frees you from the tedium (and expense/privacy-concerns) of using a javascript-enabled scraping setup to pry into the makeup of a target URL and find out all sorts of details about it, including how many sites it lets track you. You can do the same with my splashr? package, but you have the benefit of a third-party making the connection with urlscan.io vs requests coming from your IP space.

I’m waiting on an API key so I can write the “submit a scan request programmatically” function, but—until then—you can retrieve existing sites in their database or manually enter one for later retrieval.

The package is a WIP but has enough bits to be useful now to, say, see just how promiscuous cnn.com makes you:

cnn_db <- urlscan::urlscan_search("domain:cnn.com")

latest_scan_results <- urlscan::urlscan_result(cnn_db$results$`_id`[1], TRUE, TRUE)

latest_scan_results$scan_result$lists$ips
##  [1] "151.101.65.67"   "151.101.113.67"  "2.19.34.83"     
##  [4] "2.20.22.7"       "2.16.186.112"    "54.192.197.56"  
##  [7] "151.101.114.202" "83.136.250.242"  "157.166.238.142"
## [10] "13.32.217.114"   "23.67.129.200"   "2.18.234.21"    
## [13] "13.32.145.105"   "151.101.112.175" "172.217.21.194" 
## [16] "52.73.250.52"    "172.217.18.162"  "216.58.210.2"   
## [19] "172.217.23.130"  "34.238.24.243"   "13.107.21.200"  
## [22] "13.32.159.194"   "2.18.234.190"    "104.244.43.16"  
## [25] "54.192.199.124"  "95.172.94.57"    "138.108.6.20"   
## [28] "63.140.33.27"    "2.19.43.224"     "151.101.114.2"  
## [31] "74.201.198.92"   "54.76.62.59"     "151.101.113.194"
## [34] "2.18.233.186"    "216.58.207.70"   "95.172.94.20"   
## [37] "104.244.42.5"    "2.18.234.36"     "52.94.218.7"    
## [40] "62.67.193.96"    "62.67.193.41"    "69.172.216.55"  
## [43] "13.32.145.124"   "50.31.185.52"    "54.210.114.183" 
## [46] "74.120.149.167"  "64.202.112.28"   "185.60.216.19"  
## [49] "54.192.197.119"  "185.60.216.35"   "46.137.176.25"  
## [52] "52.73.56.77"     "178.250.2.67"    "54.229.189.67"  
## [55] "185.33.223.197"  "104.244.42.3"    "50.16.188.173"  
## [58] "50.16.238.189"   "52.59.88.2"      "52.38.152.125"  
## [61] "185.33.223.80"   "216.58.207.65"   "2.18.235.40"    
## [64] "69.172.216.58"   "107.23.150.218"  "34.192.246.235" 
## [67] "107.23.209.129"  "13.32.145.107"   "35.157.255.181" 
## [70] "34.228.72.179"   "69.172.216.111"  "34.205.202.95"

latest_scan_results$scan_result$lists$countries
## [1] "US" "EU" "GB" "NL" "IE" "FR" "DE"

latest_scan_results$scan_result$lists$domains
##  [1] "cdn.cnn.com"                    "edition.i.cdn.cnn.com"         
##  [3] "edition.cnn.com"                "dt.adsafeprotected.com"        
##  [5] "pixel.adsafeprotected.com"      "securepubads.g.doubleclick.net"
##  [7] "tpc.googlesyndication.com"      "z.moatads.com"                 
##  [9] "mabping.chartbeat.net"          "fastlane.rubiconproject.com"   
## [11] "b.sharethrough.com"             "geo.moatads.com"               
## [13] "static.adsafeprotected.com"     "beacon.krxd.net"               
## [15] "revee.outbrain.com"             "smetrics.cnn.com"              
## [17] "pagead2.googlesyndication.com"  "secure.adnxs.com"              
## [19] "0914.global.ssl.fastly.net"     "cdn.livefyre.com"              
## [21] "logx.optimizely.com"            "cdn.krxd.net"                  
## [23] "s0.2mdn.net"                    "as-sec.casalemedia.com"        
## [25] "errors.client.optimizely.com"   "social-login.cnn.com"          
## [27] "invocation.combotag.com"        "sb.scorecardresearch.com"      
## [29] "secure-us.imrworldwide.com"     "bat.bing.com"                  
## [31] "jadserve.postrelease.com"       "ssl.cdn.turner.com"            
## [33] "cnn.sdk.beemray.com"            "static.chartbeat.com"          
## [35] "native.sharethrough.com"        "www.cnn.com"                   
## [37] "btlr.sharethrough.com"          "platform-cdn.sharethrough.com" 
## [39] "pixel.moatads.com"              "www.summerhamster.com"         
## [41] "mms.cnn.com"                    "ping.chartbeat.net"            
## [43] "analytics.twitter.com"          "sharethrough.adnxs.com"        
## [45] "match.adsrvr.org"               "gum.criteo.com"                
## [47] "www.facebook.com"               "d3qdfnco3bamip.cloudfront.net" 
## [49] "connect.facebook.net"           "log.outbrain.com"              
## [51] "serve2.combotag.com"            "rva.outbrain.com"              
## [53] "odb.outbrain.com"               "dynaimage.cdn.cnn.com"         
## [55] "data.api.cnn.io"                "aax.amazon-adsystem.com"       
## [57] "cdns.gigya.com"                 "t.co"                          
## [59] "pixel.quantserve.com"           "ad.doubleclick.net"            
## [61] "cdn3.optimizely.com"            "w.usabilla.com"                
## [63] "amplifypixel.outbrain.com"      "tr.outbrain.com"               
## [65] "mab.chartbeat.com"              "data.cnn.com"                  
## [67] "widgets.outbrain.com"           "secure.quantserve.com"         
## [69] "static.ads-twitter.com"         "amplify.outbrain.com"          
## [71] "tag.bounceexchange.com"         "adservice.google.com"          
## [73] "adservice.google.com.ua"        "www.googletagservices.com"     
## [75] "cdn.adsafeprotected.com"        "js-sec.indexww.com"            
## [77] "ads.rubiconproject.com"         "c.amazon-adsystem.com"         
## [79] "www.ugdturner.com"              "a.postrelease.com"             
## [81] "cdn.optimizely.com"             "cnn.com"

O_o

FIN

Again, kick the tyres, file issues/PRs and drop a note if you’ve found something interesting as a result of any (or all!) of the packages.

(FWIW I think I even caused myself pain due to the title of this blog post).

Kaiser Fung (@junkcharts) did a makeover post on this chart about U.S. steel tariffs:

Kaiser’s makeover is good (Note: just because I said “good” does not mean I’m endorsing the use of pie charts):

But, I’m curious as to what others would do with the data. Here’s my stab at a single-geom makeover:

library(waffle)
library(viridis)
library(tidyverse)

data_frame(
  country = c("Rest of World", "Canada*", "Brazil*", "South Korea", "Mexico", 
              "Russia", "Turkey", "Japan", "Taiwan", "Germany", "India"),
  pct = c(22, 16, 13, 10, 9, 9, 7, 5, 4, 3, 2)
) %>% 
  mutate(country = sprintf("%s (%s%%)", country, pct)) %>% 
  waffle(
    colors = c("gray70", viridis_pal(option = "plasma")(10))
  ) +
  labs(
    title = "U.S. Steel Imports — YTD 2017 Percent of Volume",
    subtitle = "Ten nations account for ~80% of U.S. steel imports.",
    caption = "Source: IHS Global Trade Atlas • YTD through September 2017\n* Canada & Brazil are not impacted by the proposed tariffs"
  ) +
  theme_ipsum_ps() +
  theme(legend.position = "top") +
  theme(axis.text = element_blank()) +
  theme(title = element_text(hjust=0.5)) +
  theme(plot.title = element_text(hjust=0.5)) +
  theme(plot.subtitle = element_text(hjust=0.5)) +
  theme(plot.caption = element_text(hjust=1))

The percentages are included in the legend titles in the event that some readers of the chart may want to know the specific numbers, but my feeling for the intent of the original pac-man pies was to provide a list that didn’t include China-proper (despite 45 using them to rile up his base) and give a sense of proportion for the “top 10”. The waffle chart isn’t perfect for it, but it is one option.

How would you use the data (provided in the R snippet) to communicate the message you think needs to be communicated? Drop a note in the comments with a link to your creation(s) if you do give the data a spin.