Skip navigation

Well, 2018 has flown by and today seems like an appropriate time to take a look at the landscape of R bloggerdom as seen through the eyes of readers of R-bloggers and R Weekly. We’ll do this via a new package designed to make it easier to treat Feedly as a data source: seymour [GL | GH] (which is a pun-ified name based on a well-known phrase from Little Shop of Horrors).

The seymour package builds upon an introductory Feedly API blog post from back in April 2018 and covers most of the “getters” in the API (i.e. you won’t be adding anything to or modifying anything in Feedly through this package unless you PR into it with said functions). An impetus for finally creating the package came about when I realized that you don’t need a Feedly account to use the search or stream endpoints. You do get more data back if you have a developer token and can also access your own custom Feedly components if you have one. If you are a “knowledge worker” and do not have a Feedly account (and, really, a Feedly Pro account) you are missing out. But, this isn’t a rah-rah post about Feedly, it’s a rah-rah post about R! Onwards!

Feeling Out The Feeds

There are a bunch of different ways to get Feedly metadata about an RSS feed. One easy way is to just use the RSS feed URL itself:

library(seymour) # git[la|hu]b/hrbrmstr/seymour
library(hrbrthemes) # git[la|hu]b/hrbrmstr/hrbrthemes
library(lubridate)
library(tidyverse)
r_bloggers <- feedly_feed_meta("http://feeds.feedburner.com/RBloggers")
r_weekly <- feedly_feed_meta("https://rweekly.org/atom.xml")
r_weekly_live <- feedly_feed_meta("https://feeds.feedburner.com/rweeklylive")

glimpse(r_bloggers)
## Observations: 1
## Variables: 14
## $ feedId      <chr> "feed/http://feeds.feedburner.com/RBloggers"
## $ id          <chr> "feed/http://feeds.feedburner.com/RBloggers"
## $ title       <chr> "R-bloggers"
## $ subscribers <int> 24518
## $ updated     <dbl> 1.546227e+12
## $ velocity    <dbl> 44.3
## $ website     <chr> "https://www.r-bloggers.com"
## $ topics      <I(list)> data sci....
## $ partial     <lgl> FALSE
## $ iconUrl     <chr> "https://storage.googleapis.com/test-site-assets/X...
## $ visualUrl   <chr> "https://storage.googleapis.com/test-site-assets/X...
## $ language    <chr> "en"
## $ contentType <chr> "longform"
## $ description <chr> "Daily news and tutorials about R, contributed by ...

glimpse(r_weekly)
## Observations: 1
## Variables: 13
## $ feedId      <chr> "feed/https://rweekly.org/atom.xml"
## $ id          <chr> "feed/https://rweekly.org/atom.xml"
## $ title       <chr> "RWeekly.org - Blogs to Learn R from the Community"
## $ subscribers <int> 876
## $ updated     <dbl> 1.546235e+12
## $ velocity    <dbl> 1.1
## $ website     <chr> "https://rweekly.org/"
## $ topics      <I(list)> data sci....
## $ partial     <lgl> FALSE
## $ iconUrl     <chr> "https://storage.googleapis.com/test-site-assets/2...
## $ visualUrl   <chr> "https://storage.googleapis.com/test-site-assets/2...
## $ contentType <chr> "longform"
## $ language    <chr> "en"

glimpse(r_weekly_live)
## Observations: 1
## Variables: 9
## $ id          <chr> "feed/https://feeds.feedburner.com/rweeklylive"
## $ feedId      <chr> "feed/https://feeds.feedburner.com/rweeklylive"
## $ title       <chr> "R Weekly Live: R Focus"
## $ subscribers <int> 1
## $ updated     <dbl> 1.5461e+12
## $ velocity    <dbl> 14.7
## $ website     <chr> "https://rweekly.org/live"
## $ language    <chr> "en"
## $ description <chr> "Live Updates from R Weekly"

Feedly uses some special terms, one of which (above) is velocity. “Velocity” is simply the average number of articles published weekly (Feedly’s platform updates that every few weeks for each feed). R-bloggers has over 24,000 Feedly subscribers so any post-rankings we do here should be fairly representative. I included both the “live” and the week-based R Weekly feeds as I wanted to compare post coverage between R-bloggers and R Weekly in terms of raw content.

On the other hand, R Weekly’s “weekly” RSS feed has less than 1,000 subscribers. WAT?! While I have mostly nothing against R-bloggers-proper I heartily encourage ardent readers to also subscribe to R Weekly and perhaps even consider switching to it (or at least adding the individual blog feeds they monitor to your own Feedly). It wasn’t until the Feedly API that I had any idea of how many folks were really viewing my R blog posts since we must provide a full post RSS feed to R-bloggers and get very little in return (at least in terms of data). R Weekly uses a link counter but redirects all clicks to the blog author’s site where we can use logs or analytics platforms to measure engagement. R Weekly is also run by a group of volunteers (more eyes == more posts they catch!) and has a Patreon where the current combined weekly net is likely not enough to buy each volunteer a latte. No ads, a great team and direct engagement stats for the community of R bloggers seems like a great deal for $1.00 USD. If you weren’t persuaded by the above rant, then perhaps at least consider installing this (from source that you control).

Lastly, I believe I’m that “1” subscriber to R Weekly Live O_o. But, I digress.

We’ve got the feedIds (which can be used as “stream” ids) so let’s get cracking!

Binding Up The Posts

We need to use the feedId in calls to feedly_stream() to get the individual posts. The API claims there’s a temporal parameter that allows one to get posts only after a certain date but I couldn’t get it to work (PRs are welcome on any community source code portal you’re most comfortable in if you’re craftier than I am). As a result, we need to make a guess as to how many calls we need to make for two of the three feeds. Basic maths of 44 * 52 / 1000 suggests ~3 should suffice for R Weekly (live) and R-bloggers but let’s do 5 to be safe. We should be able to get R Weekly (weekly) in one go.

r_weekly_wk <- feedly_stream(r_weekly$feedId)

range(r_weekly_wk$items$published) # my preview of this said it got back to 2016!
## [1] "2016-05-20 20:00:00 EDT" "2018-12-30 19:00:00 EST"

# NOTE: If this were more than 3 I'd use a loop/iterator
# In reality, I should make a helper function to do this for you (PRs welcome)

r_blog_1 <- feedly_stream(r_bloggers$feedId)
r_blog_2 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_1$continuation)
r_blog_3 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_2$continuation)

r_weekly_live_1 <- feedly_stream(r_weekly_live$feedId)
r_weekly_live_2 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_1$continuation)
r_weekly_live_3 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_2$continuation)

bind_rows(r_blog_1$items, r_blog_2$items, r_blog_3$items) %>% 
  filter(published >= as.Date("2018-01-01")) -> r_blog_stream

bind_rows(r_weekly_live_1$items, r_weekly_live_2$items, r_weekly_live_3$items) %>% 
  filter(published >= as.Date("2018-01-01")) -> r_weekly_live_stream

r_weekly_wk_stream <- filter(r_weekly_wk$items, published >= as.Date("2018-01-01"))

Let’s take a look:

glimpse(r_weekly_wk_stream)
## Observations: 54
## Variables: 27
## $ id                  <chr> "2nIALmjjlFcpPJKakm2k8hjka0FzpApixM7HHu8B0...
## $ originid            <chr> "https://rweekly.org/2018-53", "https://rw...
## $ fingerprint         <chr> "114357f1", "199f78d0", "9adc236e", "63f99...
## $ title               <chr> "R Weekly 2018-53 vroom, Classification", ...
## $ updated             <dttm> 2018-12-30 19:00:00, 2018-12-23 19:00:00,...
## $ crawled             <dttm> 2018-12-31 00:51:39, 2018-12-23 23:46:49,...
## $ published           <dttm> 2018-12-30 19:00:00, 2018-12-23 19:00:00,...
## $ alternate           <list> [<https://rweekly.org/2018-53.html, text/...
## $ canonicalurl        <chr> "https://rweekly.org/2018-53.html", "https...
## $ unread              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ engagement          <int> 1, 5, 5, 3, 2, 3, 1, 2, 3, 2, 4, 3, 2, 2, ...
## $ engagementrate      <dbl> 0.33, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ recrawled           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tags                <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL...
## $ content_content     <chr> "<p>Hello and welcome to this new issue!</...
## $ content_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/https://rweekly.org/atom.xml", "feed...
## $ origin_title        <chr> "RWeekly.org - Blogs to Learn R from the C...
## $ origin_htmlurl      <chr> "https://rweekly.org/", "https://rweekly.o...
## $ visual_processor    <chr> "feedly-nikon-v3.1", "feedly-nikon-v3.1", ...
## $ visual_url          <chr> "https://github.com/rweekly/image/raw/mast...
## $ visual_width        <int> 372, 672, 1000, 1000, 1000, 1001, 1000, 10...
## $ visual_height       <int> 479, 480, 480, 556, 714, 624, 237, 381, 36...
## $ visual_contenttype  <chr> "image/png", "image/png", "image/gif", "im...
## $ webfeeds_icon       <chr> "https://storage.googleapis.com/test-site-...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(r_weekly_live_stream)
## Observations: 1,333
## Variables: 27
## $ id                  <chr> "rhkRVQ8KjjGRDQxeehIj6RRIBGntdni0ZHwPTR8B3...
## $ originid            <chr> "https://link.rweekly.org/ckb", "https://l...
## $ fingerprint         <chr> "c11a0782", "c1897fc3", "c0b36206", "7049e...
## $ title               <chr> "Top Tweets of 2018", "My #Best9of2018 twe...
## $ crawled             <dttm> 2018-12-29 11:11:52, 2018-12-28 11:24:22,...
## $ published           <dttm> 2018-12-28 19:00:00, 2018-12-27 19:00:00,...
## $ canonical           <list> [<https://link.rweekly.org/ckb, text/html...
## $ alternate           <list> [<http://feedproxy.google.com/~r/RWeeklyL...
## $ unread              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ tags                <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ canonicalurl        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ ampurl              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ cdnampurl           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ engagement          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ summary_content     <chr> "<p>maraaverick.rbind.io</p><img width=\"1...
## $ summary_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/https://feeds.feedburner.com/rweekly...
## $ origin_title        <chr> "R Weekly Live: R Focus", "R Weekly Live: ...
## $ origin_htmlurl      <chr> "https://rweekly.org/live", "https://rweek...
## $ visual_url          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_processor    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_width        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_height       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_contenttype  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_pocket  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(r_blog_stream)
## Observations: 2,332
## Variables: 34
## $ id                  <chr> "XGq6cYRY3hH9/vdZr0WOJiPdAe0u6dQ2ddUFEsTqP...
## $ keywords            <list> ["R bloggers", "R bloggers", "R bloggers"...
## $ originid            <chr> "https://datascienceplus.com/?p=19513", "h...
## $ fingerprint         <chr> "2f32071a", "332f9548", "2e6f8adb", "3d7ed...
## $ title               <chr> "Leaf Plant Classification: Statistical Le...
## $ crawled             <dttm> 2018-12-30 22:35:22, 2018-12-30 19:01:25,...
## $ published           <dttm> 2018-12-30 19:26:20, 2018-12-30 13:18:00,...
## $ canonical           <list> [<https://www.r-bloggers.com/leaf-plant-c...
## $ author              <chr> "Giorgio Garziano", "Sascha W.", "Economet...
## $ alternate           <list> [<http://feedproxy.google.com/~r/RBlogger...
## $ unread              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ entities            <list> [<c("nlp/f/entity/en/-/leaf plant classif...
## $ engagement          <int> 50, 39, 482, 135, 33, 12, 13, 41, 50, 31, ...
## $ engagementrate      <dbl> 1.43, 0.98, 8.76, 2.45, 0.59, 0.21, 0.22, ...
## $ enclosure           <list> [NULL, NULL, NULL, NULL, <c("https://0.gr...
## $ tags                <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL...
## $ recrawled           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ updatecount         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ content_content     <chr> "<p><div><div><div><div data-show-faces=\"...
## $ content_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ summary_content     <chr> "CategoriesAdvanced Modeling\nTags\nLinear...
## $ summary_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/http://feeds.feedburner.com/RBlogger...
## $ origin_title        <chr> "R-bloggers", "R-bloggers", "R-bloggers", ...
## $ origin_htmlurl      <chr> "https://www.r-bloggers.com", "https://www...
## $ visual_processor    <chr> "feedly-nikon-v3.1", "feedly-nikon-v3.1", ...
## $ visual_url          <chr> "https://i0.wp.com/datascienceplus.com/wp-...
## $ visual_width        <int> 383, 400, NA, 286, 456, 250, 450, 456, 397...
## $ visual_height       <int> 309, 300, NA, 490, 253, 247, 450, 253, 333...
## $ visual_contenttype  <chr> "image/png", "image/png", NA, "image/png",...
## $ webfeeds_icon       <chr> "https://storage.googleapis.com/test-site-...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_pocket  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

And also check how far into December for each did I get as of this post? (I’ll check again after the 31 and update if needed).

range(r_weekly_wk_stream$published)
## [1] "2018-01-07 19:00:00 EST" "2018-12-30 19:00:00 EST"

range(r_blog_stream$published)
## [1] "2018-01-01 11:00:27 EST" "2018-12-30 19:26:20 EST"

range(r_weekly_live_stream$published)
## [1] "2018-01-01 19:00:00 EST" "2018-12-28 19:00:00 EST"

Digging Into The Weeds Feeds

In the above glimpses there’s another special term, engagement. Feedly defines this as an “indicator of how popular this entry is. The higher the number, the more readers have read, saved or shared this particular entry”. We’ll use this to look at the most “engaged” content in a bit. What’s noticeable from the start is that R Weekly Live has 1,333 entries and R-bloggers has 2,330 entries (so, nearly double the number of entries). Those counts are a bit of “fake news” when it comes to overall unique posts as can be seen by:

bind_rows(
  mutate(r_weekly_live_stream, src = "R Weekly (Live)"),
  mutate(r_blog_stream, src = "R-bloggers")
) %>% 
  mutate(wk = lubridate::week(published)) -> y2018

filter(y2018, title == "RcppArmadillo 0.9.100.5.0") %>% 
  select(src, title, originid, published) %>% 
  gt::gt()

src title originid published
R Weekly (Live) RcppArmadillo 0.9.100.5.0 https://link.rweekly.org/bg6 2018-08-17 07:55:00
R Weekly (Live) RcppArmadillo 0.9.100.5.0 https://link.rweekly.org/bfr 2018-08-16 21:20:00
R-bloggers RcppArmadillo 0.9.100.5.0 https://www.r-bloggers.com/?guid=f8865e8a004f772bdb64e3c4763a0fe5 2018-08-17 08:00:00
R-bloggers RcppArmadillo 0.9.100.5.0 https://www.r-bloggers.com/?guid=3046299f73344a927f787322c867233b 2018-08-16 21:20:00

Feedly has many processes going on behind the scenes to identify new entries and update entries as original sources are modified. This “duplication” (thankfully) doesn’t happen alot:

count(y2018, src, wk, title, sort=TRUE) %>% 
  filter(n > 1) %>% 
  arrange(wk) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("wk", "n"), decimals = 0)

src wk title n
R-bloggers 3 conapomx data package 2
R Weekly (Live) 5 R in Latin America 2
R Weekly (Live) 12 Truncated Poisson distributions in R and Stan by @ellis2013nz 2
R Weekly (Live) 17 Regression Modeling Strategies 2
R Weekly (Live) 18 How much work is onboarding? 2
R Weekly (Live) 18 Survey books, courses and tools by @ellis2013nz 2
R-bloggers 20 Beautiful and Powerful Correlation Tables in R 2
R Weekly (Live) 24 R Consortium is soliciting your feedback on R package best practices 2
R Weekly (Live) 33 RcppArmadillo 0.9.100.5.0 2
R-bloggers 33 RcppArmadillo 0.9.100.5.0 2
R-bloggers 39 Individual level data 2
R Weekly (Live) 41 How R gets built on Windows 2
R Weekly (Live) 41 R Consortium grant applications due October 31 2
R Weekly (Live) 41 The Economist’s Big Mac Index is calculated with R 2
R Weekly (Live) 42 A small logical change with big impact 2
R Weekly (Live) 42 Maryland’s Bridge Safety, reported using R 2
R-bloggers 47 OneR – fascinating insights through simple rules 2

In fact, it happens infrequently enough that I’m going to let the “noise” stay in the data since Feedly technically is tracking some content change.

Let’s look at the week-over-week curation counts (neither source publishes original content, so using the term “published” seems ill fitting) for each:

count(y2018, src, wk) %>% 
  ggplot(aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0, color = src), show.legend = FALSE) +
  facet_wrap(~src, ncol=1, scales="free_x") + 
  labs(
    x = "Week #", y = "# Posts", 
    title = "Weekly Post Curation Stats for R-bloggers & R Weekly (Live)"
  ) +
  theme_ft_rc(grid="Y")

week over week

Despite R-bloggers having curated more overall content, there’s plenty to read each week for consumers of either/both aggregators.

Speaking of consuming, let’s look at the distribution of engagement scores for both aggregators:

group_by(y2018, src) %>% 
  summarise(v = list(broom::tidy(summary(engagement)))) %>% 
  unnest()
## # A tibble: 2 x 8
##   src             minimum    q1 median  mean    q3 maximum    na
##   <chr>             <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 R Weekly (Live)       0     0    0     0       0       0  1060
## 2 R-bloggers            1    16   32.5  58.7    70    2023    NA

Well, it seems that it’s more difficult for Feedly to track engagement for the link-only R Weekly (Live) feed, so we’ll have to focus on R-bloggers for engagement views. Summary values are fine, but we can get a picture of the engagement distribution (we’ll do it monthly to get a bit more granularity, too):

filter(y2018, src == "R-bloggers") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>% 
  ggplot(aes(month, engagement)) +
  geom_violin() +
  ggbeeswarm::geom_quasirandom(
    groupOnX = TRUE, size = 2, color = "#2b2b2b", fill = ft_cols$green,
    shape = 21, stroke = 0.25
  ) +
  scale_y_comma(trans = "log10") +
  labs(
    x = NULL, y = "Engagement Score",
    title = "Monthly Post Engagement Distributions for R-bloggers Curated Posts",
    caption = "NOTE: Y-axis log10 Scale"
  ) +
  theme_ft_rc(grid="Y")

post engagement distribution

I wasn’t expecting each month’s distribution to be so similar. There are definitely outliers in terms of positive engagement so we should be able see what types of R-focused content piques the interest of the ~25,000 Feedly subscribers of R-bloggers.

filter(y2018, src == "R-bloggers") %>% 
  group_by(author) %>% 
  summarise(n_posts = n(), total_eng = sum(engagement), avg_eng = mean(engagement), med_eng = median(engagement)) %>% 
  arrange(desc(n_posts)) %>% 
  slice(1:20) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("n_posts", "total_eng", "avg_eng", "med_eng"), decimals = 0)

author n_posts total_eng avg_eng med_eng
David Smith 116 9,791 84 47
John Mount 94 4,614 49 33
rOpenSci – open tools for open science 89 2,967 33 19
Thinking inside the box 85 1,510 18 14
R Views 60 4,142 69 47
hrbrmstr 55 1,179 21 16
Dr. Shirin Glander 54 2,747 51 25
xi’an 49 990 20 12
Mango Solutions 42 1,221 29 17
Econometrics and Free Software 33 2,858 87 60
business-science.io – Articles 31 4,484 145 70
NA 31 1,724 56 40
statcompute 29 1,329 46 33
Ryan Sheehy 25 1,271 51 45
Keith Goldfeld 24 1,305 54 43
free range statistics – R 23 440 19 12
Jakob Gepp 21 348 17 13
Tal Galili 21 1,587 76 22
Jozef’s Rblog 18 1,617 90 65
arthur charpentier 16 1,320 82 68

It is absolutely no surprise David comes in at number one in both post count and almost every engagement summary statistic since he’s a veritable blogging machine and creates + curates some super interesting content (whereas your’s truly doesn’t even make the median engagement cut ?).

What were the most engaging posts?

filter(y2018, src == "R-bloggers") %>% 
  arrange(desc(engagement)) %>% 
  mutate(published = as.Date(published)) %>% 
  select(engagement, title, published, author) %>% 
  slice(1:50) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("engagement"), decimals = 0)

engagement title published author
2,023 Happy Birthday R 2018-08-27 eoda GmbH
1,132 15 Types of Regression you should know 2018-03-25 ListenData
697 R and Python: How to Integrate the Best of Both into Your Data Science Workflow 2018-10-08 business-science.io – Articles
690 Ultimate Python Cheatsheet: Data Science Workflow with Python 2018-11-18 business-science.io – Articles
639 Data Analysis with Python Course: How to read, wrangle, and analyze data 2018-10-31 Andrew Treadway
617 Machine Learning Results in R: one plot to rule them all! 2018-07-18 Bernardo Lares
614 R tip: Use Radix Sort 2018-08-21 John Mount
610 Data science courses in R (/python/etc.) for $10 at Udemy (Sitewide Sale until Aug 26th) 2018-08-24 Tal Galili
575 Why R for data science – and not Python? 2018-12-02 Learning Machines
560 Case Study: How To Build A High Performance Data Science Team 2018-09-18 business-science.io – Articles
516 R 3.5.0 is released! (major release with many new features) 2018-04-24 Tal Galili
482 R or Python? Why not both? Using Anaconda Python within R with {reticulate} 2018-12-30 Econometrics and Free Software
479 Sankey Diagram for the 2018 FIFA World Cup Forecast 2018-06-10 Achim Zeileis
477 5 amazing free tools that can help with publishing R results and blogging 2018-12-22 Jozef’s Rblog
462 What’s the difference between data science, machine learning, and artificial intelligence? 2018-01-09 David Robinson
456 XKCD “Curve Fitting”, in R 2018-09-28 David Smith
450 The prequel to the drake R package 2018-02-06 rOpenSci – open tools for open science
449 Who wrote that anonymous NYT op-ed? Text similarity analyses with R 2018-09-07 David Smith
437 Elegant regression results tables and plots in R: the finalfit package 2018-05-16 Ewen Harrison
428 How to implement neural networks in R 2018-01-12 David Smith
426 Data transformation in style: package sjmisc updated 2018-02-06 Daniel
413 Neural Networks Are Essentially Polynomial Regression 2018-06-20 matloff
403 Custom R charts coming to Excel 2018-05-11 David Smith
379 A perfect RStudio layout 2018-05-22 Ilya Kashnitsky
370 Drawing beautiful maps programmatically with R, sf and ggplot2 — Part 1: Basics 2018-10-25 Mel Moreno and Mathieu Basille
368 The Financial Times and BBC use R for publication graphics 2018-06-27 David Smith
367 Dealing with The Problem of Multicollinearity in R 2018-08-16 Perceptive Analytics
367 Excel is obsolete. Here are the Top 2 alternatives from R and Python. 2018-03-13 Appsilon Data Science Blog
365 New R Cheatsheet: Data Science Workflow with R 2018-11-04 business-science.io – Articles
361 Tips for analyzing Excel data in R 2018-08-30 David Smith
360 Importing 30GB of data in R with sparklyr 2018-02-16 Econometrics and Free Software
358 Scraping a website with 5 lines of R code 2018-01-24 David Smith
356 Clustering the Bible 2018-12-27 Learning Machines
356 Finally, You Can Plot H2O Decision Trees in R 2018-12-26 Gregory Kanevsky
356 Geocomputation with R – the afterword 2018-12-12 Rstats on Jakub Nowosad’s website
347 Time Series Deep Learning: Forecasting Sunspots With Keras Stateful LSTM In R 2018-04-18 business-science.io – Articles
343 Run Python from R 2018-03-27 Deepanshu Bhalla
336 Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models) 2018-07-24 Bernardo Lares
332 R Generation: 25 Years of R 2018-08-01 David Smith
329 How to extract data from a PDF file with R 2018-01-05 Packt Publishing
325 R or Python? Python or R? The ongoing debate. 2018-01-28 tomaztsql
322 How to perform Logistic Regression, LDA, & QDA in R 2018-01-05 Prashant Shekhar
321 Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity 2018-09-06 David Robinson
311 Intuition for principal component analysis (PCA) 2018-12-06 Learning Machines
310 Packages for Getting Started with Time Series Analysis in R 2018-02-18 atmathew
309 Announcing the R Markdown Book 2018-07-13 Yihui Xie
307 Automated Email Reports with R 2018-11-01 JOURNEYOFANALYTICS
304 future.apply – Parallelize Any Base R Apply Function 2018-06-23 JottR on R
298 How to build your own Neural Network from scratch in R 2018-10-09 Posts on Tychobra
293 RStudio 1.2 Preview: SQL Integration 2018-10-02 Jonathan McPherson

Weekly & monthly curated post descriptive statstic patterns haven’t changed much since the April post:

filter(y2018, src == "R-bloggers") %>% 
  mutate(wkday = lubridate::wday(published, label = TRUE, abbr = TRUE)) %>%
  count(wkday) %>% 
  ggplot(aes(wkday, n)) +
  geom_col(width = 0.5, fill = ft_cols$slate, color = NA) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Curated Posts",
    title = "Day-of-week Curated Post Count for the R-bloggers Feed"
  ) +
  theme_ft_rc(grid="Y")

day of week view

filter(y2018, src == "R-bloggers") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>%
  count(month) %>% 
  ggplot(aes(month, n)) +
  geom_col(width = 0.5, fill = ft_cols$slate, color = NA) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Curated Posts",
    title = "Monthly Curated Post Count for the R-bloggers Feed"
  ) +
  theme_ft_rc(grid="Y")

month view

Surprisingly, monthly post count consistency (or even posting something each month) is not a common trait amongst the top 20 (by total engagement) authors:

w20 <- scales::wrap_format(20)

filter(y2018, src == "R-bloggers") %>% 
  filter(!is.na(author)) %>% # some posts don't have author attribution
  mutate(author_t = map_chr(w20(author), paste0, collapse="\n")) %>% # we need to wrap for facet titles (below)
  count(author, author_t, wt=engagement, sort=TRUE) %>% # get total author engagement
  slice(1:20) %>% # top 20
  { .auth_ordr <<- . ; . } %>% # we use the order later
  left_join(filter(y2018, src == "R-bloggers"), "author") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>%
  count(month, author_t, sort = TRUE) %>% 
  mutate(author_t = factor(author_t, levels = .auth_ordr$author_t)) %>% 
  ggplot(aes(month, nn, author_t)) +
  geom_col(width = 0.5) +
  scale_x_discrete(labels=substring(month.abb, 1, 1)) +
  scale_y_comma() +
  facet_wrap(~author_t) +
  labs(
    x = NULL, y = "Curated Post Count",
    title = "Monthly Curated Post Counts-per-Author (Top 20 by Engagement)",
    subtitle = "Arranged by Total Author Engagement"
  ) +
  theme_ft_rc(grid="yY")

Overall, most authors favor shorter titles for their posts:

filter(y2018, src == "R-bloggers") %>% 
  mutate(
    `Character Count Distribution` = nchar(title), 
    `Word Count Distribution` = stringi::stri_count_boundaries(title, type = "word")
  ) %>% 
  select(id, `Character Count Distribution`, `Word Count Distribution`) %>% 
  gather(measure, value, -id) %>% 
  ggplot(aes(value)) +
  ggalt::geom_bkde(alpha=1/3, color = ft_cols$slate, fill = ft_cols$slate) +
  scale_y_continuous(expand=c(0,0)) +
  facet_wrap(~measure, scales = "free") +
  labs(
    x = NULL, y = "Density",
    title = "Title Character/Word Count Distributions",
    subtitle = "~38 characters/11 words seems to be the sweet spot for most authors",
    caption = "Note Free X/Y Scales"
  ) +
  theme_ft_rc(grid="XY")

This post is already kinda tome-length so I’ll leave it to y’all to grab the data and dig in a bit more.

A Word About Using The content_content Field For R-bloggers Posts

Since R-bloggers requires a full feed from contributors, they, in-turn, post a “kinda” full-feed back out. I say “kinda” as they still haven’t fixed a reported bug in their processing engine which causes issues in (at least) Feedly’s RSS processing engine. If you use Feedly, take a look at the R-bloggers RSS feed entry for the recent “R or Python? Why not both? Using Anaconda Python within R with {reticulate}” post. It cuts off near “Let’s check its type:”. This is due to the way the < character is processed by the R-bloggers ingestion engine which turns the ## <class 'pandas.core.frame.DataFrame'> in the original post and doesn’t even display right on the R-bloggers page as it mangles the input and turns the descriptive output into an actuall <class> tag: <class &#39;pandas.core.frame.dataframe&#39;=""></class>. It’s really an issue on both sides, but R-bloggers is doing the mangling and should seriously consider addressing it in 2019.

Since it is still not fixed, it forces you to go to R-bloggers (clicks FTW? and may partly explain why that example post has a 400+ engagement score) unless you scroll back up to the top of the Feedly view and go to the author’s blog page. Given that tibble output invariably has a < right up top, your best bet for getting more direct views of your own content is to get a code-block with printed ## < output in it as close to the beginning as possible (perhaps start each post with a print(tbl_df(mtcars)))? ?).

Putting post-view-hacking levity aside, this content mangling means you can’t trust the content_content column in the stream data frame to have all the content; that is, if you were planning on taking the provided data and doing some topic clustering or content-based feature extraction for other stats/ML ops you’re out of luck and need to crawl the original site URLs on your own to get the main content for such analyses.

A Bit More About seymour

The seymour package has the following API functions:

  • feedly_access_token: Retrieve the Feedly Developer Token
  • feedly_collections: Retrieve Feedly Connections
  • feedly_feed_meta: Retrieve Metadata for a Feed
  • feedly_opml: Retrieve Your Feedly OPML File
  • feedly_profile: Retrieve Your Feedly Profile
  • feedly_search_contents: Search content of a stream
  • feedly_search_title: Find feeds based on title, url or ‘#topic’
  • feedly_stream: Retrieve contents of a Feedly “stream”
  • feedly_tags: Retrieve List of Tags

along with following helper function (which we’ll introduce in a minute):

  • render_stream: Render a Feedly Stream Data Frame to RMarkdown

and, the following helper reference (as Feedly has some “universal” streams):

  • global_resource_ids: Global Resource Ids Helper Reference

The render_stream() function is semi-useful on its own but was designed as more of a “you may want to replicate this on your own” (i.e. have a look at the source code and riff off of it). “Streams” are individual feeds, collections or even “boards” you make and with this new API package and the power of R Markdown, you can make your own “newsletter” like this:

fp <- feedly_profile() # get profile to get my id

# use the id to get my "security" category feed in my feedly
fs <- feedly_stream(sprintf("user/%s/category/security", fp$id))

# get the top 10 items with engagement >= third quartile of all posts
# and don't include duplicates in the report
mutate(fs$items, published = as.Date(published)) %>% 
  filter(published >= as.Date("2018-12-01")) %>%
  filter(engagement > fivenum(engagement)[4]) %>% 
  filter(!is.na(summary_content)) %>% 
  mutate(alt_url = map_chr(alternate, ~.x[[1]])) %>% 
  distinct(alt_url, .keep_all = TRUE) %>% 
  slice(1:10) -> for_report

# render the report
render_stream(
  feedly_stream = for_report, 
  title = "Cybersecurity News", 
  include_visual = TRUE,
  browse = TRUE
)

Which makes the following Rmd and HTML. (So, no need to “upgrade” to “Teams” to make newsletters!).

FIN

As noted, the 2018 data for R Weekly (Live) & R-bloggers is available and you can find the seymour package on [GL | GH].

If you’re not a Feedly user I strongly encourage you to give it a go! And, if you don’t subscribe to R Weekly, you should make that your first New Year’s Resolution.

Here’s looking to another year of great R content across the R blogosphere!

Phishing is [still] the primary way attackers either commit a primary criminal act (i.e. phish a target to, say, install ransomware) or is the initial vehicle used to gain a foothold in an organization so they can perform other criminal operations to achieve some goal. As such, security teams, vendors and active members of the cybersecurity community work diligently to neutralize phishing campaigns as quickly as possible.

One popular community tool/resource in this pursuit is PhishTank which is a collaborative clearing house for data and information about phishing on the Internet. Also, PhishTank provides an open API for developers and researchers to integrate anti-phishing data into their applications at no charge.

While the PhishTank API is useful for real-time anti-phishing operations the data is also useful for security researchers as we work to understand the ebb, flow and evolution of these attacks. One avenue of research is to track the various features associated with phishing campaigns which include (amongst many other elements) network (internet) location of the phishing site, industry being targeted, domain names being used, what type of sites are being cloned/copied and a feature we’ll be looking at in this post: what percentage of new phishing sites use SSL encryption and — of these — which type of SSL certificates are “en vogue”.

Phishing sites are increasingly using and relying on SSL certificates because we in the information security industry spent a decade instructing the general internet surfing population to trust sites with the green lock icon near the location bar. Initially, phishers worked to compromise existing, encryption-enabled web properties to install phishing sites/pages since they could leech off of the “trusted” status of the associated SSL certificates. However, the advent of services like Let’s Encrypt have made it possible for attacker to setup their own phishing domains that look legitimate to current-generation internet browsers and prey upon the decade’s old “trust the lock icon” mantra that most internet users still believe. We’ll table that path of discussion (since it’s fraught with peril if you don’t support the internet-do-gooder-consequences-be-darned cabal’s personal agendas) and just focus on how to work with PhishTank data in R and take a look at the most prevalent SSL certs used in the past week (you can extend the provided example to go back as far as you like provided the phishing sites are still online).

Accessing PhishTank From R

You can use the aquarium package [GL|GH] to gain access to the data provided by PhishTank’s API (you need to sign up for access and put you API key into the PHISHTANK_API_KEY environment variable which is best done via your ~/.Renviron file).

Let’s setup all the packages we’ll need and cache a current copy of the PhishTank data. The package forces you to utilize your own caching strategy since it doesn’t make sense for it to decide that for you. I’d suggest either using the time-stamped approach below or using some type of database system (or, say, Apache Drill) to actually manage the data.

Here are the packages we’ll need:

library(psl) # git[la|hu]b/hrbrmstr/psl
library(curlparse) # git[la|hu]b/hrbrmstr/curlparse
library(aquarium) # git[la|hu]b/hrbrmstr/aquarium
library(gt) # github/rstudio/gt
library(furrr)
library(stringi)
library(openssl)
library(tidyverse)

NOTE: The psl and curlparse packages are optional. Windows users will find it difficult to get them working and it may be easier to review the functions provided by the urlparse package and substitute equivalents for the domain() and apex_domain() functions used below. Now, we get a copy of the current PhishTank dataset & cache it:

if (!file.exists("~/Data/2018-12-23-fishtank.rds")) {
  xdf <- pt_read_db()
  saveRDS(xdf, "~/Data/2018-12-23-fishtank.rds")
} else {
  xdf <- readRDS("~/Data/2018-12-23-fishtank.rds")
}

Let’s take a look:

glimpse(xdf)
## Observations: 16,446
## Variables: 9
## $ phish_id          <chr> "5884184", "5884138", "5884136", "5884135", ...
## $ url               <chr> "http://internetbanking-bancointer.com.br/lo...
## $ phish_detail_url  <chr> "http://www.phishtank.com/phish_detail.php?p...
## $ submission_time   <dttm> 2018-12-22 20:45:09, 2018-12-22 18:40:24, 2...
## $ verified          <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ verification_time <dttm> 2018-12-22 20:45:52, 2018-12-22 21:26:49, 2...
## $ online            <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ details           <list> [<209.132.252.7, 209.132.252.0/24, 7296 468...
## $ target            <chr> "Other", "Other", "Other", "PayPal", "Other"...

The data is really straightforward. We have unique ids for each site/campaign the URL of the site along with a URL to extra descriptive info PhishTank has on the site/campaign. We also know when the site was submitted/discovered and other details, such as the network/internet space the site is in:

glimpse(xdf$details[1])
## List of 1
##  $ :'data.frame':    1 obs. of  6 variables:
##   ..$ ip_address        : chr "209.132.252.7"
##   ..$ cidr_block        : chr "209.132.252.0/24"
##   ..$ announcing_network: chr "7296 468"
##   ..$ rir               : chr "arin"
##   ..$ country           : chr "US"
##   ..$ detail_time       : chr "2018-12-23T01:46:16+00:00"

We’re going to focus on recent phishing sites (in this case, ones that are less than a week old) and those that use SSL certificates:

filter(xdf, verified == "yes") %>%
  filter(online == "yes") %>%
  mutate(diff = as.numeric(difftime(Sys.Date(), verification_time), "days")) %>%
  filter(diff <= 7) %>%
  { all_ct <<- nrow(.) ; . } %>%
  filter(grepl("^https", url)) %>%
  { ssl_ct <<- nrow(.) ; . } %>%
  mutate(
    domain = domain(url),
    apex = apex_domain(domain)
  ) -> recent

Let’s ee how many are using SSL:

(ssl_ct)
## [1] 383

(pct_ssl <- ssl_ct / all_ct)
## [1] 0.2919207

This percentage is lower than a recent “50% of all phishing sites use encryption” statistic going around of late. There are many reasons for the difference:

  • PhishTank doesn’t have all phishing sites in it
  • We just looked at a week of examples
  • Some sites were offline at the time of access attempt
  • Diverse attacker groups with varying degrees of competence engage in phishing attacks

Despite the 20% deviation, 30% is still a decent percentage, and a green, “everything’s ??” icon is a still a valued prize so we shall pursue our investigation.

Now we need to retrieve all those certs. This can be a slow operation that so we’ll grab them in parallel. It’s also quite possible the “online”status above data frame glimpse is inaccurate (sites can go offline quickly) so we’ll catch certificate request failures with safely() and cache the results:

cert_dl <- purrr::safely(openssl::download_ssl_cert)

plan(multiprocess)

if (!file.exists("~/Data/recent.rds")) {

  recent <- mutate(recent, cert = future_map(domain, cert_dl))
  saveRDS(recent, "~/Data/recent.rds")

} else {
  recent <- readRDS("~/Data/recent.rds")
}

Let see how many request failures we had:

(failed <- sum(map_lgl(recent$cert, ~is.null(.x$result))))
## [1] 25

(failed / nrow(recent))
## [1] 0.06527415

As noted in the introduction to the blog, when attackers want to use SSL for the lock icon ruse they can either try to piggyback off of legitimate domains or rely on Let’s Encrypt to help them commit crimes. Let’s see what the top p”apex” domains](https://help.github.com/articles/about-supported-custom-domains/#apex-domains) were in use in the past week:

count(recent, apex, sort = TRUE)
## # A tibble: 255 x 2
##    apex                              n
##    <chr>                         <int>
##  1 000webhostapp.com                42
##  2 google.com                       17
##  3 umbler.net                        8
##  4 sharepoint.com                    6
##  5 com-fl.cz                         5
##  6 lbcpzonasegurabeta-viabcp.com     4
##  7 windows.net                       4
##  8 ashaaudio.net                     3
##  9 brijprints.com                    3
## 10 portaleisp.com                    3
## # ... with 245 more rows

We can see that a large hosting provider (000webhostapp.com) bore a decent number of these sites, but Google Sites (which is what the full domain represented by the google.com apex domain here is usually pointing to) Microsoft SharePoint (sharepoint.com) and Microsoft forums (windows.net) are in active use as well (which is smart give the pervasive trust associated with those properties). There are 241 distinct apex domains in this 1-week set so what is the SSL cert diversity across these pages/campaigns?

We ultimately used openssl::download_ssl_cert to retrieve the SSL certs of each site that was online, so let’s get the issuer and intermediary certs from them and look at the prevalence of each. We’ll extract the fields from the issuer component returned by openssl::download_ssl_cert then just do some basic maths:

filter(recent, map_lgl(cert, ~!is.null(.x$result))) %>%
  mutate(issuers = map(cert, ~map_chr(.x$result, ~.x$issuer))) %>%
  mutate(
    inter = map_chr(issuers, ~.x[1]), # the order is not guaranteed here but the goal of the exercise is
    root = map_chr(issuers, ~.x[2])   # to get you working with the data vs build a 100% complete solution
  ) %>%
  mutate(
    inter = stri_replace_all_regex(inter, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>% # there are parswers for the cert info fields but this hack is quick and works
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols),
    root = stri_replace_all_regex(root, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>%
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols)
  ) -> recent

Let’s take a look at roots:

unnest(recent, root) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
DST Root CA X3 96 26.82%
COMODO RSA Certification Authority 93 25.98%
DigiCert Global Root G2 45 12.57%
Baltimore CyberTrust Root 30 8.38%
GlobalSign 27 7.54%
DigiCert Global Root CA 15 4.19%
Go Daddy Root Certificate Authority – G2 14 3.91%
COMODO ECC Certification Authority 11 3.07%
Actalis Authentication Root CA 9 2.51%
GlobalSign Root CA 4 1.12%
Amazon Root CA 1 3 0.84%
Let’s Encrypt Authority X3 3 0.84%
AddTrust External CA Root 2 0.56%
DigiCert High Assurance EV Root CA 2 0.56%
USERTrust RSA Certification Authority 2 0.56%
GeoTrust Global CA 1 0.28%
SecureTrust CA 1 0.28%

DST Root CA X3 is (wait for it) Let’s Encrypt! Now, Comodo is not far behind and indeed surpasses LE if we combine the extra-special “enhanced” versions they provide and it’s important for you to read the comments near the lines of code making assumptions about order of returned issuer information above. Now, let’s take a look at intermediaries:

unnest(recent, inter) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
Let’s Encrypt Authority X3 99 27.65%
cPanel\, Inc. Certification Authority 75 20.95%
RapidSSL TLS RSA CA G1 45 12.57%
Google Internet Authority G3 24 6.70%
COMODO RSA Domain Validation Secure Server CA 20 5.59%
CloudFlare Inc ECC CA-2 18 5.03%
Go Daddy Secure Certificate Authority – G2 14 3.91%
COMODO ECC Domain Validation Secure Server CA 2 11 3.07%
Actalis Domain Validation Server CA G1 9 2.51%
RapidSSL RSA CA 2018 9 2.51%
Microsoft IT TLS CA 1 6 1.68%
Microsoft IT TLS CA 5 6 1.68%
DigiCert SHA2 Secure Server CA 5 1.40%
Amazon 3 0.84%
GlobalSign CloudSSL CA – SHA256 – G3 2 0.56%
GTS CA 1O1 2 0.56%
AlphaSSL CA – SHA256 – G2 1 0.28%
DigiCert SHA2 Extended Validation Server CA 1 0.28%
DigiCert SHA2 High Assurance Server CA 1 0.28%
Don Dominio / MrDomain RSA DV CA 1 0.28%
GlobalSign Extended Validation CA – SHA256 – G3 1 0.28%
GlobalSign Organization Validation CA – SHA256 – G2 1 0.28%
RapidSSL SHA256 CA 1 0.28%
TrustAsia TLS RSA CA 1 0.28%
USERTrust RSA Domain Validation Secure Server CA 1 0.28%
NA 1 0.28%

LE is number one again! But, it’s important to note that these issuer CommonNames can roll up into a single issuing organization given just how messed up integrity and encryption capability is when it comes to web site certs, so the raw results could do with a bit of post-processing for a more complete picture (an exercise left to intrepid readers).

FIN

There are tons of avenues to explore with this data, so I hope this post whet your collective appetites sufficiently for you to dig into it, especially if you have some dowm-time coming.

Let me also take this opportunity to resissue guidance I and many others have uttered this holiday season: be super careful about what you click on, which sites you even just visit, and just how much you really trust the site, provider and entity behind the form about to enter your personal information and credit card info into.

I can’t seem to free my infrequently-viewed email inbox from “you might like!” notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I’d’ve been blissfully unaware of it and would have saved you the trouble of reading this).

Today, they sent me this gem by @JeromeDeveloper: Scrapy and Scrapyrt: how to create your own API from (almost) any website. Go ahead and click it. Give the Medium author the ? they so desperately crave (and to provide context for the rant below).

I have no issue with @JeromeDeveloper’s coding prowess, nor Scrapy/Scrapyrt. In fact, I’m a huge fan of the folks at ScrapingHub, so much so that I wrote splashr to enable use of their Splash server from R.

My issue is with the example the author chose to use.

CoinMarketCap provides cryptocurrency prices and other cryptocurrency info. I use it to track cryptocurrency prices to see which currency attackers who pwn devices to install illegal cryptocurrency rigs might be switching to next and to get a feel for when they’ll stop mining and go back to just stealing data and breaking things.

CoinMarketCap has an API with a generous free tier with the following text in their Terms & Conditions (which, in the U.S. [soon] may stupidly be explicitly repeated & required on each page that scraping is prohibited on vs a universal site link):

You may not, and shall not, copy, reproduce, download, “screen scrape”, store, transmit, broadcast, publish, modify, create a derivative work from, display, perform, distribute, redistribute, sell, license, rent, lease or otherwise use, transfer (either in printed, electronic or other format) or exploit any Content, in whole or in part, in any way that does not comply with these Terms without our prior written permission.

There is only one reason (apart from complete oblivion) to use CoinMarketCap as an example: to show folks how clever you are at bypassing site restrictions and eventually avoiding paying for an API to get data that you did absolutely nothing to help gather, curate and setup infrastructure for. There is no mention of “be sure what you are doing is legal/ethical”, just a casual caution to not abuse the Scrapyrt technology since it may get you banned.

Ethics matter across every area of “data science” (of which, scraping is one component). Just because you can do something doesn’t mean you should and just because you don’t like Terms & Conditions and want to grift the work of others for fun, profit & ? also doesn’t mean you should; and, it definitely doesn’t mean you should be advocating others do it as well.

Ironically, Medium itself places restrictions on what you can do:

Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.

yet they advocated I read and heed a post which violates similar terms of another site. So I wonder how they’d feel if I did a riff of that post and showed how to setup a hackish-API to scrape all their content. O_o

The libcurl library (the foundational library behind the RCurl and curl packages) has switched to using OpenSSL’s default ciphers since version 7.56.0 (October 4 2017). If you’re a regular updater of curl/httr you should be fairly current with these cipher suites, but if you’re not a keen updater or use RCurl for your web-content tasks, you are likely not working with a recent cipher list and may start running into trouble as the internet self-proclaimed web guardians keep their wild abandon push towards “HTTPS Everywhere”.

Why is this important? Well, as a web consumer (via browsers) you likely haven’t run into any issues when visiting SSL/TLS-enabled sites since most browsers update super-frequently and bring along modern additions to cipher suites with them. Cipher suites are one of the backbones of assurance when it comes to secure connections to servers and stronger/different ciphers are continually added to openssl (and other libraries). If a server (rightfully) only supports a modern, seriously secure TLS configuration, clients that do not have such support won’t be able to connect and you’ll see errors such as:

SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

You can test what a server supports via tools like SSL Test. I’d point to a command-line tool but there are enough R users on crippled Windows systems that it’s just easier to have you point and click to see. If you are game to try a command-line tool then give testssl? a go from an RStudio terminal (I use that suggestion specifically to be platform agnostic as I cannot assume R Windows users know how to use a sane shell). The testssl script has virtually no dependencies so it should “work everywhere”. Note that both SSL Test and testsslmake quite a few connections to a site so make sure you’re only using your own server(s) as test targets unless you have permission from others to use theirs (go ahead and hit mine if you like).

You can also see what your R client packages support. One could run:

library(magrittr)

read.table(
  text = system("openssl ciphers -v", intern=TRUE) %>% 
    gsub("[[:alpha:]]+=", "", .)
) %>% 
  setNames(
    c("ciphername", "protoccol_version", "key_exchange", "authentication", 
      "symmetric_encryption_method", "message_authentication_code")
  )

in attempt to do that via the openssl binary on your system, but Windows users likely won’t be able to run that (unlike every other modern OS including iOS) and it might not show you what your installed R client packages can handle since they may be using different libraries.

So, another platform-agnostic (but requiring a call to a web site, so some privacy leakage) is to use How’s My SSL.

ssl_check_url <- "https://www.howsmyssl.com/a/check"

jsonlite::fromJSON(
  readLines(url(ssl_check_url), warn = FALSE)
) -> base_chk

jsonlite::fromJSON(
  RCurl::getURL(ssl_check_url)
) -> rcurl_chk

jsonlite::fromJSON(
  rawToChar(
    curl::curl_fetch_memory(ssl_check_url)$content
  )
) -> curl_chk

Compare the $given_cipher_suites for each of those to see how they compare and also take a look at $rating. macOS and Linux users should have fairly uniform results for all three. Windows users may be in for a sad awakening (I mean you’re used to that on a regular basis, so it’s cool). You can also configure how you communicate what you support via the ssl_cipher_list cURL option (capitalization is a bit different with RCurl but I kinda want you to use the curl package so you’re on your own to translate. Note that you can’t game the system and claim you can handle a cipher you don’t actually have.

FIN

You should try to stay current with the OpenSSL (or equivalent) library on your operating system and also with the libcurl library on your system and then the curl, openssl, and RCurl packages. You may have production-rules requiring lead-times for changing configs but these should be in the “test when they make it to CRAN and install as-soon-as-possible-thereafter” category.

Despite their now inherent evil status, GitHub has some tools other repository aggregators do not. One such tool is the free vulnerability alert service which will scan repositories for outdated+vulnerable dependencies.

Now, “R” is nowhere near a first-class citizen in the internet writ large, including software development tooling (e.g. the Travis-CI and GitLab continuous integration recipes are community maintained vs a first-class/supported offering). This also means that GitHub’s service will never check for nor alert when a pure R package has security issues, mostly due to the fact that there’s only a teensy few of us who even bother to check packages for issues once in a while and there’s no real way to report said issues into the CVE process easily (though I guess I could given that my $DAYJOB is an official CVE issuer), so the integrity & safety of the R package ecosystem is still in the “trust me, everything’s ?!!” state. Given that, any extra way to keep even some R packages less insecure is great.

So, right now you’re thinking “you click-baited us with a title that your lede just said isn’t possible…WTHeck?!.

It’s true that GitHub does not consider R a first-class citizen, but it does support Java and:

    available.packages() %>% 
      dplyr::as_data_frame() %>% 
      tidyr::separate_rows(Imports, sep=",[[:space:]]*") %>% # we really just
      tidyr::separate_rows(Depends, sep=",[[:space:]]*") %>% # need these two
      tidyr::separate_rows(Suggests, sep=",[[:space:]]*") %>%
      tidyr::separate_rows(Enhances, sep=",[[:space:]]*") %>%
      dplyr::select(Package, Imports, Depends) %>% 
      filter(
        grepl("rJava", Imports) | grepl("rJava", "Depends") | 
          grepl("Suggests", Imports) | grepl("Enhances", "Depends")
      ) %>% 
      dplyr::distinct(Package) %>% 
      dplyr::summarise(total_pkgs_using_rjava = n())
    ## # A tibble: 1 x 1
    ##   total_pkgs_using_rjava
    ##                    <int>
    ## 1                     66

according to ☝ there are 66 CRAN packages that require rJava, seven of which explicitly provide only JARs (a compressed directory tree of supporting Java classes). There are more CRAN-unpublished rJava-based projects on GitLab & GitHub, but it’s likely that public-facing rJava packages that include or depend on public JAR-dependent projects still number less than ~200. Given the now >13K packages in CRAN, this is a tiny subset but with the sorry state of R security, anything is better than nothing.

Having said that, one big reason (IMO) for the lack of Java-wrapped CRAN or “devtools”-only released rJava-dependent packages it that it’s 2018 and you still have better odds of winning a Vegas-jackpot than you do getting rJava to work on your workstation in less than 4 tries and especially after an OS upgrade. That’s sad since there are many wonderful, solid and useful Java libraries that would be super-handy for many workflows yet most of us (I’m including myself) package-writers prefer to spin wheels to get C++ or Rust libraries working with R than try to make it easier for regular R users to tap into that rich Java ecosystem.

But, I digress.

For the handful of us that do write and use rJava-based packages, we can better serve our userbase by deliberately putting those R+Java repos on GitHub. Now, I hear you. They’re evil and by doing this one of the most evil corporations on the planet can make money with your metadata (and, possibly just blatantly steal your code for use in-product without credit) but I’ll give that up on a case-by-case basis to make it easier to keep users safe.

Why will this enhance safety? Go take a look at one of my non-CRAN rJava-backed packages: pdfbox?. It has this awesome “in-your-face” security warning banner:

The vulnerability is CVE-2018-11797 which is baseline computed to be “high severity” with a the following specific weakness: In Apache PDFBox 1.8.0 to 1.8.15 and 2.0.0RC1 to 2.0.11, a carefully crafted PDF file can trigger an extremely long running computation when parsing the page tree.. So, it’s a process denial of service vulnerability. You’ll also note I haven’t updated the JARs yet (mostly since it’s not a code-execution vulnerability).

I knew about this 28 days ago (I’ve been incredibly busy and there’s alot of blather required to talk about it, hence the delay in blogging) thanks to the GitHub service and will resolve it when I get some free time over the Thanksgiving break. I received an alert for this, there are hooks for security alerts (so one can auto-create an issue) and there’s a warning for users and any of them could file an issue to let me know it’s super-important to them that I get it fixed (or they could be super-awesome and file a PR :-).

FIN

The TLDR is (first) a note — to package authors — who use rJava to bite the GitHub bullet and take advantage of this free service; and, (second) — to users — to encourage use of this service by authors of packages you use and to keep a watchful eye out for any security alerts for code you depend on to get things done.

A (perhaps) third and final note is for all of us to be to continually mindful about the safety & integrity of the R package ecosystem and do what we can to keep moving it forward.

If you’re an R/RStudio user who has migrated to Mojave (macOS 10.14) or are contemplating migrating to it, you will likely eventually run into an issue where you’re trying to access resources that are in Apple’s new hardened filesystem sandboxes.

Rather than reinvent the wheel by blathering about what that means, give these links a visit if you want more background info:

and/or Google/DuckDuckGo for macos privacy tcc full disk access as there have been a ton of blogs on this topic.

The TLDR for getting your R/RStudio bits to work in these sandboxed areas is to grant them “Full Disk Access” (if that sounds scary, it is) permissions. You can do that by adding both the R executable and RStudio executable (drag their icons) to the Full Disk Access pane under the Privacy tab of System Preferences => Security & Privacy:

I also used the Finder’s “Go To” command to head on over /usr/local/bin and use the “Show Original” popup on both R and Rscript and dragged their fully qualified path binaries into the pane as well (you can’t see them b/c the pane is far too small). The symbolic links might be sufficient, but I’ve been running this way since the betas and just re-drag the versioned R/Rscript binaries each time I upgrade (or rebuild) R.

If you do grant FDA to R/RStudio just make sure be even more careful about trusting R code you run from the internet or R packages you install from untrusted sources (like GitHub) since R/RStudio are now potential choice conduits for malicious code that wants to get at your seekret things.

Photo by Alexander Dummer on Unsplash

The CBC covered the recent (as of the original post-time on this blog entry) Quebec elections and used a well-crafted hex grid map to display results:

They have a great ‘splainer on why they use this type of map.

Thinking that it may be useful for others, I used a browser Developer Tools inspector to yank out the javascript-created SVG and wrangled out the hexes using svg2geojson? and put them into a GeoJSON file along with some metadata that I extracted from the minified javascript from the CBC’s site and turned into a data frame using the V8? package. Since most of the aforementioned work was mouse clicking and ~8 (disjointed) lines of accompanying super-basic R code, there’s not really much to show wrangling-wise1, but I can show an example of using the new GeoJSON file in R and the sf? package:

library(sf)
library(ggplot2)

# get the GeoJSON file from: https://gitlab.com/hrbrmstr/quebec-hex-ridings or https://github.com/hrbrmstr/quebec-hex-ridings
sf::st_read("quebec-ridings.geojson", quiet = TRUE, stringsAsFactors = FALSE) %>% 
  ggplot() +
  geom_sf(aes(fill = regionname)) +
  coord_sf(datum = NA) +
  ggthemes::scale_fill_tableau(name = NULL, "Tableau 20") +
  ggthemes::theme_map() +
  theme(legend.position = "bottom")

And, with a little more ggplot2-tweaking and some magick, we can even put it in the CBC-styled border:

library(sf)
library(magick)
library(ggplot2)

plt <- image_graph(1488, 1191, bg = "white")
sf::st_read("quebec-ridings.geojson", quiet=TRUE, stringsAsFactors=FALSE) %>% 
  ggplot() +
  geom_sf(aes(fill=regionname)) +
  coord_sf(datum=NA) +
  scale_x_continuous(expand=c(0,2)) +
  scale_y_continuous(expand=c(0,0)) +
  ggthemes::theme_map() +
  theme(plot.margin = margin(t=150)) +
  theme(legend.position = "none")
dev.off()

# get this bkgrnd img from the repo
image_composite(plt, image_read("imgs/background.png")) %>% 
  image_write("imgs/composite-map.png")

You can tweak the border color with magick? as needed and there’s a background2.png in the imgs directory in the repo that has the white inset that you can further composite as needed.

With a teensy bit of work you should be able adjust the stroke color via aes() to separate things as the CBC did.

FIN

It’s important to re-state that the CBC made the original polygons for the hexes (well, they made a set of grid points and open source software turned it into a set of SVG paths) and the background images. All I did was some extra bit of wrangling and conversionating2.

1 I can toss a screencast if there’s sufficient interest.
2 Totally not a word.

In my semi-daily run of brew update I noticed that proj4 had been updated to 5.2. I kinda “squeee“‘d since (as the release notes show) the Equal Earth projection was added to it (+proj=eqearth).

As the team who created the projection describes it: “The Equal Earth map projection is a new equal-area pseudocylindrical projection for world maps. It is inspired by the widely used Robinson projection, but unlike the Robinson projection, retains the relative size of areas. The projection equations are simple to implement and fast to evaluate. Continental outlines are shown in a visually pleasing and balanced way.”

They continue: “The Robinson and Equal Earth projections share a similar outer shape[…] Upon close inspection, however, the way that they differ becomes apparent. The Equal Earth with a height-to-width ratio of 1:2.05 is slightly wider than the Robinson at 1:1.97. On the Equal Earth, the continents in tropical and mid-latitude areas are more elongated (north to south) and polar areas are more flattened. This is a consequence of Equal Earth being equal-area in contrast to the Robinson that moderately exaggerates high-latitude areas.”

Here’s a comparison of it to other, similar, projections:

©Taylor & Francis Group, 2018. All rights reserved.

Map projections are a pretty nerdy thing, but this one even got the attention of Newsweek.

To use this new projection now in R, you’ll need to install the proj4 ? from source after upgrading to the new proj4 library. That was a simple brew upgrade for me and Linux users can do the various package manager incantations to get theirs as well. Windows users can be jealous for a while until updated package binaries make their way to CRAN (or switch to a real platform ?).

After a fresh source install of proj4 all you have to do is:

library(ggalt) # git[la|hu]b/hrbrmstr/hrbrthemes
library(hrbrthemes) # git[la|hu]b/hrbrmstr/hrbrthemes
library(ggplot2)

world <- map_data("world")

ggplot() +
  geom_map(
    map = world, data = world,
    aes(long, lat, map_id = region), 
    color = ft_cols$white, fill = ft_cols$slate,
    size = 0.125
  ) +
  coord_proj("+proj=eqearth") +
  labs(
    x = NULL, y = NULL,
    title = "Equal Earth Projection (+proj=eqearth)"
  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank())

to get:

Remember, friends don't let friends use Mercator.