Exploring R-Bloggers Posts with the Feedly API

There’s a yuge chance you’re reading this post (at least initially) on R-Bloggers right now (though you should also check out R Weekly and add their live feed to your RSS reader pronto!). It’s a central “watering hole” for R folks and is read by many (IIRC over 20,000 Feedly users have it in their OPML).

I’m addicted to Feedly and waited years for them to publish their API. They have and there will eventually be a package for it (go for it if you want to get’er done before me since I won’t have time to do it justice for a while). As just parenthetically noted, I’ve started work on one and have scaffolded just enough to give R folks a present: almost 5 years of R-Bloggers data — posts, engagement rates, authors, etc). But, you’ll have to put up with some expository, first.

Digging In

We’ll need some packages to help this expository and extraction. Plus, you’ll need to go to https://developer.feedly.com/ to get your developer token (NOTE: this requires a “Pro” account or a regular account and you manually doing the OAuth dance to get an access token; any final “Feedly package” by myself or others will likely use OAuth) and store it in your ~/.Renviron in FEEDLY_ACCESS_TOKEN.

I’ve sliced and diced bits from the (non-published) fledgling package to give a peek behind the API covers. There’s plenty of exposition in the following code block comment header to describe what it does:

#' Simplifying some example package setup for this non-pkg example
.pkgenv <- new.env(parent=emptyenv())
.pkgenv$token <- Sys.getenv("FEEDLY_ACCESS_TOKEN")

#' In reality, this is more complex since the non-toy example has to
#' refresh tokens when they expire.
.feedly_token <- function() {
  return(.pkgenv$token)
}

#' Get a chunk of a Feedly "stream"
#'
#' For the purposes of this short example, consider a
#' "stream" to be all the historical items in a feed.
#' (Note: the definition is more complex than that)
#'
#' Max "page size" (mad numbner of items returned in a single call)
#' is 1,000. For example simplicity, there's a blanket assumption
#' that if `continuation` is actually present, the caller is
#' savvy and asked for a large number of items (e.g. 10,000).
#' Therefore, assume we're paging by the thousands.
#'
#' @md
#' @param feed_id the id of the stream (for this examplea feed id)
#' @param ct numnber of items to retrieve (API will only return 1,000
#'        items for a single response and populate `continuation`
#'        with a value that should be passed to subsequent calls
#'        to page through the results; `ct` will be reset to 1,000
#'        internally if this is the case)
#' @param continuation see `ct`
#' @references <https://developer.feedly.com/v3/streams/>
#' @return for this example, an ugly `list`
feedly_stream <- function(stream_id, ct=100L, continuation=NULL) {

  ct <- as.integer(ct)

  if (!is.null(continuation)) ct <- 1000L

  httr::GET(
    url = "https://cloud.feedly.com/v3/streams/contents",
    httr::add_headers(
      `Authorization` = sprintf("OAuth %s", .feedly_token())
    ),
    query = list(
      streamId = stream_id,
      count = ct,
      continuation = continuation
    )
  ) -> res

  httr::stop_for_status(res)

  res <- httr::content(res, as="text")
  res <- jsonlite::fromJSON(res)

  res

}

We’ll grab 10,000 Feedly entries for the R-Bloggers feed stream:

r_bloggers_feed_id <- "feed/http://feeds.feedburner.com/RBloggers"

rb_stream <- feedly_stream(r_bloggers_feed_id, 10000L)

# preallocate space
streams <- vector("list", 10)
streams[1L] <- list(rb_stream)

# gotta catch'em all!
idx <- 2L
while(length(rb_stream$continuation) > 0) {
  cat(".", sep="") # poor dude's progress par
  feedly_stream(
    stream_id = r_bloggers_feed_id,
    ct = 1000L,
    continuation = rb_stream$continuation
  ) -> rb_stream
  streams[idx] <- list(rb_stream)
  idx <- idx + 1L
}
cat("\n")

For those who aren’t used to piecing together bits from API’s like this (and for those who do not have a Pro account, those who didn’t want to write OAuth code or those who don’t use Feedly and cannot reproduce the post example), here’s some dissection:

str(streams, 1)
## List of 12
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 6 # No "continuation" in this one

str(streams[[1]], 1)
## List of 7
##  $ id          : chr "feed/http://feeds.feedburner.com/RBloggers"
##  $ title       : chr "R-bloggers"
##  $ direction   : chr "ltr"
##  $ updated     : num 1.52e+12
##  $ alternate   :'data.frame':	1 obs. of  2 variables:
##  $ continuation: chr "15f457e2b66:160d6e:8cbd7d4f"
##  $ items       :'data.frame':	1000 obs. of  22 variables:

glimpse(streams[[1]]$items)
## Observations: 1,000
## Variables: 22
## $ id             <chr> "XGq6cYRY3hH9/vdZr0WOJiPdAe0u6dQ2ddUFEsTqP10=_1628f55fc26:7feb...
## $ keywords       <list> ["R bloggers", "R bloggers", "R bloggers", "R bloggers", "R b...
## $ originId       <chr> "https://tjmahr.github.io/ridgelines-in-bayesplot-1-5-0-releas...
## $ fingerprint    <chr> "f96c93f7", "9b2344db", "ca3762c8", "980635d0", "fbd60fac", "6...
## $ content        <data.frame> c("<p><div><div><div><div data-show-faces=\"false\" dat...
## $ title          <chr> "Ridgelines in bayesplot 1.5.0", "Mathematical art in R", "R a...
## $ published      <dbl> 1.522732e+12, 1.522796e+12, 1.522714e+12, 1.522714e+12, 1.5227...
## $ crawled        <dbl> 1.522823e+12, 1.522809e+12, 1.522794e+12, 1.522793e+12, 1.5227...
## $ canonical      <list> [<https://www.r-bloggers.com/ridgelines-in-bayesplot-1-5-0/, ...
## $ origin         <data.frame> c("feed/http://feeds.feedburner.com/RBloggers", "feed/h...
## $ author         <chr> "Higher Order Functions", "David Smith", "R Views", "rOpenSci ...
## $ alternate      <list> [<http://feedproxy.google.com/~r/RBloggers/~3/O5DIWloFJO8/, t...
## $ summary        <data.frame> c("At the end of March, Jonah Gabry and I released\nbay...
## $ visual         <data.frame> c("feedly-nikon-v3.1", "feedly-nikon-v3.1", "feedly-nik...
## $ unread         <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories     <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c3d/category/big data...
## $ engagement     <int> 9, 37, 52, 15, 78, 35, 31, 9, 28, 2, 21, 8, 25, 11, 21, 29, 12...
## $ engagementRate <dbl> 0.41, 1.37, 1.58, 0.45, 2.23, 0.97, 0.84, 0.23, 0.72, 0.05, 0....
## $ recrawled      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1.522807e+12, NA, NA, NA, NA, ...
## $ tags           <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
## $ decorations    <data.frame> c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",...
## $ enclosure      <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...

That entries structure is defined in the Feedly API docs.

We’ll extract the bits we want to use for the rest of the post and clean it up a bit:

map_df(streams, ~{
  select(.x$items, title, author, published, engagement) %>%
    mutate(published = anytime::anydate(published / 1000)) %>% # overly-high-resolution timestamp
    tbl_df()
}) -> xdf

glimpse(xdf)
## Observations: 11,421
## Variables: 4
## $ title      <chr> "Ridgelines in bayesplot 1.5.0", "Mathematical art in R", "R and T...
## $ author     <chr> "Higher Order Functions", "David Smith", "R Views", "rOpenSci - op...
## $ published  <date> 2018-04-03, 2018-04-03, 2018-04-02, 2018-04-02, 2018-04-03, 2018-...
## $ engagement <int> 9, 37, 52, 15, 78, 35, 31, 9, 28, 2, 21, 8, 25, 11, 21, 29, 12, 11...

Using an arbitrary “10,000” extract didn’t give us full months:

range(xdf$published)
## [1] "2013-05-31" "2018-04-03"

so we’ll filter out the incomplete bits and add in some additional temporal metadata:

xdf %>%
  filter(
    published > as.Date("2013-05-31"),  # complete months
    published < as.Date("2018-04-01")
  ) %>%
  mutate(
    year = as.integer(lubridate::year(published)),
    month = lubridate::month(published, label=TRUE, abbr=TRUE),
    wday = lubridate::wday(published, label=TRUE, abbr=TRUE),
    ym = as.Date(format(published, "%Y-%m-01"))
  ) -> xdf

I’m only going to do some light analysis work with engagement data (how “popular” a post was) but the full post summary and body content is available in the data dump you’re going to get at the end (this is reminding me of the Sesame Street “Monster at the End of This Book” story). That means enterprising folk can do some tidy text mining to cluster away some additional insights.

Thankfully, there’s not a ton of missing engagement data:

sum(is.na(xdf$engagement)) / nrow(xdf)
## [1] 0.06506849

broom::tidy((summary(xdf$engagement)))
##   minimum q1 median     mean q3 maximum  na
## 1       0  5     20 69.27219 75    4785 741

Let’s look at post count over time, first:

count(xdf, ym) %>%
  arrange(ym) %>%
  ggplot(aes(ym, n)) +
  ggforce::geom_bspline0(color="lightslategray") +
  scale_x_date(expand=c(0,0.5)) +
  labs(
    x=NULL, y="Post count",
    title="R-Bloggers Post Count",
    subtitle="June 2013 — March 2018"
  ) +
  theme_ipsum_ps(grid="XY")

It’ll be interesting to watch that over this year and compare 2017 to 2018 given how “hot” 2017 seems to have been. To turn a Mythbuster phrase: a neat “try this at home” exercise would be to tease out some “whys” for various spikes (which likely means some post content spelunking).

Let’s see if any days are more popular than others:

count(xdf, wday) %>%
  ggplot(aes(wday, n)) +
  geom_col(fill="lightslategray", width=0.65) +
  scale_y_comma() +
  labs(
    x=NULL, y="Post count",
    title="R-Bloggers Aggregate Post Count By Day of Week"
  ) +
  theme_ipsum_ps(grid="Y")

Weekends are sleepy and there are some “go-getters” at the beginning of the week. More “try this at home” would be to see if any individuals have “patterns” by day of week (or even time of day, since that’s also available in the published time stamp).

The summary() above told us we have a pretty skewed engagement distribution, but it’s always nice to visualise just how bad it is:

ggplot(xdf, aes(engagement)) +
  geom_density(aes(y=calc(count)), fill="lightslategray", alpha=2/3) +
  scale_x_comma() +
  scale_y_comma() +
  labs(
    x=NULL, y="Engagement",
    title = "R-Bloggers Post Engagement Distribution",
    subtitle = "June 2013 — March 2018"
  ) +
  theme_ipsum_ps(grid="XY")

That graph is the story of my daily life dealing with internet data. Couldn’t even get a break when trying to have some fun. #sigh

We’ll close with the “all time top 10” based on total engagement:

count(xdf, author, wt=engagement, sort=TRUE)
## # A tibble: 1,065 x 2
##    author               n
##  1 David Smith      87381
##  2 Tal Galili       29302
##  3 Joseph Rickert   16846
##  4 DataCamp Blog    14402
##  5 DataCamp         14208
##  6 John Mount       13274
##  7 Francis Smart     8506
##  8 hadleywickham     8129
##  9 hrbrmstr          7855
## 10 Sharp Sight Labs  7620
## # ... with 1,055 more rows

@revodavid is a blogging machine, and that top-spot is well-deserved given the plethora of interesting, useful and fun content he shares. And, it looks like someone only needs to blog a bit more this year to overtake @hadley (I’m comin’ fer ya, Hadley!).

FIN

As promised, you can get the data in a ~30MB RDS file via https://rud.is/dl/r-bloggers-feedly-streams.rds and can then use the extraction-to-data-frame example from above to work with the bits you care about.

Hopefully folks will have some fun with this and share their results!

7 Comments

- MikeJackTzen
- Posted 2018-04-16 at 12:28
- Permalink
- Reply
this is awesome.
if you have time, can you have a part 2 that focuses on accessing / exporting your own ‘saved for later’ feedly articles?
- - hrbrmstr
  - Posted 2018-04-16 at 13:16
  - Permalink
  - Reply
  Sure! I might even try to make it an RStudio addin.
- - hrbrmstr
  - Posted 2018-04-16 at 16:00
  - Permalink
  - Reply
  Here ya go: https://rud.is/b/2018/04/16/by-request-retrieving-your-feedly-saved-for-later-entries/ !
- Zehra
- Posted 2023-07-27 at 10:02
- Permalink
- Reply
Is there a way to fetch news data between desired dates? Thank you in advance.
- - hrbrmstr
  - Posted 2023-07-28 at 06:32
  - Permalink
  - Reply
  Well met! I’ll try to check before the remaining sub runs out (Feedly pulled some very skeezy things on me, all in the name of greed, and I abandoned them in favor of Inoreader a while ago).
  - - Zehra
    - Posted 2023-07-30 at 10:26
    - Permalink
    - Reply
    Thank you, will be waiting!
- Zehra
- Posted 2023-07-30 at 11:09
- Permalink
- Reply
I am retrieving data from feedly using feedly_stream function. Even though i am using the contuniation id from the earlier call, it is not stopping to fetch the old data and still i am having same data when i compare the data from different API calls. I could not see any other packages to use in order to fetch data from feedly. Is there a way to fetch data from feedly from the oldest data until the newest data on the stream?

11 Trackbacks/Pingbacks

By Exploring R-Bloggers Posts with the Feedly API | Atlantic Tagmata AI Security Feed on 04 Apr 2018 at 9:23 am

[…] This is a Security Bloggers Network syndicated blog post authored by hrbrmstr. Read the original post at: rud.is […]
By Exploring R-Bloggers Posts with the Feedly API – Cloud Data Architect on 05 Apr 2018 at 1:14 am

[…] article was first published on R – rud.is, and kindly contributed to […]
By Statistics from R-bloggers (Revolutions) – Iot Portal on 10 Apr 2018 at 10:33 pm

[…] way to access that information other than the blog post feed. Bob Rudis figured it out though — by using the Feedly API and the RSS feeds of R-bloggers, you can extract quite a bit of data about the posts of the 750+ […]
By Statistics from R-bloggers (Revolutions) | Blockchain News focuses on Blockchain, Distributed Ledger technology, and Initial Coin Offerings (ICO) news and opinion on 11 Apr 2018 at 2:00 am

[…] way to access that information other than the blog post feed. Bob Rudis figured it out though — by using the Feedly API and the RSS feeds of R-bloggers, you can extract quite a bit of data about the posts of the 750+ […]
By Statistics from R-bloggers – Cloud Data Architect on 11 Apr 2018 at 1:45 pm

[…] way to access that information other than the blog post feed. Bob Rudis figured it out though — by using the Feedly API and the RSS feeds of R-bloggers, you can extract quite a bit of data about the posts of the 750+ […]
By By Request: Retrieving Your Feedly “Saved for Later” Entries | rud.is on 16 Apr 2018 at 2:28 pm

[…] asked if one can access Feedly “Saved for Later” items via the API. The answer is […]
By By Request: Retrieving Your Feedly “Saved for Later” Entries | Atlantic Tagmata AI Security Feed on 16 Apr 2018 at 3:19 pm

[…] asked if one can access Feedly “Saved for Later” items via the API. The answer is “Yes!”, and it […]
By API, Feedly i R | Łukasz Prokulski on 20 Apr 2018 at 1:58 am

[…] na Dane i analizy wrzuciłem post z informacją, że Feedly ma API. Źródłem inspiracji był post z bardzo fajnego bloga, którego autor chyba pracuje nad biblioteką do R obsługującą API Feedly […]
By Snakes in a Package: combining Python and R with reticulate – Mango Solutions on 18 Sep 2018 at 4:13 am

[…] I had collected all the data, Bob Rudis wrote about the Feedly API and released a dataset of blogposts over a longer time period. I would […]
By Exploring 2018 R-bloggers & R Weekly Posts with Feedly & the ‘seymour’ package | rud.is on 31 Dec 2018 at 7:36 am

[…] seymour package builds upon an introductory Feedly API blog post from back in April 2018 and covers most of the “getters” in the API (i.e. you won’t be adding anything to […]
By Exploring 2018 R-bloggers & R Weekly Posts with Feedly & the ‘seymour’ package – Data Science Austria on 31 Dec 2018 at 3:45 pm

[…] seymour package builds upon an introductory Feedly API blog post from back in April 2018 and covers most of the “getters” in the API (i.e. you won’t be adding anything to or […]

rud.is