Skip navigation

Tag Archives: post

I’m uncontainably excited to report that the ggplot2 extension package ggalt is now on CRAN.

The absolute best part of this package is the R community members who contributed suggestions and new geoms, stats, annotations and integration features. This release would not be possible without the PRs from:

  • Ben Bolker
  • Ben Marwick
  • Jan Schulz
  • Carson Sievert

and a host of folks who have made suggestions and have put up with broken GitHub builds along the way. Y’all are awesome.

Please see the vignette and graphics-annotated help pages for info on everything that’s available. Some highlights include:

  • multiple ways to render splines (so you can make those cool, smoothed D3-esque line charts :-)
  • geom_cartogram() which replicates the old functionality of geom_map() so your old mapping code doesn’t break anymore
  • a re-re-mended coord_proj() (but, read on about why you should be re-thinking of how you do maps in ggplot2)
  • lollipop charts (geom_lollipop())
  • dumbbell charts (geom_dumbbell())
  • step-ribbon charts
  • the ability to easily encircle points (beyond those boring ellipses)
  • byte formatters (i.e. turn 1024 to 1 Kb, etc)
  • better integration with plotly

If you do any mapping in ggplot2, please follow the machinations of geom_sf() and the sf package. Ed and the rest of the R spatial community have done 100% outstanding work here and it is going to change how you think and work spatially in R forever (in an awesome way). I hope to retire coord_proj() and geom_cartogram() some day soon thanks to all their hard work.

Your contributions, feedback and suggestions are welcome and encouraged. The next steps for me w/r/t ggalt are ensuring 100% plotly coverage since it’s the best way to make your ggplot2 plots interactive. There are a few more additions that didn’t make it into this release that I’ll also be integrating.

Please make sure to say “thank you” to the above contributors if you see them in person on on the internets. They’ve done great work and are exemplary examples of how awesome and talented the R community is.

splashr has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main request for it was to enable rendering of htmlwidgets:

splash_vm <- start_splash(add_tempdir=TRUE)

DiagrammeR("
  graph LR
    A-->B
    A-->C
    C-->E
    B-->D
    C-->D
    D-->F
    E-->F
") %>% 
  saveWidget("/tmp/diag.html")

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="html")
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<script src= ...
## [2] <body style="background-color: white; margin: 0px; padding: 40px;">\n<div id="htmlwidget_container">\n<div id="ht ...

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="png", wait=2)

But if you use the new Docker image and the add_tempdir=TRUE parameter it can render any local HTML file.

The other new bits are helpers to identify content types in the HAR types. Along with get_content_type():

library(tidyverse)

map_chr(rud_har$log$entries, get_content_type)
##  [1] "text/html"                "text/html"                "application/javascript"   "text/css"                
##  [5] "text/css"                 "text/css"                 "text/css"                 "text/css"                
##  [9] "text/css"                 "application/javascript"   "application/javascript"   "application/javascript"  
## [13] "application/javascript"   "application/javascript"   "application/javascript"   "text/javascript"         
## [17] "text/css"                 "text/css"                 "application/x-javascript" "application/x-javascript"
## [21] "application/x-javascript" "application/x-javascript" "application/x-javascript" NA                        
## [25] "text/css"                 "image/png"                "image/png"                "image/png"               
## [29] "font/ttf"                 "font/ttf"                 "text/html"                "font/ttf"                
## [33] "font/ttf"                 "application/font-woff"    "application/font-woff"    "image/svg+xml"           
## [37] "text/css"                 "text/css"                 "image/gif"                "image/svg+xml"           
## [41] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [45] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [49] "text/css"                 "application/x-javascript" "image/gif"                NA                        
## [53] "image/jpeg"               "image/svg+xml"            "image/svg+xml"            "image/svg+xml"           
## [57] "image/svg+xml"            "image/svg+xml"            "image/svg+xml"            "image/gif"               
## [61] NA                         "application/x-javascript" NA                         NA

there are many is_...() functions for logical tests.

But, one of the more interesting is_() functions is is_xhr(). Sites with dynamic content usually load said content via an XMLHttpRequest or XHR for short. Modern web apps usually return JSON in said requests and, for questions like this one on StackOverflow it’s usually better to grab the JSON and use it for data than it is to scrape the table made from JavaScript calls.

Now, it’s not too hard to open Developer Tools and find those XHR requests, but we can also use splashr to programmatically find them. We have to do a bit more work and use the new execute_lua() function since we need to give the page time to load up all the data. (I’ll eventually write a mini-R-DSL around this idiom so you don’t have to grok Lua for non-complex scraping tasks). Here’s how we’d answer that StackOverflow question today…

First, we grab the entire HAR contents (including bodies of the individual requests) after waiting a bit:

splash_local %>%
  execute_lua('
function main(splash)
  splash.response_body_enabled = true
  splash:go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D")
  splash:wait(2)
  return splash:har()
end
') -> res

pg <- as_har(res)

then we look for XHRs:

map_lgl(pg$log$entries, is_xhr) %>% which()
## 10

and, finally, we grab the JSON:

pg$log$entries[[10]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  rawToChar() %>% 
  jsonlite::fromJSON() %>% 
  glimpse()
## List of 4
##  $ TotalPages  : int 16
##  $ TotalRecords: int 384
##  $ Records     :'data.frame': 24 obs. of  21 variables:
##   ..$ ID            : chr [1:24] "{5E4B0D96-18D3-4FC6-B1AA-345675F3765C}" "{674EEC8B-062A-4268-9467-5C61030B83C9}" ## "{3E6257FE-67A1-4F13-B377-9EA7CCBD50F2}" "{C28479E6-5458-4010-A005-84E5F35B2FEA}" ...
##   ..$ FirstName     : chr [1:24] "Mirna" "Barbara" "Donald" "Victoria" ...
##   ..$ LastName      : chr [1:24] "Aeschlimann" "Angus" "Annino" "Arthur" ...
##   ..$ Image         : chr [1:24] "" "/~/media/directory/physicians/ppoc/angus_barbara.ashx" "/~/media/directory/physicians/ppoc/## annino_donald.ashx" "/~/media/directory/physicians/ppoc/arthur_victoria.ashx" ...
##   ..$ Suffix        : chr [1:24] "MD" "MD" "MD" "MD" ...
##   ..$ Url           : chr [1:24] "http://www.childrenshospital.org/doctors/mirna-aeschlimann" "http://www.childrenshospital.org/doctors/## barbara-angus" "http://www.childrenshospital.org/doctors/donald-annino" "http://www.childrenshospital.org/doctors/victoria-arthur" ...
##   ..$ Gender        : chr [1:24] "female" "female" "male" "female" ...
##   ..$ Latitude      : chr [1:24] "42.468769" "42.235088" "42.463177" "42.447168" ...
##   ..$ Longitude     : chr [1:24] "-71.100558" "-71.016021" "-71.143169" "-71.229734" ...
##   ..$ Address       : chr [1:24] "{"practice_name":"Pediatrics, Inc.", "address_1":"577 Main ## Street", "city":&q"| __truncated__ "{"practice_name":"Crown Colony Pediatrics", ## "address_1":"500 Congress Street, Suite 1F""| __truncated__ "{"practice_name":"Pediatricians ## Inc.", "address_1":"955 Main Street", "city":"| __truncated__ ## "{"practice_name":"Lexington Pediatrics", "address_1":"19 Muzzey Street, Suite 105", &qu"| ## __truncated__ ...
##   ..$ Distance      : chr [1:24] "" "" "" "" ...
##   ..$ OtherLocations: chr [1:24] "" "" "" "" ...
##   ..$ AcademicTitle : chr [1:24] "" "" "" "Clinical Instructor of Pediatrics - Harvard Medical School" ...
##   ..$ HospitalTitle : chr [1:24] "Pediatrician" "Pediatrician" "Pediatrician" "Pediatrician" ...
##   ..$ Specialties   : chr [1:24] "Primary Care, Pediatrics, General Pediatrics" "Primary Care, Pediatrics, General Pediatrics" "General ## Pediatrics, Pediatrics, Primary Care" "Primary Care, Pediatrics, General Pediatrics" ...
##   ..$ Departments   : chr [1:24] "" "" "" "" ...
##   ..$ Languages     : chr [1:24] "English" "English" "" "" ...
##   ..$ PPOCLink      : chr [1:24] "http://www.childrenshospital.org/patient-resources/provider-glossary" "/patient-resources/## provider-glossary" "http://www.childrenshospital.org/patient-resources/provider-glossary" "http://www.childrenshospital.org/## patient-resources/provider-glossary" ...
##   ..$ Gallery       : chr [1:24] "" "" "" "" ...
##   ..$ Phone         : chr [1:24] "781-438-7330" "617-471-3411" "781-729-4262" "781-862-4110" ...
##   ..$ Fax           : chr [1:24] "781-279-4046" "(617) 471-3584" "" "(781) 863-2007" ...
##  $ Synonims    : list()

UPDATE So, I wrote a mini-DSL for this:

splash_local %>%
  splash_response_body(TRUE) %>% 
  splash_go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D") %>% 
  splash_wait(2) %>% 
  splash_har() -> res

which should make it easier to perform basic “go-wait-retrieve” operations.

It’s unlikely we want to rely on a running Splash instance for our production work, so I’ll be making a helper function to turn HAR XHR requests into a httr function calls, similar to the way curlconverter works.

ggplot() +
  geom_heart() +
  coord_equal() +
  labs(title="Happy Valentine's Day") +
  theme_heart()

Presented without exposition (since it’s a silly Geom)

This particular ❤️ math pilfered this morning from @dmarcelinobr:

library(ggplot2)

geom_heart <- function(..., colour = "#67001f", size = 0.5, fill = "#b2182b",
                       mul = 1.0, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {
  
  data <- data.frame(t=seq(0, 10*pi, by=0.1))
  
  x <- function(t) 16*sin(t)^3
  y <- function(t) 13*cos(t) - 5*cos(2*t) - 2*cos(3*t) - cos(4*t)
  
  data$x <- x(data$t) * mul
  data$y <- y(data$t) * mul
  
  data <- rbind(data, data[1,])
  
  layer(
    data = data,
    mapping = aes(x=x, y=y),
    stat = "identity",
    geom = ggplot2::GeomPolygon,
    position = "identity",
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      size = size,
      colour = colour,
      fill = fill,
      ...
    )
  )
  
}

theme_heart <- function() {
  ggthemes::theme_map(base_family = "Zapfino") +
    theme(plot.title=element_text(hjust=0.5, size=28)) +
    theme(plot.margin=margin(30,30,30,30))
}

If you do enough web scraping, you’ll eventually hit a wall that the trusty httr verbs (that sit beneath rvest) cannot really overcome: dynamically created content (via javascript) on a site. If the site was nice enough to use XHR requests to load the dynamic content, you can generally still stick with httr verbs — if you can figure out what those requests are — and code-up the right parameters (browser “Developer Tools” menus/views and my curlconverter package are super handy for this). Unfortunately, some sites require actual in-page rendering and that’s when scraping turns into a modest chore.

For dynamic sites, the RSelenium and/or seleniumPipes packages are super-handy tools to have in the toolbox. They interface with Selenium which is a feature-rich environment/ecosystem for automating browser tasks. You can programmatically click buttons, press keys, follow links and extract page content because you’re scripting actions in an actual browser or a browser-like tool such as phantomjs. Getting the server component of Selenium running was often a source of pain for R folks, but the new docker images make it much easier to get started. For truly gnarly scraping tasks, it should be your go-to solution.

However, sometimes all you need is the rendering part and for that, there’s a new light[er]weight alternative dubbed Splash. It’s written in python and uses QT webkit for rendering. To avoid deluging your system with all of the Splash dependencies you can use the docker images. In fact, I made it dead easy to do so. Read on!

Going for a dip

The intrepid Winston Chang at RStudio started a package to wrap Docker operations and I’ve recently joind in the fun to add some tweaks & enhancements to it that are necessary to get it on CRAN. Why point this out? Since you need to have Splash running to work with it in splashr I wanted to make it as easy as possible. So, if you install Docker and then devtools::install_github("wch/harbor") you can then devtools::install_github("hrbrmstr/splashr") to get Splash up and running with:

library(splashr)

install_splash()
splash_svr <- start_splash()

The install_splash() function will pull the correct image to your local system and you’ll need that splash_svr object later on to stop the container. Now, you can have Splash running on any host, but this post assumes you’re running it locally.

We can test to see if the server is active:

splash("localhost") %>% splash_active()
## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 70443008

Now, we’re ready to scrape!

We’ll use this site — http://www.techstars.com/companies/ — mentioned over at DataCamp’s tutorial since it doesn’t use XHR but does require rendering and it doesn’t prohibit scraping in the Terms of Service (don’t violate Terms of Service, it is both unethical and could get you blocked, fined or worse).

Let’s scrape the “Summary by Class” table. Here’s an excerpt along with the Developer Tools view:

You’re saying “HEY. That has <table> in the HTML so why not just use rvest? Well, you can validate the lack of <table>s in the “view source” view of the page or with:

library(rvest)

pg <- read_html("http://www.techstars.com/companies/")
html_nodes(pg, "table")
## {xml_nodeset (0)}

Now, let’s do it with splashr:

splash("localhost") %>% 
  render_html("http://www.techstars.com/companies/", wait=5) -> pg
  
html_nodes(pg, "table")
## {xml_nodeset (89)}
##  [1] <table class="table75"><tbody>\n<tr>\n<th>Status</th>\n        <th>Number of Com ...
##  [2] <table class="table75"><tbody>\n<tr>\n<th colspan="2">Impact</th>\n      </tr>\n ...
##  [3] <table class="table75"><tbody>\n<tr>\n<th>Class</th>\n        <th>#Co's</th>\n   ...
##  [4] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Anywhere 2017 Q1</th>\ ...
##  [5] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Atlanta 2016 Summer</t ...
##  [6] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2013 Fall</th>\ ...
##  [7] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2014 Summer</th ...
##  [8] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2015 Spring</th ...
##  [9] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Austin 2016 Spring</th ...
## [10] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2014</th>\n   ...
## [11] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2015 Spring</ ...
## [12] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays 2016 Winter</ ...
## [13] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Cape Town 201 ...
## [14] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2015 Summ ...
## [15] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays NYC 2016 Summ ...
## [16] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Barclays Tel Aviv 2016 ...
## [17] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2015 Summer</th ...
## [18] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Berlin 2016 Summer</th ...
## [19] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2009 Spring</th ...
## [20] <table><tbody>\n<tr>\n<th class="batch_class" colspan="4">Boston 2010 Spring</th ...
## ...##

We need to set the wait parameter (5 seconds was likely overkill) to give the javascript callbacks time to run. Now you can go crazy turning that into data.

Candid Camera

You can also take snapshots (pictures) of websites with splashr, like this (apologies if you start drooling on your keyboard):

splash("localhost") %>% 
  render_png("https://www.cervelo.com/en/triathlon/p-series/p5x")

The snapshot functions return magick objects, so you can do anything you’d like with them.

HARd Work

Since Splash is rendering the entire site (it’s a real browser), it knows all the information about the various components of a page and can return that in HAR format. You can retrieve this data and use John Harrison’s spiffy HARtools package to visualize and further analyze the data. For the sake of brevity, here’s just the main print() output from a site:

splash("localhost") %>% 
  render_har("https://www.r-bloggers.com/")

## --------HAR VERSION-------- 
## HAR specification version: 1.2 
## --------HAR CREATOR-------- 
## Created by: Splash 
## version: 2.3.1 
## --------HAR BROWSER-------- 
## Browser: QWebKit 
## version: 538.1 
## --------HAR PAGES-------- 
## Page id: 1 , Page title: R-bloggers | R news and tutorials contributed by (750) R bloggers 
## --------HAR ENTRIES-------- 
## Number of entries: 130 
## REQUESTS: 
## Page: 1 
## Number of entries: 130 
##   -  https://www.r-bloggers.com/ 
##   -  https://www.r-bloggers.com/wp-content/themes/magazine-basic-child/style.css 
##   -  https://www.r-bloggers.com/wp-content/plugins/mashsharer/assets/css/mashsb.min.cs... 
##   -  https://www.r-bloggers.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?... 
##   -  https://www.r-bloggers.com/wp-content/plugins/jetpack/css/jetpack.css?ver=4.4.2 
##      ........ 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/10579991_10152371745729891_26331957... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/14962601_10210947974726136_38966601... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/c0.8.50.50/p50x50/311082_286149511398044_4... 
##   -  https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/11046696_917285094960943_6143235831... 
##   -  https://static.xx.fbcdn.net/rsrc.php/v3/y2/r/0iTJ2XCgjBy.png

FIN

You can also do some basic scripting in Splash with lua and coding up an interface with that capability is on the TODO as is adding final tests and enabling tweaking the Docker configurations to support more fun things that Splash can do.

File an issue on github if you have feature requests or problems and feel free to jump on board with a PR if you’d like to help put the finishing touches on the package or add some features.

Don’t forget to stop_splash(splash_svr) when you’re finished scraping!

I made a promise to someone that my next blog would be about stringi vs stringr and I intend to keep said promise.

stringr and stringi do “string operations”: find, replace, match, extract, convert, transform, etc.

The stringr package is now part of the tidyverse and is 100% focused on string processing and is pretty much a wrapper package for stringi. The stringi package wraps chunks of the icu4c library but the stringi API frmaing was actually based on the patterns in the stringr package API. stringr did not wrap stringi at the time but does now and stringi strays a bit (on occasion) from string processing since the entire icu4c library is at it’s disposal. Confused? Good! There’s more!

The impetus for asking me to blog about this is that I’m known to say “just use stringi” in situations where someone has taken a stringr “shortcut”. Let me explain why.

Readers Digest

First, you need to read pages 4-5 of the stringi manual [PDF] and then the stringr vignette. I’m not duplicating the information on those pages. The TL;DR on them is:

  • that stringr makes some (valid) assumptions about defaults for the stringi calls it wraps
  • stringr is much easier to initially grok as it’s very focused and has far fewer functions
  • they both use ICU regular expressions
  • stringi includes more than string processing and has far more total functions:

As noted, stringr wraps stringi calls (for the most part) and some of the stringr functions reference more than one stringi function:

That’s my primary defense for “just use stringi” — stringr “just uses” it and you are forced to install stringi on every system stringr is on, so why introduce another dependency into your code?

All Wrapped Up

These are the stringr functions with a 1:~1 correspondence to stringi functions:

stri_c stri_conv stri_count stri_detect stri_dup stri_extract stri_extract_all stri_join stri_length stri_locate stri_locate_all stri_match stri_match_all stri_order stri_pad stri_replace stri_replace_all stri_replace_na stri_sort stri_split stri_split_fixed stri_sub stri_sub<- stri_subset stri_trim stri_wrap

I used 1:~1 since at the heart of the string processing capabilities of both packages lies the concept of granular control of matching behaviour. Specifically, there are four modes (so it’s really 1:4?):

  • fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets
  • coll: Compare strings respecting standard collation rules
  • regex: The default. Uses ICU regular expressions
  • boundary: Match boundaries between things

stringr has function modifiers around pattern to handle those whereas stringi requires explicit function calls. So, you’d do the following to replace a fixed char/byte sequence in each package:

  • stri_replace_all_fixed("Lorem i.sum dolor sit amet, conse.tetur adipisicing elit.", ".", "#")
  • str_replace_all("Lorem i.sum dolor sit amet, conse.tetur adipisicing elit.", fixed("."), "#")

In that case there’s not much in the way of keystroke savings, but the default mode of stringr is to use regex replacement and you do save both an i and _regex for that but add one more function call in-between you and your goal. When you work with multi-gigabyte character structures (as I do), those milliseconds often add up. If keystrokes > milliseconds in your workflow, you may want to stick with stringr.

Treasure Hunting in stringi

If you take some time to look at what’s in stringi you’ll find quite a bit (I excluded the fixed/coll/reged/boundary versions for brevity):

That’s an SVG, so zoom in as much as you need to to read it.

These are stringi gems:

  • stri_stats_general (stats abt a character vector)
  • stri_trans_totitle (For When You Want Title Case)
  • stri_flatten (paste0 but better defaults)
  • stri_rand_strings (random strings)
  • stri_rand_lipsum (random Lorem Ipsum lines!)
  • stri_count_words, stri_extract_all_words, stri_extract_first_words, stri_extract_last_words

Plus it has some helpful operators:

  • %s!=%, %s!==%, %s+%, %s<%, %s<=%, %s==%, %s===% %s>%, %s>=%, %stri!=%, %stri!==%, %stri+%, %stri<%, %stri<=%, %stri==%, %stri===%, %stri>%, %stri>=%

Of those, %s+% is ++handy for string concatenation.

Prior to readr, these were my go-to line/raw readers/writer: stri_read_raw, stri_read_lines, and stri_write_lines.

It also handles gnarly character encoding operations in a cross-platform, predictable manner.

FIN

To do a full comparison justice would have required writing a mini-book which is something I can’t spare cycles on, so my primary goals were to make sure folks knew stringr wrapped stringi and to show that stringi has much more to offer than you probably knew. If you start to get hooked on some of the more “fun” or utilitarian functions in stringi it’s probably worth switching to it. If string ops are ancillary operations to you and you normally work in regex-land, then you’re not missing out on anything and can save a few keystrokes here and there by using stringr.

Comments are extremely encouraged for this post as I’m curious if you know about stringi before and when/where/how you use it vs stringr (or, why you don’t).

I was enthused to see a mention of this on the GDELT blog since I’ve been working on an R package dubbed newsflash to work with the API that the form front-ends.

Given the current climate, I feel compelled to note that I’m neither a Clinton supporter/defender/advocate nor a ? supporter/defender/advocate) in any way, shape or form. I’m only using the example for replication and I’m very glad the article author stayed (pretty much) non-partisan apart from some color commentary about the predictability of network coverage of certain topics.

For now, the newsflash package is configured to grab raw count data, not the percent summaries since folks using R to grab this data probably want to do their own work with it. I used the following to try to replicate the author’s findings:

library(newsflash)
library(ggalt) # github version
library(hrbrmisc) # github only
library(tidyverse)
starts <- seq(as.Date("2015-01-01"), (as.Date("2017-01-26")-30), "30 days")
ends <- as.character(starts + 29)
ends[length(ends)] <- ""

pb <- progress_estimated(length(starts))
emails <- map2(starts, ends, function(x, y) {
  pb$tick()$print()
  query_tv("clinton", "email,emails,server", timespan="custom", start_date=x, end_date=y)
})

clinton_timeline <- map_df(emails, "timeline")

sum(clinton_timeline$value)
## [1] 34778

count(clinton_timeline, station, wt=value, sort=TRUE) %>%
  mutate(pct=n/sum(n), pct_lab=sprintf("%s (%s)", scales::comma(n), scales::percent(pct)),
         station=factor(station, levels=rev(station))) -> timeline_df

timeline_df

## # A tibble: 7 × 4
##             station     n         pct        pct_lab
##              <fctr> <int>       <dbl>          <chr>
## 1          FOX News 14807 0.425757663 14,807 (42.6%)
## 2      FOX Business  7607 0.218730232  7,607 (21.9%)
## 3               CNN  5434 0.156248203  5,434 (15.6%)
## 4             MSNBC  4413 0.126890563  4,413 (12.7%)
## 5 Aljazeera America  1234 0.035482201   1,234 (3.5%)
## 6         Bloomberg   980 0.028178734     980 (2.8%)
## 7              CNBC   303 0.008712404     303 (0.9%)

NOTE: I had to break up the queries since the bulk one across the two dates bump up against the API limits and may be providing helper functions for that before CRAN release.

While my package matches the total from the news article and sample query: 34,778 results my percentages are different since it’s the percentages across the raw counts for the included stations. “Percent of Sentences” (result “n” divided by the number of all sentences for each station in the time frame) — which the author used — seems to have some utility so I’ll probably add that as a query parameter or add a new function.

Tidy news text

The package also is designed to work with the tidytext package (it’s on CRAN) and provides a top_text() function which can return a tidytext-ready tibble or a plain character vector for use in other text processing packages. If you were curious as to whether this API has good data behind it, we can take a naive peek with the help of tidytext:

library(tidytext)

tops <- map_df(emails, top_text)
anti_join(tops, stop_words) %>% 
  filter(!(word %in% c("clinton", "hillary", "server", "emails", "mail", "email",
                       "mails", "secretary", "clinton's", "secretary"))) %>% 
  count(word, sort=TRUE) %>% 
  print(n=20)

## # A tibble: 26,861 × 2
##             word     n
##            <chr> <int>
## 1        private 12683
## 2     department  9262
## 3            fbi  7250
## 4       campaign  6790
## 5     classified  6337
## 6          trump  6228
## 7    information  6147
## 8  investigation  5111
## 9         people  5029
## 10          time  4739
## 11      personal  4514
## 12     president  4448
## 13        donald  4011
## 14    foundation  3972
## 15          news  3918
## 16     questions  3043
## 17           top  2862
## 18    government  2799
## 19          bill  2698
## 20      reporter  2684

I’d say the API is doing just fine.

Fin

The package also has some other bits from the API in it and if this has piqued your interest, please leave all package feature requests or problems as a github issue.

Many thanks to the Internet Archive / GDELT for making this API possible. Data like this would be amazing in any time, but is almost invaluable now.

I need to be up-front about something: I’m somewhat partially at fault for ? being elected. While I did not vote for him, I could not in any good conscience vote for his Democratic rival. I wrote in a ticket that had one Democrat and one Republican on it. The “who” doesn’t matter and my district in Maine went abundantly for ?’s opponent, so there was no real impact of my direct choice but I did actively point out the massive flaws in his opponent. Said flaws were many and I believe we’d be in a different bad place, but not equally as bad of a place now with her. But, that’s in the past and we’ve got a new reality to deal with, now.

This is a (hopefully) brief post about finding a way out of this mess we’re in. It’s far from comprehensive, but there’s honest-to-goodness evil afoot that needs to be met head on.

Brand Damage

You’ll note I’m not using either of their names. Branding is extremely important to both of them, but is the almost singular focus of ?. His name is his hotel brand, company brand and global identifier. Using it continues to add it to the history books and can only help inflate the power of that brand. First and foremost, do not use his name in public posts, articles, papers, etc. “POTUS”, “The President”, “The Commander in Chief”, “?” (chosen to match his skin/hair color, complexion and that comb-over tuft) are all sufficient references since there is date-context with virtually anything we post these days. Don’t help build up his brand. Don’t populate historical repositories with his name. Don’t give him what he wants most of all: attention.

Document and Defend with Data

Speaking of the historical record, we need to be blogging and publishing regularly the actual facts based on data. We also need to save data as there’s signs of a deliberate government purge going on. I’m not sure how successful said purge will be in the long run and I suspect that the long-term effects of data purging and corruption by this administration will have lasting unintended consequences.

Join/support @datarefuge to save data & preserve the historical record.

Install the Wayback Machine plugin and take the 2 seconds per site you visit to click it.

Create blog posts, tweets, news articles and papers that counter bad facts with good/accurate/honest ones. Don’t make stuff up (even a little). Validate your posits before publishing. Write said posts in a respectful tone.

Support the Media

When the POTUS’ Chief Strategist says things like “The media should be embarrassed and humiliated and keep its mouth shut and just listen for a while” it’s a deliberate attempt to curtail the Press and eventually there will be more actions to actually suppress Press freedom.

I’m not a liberal (I probably have no convenient definition) and I think the Press gave Obama a free ride during his eight year rule. They are definitely making up for that now, mostly because their very livelihoods are at stake.

The problem with them is that they are continuing to let themselves be manipulated by ?. He’s a master at this manipulation. Creating a story about the size of his hands in a picture delegitimizes you as a purveyor of news, especially when — as you’re watching his hands — he’s separating families, normalizing bigotry and undermining the Constitution. Forget about the hands and even forget about the hotels (for now). There was even a recent story trying to compare email servers (the comparison is very flawed). Stop it.

Encourage reporters to focus on things that actually matter and provide pointers to verifiable data they can use to call out the lack of veracity in ?’s policies. Personal blog posts are fleeting things but an NYT, WSJ (etc) story will live on.

Be Kind

I’ve heard and read some terrible language about rural America from what I can only classify as “liberals” in the week this post was written. Intellectual hubris and actual, visceral disdain for those who don’t think a certain way were two major reasons why ? got elected. The actual reasons he got elected are diverse and very nuanced.

Regardless of political leaning, pick your head up from your glowing rectangles and go out of your way to regularly talk to someone who doesn’t look, dress, think, eat, etc like you. Engage everyone with compassion. Regularly challenge your own beliefs.

There is a wedge that I estimate is about 1/8th of the way into the core of America now. Perpetuating this ideological “us vs them” mindset is only going to fuel the fires that created the conditions we’re in now and drive the wedge in further. The only way out is through compassion.

Remember: all life matters. Your degree, profession, bank balance or faith alignment doesn’t give you the right to believe you are better than anyone else.

FIN (for now)

I’ll probably move most of future opines to a new medium (not uppercase Medium) as you may be getting this drivel when you want recipes or R code (even though there are separate feeds for them).

Dear Leader has made good on his campaign promise to “crack down” on immigration from “dangerous” countries. I wanted to both see one side of the impact of that decree — how many potential immigrants per year might this be impacting — and show toss up some code that shows how to free data from PDF documents using the @rOpenSci tabulizer package — authored by (@thosjleeper) — (since knowing how to find, free and validate the veracity of U.S. gov data is kinda ++paramount now).

This is just one view and I encourage others to find, grab and blog other visa-related data and other government data in general.

So, the data is locked up in this PDF document:

As PDF documents go, it’s not horribad since the tables are fairly regular. But I’m not transcribing that and traditional PDF text extracting tools on the command-line or in R would also require writing more code than I have time for right now.

Enter: tabulizer — an R package that wraps tabula Java functions and makes them simple to use. I’m only showing one aspect of it here and you should check out the aforelinked tutorial to see all the features.

First, we need to setup our environment, download the PDF and extract the tables with tabulizer:

library(tabulizer)
library(hrbrmisc)
library(ggalt)
library(stringi)
library(tidyverse)

URL <- "https://travel.state.gov/content/dam/visas/Statistics/AnnualReports/FY2016AnnualReport/FY16AnnualReport-TableIII.pdf"
fil <- sprintf("%s", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

tabs <- tabulizer::extract_tables("FY16AnnualReport-TableIII.pdf")

You should str(tabs) in your R session. It found all our data, but put it into a list with 7 elements. You actually need to peruse this list to see where it mis-aligned columns. In the “old days”, reading this in and cleaning it up would have taken the form of splitting & replacing elements in character vectors. Now, after our inspection, we can exclude rows we don’t want, move columns around and get a nice tidy data frame with very little effort:

bind_rows(
  tbl_df(tabs[[1]][-1,]),
  tbl_df(tabs[[2]][-c(12,13),]),
  tbl_df(tabs[[3]][-c(7, 10:11), -2]),
  tbl_df(tabs[[4]][-21,]),
  tbl_df(tabs[[5]]),
  tbl_df(tabs[[6]][-c(6:7, 30:32),]),
  tbl_df(tabs[[7]][-c(11:12, 25:27),])
) %>%
  setNames(c("foreign_state", "immediate_relatives",  "special_mmigrants",
             "family_preference", "employment_preference", "diversity_immigrants","total")) %>% 
  mutate_each(funs(make_numeric), -foreign_state) %>%
  mutate(foreign_state=trimws(foreign_state)) -> total_visas_2016

I’ve cleaned up PDFs before and that code was a joy to write compared to previous efforts. No use of purrr since I was referencing the list structure in the console as I entered in the various matrix coordinates to edit out.

Finally, we can extract the target “bad” countries and see how many human beings could be impacted this year by referencing immigration stats for last year:

filter(foreign_state %in% c("Iran", "Iraq", "Libya", "Somalia", "Sudan", "Syria", "Yemen")) %>%
  gather(preference, value, -foreign_state) %>%
  mutate(preference=stri_replace_all_fixed(preference, "_", " " )) %>%
  mutate(preference=stri_trans_totitle(preference)) -> banned_visas

ggplot(banned_visas, aes(foreign_state, value)) +
  geom_col(width=0.65) +
  scale_y_continuous(expand=c(0,5), label=scales::comma) +
  facet_wrap(~preference, scales="free_y") +
  labs(x="# Visas", y=NULL, title="Immigrant Visas Issued (2016)",
       subtitle="By Foreign State of Chargeability or Place of Birth; Fiscal Year 2016; [Total n=31,804] — Note free Y scales",
       caption="Visa types explanation: https://travel.state.gov/content/visas/en/general/all-visa-categories.html\nSource: https://travel.state.gov/content/visas/en/law-and-policy/statistics/annual-reports/report-of-the-visa-office-2016.html") +
  theme_hrbrmstr_msc(grid="Y") +
  theme(axis.text=element_text(size=12))

~32,000 human beings potentially impacted, many who will remain separated from family (“family preference”); plus, the business impact of losing access to skilled labor (“employment preference”).

Go forth and find more US gov data to free (before it disappears)!