Skip navigation

Category Archives: HTML5

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text.

I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt a try on your own conversion needs since each method produces different results.

On my macOS systems, the htm2txt calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations).

I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava-based R package dubbed jericho which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment).

Is it worth it to get rJava up and running to use jericho vs htm2txt? Let’s take a look:

library(jericho) # devtools::install_github("hrbrmstr/jericho")
library(microbenchmark)
library(htm2txt)
library(tidyverse)

c(
  "https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526",
  "https://en.wikipedia.org/wiki/Timeline_of_antisemitism",
  "http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/"
) -> urls

map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[1])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[1])
  },
  htm2txt = {
    a <- htm2txt(sites_html[1])
  },
  times = 10
) -> mb1

# microbenchmark(
#   jericho_txt = {
#     a <- html_to_text(sites_html[2])
#   },
#   jericho_render = {
#     a <- render_html_to_text(sites_html[2])
#   },
#   htm2txt = {
#     a <- htm2txt(sites_html[2])
#   },
#   times = 10
# ) -> mb2

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[3])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[3])
  },
  htm2txt = {
    a <- htm2txt(sites_html[3])
  },
  times = 10
) -> mb3

The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho goes fast in that test). Here’s what the other two look like:

mb1
## Unit: milliseconds
##            expr         min          lq        mean      median          uq         max neval
##     jericho_txt    4.121872    4.294953    4.567241    4.405356    4.734923    5.621142    10
##  jericho_render    5.446296    5.564006    5.927956    5.719971    6.357465    6.785791    10
##         htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132    10

mb3
## Unit: milliseconds
##            expr        min         lq       mean     median         uq        max neval
##     jericho_txt   2.641352   2.814318   3.297543   3.034445   3.488639   5.437411    10
##  jericho_render   3.034765   3.143431   4.708136   3.746157   5.953550   8.931072    10
##         htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958    10

You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text.

I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R.

This made the rounds on social media last week:

One of the original versions was static and was not nearly as popular, but—as you can see—this one went viral.

Despite the public’s infatuation with circles (I’m lookin’ at you, pie charts), I’m not going to reproduce this polar coordinate visualization in ggplot2. I believe others have already done so (or are doing so) and you can mimic the animation pretty easily with `coord_polar()` and @drob’s enhanced ggplot2 animation tools.

NOTE: If you’re more interested in the stats/science than a spirograph or colorful D3 animation (below), Gavin Simpson (@ucfagls) has an [awesome post](http://www.fromthebottomoftheheap.net/2016/03/25/additive-modeling-global-temperature-series-revisited/) with a detailed view of the HadCRUT data set.

## HadCRUT in R

I noticed that [the original data source](http://www.metoffice.gov.uk/hadobs/hadcrut4/), had 12 fields, two of which (columns 11 & 12) are the lower+upper bounds of the 95% confidence interval of the combined effects of all the uncertainties described in the HadCRUT4 error model (measurement and sampling, bias and coverage uncertainties). The spinning vis of doom may be mesmerizing, but it only shows the median. I thought it might be fun to try to make a good looking visualization using the CI as well (you can pick one of the other pairs to try this at home), both in R and then in D3. I chose D3 for the animated version mostly to play with the new 4.0 main branch, but I think it’s possible to do more with dynamic visualizations in D3 than it is with R (and it doesn’t require stop-motion techniques).

The following code:

– reads in the data set (and saves it locally to be nice to their bandwidth bill)
– does some munging to get fields we need
– saves a version out for use with D3
– uses `geom_segment()` + `geom_point()` to do the heavy lifting
– colors the segments by year using the `viridis` palette (the Plasma version)
– labels the plot by decade using facets and some fun facet margin “tricks” to make it look like the x-axis labels are on top

library(readr)    # read_table() / write_csv()
library(dplyr)
library(zoo)      # as.yearmon()
library(ggplot2)  # devtools::install_github("hadley/ggplot2")
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
library(viridis)

URL <- "http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/time_series/HadCRUT.4.4.0.0.monthly_ns_avg.txt"
fil <- sprintf("data/%s", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

global_temps <- read_table(fil, col_names=FALSE)

global_temps %>%
  select(year_mon=1, median=2, lower=11, upper=12) %>%
  mutate(year_mon=as.Date(as.yearmon(year_mon, format="%Y/%m")),
         year=as.numeric(format(year_mon, "%Y")),
         decade=(year %/% 10) * 10,
         month=format(year_mon, "%b")) %>%
  mutate(month=factor(month, levels=month.abb)) %>%
  filter(year != 2016) -> global_temps

# for D3 vis
write_csv(global_temps, "data/temps.csv")

#+ hadcrut, fig.retina=2, fig.width=12, fig.height=6
gg <- ggplot(global_temps)
gg <- gg + geom_segment(aes(x=year_mon, xend=year_mon, y=lower, yend=upper, color=year), size=0.2)
gg <- gg + geom_point(aes(x=year_mon, y=median), color="white", shape=".", size=0.01)
gg <- gg + scale_x_date(name="Median in white", expand=c(0,0.5))
gg <- gg + scale_y_continuous(name=NULL, breaks=c(0, 1.5, 2),
                              labels=c("0°C", "1.5°C", "2.0°C"), limits=c(-1.6, 2.25))
gg <- gg + scale_color_viridis(option="C")
gg <- gg + facet_wrap(~decade, nrow=1, scales="free_x")
gg <- gg + labs(title="Global Temperature Change (1850-2016)",
                subtitle="Using lower and upper bounds of the 95% confidence interval of the combined effects of all the uncertainties described in the HadCRUT4 error model (measurement and sampling, bias and coverage uncertainties; fields 11 & 12)",
                caption="HadCRUT4 (http://www.metoffice.gov.uk/hadobs/hadcrut4/index.html)")
gg <- gg + theme_hrbrmstr_my(grid="XY")
gg <- gg + theme(panel.background=element_rect(fill="black", color="#2b2b2b", size=0.15))
gg <- gg + theme(panel.margin=margin(0,0,0,0))
gg <- gg + theme(panel.grid.major.y=element_line(color="#b2182b", size=0.25))
gg <- gg + theme(strip.text=element_text(hjust=0.5))
gg <- gg + theme(axis.title.x=element_text(hjust=0, margin=margin(t=-10)))
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(size=12, color="#b2182b"))
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin=margin(10, 10, 10, 10))
gg <- gg + theme(plot.caption=element_text(margin=margin(t=-6)))
gg

hadcrut

(Click image for larger version)

My `theme_hrbrmstr_my()` required the Myriad Pro font, so you’ll need to use one of the other themes in the `hrbrmisc` package or fill in some `theme()` details on your own.

## HadCRUT in D3

While the static visualization is pretty, we can kick it up a bit with some basic animations. Rather than make a multi-file HTML+js+D3+CSS example, this is all self-contained (apart from the data) in a single `index.html` file (some folks asked for the next D3 example to be self-contained).

Some nice new features of D3 4.0 (that I ended up using here):

– easier to use `scale`s
– less verbose `axis` creation
– `viridis` is now a first-class citizen

Mike Bostock has spent much time refining the API for [D3 4.0](https://github.com/d3/d3) and it shows. I’m definitely looking forward to playing with it over the rest of the year.

The vis is below but you can bust the `iframe` via [https://rud.is/projects/hadcrut4/](https://rud.is/projects/hadcrut4/).

I have it setup as “click to view” out of laziness. It’s not hard to make it trigger on `div` scroll visibility, but this way you also get to repeat the visualization animation without it looping incessantly.

If you end up playing with the D3 code, definitely change the width. I had to make it a bit smaller to fit it into the blog theme.

## Fin

You can find the source for both the R & D3 visualizations [on github](https://github.com/hrbrmstr/hadcrut).

This post comes hot off the heels of the [nigh-feature-complete release of `vegalite`](http://rud.is/b/2016/02/27/create-vega-lite-specs-widgets-with-the-vegalite-package/) (virtually all the components of Vega-Lite are now implemented and just need real-world user testing). I’ve had a few and seen a few questions about “why Vega-Lite”? I _think_ my previous post gave some good answers to “why”. However, Vega-Lite and Vega provide different ways to think about composing statistical graphs than folks seem to be used to (which is part of the “why?”).

Vega-Lite attempts to simplify the way charts are specified (i.e. the way you create a “spec”) in Vega. Vega-proper is rich and complex. You interleave data, operations on data, chart aesthetics and chart element interactions all in one giant JSON file. Vega-Lite 1.0 is definitely more limited than Vega-proper and even when it does add more interactivity (like “brushing”) it will _still_ be more limited, _on purpose_. The reduction in complexity makes it more accessible to both humans and apps, especially apps that don’t grok the Grammar of Graphics (GoG) well.

Even though `ggplot2` lets you mix and match statistical operations on data, I’m going to demonstrate the difference in paradigms/idioms through a single chart. I grabbed the [FRED data on historical WTI crude oil prices](https://research.stlouisfed.org/fred2/series/DCOILWTICO) and will show a chart that displays the minimum monthly price per-decade for a barrel of this cancerous, greed-inducing, global-conflict-generating, atmosphere-destroying black gold.

The data consists of records of daily prices (USD) for this commodity. That means we have to:

1. compute the decade
2. compute the month
3. determine the minimum price by month and decade
4. plot the values

The goal of each idiom is to provide a way to reproduce and communicate the “research”.

Here’s the idiomatic way of doing this with Vega-Lite:

library(vegalite)
library(quantmod)
library(dplyr)
 
getSymbols("DCOILWTICO", src="FRED")
 
data_frame(date=index(DCOILWTICO),
           value=coredata(DCOILWTICO)[,1]) %>%
  mutate(decade=sprintf("%s0", substring(date, 1, 3))) -> oil
 
# i created a CSV and moved the file to my server for easier embedding but
# could just have easily embedded the data in the spec.
# remember, you can pipe a vegalite object to embed_spec() to
# get javascript embed code.
 
vegalite() %>%
  add_data("http://rud.is/dl/crude.csv") %>%
  encode_x("date", "temporal") %>%
  encode_y("value", "quantitative", aggregate="min") %>%
  encode_color("decade", "nominal") %>%
  timeunit_x("month") %>%
  axis_y(title="", format="$3d") %>%
  axis_x(labelAngle=45, labelAlign="left", 
         title="Min price for Crude Oil (WTI) by month/decade, 1986-present") %>%
  mark_tick(thickness=3) %>%
  legend_color(title="Decade", orient="left")

Here’s the “spec” that creates (wordpress was having issues with it, hence the gist embed):

And, here’s the resulting visualization:

The grouping and aggregation operations operate in-chart-craft-situ. You have to carefully, visually parse either the spec or the R code that creates the spec to really grasp what’s going on. A different way of looking at this is that you embed everything you need to reproduce the transformations and visual encodings in a single, simple JSON file.

Here’s what I believe to be the modern, idiomatic way to do this in R + `ggplot2`:

library(ggplot2)
library(quantmod)
library(dplyr)
 
getSymbols("DCOILWTICO", src="FRED")
 
data_frame(date=index(DCOILWTICO),
           value=coredata(DCOILWTICO)[,1]) %>%
  mutate(decade=sprintf("%s0", substring(date, 1, 3)),
         month=factor(format(as.Date(date), "%B"),
                      levels=month.name)) -> oil
 
filter(oil, !is.na(value)) %>%
  group_by(decade, month) %>%
  summarise(value=min(value)) %>%
  ungroup() -> oil_summary
 
ggplot(oil_summary, aes(x=month, y=value, group=decade)) +
  geom_point(aes(color=decade), shape=95, size=8) +
  scale_y_continuous(labels=scales::dollar) +
  scale_color_manual(name="Decade", 
                     values=c("#d42a2f", "#fd7f28", "#339f34", "#d42a2f")) +
  labs(x="Min price for Crude Oil (WTI) by month/decade, 1986-present", y=NULL) +
  theme_bw() +
  theme(axis.text.x=element_text(angle=-45, hjust=0)) +
  theme(legend.position="left") +
  theme(legend.key=element_blank()) +
  theme(plot.margin=grid::unit(rep(1, 4), "cm"))

(To stave off some comments, yes I do know you can be Vega-like and compute with arbitrary functions within ggplot2. This was meant to show what I’ve seen to be the modern, recommended idiom.)

You really don’t even need to know R (for the most part) to grok what’s going on. Data is acquired and transformed and we map that into the plot. Yes, you can do the same thing with Vega[-Lite] (i.e. munge the data ahead of time and just churn out marks) but _you’re not encouraged to_. The power of the Vega paradigm is that you _do blend data and operations together_ and they _stay together_.

To make the R+ggplot2 code reproducible the entirety of the script has to be shipped. It’s really the same as shipping the Vega[-Lite] spec, though since you need to reproduce either the JSON or the R code in environments that support the code (R just happens to support both ggplot2 & Vega-Lite now :-).

I like the latter approach but can appreciate both (otherwise I wouldn’t have written the `vegalite` package). I also think Vega-Lite will catch on more than Vega-proper did (though Vega itself is in use and you use under the covers whenever you use `ggvis`). If Vega-Lite does nothing more than improve visualization literacy—you _must_ understand core vis terms to use it—and foster the notion for the need for serialization, reproduction and sharing of basic statistical charts, it will have been an amazing success in my book.

[Vega-Lite](http://vega.github.io/vega-lite/) 1.0 was [released this past week](https://medium.com/@uwdata/introducing-vega-lite-438f9215f09e#.yfkl0tp1c). I had been meaning to play with it for a while but I’ve been burned before by working with unstable APIs and was waiting for this to bake to a stable release. Thankfully, there were no new shows in the Fire TV, Apple TV or Netflix queues, enabling some fast-paced nocturnal coding to make an [R `htmlwidget`s interface](https://github.com/hrbrmstr/vegalite) to the Vega-Lite code before the week was out.

What is “Vega” and why “-Lite”? [Vega](http://vega.github.io/) is _”a full declarative visualization grammar, suitable for expressive custom interactive visualization design and programmatic generation.”_ Vega-Lite _”provides a higher-level grammar for visual analysis, comparable to ggplot or Tableau, that generates complete Vega specifications.”_ Vega-Lite compiles to Vega and is more compact and accessible than Vega (IMO). Both are just JSON data files with a particular schema that let you encode the data, encodings and aesthetics for statistical charts.

Even I don’t like to write JSON by hand and I can’t imagine anyone really wanting to do that. I see Vega and Vega-Lite as amazing ways to serialize statistical charts from ggplot2 or even Tableau (or any Grammar of Graphics-friendly creation tool) and to pass around for use in other programs—like [Voyager](http://vega.github.io/voyager/) or [Pole★](http://vega.github.io/polestar/)—or directly on the web. It is “glued” to D3 (given the way data transformations are encoded and colors are specified) but it’s a pretty weak glue and one could make a Vega or Vega-Lite spec render to anything given some elbow grease.

But, enough words! Here’s how to make a simple Vega-Lite bar chart using `vegalite`:

# devtools::install_github("hrbrmstr/vegalite")
library(vegalite)
 
dat <- jsonlite::fromJSON('[
    {"a": "A","b": 28}, {"a": "B","b": 55}, {"a": "C","b": 43},
    {"a": "D","b": 91}, {"a": "E","b": 81}, {"a": "F","b": 53},
    {"a": "G","b": 19}, {"a": "H","b": 87}, {"a": "I","b": 52}
  ]')
 
vegalite() %>% 
  add_data(dat) %>%
  encode_x("a", "ordinal") %>%
  encode_y("b", "quantitative") %>%
  mark_bar()

Note that bar graph you see above is _not_ a PNG file or `iframe`d widget. If you `view-source:` you’ll see that I was able to take the Vega-Lite generated spec for that widget code (done by piping the widget to `to_spec()`) and just insert it into this post via:

<style media="screen">.wpvegadiv { display:inline-block; margin:auto }</style>
 
<center><div id="vlvis1" class="wpvegadiv"></div></center>
 
<script>
var spec1 = JSON.parse('{"description":"","data":{"values":[{"a":"A","b":28},{"a":"B","b":55},{"a":"C","b":43},{"a":"D","b":91},{"a":"E","b":81},{"a":"F","b":53},{"a":"G","b":19},{"a":"H","b":87},{"a":"I","b":52}]},"mark":"bar","encoding":{"x":{"field":"a","type":"ordinal"},"y":{"field":"b","type":"quantitative"}},"config":[],"embed":{"renderer":"svg","actions":{"export":false,"source":false,"editor":false}}} ');
 
var embedSpec = { "mode": "vega-lite", "spec": spec1, "renderer": spec1.embed.renderer, "actions": spec1.embed.actions };
 
vg.embed("#vlvis1", embedSpec, function(error, result) {});
</script>

I did have have all the necessary js libs pre-loaded like you see [in this example](http://vega.github.io/vega-lite/tutorials/getting_started.html#embed). You can use the `embed_spec()` function to generate most of that for you, too.

This means you can use R to gather, clean, tidy and analyze data. Then, generate a visualization based on that data with `vegalite`. _Then_ generate a lightweight JSON spec from it and easily embed it anywhere without having to rig up a way to get a widget working or ship giant R markdown created files (like [this one](http://rud.is/projects/vegalite01.html) which has many full `vegalite` widgets on it).

One powerful feature of Vega/Vega-Lite is that the data does not have to be embedded in the spec.

Take this streamgraph visualization about unemployment levels across various industries over time:

vegalite() %>%
  cell_size(500, 300) %>%
  add_data("https://vega.github.io/vega-editor/app/data/unemployment-across-industries.json") %>%
  encode_x("date", "temporal") %>%
  encode_y("count", "quantitative", aggregate="sum") %>%
  encode_color("series", "nominal") %>%
  scale_color_nominal(range="category20b") %>%
  timeunit_x("yearmonth") %>%
  scale_x_time(nice="month") %>%
  axis_x(axisWidth=0, format="%Y", labelAngle=0) %>%
  mark_area(interpolate="basis", stack="center")

The URL you see in the R code is placed into the JSON spec. That means whenever that data changes and the visualization is refreshed, you see updated content without going back to R (or js code).

Now, dynamically-created visualizations are great, but what if you want to actually let your viewers have a copy of it? With Vega/Vega-Lite, you don’t need to resort to hackish bookmarklets, just change a configuration option to enable an export link:

vegalite(export=TRUE) %>%
  add_data("https://vega.github.io/vega-editor/app/data/seattle-weather.csv") %>%
  encode_x("date", "temporal") %>%
  encode_y("*", "quantitative", aggregate="count") %>%
  encode_color("weather", "nominal") %>%
  scale_color_nominal(domain=c("sun","fog","drizzle","rain","snow"),
                      range=c("#e7ba52","#c7c7c7","#aec7e8","#1f77b4","#9467bd")) %>%
  timeunit_x("month") %>%
  axis_x(title="Month") %>% 
  mark_bar()

(You can style/place that link however/wherever you want. It’s a simple classed `

`.)

If you choose a `canvas` renderer, the “export” option will be PNG vs SVG.

The package is nearly (~98%) feature complete to the 1.0 Vega-Lite standard. There are some tedious bits from the Vega-Lite spec remaining to be encoded. I’ve transcribed much of the Vega-Lite documentation to R function & package documentation with links back to the Vega-Lite sources if you need more detail.

I’m hoping to be able to code up an “`as_spec()`” function to enable quick conversion of ggplot2-created graphics to Vega-Lite (and support converting a ggplot2 object to a Vega-Lite spec in `to_spec()`) but that won’t be for a while unless someone wants to jump on board and implement an Vega expression creator/parser in R for me :-)

You can work with the current code [on github](https://github.com/hrbrmstr/vegalite) and/or jump on board to help with package development or file an issue with an idea or a bug. Please note that this package is under _heavy development_ and the function interface is very likely to change as I and others work with it and develop more streamlined ways to handle the encodings. Check back to the github repo often to find out what’s different (there will be a `NEWS` file posted soon and maintained as well).

I’ve been slowly prodding the [metricsgraphics package](https://github.com/hrbrmstr/metricsgraphics/) towards a 1.0.0 release, but there are some rough edges that still need sorting out. One of them is the ability to handle passing in variables for the `x` & `y` accessor values (you _can_ pass in bare and quoted strings). This can now be achieved (in the `dev01` branch) via `mjs_plot_` and in `mjs_plot` proper in the github main branch thanks to a [PR](https://github.com/hrbrmstr/metricsgraphics/pull/31) by [Jonathan Owen](https://github.com/jrowen). If everything stays stable with the PR, I’ll just fold the code into `mjs_plot` for the `0.9.0` CRAN release.

One other pending feature is the ability to turn _basic_ (single `geom_`) `ggplot` objects into `metricsgraphics` plots. Sometimes it’s just easier/nicer to “think” in `ggplot` and it may be the case that one might have coded a quick histogram/scatter/line plot in `ggplot` and want an equally quick interactive version. This can also now be achieved (again, in beta) via `as_mjsplot`. While the previous addition is fairly self-explanatory, this new one needs a few examples. Please note that the package installation is coming from the `dev01` branch:

devtools::install_github("hrbrmstr/metricsgraphics", ref="dev01") 
 
library(metricsgraphics)
library(ggplot2)
 
dat <- data.frame(year=seq(1790, 1970, 10),
                  uspop=as.numeric(uspop))
 
set.seed(5689)
movies <- movies[sample(nrow(movies), 1000), ]
 
gg1 <- ggplot(dat, aes(x=year, y=uspop)) + geom_line()
gg2 <- ggplot(dat, aes(x=year, y=uspop)) + geom_point()
gg3 <- ggplot(movies, aes(rating)) + geom_histogram()
gg4 <- ggplot(movies, aes(rating)) + geom_histogram(binwidth = 0.1)
 
gg1
as_mjsplot(gg1)
 
gg2
as_mjsplot(gg2)
 
gg3
as_mjsplot(gg3)
 
gg4
as_mjsplot(gg4)

Which you can see below:

As you can see, `as_mjsplot` will do it’s best to figure out the bins (if using `geom_histogram`) and also axis labels. Support for converting `geom_vline` and `geom_hline` to markers and baselines (respectively) is a work in progress.

I’ve only done limited testing with some basic single `geom_` constructs, but if there are any bugs with it or feature requests (remember, the MetricsGraphics.js library has a very limited repertoire) please post an issue on GitHub tagging the `dev01` branch.

I set aside a small bit of time to give [rbokeh](https://github.com/bokeh/rbokeh) a try and figured I’d share a small bit of code that shows how to make the “same” chart in both ggplot2 and rbokeh.

#### What is Bokeh/rbokeh?

rbokeh is an [htmlwidget](http://htmlwidgets.org) wrapper for the [Bokeh](http://bokeh.pydata.org/en/latest/) visualization library that has become quite popular in Python circles. Bokeh makes creating interactive charts pretty simple and rbokeh lets you do it all with R syntax.

#### Comparing ggplot & rbokeh

This is not a comprehensive introduction into rbokeh. You can get that [here (officially)](http://hafen.github.io/rbokeh/). I merely wanted to show how a ggplot idiom would map to an rbokeh one for those that may be looking to try out the rbokeh library and are familiar with ggplot. They share a very common “grammar of graphics” base where you have a plot structure and add layers and aesthetics. They each do this a tad bit differently, though, as you’ll see.

First, let’s plot a line graph with some markers in ggplot. The data I’m using is a small time series that we’ll use to plot a cumulative sum of via a line graph. It’s small enough to fit inline:

library(ggplot2)
library(rbokeh)
library(htmlwidgets)
 
structure(list(wk = structure(c(16069, 16237, 16244, 16251, 16279,
16286, 16300, 16307, 16314, 16321, 16328, 16335, 16342, 16349,
16356, 16363, 16377, 16384, 16391, 16398, 16412, 16419, 16426,
16440, 16447, 16454, 16468, 16475, 16496, 16503, 16510, 16517,
16524, 16538, 16552, 16559, 16566, 16573), class = "Date"), n = c(1L,
1L, 1L, 1L, 3L, 1L, 3L, 2L, 4L, 2L, 3L, 2L, 5L, 5L, 1L, 1L, 3L,
3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 7L, 1L, 2L, 6L, 7L, 1L, 1L,
1L, 2L, 2L, 7L, 1L)), .Names = c("wk", "n"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -38L)) -> by_week
 
events <- data.frame(when=as.Date(c("2014-10-09", "2015-03-20", "2015-05-15")),
                     what=c("Thing1", "Thing2", "Thing2"))

The ggplot version is pretty straightforward:

gg <- ggplot()
gg <- gg + geom_vline(data=events,
                      aes(xintercept=as.numeric(when), color=what),
                      linetype="dashed", alpha=1/2)
gg <- gg + geom_text(data=events,
                     aes(x=when, y=1, label=what, color=what),
                     hjust=1.1, size=3)
gg <- gg + geom_line(data=by_week, aes(x=wk, y=cumsum(n)))
gg <- gg + scale_x_date(expand=c(0, 0))
gg <- gg + scale_y_continuous(limits=c(0, 100))
gg <- gg + labs(x=NULL, y="Cumulative Stuff")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

We:

– setup a base ggplot object
– add a layer of marker lines (which are the 3 `events` dates)
– add a layer of text for the marker lines
– add a layer of the actual line – note that we can use `cumsum(n)` vs pre-compute it
– setup scale and other aesthetic properties

That gives us this:

gg

Here’s a similar structure in rbokeh:

figure(width=550, height=375,
       logo="grey", outline_line_alpha=0) %>%
  ly_abline(v=events$when, color=c("red", "blue", "blue"), type=2, alpha=1/4) %>%
  ly_text(x=events$when, y=5, color=c("red", "blue", "blue"),
          text=events$what, align="right", font_size="7pt") %>%
  ly_lines(x=wk, y=cumsum(n), data=by_week) %>%
  y_range(c(0, 100)) %>%
  x_axis(grid=FALSE, label=NULL,
         major_label_text_font_size="8pt",
         axis_line_alpha=0) %>%
  y_axis(grid=FALSE,
         label="Cumulative Stuff",
         minor_tick_line_alpha=0,
         axis_label_text_font_size="10pt",
         axis_line_alpha=0) -> rb
rb

Here, we set the `width` and `height` and configure some of the initial aesthetic options. Note that `outline_line_alpha=0` is the equivalent of `theme(panel.border=element_blank())`.

The markers and text do not work exactly as one might expect since there’s no way to specify a `data` parameter, so we have to set the colors manually. Also, since the target is a browser, points are specified in the same way you would with CSS. However, it’s a pretty easy translation from `geom_[hv]line` to `ly_abline` and `geom_text` to `ly_text`.

The `ly_lines` works pretty much like `geom_line`.

Notice that both ggplot and rbokeh can grok dates for plotting (though we do not need the `as.numeric` hack for rbokeh).

rbokeh will auto-compute bounds like ggplot would but I wanted the scale to go from 0 to 100 in each plot. You can think of `y_range` as `ylim` in ggplot.

To configure the axes, you work directly with `x_axis` and `y_axis` parameters vs `theme` elements in ggplot. To turn off only lines, I set the alpha to 0 in each and did the same with the y axis minor tick marks.

Here’s the rbokeh result:

NOTE: you can save out the widget with:

saveWidget(rb, file="rbokeh001.html")

and I like to use the following `iframe` settings to include the widgets:

<iframe style="max-width=100%" 
        src="rbokeh001.html" 
        sandbox="allow-same-origin allow-scripts" 
        width="100%" 
        height="400" 
        scrolling="no" 
        seamless="seamless" 
        frameBorder="0"></iframe>

#### Wrapping up

Hopefully this helped a bit with translating some ggplot idioms over to rbokeh and developing a working mental model of rbokeh plots. As I play with it a bit more I’ll add some more examples here in the event there are “tricks” that need to be exposed. You can find the code [up on github](https://gist.github.com/hrbrmstr/a3a1be8132530b355bf9) and please feel free to drop a note in the comments if there are better ways of doing what I did or if you have other hints for folks.

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

List_of_Netflix_Movies_and_TV_Shows___AllFlicks

(i.e. by the controls & style of the controls available) and in the source:

view-source_www_allflicks_net

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Developer_Tools_-_http___www_allflicks_net_

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:
Developer_Tools_-_http___www_allflicks_net_ 2

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr)
library(jsonlite)
library(dplyr) # for glimpse
 
url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"
 
resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4
##  $ draw           : int 1
##  $ recordsTotal   : int 7448
##  $ recordsFiltered: int 7448
##  $ data           :'data.frame':  7448 obs. of  9 variables:
##   ..$ box_art  : chr [1:7448] "<img src=\"http://cdn1.nflximg.net/images/9159/12119159.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/6195/20866195.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/3735/2243735.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn0.nflximg.net/images/2668/21112668.jpg\" width=\"55\" alt=\"Thumbnail\">" ...
##   ..$ title    : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ...
##   ..$ year     : chr [1:7448] "2001" "2013" "2006" "2015" ...
##   ..$ rating   : chr [1:7448] "3.3" "2.5" "3.6" "4" ...
##   ..$ category : chr [1:7448] "<a href=\"http://www.allflicks.net/category/thrillers/\">Thrillers</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" ...
##   ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ...
##   ..$ cast     : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ...
##   ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ...
##   ..$ id       : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

<

pre lang=”rsplus”>glimpse(recs$data)

Observations: 7448

Variables:

$ box_art (chr) “<img src=\”http://cdn1.nflximg.net/images/9159/12…

$ title (chr) “In the Bedroom”, “Wolfy: The Incredible Secret”, …

$ year (chr) “2001”, “2013”, “2006”, “2015”, “1993”, “2013”, “2…

$ rating (chr) “3.3”, “2.5”, “3.6”, “4”, “3.5”, “3.1”, “3.3”, “4….

$ category (chr) “<a href=\”http://www.allflicks.net/category/thril…

$ available (chr) “17 May 2015”, “17 May 2015”, “17 May 2015”, “17 M…

$ cast (chr) “Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T…

$ director (chr) “Todd Field”, “Éric Omond”, “Mucci Fassett, Nico R…

$ id (chr) “60022258”, “70302834”, “70053695”, “80028529”, “8…

Now, we can use that in any R workflow or write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.

I’ve updated my [metricsgraphics](https://github.com/hrbrmstr/metricsgraphics) package to version [0.7](https://github.com/hrbrmstr/metricsgraphics/releases/tag/v0.7). The core [MetricsGraphics](http://metricsgraphicsjs.org) JavaScript library has been updated to version 2.1.0 (from 1.1.0). Two blog-worthy features since releasing version 0.5 are `mjs_grid` (which is a `grid.arrange`-like equivalent for `metricsgraphics` plots and `mjs_add_rollover` which lets you add your own custom rollover text to the plots.

### The Grid

The `grid.arrange` (and `arrangeGrob`) functions from the `gridExtra` package come in handy when combining `ggplot2` charts. I wanted a similar way to arrange independent or linked `metricsgraphics` charts, hence `mjs_grid` was born.

`mjs_grid` uses the tag functions in `htmltools` to arrange `metricsgraphics` plot objects into an HTML `

` structure. At present, only uniform tables are supported, but I’m working on making the grid feature more flexible (just like `grid.arrange`). The current functionality is pretty straightforward:

– You build individual `metricsgraphics` plots;
– Optionally combine them in a `list`;
– Pass in the plots/lists into `mjs_grid`;
– Tell `mjs_grid` how many rows & columns are in the grid; and
– Specify the column widths

But, code > words, so here are some examples. To avoid code repetition, note that you’ll need the following packages available to run most of the snippets below:

library(metricsgraphics)
library(htmlwidgets)
library(htmltools)
library(dplyr)

First, we’ll combine a few example plots:

tmp <- data.frame(year=seq(1790, 1970, 10), uspop=as.numeric(uspop))
tmp %>%
  mjs_plot(x=year, y=uspop, width=300, height=300) %>%
  mjs_line() %>%
  mjs_add_marker(1850, "Something Wonderful") %>%
  mjs_add_baseline(150, "Something Awful") -> mjs1
 
mjs_plot(rnorm(10000), width=300, height=300) %>%
  mjs_histogram(bins=30, bar_margin=1) -> mjs2
 
movies <- ggplot2::movies[sample(nrow(ggplot2::movies), 1000), ]
mjs_plot(movies$rating, width=300, height=300) %>% mjs_histogram() -> mjs3
 
tmp %>%
  mjs_plot(x=year, y=uspop, width=300, height=300) %>%
  mjs_line(area=TRUE) -> mjs4
 
mjs_grid(mjs1, mjs2, mjs3, mjs4, ncol=2, nrow=2)

Since your can pass a `list` as a parameter, you can generate many (similar) plots and then grid-display them without too much code. This one generates 7 random histograms with linked rollovers and displays them in grid. Note that this example has `mjs_grid` using the same algorithm `grid.arrange` does for auto-computing “optimal” grid size.

lapply(1:7, function(x) {
  mjs_plot(rnorm(10000, mean=x/2, sd=x), width=250, height=250, linked=TRUE) %>%
    mjs_histogram(bar_margin=2) %>%
    mjs_labs(x_label=sprintf("Plot %d", x))
}) -> plots
 
mjs_grid(plots)

And, you can use `do` from `dplyr` to get `ggplot2` `facet_`-like behavior (though, one could argue that interactive graphics should use controls/selectors vs facets). This example uses the `tips` dataset from `reshape2` and creates a list of plots that are then passed to `mjs_grid`:

tips <- reshape2::tips
a <- tips %>%
  mutate(percent=tip/total_bill,
         day=factor(day, levels=c("Thur", "Fri", "Sat", "Sun"), ordered=TRUE)) %>%
  group_by(day) %>%
  do( plot={ day_label <- unique(.$day)
             mjs_plot(., x=total_bill, y=percent, width=275, height=275, left=100) %>%
               mjs_point(color_accessor=sex, color_type="category") %>%
               mjs_labs(x_label=sprintf("Total Bill (%s)", day_label), y_label="Tip %") })
 
mjs_grid(a$plot, ncol=2, nrow=2, widths=c(0.5, 0.5))

### Rollovers

I’ve had a few requests to support the use of different rollovers and this is a first stab at exposing MetricsGraphics’ native functionality to users of the `metricsgraphics` package. The API changed from MG 1.1.0 to 2.2.0, so I’m _kinda_ glad I waited for this. It requires knowledge of javascript, D3 and the use of `{{ID}}` as part of the CSS node selector when targeting the MetricsGraphics SVG element that displays the rollover text. Here is a crude, but illustrative example of how to take advantage of this feature (mouseover the graphics to see the altered text):

set.seed(1492)
dat <- data.frame(date=seq(as.Date("2014-01-01"),
                           as.Date("2014-01-31"),
                           by="1 day"),
                  value=rnorm(n=31, mean=0, sd=2))
 
dat %>%
  mjs_plot(x=date, y=value, width=500, height=300) %>%
  mjs_line() %>%
  mjs_axis_x(xax_format = "date") %>%
  mjs_add_mouseover("function(d, i) {
                $('{{ID}} svg .mg-active-datapoint')
                    .text('custom text : ' + d.date + ' ' + i);
                 }")

### Postremo

If you are using `metricsgraphics`, drop a link in the comments here to show others how you’re using it! If you need/want some functionality (I’m hoping to get `xts` support into the 0.8 release) that isn’t already in existing feature requests or something’s broken for you, post a new [issue on github](https://github.com/hrbrmstr/metricsgraphics/issues).