Skip navigation

Category Archives: R

>UPDATE: time spent per task factor order was wrong before. now fixed.

I caught this tweet today:

The WSJ folks usually do a great job, but this was either rushed or not completely thought through. There’s no way you’re going to be able to do any real comparisons between the segments across pies and direct pie % labels kinda mean they should have just made a table if they were going to phone it in.

Despite the fact that today is Pi[e] Day, these pies need to go.

If the intent was to primarily allow comparison of hours in-task, leaving some ability to compare the same time category across tasks, then bars are probably the way to go (you could do a parallel coordinates plot, but those looks like tangled guitar strings to me, so I’ll stick with bars). Here’s one possible alternative using R & ggplot2. Since I provide the data, please link to your own creations as I’d love to see how others would represent the data.

NOTE: I left direct bar labels off deliberately. My view is that (a) this is designed to be a relative comparison vs precise comparison & (b) it’s survey data and if we’re going to add #’s I’d feel compelled to communicate margin of error, etc. I don’t think that’s necessary.

library(ggplot2)
library(grid)
library(scales)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
library(tidyr)
 
dat <- read.table(text=
"Task|less_than_one_hour_per_week|one_to_four_hours_per_week|one_to_three_hours_a_day|four_or_more_hours_a_day
Basic exploratory data analysis|11|32|46|12
Data cleaning|19|42|31|7
Machine learning, statistics|34|29|27|10
Creating visualizations|23|41|29|7
Presenting analysis|27|47|20|6
Extract, transform, load|43|32|20|5", sep="|", header=TRUE, stringsAsFactors=FALSE)
 
amount_trans <- c("less_than_one_hour_per_week"="<1 hr/\nwk", 
                  "one_to_four_hours_per_week"="1-4 hrs/\nwk", 
                  "one_to_three_hours_a_day"="1-3 hrs/\nday", 
                  "four_or_more_hours_a_day"="4+ hrs/\nday")
 
dat <- gather(dat, amount, value, -Task)
dat$value <- dat$value / 100
dat$amount <- factor(amount_trans[dat$amount], levels=amount_trans)
 
title_trans <- c("Basic exploratory data analysis"="Basic exploratory\ndata analysis", 
                 "Data cleaning"="Data\ncleaning", 
                 "Machine learning, statistics"="Machine learning,\nstatistics", 
                 "Creating visualizations"="Creating\nvisualizations", 
                 "Presenting analysis"="Presenting\nanalysis", 
                 "Extract, transform, load"="Extract,\ntransform, load")
 
dat$Task <-factor(title_trans[dat$Task], levels=title_trans)
 
gg <- ggplot(dat, aes(x=amount, y=value, fill=amount))
gg <- gg + geom_bar(stat="identity", width=0.75, color="#2b2b2b", size=0.05)
gg <- gg + scale_y_continuous(expand=c(0,0), labels=percent, limits=c(0, 0.5))
gg <- gg + scale_x_discrete(expand=c(0,1))
gg <- gg + scale_fill_manual(name="", values=c("#a6cdd9", "#d2e4ee", "#b7b079", "#efc750"))
gg <- gg + facet_wrap(~Task, scales="free")
gg <- gg + labs(x=NULL, y=NULL, title="Where Does the Time Go?")
gg <- gg + theme_hrbrmstr(grid="Y", axis="x", plot_title_margin=9)
gg <- gg + theme(panel.background=element_rect(fill="#efefef", color=NA))
gg <- gg + theme(strip.background=element_rect(fill="#858585", color=NA))
gg <- gg + theme(strip.text=element_text(family="OpenSans-CondensedBold", size=12, color="white", hjust=0.5))
gg <- gg + theme(panel.margin.x=unit(1, "cm"))
gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid.major.y=element_line(color="#b2b2b2"))
gg <- gg + theme(axis.text.x=element_text(margin=margin(t=-10)))
gg <- gg + theme(axis.text.y=element_text(margin=margin(r=-10)))
 
ggplot_with_subtitle(gg, 
                     "The amount of time spent on various tasks by surveyed non-managers in data-science positions.",
                     fontfamily="OpenSans-CondensedLight", fontsize=12, bottom_margin=16)

RStudioScreenSnapz018

UPDATE: A newer blog post explaining the new ggplot2 additions: http://rud.is/b/2016/03/16/supreme-annotations/

UPDATE: this capability (+ more) are being rolled into ggplot2-proper. PR will be absorbed into ggplot2 main branch soon. exciting, annotated times ahead!

UPDATE: fontsize issue has been fixed & there’s a Shiny gadget available for interactively making subtitles. More info at the end of the post.

Subtitles aren’t always necessary for plots, but I began to use them enough that I whipped up a function for ggplot2 that does a decent job adding a subtitle to a finished plot object. More than a few folks have tried their hand at this in the past and this is just my incremental contribution until there’s proper support in ggplot2 (someone’s bound to add it via PR at some point).

We’ll nigh fully recreate the following plot from this WaPo article:

2300lawyers0116-2

Here’s a stab at that w/o the subtitle:

library(ggplot2)
library(scales)
 
data.frame(
  yrs=c("1789-90", "1849-50", "1909-10", "1965-66", "2016-16"),
  pct=c(0.526, 0.795, 0.713, 0.575, 0.365),
  xtralabs=c("", "Highest:\n", "", "", "Lowest:\n")
) -> hill_lawyers
 
gg <- ggplot(hill_lawyers, aes(yrs, pct))
gg <- gg + geom_bar(stat="identity", width=0.65)
gg <- gg + geom_label(aes(label=sprintf("%s%s", xtralabs, percent(pct))),
                      vjust=-0.4, family=c(rep("FranklinGothic-Book", 4),"FranklinGothic-Heavy"), 
                      lineheight=0.9, size=4, label.size=0)
gg <- gg + scale_x_discrete()
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0.0, 1.0), labels=percent)
gg <- gg + labs(x=NULL, y=NULL, title="Fewer and fewer lawyers on the Hill")
gg <- gg + theme_minimal(base_family="FranklinGothic-Book")
gg <- gg + theme(axis.line=element_line(color="#2b2b2b", size=0.5))
gg <- gg + theme(axis.line.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(family=c(rep("FranklinGothic-Book", 4),
                                                   "FranklinGothic-Heavy")))
gg <- gg + theme(panel.grid.major.x=element_blank())
gg <- gg + theme(panel.grid.major.y=element_line(color="#b2b2b2", size=0.1))
gg <- gg + theme(panel.grid.minor.y=element_blank())
gg <- gg + theme(plot.title=element_text(hjust=0, 
                                         family="FranklinGothic-Heavy", 
                                         margin=margin(b=10)))
gg

RStudio

(There are some “tricks” in that plotting code that may be worth spending an extra minute or two to mull over if you didn’t realize some of the function parameters were vectorized, or that you could get a white background with no border for text labels so grid lines don’t get in the way.)

Ideally, a subtitle would be part of the gtable that gets made underneath the covers so it will “travel well” with the plot object itself. The function below makes a textGrob from whatever text we pass into it and does just that; it inserts the new grob into a new table row.

#' Add a subtitle to a ggplot object and draw plot on current graphics device.
#' 
#' @param gg ggplot2 object
#' @param label subtitle label
#' @param fontfamily font family to use. The function doesn't pull any font 
#'        information from \code{gg} so you should consider specifying fonts
#'        for the plot itself and here. Or send me code to make this smarter :-)
#' @param fontsize font size
#' @param hjust,vjust horizontal/vertical justification 
#' @param bottom_margin space between bottom of subtitle and plot (code{pts})
#' @param newpage draw new (empty) page first?
#' @param vp viewport to draw plot in
#' @param ... parameters passed to \code{gpar} in call to \code{textGrob}
#' @return Invisibly returns the result of \code{\link{ggplot_build}}, which
#'   is a list with components that contain the plot itself, the data,
#'   information about the scales, panels etc.
ggplot_with_subtitle <- function(gg, 
                                 label="", 
                                 fontfamily=NULL,
                                 fontsize=10,
                                 hjust=0, vjust=0, 
                                 bottom_margin=5.5,
                                 newpage=is.null(vp),
                                 vp=NULL,
                                 ...) {
 
  if (is.null(fontfamily)) {
    gpr <- gpar(fontsize=fontsize, ...)
  } else {
    gpr <- gpar(fontfamily=fontfamily, fontsize=fontsize, ...)
  }
 
  subtitle <- textGrob(label, x=unit(hjust, "npc"), y=unit(hjust, "npc"), 
                       hjust=hjust, vjust=vjust,
                       gp=gpr)
 
  data <- ggplot_build(gg)
 
  gt <- ggplot_gtable(data)
  gt <- gtable_add_rows(gt, grobHeight(subtitle), 2)
  gt <- gtable_add_grob(gt, subtitle, 3, 4, 3, 4, 8, "off", "subtitle")
  gt <- gtable_add_rows(gt, grid::unit(bottom_margin, "pt"), 3)
 
  if (newpage) grid.newpage()
 
  if (is.null(vp)) {
    grid.draw(gt)
  } else {
    if (is.character(vp)) seekViewport(vp) else pushViewport(vp)
    grid.draw(gt)
    upViewport()
  }
 
  invisible(data)
 
}

The roxygen comments should give you an idea of how to work with it, and here it is in action:

subtitle <- "The percentage of Congressional members that are laywers has been\ncontinuously dropping since the 1960s"
 
ggplot_with_subtitle(gg, subtitle,
                     fontfamily="FranklinGothic-Book",
                     bottom_margin=20, lineheight=0.9)

Fullscreen_3_12_16__3_39_PM

It deals with long annotations pretty well, too (I strwrapped the source text for the below at 100 characters). The text is senseless here, but it’s just for show (I had it handy…don’t judge…you’re getting free code :-):

Fullscreen_3_12_16__7_44_PM

I think this beats manually re-creating the wheel, even if you only infrequently use subtitles. It definitely beats hand-editing plots and is a bit more elegant and functional than using grid.arrange (et al) to mimic the functionality. It also beats futzing with panel margins and clipping to shoehorn a frankenmashup mess of geom_text or annotation_custom calls.

Kick the tyres, tell me where it breaks and if I can cover enough edge cases (or make it smarter) I’ll add it to my ggalt package.

Shiny Subtitle Gadget

Thanks to:

you can now play with an experimental Shiny gadget which you can load by devtools::install_github("hrbrmstr/hrbrmisc") (that’s a temporary home for it, I use this pkg for testing/playing). Just select a ggplot2 object variable name in RStudio and then select “Add subtitle” from the Addins menu and give it a whirl. It looks like this:

__Development_hrbrmisc_-_master_-_RStudio

>UPDATE: I rejiggered the function to actually now, y’know, do what it says it should do :-)

A friend, we’ll call him _Alen_ put a call out for some function that could take an image and produce a per-row “histogram” along the edge for the number of filled-in points. That requirement eventually scope-creeped to wanting “histograms” on both the edge and bottom. In, essence there was a desire to be able to compare the number of pixels in each row/line to each other.

Now, you’re all like _”Well, you used ggplot to make the image so…”_ Yeah, not so much. They had done some basic charting in D3. And, it turns out, that it would be handy to compare the data between different images since they had different sets of data they were charting in the same place.

I can’t show you their images as they are part of super seekrit research which will eventually solve world hunger and land a family on Mars. But, I _can_ do a minor re-creation. I made a really simple D3 page that draws random lines in a specified color. Like this:



You can view the source of to see the dead-simple D3 that generates that. You’ll see something different in that image every time since it’s javascript and js has no decent built-in random routines (well it does _now_ but the engine functionality in browsers hasn’t caught up yet). So, you won’t be able to 100% replicate the results below but it will work.

First, we need to be able to get the image from the `div` into a bitmap so we can do some pixel counting. We’ll use the new `webshot` package for that.

library(webshot)
 
tmppng1 <- tempfile(fileext=".png")
webshot("http://rud.is/projects/randomlines.html?linecol=f6743d", 
        file=tmppng1,
        selector="#vis")

The image that produced looks like this:

img1

To make the “histograms” on the right and bottom, we’ll use the `raster` capabilities in R to let us treat the data like a matrix so we can easily add columns and rows. I made a function (below) that takes in a `png` file and either a list of colors to look for or a list of colors to exclude and the color you want the “histograms” to be drawn in. This way you can just exclude the background and annotation colors or count specific sets of colors. The counting is fueled by `fastmatch` which makes for super-fast comparisons.

#' Make a "row color density" histogram for an image file
#' 
#' Takes a file path to a png and returns displays it with a histogram of 
#' pixel density
#' 
#' @param img_file path to png file
#' @param target_colors,ignore_colors colors to count or ignore. Either one should be 
#'        \code{NULL} or \code{ignore_colors} should be \code{NULL}. Whichever is
#'        not \code{NULL} should be a vector of hex strings (can be huge vector of 
#'        hex strings as it uses \code{fastmatch}). The alpha channel is thrown away 
#'        if any, so you only need to specify \code{#rrggbb} hex strings
#' @param color to use for the density histogram line
selective_image_color_histogram <- function(img_file, 
                                            target_colors=NULL,
                                            ignore_colors=c("#ffffff", "#000000"),
                                            hist_col="steelblue",
                                            plot=TRUE) {
 
  require(png)
  require(grid)
  require(raster)
  require(fastmatch)
  require(gridExtra)
 
  "%fmin%" <- function(x, table) { fmatch(x, table, nomatch = 0) > 0 }
  "%!fmin%" <- function(x, table) { !fmatch(x, table, nomatch = 0) > 0 }
 
  if (is.null(target_colors) & is.null(ignore_colors)) {
    stop("Only one of 'target_colors' or 'ignore_colors' can be 'NULL'", call.=FALSE)
  }
 
  # clean up params
  target_colors <- tolower(target_colors)  
  ignore_colors <- tolower(ignore_colors)  
 
  # read in file and convert to usable data structure  
  png_file <- readPNG(img_file)
  img <- substr(tolower(as.matrix(as.raster(png_file))), 1, 7)
 
  if (length(target_colors)==0) {
    tf_img <- matrix(img %!fmin% ignore_colors, nrow=nrow(img), ncol=ncol(img))
  } else {
    tf_img <- matrix(img %fmin% target_colors, nrow=nrow(img), ncol=ncol(img))
  }  
 
  # count the pixels
  wvals <- rowSums(tf_img)
  hvals <- colSums(tf_img)
 
  # add a slight right & bottom margin
  wdth <- max(wvals) + round(0.1*max(wvals))
  hght <- max(hvals) + round(0.1*max(hvals))
 
  # create the "histogram" 
  col_mat <- matrix(rep("#ffffff", wdth*nrow(img)), nrow=nrow(img), ncol=wdth)
  for (row in 1:nrow(img)) { 
    col_mat[row, 1:wvals[row]] <- hist_col
  }
 
  # make bigger image
  new_img <- cbind(img, col_mat)
 
  # create the "histogram"
  row_mat <- matrix(rep("#ffffff", hght*ncol(new_img)), ncol=ncol(new_img), nrow=hght)
  for (col in 1:ncol(img)) { 
    row_mat[1:hvals[col], col] <- hist_col
  }
 
  # make a new bigger image and turn it into something we can use with 
  # grid since we can also use it with ggplot this way if we really wanted to
  # and friends don't let friends use base graphics
  rg1 <- rasterGrob(rbind(new_img, row_mat))
 
  # if we want to plot it, now is the time
  if (plot) grid.arrange(rg1)
 
  # return a list with each "histogram"
  return(list(row_hist=wvals, col_hist=hvals))
 
}

After reading in the `png` as a raster, the function counts up all the specified pixels by row and extends the matrix width-wise. Then it does the same by column and extends the matrix height-wise. Finally, it makes a `rasterGrob` (b/c friends don’t let friends use base graphics) and optionally plots the output. It also returns the counts by row and by column. That will let us compare between images.

Now we can do:

a <- selective_image_color_histogram(tmppng, hist_col="#f6743d", plot=TRUE)

hist1

And, make a counterpart image for it:

tmppng2 <- tempfile(fileext=".png")
webshot("http://rud.is/projects/randomlines.html?linecol=80b1d4", 
        file=tmppng2,
        selector="#vis")
 
b <- selective_image_color_histogram(tmppng2, hist_col="#80b1d4", plot=TRUE)

hist2

You can definitely visually compare to see which ones had more “activity” in which row(s) (or column(s)) but why not let R do that for you (you’ll probably need to change the font to something boring like `”Helvetica”`)?

library(ggplot2)
library(dplyr)
 
gg <- ggplot(data_frame(x=1:length(a$row_hist),
                        diff=a$row_hist - b$row_hist,
                        `A vs B`=factor(sign(diff), levels=c(-1, 0, 1), 
                                        labels=c("A", "Neutral", "B"))))
gg <- gg + geom_segment(aes(x=x, xend=x, y=0, yend=diff, color=`A vs B`))
gg <- gg + scale_x_continuous(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + scale_color_manual(values=c("#f6743d", "#2b2b2b", "#80b1d4"))
gg <- gg + labs(x="Row", y="Difference")
gg <- gg + coord_flip()
gg <- gg + ggthemes::theme_tufte(base_family="URW Geometric Semi Bold")
gg <- gg + theme(panel.grid=element_line(color="#2b2b2b", size=0.15))
gg <- gg + theme(panel.grid.major.y=element_blank())
gg <- gg + theme(panel.grid.minor=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.y=element_blank())
gg <- gg + theme(axis.title.x=element_text(hjust=0))
gg <- gg + theme(axis.title.y=element_text(hjust=0))
gg

vertdif

This way, you let the _power of data science_ show you the answer. (The column processing chart is an exercise left to the reader).

The code may only be useful to _Alen_, but it was a fun and quick enough exercise that I thought it might be useful to the broader community.

Poke holes or improve upon it at will and tell me how horrible my code is in the comments (I have not looked to see if I subtracted in the right direction as I’m on solo dad duty for a cpl days and #4 is hungry).

It’s usually a good thing when my and infosec worlds collide. Unfortunately, this time it’s a script that R folk running on OS X can use to see if they are using a version of XQuartz that has a nasty vulnerability in the framework it uses to auto-update. If this test comes back with the warning, try to refrain from using XQuartz on insecure networks until the developers fix the issue.

**UPDATE**

Thanks to a gist prodding by @bearloga, here’s a script to scan all your applications for the vulnerability:

library(purrr)
library(dplyr)
library(XML)
 
read_plist <- safely(readKeyValueDB)
safe_compare <- safely(compareVersion)
 
apps <- list.dirs(c("/Applications", "/Applications/Utilities"), recursive=FALSE)
 
# if you have something further than this far down that's bad you're on your own
 
for (i in 1:4) {
  moar_dirs <- grep("app$", apps, value=TRUE, invert=TRUE)
  if (length(moar_dirs) > 0) { apps <- c(apps, list.dirs(moar_dirs, recursive=FALSE)) }
}
apps <- unique(grep("app$", apps, value=TRUE))
 
pb <- txtProgressBar(0, length(apps), style=3)
 
suppressWarnings(map_df(1:length(apps), function(i) {
 
  x <- apps[i]
 
  setTxtProgressBar(pb, i)
 
  is_vuln <- FALSE
  version <- ""
 
  app_name <- sub("\\.app$", "", basename(x))
  app_loc <- sub("^/", "", dirname(x))
 
  to_look <- c(sprintf("%s/Contents/Frameworks/Autoupdate.app/Contents/Info.plist", x),
               sprintf("%s/Contents/Frameworks/Sparkle.framework/Versions/A/Resources/Info.plist", x),
               sprintf("%s/Contents/Frameworks/Sparkle.framework/Versions/A/Resources/Autoupdate.app/Contents/Info.plist", x))
 
  is_there <- map_lgl(c(sprintf("%s/Contents/Frameworks/Sparkle.framework/", x), to_look), file.exists)
 
  has_sparkle <- any(is_there)
 
  to_look <- to_look[which(is_there[-1])]
 
  discard(map_chr(to_look, function(x) {
    read_plist(x)$result$CFBundleShortVersionString %||% NA
  }), is.na) -> vs
 
  if (any(map_dbl(vs, function(v) { safe_compare(v, "1.16.1")$result %||% -1 }) < 0)) {
    is_vuln <- TRUE
    version <- vs[1]
  }
 
  data_frame(app_loc, app_name, has_sparkle, is_vuln, version)
 
})) -> app_scan_results
 
close(pb)
 
select(arrange(filter(app_scan_results, has_sparkle), app_loc, app_name), -has_sparkle)

My wife tricked me into a partial-weekend project to try to get all the primary/caucus results to-date on a map (the whole us). This is challenging since not all states use counties as boundaries for aggregate results. I’m still piecing together some shapefiles for the primary/caucus summation boundaries for a couple remaining states but I didn’t want to let the data source for the election results go without a mention.

The bestest part of the `iframe` below (which can be busted with [this link](/projects/primaryplotting.html)) is the CNN JSON link. You can discover those with Developer Tools on any modern browser. Here’s [the rest](https://gist.github.com/hrbrmstr/25a53e2fcaee2aafa908) of those links (using a gist to add enough layers of redirection to hopefully keep this data free/available).

It’s really well-formatted JSON. As of this post, not all those links completely work (the Maine & PR results weren’t certified yet). Please credit the hard-working folks at CNN whenever/wherever you use this data (if you use it at all). Making a resource like this available is a great service (even if it wasn’t 100% intentional).

The rest of the post shows how to display the voting % per top-candidate in each Texas county. Because Texas uses counties for roll-up aggregation, we can also use `tigris` to get great maps.



NOTE: you won’t need to use this function if you use the [development version](https://github.com/yihui/knitr) of `knitr`


Winston Chang released his [`webshot`](https://github.com/wch/webshot) package to CRAN this past week. The package wraps the immensely useful [`phantomjs`](http://phantomjs.org/) utility and makes it dirt simple to capture whole or partial web pages in R. One beautiful bonus feature of `webshot` is that you can install `phamtomjs` with it (getting `phantomjs` to work on Windows is a pain).

You can do many things with the `webshot` package but I hastily drafted this post to put forth a means to generate a static image from an `htmlwidget`. I won’t elaborate much since I included a fully `roxygen`-doc’d function below, but the essence of `capture_widget()` is to pass in an `htmlwidget` object and have it rendered for you to a `png` file and get back either:

– a file system `path` reference (e.g. `/path/to/widget.png`)
– a `markdown` image reference (e.g. `![](file:///path/to/widget.png)`)
– an `html` image reference (e.g. ``), or
– an `inline` base64 encoded HTML imgage reference (e.g. ``)

which you can then use in R markdown documents knitted to PDF (or in any other context).

Take a look at the function, poke the tyres and drop suggestions in the comments. I’ll add this to one of my widgets soon so folks can submit complaints or enhancements via issues & PRs on github).

To use the function, just pipe a sized widget to it and use the output from it.

#' Capture a static (png) version of a widget (e.g. for use in a PDF knitr document)
#'
#' Widgets are generally interactive beasts rendered in an HTML DOM with
#' javascript. That makes them unusable in PDF documents. However, many widgets
#' initial views would work well as static images. This function renders a widget
#' to a file and make it usable in a number of contexts.
#'
#' What is returned depends on the value of \code{output}. By default (\code{"path"}),
#' the full disk path will be returned. If \code{markdown} is specified, a markdown
#' string will be returned with a \code{file:///...} URL. If \code{html} is
#' specified, an \code{<img src='file:///...'/>} tag will be returned and if
#' \code{inline} is specified, a base64 encoded \code{<img>} tag will be returned
#' (just like you'd see in a self-contained HTML file from \code{knitr}).
#'
#' @importFrom webshot webshot
#' @importFrom base64 img
#' @param wdgt htmlwidget to capture
#' @param output how to return the results of the capture (see Details section)
#' @param height,width it's important for many widget to be responsive in HTML
#'        documents. PDFs are static beasts and having a fixed image size works
#'        better for them. \code{height} & \code{width} will be passed into the
#'        rendering process, which means you should probably specify similar
#'        values in your widget creation process so the captured \code{<div>}
#'        size matches the size you specify here.
#' @param png_render_path by default, this will be a temporary file location but
#'        a fully qualified filename (with extension) can be specified. It's up to
#'        the caller to free the storage when finished with the resource.
#' @return See Details
#' @export
capture_widget <- function(wdgt,
                           output=c("path", "markdown", "html", "inline"),
                           height, width,
                           png_render_path=tempfile(fileext=".png")) {
 
  wdgt_html_tf <- tempfile(fileext=".html")
 
  htmlwidgets::saveWidget(vl, wdgt_html_tf)
 
  webshot::webshot(url=sprintf("file://%s", wdgt_html_tf),
                   selector="#htmlwidget_container",
                   file=wdgt_png_tf,
                   vwidth=width, vheight=height)
 
  # done with HTML
  unlink(wdgt_html_tf)
 
  switch(match.arg(output, c("path", "markdown", "html", "inline")),
             `path`=png_render_path,
         `markdown`=sprintf("![widget](file://%s)", png_render_path),
             `html`=sprintf("<img src='file://%s'/>", png_render_path),
           `inline`=base64::img(wdgt_png_tf))
 
}

This post comes hot off the heels of the [nigh-feature-complete release of `vegalite`](http://rud.is/b/2016/02/27/create-vega-lite-specs-widgets-with-the-vegalite-package/) (virtually all the components of Vega-Lite are now implemented and just need real-world user testing). I’ve had a few and seen a few questions about “why Vega-Lite”? I _think_ my previous post gave some good answers to “why”. However, Vega-Lite and Vega provide different ways to think about composing statistical graphs than folks seem to be used to (which is part of the “why?”).

Vega-Lite attempts to simplify the way charts are specified (i.e. the way you create a “spec”) in Vega. Vega-proper is rich and complex. You interleave data, operations on data, chart aesthetics and chart element interactions all in one giant JSON file. Vega-Lite 1.0 is definitely more limited than Vega-proper and even when it does add more interactivity (like “brushing”) it will _still_ be more limited, _on purpose_. The reduction in complexity makes it more accessible to both humans and apps, especially apps that don’t grok the Grammar of Graphics (GoG) well.

Even though `ggplot2` lets you mix and match statistical operations on data, I’m going to demonstrate the difference in paradigms/idioms through a single chart. I grabbed the [FRED data on historical WTI crude oil prices](https://research.stlouisfed.org/fred2/series/DCOILWTICO) and will show a chart that displays the minimum monthly price per-decade for a barrel of this cancerous, greed-inducing, global-conflict-generating, atmosphere-destroying black gold.

The data consists of records of daily prices (USD) for this commodity. That means we have to:

1. compute the decade
2. compute the month
3. determine the minimum price by month and decade
4. plot the values

The goal of each idiom is to provide a way to reproduce and communicate the “research”.

Here’s the idiomatic way of doing this with Vega-Lite:

library(vegalite)
library(quantmod)
library(dplyr)
 
getSymbols("DCOILWTICO", src="FRED")
 
data_frame(date=index(DCOILWTICO),
           value=coredata(DCOILWTICO)[,1]) %>%
  mutate(decade=sprintf("%s0", substring(date, 1, 3))) -> oil
 
# i created a CSV and moved the file to my server for easier embedding but
# could just have easily embedded the data in the spec.
# remember, you can pipe a vegalite object to embed_spec() to
# get javascript embed code.
 
vegalite() %>%
  add_data("http://rud.is/dl/crude.csv") %>%
  encode_x("date", "temporal") %>%
  encode_y("value", "quantitative", aggregate="min") %>%
  encode_color("decade", "nominal") %>%
  timeunit_x("month") %>%
  axis_y(title="", format="$3d") %>%
  axis_x(labelAngle=45, labelAlign="left", 
         title="Min price for Crude Oil (WTI) by month/decade, 1986-present") %>%
  mark_tick(thickness=3) %>%
  legend_color(title="Decade", orient="left")

Here’s the “spec” that creates (wordpress was having issues with it, hence the gist embed):

And, here’s the resulting visualization:

The grouping and aggregation operations operate in-chart-craft-situ. You have to carefully, visually parse either the spec or the R code that creates the spec to really grasp what’s going on. A different way of looking at this is that you embed everything you need to reproduce the transformations and visual encodings in a single, simple JSON file.

Here’s what I believe to be the modern, idiomatic way to do this in R + `ggplot2`:

library(ggplot2)
library(quantmod)
library(dplyr)
 
getSymbols("DCOILWTICO", src="FRED")
 
data_frame(date=index(DCOILWTICO),
           value=coredata(DCOILWTICO)[,1]) %>%
  mutate(decade=sprintf("%s0", substring(date, 1, 3)),
         month=factor(format(as.Date(date), "%B"),
                      levels=month.name)) -> oil
 
filter(oil, !is.na(value)) %>%
  group_by(decade, month) %>%
  summarise(value=min(value)) %>%
  ungroup() -> oil_summary
 
ggplot(oil_summary, aes(x=month, y=value, group=decade)) +
  geom_point(aes(color=decade), shape=95, size=8) +
  scale_y_continuous(labels=scales::dollar) +
  scale_color_manual(name="Decade", 
                     values=c("#d42a2f", "#fd7f28", "#339f34", "#d42a2f")) +
  labs(x="Min price for Crude Oil (WTI) by month/decade, 1986-present", y=NULL) +
  theme_bw() +
  theme(axis.text.x=element_text(angle=-45, hjust=0)) +
  theme(legend.position="left") +
  theme(legend.key=element_blank()) +
  theme(plot.margin=grid::unit(rep(1, 4), "cm"))

(To stave off some comments, yes I do know you can be Vega-like and compute with arbitrary functions within ggplot2. This was meant to show what I’ve seen to be the modern, recommended idiom.)

You really don’t even need to know R (for the most part) to grok what’s going on. Data is acquired and transformed and we map that into the plot. Yes, you can do the same thing with Vega[-Lite] (i.e. munge the data ahead of time and just churn out marks) but _you’re not encouraged to_. The power of the Vega paradigm is that you _do blend data and operations together_ and they _stay together_.

To make the R+ggplot2 code reproducible the entirety of the script has to be shipped. It’s really the same as shipping the Vega[-Lite] spec, though since you need to reproduce either the JSON or the R code in environments that support the code (R just happens to support both ggplot2 & Vega-Lite now :-).

I like the latter approach but can appreciate both (otherwise I wouldn’t have written the `vegalite` package). I also think Vega-Lite will catch on more than Vega-proper did (though Vega itself is in use and you use under the covers whenever you use `ggvis`). If Vega-Lite does nothing more than improve visualization literacy—you _must_ understand core vis terms to use it—and foster the notion for the need for serialization, reproduction and sharing of basic statistical charts, it will have been an amazing success in my book.

[Vega-Lite](http://vega.github.io/vega-lite/) 1.0 was [released this past week](https://medium.com/@uwdata/introducing-vega-lite-438f9215f09e#.yfkl0tp1c). I had been meaning to play with it for a while but I’ve been burned before by working with unstable APIs and was waiting for this to bake to a stable release. Thankfully, there were no new shows in the Fire TV, Apple TV or Netflix queues, enabling some fast-paced nocturnal coding to make an [R `htmlwidget`s interface](https://github.com/hrbrmstr/vegalite) to the Vega-Lite code before the week was out.

What is “Vega” and why “-Lite”? [Vega](http://vega.github.io/) is _”a full declarative visualization grammar, suitable for expressive custom interactive visualization design and programmatic generation.”_ Vega-Lite _”provides a higher-level grammar for visual analysis, comparable to ggplot or Tableau, that generates complete Vega specifications.”_ Vega-Lite compiles to Vega and is more compact and accessible than Vega (IMO). Both are just JSON data files with a particular schema that let you encode the data, encodings and aesthetics for statistical charts.

Even I don’t like to write JSON by hand and I can’t imagine anyone really wanting to do that. I see Vega and Vega-Lite as amazing ways to serialize statistical charts from ggplot2 or even Tableau (or any Grammar of Graphics-friendly creation tool) and to pass around for use in other programs—like [Voyager](http://vega.github.io/voyager/) or [Pole★](http://vega.github.io/polestar/)—or directly on the web. It is “glued” to D3 (given the way data transformations are encoded and colors are specified) but it’s a pretty weak glue and one could make a Vega or Vega-Lite spec render to anything given some elbow grease.

But, enough words! Here’s how to make a simple Vega-Lite bar chart using `vegalite`:

# devtools::install_github("hrbrmstr/vegalite")
library(vegalite)
 
dat <- jsonlite::fromJSON('[
    {"a": "A","b": 28}, {"a": "B","b": 55}, {"a": "C","b": 43},
    {"a": "D","b": 91}, {"a": "E","b": 81}, {"a": "F","b": 53},
    {"a": "G","b": 19}, {"a": "H","b": 87}, {"a": "I","b": 52}
  ]')
 
vegalite() %>% 
  add_data(dat) %>%
  encode_x("a", "ordinal") %>%
  encode_y("b", "quantitative") %>%
  mark_bar()

Note that bar graph you see above is _not_ a PNG file or `iframe`d widget. If you `view-source:` you’ll see that I was able to take the Vega-Lite generated spec for that widget code (done by piping the widget to `to_spec()`) and just insert it into this post via:

<style media="screen">.wpvegadiv { display:inline-block; margin:auto }</style>
 
<center><div id="vlvis1" class="wpvegadiv"></div></center>
 
<script>
var spec1 = JSON.parse('{"description":"","data":{"values":[{"a":"A","b":28},{"a":"B","b":55},{"a":"C","b":43},{"a":"D","b":91},{"a":"E","b":81},{"a":"F","b":53},{"a":"G","b":19},{"a":"H","b":87},{"a":"I","b":52}]},"mark":"bar","encoding":{"x":{"field":"a","type":"ordinal"},"y":{"field":"b","type":"quantitative"}},"config":[],"embed":{"renderer":"svg","actions":{"export":false,"source":false,"editor":false}}} ');
 
var embedSpec = { "mode": "vega-lite", "spec": spec1, "renderer": spec1.embed.renderer, "actions": spec1.embed.actions };
 
vg.embed("#vlvis1", embedSpec, function(error, result) {});
</script>

I did have have all the necessary js libs pre-loaded like you see [in this example](http://vega.github.io/vega-lite/tutorials/getting_started.html#embed). You can use the `embed_spec()` function to generate most of that for you, too.

This means you can use R to gather, clean, tidy and analyze data. Then, generate a visualization based on that data with `vegalite`. _Then_ generate a lightweight JSON spec from it and easily embed it anywhere without having to rig up a way to get a widget working or ship giant R markdown created files (like [this one](http://rud.is/projects/vegalite01.html) which has many full `vegalite` widgets on it).

One powerful feature of Vega/Vega-Lite is that the data does not have to be embedded in the spec.

Take this streamgraph visualization about unemployment levels across various industries over time:

vegalite() %>%
  cell_size(500, 300) %>%
  add_data("https://vega.github.io/vega-editor/app/data/unemployment-across-industries.json") %>%
  encode_x("date", "temporal") %>%
  encode_y("count", "quantitative", aggregate="sum") %>%
  encode_color("series", "nominal") %>%
  scale_color_nominal(range="category20b") %>%
  timeunit_x("yearmonth") %>%
  scale_x_time(nice="month") %>%
  axis_x(axisWidth=0, format="%Y", labelAngle=0) %>%
  mark_area(interpolate="basis", stack="center")

The URL you see in the R code is placed into the JSON spec. That means whenever that data changes and the visualization is refreshed, you see updated content without going back to R (or js code).

Now, dynamically-created visualizations are great, but what if you want to actually let your viewers have a copy of it? With Vega/Vega-Lite, you don’t need to resort to hackish bookmarklets, just change a configuration option to enable an export link:

vegalite(export=TRUE) %>%
  add_data("https://vega.github.io/vega-editor/app/data/seattle-weather.csv") %>%
  encode_x("date", "temporal") %>%
  encode_y("*", "quantitative", aggregate="count") %>%
  encode_color("weather", "nominal") %>%
  scale_color_nominal(domain=c("sun","fog","drizzle","rain","snow"),
                      range=c("#e7ba52","#c7c7c7","#aec7e8","#1f77b4","#9467bd")) %>%
  timeunit_x("month") %>%
  axis_x(title="Month") %>% 
  mark_bar()

(You can style/place that link however/wherever you want. It’s a simple classed `

`.)

If you choose a `canvas` renderer, the “export” option will be PNG vs SVG.

The package is nearly (~98%) feature complete to the 1.0 Vega-Lite standard. There are some tedious bits from the Vega-Lite spec remaining to be encoded. I’ve transcribed much of the Vega-Lite documentation to R function & package documentation with links back to the Vega-Lite sources if you need more detail.

I’m hoping to be able to code up an “`as_spec()`” function to enable quick conversion of ggplot2-created graphics to Vega-Lite (and support converting a ggplot2 object to a Vega-Lite spec in `to_spec()`) but that won’t be for a while unless someone wants to jump on board and implement an Vega expression creator/parser in R for me :-)

You can work with the current code [on github](https://github.com/hrbrmstr/vegalite) and/or jump on board to help with package development or file an issue with an idea or a bug. Please note that this package is under _heavy development_ and the function interface is very likely to change as I and others work with it and develop more streamlined ways to handle the encodings. Check back to the github repo often to find out what’s different (there will be a `NEWS` file posted soon and maintained as well).