This post comes hot off the heels of the [nigh-feature-complete release of `vegalite`](http://rud.is/b/2016/02/27/create-vega-lite-specs-widgets-with-the-vegalite-package/) (virtually all the components of Vega-Lite are now implemented and just need real-world user testing). I’ve had a few and seen a few questions about “why Vega-Lite”? I _think_ my previous post gave some good answers to “why”. However, Vega-Lite and Vega provide different ways to think about composing statistical graphs than folks seem to be used to (which is part of the “why?”).
Vega-Lite attempts to simplify the way charts are specified (i.e. the way you create a “spec”) in Vega. Vega-proper is rich and complex. You interleave data, operations on data, chart aesthetics and chart element interactions all in one giant JSON file. Vega-Lite 1.0 is definitely more limited than Vega-proper and even when it does add more interactivity (like “brushing”) it will _still_ be more limited, _on purpose_. The reduction in complexity makes it more accessible to both humans and apps, especially apps that don’t grok the Grammar of Graphics (GoG) well.
Even though `ggplot2` lets you mix and match statistical operations on data, I’m going to demonstrate the difference in paradigms/idioms through a single chart. I grabbed the [FRED data on historical WTI crude oil prices](https://research.stlouisfed.org/fred2/series/DCOILWTICO) and will show a chart that displays the minimum monthly price per-decade for a barrel of this cancerous, greed-inducing, global-conflict-generating, atmosphere-destroying black gold.
The data consists of records of daily prices (USD) for this commodity. That means we have to:
1. compute the decade
2. compute the month
3. determine the minimum price by month and decade
4. plot the values
The goal of each idiom is to provide a way to reproduce and communicate the “research”.
Here’s the idiomatic way of doing this with Vega-Lite:
library(vegalite) library(quantmod) library(dplyr) getSymbols("DCOILWTICO", src="FRED") data_frame(date=index(DCOILWTICO), value=coredata(DCOILWTICO)[,1]) %>% mutate(decade=sprintf("%s0", substring(date, 1, 3))) -> oil # i created a CSV and moved the file to my server for easier embedding but # could just have easily embedded the data in the spec. # remember, you can pipe a vegalite object to embed_spec() to # get javascript embed code. vegalite() %>% add_data("http://rud.is/dl/crude.csv") %>% encode_x("date", "temporal") %>% encode_y("value", "quantitative", aggregate="min") %>% encode_color("decade", "nominal") %>% timeunit_x("month") %>% axis_y(title="", format="$3d") %>% axis_x(labelAngle=45, labelAlign="left", title="Min price for Crude Oil (WTI) by month/decade, 1986-present") %>% mark_tick(thickness=3) %>% legend_color(title="Decade", orient="left")
Here’s the “spec” that creates (wordpress was having issues with it, hence the gist embed):
And, here’s the resulting visualization:
The grouping and aggregation operations operate in-chart-craft-situ. You have to carefully, visually parse either the spec or the R code that creates the spec to really grasp what’s going on. A different way of looking at this is that you embed everything you need to reproduce the transformations and visual encodings in a single, simple JSON file.
Here’s what I believe to be the modern, idiomatic way to do this in R + `ggplot2`:
library(ggplot2) library(quantmod) library(dplyr) getSymbols("DCOILWTICO", src="FRED") data_frame(date=index(DCOILWTICO), value=coredata(DCOILWTICO)[,1]) %>% mutate(decade=sprintf("%s0", substring(date, 1, 3)), month=factor(format(as.Date(date), "%B"), levels=month.name)) -> oil filter(oil, !is.na(value)) %>% group_by(decade, month) %>% summarise(value=min(value)) %>% ungroup() -> oil_summary ggplot(oil_summary, aes(x=month, y=value, group=decade)) + geom_point(aes(color=decade), shape=95, size=8) + scale_y_continuous(labels=scales::dollar) + scale_color_manual(name="Decade", values=c("#d42a2f", "#fd7f28", "#339f34", "#d42a2f")) + labs(x="Min price for Crude Oil (WTI) by month/decade, 1986-present", y=NULL) + theme_bw() + theme(axis.text.x=element_text(angle=-45, hjust=0)) + theme(legend.position="left") + theme(legend.key=element_blank()) + theme(plot.margin=grid::unit(rep(1, 4), "cm"))
(To stave off some comments, yes I do know you can be Vega-like and compute with arbitrary functions within ggplot2. This was meant to show what I’ve seen to be the modern, recommended idiom.)
You really don’t even need to know R (for the most part) to grok what’s going on. Data is acquired and transformed and we map that into the plot. Yes, you can do the same thing with Vega[-Lite] (i.e. munge the data ahead of time and just churn out marks) but _you’re not encouraged to_. The power of the Vega paradigm is that you _do blend data and operations together_ and they _stay together_.
To make the R+ggplot2 code reproducible the entirety of the script has to be shipped. It’s really the same as shipping the Vega[-Lite] spec, though since you need to reproduce either the JSON or the R code in environments that support the code (R just happens to support both ggplot2 & Vega-Lite now :-).
I like the latter approach but can appreciate both (otherwise I wouldn’t have written the `vegalite` package). I also think Vega-Lite will catch on more than Vega-proper did (though Vega itself is in use and you use under the covers whenever you use `ggvis`). If Vega-Lite does nothing more than improve visualization literacy—you _must_ understand core vis terms to use it—and foster the notion for the need for serialization, reproduction and sharing of basic statistical charts, it will have been an amazing success in my book.