Charts & Graphs Archives - Page 4 of 12

Category Archives: Charts & Graphs

A look at airline crashes in R with googlesheets, dplyr & ggplot2

2015-03-30 – 14:32
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (5)

Over on The DO Loop, @RickWicklin does a nice job [visualizing the causes of airline crashes](http://blogs.sas.com/content/iml/2015/03/30/visualizing-airline-crashes/) in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:

– Give @jennybc’s new [googlesheets](http://github.com/jennybc/googlesheets) a spin
– Show some `dplyr` & `tidyr` data wrangling (never can have too many examples)
– Crank out some `ggplot` zero-based streamgraph-y area charts for the data with some extra `ggplot` wrangling for good measure

I also decided to use the colors in the [original David McCandless/Kashan visualization](http://www.informationisbeautiful.net/visualizations/plane-truth-every-single-commercial-plane-crash-visualized/).

#### Getting The Data

As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site [makes the data available](https://docs.google.com/spreadsheet/ccc?key=0AjOUPqcIwvnjdEx2akx5ZjJXSk9oM1E3dWpqZFJ6Nmc&usp=drive_web#gid=1), so I copied it to my Google Drive and gave her package a go:

library(googlesheets)
library(ggplot2) # we'll need the rest of the libraries later
library(dplyr)   # but just getting them out of the way
library(tidyr)
 
# this will prompt for authentication the first time
my_sheets <- list_sheets()
 
# which one is the flight data one
grep("Flight", my_sheets$sheet_title, value=TRUE)
 
## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" 
 
# get the sheet reference then the data from the second tab
flights <- register_ss("Flight Risk JSON")
flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL")
 
# take a quick look
glimpse(flights_csv)
 
## Observations: 440
## Variables:
## $ date       (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "...
## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1...
## $ loc        (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "...
## $ country    (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor...
## $ ref        (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ...
## $ airline    (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"...
## $ fat        (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4...
## $ px         (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"...
## $ cat        (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1...
## $ phase      (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi...
## $ cert       (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi...
## $ meta       (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h...
## $ cause      (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err...
## $ notes      (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
 
# the spreadsheet has a "helper" row for javascript, so we nix it
flights_csv <- flights_csv[-1,] # js vars removal
 
# and we convert some columns while we're at it
flights_csv %>%
  mutate(date=as.Date(date),
         fat=as.numeric(fat),
         px=as.numeric(px)) -> flights_csv

#### A Bit of Cleanup

Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.

The area chart also needs full values for each category combo per year, so we use `expand` from `tidyr` with `left_join` and `mutate` to fill in the gaps.

Finally, we make proper, ordered labels:

flights_csv %>%
  mutate(year=as.numeric(format(date, "%Y"))) %>%
  mutate(phase=tolower(phase),
         phase=ifelse(grepl("take", phase), "takeoff", phase),
         phase=ifelse(grepl("climb", phase), "takeoff", phase),
         phase=ifelse(grepl("ap", phase), "approach", phase)) %>%
  count(year, meta, phase) %>%
  left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% 
  mutate(n=ifelse(is.na(n), 0, n)) %>% 
  filter(!phase %in% c("grounded", "unknown")) %>%
  mutate(phase=factor(phase, 
                      levels=c("takeoff", "en_route", "approach", "landing"),
                      labels=c("Takeoff", "En Route", "Approach", "Landing"),
                      ordered=TRUE)) -> flights_dat

I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.

#### The Final Plot

I’m a big fan of an incremental, additive build idiom for `ggplot` graphics. By using the `gg <- gg + …` style one can move lines around, comment them out, etc without dealing with errant `+` signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows: - main `ggplot` call with mappings if the graph is short, otherwise add the mappings to the `geom`s - all `geom_` or `stat_` layers in the order I want them, and using line breaks to logically separate elements (like `aes`) or to wrap long lines for easier readability. - all `scale_` elements in order from axes to line to shape to color to fill to alpha; I'm not as consistent as I'd like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking - `facet` call (if any) - label setting, always with `labs` unless I really have a need for using `ggtitle` - base `theme_` call - all other `theme` elements, one per `gg <- gg +` line I know that's not everyone's cup of tea, but it's just how I roll `ggplot`-style. For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don't have Futura Medium.

flights_palette <- c("#702023", "#A34296", "#B06F31", "#939598", "#3297B0")
 
gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) 
gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area",
                       position="stack", method="gam", formula=y~s(x)) 
gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, 
                             labels=c("Criminal", "Human Error",
                                      "Mechanical", "Unknown", "Weather"))
gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13))
gg <- gg + facet_grid(~phase)
gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(text=element_text(family="Futura Medium"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(strip.background=element_rect(fill="#525252"))
gg <- gg + theme(strip.text=element_text(color="white"))
gg

That ultimately produces:

flights

with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the _bump_ between 2005 and 2010 for landing crashes.

As an aside, Boeing has a [really nice PDF](http://www.boeing.com/news/techissues/pdf/statsum.pdf) on some of this data with quite a bit more detail.

Fixing Colors & Proportions in Jerusalem Post Election Graphics

2015-03-20 – 20:13
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (3)

Vis expert Naomi Robbins did an excellent [critique](http://www.forbes.com/sites/naomirobbins/2015/03/19/color-problems-with-figures-from-the-jerusalem-post/) of the [graphics](http://www.jpost.com/Israel-Elections/Analysis-The-Israel-election-decided-by-one-vote-394229) that went along with an article on Israeli election in the Jerusalem Post.

Non-uniform and color-blind-unfriendly categorical colors and disproportionate arc sizes are definitely three substantial issues in that series of visualizations. We can rectify all of them with two new packages of mine: [waffle](http://github.com/hrbrmstr/waffle) & [adobecolor](http://github.com/hrbrmstr/adobecolor). The former provides a good alternative to pie charts (no charts at all are a good alternative to pie charts) and the latter makes it possible to share color palettes without passing long strings of hex-encoded colors.

Using [XScope](http://xscopeapp.com/) I encoded a color-blind-friendly palette from [Brian Connelly](http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/) and saved the palette off as an Adobe Color file (`ACO`). I then took the values from the charts and mapped each party to a particular color. Then, I made ordered and proportional waffle charts using the the values and aligned colors. The results are below:

# install.packages("waffle")
# devtools::install_github("hrbrmstr/swatches")
 
library(waffle)
library(swatches)
 
national_unity <- c(`Zionist Union (27)`=27,
                    `Likud (27)`=27,
                    `Kulanu (10)`=10,
                    `Shas (7)`=7,
                    `UTJ (6)`=6)
 
right_wing <- c(`Likud (27)`=27,
                `Kulanu (10)`=10,
                `Bayit Yehudi (8)`=8,
                `Shas (7)`=7,
                `UTJ (6)`=6,
                `Yisrael Beytenu (5)`=5)
 
herzog_led <- c(`Zionist Union (27)`=27,
                `Kulanu (10)`=10,
                `Shas (7)`=7,
                `UTJ (6)`=6,
                `Meretz (5)`=5)
 
party_colors <- rev(read_aco("http://rud.is/dl/israel.aco"))
 
zion <- party_colors[1]
likud <- party_colors[2]
kulanu <- party_colors[3]
shas <- party_colors[4]
utj <- party_colors[5]
visrael <- party_colors[6]
meretz <- party_colors[7]
bayit <- party_colors[6]
 
nw <- waffle(national_unity, rows=5,
             colors=c(zion, likud, kulanu, shas, utj),
             title="\nNational unity government") +
  theme(plot.title=element_text(size=12, face="bold"))
 
rw <- waffle(right_wing, rows=5,
             colors=c(likud, kulanu, bayit, shas, utj, visrael),
             title="\nRight Wing", pad=3) +
  theme(plot.title=element_text(size=12, face="bold"))
 
hw <- waffle(herzog_led, rows=5,
             colors=c(zion, kulanu, shas, utj, meretz),
             title="\nHerzog led", pad=5) +
  theme(plot.title=element_text(size=12, face="bold"))
 
iron(nw, rw, hw)

israel

If I knew my audience did not have color processing issues, I’d use a better palette. Regardless, these results are far better than the careless pies presented in the original story. The squares represent the same quantities in each chart and the colors also map to the parties.

Honestly, though, you could get a better idea with simple, un-tweaked base graphics bar charts:

par(mfrow=c(3,1))
barplot(national_unity, col=c(zion, likud, kulanu, shas, utj), main="National unity government")
barplot(right_wing, col=c(likud, kulanu, bayit, shas, utj, visrael), main="Right Wing")
barplot(herzog_led, col=c(zion, kulanu, shas, utj, meretz), main="Herzog led")

isrbar

Please consider your readers and the message you’re trying to convey when developing visualizations, especially when you have as large an audience as the Jerusalem Post.

Making waffle charts in R (with the new ‘waffle’ package)

2015-03-18 – 15:44
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (17)

NOTE: The waffle package (sans JavaScript-y goodness) is up on CRAN so you can do an install.packages("waffle") and library(waffle) vs the devtools dance.

My disdain for pie charts is fairly well-known, but I do concede that there are times one needs to communicate parts of a whole graphically verses using just words or a table. When that need arises, I’m partial to “waffle charts” or “square pie charts”. @eagereyes did a great post a while ago on them (make sure to read the ‘debate’ between Robert and @hadleywickham in the comments, too), so head there for the low-down on them. Rather than have every waffle chart I make be a one-off creation, I made an R package for them.

There is currently one function in the package — waffle — and said function doesn’t mimic all the goodness of these charts as described in Robert’s post (yet). It does, however, do a pretty decent job covering the basics. Let’s take the oft-cited New York times “debt” graphic:

We can replicate that pretty closely in R. To make it as simple as possible, the waffle function takes a named numeric vector. If no names are specified, or you leave some names out, LETTERS will be used to fill in the gaps. The function takes your data quite literally, so if you give it a vector that sums up to, say, 10,000, then the function will try to create a ggplot object with 10,000 geom_rect elements. Needless to say, that’s a bad idea. So, I suggest using the raw numbers in the vector and passing in a scaled version of the vector to the function. That way, you can play with the values to get the desired look. Here’s the R version of of the NYT graphic:

# devtools::install_github("hrbrmstr/waffle")
library(waffle)
savings <- c(`Mortgage ($84,911)`=84911, `Auto and\ntuition loans ($14,414)`=14414, 
             `Home equity loans ($10,062)`=10062, `Credit Cards ($8,565)`=8565)
waffle(savings/392, rows=7, size=0.5, 
       colors=c("#c7d4b6", "#a3aabd", "#a0d0de", "#97b5cf"), 
       title="Average Household Savings Each Year", 
       xlab="1 square == $392")

savings

This package evolved from a teensy gist I made earlier this year to help communicate the scope of the Anthem data breach in the US. Since then, a recent breach at Premera occurred and added to the tally. Here’s two views of that data, one with one square equalling one million people and another with one square equalling ten million people (using the blue shade from each of the company’s logos):

parts <- c(`Un-breached\nUS Population`=(318-11-79), `Premera`=11, `Anthem`=79)
 
waffle(parts, rows=8, size=1, colors=c("#969696", "#1879bf", "#009bda"), 
       title="Health records breaches as fraction of US Population", 
       xlab="One square == 1m ppl")

320

waffle(parts/10, rows=3, colors=c("#969696", "#1879bf", "#009bda"), 
       title="Health records breaches as fraction of US Population", 
       xlab="One square == 10m ppl"

I’m betting that gets alot bluer by the end of the year.

The function returns a ggplot object, so fonts, sizes, etc can all be customized and the source is up on github for all to play with and contribute to.

Along with adding support for filling in the chart as shown in the @eagereyes post, there will also be an htmlwidget version coming as well. Standard drill applies: issues/enhancements to github issues, feedback and your own examples in the comments.

UPDATE

Thanks to a PR by @timelyportfolio, there is now a widget option in the package.

Streamgraph package now supports continuous x axis scale

2015-03-07 – 13:08
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, R
Tagged post
Leave a Comment

A post on [StackOverflow](http://stackoverflow.com/questions/28725604/streamgraphs-dataviz-in-r-wont-plot) asked about using a continuous variable for the x-axis (vs dates) in my [streamgraph package](http://github.com/hrbrmstr/streamgraph). While I provided a workaround for the question, it helped me bump up the priority for adding support for continuous x axis scales. With the [DBIR](http://www.verizonenterprise.com/DBIR/) halfway behind me now, I kicked out a new rev of the package/widget that has support for continuous scales.

Using the data from the SO post, you can see there’s not much difference in how you use continuous vs date scales:

library(streamgraph)
 
dat <- read.table(text="week variable value
40     rev1  372.096
40     rev2  506.880
40     rev3 1411.200
40     rev4  198.528
40     rev5   60.800
43     rev1  342.912
43     rev2  501.120
43     rev3  132.352
43     rev4  267.712
43     rev5   82.368
44     rev1  357.504
44     rev2  466.560", header=TRUE)
 
dat %>% 
  streamgraph("variable","value","week", scale="continuous") %>% 
  sg_axis_x(tick_format="d")

Product Revenue

I’ll be adding support for using a categorical variable on the x axis soon. Once that’s done, it’ll be time to do the CRAN dance.

New release (0.7) of metricsgraphics htmlwidget — grids & rollovers

2015-02-01 – 10:34
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, HTML5, R
Tagged post
Comments (2)

I’ve updated my [metricsgraphics](https://github.com/hrbrmstr/metricsgraphics) package to version [0.7](https://github.com/hrbrmstr/metricsgraphics/releases/tag/v0.7). The core [MetricsGraphics](http://metricsgraphicsjs.org) JavaScript library has been updated to version 2.1.0 (from 1.1.0). Two blog-worthy features since releasing version 0.5 are `mjs_grid` (which is a `grid.arrange`-like equivalent for `metricsgraphics` plots and `mjs_add_rollover` which lets you add your own custom rollover text to the plots.

### The Grid

The `grid.arrange` (and `arrangeGrob`) functions from the `gridExtra` package come in handy when combining `ggplot2` charts. I wanted a similar way to arrange independent or linked `metricsgraphics` charts, hence `mjs_grid` was born.

`mjs_grid` uses the tag functions in `htmltools` to arrange `metricsgraphics` plot objects into an HTML `

` structure. At present, only uniform tables are supported, but I’m working on making the grid feature more flexible (just like `grid.arrange`). The current functionality is pretty straightforward:

– You build individual `metricsgraphics` plots;
– Optionally combine them in a `list`;
– Pass in the plots/lists into `mjs_grid`;
– Tell `mjs_grid` how many rows & columns are in the grid; and
– Specify the column widths

But, code > words, so here are some examples. To avoid code repetition, note that you’ll need the following packages available to run most of the snippets below:

library(metricsgraphics)
library(htmlwidgets)
library(htmltools)
library(dplyr)

First, we’ll combine a few example plots:

tmp <- data.frame(year=seq(1790, 1970, 10), uspop=as.numeric(uspop))
tmp %>%
  mjs_plot(x=year, y=uspop, width=300, height=300) %>%
  mjs_line() %>%
  mjs_add_marker(1850, "Something Wonderful") %>%
  mjs_add_baseline(150, "Something Awful") -> mjs1
 
mjs_plot(rnorm(10000), width=300, height=300) %>%
  mjs_histogram(bins=30, bar_margin=1) -> mjs2
 
movies <- ggplot2::movies[sample(nrow(ggplot2::movies), 1000), ]
mjs_plot(movies$rating, width=300, height=300) %>% mjs_histogram() -> mjs3
 
tmp %>%
  mjs_plot(x=year, y=uspop, width=300, height=300) %>%
  mjs_line(area=TRUE) -> mjs4
 
mjs_grid(mjs1, mjs2, mjs3, mjs4, ncol=2, nrow=2)

Since your can pass a `list` as a parameter, you can generate many (similar) plots and then grid-display them without too much code. This one generates 7 random histograms with linked rollovers and displays them in grid. Note that this example has `mjs_grid` using the same algorithm `grid.arrange` does for auto-computing “optimal” grid size.

lapply(1:7, function(x) {
  mjs_plot(rnorm(10000, mean=x/2, sd=x), width=250, height=250, linked=TRUE) %>%
    mjs_histogram(bar_margin=2) %>%
    mjs_labs(x_label=sprintf("Plot %d", x))
}) -> plots
 
mjs_grid(plots)

And, you can use `do` from `dplyr` to get `ggplot2` `facet_`-like behavior (though, one could argue that interactive graphics should use controls/selectors vs facets). This example uses the `tips` dataset from `reshape2` and creates a list of plots that are then passed to `mjs_grid`:

tips <- reshape2::tips
a <- tips %>%
  mutate(percent=tip/total_bill,
         day=factor(day, levels=c("Thur", "Fri", "Sat", "Sun"), ordered=TRUE)) %>%
  group_by(day) %>%
  do( plot={ day_label <- unique(.$day)
             mjs_plot(., x=total_bill, y=percent, width=275, height=275, left=100) %>%
               mjs_point(color_accessor=sex, color_type="category") %>%
               mjs_labs(x_label=sprintf("Total Bill (%s)", day_label), y_label="Tip %") })
 
mjs_grid(a$plot, ncol=2, nrow=2, widths=c(0.5, 0.5))

### Rollovers

I’ve had a few requests to support the use of different rollovers and this is a first stab at exposing MetricsGraphics’ native functionality to users of the `metricsgraphics` package. The API changed from MG 1.1.0 to 2.2.0, so I’m _kinda_ glad I waited for this. It requires knowledge of javascript, D3 and the use of `{{ID}}` as part of the CSS node selector when targeting the MetricsGraphics SVG element that displays the rollover text. Here is a crude, but illustrative example of how to take advantage of this feature (mouseover the graphics to see the altered text):

set.seed(1492)
dat <- data.frame(date=seq(as.Date("2014-01-01"),
                           as.Date("2014-01-31"),
                           by="1 day"),
                  value=rnorm(n=31, mean=0, sd=2))
 
dat %>%
  mjs_plot(x=date, y=value, width=500, height=300) %>%
  mjs_line() %>%
  mjs_axis_x(xax_format = "date") %>%
  mjs_add_mouseover("function(d, i) {
                $('{{ID}} svg .mg-active-datapoint')
                    .text('custom text : ' + d.date + ' ' + i);
                 }")

### Postremo

If you are using `metricsgraphics`, drop a link in the comments here to show others how you’re using it! If you need/want some functionality (I’m hoping to get `xts` support into the 0.8 release) that isn’t already in existing feature requests or something’s broken for you, post a new [issue on github](https://github.com/hrbrmstr/metricsgraphics/issues).

NASA GISS’s Annual Global Temperature Anomaly Trends (dplyr/ggplot version)

2015-01-18 – 09:50
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R, Weather
Tagged post
Comments (1)

D Kelly O’Day did a [great post](https://chartsgraphs.wordpress.com/2015/01/16/nasa-gisss-annual-global-temperature-anomaly-trends/) on charting NASA’s Goddard Institute for Space Studies (GISS) temperature anomaly data, but it sticks with base R for data munging & plotting. While there’s absolutely nothing wrong with base R operations, I thought a modern take on the chart using `dplyr`, `magrittr` & `tidyr` for data manipulation and `ggplot2` for formatting would be helpful for the scores of new folk learning R this year (our little language is becoming [all the rage](http://redmonk.com/sogrady/2015/01/14/language-rankings-1-15/), it seems). I also really enjoy working with weather data.

Before further exposition, here’s the result:

forwp

I made liberal use of the “piping” idiom encouraged `magrittr`, `dplyr` and other new R packages, including the forward assignment operator `->` (which may put some folks off a bit). That also meant using `magrittr`’s aliases for `[` and `[[`, which are more readable in pipes.

I don’t use `library(tidyr)` since `tidyr`’s `extract` conflicts with `magrittr`’s, but you’ll see a `tidyr::gather` in the code for wide-to-long data shaping.

I chose to use the monthly temperature anomaly data as a base layer in the chart as a contrast to the monthly- and annual-anomaly means. I also marked the hottest annual- and annual-mean anomalies and framed the decades with vertical markers.

There are no hardcoded years or decades anywhere in the `ggplot2` code, so this should be quite reusable as the data source gets updated.

As I come back to the chart, I think there may be a bit too much “chart junk” on it, but you can tweak it to your own aesthetic preferences (if you do, drop a note in the comments with a link to your creation).

The code is below and in [this gist](https://gist.github.com/hrbrmstr/07ba10fb4c3fe9c9f3a0).

library(httr)
library(magrittr)
library(dplyr)
library(ggplot2)
 
# data retrieval ----------------------------------------------------------
 
# the user agent string was necessary for me; YMMV
 
pg <- GET("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt",
          user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A"))
 
# extract monthly data ----------------------------------------------------
 
content(pg, as="text") %>%
  strsplit("\n") %>%
  extract2(1) %>%
  grep("^[[:digit:]]", ., value=TRUE) -> lines
 
# extract column names ----------------------------------------------------
 
content(pg, as="text") %>%
  strsplit("\n") %>%
  extract2(1) %>%
  extract(8) %>%
  strsplit("\ +") %>%
  extract2(1) -> lines_colnames
 
# make data frame ---------------------------------------------------------
 
data <- read.table(text=lines, stringsAsFactors=FALSE)
colnames(data) <- lines_colnames
 
# transform data frame ----------------------------------------------------
 
data %>%
  tidyr::gather(month, value, Jan, Feb, Mar, Apr, May, Jun,
                       Jul, Aug, Sep, Oct, Nov, Dec) %>%     # wide to long
  mutate(value=value/100) %>%                                # convert to degree Celcius change
  select(year=Year, month, value) %>%                        # only need these fields
  mutate(date=as.Date(sprintf("%d-%d-%d", year, month, 1)),  # make proper dates
         decade=year %/% 10,                                 # calc decade
         start=decade*10, end=decade*10+9) %>%               # calc decade start/end
  group_by(decade) %>%
    mutate(decade_mean=mean(value)) %>%                      # calc decade mean
  group_by(year) %>%
    mutate(annum_mean=mean(value)) %>%                       # calc annual mean
  ungroup -> data
 
# start plot --------------------------------------------------------------
 
gg <- ggplot()
 
# decade vertical markers -------------------------------------------------
 
gg <- gg + geom_vline(data=data %>% select(end),
                      aes(xintercept=as.numeric(as.Date(sprintf("%d-12-31", end)))),
                          size=0.5, color="#4575b4", linetype="dotted", alpha=0.5)
 
# monthly data ------------------------------------------------------------
 
gg <- gg + geom_line(data=data, aes(x=date, y=value, color="monthly anomaly"),
                     size=0.35, alpha=0.25)
gg <- gg + geom_point(data=data, aes(x=date, y=value, color"monthly anomaly"),
                      size=0.75, alpha=0.5)
 
# decade mean -------------------------------------------------------------
 
gg <- gg + geom_segment(data=data %>% distinct(decade, decade_mean, start, end),
                        aes(x=as.Date(sprintf("%d-01-01", start)),
                            xend=as.Date(sprintf("%d-12-31", end)),
                            y=decade_mean, yend=decade_mean,
                            color="decade mean anomaly"),
                        linetype="dashed")
 
# annual data -------------------------------------------------------------
 
gg <- gg + geom_line(data=data %>% distinct(year, annum_mean),
                      aes(x=as.Date(sprintf("%d-06-15", year)), y=annum_mean,
                          color="annual mean anomaly"),
                      size=0.5)
gg <- gg + geom_point(data=data %>% distinct(year, annum_mean),
                      aes(x=as.Date(sprintf("%d-06-15", year)), y=annum_mean,
                          color="annual mean anomaly"),
                      size=2)
 
# additional annotations --------------------------------------------------
 
# max annual mean anomaly horizontal marker/text
 
gg <- gg + geom_hline(yintercept=max(data$annum_mean),  alpha=0.9,
                      color="#d73027", linetype="dashed", size=0.25)
 
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", mean(range(data$year)))),
                    y=max(data$annum_mean),
                    color="#d73027", alpha=0.9,
                    hjust=0.25, vjust=-1, size=3,
                    label=sprintf("Max annual mean anomaly %2.1fºC", max(data$annum_mean)))
 
gg <- gg + geom_hline(yintercept=max(data$value),  alpha=0.9,
                      color="#7f7f7f", linetype="dashed", size=0.25)
 
# max annual anomaly horizontal marker/text
 
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", mean(range(data$year)))),
                    y=max(data$value),
                    color="#7f7f7f",  alpha=0.9,
                    hjust=0.25, vjust=-1, size=3,
                    label=sprintf("Max annual anomaly %2.1fºC", max(data$value)))
 
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", range(data$year)[2])),
                    y=min(data$value), size=3, hjust=1,
                    label="Data: http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt")
 
# set colors --------------------------------------------------------------
 
gg <- gg + scale_color_manual(name="", values=c("#d73027", "#4575b4", "#7f7f7f"))
 
# set x axis limits -------------------------------------------------------
 
gg <- gg + scale_x_date(expand=c(0, 1),
                        limits=c(as.Date(sprintf("%d-01-01", range(data$year)[1])),
                                 as.Date(sprintf("%d-12-31", range(data$year)[2]))))
 
# add labels --------------------------------------------------------------
 
gg <- gg + labs(x=NULL, y="GLOBAL Temp Anomalies in 1.0ºC",
                title=sprintf("GISS Land and Sea Temperature Annual Anomaly Trend (%d to %d)\n",
                              range(data$year)[1], range(data$year)[2]))
 
# theme/legend tweaks -----------------------------------------------------
 
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position=c(0.9, 0.2))
gg <- gg + theme(legend.key=element_blank())
gg <- gg + theme(legend.background=element_blank())
gg

New R Package: metricsgraphics

2015-01-08 – 16:50
Posted in Charts & Graphs, d3, DataVis, DataViz, R, RStudio
Tagged post
Comments (3)

Mozilla released the [MetricsGraphics.js library](http://metricsgraphicsjs.org/) back in November of 2014 ([gh repo](https://github.com/mozilla/metrics-graphics)) and was greeted with great fanfare. It’s primary focus is on crisp, clean layouts for interactive time-series data, but they have support for other chart types as well (though said support is far from comprehensive).

I had been pondering building an R package to help generate these charts when Ramnath Vaidyanathan, Kenton Russell & JJ Allaire came up with the insanely awesome [htmlwidgets](http://www.htmlwidgets.org/) R package, which is the best javascript<->R bridge to-date. Here’s a quick take on how to make a basic line chart before going into some package (and MetricsGraphics) details:

library(metricsgraphics)
 
tmp <- data.frame(year=seq(1790, 1970, 10), uspop=as.numeric(uspop))
 
tmp %>%
  mjs_plot(x=year, y=uspop) %>%
  mjs_line() %>%
  mjs_add_marker(1850, "Something Wonderful") %>%
  mjs_add_baseline(150, "Something Awful")

Example of Basic MetricsGrahpics Chart

One of the package goals (which should be evident from the example) is that it had to conform to the new “piping” idiom, made popular through the [magrittr](https://github.com/smbache/magrittr), [ggvis](http://ggvis.rstudio.com/) and [dplyr](http://github.com/dplyr) packages. This made it possible to avoid one function with a ton of parameters and help break out the chart building into logical steps. While it may not have the flexibility of `ggplot2`, you can do some neat things with MetricsGraphics charts, like use multiple lines:

set.seed(1492)
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4))
 
stocks %>%
  mjs_plot(x=time, y=X, width=500, height=350) %>%
  mjs_line() %>%
  mjs_add_line(Y) %>%
  mjs_add_line(Z) %>%
  mjs_axis_x(xax_format="date") %>%
  mjs_add_legend(c("X", "Y", "Z"))

Stocks X, Y & Z over time

and, pretty configurable scatterplots:

library(RColorBrewer)
 
mtcars %>%
  mjs_plot(x=wt, y=mpg, width=500, height=350) %>%
  mjs_point(color_accessor=cyl,
            x_rug=TRUE, y_rug=TRUE,
            size_accessor=carb,
            size_range=c(5, 10),
            color_type="category",
            color_range=brewer.pal(n=11, name="RdBu")[c(1, 5, 11)]) %>%
  mjs_labs(x="Weight of Car", y="Miles per Gallon")

Motor Trend Cars – mpg~wt

The `htmlwidgets` developers go into [great detail](http://www.htmlwidgets.org/develop_intro.html) on how to create a widget, but there are some central points I’ll cover and potentially reiterate.

First, use the `htmlwidgets::scaffoldWidget` that `htmlwidgets` provides to kickstart your project. It’ll setup the essentials and free your time up to work on the interface components. You will need to edit the generated `yaml` file to use the minified javascript files for things like jquery or d3 since Chrome will be unhappy if you don’t.

Next, remember that all you’re doing is building an R object with data to be passed into a javascript function/environment. MetricsGraphics made this a bit easier for me since the main graphic configuration is one, giant parameter list (take a look at the `metricsgraphics.js` source in github).

Third, if you need to customize the html generation function in the main `packagename_html` file, ensure you pass in `class` to the main `div` element. I was very pleased to discover that you can return a list of HTML elements vs a single one:

metricsgraphics_html <- function(id, style, class, ...) {
  list(tags$div(id = id, class = class, style=style),
       tags$div(id = sprintf("%s-legend", id), class = sprintf("%s-legend", class)))
}

and that may eventually enable support for facet-like functionality without manually creating multiple plots.

Fourth, try to build around the piping idiom. It makes it so much easier to add parameters and manage the data environment.

Fifth, use `iframe`s for embedding your visualizations in other documents (like this blog post). It avoids potential namespace collisions and frees you from having to cut/paste HTML from one doc to another.

And, lastly, remember that you can generate your own `elementId` in the event you need to use it with your javascript visualization library (like I had to).

Currently, `metricsgraphics` is at 0.4.1 and has support for most of the basic chart types along with linking charts (in `Rmd` files). You can install it from the [github repo](https://github.com/hrbrmstr/metricsgraphics) and make sure to file all issues or feature requests there. If you make something with it like @abresler [did](http://asbcllc.com/blog/2015/January/ww2_tanks/), drop a note in the comments!

Now, go forth and wrap some libraries!

Power Outage Impact Choropleths In 5 Steps in R (featuring rvest & RStudio “Projects”)

2014-11-27 – 10:17
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, gis, maps, R
Tagged post
Comments (1)

I and @awpiii were trading news about the power outages in Maine & New Hampshire last night and he tweeted the link to the @PSNH [Outage Map](http://www.psnh.com/outage/). As if the Bing Maps tiles weren’t bad enough, the use of a categorical color scale instead of a sequential one^{[[1](http://earthobservatory.nasa.gov/blogs/elegantfigures/2011/05/20/qualitative-vs-sequential-color-scales/)]} caused sufficient angst that I whipped up an alternate version in R between making pies and bread for Thanksgiving (even with power being out for us).

PSNH provides a text version of outages (by town) that ends up being a pretty clean HTML table, and a quick Google search led me to a fairly efficient town-level [shapefile](http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/datalayers/adjacent-states-town-boundaries.html) for New Hampshire. With these data files at the ready, it was time to make a better map.

**Step 0 – Environment Setup**

So, I lied. There are six steps. “5” just works way better in attention-grabbing list headlines. The first one is setting up the project and loading all the libraries we’ll need. I use RStudio for most of my R coding and their IDE has the concept of a “project” which has it’s own working directory, workspace, history, and source documents separate from any other RStudio windows you have open. They are a great way to organize your analyses and experiments. I have my own “new project” [script](http://rud.is/dl/newprj.sh) that sets up additional directory structures, configures the `Rproj` file with my preferences and initializes a git repository for the project.

I also use the setup step to load up a ggplot2 map theme I keep in a gist.

library(sp)
library(rgdal)
library(dplyr)
library(rvest)
library(stringi)
library(scales)
library(RColorBrewer)
library(ggplot2)
 
# for theme_map
devtools::source_gist("https://gist.github.com/hrbrmstr/33baa3a79c5cfef0f6df")

**Step 1 – Read in the map**

This is literally a one-liner:

nh <- readOGR("data/nhtowns/NHTOWNS_POLY.shp", "NHTOWNS_POLY")

My projects all have a `data` directory and thats where I normally store shapefiles. I used `ogrinfo -al NHTOWNS_POLY.shp` at the command line to determine the layer name.

**Step 2 – Read in the outage data**

The `rvest` package is nothing short of amazing. It makes very quick work of web scraping and—despite some newlines in the mix—this qualifies as a one-liner in my book:

outage <- html("http://www.psnh.com/outagelist/") %>%
  html_nodes("table") %>%
  html_table() %>%
  .[[1]]

That bit of code grabs the whole page, extracts all the HTML tables (there is just one in this example), turns it into a list of data frames and then returns the first one.

**Step 3 – Data wrangling**

While this step is definitely not as succinct as the two previous ones, it’s pretty straightforward:

outage <- outage[complete.cases(outage),]
colnames(outage) <- c("id", "total_customers", "without_power", "percentage_out")
outage$id <- stri_trans_totitle(outage$id)
outage$out <- cut(outage$without_power,
    breaks=c(0, 25, 100, 500, 1000, 5000, 10000, 20000, 40000),
    labels=c("1 - 25", "26 - 100", "101 - 500", "501 - 1,000",
             "1,001 - 5,000", "5,001- 10,000", "10,001 - 20,000",
             "20,001 - 40,000"))

We filter out the `NA`’s (this expunges the “total” row), rename the columns, convert the town name to the same case used in the shapefile (NOTE: I could have just `toupper`ed all the town names, but I really like this function from the `stringi` package) and then use `cut` to make an 8-level factor out of the customer outage count (to match the PSNH map legend).

**Step 4 – Preparing the map for plotting with `ggplot`**

This is another one-liner:

nh_map <- fortify(nh, region="NAME")

and makes it possible to use the town names when specifying the polygon regions we want to fill with our spiffy color scheme.

**Step 5 – Plotting the map**

It is totally possible to do this in one line, but many kittens will lose their lives if you do. I like this way of structuring the creation of a `ggplot` graphic since it makes it very easy to comment out or add various layers or customizations without worrying about stray `+` signs.

gg <- ggplot(data=nh_map, aes(map_id=id))
gg <- gg + geom_map(map=nh_map, aes(x=long, y=lat),
                    color="#0e0e0e", fill="white", size=0.2)
gg <- gg + geom_map(data=outage, map=nh_map, aes(fill=out),
                    color="#0e0e0e", size=0.2)
gg <- gg + scale_fill_brewer(type="seq", palette="RdPu",
                             name="Number of\ncustomer outages\nin each town")
gg <- gg + coord_equal()
gg <- gg + labs(title=sprintf("%s Total PSNH Customers Without Power",
                              comma(sum(outage$without_power))))
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg

That sequence starts the base `ggplot` object creation, sets up the base map colors and then overlays the town outage colors. We use the `RdPu` [Color Brewer](http://colorbrewer2.org/) sequential palette and give the legend the same title as the PSNH counterpart.

The shapefile is already projected (Lambert Conformal Conic—take a look at it with `ogrinfo -al`), so we can get away with using `coord_equal` vs re-projecting it, and we do a tally of outages to stick in the title. My base `theme_map` is designed for Maine, hence the extra `theme` call to move the legend.

**The Finished Product**

Crisp SVG polygons, no cluttered Bing Maps tiles and a proper color palette.

![img](http://rud.is/dl/psnh.svg)

All the code is [up on github](https://github.com/hrbrmstr/psnh).