Data Visualization Archives - Page 8 of 13

Category Archives: Data Visualization

New release (0.7) of metricsgraphics htmlwidget — grids & rollovers

2015-02-01 – 10:34
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, HTML5, R
Tagged post
Comments (2)

I’ve updated my [metricsgraphics](https://github.com/hrbrmstr/metricsgraphics) package to version [0.7](https://github.com/hrbrmstr/metricsgraphics/releases/tag/v0.7). The core [MetricsGraphics](http://metricsgraphicsjs.org) JavaScript library has been updated to version 2.1.0 (from 1.1.0). Two blog-worthy features since releasing version 0.5 are `mjs_grid` (which is a `grid.arrange`-like equivalent for `metricsgraphics` plots and `mjs_add_rollover` which lets you add your own custom rollover text to the plots.

### The Grid

The `grid.arrange` (and `arrangeGrob`) functions from the `gridExtra` package come in handy when combining `ggplot2` charts. I wanted a similar way to arrange independent or linked `metricsgraphics` charts, hence `mjs_grid` was born.

`mjs_grid` uses the tag functions in `htmltools` to arrange `metricsgraphics` plot objects into an HTML `

` structure. At present, only uniform tables are supported, but I’m working on making the grid feature more flexible (just like `grid.arrange`). The current functionality is pretty straightforward:

– You build individual `metricsgraphics` plots;
– Optionally combine them in a `list`;
– Pass in the plots/lists into `mjs_grid`;
– Tell `mjs_grid` how many rows & columns are in the grid; and
– Specify the column widths

But, code > words, so here are some examples. To avoid code repetition, note that you’ll need the following packages available to run most of the snippets below:

library(metricsgraphics)
library(htmlwidgets)
library(htmltools)
library(dplyr)

First, we’ll combine a few example plots:

tmp <- data.frame(year=seq(1790, 1970, 10), uspop=as.numeric(uspop))
tmp %>%
  mjs_plot(x=year, y=uspop, width=300, height=300) %>%
  mjs_line() %>%
  mjs_add_marker(1850, "Something Wonderful") %>%
  mjs_add_baseline(150, "Something Awful") -> mjs1
 
mjs_plot(rnorm(10000), width=300, height=300) %>%
  mjs_histogram(bins=30, bar_margin=1) -> mjs2
 
movies <- ggplot2::movies[sample(nrow(ggplot2::movies), 1000), ]
mjs_plot(movies$rating, width=300, height=300) %>% mjs_histogram() -> mjs3
 
tmp %>%
  mjs_plot(x=year, y=uspop, width=300, height=300) %>%
  mjs_line(area=TRUE) -> mjs4
 
mjs_grid(mjs1, mjs2, mjs3, mjs4, ncol=2, nrow=2)

Since your can pass a `list` as a parameter, you can generate many (similar) plots and then grid-display them without too much code. This one generates 7 random histograms with linked rollovers and displays them in grid. Note that this example has `mjs_grid` using the same algorithm `grid.arrange` does for auto-computing “optimal” grid size.

lapply(1:7, function(x) {
  mjs_plot(rnorm(10000, mean=x/2, sd=x), width=250, height=250, linked=TRUE) %>%
    mjs_histogram(bar_margin=2) %>%
    mjs_labs(x_label=sprintf("Plot %d", x))
}) -> plots
 
mjs_grid(plots)

And, you can use `do` from `dplyr` to get `ggplot2` `facet_`-like behavior (though, one could argue that interactive graphics should use controls/selectors vs facets). This example uses the `tips` dataset from `reshape2` and creates a list of plots that are then passed to `mjs_grid`:

tips <- reshape2::tips
a <- tips %>%
  mutate(percent=tip/total_bill,
         day=factor(day, levels=c("Thur", "Fri", "Sat", "Sun"), ordered=TRUE)) %>%
  group_by(day) %>%
  do( plot={ day_label <- unique(.$day)
             mjs_plot(., x=total_bill, y=percent, width=275, height=275, left=100) %>%
               mjs_point(color_accessor=sex, color_type="category") %>%
               mjs_labs(x_label=sprintf("Total Bill (%s)", day_label), y_label="Tip %") })
 
mjs_grid(a$plot, ncol=2, nrow=2, widths=c(0.5, 0.5))

### Rollovers

I’ve had a few requests to support the use of different rollovers and this is a first stab at exposing MetricsGraphics’ native functionality to users of the `metricsgraphics` package. The API changed from MG 1.1.0 to 2.2.0, so I’m _kinda_ glad I waited for this. It requires knowledge of javascript, D3 and the use of `{{ID}}` as part of the CSS node selector when targeting the MetricsGraphics SVG element that displays the rollover text. Here is a crude, but illustrative example of how to take advantage of this feature (mouseover the graphics to see the altered text):

set.seed(1492)
dat <- data.frame(date=seq(as.Date("2014-01-01"),
                           as.Date("2014-01-31"),
                           by="1 day"),
                  value=rnorm(n=31, mean=0, sd=2))
 
dat %>%
  mjs_plot(x=date, y=value, width=500, height=300) %>%
  mjs_line() %>%
  mjs_axis_x(xax_format = "date") %>%
  mjs_add_mouseover("function(d, i) {
                $('{{ID}} svg .mg-active-datapoint')
                    .text('custom text : ' + d.date + ' ' + i);
                 }")

### Postremo

If you are using `metricsgraphics`, drop a link in the comments here to show others how you’re using it! If you need/want some functionality (I’m hoping to get `xts` support into the 0.8 release) that isn’t already in existing feature requests or something’s broken for you, post a new [issue on github](https://github.com/hrbrmstr/metricsgraphics/issues).

NASA GISS’s Annual Global Temperature Anomaly Trends (dplyr/ggplot version)

2015-01-18 – 09:50
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R, Weather
Tagged post
Comments (1)

D Kelly O’Day did a [great post](https://chartsgraphs.wordpress.com/2015/01/16/nasa-gisss-annual-global-temperature-anomaly-trends/) on charting NASA’s Goddard Institute for Space Studies (GISS) temperature anomaly data, but it sticks with base R for data munging & plotting. While there’s absolutely nothing wrong with base R operations, I thought a modern take on the chart using `dplyr`, `magrittr` & `tidyr` for data manipulation and `ggplot2` for formatting would be helpful for the scores of new folk learning R this year (our little language is becoming [all the rage](http://redmonk.com/sogrady/2015/01/14/language-rankings-1-15/), it seems). I also really enjoy working with weather data.

Before further exposition, here’s the result:

forwp

I made liberal use of the “piping” idiom encouraged `magrittr`, `dplyr` and other new R packages, including the forward assignment operator `->` (which may put some folks off a bit). That also meant using `magrittr`’s aliases for `[` and `[[`, which are more readable in pipes.

I don’t use `library(tidyr)` since `tidyr`’s `extract` conflicts with `magrittr`’s, but you’ll see a `tidyr::gather` in the code for wide-to-long data shaping.

I chose to use the monthly temperature anomaly data as a base layer in the chart as a contrast to the monthly- and annual-anomaly means. I also marked the hottest annual- and annual-mean anomalies and framed the decades with vertical markers.

There are no hardcoded years or decades anywhere in the `ggplot2` code, so this should be quite reusable as the data source gets updated.

As I come back to the chart, I think there may be a bit too much “chart junk” on it, but you can tweak it to your own aesthetic preferences (if you do, drop a note in the comments with a link to your creation).

The code is below and in [this gist](https://gist.github.com/hrbrmstr/07ba10fb4c3fe9c9f3a0).

library(httr)
library(magrittr)
library(dplyr)
library(ggplot2)
 
# data retrieval ----------------------------------------------------------
 
# the user agent string was necessary for me; YMMV
 
pg <- GET("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt",
          user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A"))
 
# extract monthly data ----------------------------------------------------
 
content(pg, as="text") %>%
  strsplit("\n") %>%
  extract2(1) %>%
  grep("^[[:digit:]]", ., value=TRUE) -> lines
 
# extract column names ----------------------------------------------------
 
content(pg, as="text") %>%
  strsplit("\n") %>%
  extract2(1) %>%
  extract(8) %>%
  strsplit("\ +") %>%
  extract2(1) -> lines_colnames
 
# make data frame ---------------------------------------------------------
 
data <- read.table(text=lines, stringsAsFactors=FALSE)
colnames(data) <- lines_colnames
 
# transform data frame ----------------------------------------------------
 
data %>%
  tidyr::gather(month, value, Jan, Feb, Mar, Apr, May, Jun,
                       Jul, Aug, Sep, Oct, Nov, Dec) %>%     # wide to long
  mutate(value=value/100) %>%                                # convert to degree Celcius change
  select(year=Year, month, value) %>%                        # only need these fields
  mutate(date=as.Date(sprintf("%d-%d-%d", year, month, 1)),  # make proper dates
         decade=year %/% 10,                                 # calc decade
         start=decade*10, end=decade*10+9) %>%               # calc decade start/end
  group_by(decade) %>%
    mutate(decade_mean=mean(value)) %>%                      # calc decade mean
  group_by(year) %>%
    mutate(annum_mean=mean(value)) %>%                       # calc annual mean
  ungroup -> data
 
# start plot --------------------------------------------------------------
 
gg <- ggplot()
 
# decade vertical markers -------------------------------------------------
 
gg <- gg + geom_vline(data=data %>% select(end),
                      aes(xintercept=as.numeric(as.Date(sprintf("%d-12-31", end)))),
                          size=0.5, color="#4575b4", linetype="dotted", alpha=0.5)
 
# monthly data ------------------------------------------------------------
 
gg <- gg + geom_line(data=data, aes(x=date, y=value, color="monthly anomaly"),
                     size=0.35, alpha=0.25)
gg <- gg + geom_point(data=data, aes(x=date, y=value, color"monthly anomaly"),
                      size=0.75, alpha=0.5)
 
# decade mean -------------------------------------------------------------
 
gg <- gg + geom_segment(data=data %>% distinct(decade, decade_mean, start, end),
                        aes(x=as.Date(sprintf("%d-01-01", start)),
                            xend=as.Date(sprintf("%d-12-31", end)),
                            y=decade_mean, yend=decade_mean,
                            color="decade mean anomaly"),
                        linetype="dashed")
 
# annual data -------------------------------------------------------------
 
gg <- gg + geom_line(data=data %>% distinct(year, annum_mean),
                      aes(x=as.Date(sprintf("%d-06-15", year)), y=annum_mean,
                          color="annual mean anomaly"),
                      size=0.5)
gg <- gg + geom_point(data=data %>% distinct(year, annum_mean),
                      aes(x=as.Date(sprintf("%d-06-15", year)), y=annum_mean,
                          color="annual mean anomaly"),
                      size=2)
 
# additional annotations --------------------------------------------------
 
# max annual mean anomaly horizontal marker/text
 
gg <- gg + geom_hline(yintercept=max(data$annum_mean),  alpha=0.9,
                      color="#d73027", linetype="dashed", size=0.25)
 
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", mean(range(data$year)))),
                    y=max(data$annum_mean),
                    color="#d73027", alpha=0.9,
                    hjust=0.25, vjust=-1, size=3,
                    label=sprintf("Max annual mean anomaly %2.1fºC", max(data$annum_mean)))
 
gg <- gg + geom_hline(yintercept=max(data$value),  alpha=0.9,
                      color="#7f7f7f", linetype="dashed", size=0.25)
 
# max annual anomaly horizontal marker/text
 
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", mean(range(data$year)))),
                    y=max(data$value),
                    color="#7f7f7f",  alpha=0.9,
                    hjust=0.25, vjust=-1, size=3,
                    label=sprintf("Max annual anomaly %2.1fºC", max(data$value)))
 
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", range(data$year)[2])),
                    y=min(data$value), size=3, hjust=1,
                    label="Data: http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt")
 
# set colors --------------------------------------------------------------
 
gg <- gg + scale_color_manual(name="", values=c("#d73027", "#4575b4", "#7f7f7f"))
 
# set x axis limits -------------------------------------------------------
 
gg <- gg + scale_x_date(expand=c(0, 1),
                        limits=c(as.Date(sprintf("%d-01-01", range(data$year)[1])),
                                 as.Date(sprintf("%d-12-31", range(data$year)[2]))))
 
# add labels --------------------------------------------------------------
 
gg <- gg + labs(x=NULL, y="GLOBAL Temp Anomalies in 1.0ºC",
                title=sprintf("GISS Land and Sea Temperature Annual Anomaly Trend (%d to %d)\n",
                              range(data$year)[1], range(data$year)[2]))
 
# theme/legend tweaks -----------------------------------------------------
 
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position=c(0.9, 0.2))
gg <- gg + theme(legend.key=element_blank())
gg <- gg + theme(legend.background=element_blank())
gg

Power Outage Impact Choropleths In 5 Steps in R (featuring rvest & RStudio “Projects”)

2014-11-27 – 10:17
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, gis, maps, R
Tagged post
Comments (1)

I and @awpiii were trading news about the power outages in Maine & New Hampshire last night and he tweeted the link to the @PSNH [Outage Map](http://www.psnh.com/outage/). As if the Bing Maps tiles weren’t bad enough, the use of a categorical color scale instead of a sequential one^{[[1](http://earthobservatory.nasa.gov/blogs/elegantfigures/2011/05/20/qualitative-vs-sequential-color-scales/)]} caused sufficient angst that I whipped up an alternate version in R between making pies and bread for Thanksgiving (even with power being out for us).

PSNH provides a text version of outages (by town) that ends up being a pretty clean HTML table, and a quick Google search led me to a fairly efficient town-level [shapefile](http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/datalayers/adjacent-states-town-boundaries.html) for New Hampshire. With these data files at the ready, it was time to make a better map.

**Step 0 – Environment Setup**

So, I lied. There are six steps. “5” just works way better in attention-grabbing list headlines. The first one is setting up the project and loading all the libraries we’ll need. I use RStudio for most of my R coding and their IDE has the concept of a “project” which has it’s own working directory, workspace, history, and source documents separate from any other RStudio windows you have open. They are a great way to organize your analyses and experiments. I have my own “new project” [script](http://rud.is/dl/newprj.sh) that sets up additional directory structures, configures the `Rproj` file with my preferences and initializes a git repository for the project.

I also use the setup step to load up a ggplot2 map theme I keep in a gist.

library(sp)
library(rgdal)
library(dplyr)
library(rvest)
library(stringi)
library(scales)
library(RColorBrewer)
library(ggplot2)
 
# for theme_map
devtools::source_gist("https://gist.github.com/hrbrmstr/33baa3a79c5cfef0f6df")

**Step 1 – Read in the map**

This is literally a one-liner:

nh <- readOGR("data/nhtowns/NHTOWNS_POLY.shp", "NHTOWNS_POLY")

My projects all have a `data` directory and thats where I normally store shapefiles. I used `ogrinfo -al NHTOWNS_POLY.shp` at the command line to determine the layer name.

**Step 2 – Read in the outage data**

The `rvest` package is nothing short of amazing. It makes very quick work of web scraping and—despite some newlines in the mix—this qualifies as a one-liner in my book:

outage <- html("http://www.psnh.com/outagelist/") %>%
  html_nodes("table") %>%
  html_table() %>%
  .[[1]]

That bit of code grabs the whole page, extracts all the HTML tables (there is just one in this example), turns it into a list of data frames and then returns the first one.

**Step 3 – Data wrangling**

While this step is definitely not as succinct as the two previous ones, it’s pretty straightforward:

outage <- outage[complete.cases(outage),]
colnames(outage) <- c("id", "total_customers", "without_power", "percentage_out")
outage$id <- stri_trans_totitle(outage$id)
outage$out <- cut(outage$without_power,
    breaks=c(0, 25, 100, 500, 1000, 5000, 10000, 20000, 40000),
    labels=c("1 - 25", "26 - 100", "101 - 500", "501 - 1,000",
             "1,001 - 5,000", "5,001- 10,000", "10,001 - 20,000",
             "20,001 - 40,000"))

We filter out the `NA`’s (this expunges the “total” row), rename the columns, convert the town name to the same case used in the shapefile (NOTE: I could have just `toupper`ed all the town names, but I really like this function from the `stringi` package) and then use `cut` to make an 8-level factor out of the customer outage count (to match the PSNH map legend).

**Step 4 – Preparing the map for plotting with `ggplot`**

This is another one-liner:

nh_map <- fortify(nh, region="NAME")

and makes it possible to use the town names when specifying the polygon regions we want to fill with our spiffy color scheme.

**Step 5 – Plotting the map**

It is totally possible to do this in one line, but many kittens will lose their lives if you do. I like this way of structuring the creation of a `ggplot` graphic since it makes it very easy to comment out or add various layers or customizations without worrying about stray `+` signs.

gg <- ggplot(data=nh_map, aes(map_id=id))
gg <- gg + geom_map(map=nh_map, aes(x=long, y=lat),
                    color="#0e0e0e", fill="white", size=0.2)
gg <- gg + geom_map(data=outage, map=nh_map, aes(fill=out),
                    color="#0e0e0e", size=0.2)
gg <- gg + scale_fill_brewer(type="seq", palette="RdPu",
                             name="Number of\ncustomer outages\nin each town")
gg <- gg + coord_equal()
gg <- gg + labs(title=sprintf("%s Total PSNH Customers Without Power",
                              comma(sum(outage$without_power))))
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg

That sequence starts the base `ggplot` object creation, sets up the base map colors and then overlays the town outage colors. We use the `RdPu` [Color Brewer](http://colorbrewer2.org/) sequential palette and give the legend the same title as the PSNH counterpart.

The shapefile is already projected (Lambert Conformal Conic—take a look at it with `ogrinfo -al`), so we can get away with using `coord_equal` vs re-projecting it, and we do a tally of outages to stick in the title. My base `theme_map` is designed for Maine, hence the extra `theme` call to move the legend.

**The Finished Product**

Crisp SVG polygons, no cluttered Bing Maps tiles and a proper color palette.

![img](http://rud.is/dl/psnh.svg)

All the code is [up on github](https://github.com/hrbrmstr/psnh).

Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions

2014-11-26 – 01:47
Posted in Charts & Graphs, Data Analysis, Data Visualization, DataVis, DataViz, maine, R, Weather
Tagged post
Comments (4)

UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE) at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.

The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains [historical daily climate records](http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html) for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their [data files](http://cdiac.ornl.gov/ftp/ushcn_daily/) indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).

To start, we’ll need a bit more than base R to get the job done:

library(pbapply)
library(data.table)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(stringi)

In all honesty, `pbapply`, `dplyr` and `stringi` are not necessary, but they definitely make life easier by (respectively) giving us:

– free progress bars for `*apply` operations,
– high efficacy data manipulation idioms, and
– a handy utility for converting strings to title case.

With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have `#` characters in their titles, to we have to account for that when we call `read.fwf`. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:

stations <- read.fwf("data/ushcn-stations.txt",
                     widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                     col.names=c("coop_id", "latitude", "longitude", "elevation",
                                 "state", "name", "component_1", "component_2",
                                 "component_3", "utc_offset"),
                     colClasses=c("character", "numeric", "numeric", "numeric",
                                  "character", "character", "character", "character",
                                  "character", "character"),
                     comment.char="", strip.white=TRUE)
 
# not a great circle, but it gets the job done here
closestStation <- function(stations, lat, lon) {
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                          (stations$longitude-lon)^2))
  stations[index,]
}
 
# what's the closest station?
closestStation(stations, 43.2672, -70.8617)
 
##     coop_id latitude longitude elevation state   name component_1 component_2 component_3 utc_offset
633  272174    43.15    -70.95      24.4    NH DURHAM      ------      ------      ------         +5

As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:

closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617)
##    coop_id latitude longitude elevation state             name component_1 component_2 component_3 utc_offset
10  176905  43.6497  -70.3003      13.7    ME PORTLAND JETPORT      ------      ------      ------         +5

The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.

Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:

Variable	Columns	Type
COOP ID	1-6	Character
YEAR	7-10	Integer
MONTH	11-12	Integer
ELEMENT	13-16	Character
VALUE1	17-21	Integer
MFLAG1	22	Character
QFLAG1	23	Character
SFLAG1	24	Character
VALUE2	25-29	Integer
MFLAG2	30	Character
QFLAG2	31	Character
SFLAG2	32	Character
…	…	…
VALUE31	257-261	Integer
MFLAG31	262	Character
QFLAG31	263	Character
SFLAG31	264	Character

We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use `read.fwf` again, we’ll take a different approach (since we ultimately need the data in long format) and use `readLines` to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.

Next, we setup nested `lapply` calls to build a long data frame from each month then combine them all together into a single data frame:

snow <- readLines("data/state27_NH.txt")
 
snow <- grep("SNOW", snow, value=TRUE)
snow <- grep("^272174", snow, value=TRUE)
 
snow_dat <- rbindlist(pblapply(snow, function(x) {
 
  rbindlist(lapply(1:31, function(i) {
 
    # record format described here:
    # http://cdiac.ornl.gov/ftp/ushcn_daily/data_format.txt
 
    start <- 17 + (i-1)*8
 
    list(coop_id=substr(x, 1, 6),
         date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i),
         element=substr(x, 13, 16),
         value=as.numeric(substr(x, start, start+4)),
         mflag=substr(x, start+5, start+5),
         qflag=substr(x, start+6, start+6),
         sflag=substr(x, start+7, start+7))
 
  }))
 
}))

Now, we’ll clean up the records even further by removing invalid entries (those with a `value` == `-9999`) and convert record dates to actual `Date` objects and filter out invalid dates:

snow_dat <- snow_dat %>% filter(value != -9999)
 
# since the data file has 31 days for each records regardless of whether
# that's valid or not we do a shortcut to remove invalid dates by doing the
# a vectorized Date conversion, then removing records with NA dates
 
snow_dat$date <- as.Date(snow_dat$date)
snow_dat <- snow_dat %>% filter(!is.na(date))
 
# having the year extracted is handy for filtering
snow_dat$year <- format(snow_dat$date, "%Y")

Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.

snow_dat$doy <- as.numeric(format(snow_dat$date, "%j"))
snow_dat$doy <- ifelse(snow_dat$doy<=180,
                       snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")),
                       snow_dat$doy)

Now, the fun begins. We use (mostly) `dplyr` to extract the first snowfall day from each year, then make a dot-line plot from the data:

first <- snow_dat %>%
  filter(value>0) %>%                           # ignore 0 values
  filter(date>=as.Date("1950-01-01")) %>%       # start at 1950 (arbitrary)
  merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details
  group_by(coop_id, year) %>%                   # group by station and year
  arrange(doy) %>%                              # sort by our munged day of year
  filter(row_number(doy) == 1) %>%              # grab the first entry by group
  select(name, state, date, value, doy)         # we only need some variables
 
title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state))
 
gg <- ggplot(first, aes(y=year, x=doy))
gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25)
gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd")
gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2)
gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + labs(x=NULL, y=NULL, title=title_1)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(color="#08306b"))
by_year <- gg

While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use `summary` to figure out the quartiles (for labeling the chart) for the actual date values:

wx_range <- summary(as.Date(format(first$date, "2013-%m-%d")))
names(wx_range) <- NULL
min_wx <- gsub("2013-", "", wx_range[2])
max_wx <- gsub("2013-", "", wx_range[5])
 
title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx)
 
gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy))
gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b")
gg <- gg + scale_y_continuous(expand=c(0, 0),
                              limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title=title_2)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_line(color="white"))
gg <- gg + theme(axis.text.y=element_text(color="white"))
gg <- gg + theme(plot.title=element_text(size=11))
box_wx <- gg

Finally, we’ll combine them together to get the finished product:

grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc"))

And, do the same for Portland:

Click for larger

There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.

You can find the complete, commented code up on [github](https://github.com/hrbrmstr/snowfirst).

Spending Seized Assets – A State-by-State Per-capita Comparison in R

2014-10-13 – 16:28
Posted in Charts & Graphs, Data Analysis, data driven security, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Comments (3)

The Washingon Post did another great story+vis, this time on states [Spending seized assets](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/).

According to their sub-head:

>_Since 2008, about 5,400 police agencies have spent $2.5 billion in proceeds from cash and property seized under federal civil forfeiture laws. Police suspected the assets were linked to crime, although in 81 percent of cases no one was indicted._

Their interactive visualization lets you drill down into each state to examine the spending in each category. Since the WaPo team made the [data available](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/data/all.json) [JSON] I thought it might be interesting to take a look at a comparison across states (i.e. who are the “big spenders” of this siezed hoarde). Here’s a snippet of the JSON:

{"states": [
  {
  "st": "AK",
  "stn": "Alaska",
  "total": 8470032,
  "cats":
     [{ "weapons": 1649832, 
     "electronicSurv": 402490, 
     "infoRewards": 760730, 
     "travTrain": 848128, 
     "commPrograms": 121664, 
     "salaryOvertime": 776766, 
     "other": 1487613, 
     "commComp": 1288439, 
     "buildImprov": 1134370 }],
  "agencies": [
     {
     "aid": "AK0012700",
     "aname": "Airport Police & Fire Ted Stevens Anch Int'L Arpt",
     "total": 611553,
     "cats":
        [{ "weapons": 214296, "travTrain": 44467, "other": 215464, "commComp": 127308, "buildImprov": 10019 }]
     },
     {
     "aid": "AK0010100",
     "aname": "Anchorage Police Department",
     "total": 3961497,
     "cats":
        [{ "weapons": 1104777, "electronicSurv": 94741, "infoRewards": 743230, "travTrain": 409474, "salaryOvertime": 770709, "other": 395317, "commComp": 249220, "buildImprov": 194029 }]
     },

Getting the data was easy (in R, of course!). Let’s setup the packages we’ll need:

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(grid)
library(statebins)
library(gridExtra)

We also need `jsonlite`, but only to parse the data (which I’ve downloaded locally), so we’ll just do that in one standalone line:

data <- jsonlite::fromJSON("all.json", simplifyVector=FALSE)

It’s not fair (or valid) to just compare totals since some states have a larger population than others, so we’ll show the data twice, once in raw totals and once with a per-capita lens. For that, we’ll need population data:

pop <- read.csv("http://www.census.gov/popest/data/state/asrh/2013/files/SCPRC-EST2013-18+POP-RES.csv", stringsAsFactors=FALSE)
colnames(pop) <- c("sumlev", "region", "divison", "state", "stn", "pop2013", "pop18p2013", "pcntest18p")
pop$stn <- gsub(" of ", " Of ", pop$stn)

We have to fix the `District of Columbia` since the WaPo data capitalizes the `Of`.

Now we need to extract the agency data. This is really straightforward with some help from the `data.table` package:

agencies <- rbindlist(lapply(data$states, function(x) {
  rbindlist(lapply(x$agencies, function(y) {
    data.table(st=x$st, stn=x$stn, aid=y$aid, aname=y$aname, rbindlist(y$cats))
  }), fill=TRUE)
}), fill=TRUE)

The `rbindlist` `fill` option is super-handy in the event we have varying columns (and, we do in this case). It’s also wicked-fast.

Now, we use some `dplyr` and `tidyr` to integrate the population information and summarize our data (OK, we cheat and use `melt`, but some habits are hard to break):

c_st <- agencies %>%
  merge(pop[,5:6], all.x=TRUE, by="stn") %>%
  gather(category, value, -st, -stn, -pop2013, -aid, -aname) %>%
  group_by(st, category, pop2013) %>%
  summarise(total=sum(value, na.rm=TRUE), per_capita=sum(value, na.rm=TRUE)/pop2013) %>%
  select(st, category, total, per_capita)

Let’s use a series of bar charts to compare state-against state. We’ll do the initial view with just raw totals. There are 9 charts, so this graphic scrolls a bit and you can select it to make it larger:

# hack to ordering the bars by kohske : http://stackoverflow.com/a/5414445/1457051 #####
 
c_st <- transform(c_st, category2=factor(paste(st, category)))
c_st <- transform(c_st, category2=reorder(category2, rank(-total)))
 
# pretty names #####
 
levels(c_st$category) <- c("Weapons", "Travel, training", "Other",
                           "Communications, computers", "Building improvements",
                           "Electronic surveillance", "Information, rewards",
                           "Salary, overtime", "Community programs")
gg <- ggplot(c_st, aes(x=category2, y=total))
gg <- gg + geom_bar(stat="identity", aes(fill=category))
gg <- gg + scale_y_continuous(labels=dollar)
gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2)
gg <- gg + facet_wrap(~category, scales = "free", ncol=1)
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(size=15, face="bold"))
gg <- gg + theme(panel.margin=unit(2, "lines"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

Comparison of Spending Category by State (raw totals)

There are definitely a few, repeating “big spenders” in that view, but is that the _real_ story? Let’s take another look, but factoring in state population:

# change bar order to match per-capita calcuation #####
 
c_st <- transform(c_st, category2=reorder(category2, rank(-per_capita)))
 
# per-capita bar plot #####
 
gg <- ggplot(c_st, aes(x=category2, y=per_capita))
gg <- gg + geom_bar(stat="identity", aes(fill=category))
gg <- gg + scale_y_continuous(labels=dollar)
gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2)
gg <- gg + facet_wrap(~category, scales = "free", ncol=1)
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(size=15, face="bold"))
gg <- gg + theme(panel.margin=unit(2, "lines"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

Comparison of Spending Category by State (per-capita)

That certainly changes things! Alaska, West Virginia, and D.C. definitely stand out for “Weapons”, “Other” & “Information”, respectively, (what’s Rhode Island hiding in “Other”?!) and the “top 10” in each category are very different from the raw total’s view. We can look at this per-capita view with the `statebins` package as well:

st_pl <- vector("list", 1+length(unique(c_st$category)))
 
j <- 0
for (i in unique(c_st$category)) {
  j <- j + 1
  st_pl[[j]] <- statebins_continuous(c_st[category==i,], state_col="st", value_col="per_capita") +
    scale_fill_gradientn(labels=dollar, colours=brewer.pal(6, "PuBu"), name=i) +
    theme(legend.key.width=unit(2, "cm"))
}
st_pl[[1+length(unique(c_st$category))]] <- list(ncol=1)
 
grid.arrange(st_pl[[1]], st_pl[[2]], st_pl[[3]],
             st_pl[[4]], st_pl[[5]], st_pl[[6]],
             st_pl[[7]], st_pl[[8]], st_pl[[9]], ncol=3)

Per-capita “Statebins” view of WaPo Seizure Data

(Doing this exercise also showed me I need to add some flexibility to the `statebins` package).

The (https://gist.github.com/hrbrmstr/27b8f44f573539dc2971) shows how to build a top-level category data table (along with the rest of the code in this post). I may spin this data up into an interactive D3 visualization in the next week or two (as I think it might work better than large faceted bar charts), so stay tuned!

A huge thank you to the WaPo team for making data available to others. Go forth and poke at it with your own questions and see what you can come up with (perhaps comparing by area of state)!

Plot Me Like a Hurricane (a.k.a. animating historical North Atlantic basin tropical storm tracks)

2014-10-07 – 22:34
Posted in Data Visualization, DataVis, DataViz, maps, R, Weather
Tagged post
Comments (2)

Markus Gessman (@MarkusGesmann) did a beautiful job [Visualising the seasonality of Atlantic windstorms](http://www.magesblog.com/2014/10/visualising-seasonality-of-atlantic.html) using small multiples, which was inspired by both a [post](http://freakonometrics.hypotheses.org/17113) by Arthur Charpentier (@freakonometrics) on using Markov spatial processes to “generate” hurricanes—which was [tweaked a bit](http://robertgrantstats.wordpress.com/2014/10/01/transparent-hurricane-paths-in-r/) by Robert Grant (@robertstats)—and [Gaston Sanchez](https://github.com/gastonstat)’s [Visualizing Hurricane Trajectories](http://rpubs.com/gaston/hurricanes) RPub.

I have [some history](http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/) with hurricane data and thought I’d jump on the bandwagon using the same data and making some stop-frame animations. I borrowed from previous work (hence starting with all the credits above) but have used `dplyr` idioms for data-frame filtering & mutating and my own month/year extraction code.

The first animation accumulates storm tracks in-year and displays the names of the storms in a list down the left side while the second does a full historical accumulation of tracks. I changed the storm path gradient but kept most of the other formatting bits and made the plots suitable for 1080p output/playback.

Rather than go the `ffmpeg` route, I used [ImageMagick](http://www.imagemagick.org/) since it makes equally quick work out of converting a bunch of `png` files to an `mp4` file. I made the animations go quickly, but they can be advanced forward/back one frame at a time in any decent player.

library(maps)
library(data.table)
library(dplyr)
library(ggplot2)
library(grid)
library(RColorBrewer)
 
# takes in a numeric vector and returns a sequence from low to high
rangeseq <- function(x, by=1) {
  rng <- range(x)
  seq(from=rng[1], to=rng[2], by=by)
}
 
# etract the months (as a factor of full month names) from
# a date+time "x" that can be converted to a POSIXct object,
extractMonth <- function(x) {
  months <- format(as.POSIXct(x), "%m")
  factor(months, labels=month.name[rangeseq(as.numeric(months))])
}
 
# etract the years (as a factor of full 4-charater-digit years) from
# a date+time "x" that can be converted to a POSIXct object,
extractYear <- function(x) {
  factor(as.numeric(format(as.POSIXct(x), "%Y")))
}
 
# get from: ftp://eclipse.ncdc.noaa.gov/pub/ibtracs/v03r06/all/csv/Allstorms.ibtracs_all.v03r06.csv.gz
storms_file <- "data/Allstorms.ibtracs_all.v03r06.csv"
storms <-  fread(storms_file, skip=10, select=1:18)
 
col_names <- c("Season", "Num", "Basin", "Sub_basin", "Name", "ISO_time", "Nature",
             "Latitude", "Longitude", "Wind.kt", "Pressure.mb", "Degrees_North", "Deegrees_East")
setnames(storms, paste0("V", c(2:12, 17, 18)), col_names)
 
# use dplyr idioms to filter & mutate the data
 
storms <- storms %>%
  filter(Latitude > -999,                                  # remove missing data
         Longitude > -999,
         Wind.kt > 0,
         !(Name %in% c("UNNAMED", "NONAME:UNNAMED"))) %>%
  mutate(Basin=gsub(" ", "", Basin),                       # clean up fields
         ID=paste(Name, Season, sep="."),
         Month=extractMonth(ISO_time),
         Year=extractYear(ISO_time)) %>%
  filter(Season >= 1989, Basin %in% "NA")                  # limit to North Atlantic basin
 
season_range <- paste(range(storms$Season), collapse=" - ")
knots_range <- range(storms$Wind.kt)
 
# setup base plotting parameters (these won't change)
 
base <- ggplot()
base <- base + geom_polygon(data=map_data("world"),
                            aes(x=long, y=lat, group=group),
                            fill="gray25", colour="gray25", size=0.2)
base <- base + scale_color_gradientn(colours=rev(brewer.pal(n=9, name="RdBu")),
                                     space="Lab", limits=knots_range)
base <- base + xlim(-138, -20) + ylim(3, 55)
base <- base + coord_map()
base <- base + labs(x=NULL, y=NULL, title=NULL, colour = "Wind (knots)")
base <- base + theme_bw()
base <- base + theme(text=element_text(family="Arial", face="plain", size=rel(5)),
                     panel.background = element_rect(fill = "gray10", colour = "gray30"),
                     panel.margin = unit(c(0,0), "lines"),
                     panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank(),
                     plot.margin = unit(c(0,0,0,0), "lines"),
                     axis.text.x = element_blank(),
                     axis.text.y = element_blank(),
                     axis.ticks = element_blank(),
                     legend.position = c(0.25, 0.1),
                     legend.background = element_rect(fill="gray10", color="gray10"),
                     legend.text = element_text(color="white", size=rel(2)),
                     legend.title = element_text(color="white", size=rel(5)),
                     legend.direction = "horizontal")
 
# loop over each year, producing plot files that accumulate tracks over each month
 
for (year in unique(storms$Year)) {
 
  storm_ids <- unique(storms[storms$Year==year,]$ID)
 
  for (i in 1:length(storm_ids)) {
 
    storms_yr <- storms %>% filter(Year==year, ID %in% storm_ids[1:i])
 
    # stuff takes a while, so it's good to have a progress message
    message(sprintf("%s %s", year, storm_ids[i]))
 
    gg <- base
    gg <- gg + geom_path(data=storms_yr,
                         aes(x=Longitude, y=Latitude, group=ID, colour=Wind.kt),
                         size=1.0, alpha=1/4)
    gg <- gg + geom_text(label=year, aes(x=-135, y=51), size=rel(6), color="white", vjust=1)
    gg <- gg + geom_text(label=paste(gsub(".[[:digit:]]+$", "", storm_ids[1:i]), collapse="\n"),
                         aes(x=-135, y=49.5), size=rel(4.5), color="white", vjust=1)
 
    # change "quartz" to "cairo" if you're not on OS X
 
    png(filename=sprintf("output/%s%03d.png", year, i),
        width=1920, height=1080, type="quartz", bg="gray25")
    print(gg)
    dev.off()
 
  }
 
}
 
# convert to mp4 animation - needs imagemagick
system("convert -delay 8 output/*png output/hurr-1.mp4")
# unlink("output/*png") # do this after verifying convert works

# take an alternate approach for accumulating the entire hurricane history
# start with the base, but add to the ggplot object in a loop, which will
# accumulate all the tracks.
 
gg <- base
 
for (year in unique(storms$Year)) {
 
  storm_ids <- unique(storms[storms$Year==year,]$ID)
 
  for (i in 1:length(storm_ids)) {
 
    storms_yr <- storms %>% filter(ID %in% storm_ids[i])
 
    message(sprintf("%s %s", year, storm_ids[i]))
    gg <- gg + geom_path(data=storms_yr,
                         aes(x=Longitude, y=Latitude, group=ID, colour=Wind.kt),
                         size=1.0, alpha=1/4)
 
    png(filename=sprintf("output/%s%03d.png", year, i),
        width=1920, height=1080, type="quartz", bg="gray25")
    print(gg)
    dev.off()
 
  }
 
}
 
system("convert -delay 8 output/*png output/hurr-2.mp4")
# unlink("output/*png") # do this after verifying convert works

Full code in [this gist](https://gist.github.com/hrbrmstr/23bf06784e898871dd61).

Seeing the (day)light with R

2014-09-23 – 12:28
Posted in Charts & Graphs, Data Visualization, DataVis, R, Weather
Tagged post
Comments (4)

The arrival of the autumnal equinox foreshadows the reality of longer nights and shorter days here in the northeast US. We can both see that reality and distract ourselves from it at the same time by firing up RStudio (or your favorite editor) and taking a look at the sunrise & sunset times based on our map coordinates using some functions from the R {maptools} package.

The sunriset function takes in a lat/lon pair, a range of dates and whether we want sunrise or sunset calculated and returns when those ephemeral events occur. For example, we can see the sunrise time for Portsmouth, NH on Christmas day this year (2014) via:

library(maptools)

# these functions need the lat/lon in an unusual format
portsmouth <- matrix(c(-70.762553, 43.071755), nrow=1)
for_date <- as.POSIXct("2014-12-25", tz="America/New_York")
sunriset(portsmouth, for_date, direction="sunrise", POSIXct.out=TRUE)

##         day_frac                time
## newlon 0.3007444 2014-12-25 07:13:04

We can pass in a vector of dates, to this function, and that means we’ll have data points we can work with to visualize this change. Let’s wrap the sequence generation into a function of our own and extract:

sunrise
sunset
solar noon
# hours of daylight

for every day in the sequence, returning the result as a data frame.

# adapted from http://r.789695.n4.nabble.com/maptools-sunrise-sunset-function-td874148.html
ephemeris <- function(lat, lon, date, span=1, tz="UTC") {

  # convert to the format we need
  lon.lat <- matrix(c(lon, lat), nrow=1)

  # make our sequence - using noon gets us around daylight saving time issues
  day <- as.POSIXct(date, tz=tz)
  sequence <- seq(from=day, length.out=span , by="days")

  # get our data
  sunrise <- sunriset(lon.lat, sequence, direction="sunrise", POSIXct.out=TRUE)
  sunset <- sunriset(lon.lat, sequence, direction="sunset", POSIXct.out=TRUE)
  solar_noon <- solarnoon(lon.lat, sequence, POSIXct.out=TRUE)

  # build a data frame from the vectors
  data.frame(date=as.Date(sunrise$time),
             sunrise=as.numeric(format(sunrise$time, "%H%M")),
             solarnoon=as.numeric(format(solar_noon$time, "%H%M")),
             sunset=as.numeric(format(sunset$time, "%H%M")),
             day_length=as.numeric(sunset$time-sunrise$time))

}

Now we can take a look at these values over 10 days near All Hallows Eve:

ephemeris(43.071755, -70.762553, "2014-10-31", 10, tz="America/New_York")

##          date sunrise solarnoon sunset day_length
## 1  2014-10-31     716      1226   1736  10.332477
## 2  2014-11-01     717      1226   1734  10.289145
## 3  2014-11-02     518      1026   1533  10.246169
## 4  2014-11-03     620      1126   1632  10.203563
## 5  2014-11-04     621      1126   1631  10.161346
## 6  2014-11-05     622      1126   1629  10.119535
## 7  2014-11-06     624      1126   1628  10.078148
## 8  2014-11-07     625      1126   1627  10.037204
## 9  2014-11-08     626      1126   1626   9.996721
## 10 2014-11-09     627      1126   1625   9.956719

We now have everything we need to visualize the seasonal daylight changes. We’ll use ggplot (with some help from the grid package) and build a two panel graph, one that gives us a “ribbon” view of what hours of the day are in daylight and the other just showing the changes in the total number of hours of daylight available during the day. We’ll build the function so that it will:

optionally show the current date/time (TRUE by default)
optionally show when solar noon is (FALSE by default)
optionally plot the graphs (TRUE by default)
return an arrangeGrob of the charts in the event we want to use them in other charts

library(ggplot2)
library(scales)
library(gridExtra)

# create two formatter functions for the x-axis display

# for graph #1 y-axis
time_format <- function(hrmn) substr(sprintf("%04d", hrmn),1,2)

# for graph #2 y-axis
pad5 <- function(num) sprintf("%2d", num)

daylight <- function(lat, lon, place, start_date, span=2, tz="UTC", 
                     show_solar_noon=FALSE, show_now=TRUE, plot=TRUE) {

  stopifnot(span>=2) # really doesn't make much sense to plot 1 value

  srss <- ephemeris(lat, lon, start_date, span, tz)

  x_label = ""

  gg <- ggplot(srss, aes(x=date))
  gg <- gg + geom_ribbon(aes(ymin=sunrise, ymax=sunset), fill="#ffeda0")

  if (show_solar_noon) gg <- gg + geom_line(aes(y=solarnoon), color="#fd8d3c")

  if (show_now) {
    gg <- gg + geom_vline(xintercept=as.numeric(as.Date(Sys.time())), color="#800026", linetype="longdash", size=0.25)
    gg <- gg + geom_hline(yintercept=as.numeric(format(Sys.time(), "%H%M")), color="#800026", linetype="longdash", size=0.25)
    x_label = sprintf("Current Date / Time: %s", format(Sys.time(), "%Y-%m-%d / %H:%M"))
  }

  gg <- gg + scale_x_date(expand=c(0,0), labels=date_format("%b '%y"))
  gg <- gg + scale_y_continuous(labels=time_format, limits=c(0,2400), breaks=seq(0, 2400, 200), expand=c(0,0))
  gg <- gg + labs(x=x_label, y="",
                  title=sprintf("Sunrise/set for %s\n%s ", place, paste0(range(srss$date), sep=" ", collapse="to ")))
  gg <- gg + theme_bw()
  gg <- gg + theme(panel.background=element_rect(fill="#525252"))
  gg <- gg + theme(panel.grid=element_blank())

  gg1 <- ggplot(srss, aes(x=date, y=day_length))
  gg1 <- gg1 + geom_area(fill="#ffeda0")
  gg1 <- gg1 + geom_line(color="#525252")

  if (show_now) gg1 <- gg1 + geom_vline(xintercept=as.numeric(as.Date(Sys.time())), color="#800026", linetype="longdash", size=0.25)

  gg1 <- gg1 + scale_x_date(expand=c(0,0), labels=date_format("%b '%y"))
  gg1 <- gg1 + scale_y_continuous(labels=pad5, limits=c(0,24), expand=c(0,0))
  gg1 <- gg1 + labs(x="", y="", title="Day(light) Length (hrs)")
  gg1 <- gg1 + theme_bw()

  if (plot) grid.arrange(gg, gg1, nrow=2)

  arrangeGrob(gg, gg1, nrow=2)

}

We can test our our new function using the same location and graph the sunlight data for a year starting September 1, 2014 (select graph for full-size version):

daylight(43.071755, -70.762553, "Portsmouth, NH", "2014-09-01", 365, tz="America/New_York")

With the longer nights approaching we can further enhance the plotting function to add markers for solstices and perhaps even make a new version that compares sunlight across different geographical locations.

Complete code example is in this gist.

Charting/Mapping the Scottish Vote with R (an rvest/dplyr/tidyr/TopoJSON/ggplot tutorial)

2014-09-20 – 10:49
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Comments (15)

The BBC did a pretty good job [live tracking the Scotland secession vote](http://www.bbc.com/news/events/scotland-decides/results), but I really didn’t like the color scheme they chose and decided to use the final tally site as the basis for another tutorial using the tools from the Hadleyverse and taking advantage of the fact that newer `gdal` libraries can read in [TopoJSON](https://github.com/mbostock/topojson)/GeoJSON files, meaning we can use _most_ of the maps the D3-ers create/use right in R.

We’ll need a few R packages to help us get, clean, format and chart the data:

library(rvest)
library(dplyr)
library(httr) # >0.5
library(tidyr)
library(gpclib)
library(rgeos)
library(sp)
library(maptools)
library(rgdal) # needs gdal > 1.11.0
library(ggplot2)
library(reshape2)
library(gridExtra)

The new `rvest` package makes it super-fun (and easy) to get data out of web pages (as I’ve [mentioned on the sister blog](http://datadrivensecurity.info/blog/posts/2014/Sep/migrating-to-rvest/)), but said data is still web page data, usually geared towards making things render well in a browser, and we end up having to clean up the extracted fields to get useful data. Since we usually want a data frame from the extraction, an `rvest` idiom I’ve been playing with involves bundling the element extraction & cleanup code into one function and then using that to build the columns:

# extract data from rvest-ed <div>'s and clean it up a bit
# pass in the rvested HTML object and the CSS selector to extract, also 
# indicating whether we want a number or character vector returned
 
extractAndCleanup <- function(data, selector, make_numeric=FALSE) {
  x <- data %>% html_nodes(selector) %>% html_text()
  x <- gsub("^[[:punct:][:space:]]*|[[:punct:][:space:]]*$", "", x)
  if (make_numeric) x <- as.numeric(gsub("[,[:space:]]*", "", x))
  x
}
 
bbc_vote <- html("http://www.bbc.com/news/events/scotland-decides/results")
 
secede <- data.frame(
  council=bbc_vote %>% extractAndCleanup(".body-row__cell--council"),
  electorate=bbc_vote %>% extractAndCleanup(".body-row__cell--electorate", TRUE),
  yes=bbc_vote %>% extractAndCleanup(".body-row__cell--yes", TRUE),
  no=bbc_vote %>% extractAndCleanup(".body-row__cell--no", TRUE),
  stringsAsFactors=FALSE)

We can then compute whether the vote tally was to secede or not and assign a color in the event we choose to use base graphics for plotting (we won’t for this tutorial). I chose a muted version of the Union Jack red and the official Scottish blue for this exercise.

secede <- secede %>% mutate(gone=yes>no,
                            color=ifelse(gone, "#0065BD", "#CF142B77"))

Getting the map from the BBC site is just as simple. An inspection of the site in any decent browser with a “Developer” mode lets us see the elements being downloaded. For the BBC map, it reads the data from: `http://static.bbci.co.uk/news/1.49.0-1192/js/data/maps/l/scotland-elections.js` which is a TopoJSON object wrapped in two lines of extra javascript code. We’ll grab that file, clean it up and read the map into R using `httr`’s new-ish ability to save to a data file:

GET("http://static.bbci.co.uk/news/1.49.0-1192/js/data/maps/l/scotland-elections.js",
    write_disk("data/scotland.json"), progress())
tmp <- readLines("data/scotland.json")
dir.create("data")
writeLines(tmp[2], "data/scotland.json")
 
map <- readOGR("data/scotland.json", "scotland-elections")

We’ll want to work with the map using Council names, so we need to ensure the names from the extracted `div` elements match what’s in the TopoJSON file:

secede$council %in% map@data$name
 
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

It looks like we’ll need to clean the names up a bit, but thankfully the names aren’t too far off:

secede$council <- gsub("&", "and", secede$council)
secede[secede$council=="Edinburgh",]$council = "City of Edinburgh"
secede[secede$council=="Glasgow",]$council = "Glasgow City"
secede[secede$council=="Comhairle nan Eilean Siar",]$council = "Na h-Eileanan an Iar"

If we were using base graphics for plotting, we’d also have to ensure the data was in the right order:

secede$council <- factor(secede$council, map@data$name, ordered=TRUE)
secede <- secede %>% arrange(council)

We’re going to use `ggplot` for the mapping portion, but the normal `fortify` process didn’t work on this TopoJSON file (some polygon errors emerged), so we’ll take another route and do the data Council name↔id mapping after the `fortify` call and merge the rest of our data into the map data frame:

map_df <- fortify(map)
 
# manually associate the map id's with the Council names and vote data
councils <- data.frame(id=0:(length(map@data$name)-1),
                       council=as.character(map@data$name))
map_df <- merge(map_df, councils, by="id")
map_df <- merge(map_df, secede, by="council")

Now we can generate the choropleth:

gg <- ggplot()
gg <- gg + geom_map(data=map_df, map=map_df,
                    aes(map_id=id, x=long, y=lat, group=group, fill=color),
                    color="white", size=0.25)
gg <- gg + scale_fill_manual(values=rev(unique(secede$color)),
                             labels=c("Yes", "No"), name="Secede?")
gg <- gg + xlim(extendrange(r=range(coordinates(map)[,1]), f=0.15))
gg <- gg + ylim(extendrange(r=range(coordinates(map)[,2]), f=0.07))
gg <- gg + coord_map()
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())

A choropleth is all well-and-good, but—since we have the data–let’s add the bar chart to complete the presentation. We’ll combine some `dplyr` and `tidyr` calls to melt and subset our data frame:

secede_m <- secede %>%
  gather(variable, value, -council) %>%
  filter(variable %in% c("yes", "no")) %>%
  mutate(value=as.numeric(value))

For this exercise, we’ll plot the 100% stacked bars in order of the “No” votes, and we’ll pre-process this ordering to make the `ggplot` code easier on the eyes. We start by merging some data back into our melted data frame so we can build the sorted factor by the “No” value column and then make sure the Councils will be in that order:

secede_m <- merge(secede_m, secede, by="council")
secede_m$variable <- factor(secede_m$variable,
                            levels=c("yes", "no"), ordered=TRUE)
secede_m <- secede_m %>% arrange(no, variable)
secede_m$council <- factor(secede_m$council,
                           unique(secede_m$council), ordered=TRUE)

Finally, we can create the 100% stacked bar plot and combine it with the choropleth to build the final product:

gg1 <- ggplot(secede_m, aes(x=council, y=value, fill=factor(variable)))
gg1 <- gg1 + geom_bar(stat="identity", position="fill")
gg1 <- gg1 + scale_fill_manual(values=rev(unique(secede$color)),
                             labels=c("Yes", "No"), name="Secede?")
gg1 <- gg1 + geom_hline(yintercept=0.50, color="gray80")
gg1 <- gg1 + geom_text(aes(label=percent(yes/100)), y=0.08, color="white", size=3)
gg1 <- gg1 + geom_text(aes(label=percent(no/100)), y=0.92, color="white", size=3)
gg1 <- gg1 + coord_flip()
gg1 <- gg1 + labs(x="", y="")
gg1 <- gg1 + theme_bw()
gg1 <- gg1 + theme(panel.grid=element_blank())
gg1 <- gg1 + theme(legend.position="top")
gg1 <- gg1 + theme(panel.border=element_blank())
gg1 <- gg1 + theme(axis.ticks=element_blank())
gg1 <- gg1 + theme(axis.text.x=element_blank())
 
vote <- arrangeGrob(gg1, gg, ncol=2,
                     main=textGrob("Scotland Votes", gp=gpar(fontsize=20)))

(Click for larger version)

I’ve bundled this code up into it’s own [github repo](https://github.com/hrbrmstr/secede-2014). The full project example has a few extra features as

– it shows how to save the resultant data frame to an R data file (in case the BBC nukes the site)
– also saves the cleaned-up JSON (getting minimal Scotland shapefiles is tricky so this one’s a keeper even with the polygon errors)
– wraps all that in `if` statements so future analysis/vis can work with or without the live data being available.

Hadley really has to stop making R so fun to work with :-)

UPDATE

Based on a comment by Paul Drake suggesting that the BBC choropleth (and, hence, my direct clone of it) could be made more informative by showing the vote difference. Since it’s just changing two lines of code, here it is in-situ vs creating a new post.

gg <- gg + geom_map(data=map_df, map=map_df,
                    aes(map_id=id, x=long, y=lat, group=group, fill=yes-no),
                    color="white", size=0.25)
gg <- gg + scale_fill_gradient(low="#CF142B", high="#0065BD", 
                               name="Secede?\n(vote margin)", guide="legend")