Skip navigation

Category Archives: maine

Version 0.6.0 of the hrbrthemes? package should be hitting a CRAN mirror near you soon. Apart from some general documentation and code cleanup this release includes the dark theme folks have been seeing in blog posts and tweets over the past few months. It’s called theme_ft_rc() since it is an homage to the wonderful new chart theme developed by the @ft_data crew over at the Financial Times (you can see examples from their work here).

While there was nothing stopping folks from using the GitHub version, the CRAN release makes it more widely available. There are still intermittent issues with fonts for some folks which I’ll be working on for the next release.

Since you’ve already seen lots of examples of these charts I won’t just make a gratuitous example using the theme. I will, however, make some charts based on a new data package dubbed iceout?. The iceout package was originally conceived by Ben Tupper from the Bigelow Laboratory for Ocean Sciences. I keep an eye on fellow Mainer repositories and I did not realize (but should have known) that researches keep track of when inland bodies of water freeze and thaw. The package name is derived from the term used for the thaw measurements (“ice-out” or “ice-off”).

Before becoming obsessed with this data and getting the package to the current state it is in, the original codebase worked off of a USGS Lake Ice-Out Data for New England dataset that focused solely on New England and only went up to 2005. Some digging discovered that

  • Maine’s Department of Agriculture and Forestry maintains online records since 2003; and,
  • Minnesota’s Department of Natural Resources maintains a comprehensive database of records going back to the 1800’s.

But I hit the jackpot after discovering the U.S. National Snow & Ice Data Center’s Global Lake and River Ice Phenology dataset which:

… contains freeze and breakup dates and other ice cover descriptive data for 865 lakes and rivers. Of the 542 water bodies that have records longer than 19 years, 370 are in North America and 172 are in Eurasia; 249 have records longer than 50 years; and 66 longer than 100 years. A few have data prior to 1845. These data, from water bodies distributed around the Northern Hemisphere, allow analysis of broad spatial patterns as well as long-term temporal patterns.

So, I converted the original package to a data package containing all four of those datasets plus some interactive functions for pulling “live” data and a set of “builders” to regenerate the databases. Let’s take a quick look at what’s in the NSIDC data and the global coverage area:

library(iceout) # github/hrbrmstr/iceout
library(hrbrthemes) 
library(ggplot2)
library(dplyr)

data("nsidc_iceout")

glimpse(nsidc_iceout)
## Observations: 35,918
## Variables: 37
## $ lakecode                <chr> "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1…
## $ lakename                <chr> "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Su…
## $ lakeorriver             <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", …
## $ season                  <chr> "1443-44", "1444-45", "1445-46", "1446-47", "1447-48", "1448…
## $ iceon_year              <dbl> 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, …
## $ iceon_month             <dbl> 12, 11, 12, 12, 11, 12, 12, 12, 12, 11, 12, 12, 12, 12, 12, …
## $ iceon_day               <dbl> 8, 23, 1, 2, 30, 8, 13, 8, 23, 28, 3, 5, 1, 5, 6, 20, 10, 15…
## $ iceoff_year             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ iceoff_month            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ iceoff_day              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ duration                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ latitude                <dbl> 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.1…
## $ longitude               <dbl> 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.…
## $ country                 <chr> "Japan", "Japan", "Japan", "Japan", "Japan", "Japan", "Japan…
## $ froze                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ obs_comments            <chr> "calendar correction for ice_on: -30 days of original data; …
## $ area_drained            <dbl> 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, …
## $ bow_comments            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ conductivity_us         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ elevation               <dbl> 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, …
## $ filename                <chr> "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $ initials                <chr> "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $ inlet_streams           <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
## $ landuse_code            <chr> "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAF…
## $ largest_city_population <dbl> 52000, 52000, 52000, 52000, 52000, 52000, 52000, 52000, 5200…
## $ max_depth               <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, …
## $ mean_depth              <dbl> 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, …
## $ median_depth            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ power_plant_discharge   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ secchi_depth            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ shoreline               <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ surface_area            <dbl> 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, …
## $ state                   <chr> "Nagano Prefecture", "Nagano Prefecture", "Nagano Prefecture…
## $ iceon_date              <date> 1443-12-08, 1444-11-23, 1445-12-01, 1446-12-02, 1447-11-30,…
## $ iceon_doy               <dbl> 342, 328, 335, 336, 334, 343, 347, 342, 357, 333, 337, 339, …
## $ iceout_date             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ iceout_doy              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

maps::map("world", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>%
  fortify() -> wrld

ggplot() + 
  ggalt::geom_cartogram(
    data = wrld, map = wrld, aes(long, lat, map_id=region), 
    fill="#3B454A",  color = "white", size = 0.125
  ) +
  geom_point(
    data = distinct(nsidc_iceout, lakeorriver, longitude, latitude),
    aes(longitude, latitude, fill = lakeorriver), 
    size = 1.5, color = "#2b2b2b", stroke = 0.125, shape = 21
  ) +
  scale_fill_manual(
    name = NULL, values = c("L"="#fdbf6f", "R"="#1f78b4"), labels=c("L" = "Lake", "R" = "River")
  ) +
  ggalt::coord_proj("+proj=wintri", ylim = range(nsidc_iceout$latitude, na.rm = TRUE)) +
  labs(title = "NSIDC Dataset Coverage") +
  theme_ft_rc(grid="") +
  theme(legend.position = c(0.375, 0.1)) +
  theme(axis.text = element_blank(), axis.title = element_blank())

W00t! Lots of data (though not all of the extra features are populated for all readings/areas)!

I think the reason the ice-out data garnered my obsession was how it can be used as another indicator that we are indeed in the midst of a climate transformation. Let’s look at the historical ice-out information for Maine inland bodies of water:

filter(nsidc_iceout, country == "United States", state == "ME") %>% 
  mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% # we want the Y axis formatted as month-day so we choose a leap year to ensure we get leap dates (if any)
  ggplot(aes(iceoff_year, iceout_date)) +
  geom_point(aes(color = lakename), size = 0.5, alpha=1/4) +
  geom_smooth(aes(color = lakename), se=FALSE, method = "loess", size=0.25) +
  scale_y_date(date_labels = "%b-%d") +
  labs(
    x = NULL, y = "Ice-out Month/Day", color = NULL,
    title = "Historical Ice-out Data/Trends for Maine Inland Bodies of Water"
  ) +
  theme_ft_rc(grid="XY")

You can follow that code-pattern to look at other states. It’s also fun to look at the ice-out date distributions by latitude grouping:

filter(nsidc_iceout, !is.na(latitude) & !is.na(longitude) & !is.na(iceout_date)) %>% 
  filter(country == "United States") %>% 
  mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% 
  mutate(lat_grp = cut(latitude, scales::pretty_breaks(5)(latitude), ordered_result = TRUE)) %>% 
  arrange(desc(iceoff_year)) %>% 
  ggplot() +
  ggbeeswarm::geom_quasirandom(
    aes(lat_grp, iceout_date, fill = iceoff_year), groupOnX = TRUE, 
    shape = 21, size =1, color = "white", stroke = 0.125, alpha=1/2
  ) +
  scale_y_date(date_labels = "%b-%d") +
  viridis::scale_fill_viridis(name = "Year", option = "magma") +
  labs(
    x = "Latitude Grouping", y = "Ice-out Month/Day",
    title = "U.S. Ice-out Historical Day/Month Distributions by Latitude Grouping"
  ) +
  theme_ft_rc(grid="Y")

If you want to focus on individual lakes there’s a Shiny app for that (well one for the U.S. anyway).

After loading the package, just enter explore_us() at an R console and you’ll see something like this:

The leaflet view will zoom to each new lake selected and the graph will be updated as well.

Other Package News

The sergeant? package is reaching a stable point in the 0.8.0 branch (mostly due to David Severski’s tireless help finding bugs ?) and should be headed to CRAN soon. Get your issues or PRs in if you want them CRANdied.

I’ve finally updated the Java library dependencies in pdfboxjars? so pdfbox? will no longer cause GitHub to tell you or I that it is insecure.

There’s a new package dubbed reapr? that is aimed somewhere at the intersection of curl + httr + rvest. Fundamentally, it provides some coder-uplift when scraping data. The README has examples but here’s what you get on an initial scrape of this blog’s index page:

reapr::reap_url("http://rud.is/b")
##                Title: rud.is | "In God we trust. All others must bring data"
##         Original URL: http://rud.is/b
##            Final URL: https://rud.is/b/
##           Crawl-Date: 2019-01-17 19:51:09
##               Status: 200
##         Content-Type: text/html; charset=UTF-8
##                 Size: 50 kB
##           IP Address: 104.236.112.222
##                 Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
##                       label[1], noscript[1], section[1], title[1],
##                       aside[2], nav[2], ul[2], style[5], img[6],
##                       input[6], article[8], time[8], footer[9], h1[9],
##                       header[9], p[10], li[19], meta[20], div[31],
##                       script[40], span[49], link[53], a[94]
##           # Comments: 17
##   Total Request Time: 2.093s

The reap_url() function:

  • Uses httr::GET() to make web connections and retrieve content which enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET() can handle to ... (e.g. httr::user_agent()) to have as much granular control over the interaction as possible.
  • Returns a richer set of data. After the httr::response object is obtained many tasks are performed including:
    • timestamping of the URL crawl
    • extraction of the asked-for URL and the final URL (in the case
      of redirects)
    • extraction of the IP address of the target server
    • extraction of both plaintext and parsed (xml_document) HTML
    • extraction of the plaintext webpage <title> (if any)
    • generation of a dynamic list tags in the document which can be
      fed directly to HTML/XML search/retrieval function (which may
      speed up node discovery)
    • extraction of the text of all comments in the HTML document
    • inclusion of the full httr::response object with the returned
      object
    • extraction of the time it took to make the complete request

I’m still wrestling with the API so definitely file issues with suggestions (wherever you’re most comfortable socially coding).

Speaking of IP addresses (bullet 3 above), I finally got some time to study the gdns? C library (a modern DNS API library) and created the clandnstine? package. The package name jeu de mots is due to the fact that the intent is to have it solely support DNS over TLS requests since regular DNS is plaintext, enables ISP spying/injection and generally fraught with peril. All forms of DNS lookups are supported. The catch is that you have to point it at a DNS over TLS-capable resolver. The package defaults to Quad9 (9.9.9.9) because I trust them more than Google or Cloudflare (btw: that’s not saying much as I trust used car salesfolks more than all three of them). Keep an eye (or RSS reader) peeled on $WORK blog over the coming weeks as I’ll have some analysis and data on a few hundred DNS over TLS endpoints you can use thanks to a new study developed by cow-orkers Jon Hart and Shan Skidar.

There also a toy package forecequotes? that is more “have fun with the cli & crayon packages” than anything else. But if you like Star Wars, random quote APIs and want to integrate richer command line interface output into your work, then definitely give it a peek.

Finally, I haven’t used R’s direct C interface in a while (since Rcpp is addictive and handy) and wanted to keep those skills fresh, so I made a wrapper to an old (in internet years) IP address trie C library. The underlying library is much slower than what we use in iptools but it works, does a bit more than its iptoos counterpart and covers data marshaling, external pointer handling, and attribute/class setting so it may be a half-decent reference package for using the R<->C bridge.

FIN

If you know of more/better ice-out data please drop an issue in the Bigelow Labs’ iceout repo and I’ll get it integrated. And, if you do your own ice-out exploration definitely blog about it, tell R Weekly and drop a note in the comments.

Here are links to all the mentioned packages grouped by social coding platform (so you can interact/collaborate wherever you feel most comfortable working):

sr.ht

GitLab

GitHub

Tis the season for finding out how well Maine fisherfolk did last year; specifically, Maine lobsterfolk.

Most of the news sites in Maine do a feature on the annual landings (here’s one from Bangor Daily News). There was a marked decline — the largest ever — in both poundage and revenue in 2017 and many sources point to the need to improve fishery management to help ensure both the environmental and economic health of the state.

My preferred view for this annual catch comparison is a connected scatterplot, tracing a path along the years. That way you get the feel of a time-series with the actual poundage-to-value without having to resort to two charts or (heaven forbid) a dual-geom/dual-axis chart.

The State of Maine Department of Marine Resources makes the data available but it’s in a PDF:

Thankfully, the PDF is not obfuscated and is just a plain table so it’s easy to parse and turn into:

The code to retrieve the PDF, parse it and produce said connected scatterplot is below.

library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

lobster_by_county_url <- "https://www.maine.gov/dmr/commercial-fishing/landings/documents/lobster.county.pdf"
lobster_by_county_fil <- basename(lobster_by_county_url)

if (!file.exists(lobster_by_county_fil)) download.file(lobster_by_county_url, lobster_by_county_fil)

# read in the PDF
lobster_by_county_pgs <- pdftools::pdf_text(lobster_by_county_fil)

map(lobster_by_county_pgs, stri_split_lines) %>% # split each page into lines
  flatten() %>%
  flatten_chr() %>%
  keep(stri_detect_fixed, "$") %>% # keep only lines with "$" in them
  stri_trim_both() %>% # clean up white space
  stri_split_regex("\ +", simplify = TRUE) %>% # get the columns
  as_data_frame() %>%
  mutate_at(c("V3", "V4"), lucr::from_currency) %>% # turn the formatted text into numbers
  set_names(c("year", "county", "pounds", "value")) %>% # better column names
  filter(county != "TOTAL") %>% # we'll calculate our own, thank you
  mutate(year = as.Date(sprintf("%s-01-01", year))) %>% # I like years to be years for plotting
  mutate(county = stri_trans_totitle(county)) -> lobster_by_county_df

arrange(lobster_by_county_df, year) %>%
  mutate(value = value / 1000000, pounds = pounds / 1000000) %>% # easier on the eyes
  group_by(year) %>%
  summarise(pounds = sum(pounds), value = sum(value)) %>%
  mutate(year_lab = lubridate::year(year)) %>%
  mutate(highlight = ifelse(year_lab == 2017, "2017", "Other")) %>% # so we can highlight 2017
  ggplot(aes(pounds, value)) +
  geom_path() +
  geom_label(
    aes(label = year_lab, color = highlight, size = highlight),
    family = font_ps, show.legend = FALSE
  ) +
  scale_x_comma(name = "Pounds (millions) →", limits = c(0, 150)) +
  scale_y_comma(name = "$ USD (millions) →", limits = c(0, 600)) +
  scale_color_manual(values = c("2017" = "#742111", "Other" = "#2b2b2b")) +
  scale_size_manual(values = c("2017" = 6, "Other" = 4)) +
  labs(
    title = "Historical Maine Fisheries Landings Data — Lobster (1964-2017)",
    subtitle = "All counties combined; Not adjusted for inflation",
    caption = "The 2002 & 2003 landings may possibly reflect the increased effort by DMR to collect voluntary landings from some lobster dealers;\nLobster reporting became mandatory in 2004 for all Maine dealers buying directly from harvesters.\nSource: <https://www.maine.gov/dmr/commercial-fishing/landings/historical-data.html>"
  ) +
  theme_ipsum_ps(grid = "XY")

Even though it’s still at version `0.4`, the `ggvis` package has quite a bit of functionality and is highly useful for exploratory data analysis (EDA). I wanted to see how geographical visualizations would work under it, so I put together six examples that show how to use various features of `ggvis` for presenting static & interactive cartographic creations. Specifically, the combined exercises demonstrate:

– basic map creation
– basic maps with points/labels
– dynamic choropleths (with various scales & tooltips)
– applying projections and custom color fills (w/tooltips)
– apply projections and projecting coordinates for plotting (w/tooltips that handle missing data well)

If you want to skip the post and head straight to the code you can [head on over to github](https://github.com/hrbrmstr/ggvis-maps), [peruse the R markdown file on RPubs](http://rpubs.com/hrbrmstr/ggvis-maps) or play with the [shiny version](https://hrbrmstr.shinyapps.io/ggvis-maps/). You’ll need that code to actually run any of the snippets below since I’m leaving out some code-cruft for brevity. Also, all the map graphics below were generated by saving the `ggvis` output as PNG files (for best browser compatibility), right from the `ggvis` renderer popup. Click/tap each for a larger version.

### Basic Polygons

1

Even though we still need the help of `ggplot2`’s `fortify`, it’s pretty straightforward to crank out a basic map in `ggvis`:

maine <- readOGR("data/maine.geojson", "OGRGeoJSON")
 
map <- ggplot2::fortify(maine, region="name")
 
map %>%
  ggvis(~long, ~lat) %>%
  group_by(group, id) %>%
  layer_paths(strokeOpacity:=0.5, stroke:="#7f7f7f") %>%
  hide_legend("fill") %>%
  hide_axis("x") %>% hide_axis("y") %>%
  set_options(width=400, height=600, keep_aspect=TRUE)

The code is very similar to one of the ways we render the same image in `ggplot`. We first read in the shapefile, convert it into a data frame we can use for plotting, group the polygons properly, render them with `layer_paths` and get rid of chart junk. Now, `ggvis` (to my knowledge as of this post) has no equivalent of `coord_map`, so we have to rely on the positioning in the projection and work out the proper `height` and `width` parameters to use with a uniform aspect ratio (`keep_aspect=TRUE`).

>For those not familiar with `ggvis` the `~` operator lets us tell `ggivs` which columns (or expressions using columns) to map to function parameters and `:=` operator just tells it to use a raw, un-scaled value. You can find out more about [why the tilde was chosen](https://github.com/rstudio/ggvis/issues/173) or about the [various other special operators](http://ggvis.rstudio.com/ggvis-basics.html).

### Basic Annotations

2

You can annotate maps in an equally straightforward way.

county_centers <- maine %>%
  gCentroid(byid=TRUE) %>%
  data.frame %>%
  cbind(name=maine$name %>% gsub(" County, ME", "", .) )
 
map %>%
  group_by(group, id) %>%
  ggvis(~long, ~lat) %>%
  layer_paths(strokeWidth:=0.25, stroke:="#7f7f7f") %>%
  layer_points(data=county_centers, x=~x, y=~y, size:=8) %>%
  layer_text(data=county_centers,
             x=~x+0.05, y=~y, text:=~name,
             baseline:="middle", fontSize:=8) %>%
  hide_legend("fill") %>%
  hide_axis("x") %>% hide_axis("y") %>%
  set_options(width=400, height=600, keep_aspect=TRUE)

>Note that the `group_by` works both before or after the `ggvis` call. Consistent pipe idioms FTW!

Here, we’re making a data frame out of the county centroids and names then using that in a call to `layer_points` and `layer_text`. Note how you can change the data source for each layer (just like in `ggplot)` and use expressions just like in `ggplot` (we moved the text just slightly to the right of the dot).

>Since `ggvis` outputs [vega](http://trifacta.github.io/vega/) and uses [D3](http://d3js.org/) for rendering, you should probably take a peek at those frameworks as it will help you understand the parameter name differences between `ggvis` and `ggplot`.

### Basic Choropleths

3

There are actually two examples of this basic state choropleth in the code, but one just uses a different color scale, so I’ll just post the code for one here. This is also designed for interactivity (it has tooltips and lets you change the fill variable) so you should run it locally or look at the [shiny version](https://hrbrmstr.shinyapps.io/ggvis-maps/).

# read in some crime & population data for maine counties
me_pop <- read.csv("data/me_pop.csv", stringsAsFactors=FALSE)
me_crime <- read.csv("data/me_crime.csv", stringsAsFactors=FALSE)
 
# get it into a form we can use (and only use 2013 data)
 
crime_1k <- me_crime %>%
  filter(year==2013) %>%
  select(1,5:12) %>%
  left_join(me_pop) %>%
  mutate(murder_1k=1000*(murder/population_2010),
         rape_1k=1000*(rape/population_2010),
         robbery_1k=1000*(robbery/population_2010),
         aggravated_assault_1k=1000*(aggravated_assault/population_2010),
         burglary_1k=1000*(burglary/population_2010),
         larceny_1k=1000*(larceny/population_2010),
         motor_vehicle_theft_1k=1000*(motor_vehicle_theft/population_2010),
         arson_1k=1000*(arson/population_2010))
 
# normalize the county names
 
map %<>% mutate(id=gsub(" County, ME", "", id)) %>%
  left_join(crime_1k, by=c("id"="county"))
 
# this is for the tooltip. it does a lookup into the crime data frame and
# then uses those values for the popup
 
crime_values <- function(x) {
  if(is.null(x)) return(NULL)
  y <- me_crime %>% filter(year==2013, county==x$id) %>% select(1,5:12)
  sprintf("<table width='100%%'>%s</table>",
          paste0("<tr><td style='text-align:left'>", names(y),
         ":</td><td style='text-align:right'>", format(y), collapse="</td></tr>"))
}
 
map %>%
  group_by(group, id) %>%
  ggvis(~long, ~lat) %>%
  layer_paths(fill=input_select(label="Crime:",
                                choices=crime_1k %>%
                                  select(ends_with("1k")) %>%
                                  colnames %>% sort,
                                id="Crime",
                                map=as.name),
              strokeWidth:=0.5, stroke:="white") %>%
  scale_numeric("fill", range=c("#bfd3e6", "#8c6bb1" ,"#4d004b")) %>%
  add_tooltip(crime_values, "hover") %>%
  add_legend("fill", title="Crime Rate/1K Pop") %>%
  hide_axis("x") %>% hide_axis("y") %>%
  set_options(width=400, height=600, keep_aspect=TRUE)

You can omit the `input_select` bit if you just want to do a single choropleth (just map `fill` to a single variable). The `input_select` tells `ggvis` to make a minimal bootstrap sidebar-layout scaffold around the actual graphic to enable variable interaction. In this case we let the user explore different types of crimes (by 1K population) and we also have a tooltip that shows the #’s of each crime in each county as we hover.

### Projections and Custom Colors

4

We’re pretty much (mostly) re-creating a [previous post](http://rud.is/b/2014/11/16/moving-the-earth-well-alaska-hawaii-with-r/) in this example and making a projected U.S. map with drought data (as of 2014-12-23).

us <- readOGR("data/us.geojson", "OGRGeoJSON")
us <- us[!us$STATEFP %in% c("02", "15", "72"),]
 
# same method to change the projection
 
us_aea <- spTransform(us, CRS("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"))
 
map <- ggplot2::fortify(us_aea, region="GEOID")
 
droughts <- read.csv("data/dm_export_county_20141223.csv")
droughts$id <- sprintf("%05d", as.numeric(as.character(droughts$FIPS)))
droughts$total <- with(droughts, (D0+D1+D2+D3+D4)/5)
 
map_d <- merge(map, droughts, all.x=TRUE)
 
# pre-make custom colors per county
 
ramp <- colorRampPalette(c("white", brewer.pal(n=9, name="YlOrRd")), space="Lab")
 
map_d$fill_col <- as.character(cut(map_d$total, seq(0,100,10), include.lowest=TRUE, labels=ramp(10)))
map_d$fill_col <- ifelse(is.na(map_d$fill_col), "#FFFFFF", map_d$fill_col)
 
drought_values <- function(x) {
  if(is.null(x) | !(x$id %in% droughts$id)) return(NULL)
  y <- droughts %>% filter(id==x$id) %>% select(1,3,4,6:10)
  sprintf("<table width='100%%'>%s</table>",
          paste0("<tr><td style='text-align:left'>", names(y),
         ":</td><td style='text-align:right'>", format(y), collapse="</td></tr>"))
}
 
map_d %>%
  group_by(group, id) %>%
  ggvis(~long, ~lat) %>%
  layer_paths(fill:=~fill_col, strokeOpacity := 0.5, strokeWidth := 0.25) %>%
  add_tooltip(drought_values, "hover") %>%
  hide_legend("fill") %>%
  hide_axis("x") %>% hide_axis("y") %>%
  set_options(width=900, height=600, keep_aspect=TRUE)

It’s really similar to the previous code (and you may/should be familiar with the Albers transform from the previous post).

### World Domination

5

world <- readOGR("data/ne_50m_admin_0_countries.geojson", layer="OGRGeoJSON")
world <- world[!world$iso_a3 %in% c("ATA"),]
world <- spTransform(world, CRS("+proj=wintri"))
 
map_w <- ggplot2::fortify(world, region="iso_a3")
 
# really quick way to get coords from a KML file
 
launch_sites <- rbindlist(lapply(ogrListLayers("data/launch-sites.kml")[c language="(1:2,4:9)"][/c], function(layer) {
  tmp <- readOGR("data/launch-sites.kml", layer)
  places <- data.table(coordinates(tmp)[,1:2], as.character(tmp$Name))
}))
setnames(launch_sites, colnames(launch_sites), c("lon", "lat", "name"))
 
# now, project the coordinates we extracted
 
coordinates(launch_sites) <-  ~lon+lat
launch_sites <- as.data.frame(SpatialPointsDataFrame(spTransform(
  SpatialPoints(launch_sites, CRS("+proj=longlat")), CRS("+proj=wintri")),
  launch_sites@data))
 
map_w %>%
  group_by(group, id) %>%
  ggvis(~long, ~lat) %>%
  layer_paths(fill:="#252525", stroke:="white", strokeOpacity:=0.5, strokeWidth:=0.25) %>%
  layer_points(data=launch_sites, 
               x=~lon, y=~lat, 
               fill:="#cb181d", stroke:="white", 
               size:=25, fillOpacity:=0.5, strokeWidth:=0.25) %>%
  hide_legend("fill") %>%
  hide_axis("x") %>% hide_axis("y") %>%
  set_options(width=900, height=500, keep_aspect=TRUE)

The main differences in this example are the re-projection of the data we’re using. I grabbed a KML file of rocket launch sites from Wikipedia and made it into a data frame then [re]project those points into Winkel-Tripel for use with Winkel-Tripel world map made at the beginning of the example. The `ggplot` `coord_map` handles these transforms for you, so until there’s a `ggvis` equivalent, you’ll need to do it this way (though, there’s not Winkel-Tripel projection in the `mapproject` package so you kinda need to do it this way for `ggplot` as well for this projection).

### Wrapping Up

There’s code up on github for the “normal”, `Rmd` and Shiny versions of these examples. Give each a go and try tweaking various parameters, changing up the tooltips or using your own data. Don’t forget to drop a note in the comments with any of your creations and use github for any code issues.

UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE) at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.

The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains [historical daily climate records](http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html) for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their [data files](http://cdiac.ornl.gov/ftp/ushcn_daily/) indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).

To start, we’ll need a bit more than base R to get the job done:

library(pbapply)
library(data.table)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(stringi)

In all honesty, `pbapply`, `dplyr` and `stringi` are not necessary, but they definitely make life easier by (respectively) giving us:

– free progress bars for `*apply` operations,
– high efficacy data manipulation idioms, and
– a handy utility for converting strings to title case.

With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have `#` characters in their titles, to we have to account for that when we call `read.fwf`. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:

stations <- read.fwf("data/ushcn-stations.txt",
                     widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                     col.names=c("coop_id", "latitude", "longitude", "elevation",
                                 "state", "name", "component_1", "component_2",
                                 "component_3", "utc_offset"),
                     colClasses=c("character", "numeric", "numeric", "numeric",
                                  "character", "character", "character", "character",
                                  "character", "character"),
                     comment.char="", strip.white=TRUE)
 
# not a great circle, but it gets the job done here
closestStation <- function(stations, lat, lon) {
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                          (stations$longitude-lon)^2))
  stations[index,]
}
 
# what's the closest station?
closestStation(stations, 43.2672, -70.8617)
 
##     coop_id latitude longitude elevation state   name component_1 component_2 component_3 utc_offset
633  272174    43.15    -70.95      24.4    NH DURHAM      ------      ------      ------         +5

As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:

closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617)
##    coop_id latitude longitude elevation state             name component_1 component_2 component_3 utc_offset
10  176905  43.6497  -70.3003      13.7    ME PORTLAND JETPORT      ------      ------      ------         +5

The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.

Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:

Variable Columns Type
COOP ID 1-6 Character
YEAR 7-10 Integer
MONTH 11-12 Integer
ELEMENT 13-16 Character
VALUE1 17-21 Integer
MFLAG1 22 Character
QFLAG1 23 Character
SFLAG1 24 Character
VALUE2 25-29 Integer
MFLAG2 30 Character
QFLAG2 31 Character
SFLAG2 32 Character
VALUE31 257-261 Integer
MFLAG31 262 Character
QFLAG31 263 Character
SFLAG31 264 Character

We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use `read.fwf` again, we’ll take a different approach (since we ultimately need the data in long format) and use `readLines` to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.

Next, we setup nested `lapply` calls to build a long data frame from each month then combine them all together into a single data frame:

snow <- readLines("data/state27_NH.txt")
 
snow <- grep("SNOW", snow, value=TRUE)
snow <- grep("^272174", snow, value=TRUE)
 
snow_dat <- rbindlist(pblapply(snow, function(x) {
 
  rbindlist(lapply(1:31, function(i) {
 
    # record format described here:
    # http://cdiac.ornl.gov/ftp/ushcn_daily/data_format.txt
 
    start <- 17 + (i-1)*8
 
    list(coop_id=substr(x, 1, 6),
         date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i),
         element=substr(x, 13, 16),
         value=as.numeric(substr(x, start, start+4)),
         mflag=substr(x, start+5, start+5),
         qflag=substr(x, start+6, start+6),
         sflag=substr(x, start+7, start+7))
 
  }))
 
}))

Now, we’ll clean up the records even further by removing invalid entries (those with a `value` == `-9999`) and convert record dates to actual `Date` objects and filter out invalid dates:

snow_dat <- snow_dat %>% filter(value != -9999)
 
# since the data file has 31 days for each records regardless of whether
# that's valid or not we do a shortcut to remove invalid dates by doing the
# a vectorized Date conversion, then removing records with NA dates
 
snow_dat$date <- as.Date(snow_dat$date)
snow_dat <- snow_dat %>% filter(!is.na(date))
 
# having the year extracted is handy for filtering
snow_dat$year <- format(snow_dat$date, "%Y")

Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.

snow_dat$doy <- as.numeric(format(snow_dat$date, "%j"))
snow_dat$doy <- ifelse(snow_dat$doy<=180,
                       snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")),
                       snow_dat$doy)

Now, the fun begins. We use (mostly) `dplyr` to extract the first snowfall day from each year, then make a dot-line plot from the data:

first <- snow_dat %>%
  filter(value>0) %>%                           # ignore 0 values
  filter(date>=as.Date("1950-01-01")) %>%       # start at 1950 (arbitrary)
  merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details
  group_by(coop_id, year) %>%                   # group by station and year
  arrange(doy) %>%                              # sort by our munged day of year
  filter(row_number(doy) == 1) %>%              # grab the first entry by group
  select(name, state, date, value, doy)         # we only need some variables
 
title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state))
 
gg <- ggplot(first, aes(y=year, x=doy))
gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25)
gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd")
gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2)
gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + labs(x=NULL, y=NULL, title=title_1)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(color="#08306b"))
by_year <- gg

While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use `summary` to figure out the quartiles (for labeling the chart) for the actual date values:

wx_range <- summary(as.Date(format(first$date, "2013-%m-%d")))
names(wx_range) <- NULL
min_wx <- gsub("2013-", "", wx_range[2])
max_wx <- gsub("2013-", "", wx_range[5])
 
title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx)
 
gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy))
gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b")
gg <- gg + scale_y_continuous(expand=c(0, 0),
                              limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title=title_2)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_line(color="white"))
gg <- gg + theme(axis.text.y=element_text(color="white"))
gg <- gg + theme(plot.title=element_text(size=11))
box_wx <- gg

Finally, we’ll combine them together to get the finished product:

grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc"))

nh-2

And, do the same for Portland:

bothClick for larger

There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.

You can find the complete, commented code up on [github](https://github.com/hrbrmstr/snowfirst).

Most of my free coding time has been spent tweaking a D3-based live power outage tracker for Central Maine Power customers (there’s also a woefully less-featured Shiny app for it, too). There is some R associated with the D3 vis, but it’s limited to a cron job that’s makes the CSV files for the sparklines in D3 vis by

  • reading historical outage data from a database of scraped readings I’ve been keeping
  • filling out the time series
  • reducing it to an hourly time series, and
  • trimming the data set to the last 30 days of records:
#!/usr/bin/Rscript
# running in a cron job so no spurious text output pls
options(warn=-1)
options(show.error.messages=FALSE)
 
suppressMessages(library(methods))
suppressMessages(library(zoo))
library(chron)
library(xts)
library(reshape2)
library(DBI)
library(RMySQL)
 
m <- dbDriver("MySQL");
con <- dbConnect(m, user='DBUSER', password='DBPASSWORD', host='localhost', dbname='DBNAME');
res <- dbSendQuery(con, "SELECT * FROM outage") # cld just pull the 2 fields I need
outages <- fetch(res, n = -1)
outages$ts <- as.POSIXct(gsub("\\:[0-9]+\\..*$","", outages$ts), format="%Y-%m-%d %H:%M")
 
# for each county we have data for
for (county in unique(outages$county)) {
 
  # get the 2 fields I need (shld prbly filter that in the SELECT)
  outage.raw <- outages[outages$county == county,c(1,4)]
 
  # make it a zoo object
  outage.zoo <- zoo(outage.raw$withoutpower, outage.raw$ts)
 
  # fill out the 15-minute readings
  complete.zoo <- merge(outage.zoo, zoo(, seq(start(outage.zoo), max(outages$ts), by="15 min")), all=TRUE)
  complete.zoo[is.na(complete.zoo)] <- 0
 
  # shrink to hourly and trim at 30 days
  hourly.zoo <- last(to.hourly(complete.zoo), "30 days")
 
  # crank out a CSV  
  df <- data.frame(hourly.zoo)
  df <- data.frame(ts=rownames(df), withoutPower=df$complete.zoo.High)
 
  write.csv(df, sprintf("OUTPOUT_LOCATION/%s.csv",county), row.names=FALSE)
 
}

I knew there were other power companies in Maine, but CMP is the largest, so I focused my attention on getting data from it. After hearing an outage update on MPBN I decided to see if Bangor Hydro Electric had scrape-able content and it turns out there was a lovely (well, as lovely as XML can be) XML file delivered on the page with this “meh” Google push-pin map:

Bangor_Hydro_Electric_-_Outage_Map

The XML file is used to build the markers for the map and has marker tags that look like this:

<marker name="Exeter" outages="18" 
        lat="44.96218" lng="-69.12253" 
        streets="BUTTERS  RD;CHAMPEON  RD;GREENBUSH  RD;" 
        reflabels="87757-1;84329-1;85032-1;"/>

I’m really only tracking county-level data and BHE does not provide that, even in the huge table of street-level outages that you’ll see on that outage page. I decided to use R to aggregate the BHE data to the county level via the “point-in-polygon” method.

Getting Right To the Point

To perform the aggregation in R, I needed county-level polygons for Maine. I already had that thanks to the previous work, but I wanted to optimize the search process, so I took the US counties shapefile and used OGR from the GDAL (Geospatial Data Abstraction Library) suite to extract just the Maine county polygons:

ogr2ogr -f "ESRI Shapefile" \
        -where "STATE_NAME = 'MAINE'" maine.shp counties.shp

You can see both a reduction in file size and complexity via ogrinfo:

$ll *shp
-rwxr-xr-x@ 1 bob  staff  1517624 Jan  8  2010 counties.shp
-rw-r--r--  1 bob  staff    12588 Dec 26 23:03 maine.shp
$ ogrinfo -sql "SELECT COUNT(*) FROM counties" counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(counties):0
  COUNT_* (Integer) = 3141
$ ogrinfo -sql "SELECT COUNT(*) FROM maine" maine.shp
INFO: Open of 'maine.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: maine
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(maine):0
  COUNT_* (Integer) = 16

The conversion reduces the file size from 1.5MB to ~12K and shrinks the number of polygons from 3,141 to 16. The counties.shp and maine.shp shapefiles were built from U.S. census data and have scads more information that you might want to use (i.e. perhaps, for correlation with the outage info, though storms are the prime causal entity for the outages :-):

$ ogrinfo -al -so counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 3141
Extent: (-176.806138, 18.921786) - (-66.969271, 71.406235)
Layer SRS WKT:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137.0,298.257223563]],
    PRIMEM["Greenwich",0.0],
    UNIT["Degree",0.0174532925199433]]
NAME: String (32.0)
STATE_NAME: String (25.0)
STATE_FIPS: String (2.0)
CNTY_FIPS: String (3.0)
FIPS: String (5.0)
POP2000: Integer (9.0)
POP2007: Integer (9.0)
POP00_SQMI: Real (10.1)
POP07_SQMI: Real (7.1)
WHITE: Integer (9.0)
BLACK: Integer (9.0)
AMERI_ES: Integer (9.0)
ASIAN: Integer (9.0)
HAWN_PI: Integer (9.0)
OTHER: Integer (9.0)
MULT_RACE: Integer (9.0)
HISPANIC: Integer (9.0)
MALES: Integer (9.0)
FEMALES: Integer (9.0)
AGE_UNDER5: Integer (9.0)
AGE_5_17: Integer (9.0)
AGE_18_21: Integer (9.0)
AGE_22_29: Integer (9.0)
AGE_30_39: Integer (9.0)
AGE_40_49: Integer (9.0)
AGE_50_64: Integer (9.0)
AGE_65_UP: Integer (9.0)
MED_AGE: Real (9.1)
MED_AGE_M: Real (9.1)
MED_AGE_F: Real (9.1)
HOUSEHOLDS: Integer (9.0)
AVE_HH_SZ: Real (9.2)
HSEHLD_1_M: Integer (9.0)
HSEHLD_1_F: Integer (9.0)
MARHH_CHD: Integer (9.0)
MARHH_NO_C: Integer (9.0)
MHH_CHILD: Integer (9.0)
FHH_CHILD: Integer (9.0)
FAMILIES: Integer (9.0)
AVE_FAM_SZ: Real (9.2)
HSE_UNITS: Integer (9.0)
VACANT: Integer (9.0)
OWNER_OCC: Integer (9.0)
RENTER_OCC: Integer (9.0)
NO_FARMS97: Real (11.0)
AVG_SIZE97: Real (11.0)
CROP_ACR97: Real (11.0)
AVG_SALE97: Real (7.2)
SQMI: Real (8.1)
OID: Integer (9.0)

With the new shapefile in hand, the basic workflow to get BHE outages at the county level is:

  • Read and parse the BHE outages XML file to get the lat/long pairs
  • Build a SpatialPoints object out of those pairs
  • Make a SpatialPolygonsDataFrame out of the reduced Maine counties shapefile
  • Overlay the points in the polygons and get the feature metadata intersection (including county)
  • Aggregate the outage data

The R code (below) does all that and is liberally commented. One has to appreciate how succinct the XML parsing is and the beautiful simplicity of the over() function (which does all the really hard work).

library(XML)
library(maptools)
library(sp)
library(plyr)
 
# Small script to get county-level outage info from Bangor Hydro
# Electric's town(-ish) level info
#
# BHE's outage google push-pin map is at
#   http://apps.bhe.com/about/outages/outage_map.cfm
 
# read BHE outage XML file that was intended for the google map
# yep. One. Line. #takethatpython
 
doc <- xmlTreeParse("http://apps.bhe.com/about/outages/outage_map.xml", 
                    useInternalNodes=TRUE)
 
# xmlToDataFrame() coughed up blood on that simple file, so we have to
# resort to menial labor to bend the XML to our will
 
doc.ls <- xmlToList(doc)
doc.attrs <- doc.ls$.attrs
doc.ls$.attrs <- NULL
 
# this does the data frame conversion magic, tho it winces a bit
 
suppressWarnings(doc.df <- data.frame(do.call(rbind, doc.ls), 
                                      stringsAsFactors=FALSE))
 
# need numbers for some of the columns (vs strings)
 
doc.df$outages <- as.numeric(doc.df$outages)
doc.df$lat <- as.numeric(doc.df$lat)
doc.df$lng <- as.numeric(doc.df$lng)
 
# SpatialPoints likes matrices, note that it's in LON, LAT order
# that always messes me up for some reason
 
doc.m <- as.matrix(doc.df[,c(4,3)])
doc.pts <- SpatialPoints(doc.m)
 
# I trimmed down the country-wide counties file from
#   http://www.baruch.cuny.edu/geoportal/data/esri/usa/census/counties.zip
# with
#   ogr2ogr -f "ESRI Shapefile" -where "STATE_NAME = 'MAINE'" maine.shp counties.shp
# to both save load time and reduce the number of iterations for over() later
 
counties <- readShapePoly("maine.shp", repair=TRUE, IDvar="NAME")
 
# So, all the above was pretty much just for this next line which does  
# the "is this point 'a' in polygon 'b' automagically for us. 
 
found.pts <- over(doc.pts, counties)
 
# steal the column we need (county name) and squirrel it away with outage count
 
doc.df$county <- found.pts$NAME
doc.sub <- doc.df[,c(2,7)]
 
# aggregate the result to get outage count by county
 
count(doc.sub, c("county"), wt_var="outages")
 
##      county freq
##1    Hancock 4440
##2  Penobscot  869
##3      Waldo   28
##4 Washington  545
##5       <NA>  328

Astute readers will notice unresolved points (the NAs). I suspect they are right on coastal boundaries that were probably missed in these simplified county polygons. We can see what they are by looking at the NA entries in the merged data frame:

doc.df[is.na(doc.df$county),c(1:4)]
           name outages      lat       lng
35    Deer Isle       1 44.22451 -68.67778
38   Harborside     166 44.34900 -68.81555
39     Sorrento      43 44.47341 -68.17723
62    Bucksport      71 44.57369 -68.79562
70    Penobscot      40 44.44552 -68.81780
78      Bernard       1 44.24119 -68.35585
79   Stonington       5 44.15619 -68.66669
80 Roque Bluffs       1 44.61286 -67.47971

But a map would be more useful for those not familiar with Maine geography/extents:

Plot_Zoom

library(ggplot2)
 
ff = fortify(counties, region = "NAME")
 
missing <- doc.df[is.na(doc.df$county),]
 
gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_path(aes(group = group), size=0.15, fill="black")
gg <- gg + geom_point(data=missing, aes(x=lng, y=lat), 
                      color="#feb24c", size=3)
gg <- gg + coord_map(xlim=extendrange(range(missing$lng)), ylim=extendrange(range(missing$lat)))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

The “zoom in” is done by taking and slightly extending the range of the extracted points via range() and extendrange(), reproduced below:

range(missing$lng)
[1] -68.81780 -67.47971
range(missing$lat)
[1] 44.15619 44.61286
 
extendrange(range(missing$lng))
[1] -68.88470 -67.41281
extendrange(range(missing$lat))
[1] 44.13336 44.63569

It turns out my suspicion was right, so to use this in “production” I’ll need a more accurate shapefile for Maine counties (which I have, but Descent is calling me, so it will have to wait for another day).

I’ll leave you with a non-Google push-pin map of outages that you can build upon (it needs some tweaking):

Plot_Zoom-2

gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_polygon(aes(group = group), size=0.15, fill="black", color="#7f7f7f")
gg <- gg + geom_point(data=doc.df, aes(x=lng, y=lat, alpha=outages, size=outages), 
                      color="#feb24c")
gg <- gg + coord_map(xlim=c(-71.5,-66.75), ylim=c(43,47.5))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

You can find all the R code in one, compact gist.

It started with a local R version and migrated to a Shiny version and is now in full D3 glory.

Some down time gave me the opportunity to start a basic D3 version of the outage map, but it needs a bit of work as it relies on a page meta refresh to update (every 5 minutes) vs an inline element dynamic refresh. The fam was getting a bit irked at coding time on Thanksgiving, so keep watching the following gists for updates after the holiday:

UPDATE: A Shiny (dynamic) version of this is now available.

We had yet-another power outage this morning due to the weird weather patterns of the week and it was the final catalyst I needed to crank out some R code to map the affected counties.

Central Maine Power provides an outage portal where folks can both report outages and see areas impacted by outages. They use an SAP web service that generates the outage table and the aforelinked page just embeds that URL (http://www3.cmpco.com/OutageReports/CMP.html) as an iframe. We can use the XML package in R to grab that HTML file, parse it, extract the table and then send the data to ggplot.

It should be a good starting point for anyone wishing to do something similar. The next itch to scratch for me on this is a live D3 map that uses the outage table with drill-down capabilities to the linked data.

library(maps)
library(maptools)
library(ggplot2)
library(plyr)
library(XML)
 
cmp.url <- "http://www3.cmpco.com/OutageReports/CMP.html"
# get outage table (first one on the cmp.url page)
cmp.node <- getNodeSet(htmlParse(cmp.url),"//table")[[1]]
cmp.tab <- readHTMLTable(cmp.node,
                         header=c("subregion","total.customers","without.power"),
                         skip.rows=c(1,2,3),
                         trim=TRUE, stringsAsFactors=FALSE)
 
# clean up the table to it's easier to work with
cmp.tab <- cmp.tab[-nrow(cmp.tab),] # get rid of last row
cmp.tab$subregion <- tolower(cmp.tab$subregion)
cmp.tab$total.customers <- as.numeric(gsub(",","",cmp.tab$total.customers))
cmp.tab$without.power <- as.numeric(gsub(",","",cmp.tab$without.power))
 
# get maine map with counties
county.df <- map_data('county')
me <- subset(county.df, region=="maine")
 
# get a copy with just the affected counties
out <- subset(me, subregion %in% cmp.tab$subregion)
 
# add outage into to it
out <- join(out, cmp.tab)
 
# plot the map
gg <- ggplot(me, aes(long, lat, group=group))
gg <- gg + geom_polygon(fill=NA, colour='gray50', size=0.25)
gg <- gg + geom_polygon(data=out, aes(long, lat, group=group, fill=without.power), 
                        colour='gray50', size=0.25)
gg <- gg + scale_fill_gradient2(low="#FFFFCC", mid="#FD8D3C", high="#800026")
gg <- gg + coord_map()
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="", title="CMP (Maine) Customers Without Power by County")
gg <- gg + theme(panel.border = element_blank(),
                 panel.background = element_blank(),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="left",
                 legend.title=element_blank())
gg

Plot_Zoom(click for larger)

Don’t let the morons in Congress stop you from visiting our fair state during this beautiful, colorful autumn season. Below is a Google Maps view of known, open campgrounds that will let you experience the best our state has to offer this time of year.

Zoom, pan and click on the URLs for more campground info.