Tag Archives: post
The Washingon Post did another great story+vis, this time on states [Spending seized assets](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/).
According to their sub-head:
>_Since 2008, about 5,400 police agencies have spent $2.5 billion in proceeds from cash and property seized under federal civil forfeiture laws. Police suspected the assets were linked to crime, although in 81 percent of cases no one was indicted._
Their interactive visualization lets you drill down into each state to examine the spending in each category. Since the WaPo team made the [data available](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/data/all.json) [JSON] I thought it might be interesting to take a look at a comparison across states (i.e. who are the “big spenders” of this siezed hoarde). Here’s a snippet of the JSON:
{"states": [ { "st": "AK", "stn": "Alaska", "total": 8470032, "cats": [{ "weapons": 1649832, "electronicSurv": 402490, "infoRewards": 760730, "travTrain": 848128, "commPrograms": 121664, "salaryOvertime": 776766, "other": 1487613, "commComp": 1288439, "buildImprov": 1134370 }], "agencies": [ { "aid": "AK0012700", "aname": "Airport Police & Fire Ted Stevens Anch Int'L Arpt", "total": 611553, "cats": [{ "weapons": 214296, "travTrain": 44467, "other": 215464, "commComp": 127308, "buildImprov": 10019 }] }, { "aid": "AK0010100", "aname": "Anchorage Police Department", "total": 3961497, "cats": [{ "weapons": 1104777, "electronicSurv": 94741, "infoRewards": 743230, "travTrain": 409474, "salaryOvertime": 770709, "other": 395317, "commComp": 249220, "buildImprov": 194029 }] },
Getting the data was easy (in R, of course!). Let’s setup the packages we’ll need:
library(data.table) library(dplyr) library(tidyr) library(ggplot2) library(scales) library(grid) library(statebins) library(gridExtra)
We also need `jsonlite`, but only to parse the data (which I’ve downloaded locally), so we’ll just do that in one standalone line:
data <- jsonlite::fromJSON("all.json", simplifyVector=FALSE)
It’s not fair (or valid) to just compare totals since some states have a larger population than others, so we’ll show the data twice, once in raw totals and once with a per-capita lens. For that, we’ll need population data:
pop <- read.csv("http://www.census.gov/popest/data/state/asrh/2013/files/SCPRC-EST2013-18+POP-RES.csv", stringsAsFactors=FALSE) colnames(pop) <- c("sumlev", "region", "divison", "state", "stn", "pop2013", "pop18p2013", "pcntest18p") pop$stn <- gsub(" of ", " Of ", pop$stn)
We have to fix the `District of Columbia` since the WaPo data capitalizes the `Of`.
Now we need to extract the agency data. This is really straightforward with some help from the `data.table` package:
agencies <- rbindlist(lapply(data$states, function(x) { rbindlist(lapply(x$agencies, function(y) { data.table(st=x$st, stn=x$stn, aid=y$aid, aname=y$aname, rbindlist(y$cats)) }), fill=TRUE) }), fill=TRUE)
The `rbindlist` `fill` option is super-handy in the event we have varying columns (and, we do in this case). It’s also wicked-fast.
Now, we use some `dplyr` and `tidyr` to integrate the population information and summarize our data (OK, we cheat and use `melt`, but some habits are hard to break):
c_st <- agencies %>% merge(pop[,5:6], all.x=TRUE, by="stn") %>% gather(category, value, -st, -stn, -pop2013, -aid, -aname) %>% group_by(st, category, pop2013) %>% summarise(total=sum(value, na.rm=TRUE), per_capita=sum(value, na.rm=TRUE)/pop2013) %>% select(st, category, total, per_capita)
Let’s use a series of bar charts to compare state-against state. We’ll do the initial view with just raw totals. There are 9 charts, so this graphic scrolls a bit and you can select it to make it larger:
# hack to ordering the bars by kohske : http://stackoverflow.com/a/5414445/1457051 ##### c_st <- transform(c_st, category2=factor(paste(st, category))) c_st <- transform(c_st, category2=reorder(category2, rank(-total))) # pretty names ##### levels(c_st$category) <- c("Weapons", "Travel, training", "Other", "Communications, computers", "Building improvements", "Electronic surveillance", "Information, rewards", "Salary, overtime", "Community programs") gg <- ggplot(c_st, aes(x=category2, y=total)) gg <- gg + geom_bar(stat="identity", aes(fill=category)) gg <- gg + scale_y_continuous(labels=dollar) gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2) gg <- gg + facet_wrap(~category, scales = "free", ncol=1) gg <- gg + labs(x="", y="") gg <- gg + theme_bw() gg <- gg + theme(strip.background=element_blank()) gg <- gg + theme(strip.text=element_text(size=15, face="bold")) gg <- gg + theme(panel.margin=unit(2, "lines")) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(legend.position="none") gg
Comparison of Spending Category by State (raw totals)
There are definitely a few, repeating “big spenders” in that view, but is that the _real_ story? Let’s take another look, but factoring in state population:
# change bar order to match per-capita calcuation ##### c_st <- transform(c_st, category2=reorder(category2, rank(-per_capita))) # per-capita bar plot ##### gg <- ggplot(c_st, aes(x=category2, y=per_capita)) gg <- gg + geom_bar(stat="identity", aes(fill=category)) gg <- gg + scale_y_continuous(labels=dollar) gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2) gg <- gg + facet_wrap(~category, scales = "free", ncol=1) gg <- gg + labs(x="", y="") gg <- gg + theme_bw() gg <- gg + theme(strip.background=element_blank()) gg <- gg + theme(strip.text=element_text(size=15, face="bold")) gg <- gg + theme(panel.margin=unit(2, "lines")) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(legend.position="none") gg
Comparison of Spending Category by State (per-capita)
That certainly changes things! Alaska, West Virginia, and D.C. definitely stand out for “Weapons”, “Other” & “Information”, respectively, (what’s Rhode Island hiding in “Other”?!) and the “top 10” in each category are very different from the raw total’s view. We can look at this per-capita view with the `statebins` package as well:
st_pl <- vector("list", 1+length(unique(c_st$category))) j <- 0 for (i in unique(c_st$category)) { j <- j + 1 st_pl[[j]] <- statebins_continuous(c_st[category==i,], state_col="st", value_col="per_capita") + scale_fill_gradientn(labels=dollar, colours=brewer.pal(6, "PuBu"), name=i) + theme(legend.key.width=unit(2, "cm")) } st_pl[[1+length(unique(c_st$category))]] <- list(ncol=1) grid.arrange(st_pl[[1]], st_pl[[2]], st_pl[[3]], st_pl[[4]], st_pl[[5]], st_pl[[6]], st_pl[[7]], st_pl[[8]], st_pl[[9]], ncol=3)
Per-capita “Statebins” view of WaPo Seizure Data
(Doing this exercise also showed me I need to add some flexibility to the `statebins` package).
The (https://gist.github.com/hrbrmstr/27b8f44f573539dc2971) shows how to build a top-level category data table (along with the rest of the code in this post). I may spin this data up into an interactive D3 visualization in the next week or two (as I think it might work better than large faceted bar charts), so stay tuned!
A huge thank you to the WaPo team for making data available to others. Go forth and poke at it with your own questions and see what you can come up with (perhaps comparing by area of state)!
Markus Gessman (@MarkusGesmann) did a beautiful job [Visualising the seasonality of Atlantic windstorms](http://www.magesblog.com/2014/10/visualising-seasonality-of-atlantic.html) using small multiples, which was inspired by both a [post](http://freakonometrics.hypotheses.org/17113) by Arthur Charpentier (@freakonometrics) on using Markov spatial processes to “generate” hurricanes—which was [tweaked a bit](http://robertgrantstats.wordpress.com/2014/10/01/transparent-hurricane-paths-in-r/) by Robert Grant (@robertstats)—and [Gaston Sanchez](https://github.com/gastonstat)’s [Visualizing Hurricane Trajectories](http://rpubs.com/gaston/hurricanes) RPub.
I have [some history](http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/) with hurricane data and thought I’d jump on the bandwagon using the same data and making some stop-frame animations. I borrowed from previous work (hence starting with all the credits above) but have used `dplyr` idioms for data-frame filtering & mutating and my own month/year extraction code.
The first animation accumulates storm tracks in-year and displays the names of the storms in a list down the left side while the second does a full historical accumulation of tracks. I changed the storm path gradient but kept most of the other formatting bits and made the plots suitable for 1080p output/playback.
Rather than go the `ffmpeg` route, I used [ImageMagick](http://www.imagemagick.org/) since it makes equally quick work out of converting a bunch of `png` files to an `mp4` file. I made the animations go quickly, but they can be advanced forward/back one frame at a time in any decent player.
library(maps) library(data.table) library(dplyr) library(ggplot2) library(grid) library(RColorBrewer) # takes in a numeric vector and returns a sequence from low to high rangeseq <- function(x, by=1) { rng <- range(x) seq(from=rng[1], to=rng[2], by=by) } # etract the months (as a factor of full month names) from # a date+time "x" that can be converted to a POSIXct object, extractMonth <- function(x) { months <- format(as.POSIXct(x), "%m") factor(months, labels=month.name[rangeseq(as.numeric(months))]) } # etract the years (as a factor of full 4-charater-digit years) from # a date+time "x" that can be converted to a POSIXct object, extractYear <- function(x) { factor(as.numeric(format(as.POSIXct(x), "%Y"))) } # get from: ftp://eclipse.ncdc.noaa.gov/pub/ibtracs/v03r06/all/csv/Allstorms.ibtracs_all.v03r06.csv.gz storms_file <- "data/Allstorms.ibtracs_all.v03r06.csv" storms <- fread(storms_file, skip=10, select=1:18) col_names <- c("Season", "Num", "Basin", "Sub_basin", "Name", "ISO_time", "Nature", "Latitude", "Longitude", "Wind.kt", "Pressure.mb", "Degrees_North", "Deegrees_East") setnames(storms, paste0("V", c(2:12, 17, 18)), col_names) # use dplyr idioms to filter & mutate the data storms <- storms %>% filter(Latitude > -999, # remove missing data Longitude > -999, Wind.kt > 0, !(Name %in% c("UNNAMED", "NONAME:UNNAMED"))) %>% mutate(Basin=gsub(" ", "", Basin), # clean up fields ID=paste(Name, Season, sep="."), Month=extractMonth(ISO_time), Year=extractYear(ISO_time)) %>% filter(Season >= 1989, Basin %in% "NA") # limit to North Atlantic basin season_range <- paste(range(storms$Season), collapse=" - ") knots_range <- range(storms$Wind.kt) # setup base plotting parameters (these won't change) base <- ggplot() base <- base + geom_polygon(data=map_data("world"), aes(x=long, y=lat, group=group), fill="gray25", colour="gray25", size=0.2) base <- base + scale_color_gradientn(colours=rev(brewer.pal(n=9, name="RdBu")), space="Lab", limits=knots_range) base <- base + xlim(-138, -20) + ylim(3, 55) base <- base + coord_map() base <- base + labs(x=NULL, y=NULL, title=NULL, colour = "Wind (knots)") base <- base + theme_bw() base <- base + theme(text=element_text(family="Arial", face="plain", size=rel(5)), panel.background = element_rect(fill = "gray10", colour = "gray30"), panel.margin = unit(c(0,0), "lines"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.margin = unit(c(0,0,0,0), "lines"), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), legend.position = c(0.25, 0.1), legend.background = element_rect(fill="gray10", color="gray10"), legend.text = element_text(color="white", size=rel(2)), legend.title = element_text(color="white", size=rel(5)), legend.direction = "horizontal") # loop over each year, producing plot files that accumulate tracks over each month for (year in unique(storms$Year)) { storm_ids <- unique(storms[storms$Year==year,]$ID) for (i in 1:length(storm_ids)) { storms_yr <- storms %>% filter(Year==year, ID %in% storm_ids[1:i]) # stuff takes a while, so it's good to have a progress message message(sprintf("%s %s", year, storm_ids[i])) gg <- base gg <- gg + geom_path(data=storms_yr, aes(x=Longitude, y=Latitude, group=ID, colour=Wind.kt), size=1.0, alpha=1/4) gg <- gg + geom_text(label=year, aes(x=-135, y=51), size=rel(6), color="white", vjust=1) gg <- gg + geom_text(label=paste(gsub(".[[:digit:]]+$", "", storm_ids[1:i]), collapse="\n"), aes(x=-135, y=49.5), size=rel(4.5), color="white", vjust=1) # change "quartz" to "cairo" if you're not on OS X png(filename=sprintf("output/%s%03d.png", year, i), width=1920, height=1080, type="quartz", bg="gray25") print(gg) dev.off() } } # convert to mp4 animation - needs imagemagick system("convert -delay 8 output/*png output/hurr-1.mp4") # unlink("output/*png") # do this after verifying convert works
# take an alternate approach for accumulating the entire hurricane history # start with the base, but add to the ggplot object in a loop, which will # accumulate all the tracks. gg <- base for (year in unique(storms$Year)) { storm_ids <- unique(storms[storms$Year==year,]$ID) for (i in 1:length(storm_ids)) { storms_yr <- storms %>% filter(ID %in% storm_ids[i]) message(sprintf("%s %s", year, storm_ids[i])) gg <- gg + geom_path(data=storms_yr, aes(x=Longitude, y=Latitude, group=ID, colour=Wind.kt), size=1.0, alpha=1/4) png(filename=sprintf("output/%s%03d.png", year, i), width=1920, height=1080, type="quartz", bg="gray25") print(gg) dev.off() } } system("convert -delay 8 output/*png output/hurr-2.mp4") # unlink("output/*png") # do this after verifying convert works
Full code in [this gist](https://gist.github.com/hrbrmstr/23bf06784e898871dd61).
I saw a re-tweet for this “Man From Mars Radio Hat” advertisement and—while it’s awesome as a bit of nostalgia on it’s own—it also reveals a bit about our (US) Postal history:
The “Zone” field is not for modern ZIP codes, but for postal zones which were implemented back in 1943. The Smithsonian [online postal museum](http://postalmuseum.si.edu/) explains what they are in their [excerpt](http://postalmuseum.si.edu/zipcodecampaign/p2.html) from Carl H. Scheele’s book, _”A Short History of the Mail Service”_:
>_The basic idea of postal zoning came into being in 1943, in the midst of World War II. At this time, U.S. annual mail volume was growing steadily (the years between 1940 and 1943 saw an increase from about 28 billion pieces of mail to about 33 billion pieces of mail). In addition, post offices across the country were forced to hire inexperienced clerks as more experienced staff members went off to serve in the war. The creation of a postal zone system was deemed necessary in order to improve the accuracy and efficiency of mail sorting and delivery. Zone numbers were implemented in 124 of the country’s largest urban areas, and individual delivery districts within these areas were given one or two digit codes, to be written as part of the address after the city name. These early postal zone codes faced little opposition and caught on quickly with businesses and the American public._
Simpler times and smaller world back then.
When I used one of the Scotland TopoJSON files for a recent post, it really hit me just how much D3 cartography envy I had/have as an R user. Don’t get me wrong, I can conjure up D3 maps pretty well [1] [2] and the utility of an interactive map visualization goes without saying, but we can make great static maps in R without a great deal of effort, so I decided to replicate a few core examples from the D3 topojson gallery in R.
I chose five somewhat different examples, each focusing on various aspects of creating map layers and trying to not be too U.S. focused. Here they are (hit the main link to go to the gist for the example and the bl.ocks
URL to see it’s D3 counterpart):
- Swiss Cantons – R version of http://bl.ocks.org/mbostock/4207744
- Costa Rica – R version of http://bl.ocks.org/arce/9357998
- Area Choropleth – R version of http://bl.ocks.org/mbostock/4206573
- Block Counties – R version of http://bl.ocks.org/mbostock/4206857
- County Circles (OK, Ovals) – R version of http://bl.ocks.org/mbostock/4206975
I used the TopoJSON/GeoJSON files provided with each example, so you’ll need a recent gdal (>= 1.11), and—consequently—a suitable build of rgdal
) to work through the examples.
The Core Mapping Idiom
While the details may vary with each project you work on, the basic flow to present a map in R with ggplot
are:
- read in a map features (I use
readOGR
in these examples) - convert that into something
ggplot
can handle - identify values you wish to pair with those features (optional if we’re just plotting a plain map)
- determine which portion of the map is to be displayed
- plot the map features
Words & abbreviations mean things, just like map symbols mean things, and if you’re wondering what this “OGR” is, here’s the answer from the official FAQ:
OGR used to stand for OpenGIS Simple Features Reference Implementation. However, since OGR is not fully compliant with the OpenGIS Simple Feature specification and is not approved as a reference implementation of the spec the name was changed to OGR Simple Features Library. The only meaning of OGR in this name is historical. OGR is also the prefix used everywhere in the source of the library for class names, filenames, etc.
The readOGR
function can work with a wide variety of file formats and OGR files can hold a wide variety of data. The most basic use for our mapping is to read in these TopoJSON/GeoJSON files and use the right features from them to make our maps. Features/layers can be almost anything (counties, states, countries, rivers, lakes, etc) and we can see what features we want to work with by using the ogrListLayers
function (you can do this from an operating system command line as well, but we’ll stay in R for now). Let’s take a look at the layers available in the map from the Costa Rica example:
ogrListLayers("division.json") ## [1] "limites" "provincias" "cantones" "distritos" attr(,"driver") ## [1] "GeoJSON" attr(,"nlayers") ## [1] 4
Those translate to “country”, “provinces”, “cantons”, & “districts”. Each layer has polygons and associated data for the polygons (and overall layer), including information about the type of projection. If you’re sensing a “math trigger warning”, fear not; I won’t be delving into to much more cartographic detail as you probably just want to see the maps & code.
Swiss Cantons
If you’re from the U.S. you (most likely) have no idea what a canton is. The quickest explanation is that it is an administrative division within a country and, in this specific example, the 26 cantons of Switzerland are the member states of the federal state of Switzerland.
The D3 Swiss Cantons uses a TopoJSON/GeoJSON file that has only one layer (i.e. the cantons) along with metadata about the canton id
and name
:
ogrInfo("readme-swiss.json", "cantons") ## Source: "readme-swiss.json", layer: "cantons" ## Driver: GeoJSON number of rows 26 ## Feature type: wkbPolygon with 2 dimensions ## Extent: (5.956 45.818) - (10.492 47.808) ## Number of fields: 2 ## name type length typeName ## 1 id 4 0 String ## 2 name 4 0 String
NOTE: you should learn to get pretty adept with the OGR functions or command-line tools as you can do some really amazing things with them, including extracting only certain features, simplifying the polygons or fixing issues. Some of the TopoJSON/GeoJSON files you’ll find with D3 examples may have missing or invalid components and you can fix some of them with these tools. We’ll be working around errors and missing values in these examples.
The D3 example displays the canton name
at the centroid of the polygon, so that’s what we’ll do in R:
library(rgeos) library(rgdal) # needs gdal > 1.11.0 library(ggplot2) # ggplot map theme devtools::source_gist("https://gist.github.com/hrbrmstr/33baa3a79c5cfef0f6df") map = readOGR("readme-swiss.json", "cantons") map_df <- fortify(map)
The map
object is a SpatialPolygonsDataFrame
and has a fairly complex structure:
slotNames(map) ## [1] "data" "polygons" "plotOrder" "bbox" ## [5] "proj4string" names(map) ## [1] "id" "name" # execute these on your own and poke around the data structures after determining the class class(map@data) class(map@polygons) class(map@plotOrder) class(map@bbox) class(map@proj4string)
The fortify
function turns all that into something we can use with ggplot
. Normally, we’d be able to get fortify
to associate the canton name
to the polygon points it encodes via the region
parameter. That did not work with these TopoJSON/GeoJSON files and I didn’t really poke around much to determine why since it’s easy enough to work around. In this case, I manually merged the names with the fortified map data frame.
# create mapping for id # to name since "region=" won't work dat <- data.frame(id=0:(length(map@data$name)-1), canton=map@data$name) map_df <- merge(map_df, dat, by="id")
We can get the centroid via the gCentroid
function, and we’ll make a data frame of those center points and the name of the canton for use with a geom_text
layer after plotting the base outlines of the cantons (with a rather bland fill, but I didn’t pick the color):
# find canton centers centers <- data.frame(gCentroid(map, byid=TRUE)) centers$canton <- dat$canton # make a map! gg <- ggplot() gg <- gg + geom_map(data=map_df, map=map_df, aes(map_id=id, x=long, y=lat, group=group), color="#ffffff", fill="#bbbbbb", size=0.25) # gg <- gg + geom_point(data=centers, aes(x=x, y=y)) gg <- gg + geom_text(data=centers, aes(label=canton, x=x, y=y), size=3) gg <- gg + coord_map() gg <- gg + labs(x="", y="", title="Swiss Cantons") gg <- gg + theme_map()
The coord_map()
works with the mapproj
package to help us display maps in reasonable projections (or really dumb ones). The default is "mercator"
and we’ll stick with that since the D3 examples use it (but, winkel-tripel FTW!).
Here’s the result of our hard work (select map for larger version):
If you ignore the exposition above and just take into account non-blank source code lines, we did all that in ~16LOC and have a scaleable SVG file as a result. You can have some fun with the above code and remove the static fill="#bbbbbb"
and move it to the mapping
aesthetic parameter and tie it’s value to the canton name.
Costa Rica
The TopoJSON/GeoJSON file provided with the D3 example is a good example of encoding multiple layers into a single file (see the first ogrListLayers
above). We’ll create a fortified version of each layer and then plot each with a geom_map
layer using different line colors, sizes and fills:
limites = readOGR("division.json", "limites") provincias = readOGR("division.json", "provincias") cantones = readOGR("division.json", "cantones") distritos = readOGR("division.json", "distritos") limites_df <- fortify(limites) cantones_df <- fortify(cantones) distritos_df <- fortify(distritos) provincias_df <- fortify(provincias) gg <- ggplot() gg <- gg + geom_map(data=limites_df, map=limites_df, aes(map_id=id, x=long, y=lat, group=group), color="white", fill="#dddddd", size=0.25) gg <- gg + geom_map(data=cantones_df, map=cantones_df, aes(map_id=id, x=long, y=lat, group=group), color="red", fill="#ffffff00", size=0.2) gg <- gg + geom_map(data=distritos_df, map=distritos_df, aes(map_id=id, x=long, y=lat, group=group), color="#999999", fill="#ffffff00", size=0.1) gg <- gg + geom_map(data=provincias_df, map=provincias_df, aes(map_id=id, x=long, y=lat, group=group), color="black", fill="#ffffff00", size=0.33) gg <- gg + coord_map() gg <- gg + labs(x="", y="", title="Costa Rica TopoJSON") gg <- gg + theme_map()
The result is pretty neat and virtually identical to the D3 version:
Try playing around with the order of the geom_map
layers (or remove some) and also the line color/size/fill & alpha values to see how it changes the map.
Area Choropleth
I’m not a huge fan of the colors used in the D3 version and I’m not going to spend any time moving Hawaii & Alaska around (that’s a whole different post). But, I will show how to make a similar area choropleth:
# read in the county borders map = readOGR("us.json", "counties") # calculate (well retrieve) the area since it's part of the polygon structure # and associate it with the polygon id so we can use it later. We need to do # the merge manually again since the "us.json" file threw errors again when # trying to use the fortify "region" parameter. map_area <- data.frame(id=0:(length(map@data$id)-1), area=sapply(slot(map, "polygons"), slot, "area") ) # read in the state borders states = readOGR("us.json", "states") states_df <- fortify(states) # create map data frame and merge area info map_df <- fortify(map) map_df <- merge(map_df, map_area, by="id") gg <- ggplot() # thin white border around counties and shades of yellow-green for area (log scale) gg <- gg + geom_map(data=map_df, map=map_df, aes(map_id=id, x=long, y=lat, group=group, fill=log1p(area)), color="white", size=0.05) # thick white border for states gg <- gg + geom_map(data=states_df, map=states_df, aes(map_id=id, x=long, y=lat, group=group), color="white", size=0.5, alpha=0) gg <- gg + scale_fill_continuous(low="#ccebc5", high="#084081") # US continental extents - not showing alaska & hawaii gg <- gg + xlim(-124.848974, -66.885444) gg <- gg + ylim(24.396308, 49.384358) gg <- gg + coord_map() gg <- gg + labs(x="", y="", title="Area Choropleth") gg <- gg + theme_map() gg <- gg + theme(legend.position="none")
Play with the colors and use different values instead of the polygon area (perhaps use sample
or runif
to generate some data) to see how it changes the choropleth outcome.
Blocky Counties
The example from the D3 wiki is more “how to work with shapefiles and map coordinates” than it is useful, but we have the same flexibility in R, so we’ll make the same plot by using the bbox
function to make a data frame of bounding boxes we can use with geom_rect
(there’s no geom_map
in this example, just using the coordinate system to plot boxes):
# use the topojson from the bl.ocks example map = readOGR("us.json", "counties") # build our map data frame of rects map_df <- do.call("rbind", lapply(map@polygons, function(p) { b <- bbox(p) # get bounding box of polygon and put it into a form we can use later data.frame(xmin=b["x", "min"], xmax=b["x", "max"], ymin=b["y", "min"], ymax=b["y", "max"]) })) map_df$id <- map$id # add the id even though we aren't using it now gg <- ggplot(data=map_df) gg <- gg + geom_rect(aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), color="steelblue", alpha=0, size=0.25) # continental us only gg <- gg + xlim(-124.848974, -66.885444) gg <- gg + ylim(24.396308, 49.384358) gg <- gg + coord_map() gg <- gg + labs(x="", y="", title="Blocky Counties") gg <- gg + theme_map() gg <- gg + theme(legend.position="none")
To re-emphasize we’re just working with ggplot
layers, so play around and, perhaps color in only the odd numbered counties.
County Circles (OK, Ovals)
The last D3 example I’m copying swaps squares for circles, which makes this more of a challenge to do in R+ggplot since ggplot
has no “circle” geom (and holey geom_point
s do not count). So, we’ll borrow and slightly adapt a function from StackOverflow by joran that builds a data frame of polygon points derived by a center & diameter. We’ll add an id
value (for each of the counties) and make one really big data frame (well, big for use in ggplot
) that we can then plot as grouped geom_path
s. Unlike our cantons example, the gCentroid
function coughed up errors on this TopoJSON/GeoJSON file, so I resorted to approximating the center from the rectangular bounding box. Also, I don’t project the circle coordinates before plotting, so they’re ovals. While it doesn’t mirror the D3 example perfectly, it should help reinforce how to work with the map’s metadata and draw arbitrary components on a map:
# adapted from http://stackoverflow.com/questions/6862742/draw-a-circle-with-ggplot2 # computes a circle from a given diameter. we add "id" so we can have one big # data frame and group them for plotting circleFun <- function(id, center = c(0,0),diameter = 1, npoints = 100){ r = diameter / 2 tt <- seq(0,2*pi,length.out = npoints) xx <- center[1] + r * cos(tt) yy <- center[2] + r * sin(tt) return(data.frame(id=id, x = xx, y = yy)) } # us topojson from the bl.ocks example map = readOGR("us.json", "counties") # this topojson file gives rgeos_getcentroid errors here # so we approximate the centroid map_df <- do.call("rbind", lapply(map@polygons, function(p) { b <- bbox(p) data.frame(x=b["x", "min"] + ((b["x", "max"] - b["x", "min"]) / 2), y=b["y", "min"] + ((b["y", "max"] - b["y", "min"]) / 2)) })) # get area & diameter map_df$area <- sapply(slot(map, "polygons"), slot, "area") map_df$diameter <- sqrt(map_df$area / pi) * 2 # make our big data frame of circles circles <- do.call("rbind", lapply(1:nrow(map_df), function(i) { circleFun(i, c(map_df[i,]$x, map_df[i,]$y), map_df[i,]$diameter) })) gg <- ggplot(data=circles, aes(x=x, y=y, group=id)) gg <- gg + geom_path(color="steelblue", size=0.25) # continental us gg <- gg + xlim(-124.848974, -66.885444) gg <- gg + ylim(24.396308, 49.384358) gg <- gg + coord_map() gg <- gg + labs(x="", y="", title="County Circles (OK, Ovals)") gg <- gg + theme_map() gg <- gg + theme(legend.position="none")
If you poke around a bit at the various map libraries in R, you should be able to figure out how to get those plotted as circles (and learn alot in the process).
Wrapping Up
R ggplot
maps won’t and shouldn’t replace D3 maps for many reasons, paramount of which is interactivity. The generated SVG files are also fairly large and the non-SVG versions don’t look nearly as crisp (and aren’t as flexible). However, this should be a decent introductory primer on mapping and shapefiles and might come in handy when you want to use R to enhance maps with other data and write out (yep, R can read and write OGR) your own shapefiles for use in D3 (or other tools/languages).
Don’t forget that all source code (including TopoJSON/GeoJSON files and sample SVGs) are in their own gists:
If you figure out what is causing some of the errors I mentioned or make some epic maps of your own, don’t hesitate to drop a note in the comments.
The arrival of the autumnal equinox foreshadows the reality of longer nights and shorter days here in the northeast US. We can both see that reality and distract ourselves from it at the same time by firing up RStudio (or your favorite editor) and taking a look at the sunrise & sunset times based on our map coordinates using some functions from the R {maptools} package.
The sunriset
function takes in a lat/lon pair, a range of dates and whether we want sunrise or sunset calculated and returns when those ephemeral events occur. For example, we can see the sunrise time for Portsmouth, NH on Christmas day this year (2014) via:
library(maptools)
# these functions need the lat/lon in an unusual format
portsmouth <- matrix(c(-70.762553, 43.071755), nrow=1)
for_date <- as.POSIXct("2014-12-25", tz="America/New_York")
sunriset(portsmouth, for_date, direction="sunrise", POSIXct.out=TRUE)
## day_frac time
## newlon 0.3007444 2014-12-25 07:13:04
We can pass in a vector of dates, to this function, and that means we’ll have data points we can work with to visualize this change. Let’s wrap the sequence generation into a function of our own and extract:
- sunrise
- sunset
- solar noon
- # hours of daylight
for every day in the sequence, returning the result as a data frame.
# adapted from http://r.789695.n4.nabble.com/maptools-sunrise-sunset-function-td874148.html
ephemeris <- function(lat, lon, date, span=1, tz="UTC") {
# convert to the format we need
lon.lat <- matrix(c(lon, lat), nrow=1)
# make our sequence - using noon gets us around daylight saving time issues
day <- as.POSIXct(date, tz=tz)
sequence <- seq(from=day, length.out=span , by="days")
# get our data
sunrise <- sunriset(lon.lat, sequence, direction="sunrise", POSIXct.out=TRUE)
sunset <- sunriset(lon.lat, sequence, direction="sunset", POSIXct.out=TRUE)
solar_noon <- solarnoon(lon.lat, sequence, POSIXct.out=TRUE)
# build a data frame from the vectors
data.frame(date=as.Date(sunrise$time),
sunrise=as.numeric(format(sunrise$time, "%H%M")),
solarnoon=as.numeric(format(solar_noon$time, "%H%M")),
sunset=as.numeric(format(sunset$time, "%H%M")),
day_length=as.numeric(sunset$time-sunrise$time))
}
Now we can take a look at these values over 10 days near All Hallows Eve:
ephemeris(43.071755, -70.762553, "2014-10-31", 10, tz="America/New_York")
## date sunrise solarnoon sunset day_length
## 1 2014-10-31 716 1226 1736 10.332477
## 2 2014-11-01 717 1226 1734 10.289145
## 3 2014-11-02 518 1026 1533 10.246169
## 4 2014-11-03 620 1126 1632 10.203563
## 5 2014-11-04 621 1126 1631 10.161346
## 6 2014-11-05 622 1126 1629 10.119535
## 7 2014-11-06 624 1126 1628 10.078148
## 8 2014-11-07 625 1126 1627 10.037204
## 9 2014-11-08 626 1126 1626 9.996721
## 10 2014-11-09 627 1126 1625 9.956719
We now have everything we need to visualize the seasonal daylight changes. We’ll use ggplot
(with some help from the grid
package) and build a two panel graph, one that gives us a “ribbon” view of what hours of the day are in daylight and the other just showing the changes in the total number of hours of daylight available during the day. We’ll build the function so that it will:
- optionally show the current date/time (
TRUE
by default) - optionally show when solar noon is (
FALSE
by default) - optionally plot the graphs (
TRUE
by default) - return an
arrangeGrob
of the charts in the event we want to use them in other charts
library(ggplot2)
library(scales)
library(gridExtra)
# create two formatter functions for the x-axis display
# for graph #1 y-axis
time_format <- function(hrmn) substr(sprintf("%04d", hrmn),1,2)
# for graph #2 y-axis
pad5 <- function(num) sprintf("%2d", num)
daylight <- function(lat, lon, place, start_date, span=2, tz="UTC",
show_solar_noon=FALSE, show_now=TRUE, plot=TRUE) {
stopifnot(span>=2) # really doesn't make much sense to plot 1 value
srss <- ephemeris(lat, lon, start_date, span, tz)
x_label = ""
gg <- ggplot(srss, aes(x=date))
gg <- gg + geom_ribbon(aes(ymin=sunrise, ymax=sunset), fill="#ffeda0")
if (show_solar_noon) gg <- gg + geom_line(aes(y=solarnoon), color="#fd8d3c")
if (show_now) {
gg <- gg + geom_vline(xintercept=as.numeric(as.Date(Sys.time())), color="#800026", linetype="longdash", size=0.25)
gg <- gg + geom_hline(yintercept=as.numeric(format(Sys.time(), "%H%M")), color="#800026", linetype="longdash", size=0.25)
x_label = sprintf("Current Date / Time: %s", format(Sys.time(), "%Y-%m-%d / %H:%M"))
}
gg <- gg + scale_x_date(expand=c(0,0), labels=date_format("%b '%y"))
gg <- gg + scale_y_continuous(labels=time_format, limits=c(0,2400), breaks=seq(0, 2400, 200), expand=c(0,0))
gg <- gg + labs(x=x_label, y="",
title=sprintf("Sunrise/set for %s\n%s ", place, paste0(range(srss$date), sep=" ", collapse="to ")))
gg <- gg + theme_bw()
gg <- gg + theme(panel.background=element_rect(fill="#525252"))
gg <- gg + theme(panel.grid=element_blank())
gg1 <- ggplot(srss, aes(x=date, y=day_length))
gg1 <- gg1 + geom_area(fill="#ffeda0")
gg1 <- gg1 + geom_line(color="#525252")
if (show_now) gg1 <- gg1 + geom_vline(xintercept=as.numeric(as.Date(Sys.time())), color="#800026", linetype="longdash", size=0.25)
gg1 <- gg1 + scale_x_date(expand=c(0,0), labels=date_format("%b '%y"))
gg1 <- gg1 + scale_y_continuous(labels=pad5, limits=c(0,24), expand=c(0,0))
gg1 <- gg1 + labs(x="", y="", title="Day(light) Length (hrs)")
gg1 <- gg1 + theme_bw()
if (plot) grid.arrange(gg, gg1, nrow=2)
arrangeGrob(gg, gg1, nrow=2)
}
We can test our our new function using the same location and graph the sunlight data for a year starting September 1, 2014 (select graph for full-size version):
daylight(43.071755, -70.762553, "Portsmouth, NH", "2014-09-01", 365, tz="America/New_York")
With the longer nights approaching we can further enhance the plotting function to add markers for solstices and perhaps even make a new version that compares sunlight across different geographical locations.
Complete code example is in this gist.
The BBC did a pretty good job [live tracking the Scotland secession vote](http://www.bbc.com/news/events/scotland-decides/results), but I really didn’t like the color scheme they chose and decided to use the final tally site as the basis for another tutorial using the tools from the Hadleyverse and taking advantage of the fact that newer `gdal` libraries can read in [TopoJSON](https://github.com/mbostock/topojson)/GeoJSON files, meaning we can use _most_ of the maps the D3-ers create/use right in R.
We’ll need a few R packages to help us get, clean, format and chart the data:
library(rvest) library(dplyr) library(httr) # >0.5 library(tidyr) library(gpclib) library(rgeos) library(sp) library(maptools) library(rgdal) # needs gdal > 1.11.0 library(ggplot2) library(reshape2) library(gridExtra)
The new `rvest` package makes it super-fun (and easy) to get data out of web pages (as I’ve [mentioned on the sister blog](http://datadrivensecurity.info/blog/posts/2014/Sep/migrating-to-rvest/)), but said data is still web page data, usually geared towards making things render well in a browser, and we end up having to clean up the extracted fields to get useful data. Since we usually want a data frame from the extraction, an `rvest` idiom I’ve been playing with involves bundling the element extraction & cleanup code into one function and then using that to build the columns:
# extract data from rvest-ed <div>'s and clean it up a bit # pass in the rvested HTML object and the CSS selector to extract, also # indicating whether we want a number or character vector returned extractAndCleanup <- function(data, selector, make_numeric=FALSE) { x <- data %>% html_nodes(selector) %>% html_text() x <- gsub("^[[:punct:][:space:]]*|[[:punct:][:space:]]*$", "", x) if (make_numeric) x <- as.numeric(gsub("[,[:space:]]*", "", x)) x } bbc_vote <- html("http://www.bbc.com/news/events/scotland-decides/results") secede <- data.frame( council=bbc_vote %>% extractAndCleanup(".body-row__cell--council"), electorate=bbc_vote %>% extractAndCleanup(".body-row__cell--electorate", TRUE), yes=bbc_vote %>% extractAndCleanup(".body-row__cell--yes", TRUE), no=bbc_vote %>% extractAndCleanup(".body-row__cell--no", TRUE), stringsAsFactors=FALSE)
We can then compute whether the vote tally was to secede or not and assign a color in the event we choose to use base graphics for plotting (we won’t for this tutorial). I chose a muted version of the Union Jack red and the official Scottish blue for this exercise.
secede <- secede %>% mutate(gone=yes>no, color=ifelse(gone, "#0065BD", "#CF142B77"))
Getting the map from the BBC site is just as simple. An inspection of the site in any decent browser with a “Developer” mode lets us see the elements being downloaded. For the BBC map, it reads the data from: `http://static.bbci.co.uk/news/1.49.0-1192/js/data/maps/l/scotland-elections.js` which is a TopoJSON object wrapped in two lines of extra javascript code. We’ll grab that file, clean it up and read the map into R using `httr`’s new-ish ability to save to a data file:
GET("http://static.bbci.co.uk/news/1.49.0-1192/js/data/maps/l/scotland-elections.js", write_disk("data/scotland.json"), progress()) tmp <- readLines("data/scotland.json") dir.create("data") writeLines(tmp[2], "data/scotland.json") map <- readOGR("data/scotland.json", "scotland-elections")
We’ll want to work with the map using Council names, so we need to ensure the names from the extracted `div` elements match what’s in the TopoJSON file:
secede$council %in% map@data$name ## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE ## [13] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
It looks like we’ll need to clean the names up a bit, but thankfully the names aren’t too far off:
secede$council <- gsub("&", "and", secede$council) secede[secede$council=="Edinburgh",]$council = "City of Edinburgh" secede[secede$council=="Glasgow",]$council = "Glasgow City" secede[secede$council=="Comhairle nan Eilean Siar",]$council = "Na h-Eileanan an Iar"
If we were using base graphics for plotting, we’d also have to ensure the data was in the right order:
secede$council <- factor(secede$council, map@data$name, ordered=TRUE) secede <- secede %>% arrange(council)
We’re going to use `ggplot` for the mapping portion, but the normal `fortify` process didn’t work on this TopoJSON file (some polygon errors emerged), so we’ll take another route and do the data Council name↔id mapping after the `fortify` call and merge the rest of our data into the map data frame:
map_df <- fortify(map) # manually associate the map id's with the Council names and vote data councils <- data.frame(id=0:(length(map@data$name)-1), council=as.character(map@data$name)) map_df <- merge(map_df, councils, by="id") map_df <- merge(map_df, secede, by="council")
Now we can generate the choropleth:
gg <- ggplot() gg <- gg + geom_map(data=map_df, map=map_df, aes(map_id=id, x=long, y=lat, group=group, fill=color), color="white", size=0.25) gg <- gg + scale_fill_manual(values=rev(unique(secede$color)), labels=c("Yes", "No"), name="Secede?") gg <- gg + xlim(extendrange(r=range(coordinates(map)[,1]), f=0.15)) gg <- gg + ylim(extendrange(r=range(coordinates(map)[,2]), f=0.07)) gg <- gg + coord_map() gg <- gg + labs(x="", y="") gg <- gg + theme_bw() gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(legend.position="none") gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(axis.ticks=element_blank()) gg <- gg + theme(axis.text=element_blank())
A choropleth is all well-and-good, but—since we have the data–let’s add the bar chart to complete the presentation. We’ll combine some `dplyr` and `tidyr` calls to melt and subset our data frame:
secede_m <- secede %>% gather(variable, value, -council) %>% filter(variable %in% c("yes", "no")) %>% mutate(value=as.numeric(value))
For this exercise, we’ll plot the 100% stacked bars in order of the “No” votes, and we’ll pre-process this ordering to make the `ggplot` code easier on the eyes. We start by merging some data back into our melted data frame so we can build the sorted factor by the “No” value column and then make sure the Councils will be in that order:
secede_m <- merge(secede_m, secede, by="council") secede_m$variable <- factor(secede_m$variable, levels=c("yes", "no"), ordered=TRUE) secede_m <- secede_m %>% arrange(no, variable) secede_m$council <- factor(secede_m$council, unique(secede_m$council), ordered=TRUE)
Finally, we can create the 100% stacked bar plot and combine it with the choropleth to build the final product:
gg1 <- ggplot(secede_m, aes(x=council, y=value, fill=factor(variable))) gg1 <- gg1 + geom_bar(stat="identity", position="fill") gg1 <- gg1 + scale_fill_manual(values=rev(unique(secede$color)), labels=c("Yes", "No"), name="Secede?") gg1 <- gg1 + geom_hline(yintercept=0.50, color="gray80") gg1 <- gg1 + geom_text(aes(label=percent(yes/100)), y=0.08, color="white", size=3) gg1 <- gg1 + geom_text(aes(label=percent(no/100)), y=0.92, color="white", size=3) gg1 <- gg1 + coord_flip() gg1 <- gg1 + labs(x="", y="") gg1 <- gg1 + theme_bw() gg1 <- gg1 + theme(panel.grid=element_blank()) gg1 <- gg1 + theme(legend.position="top") gg1 <- gg1 + theme(panel.border=element_blank()) gg1 <- gg1 + theme(axis.ticks=element_blank()) gg1 <- gg1 + theme(axis.text.x=element_blank()) vote <- arrangeGrob(gg1, gg, ncol=2, main=textGrob("Scotland Votes", gp=gpar(fontsize=20)))
(Click for larger version)
I’ve bundled this code up into it’s own [github repo](https://github.com/hrbrmstr/secede-2014). The full project example has a few extra features as
– it shows how to save the resultant data frame to an R data file (in case the BBC nukes the site)
– also saves the cleaned-up JSON (getting minimal Scotland shapefiles is tricky so this one’s a keeper even with the polygon errors)
– wraps all that in `if` statements so future analysis/vis can work with or without the live data being available.
Hadley really has to stop making R so fun to work with :-)
UPDATE
Based on a comment by Paul Drake suggesting that the BBC choropleth (and, hence, my direct clone of it) could be made more informative by showing the vote difference. Since it’s just changing two lines of code, here it is in-situ vs creating a new post.
gg <- gg + geom_map(data=map_df, map=map_df, aes(map_id=id, x=long, y=lat, group=group, fill=yes-no), color="white", size=0.25) gg <- gg + scale_fill_gradient(low="#CF142B", high="#0065BD", name="Secede?\n(vote margin)", guide="legend")
I’ve operated a [Weather Underground](http://www.wunderground.com/) [Personal Weather Station](http://www.wunderground.com/weatherstation/about.asp) (PWS) [[KMEBERWI7](http://www.wunderground.com/personal-weather-station/dashboard?ID=KMEBERWI7#history)] off-and-on (hardware issues notwithstanding) for as long as I can remember, and I thought it was about time to finally do an Alfred↔PWS mashup. My personal requirements were quite modest:
– 5 reading history (including most current)
– Ability to copy all the current day’s readings as CSV
– Quickly get to my PWS data w/o a bookmark
[Alfred](http://www.alfredapp.com/) makes all this possible via customized workflows that support many scripting environments, including Python. Here’s a quick preview before going into the details:
It’s a fairly simple workflow:
– Grab today’s “raw” data (and clean it up)
– Select the last 5 entries
– Connect a URL action and use the full CSV as clipboard contents for any copy action
The full Python code is below and [on github](https://github.com/hrbrmstr/alfred-pws), and you can hit that github link or [Packal](http://www.packal.org/workflow/pws-history) for the compiled workflow. It’s been tested on Mavericks, but more eyes are always welcome.
Customizing the workflow for your own/favorite PWS is as simple as changing the `station` variable.
There’s plenty of room for improvement, including
– performing a background download of the PWS data
– using a sparkline graph as the icon
– customizing which data fields are returned
– providing commands to get/set your/favorite PWS
– providing options for the “copy” return type (currently CSV, but potentially XML or JSON)
Don’t hesitate to post issues or pull requests and check back for updates (as I’m sure some of those improvements will be making their way into the workflow).
import re import csv import sys import datetime from lib import requests from workflow import Workflow, web from StringIO import StringIO # retrieve today's history for station "X" def get_wx_data(station): tdy = datetime.datetime.today() # construct the URL for "today" url = 'http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=%s&day=%d&month=%d&year=%d&graphspan=day&format=1' % (station, tdy.day, tdy.month, tdy.year) r = web.get(url) # get the data r.raise_for_status() # report any errors return(re.sub("\n\n", "\n", re.sub("^\n|<br>", "", r.text))) # clean up the output & pass it back to main control # main workflow control def main(wf): station = "KMEBERWI7" # change to use your own/favorite station resp = get_wx_data(station) # only want last 5 readings, change this to whatever you want max = 5 i=0 for row in reversed(list(csv.reader(StringIO(resp)))): wf.add_item(title=row[0] + " | " + row[1] + "F | " + row[3] + "in | " + row[8] + "%", subtitle=station, # so you know where you're getting data from arg=station, # passed as query to URL action - http://www.wunderground.com/personal-weather-station/dashboard?ID={query}#history valid=True, # it can be opened in the browser icon="/System/Library/CoreServices/CoreTypes.bundle/Contents/Resources/ToolbarInfo.icns", # info icon copytext=resp) # get the whole CSV file in a copy if (i==max): break i += 1 # output to alfred wf.send_feedback() if __name__ == u"__main__": wf = Workflow() sys.exit(wf.run(main))