52vis Archives

Category Archives: 52vis

52Vis Week #3 – Waste Not, Want Not

2016-04-13 – 23:35
Posted in 52vis, Charts & Graphs, DataVis, DataViz, ggplot, maps, R
Tagged post
Comments (9)

The Wall Street Journal did a project piece [a while back](http://projects.wsj.com/waste-lands/) in the _”Waste Lands: America’s Forgotten Nuclear Legacy”_. They dug through [Department of Energy](http://www.lm.doe.gov/default.aspx?id=2602) and [CDC](http://www.cdc.gov/niosh/ocas/ocasawe.html) data to provide an overview of the lingering residue of this toxic time in America’s past (somehow, I have to believe the fracking phenomena of our modern era will end up doing far more damage in the long run).

Being a somewhat interactive piece, I was able to tease out the data source behind it for this week’s challenge. I’m, once again, removing the obvious vis and re-creating a non-interactive version of the WSJ’s main map view (with some additional details).

There’s definitely a story or two here, but I felt that the overall message fell a bit flat the way the WSJ folks told it. Can you find an angle or question that tells a tale in a more compelling fashion? I added some hints in the code snippet below (and in the repo) as to how you might find additional details for each toxic site (and said details are super-scrape-able with `rvest`). I also noticed some additional external data sets that could be brought in (but I’ll leave that detective work to our contestants).

If you’re up to the task, fork [this week’s repo](https://github.com/52vis/2016-15), create a subdirectory for your submission and shoot a PR my way (notifying folks here in the comments about your submission is also encouraged).

Entries accepted up until 2016-04-20 23:59:59 EDT.

Hadley has volunteered a signed book and I think I’ll take him up on the offer for this week’s prize (unless you really want a copy of Data-Driven Security :-).

One last note: I’ve secured `52vis.com` and will be getting that configured for next week’s contest. It’ll be a nice showcase site for all the submissions.

library(albersusa) # devtools::install_github("hrbrmstr/hrbrmisc")
library(rgeos)
library(maptools)
library(ggplot2)
library(ggalt)
library(ggthemes)
library(jsonlite)
library(tidyr)
library(dplyr)
library(purrr)
 
#' WSJ Waste Lands: http://projects.wsj.com/waste-lands/
#' Data from: http://www.lm.doe.gov/default.aspx?id=2602 &
#'            http://www.cdc.gov/niosh/ocas/ocasawe.html
 
sites <- fromJSON("sites.json", flatten=TRUE)
 
#' need to replace the 0-length data.frames with at least one row of `NA`s
#' so we can safely `unnest()` them later
 
sites$locations <- map(sites$locations, function(x) {
  if (nrow(x) == 0) {
    data_frame(latitude=NA, longitude=NA, postal_code=NA, name=NA, street_address=NA)
  } else {
    x
  }
})
 
#' we'll need this later
 
sites$site.rating <- factor(sites$site.rating,
                           levels=c(3:0),
                           labels=c("Remote or no potential for radioactive contamination, based on criteria at the time of FUSRAP review.",
                                    "Referred to another agency or program, no authority to clean up under FUSRAP, or status unclear.",
                                    "Cleanup declared complete under FUSRAP.",
                                    "Cleanup in progress under the Formerly Utilized Sites Remedial Action Program (FUSRAP)."))
 
#' One teensy discrepancy:
 
nrow(sites)
 
#' ## [1] 517
#'
#' The stacked bars total on the WSJ site is 515.
#' Further complication is that sites$locations is a list column with nested
#' data.frames:
 
sites <- unnest(sites)
 
nrow(sites)
 
#' ## [1] 549
 
sum(complete.cases(sites[,c("longitude", "latitude")]))
 
#' ## [1] 352
#'
#' So, just mapping long/lat is going to miss part of the story. But, I'm just
#' providing a kick-start for folks, so I'll just map long/lat :-)
 
glimpse(sites)
 
#' ## Observations: 549
#' ## Variables: 11
#' ## $ site.city      (chr) "Flint", "Albuquerque", "Buffalo", "Los...
#' ## $ site.name      (chr) "AC Spark Plug, Dort Highway Plant", "A...
#' ## $ site.rating    (fctr) Remote or no potential for radioactive...
#' ## $ site.state     (chr) "MI", "NM", "NY", "NM", "PA", "NY", "OH...
#' ## $ site.state_ap  (chr) "Mich.", "N.M.", "N.Y.", "N.M.", "Pa.",...
#' ## $ site.slug      (chr) "1-ac-spark-plug-dort-highway-plant", "...
#' ## $ latitude       (dbl) 43.02938, NA, NA, 35.88883, 39.95295, 4...
#' ## $ longitude      (dbl) -83.65525, NA, NA, -106.30502, -75.5927...
#' ## $ postal_code    (chr) "48506", NA, NA, "87544", "19382", "100...
#' ## $ name           (chr) "", NA, NA, "", "", "", "Former Buildin...
#' ## $ street_address (chr) "1300 North Dort Highway", NA, NA, "Pue...
 
#' Note that `site.slug` can be used with this URL:
#' `http://projects.wsj.com/waste-lands/site/SITE.SLUG.HERE/` to get to
#' detail pages on the WSJ site.
 
#' I want to use my `albersusa` mutated U.S. shapefile for this (NOTE: I'm moving
#' `albersus` into one of the rOpenSci pacakges soon vs publishing it standalone to CRAN)
#' so I need to mutate the Alaska points (there are no Hawaii points).
#' This step is *not necessary* unless you plan on displaying points on this
#' mutated map. I also realized I need to provide a mutated projection translation
#' function for AK & HI for the mutated Albers mapss.
 
tmp  <- data.frame(dplyr::select(filter(sites, site.state=="AK"), longitude, latitude))
coordinates(tmp) <- ~longitude+latitude
proj4string(tmp) <- CRS(us_longlat_proj)
tmp <- spTransform(tmp, CRS(us_laea_proj))
tmp <- elide(tmp, rotate=-50)
tmp <- elide(tmp, scale=max(apply(bbox(tmp), 1, diff)) / 2.3)
tmp <- elide(tmp, shift=c(-2100000, -2500000))
proj4string(tmp) <- CRS(us_laea_proj)
tmp <- spTransform(tmp, us_longlat_proj)
tmp <- as.data.frame(tmp)
 
sites[sites$site.state=="AK",]$longitude <- tmp$x
sites[sites$site.state=="AK",]$latitude <- tmp$y
 
#' and now we plot the sites
 
us_map <- fortify(usa_composite(), region="name")
 
gg <- ggplot()
gg <- gg + geom_map(data=us_map, map=us_map,
                    aes(x=long, y=lat, map_id=id),
                    color="#2b2b2b", size=0.15, fill="#e5e3df")
gg <- gg + geom_point(dat=sites, aes(x=longitude, y=latitude, fill=site.rating),
                      shape=21, color="white", stroke=1, alpha=1, size=3)
gg <- gg + scale_fill_manual(name="", values=c("#00a0b0", "#edc951", "#6a4a3c", "#eb6841"))
gg <- gg + coord_proj(us_laea_proj)
gg <- gg + guides(fill=guide_legend(override.aes=list(alpha=1, stroke=0.2, color="#2b2b2b", size=4)))
gg <- gg + labs(title="Waste Lands: America's Forgotten Nuclear Legacy",
                 caption="Data from the WSJ")
gg <- gg + theme_map()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.direction="vertical")
gg <- gg + theme(legend.key=element_blank())
gg <- gg + theme(plot.title=element_text(size=18, face="bold", hjust=0.5))
gg

52 Vis Week #2 Wrap Up

2016-04-13 – 23:07
Posted in 52vis, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (3)

I’ve been staring at this homeless data set for a few weeks now since I’m using it both here and in the data science class I’m teaching. It’s been one of the most mindful data sets I’ve worked with in a while. Even when reduced to pure numbers in named columns, the names really stick with you…_”Unsheltered Homeless People in Families”_…_”Unsheltered Chronically Homeless”_…_”Homeless Veterans”_…_”Unsheltered Homeless Unaccompanied Youth”_. These are real people, really hurting.

That’s one of the _superpowers_ “Data Science” gives you: the ability to shed light on the things that matter and to tell meaningful stories that people _need_ to hear. From my interactions with some of the folks who submitted entries, I know they, too, were impacted by the stories contained in this data set. Let’s see what they uncovered.

(All the code & un-shrunk visualizations are in the [52vis github repo](https://github.com/52vis/2016-14))

Camille compared a point-in-time view of one of the most vulnerable parts of the population—youth (under 25)—with the overall population:

The youth homelessness situation in Nevada seems especially disconcerting and I wonder how much better/worse it might be if we factored in the 25 & under U.S. census information (I’m really a bit reticent to run those numbers for fear it’ll be even worse).

Craine Munton submitted our first D3 entry! I ended up tweaking some of the JS & CSS `href`s (to fix non-sync’d files) and you can see the full version [here](https://rud.is/52vis/2016-14/cmrunton/). I’m going to try to embed it below as well (I’ll leave it up even if it’s not fully sized well. Just hit the aforementioned URL for the full-browser version).

Craine focused on another vulnerable and sometimes forgotten segment: those that put their lives on the line for our freedom and the safety and security of threatened people groups around the globe.

Joshua Kunst is an incredibly talented individual who has made a number of stunning visualizations in R. He used htmlwidgets to tell [a captivating story](https://rud.is/52vis/2016-14/jbkunst/) that ends in (statistically inferred) _hope_.

Hit the [full page](https://rud.is/52vis/2016-14/jbkunst/) for the frame-busted visualization.

Jake Kaupp took inspiration from Alberto Cairo and created some truly novel visualizations. I’m putting the easiest to embed first:

Jake has [written a superb piece](https://jkaupp.github.io/) on his creation, included an [interactive Shiny app](https://jkaupp.shinyapps.io/52vis_Homeless/) and brought in extra data to try to come to grips with this data. Definitely take time to read his post (even if it means you never get back to this post).

His small-multiples view is below but you should click on it to see it in full-browser view.

Jonathan Carroll (another fellow rOpenSci’er) created a companion [blog post](http://jcarroll.com.au/2016/04/10/52vis-week-2-challenge/) for his animated choropleth entry:

I really like how it highlights the differences per year and a number of statistical/computational choices he made.

Julia Silge focused on youth as well asking two compelling questions (you can read her [exposition](http://rpubs.com/juliasilge/170499) as well):

(In seeing this second youth-focused vis and also having a clearer picture of the areas of greatest concern, I’m wondering if there’s a climate/weather-oriented reason for certain areas standing out when it comes to homeless youth issues.)

Philipp Ottolinger took a statistical look at youth and veterans:

Make sure to dedicate some cycles to [check out his approach](https://github.com/ottlngr/2016-14/blob/ottlngr/ottlngr/homeless_ottlngr.Rmd).

@patternproject did not succumb to the temptation to draw a map just because “it’s U.S. State data” and chose, instead, to look across time and geography to tease out patterns using a heatmap.

I _really_ like this novel approach and am now rethinking my approach geo-temporal visualizations.

Xan Gregg looked at sheltered vs unsheltered homeless populations from a few different viewpoints (including animation).

(Again, beautiful work in JMP).

### We have a winner

The diversity and craftsmanship of these entries was amazing to see, as was the care and attention each submitter took when dealing with this truly tough subject. I was personally impacted by each one and I hope it raised awareness in the broader data science community.

I couldn’t choose between Joshua Kunst’s & Jake Kaupp’s entries so they’re tied for first place and they’ll both be getting a copy of [Data-Driven Security](http://dds.ec/amzn).

Joshua & Jake: hit up bob at rudis dot net to claim your prize!

A $50.00 donation has also been made to the [National Coalition for the Homeless](http://nationalhomeless.org/) dedicating it by name to each of the contest participants.

52 Vis Week 1 Winners!

The response to 52Vis has exceeded expectations and there have been great entries for both weeks. It’s time to award some prizes!

### Week 1 – Send in the Drones

I’ll take [this week](https://github.com/52vis/2016-13) in comment submission order (remember, the rules changed to submission via PR in Week 2).

NOTE: WordPress seems to have “eaten” the animations on upload, so _please_ check out the direct links to them (they are worth the clicks).

First up is a straightforward but really colorful take take on the data by J. Alexander Branham’s (in R):

You can see his code and commentary (he went into detail on both the code and thought process) on [his blog](http://jabranham.com/blog/2016/03/ggplot-maps.html). Extra points for the Albers projection!

Next up is Jérôme Laurent who did an equally great job explaining his thought process behind his approach and some great code exposition that produced this really neat time-lapse animation:

You can see Jérôme’s [blog](https://jerome-laurent-pro.github.io/2016-04-01-dataviz-week13/) for all the code and ‘splainin.

Timothy Kiely took a super neat approach and mapped out the drone geographic density, but then expanded on his vision to look at the density over time (with a paired bar chart!).

Balázs Dukai also took the density approach in his [RPubs submission](http://rpubs.com/BalazsDukai/vis_2016-13) (with code) but attacked it from a small multiples perspective. It was interesting to see California move in and out of prominence and I think this will be an interesting project to replicate as the FAA collects more data.

Fellow rOpenSci 2016 participant Julia Silge wanted to see the distribution of sightings throughout the week and took a straightforward faceted bar chart approach:

BUT she also doubled-down on stats to make her case. Read her [superb exposition](http://rpubs.com/juliasilge/168308) to see the conclusion!

I was _extremely_ excited to see a [non-R submission](https://github.com/xangregg/dronetimes) from Xan Gregg. Using JMP & JSL (SAS components), Xan walks us through both the dreaded data-munging component (that R folks got for free from me) and found some really interesting oddities in the data that suggest there are issues with data quality. I really loved the in-depth explanation and the scatterplot that was produced to help diagnose data issues is really well-crafted.

Philipp Ottolinger was super-kind enough to go back and PR his submission so all his code is in the 52vis repo, but you should also [check it out in his repo](https://github.com/ottlngr/52Vis/blob/master/01/01_52Vis.md) since he has alot of other really cool things to persuse. He tried to see if drones were a weekend or daily occurrence in his approach:

Jacob Barnett was the first Leaflet submission (with [full code](https://gist.github.com/barnettjacob/58601c78f22616a02c3d3e1fa1aea724#file-flight_data-png)) and explored the geographic prevalence of the drone sightings:

@patternproject [explored](https://github.com/patternproject/r.rudis.challenge1) the density by week of year:

and this is going to be a really neat graph to watch over time as more of these flying annoyances take to the skies. I’m re-running that code later this year to see if the weeks with greater density stay the same.

Mukul Chaware did some [text analysis](https://github.com/mukul13/2016-13/tree/master/mukul) as well as temporal queries on the data one of the more interesting charts is the day vs night one:

but he also used the text analysis to see where there were LEO notifications or not and did multiple views using that as a pivot (really neat idea). Check out his [full submission](https://github.com/mukul13/2016-13/tree/master/mukul) for some great exposition and analysis.

Lastly is Andrew Heiss‘ [inquiry to discover](https://www.andrewheiss.com/blog/2016/04/03/drone-sightings-in-the-us-visualized/) whether we have a hobbyist problem or an official problem with these flying digital buzzards:

The pairing of external data, combined with the gorgeous map truly ruled the day, but he went further to answer another question: which state had the most (per capita) drone sightings:

### The Results Are In

It was really excellent work by everyone and the different approaches by the the submitters shows exactly what I was hoping this project would show: that there are many ways to approach the data that you have and finding the question that really takes hold of you can ultimately deliver amazing results.

For this week:

– Andrew Heiss takes the #1 spot and gets not only a digital copy of [Data-Driven Security](http://dds.ec/amzn) but also a $25.00 Amazon gift cart (I actually had a sponsor who wanted to remain anonymous for this one)
– Xan Gregg takes a #2 spot and gets a copy of Data-Driven Security for an excellent analysis and being willing to be the first non-R submitter!

Andrew & Xan: please e-mail me at bob at rudis dot net so I can get your prizes to you!

Next post will be the Week 2 winners! Then, Week 3’s challenge!