R Archives - Page 43 of 56

Category Archives: R

Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R

2016-01-03 – 13:18
Posted in cartography, Data Visualization, DataVis, DataViz, ggplot, R, svg
Tagged post
Comments (7)

Moritz Stefaner started off 2016 with a [very spiffy post](http://truth-and-beauty.net/experiments/ach-ingen-zell/) on _”a visual exploration of the spatial patterns in the endings of German town and village names”_. Moritz was [exploring some new data processing & visualization tools](https://github.com/moritzstefaner/ach-ingen-zell) for the post, but when I saw what he was doing I wondered how hard it would be to do something similar in R and also used it as an opportunity to start practicing a new habit in 2016: packages vs projects.

To state more precisely the goals for this homage, the plan was to:

– use as close to the same data sets Mortiz has in his github repo, _including_ the ones in pure javascript
– generate an HTML page as output that is as close to the style in Moritz’s visualization
– use R for _everything_ (i.e. no “cheating” by sneaking in some javascript via `htmlwidgets`)
– bundle everything into a package to take advantage of all the good stuff that comes with R package validation

You may want to [take a look at the result](http://rud.is/zellingenach.html) to see if you want to continue reading (I hope you will!).

### The Setup
By using an R package as the framework for the visualization, it’s possible to keep the data with the code and also organize and document the code in a way that makes it easy for folks to use and explore without cutting and pasting (our `source`ing) code. It also makes it possible to list all the dependencies for the project and help ensure they’ll be installed when someone tries to work with it.

While I _could_ have converted Moritz’s processed data into R data files, I left the CSV intact and the javascript file of suffix groupings also intact to show that R is extremely flexible when it comes to data processing (which is a “duh” for most folks by this point but the use of javascript data structures might give some folks ideas as how to reduce data duplication between projects). Both these files get stored in the `inst/alt` folder of the source package. I also end up using some CSS for the final visualization and placed that into a file in the same directory, which makes the code that generates the HTML a bit cleaner.

Because R processes some things automatically (like `.onAttach`) when it interacts with a package one can have it provide helpful instructions (in this case, how to generate the visualization) in similar fashion to the `ggplot2` loading messages.

Similarly, there both the package itself and the package functions have documentation to help folks understand both what the package and each component is doing.

### The Fun Stuff
The CSV file of places looks something like this:

name,latitude,longitude
Nierskanal,49.01,13.23
Zwiefelhof,49.22,11.18
Zwiefaltendorf,48.21,9.51
Zwiefalten,48.23,9.46
Zwiedorf,53.69,13.05
Zwickgabel,48.58,8.31
Zwickau,50.72,12.48
Zwethau,51.58,13.04
Zwesten,51.05,9.17

and, the suffix groupings list looks like this:

const suffixList = [
  ["ach", "a", "aa", "ah"],
  ["ar", "ahr"],
  ["ate", "te", "nit", "net"],
  ["au", "aue", "oog", "ooge", "ohe", "oie"],
  ["bach", "bach", "bek", "beken", "beck", "bke"],
  ["berg", "bergen", "barg", "bargen"],
  ["born", "bronn"],
  ["bruch", "broich", "brook", "brock", "brauk"],
  ["bruck", "brück", "brügge"],
  ...
];

While `read.csv` (no need for `readr` as the file is small) can handle the CSV file, we use the `V8` package to source the javascript and convert it to an R object:

ct <- v8()
ct$source(system.file("alt/suffixlist.js", package="zellingenach"))
ct$get("suffixList")

We actually turn that into a vector of regular expressions (for town name ending checking) and a list of vectors (for the HTML visualization creation). Check out `suffix_regex()` and `suffix_names()` in the source code.

The `read_places()` function builds a `data.frame` of the places combined with the suffix grouping(s) they belong to:

# read in the file
plc <- read.csv(system.file("alt/placenames_de.tsv", package="zellingenach"),
                stringsAsFactors=FALSE)
 
# iterate over each suffix and identify which place names match the grouping
lapply(suf, function(regex) {
  which(stri_detect_regex(plc$name, regex))
}) -> matched_endings
 
plc$found <- ""
 
# add which grouping(s) the place was found to a new column
for(i in 1:length(matched_endings)) {
  where_found <- matched_endings[[i]]
  plc$found[where_found] <-
    paste0(plc$found[where_found], sprintf("%d|", i))
}
 
# some don't match so get rid of them
mutate(filter(plc, found != ""), found=sub("\\|$", "", found))

I do something a bit different than Moritz in that in that I allow towns to be part of multiple suffix groups, since:

– I’m neither a historian nor expert in German town naming conventions, and
– the javascript version and this R version both take a naive approach to suffix mapping.

This means my numbers (for the _”#### places”_ label) will be different for some of my maps.

R has similar shortcut functions (Mortiz uses D3) to make hexgrids out of shapefiles. Here’s the entirety of `create_hexgrid()`:

de_shp <- getData("GADM", country="DEU", level=0, path=tempdir())
 
de_hex_pts <- spsample(de_shp, type="hexagonal", n=10000, cellsize=0.19,
                       offset=c(0.5, 0.5), pretty=TRUE)
 
HexPoints2SpatialPolygons(de_hex_pts)

You can play with `cellsize` to change the number of hexes. I tried to find a good number to get close to the # in Moritz’s maps.

This all gets put together in `make_maps()` where we use `ggplot2` to build 52 gridded heatmaps (one for each suffix grouping). I used a log of the counts to map to a binned viridis color scale, so my colors come out a bit different than Moritz’s but the overall patterns are on par with his.

Finally, `display_maps()` takes the list created by `make_maps()` and builds out an HTML page using the `htmltools` package for the page framework and `svglite::htmlSVG` to make SVGs of the ggplot objects). NOTE that you can use the `output_file` option of `display_maps()` to send the HTML to a file as well as display it in the viewer/browser.

### Fin
Because the project is in a pacakge, we can run package checks to see if we’re missing anything including other pacakge dependencies, function documentation and other details that the package tools are gleeful to point out. We can also include code to test out our various components to ensure they are behaving as expected (i.e. generating the right data/output).

Once nice thing about the output is that it’s “responsive”, which means it handles multiple screen sizes quite well. So, if your screen is huge, you’ll have many map boxes on one line and if it’s small (like the `iframe` below) it will have fewer.

You’ll see that my maps are a bit bigger than Moritz’s. This is due to both the hex grid size and the fact that the SVG output is just slightly larger overall than the ones made by D3. Of note: I noticed some suffix subtitle components wrapped at the “-” so I converted the plain dashes to non-breaking ones `‑`/”‑”.

The one downside to using a package for this is that it’s harder to post complete code into a blog post, but you can [clone the repo](https://github.com/hrbrmstr/zellingenach) to look at the code and skip the dissection and just generate the visualization locally via:

install.packages("ggalt")
# OR: devtools::install_github("hrbrmstr/ggalt") 
devtools::install_github("hrbrmstr/zellingenach")
display_maps()

By targeting SVG & HTML, we can make a cross-platform, crisp and responsive visualization all without leaving RStudio.

If you caught any errors or made something cool with any of the code, please drop an issue on github and a note in the comments (respectively)!

If you prefer a single- `source`-able version, please see [this gist](https://gist.github.com/hrbrmstr/f3d2568ad0f27b2384d3).

Happy New YeaR!

World Map Panel Plots with ggplot2 2.0 & ggalt

2015-12-28 – 14:57
Posted in cartography, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (6)

James Austin (@awhstin) made some #spiffy 4-panel maps with base R graphics but also posited he didn’t use ggplot2 because:

…ggplot2 and maps currently do not support world maps at this point, which does not give us a great overall view.

That is certainly a box I would not put ggplot2 into, especially with the newly updated R maps (et al) packages, ggplot2 2.0 and my (still in development) ggalt package (though this was all possible before ggplot2 2.0 and ggalt). NOTE: I have no idea why I get so defensive about ggplot2 besides the fact that it’s one the best visualization tools ever created.

Here’s all you need to use the built-in facet options of ggplot2 to make the 4-panel plot (as James points out, you can get the data file from here: CLIWOC15.csv):

library(ggplot2)  # FYI you need v2.0
library(dplyr)    # yes, i could have not done this and just used 'subset' instead of 'filter'
library(ggalt)    # devtools::install_github("hrbrmstr/ggalt")
library(ggthemes) # theme_map and tableau colors
 
world <- map_data("world")
world <- world[world$region != "Antarctica",] # intercourse antarctica
 
dat <- read.csv("CLIWOC15.csv")        # having factors here by default isn't a bad thing
dat <- filter(dat, Nation != "Sweden") # I kinda feel bad for Sweden but 4 panels look better than 5 and it doesn't have much data
 
gg <- ggplot()
gg <- gg + geom_map(data=world, map=world,
                    aes(x=long, y=lat, map_id=region),
                    color="white", fill="#7f7f7f", size=0.05, alpha=1/4)
gg <- gg + geom_point(data=dat, 
                      aes(x=Lon3, y=Lat3, color=Nation), 
                      size=0.15, alpha=1/100)
gg <- gg + scale_color_tableau()
gg <- gg + coord_proj("+proj=wintri")
gg <- gg + facet_wrap(~Nation)
gg <- gg + theme_map()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(legend.position="none")
gg

You can use a separate shapefile if you want, but this is quite minimalist (a feature James suggests is desirable) and emphasizes the routes quite nicely IMO.

Update to metricsgraphics 0.9.0 (now on CRAN)

2015-12-21 – 11:42
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, Metrics, R
Tagged post
Comments (7)

It’s been a while since I’ve updated my [metricsgraphics package](https://cran.r-project.org/web/packages/metricsgraphics/index.html). The hit list for changes includes:

– Fixes for the new ggplot2 release (metricsgraphics uses the `movies` data set which is now in ggplot2movies)
– Updated all javascript libraries to the most recent versions
– Borrowed the ability to add CSS rules to a widget from taucharts (`mjs_add_css_rule`)
– Added a metricsgraphics plugin to enable line chart region annotation (`mjs_annotate_region`)
– Enabled explicit coloring line/area charts (it was a new feature in the underlying Metrics-Graphics library)
– You can use bare or quoted names when specifying the x & y accessors and can also use a variable name
– You can now use the metricsgraphics title & description capabilities, but doing so voids any predictable/specified widget height/width and the description functionality is really only suited for bootstrap templates

I think all that can be demonstrated in the following snippet:

library(metricsgraphics)
 
dat <- read.csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&a=07&b=9&c=1996&d=11&e=21&f=2015&g=d&ignore=.csv",
                stringsAsFactors=FALSE)
 
DATE <- "Date"
 
dat %>%
  filter(Date>="2008-01-01") %>% 
  mjs_plot(DATE, y="Low", title="AAPL Stock (2008-Present)", width=800, height=500) %>% 
  mjs_line(color="#6a3d9a") %>% 
  mjs_add_line(High, color="#ff7f00") %>% 
  mjs_axis_x(xax_format="date") %>% 
  mjs_add_css_rule("{{ID}} .blk { fill:black }") %>%
  mjs_annotate_region("2013-01-01", "2013-12-31", "Volatility", "blk") %>% 
  mjs_add_marker("2014-06-09", "Split") %>% 
  mjs_add_marker("2012-09-12", "iPhone 5") %>% 
  mjs_add_legend(c("Low", "High"))

NOTE: I’m still trying to figure out why WebKit on Safari renders the em dashes and Chrome does not.

Fear of WaPo Using Bad Pie Charts Has Increased Since Last Year

2015-12-13 – 09:40
Posted in Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (15)

I woke up this morning to a [headline story from the Washington Post](https://www.washingtonpost.com/news/the-fix/wp/2015/12/10/to-many-christian-terrorists-arent-true-christians-but-muslim-terrorists-are-true-muslims/) on _”Americans are twice as willing to distance Christian extremists from their religion as Muslims_”. This post is not about the content of the headline or story. It _is_ about the horrible pie chart WaPo led the article with:

This isn’t just a rant of a madman against pie charts. While I _am_ vehemently opposed to them, we did cover them [in our book](https://books.google.com/books?id=7DqwAgAAQBAJ&pg=PA146&lpg=PA146&dq=data-driven+security+pie+chart&source=bl&ots=Cy1iJylsHd&sig=a6Hz1JB-QYLq6H0VZJpPleJgRkQ&hl=en&sa=X&ved=0ahUKEwj79uqt_tjJAhVG0iYKHS0uDn4Q6AEIMzAH#v=onepage&q=data-driven%20security%20pie%20chart&f=false) and my co-author (@jayjacobs) and the incredibly talented @annkemery both agree there are often cases where they are appropriate. Even using their less-sensitive sensibilities, this would not be one of those cases.

So, what—exactly—is the problem? WaPo tried to enable comparison between pies by exploding them and using colors to indicate similar fear levels, mapping shades to entries in the top legend. Your eye has to move around a bit to take everything in and remember the mapping as you focus on each slice (since you will end up doing that given that each category colored differently). Their whole goal was to enable the reader to see the change in sentiment towards terrorism since this time last year.

Hrm. Two dates. Small set of values. Desire to quickly compare change in value/slope. **This sounds like a job for a slopegraph!**

The article and graphic are based on a [survey](http://publicreligion.org/research/2015/12/survey-nearly-half-of-americans-worried-that-they-or-their-family-will-be-a-victim-of-terrorism/). Thankfully the [complete survey data was made available](http://publicreligion.org/site/wp-content/uploads/2015/12/December-2015-PRRI-RNS-Topline1.pdf), which made it easy to do a makeover (in R of course). Here’s the result:

Each category change is clearly visible, you don’t need to remember color association and you even know the actual values^*.

The R code is below and in [this gist](https://gist.github.com/hrbrmstr/9bf4f93dffc1df48fe27). How would you make the WaPo chart better (drop a note in the comments with a link to your own makeover)?

library(tidyr)
library(ggplot2)
library(ggthemes)
library(scales)
library(dplyr)
 
# Easiest way to transcribe the PDF table
# The slope calculation will enable us to color the lines/points based on up/down
dat <- data_frame(`2014-11-01`=c(0.11, 0.22, 0.35, 0.31, 0.01),
                  `2015-12-01`=c(0.17, 0.30, 0.30, 0.23, 0.00),
                  slope=factor(sign(`2014-11-01` - `2015-12-01`)),
                  fear_level=c("Very worried", "Somewhat worried", "Not too worried",
                               "Not at all", "Don't know/refused"))
 
# Transform that into something we can use
dat <- gather(dat, month, value, -fear_level, -slope)
 
# We need real dates for the X-axis manipulation
dat <- mutate(dat, month=as.Date(as.character(month)))
 
# Since 2 categories have the same ending value, we need to
# take care of that (this is one of a few "gotchas" in slopegraph preparation)
end_lab <- dat %>%
  filter(month==as.Date("2015-12-01")) %>%
  group_by(value) %>%
  summarise(lab=sprintf("%s", paste(fear_level, collapse=", ")))
 
gg <- ggplot(dat)
 
# line
gg <- gg + geom_line(aes(x=month, y=value, color=slope, group=fear_level), size=1)
# points
gg <- gg + geom_point(aes(x=month, y=value, fill=slope, group=fear_level),
                      color="white", shape=21, size=2.5)
 
# left labels
gg <- gg + geom_text(data=filter(dat, month==as.Date("2014-11-01")),
                     aes(x=month, y=value, label=sprintf("%s — %s  ", fear_level, percent(value))),
                     hjust=1, size=3)
# right labels
gg <- gg + geom_text(data=end_lab,
                     aes(x=as.Date("2015-12-01"), y=value,
                         label=sprintf("  %s — %s", percent(value), lab)),
                     hjust=0, size=3)
 
# Here we do some slightly tricky x-axis formatting to ensure we have enough
# space for the in-panel labels, only show the months we need and have
# the month labels display properly
gg <- gg + scale_x_date(expand=c(0.125, 0),
                        labels=date_format("%b\n%Y"),
                        breaks=c(as.Date("2014-11-01"), as.Date("2015-12-01")),
                        limits=c(as.Date("2014-02-01"), as.Date("2016-12-01")))
gg <- gg + scale_y_continuous()
 
# I used colors from the article
gg <- gg + scale_color_manual(values=c("#f0b35f", "#177fb9"))
gg <- gg + scale_fill_manual(values=c("#f0b35f", "#177fb9"))
gg <- gg + labs(x=NULL, y=NULL, title="Fear of terror attacks (change since last year)\n")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.y=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg

^* Well, it’s survey. To add insult to injury, it’s a sentiment-based survey given right after a likely-to-be-attributed-terrorism attack. Also, there is a margin of error that isn’t communicated in either visualization. So while there is “data”, trust it at your own peril.

An OS X R Task Runner for—and a Mini-R-centric review of—Microsoft’s Visual Studio Code Editor

Microsoft’s newfound desire to make themselves desirable to the hipster development community has caused them to make many things [open](https://github.com/Microsoft/) and/or free of late. One of these manifestations is [Visual Studio Code](https://code.visualstudio.com/), an [Atom](https://atom.io/)-ish editor for us code jockeys. I have friends at Microsoft and the Revolution R folks are there now, so I try to give things from Redmond a shot more than I previously would, especially when they make things for Mac.

VS code is so much like Atom (or even [Sublime Text](http://www.sublimetext.com/)) that I won’t go into a full review of it. Suffice it to say it has a file selector pane, editor panes, output panes, snippets, theme support and is pretty extensible. One requirement I appreciate is that it forces you to think of code in terms of projects (you select a directory to edit in) and I also appreciate that they made git a first-class citizen.

Since I do not spend much time building large, compiled applications (this—along with web apps—seems to be VS Code’s sweet spot) there isn’t much initial appeal for me. It also lacks the “intellisense” support for the main language I use (R) so I’m left with basic syntax highlighting (the 90’s called and want their basic editor capabilities back).

None of that would initially drive me away from using something like VS Code and I may end up using it for HTML/CSS/JavaScript projects or even fire it up when I need to do some work in Python or Go. But I won’t be using it for R any time soon. While the aforementioned lack of “intellisense” for R is an issue, I don’t rely on the auto-completion for R but it does occasionally speed up typing and definitely helps with the more esoteric function definitions in equally esoteric packages.

The biggest show-stopper for VS Code is the lack of REPL (a read-eval-print loop) for R. I can fire up an R script in Sublime Text or even Atom and run individual lines of code that are executed in an R session that runs in the background and outputs in an editor pane. It works well but it is (unsurprisingly) a far cry from the tight integration of similar functionality in RStudio. VS Code can run R scripts (it just runs the code through R as you would at the command-line) but has no REPL for R, which means you end up executing the entire script as you go along. No saved state (more on that in a second) means that the beautiful data frame your code created that took 10 minutes to build will take 10 minutes to build every time you tweak model parameters or ggplot2 aesthetics. Granted, you could call R with `–save` but then you have to check for the presence of data structures in your code (so you might as well be programming in non-interactive Python).

An offshoot of the details behind this show-stopper is that you do not get graphics output in a window. You get a single PDF of all plots, just as you would if you ran the R script at the command-line. If you’ve been spoiled by RStudio or even cutting and pasting code from an editor into the R GUI, you will immediately miss the graphics viewer pane.

Unless Microsoft (or some community contributors who desperately want to use R in VS Code for some reason) add some of this functionality to VS Code (including support for seamlessly spinning R scripts and knitting R markdown documents), I cannot recommend using to anyone in the R community.

Having said that, here’s the `tasks.json` configuration if you want to be able to hit `Command-Shift-B` in an R script in VS Code and have it execute and display the output. This configuration is for the official R Project build of R and should work even after a R version upgrade).

{
	"version": "0.1.0",
	"command": "/Library/Frameworks/R.framework/Resources/bin/R",
	"showOutput": "always",
	"args": [
		"--no-restore",
		"--no-save",
		"--quiet",
		"--file=${file}"
	]
}

If you are using VS Code for R (on any platform) your comments would be especially most welcome. It’d be great to hear why you’re using it and how you’ve configured it to help make you as productive as RStudio or ESS has for others.

Using MonetDB[Lite] with real-world CSV files

[MonetDBLite](https://www.monetdb.org/blog/monetdblite-r) (for R) was announced/released today and, while the examples they provide are compelling there’s a “gotcha” for potential new folks using SQL in general and SQL + MonetDB + R together. The toy example on the site shows dumping `mtcars` with `dbWriteTable` and then doing things. Real-world CSV files have headers and commas (MonetDB by default expects no headers and `|` as a separator). Also, you need to make a MonetDB table (with a schema) before copying your _giant_ CSV file full of data into it. That’s a pain to do by hand.

Here’s another toy example that shows how to:

– use a specific directory for the embedded MonetDB files
– *auto-generate* the `CREATE TABLE` syntax from a sample of the real-world CSV file
– load the data from the real-world CSV file (i.e. skipping the header and using a `,` as a delimiter
– wire it up to R & dplyr

It’s very similar to the MonetDBLite toy example but may help folks get up and running in the real world with less frustration.

library(MonetDBLite)
library(MonetDB.R)
library(dplyr)
 
# use built-in mtcars to make a CS File
# we're more likely to find a file in this format vs what dbWriteTable produces
# i.e. it has a header and commas for separator
write.csv(add_rownames(mtcars, "auto"), "mtcars.csv", row.names=FALSE)
 
# make a connection and get rid of the old table if it exists since
# we are just playing around. in real life you prbly want to keep
# the giant table there vs recreate it every time
mdb <- dbConnect(MonetDBLite(), "/full/path/to/your/preferred/monetdb/data/dir")
try(invisible(dbSendQuery(mdb, "DROP TABLE mtcars")), silent=TRUE)
 
# now we guess the column types by reading in a small fraction of the rows
guess <- read.csv("mtcars.csv", stringsAsFactors=FALSE, nrows=1000)
create <- sprintf("CREATE TABLE mtcars ( %s )", 
                  paste0(sprintf('"%s" %s', colnames(guess), 
                                 sapply(guess, dbDataType, dbObj=mdb)), collapse=","))
 
# we build the table creation dynamically from what we've learned from guessing
invisible(dbSendQuery(mdb, create))
 
# and then we load the data into the database, skipping the header and specifying a comma
invisible(dbSendQuery(mdb, "COPY OFFSET 2 
                                 INTO mtcars 
                                 FROM '/full/path/to/where/you/wrote/the/csv/to/mtcars.csv' USING  DELIMITERS ','"))
 
# now wire it up to dplyr
mdb_src <- src_monetdb(embedded="/full/path/to/your/preferred/monetdb/data/dir")
mdb_mtcars <- tbl(mdb_src, "mtcars")
 
# and have some fun
count(mdb_mtcars, cyl)
 
## Source: MonetDB  ()
## From: <derived table> [?? x 2]
## 
##      cyl     n
##    (int) (dbl)
## 1      6     7
## 2      4    11
## 3      8    14
## ..   ...   ...

Visualizing Survey Data : Comparison Between Observations

2015-11-08 – 12:46
Posted in Cybersecurity, Data Analysis, data driven security, Data Visualization, DataVis, DataViz, ggplot, R, slopegraph
Tagged post
Comments (4)

Cybersecurity is a domain that really likes surveys, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go into the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.

We’ll use the data that makes up this chart (Figure 3 from the report):

since it’s pretty representative of the remainder of the figures.

Let’s start by reproducing this figure with ggplot2:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(ggthemes)
library(extrafont)

loadfonts(quiet=TRUE)

read.csv("question.csv", stringsAsFactors=FALSE) %>%
  gather(year, value, -belief) %>%
  mutate(year=factor(sub("y", "", year)),
         belief=str_wrap(belief, 40)) -> question

beliefs <- unique(question$belief)
question$belief <- factor(beliefs, levels=rev(beliefs[c(1,2,4,5,3,7,6)]))

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent(value)), hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.8))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Now, the survey does caveat the findings and talks about non-response bias, sampling-frame bias and self-reporting bias. However, nowhere does it talk about the margin of error or anything relating to uncertainty. Thankfully, both the 2014 and 2015 reports communicate population and sample sizes, so we can figure out the margin of error:

library(samplesize4surveys)

moe_2014 <- e4p(19915, 701, 0.5)
## With the parameters of this function: N = 19915 n =  701 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.709879 . 
## The margin of error is 3.635614 . 
## 

moe_2015 <- e4p(18705, 692, 0.5)
## With the parameters of this function: N = 18705 n =  692 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.730449 . 
## The margin of error is 3.655773 .

They are both roughly 3.65% so let's take a look at our dodged bar chart again with this new information:

mutate(question, ymin=value-0.0365, ymax=value+0.0365) -> question

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity",
                    position=position_dodge(0.85),
                    color="white", width=0.85)
gg <- gg + geom_linerange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.85),
                         size=1.5, color="#bdbdbd")
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.85))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Hrm. There seems to be a bit of overlap. Let's just focus on that:

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_pointrange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.25),
                         size=1, color="#bdbdbd", fatten=1)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,1))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

The report actually makes hard claims based on the year-over-year change in the answers to many of the questions (not just this chart). Most have these overlapping intervals. Now, I understand that when a paying customer says they want a report that they wouldn't really be satisfied with a one-pager saying "See last years's report", but not communicating the uncertainty in these results seems like a significant omission.

But, I digress. There are better (or at least alternate) ways than bars to show this comparison. One is a "dumbbell chart".

question %>%
  group_by(belief) %>%
  mutate(line_col=ifelse(diff(value)<0, "2015", "2014"),
         hjust=ifelse(diff(value)<0, -0.5, 1.5)) %>%
  ungroup() -> question

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

I've used line color to indicate whether the 2015 value increased or decreased from 2014.

But, we still have the issue of communicating the margin of error. One way I came up with (which is not perfect) is to superimpose the dot-plot on top of the entire margin of error interval. While it doesn't show the discrete start/end margin for each year it does help to show that making definitive statements on the value comparisons is not exactly a good idea:

group_by(question, belief) %>%
  summarize(xmin=min(ymin), xmax=max(ymax)) -> band

gg <- ggplot(question)
gg <- gg + geom_segment(data=band,
                        aes(x=xmin, xend=xmax, y=belief, yend=belief),
                        color="#bdbdbd", alpha=0.5, size=3)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col),
                     show.legend=FALSE)
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Finally, the year-to-year nature of the data was just begging for a slopegraph:

question %>% mutate(vjust=0.5) -> question
question[(question$belief=="Makes threat data more actionable") &
           (question$year=="2015"),]$vjust <- -1
question[(question$belief=="Reduces the cost of detecting and\npreventing cyber attacks") &
           (question$year=="2015"),]$vjust <- 1.5

question$year <- factor(question$year, levels=c("2013", "2014", "2015", "2016", "2017", "2018"))

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=year, y=value, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=year, y=value), shape=21, fill="black", color="white")
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=year, y=value,
                         label=sprintf("\u2000%s %s", percent(value),
                                       gsub("\n", " ", belief)),
                         vjust=vjust), hjust=0, size=3)
gg <- gg + geom_text(data=filter(question, year=="2014"),
                     aes(x=year, y=value, label=percent(value)),
                     hjust=1.3, size=3)
gg <- gg + scale_x_discrete(expand=c(0,0.1), drop=FALSE)
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

It doesn't help communicate uncertainty but it's a nice alternative to bars.

Hopefully this helps provide some alternatives to bars for these types of comparisons and also ways to communicate uncertainty without confusing the reader (communicating uncertainty to a broad audience is hard).

Perhaps those conducting surveys (or data analyses in general) could subscribe to a "data visualizers" paraphrase of a quote from Epidemics, Book I, of the Hippocratic school:

"Practice two things in your dealings with data: either help or do not harm the reader."

The full Rmd and data for this post is in this gist.

An Ephemeral Update to daylight()

2015-11-02 – 08:45
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (1)

This occurrence of the bi-annual corruption of the space-time continuum (i.e. changing to/from standard/daylight time) in the U.S. caused me to make a slight change to the code [from an older post](https://rud.is/b/2014/09/23/seeing-the-daylight-with-r/). The `daylight()` function now auto-discovers the date and location information (via [telize](http://www.telize.com/)) from the caller, which means all you have to do to get a plot like this:

is to source the [new gist](https://gist.github.com/hrbrmstr/e435d4fa0c31b8e1a9d0) like this:

devtools::source_gist(“e435d4fa0c31b8e1a9d0″, sha1=”64e859227266dc5f9008b3b3959a19fea373fee6”)

Remember that you should verify any code before blindly `source`ing it (in R or anywhere else) and make sure to use the SHA1 hash so you know you’re sourcing the proper code (and not potentially being pwnd).

Note that the granularity/accuracy of the geolocation is only as good as the Telize service (which uses MaxMind). The fact that this shows Vermont instead of Maine should make you all think thrice about trusting IP geolocation in general, especially you world-mapping cybersecurity folks.

Sadly, the darkest of days is still yet to come.