post Archives - Page 23 of 33

Tag Archives: post

Fear of WaPo Using Bad Pie Charts Has Increased Since Last Year

2015-12-13 – 09:40
Posted in Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (15)

I woke up this morning to a [headline story from the Washington Post](https://www.washingtonpost.com/news/the-fix/wp/2015/12/10/to-many-christian-terrorists-arent-true-christians-but-muslim-terrorists-are-true-muslims/) on _”Americans are twice as willing to distance Christian extremists from their religion as Muslims_”. This post is not about the content of the headline or story. It _is_ about the horrible pie chart WaPo led the article with:

This isn’t just a rant of a madman against pie charts. While I _am_ vehemently opposed to them, we did cover them [in our book](https://books.google.com/books?id=7DqwAgAAQBAJ&pg=PA146&lpg=PA146&dq=data-driven+security+pie+chart&source=bl&ots=Cy1iJylsHd&sig=a6Hz1JB-QYLq6H0VZJpPleJgRkQ&hl=en&sa=X&ved=0ahUKEwj79uqt_tjJAhVG0iYKHS0uDn4Q6AEIMzAH#v=onepage&q=data-driven%20security%20pie%20chart&f=false) and my co-author (@jayjacobs) and the incredibly talented @annkemery both agree there are often cases where they are appropriate. Even using their less-sensitive sensibilities, this would not be one of those cases.

So, what—exactly—is the problem? WaPo tried to enable comparison between pies by exploding them and using colors to indicate similar fear levels, mapping shades to entries in the top legend. Your eye has to move around a bit to take everything in and remember the mapping as you focus on each slice (since you will end up doing that given that each category colored differently). Their whole goal was to enable the reader to see the change in sentiment towards terrorism since this time last year.

Hrm. Two dates. Small set of values. Desire to quickly compare change in value/slope. **This sounds like a job for a slopegraph!**

The article and graphic are based on a [survey](http://publicreligion.org/research/2015/12/survey-nearly-half-of-americans-worried-that-they-or-their-family-will-be-a-victim-of-terrorism/). Thankfully the [complete survey data was made available](http://publicreligion.org/site/wp-content/uploads/2015/12/December-2015-PRRI-RNS-Topline1.pdf), which made it easy to do a makeover (in R of course). Here’s the result:

Each category change is clearly visible, you don’t need to remember color association and you even know the actual values^*.

The R code is below and in [this gist](https://gist.github.com/hrbrmstr/9bf4f93dffc1df48fe27). How would you make the WaPo chart better (drop a note in the comments with a link to your own makeover)?

library(tidyr)
library(ggplot2)
library(ggthemes)
library(scales)
library(dplyr)
 
# Easiest way to transcribe the PDF table
# The slope calculation will enable us to color the lines/points based on up/down
dat <- data_frame(`2014-11-01`=c(0.11, 0.22, 0.35, 0.31, 0.01),
                  `2015-12-01`=c(0.17, 0.30, 0.30, 0.23, 0.00),
                  slope=factor(sign(`2014-11-01` - `2015-12-01`)),
                  fear_level=c("Very worried", "Somewhat worried", "Not too worried",
                               "Not at all", "Don't know/refused"))
 
# Transform that into something we can use
dat <- gather(dat, month, value, -fear_level, -slope)
 
# We need real dates for the X-axis manipulation
dat <- mutate(dat, month=as.Date(as.character(month)))
 
# Since 2 categories have the same ending value, we need to
# take care of that (this is one of a few "gotchas" in slopegraph preparation)
end_lab <- dat %>%
  filter(month==as.Date("2015-12-01")) %>%
  group_by(value) %>%
  summarise(lab=sprintf("%s", paste(fear_level, collapse=", ")))
 
gg <- ggplot(dat)
 
# line
gg <- gg + geom_line(aes(x=month, y=value, color=slope, group=fear_level), size=1)
# points
gg <- gg + geom_point(aes(x=month, y=value, fill=slope, group=fear_level),
                      color="white", shape=21, size=2.5)
 
# left labels
gg <- gg + geom_text(data=filter(dat, month==as.Date("2014-11-01")),
                     aes(x=month, y=value, label=sprintf("%s — %s  ", fear_level, percent(value))),
                     hjust=1, size=3)
# right labels
gg <- gg + geom_text(data=end_lab,
                     aes(x=as.Date("2015-12-01"), y=value,
                         label=sprintf("  %s — %s", percent(value), lab)),
                     hjust=0, size=3)
 
# Here we do some slightly tricky x-axis formatting to ensure we have enough
# space for the in-panel labels, only show the months we need and have
# the month labels display properly
gg <- gg + scale_x_date(expand=c(0.125, 0),
                        labels=date_format("%b\n%Y"),
                        breaks=c(as.Date("2014-11-01"), as.Date("2015-12-01")),
                        limits=c(as.Date("2014-02-01"), as.Date("2016-12-01")))
gg <- gg + scale_y_continuous()
 
# I used colors from the article
gg <- gg + scale_color_manual(values=c("#f0b35f", "#177fb9"))
gg <- gg + scale_fill_manual(values=c("#f0b35f", "#177fb9"))
gg <- gg + labs(x=NULL, y=NULL, title="Fear of terror attacks (change since last year)\n")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.y=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg

^* Well, it’s survey. To add insult to injury, it’s a sentiment-based survey given right after a likely-to-be-attributed-terrorism attack. Also, there is a margin of error that isn’t communicated in either visualization. So while there is “data”, trust it at your own peril.

Moving the Shield to a new home

This week (Thanksgiving-ish, 2015) I start a new adventure at Rapid7! You can read more about why over at [the official announcement](https://community.rapid7.com/community/rapid7-news/blog/2015/11/23/why-i-joined-rapid7).

Rapid7 has an amazing amount of data, a world-class team and a determination to deliver cutting-edge solutions that make it possible for organizations detect and deter those that seek to do them harm.

I’m super-excited for the opportunity to infuse “data-science” into more of our existing solutions, share our discoveries with the community and deliver even more data-driven products and services to our customers.

An OS X R Task Runner for—and a Mini-R-centric review of—Microsoft’s Visual Studio Code Editor

Microsoft’s newfound desire to make themselves desirable to the hipster development community has caused them to make many things [open](https://github.com/Microsoft/) and/or free of late. One of these manifestations is [Visual Studio Code](https://code.visualstudio.com/), an [Atom](https://atom.io/)-ish editor for us code jockeys. I have friends at Microsoft and the Revolution R folks are there now, so I try to give things from Redmond a shot more than I previously would, especially when they make things for Mac.

VS code is so much like Atom (or even [Sublime Text](http://www.sublimetext.com/)) that I won’t go into a full review of it. Suffice it to say it has a file selector pane, editor panes, output panes, snippets, theme support and is pretty extensible. One requirement I appreciate is that it forces you to think of code in terms of projects (you select a directory to edit in) and I also appreciate that they made git a first-class citizen.

Since I do not spend much time building large, compiled applications (this—along with web apps—seems to be VS Code’s sweet spot) there isn’t much initial appeal for me. It also lacks the “intellisense” support for the main language I use (R) so I’m left with basic syntax highlighting (the 90’s called and want their basic editor capabilities back).

None of that would initially drive me away from using something like VS Code and I may end up using it for HTML/CSS/JavaScript projects or even fire it up when I need to do some work in Python or Go. But I won’t be using it for R any time soon. While the aforementioned lack of “intellisense” for R is an issue, I don’t rely on the auto-completion for R but it does occasionally speed up typing and definitely helps with the more esoteric function definitions in equally esoteric packages.

The biggest show-stopper for VS Code is the lack of REPL (a read-eval-print loop) for R. I can fire up an R script in Sublime Text or even Atom and run individual lines of code that are executed in an R session that runs in the background and outputs in an editor pane. It works well but it is (unsurprisingly) a far cry from the tight integration of similar functionality in RStudio. VS Code can run R scripts (it just runs the code through R as you would at the command-line) but has no REPL for R, which means you end up executing the entire script as you go along. No saved state (more on that in a second) means that the beautiful data frame your code created that took 10 minutes to build will take 10 minutes to build every time you tweak model parameters or ggplot2 aesthetics. Granted, you could call R with `–save` but then you have to check for the presence of data structures in your code (so you might as well be programming in non-interactive Python).

An offshoot of the details behind this show-stopper is that you do not get graphics output in a window. You get a single PDF of all plots, just as you would if you ran the R script at the command-line. If you’ve been spoiled by RStudio or even cutting and pasting code from an editor into the R GUI, you will immediately miss the graphics viewer pane.

Unless Microsoft (or some community contributors who desperately want to use R in VS Code for some reason) add some of this functionality to VS Code (including support for seamlessly spinning R scripts and knitting R markdown documents), I cannot recommend using to anyone in the R community.

Having said that, here’s the `tasks.json` configuration if you want to be able to hit `Command-Shift-B` in an R script in VS Code and have it execute and display the output. This configuration is for the official R Project build of R and should work even after a R version upgrade).

{
	"version": "0.1.0",
	"command": "/Library/Frameworks/R.framework/Resources/bin/R",
	"showOutput": "always",
	"args": [
		"--no-restore",
		"--no-save",
		"--quiet",
		"--file=${file}"
	]
}

If you are using VS Code for R (on any platform) your comments would be especially most welcome. It’d be great to hear why you’re using it and how you’ve configured it to help make you as productive as RStudio or ESS has for others.

Beware of sideloading f.lux for iOS from files obtained on sharing/torrent sites

2015-11-13 – 11:13
Posted in Apple, Cybersecurity, iOS, Security Awareness
Tagged post
Comments (2)

Apple made the @justgetflux folks remove their [iOS sideloaded app](https://justgetflux.com/sideload/) due to the use of private APIs (which are a violation of the Apple Developer agreement). The ZIP archive has been pulled from their site (and it really has, too).

This “sideloading”—i.e. installing directly to your device after compiling it from source—_was_ an interesting way to distribute the app. I visually scanned the source code before sideloading (I code for iOS in both Objective C and Swift) and there seemed to be nothing nefarious in it and it works _really_ well. HOWEVER, I’m 100% sure an Xcode project ZIP archive of f.lux is going to hit the torrent and file-sharing sites pretty quickly. But, I’m also 99% sure that folks who want to do Really Bad Things™ to and with your iOS devices will gladly add some code to that Xcode project and most folks won’t take the time (or do not have the knowledge/experience) to validate the veracity of the code before using it.

So, first I implore iOS users to _not_ grab this sideloaded project from torrent or file-sharing sites since you will be putting your devices at risk if you do so.

Since some folks (but not very many, I suspect, since it does involve real work) will no doubt leave my warning unheeded, *please* run:

shasum -a 256 f.lux-xcode-master.zip

from an Terminal or iTerm2 prompt. If you don’t get:

38f463ee5780a4f2b0160f9fa21dbe3c78c5d80d3c093e4ff553aaca230e2898

as a result *DO NOT SIDE LOAD THE APP*. It means the good/safe f.lux source code/project has been modified by someone since November 11th, 2015 and that you are putting your device (i.e. the confidentiality, integrity and availability of your private data) at risk by installing it. You can also post it to [this site](http://hash.online-convert.com/sha256-generator) (I verified that it produces the correct result).

And, please, do not jailbreak your devices. You take a relatively safe operating system and pretty much turn it into an Android (check out the [2015 DBIR](http://verizonenterprise.com/DBIR) for more info on iOS vs Android security based on real data vs hype) .

Using MonetDB[Lite] with real-world CSV files

[MonetDBLite](https://www.monetdb.org/blog/monetdblite-r) (for R) was announced/released today and, while the examples they provide are compelling there’s a “gotcha” for potential new folks using SQL in general and SQL + MonetDB + R together. The toy example on the site shows dumping `mtcars` with `dbWriteTable` and then doing things. Real-world CSV files have headers and commas (MonetDB by default expects no headers and `|` as a separator). Also, you need to make a MonetDB table (with a schema) before copying your _giant_ CSV file full of data into it. That’s a pain to do by hand.

Here’s another toy example that shows how to:

– use a specific directory for the embedded MonetDB files
– *auto-generate* the `CREATE TABLE` syntax from a sample of the real-world CSV file
– load the data from the real-world CSV file (i.e. skipping the header and using a `,` as a delimiter
– wire it up to R & dplyr

It’s very similar to the MonetDBLite toy example but may help folks get up and running in the real world with less frustration.

library(MonetDBLite)
library(MonetDB.R)
library(dplyr)
 
# use built-in mtcars to make a CS File
# we're more likely to find a file in this format vs what dbWriteTable produces
# i.e. it has a header and commas for separator
write.csv(add_rownames(mtcars, "auto"), "mtcars.csv", row.names=FALSE)
 
# make a connection and get rid of the old table if it exists since
# we are just playing around. in real life you prbly want to keep
# the giant table there vs recreate it every time
mdb <- dbConnect(MonetDBLite(), "/full/path/to/your/preferred/monetdb/data/dir")
try(invisible(dbSendQuery(mdb, "DROP TABLE mtcars")), silent=TRUE)
 
# now we guess the column types by reading in a small fraction of the rows
guess <- read.csv("mtcars.csv", stringsAsFactors=FALSE, nrows=1000)
create <- sprintf("CREATE TABLE mtcars ( %s )", 
                  paste0(sprintf('"%s" %s', colnames(guess), 
                                 sapply(guess, dbDataType, dbObj=mdb)), collapse=","))
 
# we build the table creation dynamically from what we've learned from guessing
invisible(dbSendQuery(mdb, create))
 
# and then we load the data into the database, skipping the header and specifying a comma
invisible(dbSendQuery(mdb, "COPY OFFSET 2 
                                 INTO mtcars 
                                 FROM '/full/path/to/where/you/wrote/the/csv/to/mtcars.csv' USING  DELIMITERS ','"))
 
# now wire it up to dplyr
mdb_src <- src_monetdb(embedded="/full/path/to/your/preferred/monetdb/data/dir")
mdb_mtcars <- tbl(mdb_src, "mtcars")
 
# and have some fun
count(mdb_mtcars, cyl)
 
## Source: MonetDB  ()
## From: <derived table> [?? x 2]
## 
##      cyl     n
##    (int) (dbl)
## 1      6     7
## 2      4    11
## 3      8    14
## ..   ...   ...

Visualizing Survey Data : Comparison Between Observations

2015-11-08 – 12:46
Posted in Cybersecurity, Data Analysis, data driven security, Data Visualization, DataVis, DataViz, ggplot, R, slopegraph
Tagged post
Comments (4)

Cybersecurity is a domain that really likes surveys, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go into the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.

We’ll use the data that makes up this chart (Figure 3 from the report):

since it’s pretty representative of the remainder of the figures.

Let’s start by reproducing this figure with ggplot2:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(ggthemes)
library(extrafont)

loadfonts(quiet=TRUE)

read.csv("question.csv", stringsAsFactors=FALSE) %>%
  gather(year, value, -belief) %>%
  mutate(year=factor(sub("y", "", year)),
         belief=str_wrap(belief, 40)) -> question

beliefs <- unique(question$belief)
question$belief <- factor(beliefs, levels=rev(beliefs[c(1,2,4,5,3,7,6)]))

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent(value)), hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.8))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Now, the survey does caveat the findings and talks about non-response bias, sampling-frame bias and self-reporting bias. However, nowhere does it talk about the margin of error or anything relating to uncertainty. Thankfully, both the 2014 and 2015 reports communicate population and sample sizes, so we can figure out the margin of error:

library(samplesize4surveys)

moe_2014 <- e4p(19915, 701, 0.5)
## With the parameters of this function: N = 19915 n =  701 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.709879 . 
## The margin of error is 3.635614 . 
## 

moe_2015 <- e4p(18705, 692, 0.5)
## With the parameters of this function: N = 18705 n =  692 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.730449 . 
## The margin of error is 3.655773 .

They are both roughly 3.65% so let's take a look at our dodged bar chart again with this new information:

mutate(question, ymin=value-0.0365, ymax=value+0.0365) -> question

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity",
                    position=position_dodge(0.85),
                    color="white", width=0.85)
gg <- gg + geom_linerange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.85),
                         size=1.5, color="#bdbdbd")
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.85))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Hrm. There seems to be a bit of overlap. Let's just focus on that:

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_pointrange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.25),
                         size=1, color="#bdbdbd", fatten=1)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,1))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

The report actually makes hard claims based on the year-over-year change in the answers to many of the questions (not just this chart). Most have these overlapping intervals. Now, I understand that when a paying customer says they want a report that they wouldn't really be satisfied with a one-pager saying "See last years's report", but not communicating the uncertainty in these results seems like a significant omission.

But, I digress. There are better (or at least alternate) ways than bars to show this comparison. One is a "dumbbell chart".

question %>%
  group_by(belief) %>%
  mutate(line_col=ifelse(diff(value)<0, "2015", "2014"),
         hjust=ifelse(diff(value)<0, -0.5, 1.5)) %>%
  ungroup() -> question

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

I've used line color to indicate whether the 2015 value increased or decreased from 2014.

But, we still have the issue of communicating the margin of error. One way I came up with (which is not perfect) is to superimpose the dot-plot on top of the entire margin of error interval. While it doesn't show the discrete start/end margin for each year it does help to show that making definitive statements on the value comparisons is not exactly a good idea:

group_by(question, belief) %>%
  summarize(xmin=min(ymin), xmax=max(ymax)) -> band

gg <- ggplot(question)
gg <- gg + geom_segment(data=band,
                        aes(x=xmin, xend=xmax, y=belief, yend=belief),
                        color="#bdbdbd", alpha=0.5, size=3)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col),
                     show.legend=FALSE)
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Finally, the year-to-year nature of the data was just begging for a slopegraph:

question %>% mutate(vjust=0.5) -> question
question[(question$belief=="Makes threat data more actionable") &
           (question$year=="2015"),]$vjust <- -1
question[(question$belief=="Reduces the cost of detecting and\npreventing cyber attacks") &
           (question$year=="2015"),]$vjust <- 1.5

question$year <- factor(question$year, levels=c("2013", "2014", "2015", "2016", "2017", "2018"))

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=year, y=value, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=year, y=value), shape=21, fill="black", color="white")
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=year, y=value,
                         label=sprintf("\u2000%s %s", percent(value),
                                       gsub("\n", " ", belief)),
                         vjust=vjust), hjust=0, size=3)
gg <- gg + geom_text(data=filter(question, year=="2014"),
                     aes(x=year, y=value, label=percent(value)),
                     hjust=1.3, size=3)
gg <- gg + scale_x_discrete(expand=c(0,0.1), drop=FALSE)
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

It doesn't help communicate uncertainty but it's a nice alternative to bars.

Hopefully this helps provide some alternatives to bars for these types of comparisons and also ways to communicate uncertainty without confusing the reader (communicating uncertainty to a broad audience is hard).

Perhaps those conducting surveys (or data analyses in general) could subscribe to a "data visualizers" paraphrase of a quote from Epidemics, Book I, of the Hippocratic school:

"Practice two things in your dealings with data: either help or do not harm the reader."

The full Rmd and data for this post is in this gist.

An Ephemeral Update to daylight()

2015-11-02 – 08:45
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (1)

This occurrence of the bi-annual corruption of the space-time continuum (i.e. changing to/from standard/daylight time) in the U.S. caused me to make a slight change to the code [from an older post](https://rud.is/b/2014/09/23/seeing-the-daylight-with-r/). The `daylight()` function now auto-discovers the date and location information (via [telize](http://www.telize.com/)) from the caller, which means all you have to do to get a plot like this:

is to source the [new gist](https://gist.github.com/hrbrmstr/e435d4fa0c31b8e1a9d0) like this:

devtools::source_gist(“e435d4fa0c31b8e1a9d0″, sha1=”64e859227266dc5f9008b3b3959a19fea373fee6”)

Remember that you should verify any code before blindly `source`ing it (in R or anywhere else) and make sure to use the SHA1 hash so you know you’re sourcing the proper code (and not potentially being pwnd).

Note that the granularity/accuracy of the geolocation is only as good as the Telize service (which uses MaxMind). The fact that this shows Vermont instead of Maine should make you all think thrice about trusting IP geolocation in general, especially you world-mapping cybersecurity folks.

Sadly, the darkest of days is still yet to come.

Less Drama, More Encoding

2015-10-24 – 10:43
Posted in Charts & Graphs, Data Analysis, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (1)

Junk Charts [adeptly noted and fixed](http://junkcharts.typepad.com/junk_charts/2015/10/is-it-worth-the-drama.html) this excessively stylized chart from the WSJ this week:

Their take on it does reduce the ZOMGOSH WE ARE DOOMED! look and feel of the WSJ chart:

But, we can further reduce the drama by using a more neutral color encoding _and_ encode both the # of outbreaks and total size of the impacted flock populations _per week_ with a lollipop chart (and, thankfully the USDA makes this data readily available):

library(xml2)
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(viridis)
library(ggthemes)
 
pg <- read_html("https://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/sa_detections_by_states/ct_ai_pacific_flyway/!ut/p/a1/lVNNb-IwEP0tPewx2OSbI_QDwkdBRbuQXKyJ4yTWJnYUG1D-fZ10D7RqadcHS555M_PmPRkl6IgSAWdegOZSQNW_E58stwt7PMN2NN9PHnH0_OdpE64DZ7twDSA2APzFmeL39dtV5Pf1L3i2eBjjvYMOKEEJFbrRJYqhKbkiVArNhCYVT1tou19YAZGnluSSntTwAsFrqEjJoNLldSTjioFihItctvWwxFv6zEFc4zOmGe3TiqQdURo0M62pJsBJA5TnnJK86i7Q9fwayjMU57ZPAezccnwvtdwM21aah9hcGHtuCP6Y5v_0uLHwj_S8n08XbrA2CrqhjaMHUxxMNhhH_nf1g57fdBgAtyz7MGT-ODNDnta7YLW08cpDsSEZfMli4qL9f6q2_IEsdru53xSmLejS6g1Gx5vGv6WvjB8CnxmPjp8af5ihxJNBpIqeX1HJdPgQ8VSkTmiItCxnLWtHpVQaHS-Xy-ikMhgV8oya-ncdOh23_r6E2PGqYrerD9O7u1eBlNG5/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_full_list")
 
dat <- html_table(html_nodes(pg, "table"))[[1]]
 
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         week = format(`Confirmation date`, "%Y-%U"),
         week_start = as.Date(sprintf("%s-1", week), "%Y-%U-%u") ,
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  select(week, week_start, `Flock size`) %>% 
  filter(!is.na(`Flock size`)) %>% 
  group_by(week_start) %>% 
  summarize(outbreaks=n(), 
            flock_total=sum(`Flock size`)) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=week_start, y=outbreaks))
gg <- gg + geom_vline(xintercept=as.numeric(first$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=week_start, y=25), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=week_start, y=25), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=week_start, xend=week_start, y=0, yend=outbreaks, color=flock_total), size=0.5)
gg <- gg + geom_point(aes(size=flock_total, fill=flock_total), shape=21)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="# Outbreaks", title="Avian Flu Impact by Week (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

If we really want to see the discrete events, we can do that with our less-ZOMGOSH color scheme, too:

dat <- html_table(html_nodes(pg, "table"))[[1]]
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  filter(!is.na(`Flock size`)) %>% 
  rename(date=`Confirmation date`) %>% 
  arrange(date) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=date, y=`Flock size`))
gg <- gg + geom_vline(xintercept=as.numeric(first$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=date, y=3000000), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=date, y=3000000), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=date, xend=date, y=0, yend=`Flock size`, color=`Flock size`), size=0.5, alpha=1)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + scale_y_continuous(label=comma)
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="Flock size", title="Avian Flu Impact (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

Neither of those is ever going to sell any ads, tho.