hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Using MonetDB[Lite] with real-world CSV files

[MonetDBLite](https://www.monetdb.org/blog/monetdblite-r) (for R) was announced/released today and, while the examples they provide are compelling there’s a “gotcha” for potential new folks using SQL in general and SQL + MonetDB + R together. The toy example on the site shows dumping `mtcars` with `dbWriteTable` and then doing things. Real-world CSV files have headers and commas (MonetDB by default expects no headers and `|` as a separator). Also, you need to make a MonetDB table (with a schema) before copying your _giant_ CSV file full of data into it. That’s a pain to do by hand.

Here’s another toy example that shows how to:

– use a specific directory for the embedded MonetDB files
– *auto-generate* the `CREATE TABLE` syntax from a sample of the real-world CSV file
– load the data from the real-world CSV file (i.e. skipping the header and using a `,` as a delimiter
– wire it up to R & dplyr

It’s very similar to the MonetDBLite toy example but may help folks get up and running in the real world with less frustration.

library(MonetDBLite)
library(MonetDB.R)
library(dplyr)
 
# use built-in mtcars to make a CS File
# we're more likely to find a file in this format vs what dbWriteTable produces
# i.e. it has a header and commas for separator
write.csv(add_rownames(mtcars, "auto"), "mtcars.csv", row.names=FALSE)
 
# make a connection and get rid of the old table if it exists since
# we are just playing around. in real life you prbly want to keep
# the giant table there vs recreate it every time
mdb <- dbConnect(MonetDBLite(), "/full/path/to/your/preferred/monetdb/data/dir")
try(invisible(dbSendQuery(mdb, "DROP TABLE mtcars")), silent=TRUE)
 
# now we guess the column types by reading in a small fraction of the rows
guess <- read.csv("mtcars.csv", stringsAsFactors=FALSE, nrows=1000)
create <- sprintf("CREATE TABLE mtcars ( %s )", 
                  paste0(sprintf('"%s" %s', colnames(guess), 
                                 sapply(guess, dbDataType, dbObj=mdb)), collapse=","))
 
# we build the table creation dynamically from what we've learned from guessing
invisible(dbSendQuery(mdb, create))
 
# and then we load the data into the database, skipping the header and specifying a comma
invisible(dbSendQuery(mdb, "COPY OFFSET 2 
                                 INTO mtcars 
                                 FROM '/full/path/to/where/you/wrote/the/csv/to/mtcars.csv' USING  DELIMITERS ','"))
 
# now wire it up to dplyr
mdb_src <- src_monetdb(embedded="/full/path/to/your/preferred/monetdb/data/dir")
mdb_mtcars <- tbl(mdb_src, "mtcars")
 
# and have some fun
count(mdb_mtcars, cyl)
 
## Source: MonetDB  ()
## From: <derived table> [?? x 2]
## 
##      cyl     n
##    (int) (dbl)
## 1      6     7
## 2      4    11
## 3      8    14
## ..   ...   ...

Visualizing Survey Data : Comparison Between Observations

2015-11-08 – 12:46
Posted in Cybersecurity, Data Analysis, data driven security, Data Visualization, DataVis, DataViz, ggplot, R, slopegraph
Tagged post
Comments (4)

Cybersecurity is a domain that really likes surveys, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go into the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.

We’ll use the data that makes up this chart (Figure 3 from the report):

since it’s pretty representative of the remainder of the figures.

Let’s start by reproducing this figure with ggplot2:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(ggthemes)
library(extrafont)

loadfonts(quiet=TRUE)

read.csv("question.csv", stringsAsFactors=FALSE) %>%
  gather(year, value, -belief) %>%
  mutate(year=factor(sub("y", "", year)),
         belief=str_wrap(belief, 40)) -> question

beliefs <- unique(question$belief)
question$belief <- factor(beliefs, levels=rev(beliefs[c(1,2,4,5,3,7,6)]))

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent(value)), hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.8))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Now, the survey does caveat the findings and talks about non-response bias, sampling-frame bias and self-reporting bias. However, nowhere does it talk about the margin of error or anything relating to uncertainty. Thankfully, both the 2014 and 2015 reports communicate population and sample sizes, so we can figure out the margin of error:

library(samplesize4surveys)

moe_2014 <- e4p(19915, 701, 0.5)
## With the parameters of this function: N = 19915 n =  701 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.709879 . 
## The margin of error is 3.635614 . 
## 

moe_2015 <- e4p(18705, 692, 0.5)
## With the parameters of this function: N = 18705 n =  692 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.730449 . 
## The margin of error is 3.655773 .

They are both roughly 3.65% so let's take a look at our dodged bar chart again with this new information:

mutate(question, ymin=value-0.0365, ymax=value+0.0365) -> question

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity",
                    position=position_dodge(0.85),
                    color="white", width=0.85)
gg <- gg + geom_linerange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.85),
                         size=1.5, color="#bdbdbd")
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.85))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Hrm. There seems to be a bit of overlap. Let's just focus on that:

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_pointrange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.25),
                         size=1, color="#bdbdbd", fatten=1)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,1))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

The report actually makes hard claims based on the year-over-year change in the answers to many of the questions (not just this chart). Most have these overlapping intervals. Now, I understand that when a paying customer says they want a report that they wouldn't really be satisfied with a one-pager saying "See last years's report", but not communicating the uncertainty in these results seems like a significant omission.

But, I digress. There are better (or at least alternate) ways than bars to show this comparison. One is a "dumbbell chart".

question %>%
  group_by(belief) %>%
  mutate(line_col=ifelse(diff(value)<0, "2015", "2014"),
         hjust=ifelse(diff(value)<0, -0.5, 1.5)) %>%
  ungroup() -> question

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

I've used line color to indicate whether the 2015 value increased or decreased from 2014.

But, we still have the issue of communicating the margin of error. One way I came up with (which is not perfect) is to superimpose the dot-plot on top of the entire margin of error interval. While it doesn't show the discrete start/end margin for each year it does help to show that making definitive statements on the value comparisons is not exactly a good idea:

group_by(question, belief) %>%
  summarize(xmin=min(ymin), xmax=max(ymax)) -> band

gg <- ggplot(question)
gg <- gg + geom_segment(data=band,
                        aes(x=xmin, xend=xmax, y=belief, yend=belief),
                        color="#bdbdbd", alpha=0.5, size=3)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col),
                     show.legend=FALSE)
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Finally, the year-to-year nature of the data was just begging for a slopegraph:

question %>% mutate(vjust=0.5) -> question
question[(question$belief=="Makes threat data more actionable") &
           (question$year=="2015"),]$vjust <- -1
question[(question$belief=="Reduces the cost of detecting and\npreventing cyber attacks") &
           (question$year=="2015"),]$vjust <- 1.5

question$year <- factor(question$year, levels=c("2013", "2014", "2015", "2016", "2017", "2018"))

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=year, y=value, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=year, y=value), shape=21, fill="black", color="white")
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=year, y=value,
                         label=sprintf("\u2000%s %s", percent(value),
                                       gsub("\n", " ", belief)),
                         vjust=vjust), hjust=0, size=3)
gg <- gg + geom_text(data=filter(question, year=="2014"),
                     aes(x=year, y=value, label=percent(value)),
                     hjust=1.3, size=3)
gg <- gg + scale_x_discrete(expand=c(0,0.1), drop=FALSE)
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

It doesn't help communicate uncertainty but it's a nice alternative to bars.

Hopefully this helps provide some alternatives to bars for these types of comparisons and also ways to communicate uncertainty without confusing the reader (communicating uncertainty to a broad audience is hard).

Perhaps those conducting surveys (or data analyses in general) could subscribe to a "data visualizers" paraphrase of a quote from Epidemics, Book I, of the Hippocratic school:

"Practice two things in your dealings with data: either help or do not harm the reader."

The full Rmd and data for this post is in this gist.

An Ephemeral Update to daylight()

2015-11-02 – 08:45
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (1)

This occurrence of the bi-annual corruption of the space-time continuum (i.e. changing to/from standard/daylight time) in the U.S. caused me to make a slight change to the code [from an older post](https://rud.is/b/2014/09/23/seeing-the-daylight-with-r/). The `daylight()` function now auto-discovers the date and location information (via [telize](http://www.telize.com/)) from the caller, which means all you have to do to get a plot like this:

is to source the [new gist](https://gist.github.com/hrbrmstr/e435d4fa0c31b8e1a9d0) like this:

devtools::source_gist(“e435d4fa0c31b8e1a9d0″, sha1=”64e859227266dc5f9008b3b3959a19fea373fee6”)

Remember that you should verify any code before blindly `source`ing it (in R or anywhere else) and make sure to use the SHA1 hash so you know you’re sourcing the proper code (and not potentially being pwnd).

Note that the granularity/accuracy of the geolocation is only as good as the Telize service (which uses MaxMind). The fact that this shows Vermont instead of Maine should make you all think thrice about trusting IP geolocation in general, especially you world-mapping cybersecurity folks.

Sadly, the darkest of days is still yet to come.

Less Drama, More Encoding

2015-10-24 – 10:43
Posted in Charts & Graphs, Data Analysis, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (1)

Junk Charts [adeptly noted and fixed](http://junkcharts.typepad.com/junk_charts/2015/10/is-it-worth-the-drama.html) this excessively stylized chart from the WSJ this week:

Their take on it does reduce the ZOMGOSH WE ARE DOOMED! look and feel of the WSJ chart:

But, we can further reduce the drama by using a more neutral color encoding _and_ encode both the # of outbreaks and total size of the impacted flock populations _per week_ with a lollipop chart (and, thankfully the USDA makes this data readily available):

library(xml2)
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(viridis)
library(ggthemes)
 
pg <- read_html("https://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/sa_detections_by_states/ct_ai_pacific_flyway/!ut/p/a1/lVNNb-IwEP0tPewx2OSbI_QDwkdBRbuQXKyJ4yTWJnYUG1D-fZ10D7RqadcHS555M_PmPRkl6IgSAWdegOZSQNW_E58stwt7PMN2NN9PHnH0_OdpE64DZ7twDSA2APzFmeL39dtV5Pf1L3i2eBjjvYMOKEEJFbrRJYqhKbkiVArNhCYVT1tou19YAZGnluSSntTwAsFrqEjJoNLldSTjioFihItctvWwxFv6zEFc4zOmGe3TiqQdURo0M62pJsBJA5TnnJK86i7Q9fwayjMU57ZPAezccnwvtdwM21aah9hcGHtuCP6Y5v_0uLHwj_S8n08XbrA2CrqhjaMHUxxMNhhH_nf1g57fdBgAtyz7MGT-ODNDnta7YLW08cpDsSEZfMli4qL9f6q2_IEsdru53xSmLejS6g1Gx5vGv6WvjB8CnxmPjp8af5ihxJNBpIqeX1HJdPgQ8VSkTmiItCxnLWtHpVQaHS-Xy-ikMhgV8oya-ncdOh23_r6E2PGqYrerD9O7u1eBlNG5/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_full_list")
 
dat <- html_table(html_nodes(pg, "table"))[[1]]
 
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         week = format(`Confirmation date`, "%Y-%U"),
         week_start = as.Date(sprintf("%s-1", week), "%Y-%U-%u") ,
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  select(week, week_start, `Flock size`) %>% 
  filter(!is.na(`Flock size`)) %>% 
  group_by(week_start) %>% 
  summarize(outbreaks=n(), 
            flock_total=sum(`Flock size`)) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=week_start, y=outbreaks))
gg <- gg + geom_vline(xintercept=as.numeric(first$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=week_start, y=25), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=week_start, y=25), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=week_start, xend=week_start, y=0, yend=outbreaks, color=flock_total), size=0.5)
gg <- gg + geom_point(aes(size=flock_total, fill=flock_total), shape=21)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="# Outbreaks", title="Avian Flu Impact by Week (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

If we really want to see the discrete events, we can do that with our less-ZOMGOSH color scheme, too:

dat <- html_table(html_nodes(pg, "table"))[[1]]
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  filter(!is.na(`Flock size`)) %>% 
  rename(date=`Confirmation date`) %>% 
  arrange(date) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=date, y=`Flock size`))
gg <- gg + geom_vline(xintercept=as.numeric(first$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=date, y=3000000), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=date, y=3000000), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=date, xend=date, y=0, yend=`Flock size`, color=`Flock size`), size=0.5, alpha=1)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + scale_y_continuous(label=comma)
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="Flock size", title="Avian Flu Impact (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

Neither of those is ever going to sell any ads, tho.

Installing R on OS X – “100% Homebrew Edition”

In a previous post I provided “mouse-heavy” instructions for getting R running on your Mac. A few of the comments suggested that an “all Homebrew” solution may be preferable for some folks. Now, there are issues with this since getting “support” for what may be R issues will be very difficult on the official mailing lists as you’ll immediately be told to “use the official distribution” by some stalwart R folks (this happens on StackOverflow and other forums as well). However, if you have a thick skin and can be somewhat self-sustaining, Homebrew is a superb alternative to setting up your R environment (and other things) on your OS X system.

What is ‘Homebrew’?

Homebrew is the “missing package manager for OS X”. It’s similar to apt, yum and other package managers on linux/BSD that enable you to install open source (and other types of) packages without having to do the download→unarchive→compile→curse→google→compile→curse→google some more→compile→smile→test→install dance manually. MacPorts is another third-party package manager for OS X, but I use Homebrew, so you get Homebrew examples here.

Homebrew’s inventory of packages comes from github repositories that contain “Formulas” for where to get package components and how to (literally) make them work on OS X. Here’s a small-ish example (some Formula are fairly long/involved/complex) of a Homebrew Formula for the cowsay utility (what, you don’t like to have ASCII animals give you handy messages?):

class Cowsay < Formula
  desc "Configurable talking characters in ASCII art"
  homepage "https://web.archive.org/web/20120225123719/http://www.nog.net/~tony/warez/cowsay.shtml"
  url "http://ftp.acc.umu.se/mirror/cdimage/snapshot/Debian/pool/main/c/cowsay/cowsay_3.03.orig.tar.gz"
  sha256 "0b8672a7ac2b51183780db72618b42af8ec1ce02f6c05fe612510b650540b2af"

  bottle do
    cellar :any_skip_relocation
    revision 1
    sha256 "c041ce7fbf41fd89bf620ae848e3b36fe1e69ab3e2dfca18bc2f2e79cfe8063a" => :el_capitan
    sha256 "ffacfb987481394174267fd987dea52607825e3542d1ea3d0b7aa4ccf7ea5cc5" => :yosemite
    sha256 "12c41b969af30817a4dc7ec25572fe1b707b9d4dcb46d8cc06d22264594219c1" => :mavericks
  end

  # Official download is 404:
  # url "http://www.nog.net/~tony/warez/cowsay-3.03.tar.gz"

  def install
    system "/bin/sh", "install.sh", prefix
    mv prefix/"man", share
  end

  test do
    output = shell_output("#{bin}/cowsay moo")
    assert output.include?("moo")  # bubble
    assert output.include?("^__^") # cow
  end
end

It has:

a description of what the package is
the official location of the program/libraries “home”
where the main URL of the contents of the program/library is
weird hex strings to help the Homebrew ecosystem now pwn you
instructions for how to install (with optional patching of problematic code on particular setups)
test/validation instructions

(You can/should overlook the fact they use icky Ruby for this whole thing.)

There are thousands of Formula in the main Homebrew repository and you can “tap” other (properly organized) GitHub repositories for other (usually task-specific) formula. We’ll need to do this for R.

Finally, the Homebrew community has also come up with the notion of Casks where actual binary OS X programs (and other things) can be installed. We’ll use this as well for some ancillary components.

Apart from the ease of initial setup, the truly compelling part of using Homebrew is that all it takes to update components/libraries is to do a:

brew update && brew upgrade

from a Terminal prompt. You should get into the habit of issuing those commands daily-ish.

Yes, you will need to become comfortable in the Terminal (or, preferably, iTerm 2) to use the Homebrew ecosystem (though there are some efforts to make this more GUI-user friendly).

Using Homebrew to Create & Maintain Your R Installation

I won’t provide much (if any) color commentary to the commands below. Suffice it to say that in a few short lines of a script, you’ll end up having:

R (with gfortran and the vast majority of required support libraries for many packages)
Oracle Java (a later step in the sequence ensures R knows about your Java install)
X11 (XQuartz)
MacTex
RStudio
extra SVG, XML, curl, geo-munging and C++ support libraries
A cool font for RStudio (FiraCode, though that’s not necessary)
iTerm 2 (optional)
GitUp git gui (optional)

If it’s your first time using Homebrew you’ll need to do this:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

NOTE that I’m generally against piping output of curl to run raw on your system without validation, so you can download https://raw.githubusercontent.com/Homebrew/install/master/install and verify it (or ask a security friend to verify it), but you’ll be trusting the Homebrew ecosystem to not pwn your system for the rest of your time using it, so as long as you trust that I gave you the proper URL to get to the installer, cut/paste away.

Once Homebrew is setup I’d recommend copying and pasting each line (below) one-by-one to get familiar with Homebrew operations and messaging.

This can be a pretty scary experience if you’re not used to the command-line, so drop a note in the comments or on Twitter (target @hrbrmstr and use the #rstats #homebrew tags) if some things go awry or get confusing and I’ll try to help as much as I can.

brew tap caskroom/cask
brew install brew-cask

brew install Caskroom/cask/xquartz
brew cask install java

brew tap homebrew/science
brew install R --with-openblas # added --with-openblas based on a spiffy comment by Lincoln Mullen

brew install Caskroom/cask/rstudio


# For latex:
brew cask install mactex

# OR YOU CAN DO
brew install gnupg
brew cask install basictex # suggested by @noamross
# plus the following, provided by Kras.Pten
# read why here: http://rud.is/b/2015/10/22/installing-r-on-os-x-100-homebrew-edition/#comment-10603
sudo tlmgr update –-self 
sudo tlmgr update –-all 
sudo tlmgr install titling framed inconsolata

# DO NOT DO BOTH!

brew install libsvg curl libxml2 gdal geos boost

R CMD javareconf JAVA_CPPFLAGS=-I/System/Library/Frameworks/JavaVM.framework/Headers

brew tap caskroom/fonts
brew cask install font-fira-code
brew cask install iterm2
brew cask install gitup           # if you want a GUI for git stuff (h/t @jennybryan)

That’s quite a bit less clicking/typing from what was required in the previous post.

Fin

I validated that entire configuration on a completely fresh installation of El Capitan (OS X 10.11) in a VM. At the end, I had a fully-functioning data-science setup. Hopefully, you will as well.

If you have suggestions for other Homebrew things to add to make a good “data science OS X setup”, drop a note in the comments!

P.S.

Once you have a full Homebrew & “Cask” setup, the way to keep up-to-date with everything is more like:

brew update && brew upgrade brew-cask && brew cleanup && brew cask cleanup

but an complete “how to use Homebrew” guide is beyond the scope of this post.

Installing R on OS X

NOTE: The comments are a must read for this. Some excellent additional advice and “gotchas” by some super-helpful readers.

I was in a conversation with an academic colleague (wicked smart dude) and the subject of installing R came up (NOTE: this will happen to you, too, if you ever have the misfortune to have a face-to-face convo with me ;-). They noted that getting up and running with R was not as seamless as one would like it to be and, to be honest, I have to agree, especially after typing the rest of this post out.

I recently had a similar experience helping folks who use Windows get R & RStudio up and running and that’s even more of a nightmare, especially if you do not have Administrator privileges (or, perhaps I just scare easily).

Prior to these experiences, I never really stopped to consider just how less friendly the installation process of R is when compared to Excel, Tableau or other apps one might use for data analysis and visualization. Hopefully this will becomre a top priority for the R Consortium.

Since this colleague uses OS X, I offered to put together instructions for how to get R & RStudio installed and finally had 5 minutes to crank out a blog post to help the broader community with the information.

Get R

Go to https://cran.rstudio.com/bin/macosx/
If you’re on a recent version of OS X, download the first .pkg link. As of this post, it’s R-3.2.2.pkg and the direct URL is https://cran.rstudio.com/bin/macosx/R-3.2.2.pkg
Install the package by double-clicking on it and working through the prompts.

Verify R itself is working

Look in the Applications folder for the R application.
Double-click it and you should see an R console window.
If that did not work, try installing R again
Once you’ve verified R is working, quit the app

Download RStudio

RStudio is an integrated development environment for R that will make your life and coding easier.

Go to https://www.rstudio.com/products/rstudio/download/ and download the RStudio version for Mac OS X. It’ll say something like RStudio 0.99.486 - Mac OS X 10.6+ (64-bit). The direct URL for that verison (which was current at the time of this post) is https://download1.rstudio.org/RStudio-0.99.486.dmg
Double-click that DMG file
Drag the RStudio icon to the Applications folder.

Verify RStudio & R are working together

Look in the Applications folder for the RStudio application.
Double-click it and you should see an RStudio window with four panes.

From now on, just start RStudio when you want to work in R.

[Optional] Install XQuartz

Some functions in R require an “X11 Server” and/or libraries associated with an X11 server. Apple does not provide this software with OS X anymore so you have to do it on your own via a third-party application called XQuartz.

Go to http://xquartz.macosforge.org/landing/ and download the top-most “quick download” disk image file. As of this post, that’s version 2.7.8 and this is the direct link http://xquartz.macosforge.org/downloads/SL/XQuartz-2.7.8.dmg
Double click the downloaded disk image then double click on the XQuartz.pkg and follow the installation steps.

[Optional] Install Xcode Tools

Some R packages require compilation. That requires utilities not installed on OS X by default. You can wait to do the following until it’s needed, but since you’re already installing things…

Get Xcode https://itunes.apple.com/us/app/xcode/id497799835?mt=12 and install it like any “normal” Mac application
When the intallation is done, open Xcode then close it just to verify it installed correctly
Find and open the Terminal program in the Utilities folder under the Applications folder
Paste the following into the Termainal window and hit enter/return (accept any dialog/prompt):

xcode-select --install`

Close the Terminal application

[Optional] Set yourself up for easier future compiled package installation

Some R packages need additional libraries to work and most aren’t on your system by default. There are a myriad of ways to get these libraries, and the way I obtain them is via the homebrew utility. You can save yourself the trouble of installing homebrew later by doing the following now:

Find and open the Terminal program in the Utilities folder under the Applications folder
Paste the following into the Terminal window and hit enter/return:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Read and accept the various prompts until it’s installed
Close the Terminal application

You can now do brew install xyz in the future when a library is needed to support a package. Drop a note in the comments if you’d like this discussed more in a future blog post.

[Optional] [If you have an hour+ to kill] Install MacTeX

R has an academic history and there are many semi-advanced functions that are tied to something called latex. Installing latex for OS X is not hard, just time (and bandwidth) consuming (it’s about the same size as a new OS X installer). If you delve into package creation or do more detailed output work in R, you’ll want to install MacTex sooner than later.

Go to https://tug.org/mactex/mactex-download.html and download the latest MacTeX.pkg file. This direct link should always work (until they change up the way their site works) http://tug.org/cgi-bin/mactex-download/MacTeX.pkg
Double-click the pkg installer and follow the prompts. The defaults should all be good enough for now.

Fin

If you have any changes/additions/etc drop a note in the comments. I may even stick this on github to make it easier to contribute in the future.

Replicating NatGeo’s “Proper” Earthquake Map in R

2015-10-04 – 20:01
Posted in cartography, Data Visualization, DataVis, DataViz, ggplot, gis, maps, R
Tagged post
Comments (22)

I saw this post over at NatGeo over the weekend and felt compelled to replicate this:

with ggplot2.

Three shapefiles later and we have it close enough to toss into a post (and I really don’t believe the continent names are necessary).

library(rgdal)
library(ggplot2)
library(ggthemes)
library(ggalt) # devtools::install_github("hrbrmstr/ggalt")

# grab these from http://rud.is/dl/quakefiles.tgz

world <- readOGR("countries.geo.json", "OGRGeoJSON", stringsAsFactors=FALSE)
plates <- readOGR("plates.json", "OGRGeoJSON", stringsAsFactors=FALSE)
quakes <- readOGR("quakes.json", "OGRGeoJSON", stringsAsFactors=FALSE)

world_map <- fortify(world)
plates_map <- fortify(plates)
quakes_dat <- data.frame(quakes)
quakes_dat$trans <- quakes_dat$mag %% 5

gg <- ggplot()
gg <- gg + geom_cartogram(data=world_map, map=world_map,
                          aes(x=long, y=lat, map_id=id),
                          color="white", size=0.15, fill="#d8d8d6")
gg <- gg + geom_cartogram(data=plates_map, map=plates_map,
                          aes(x=long, y=lat, map_id=id),
                          color="black", size=0.1, fill="#00000000", alpha=0)
gg <- gg + geom_point(data=quakes_dat,
                      aes(x=coords.x1, y=coords.x2, size=trans),
                      shape=1, alpha=1/3, color="#d47e5d", fill="#00000000")
gg <- gg + geom_point(data=subset(quakes_dat, mag>7.5),
                      aes(x=coords.x1, y=coords.x2, size=trans),
                      shape=1, alpha=1, color="black", fill="#00000000")
gg <- gg + geom_text(data=subset(quakes_dat, mag>7.5),
                     aes(x=coords.x1, y=coords.x2, label=sprintf("Mag %2.1f", mag)),
                     color="black", size=3, vjust=c(3.9, 3.9, 5), fontface="bold")
gg <- gg + scale_size(name="Magnitude", trans="exp", labels=c(5:8), range=c(1, 20))
gg <- gg + coord_map("mollweide")
gg <- gg + theme_map()
gg <- gg + theme(legend.position=c(0.05, 0.99))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(legend.key=element_rect(color="#00000000"))
gg

I can only imagine how many mouse clicks that would be in a GIS program.

Addendum

This comment has links to where I acquired the shapefiles (good q by Michael)
Thanks to this comment there’s now code in a gist that makes a map re-cenetered on the Pacific

For Some Definition of “Directly” and/or “Contort”

2015-09-30 – 09:21
Posted in Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (5)

Junk Charts did a post on [Don’t pick your tool before having your design](http://junkcharts.typepad.com/junk_charts/2015/09/dont-pick-your-tool-before-having-your-design.html) and made a claim that this:

_”cannot be produced directly from a tool (without contorting your body in various painful locations)”_.

I beg to differ.

With R & ggplot2, I get to both pick my tool and design at the same time since I have a very flexible and multi-purpose tool. I also don’t believe that the code below qualifies as “contortions”, though I am a ggplot2 fanboi. It’s no different than Excel folks clicking on radio buttons and color pickers, except my process is easily repeatable & scalable once finalized (this is not finalized as it’s not 100% parameterized but it’s not difficult to do that last part).

library(ggplot2)
 
dat <- data.frame(year=2010:2015,
                  penalties=c(627, 625, 653, 617, 661, 730))
 
avg <- data.frame(val=mean(head(dat$penalties, -1)),
                  last=dat$penalties[6],
                  lab="5-Yr\nAvg")
 
gg <- ggplot(dat, aes(x=year, y=penalties))
gg <- gg + geom_point()
gg <- gg + scale_x_continuous(breaks=c(2010, 2014, 2015), limits=c(NA, 2015.1))
gg <- gg + scale_y_continuous(breaks=c(600, 650, 700, 750), 
                              limits=c(599, 751), expand=c(0,0))
gg <- gg + geom_segment(data=avg, aes(x=2010, xend=2015, y=val, yend=val), linetype="dashed")
gg <- gg + geom_segment(data=avg, aes(x=2015, xend=2015, y=val, yend=last), color="steelblue")
gg <- gg + geom_point(data=avg, aes(x=2015, y=val), shape=4, size=3)
gg <- gg + geom_text(data=avg, aes(x=2015, y=val), label="5-Yr\nAvg", size=2.5, hjust=-0.3)
gg <- gg + geom_point(data=avg, aes(x=2015, y=700), shape=17, col="steelblue")
gg <- gg + geom_point(data=avg, aes(x=2015, y=730), shape=4, size=3)
gg <- gg + labs(x=NULL, y="Number of Penalties", 
                title="NFL Penalties Jumped 15% in the\nFirst 3 Weeks of the 2015 Season\n")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid.minor=element_blank())
gg <- gg + theme(panel.grid.major.x=element_blank())
gg <- gg + theme(panel.grid.major.y=element_line(color="white"))
gg <- gg + theme(panel.background=element_rect(fill="#f3f2f7"))
gg <- gg + theme(axis.ticks=element_blank())
gg