This isn’t a post about politics. I do have opinions about the now infamous e-mail server (which will no doubt come out here), but when the WSJ folks made it possible to search the Clinton email releases I though it would be fun to get the data into R to show how well the igraph
and ggnetwork
packages could work together, and also show how to use svgPanZoom
to make it a bit easier to poke around the resulting hairball network.
NOTE: There are a couple “Assignment” blocks in here. My Elements of Data Science students are no doubt following the blog by now so those are meant for you :-) Other intrepid readers can ignore them.
We’ll need some packages:
library(jsonlite) # read in the JSON data from the API
library(dplyr) # data munging
library(igraph) # work with graphs in R
library(ggnetwork) # devtools::install_github("briatte/ggnetwork")
library(intergraph) # ggnetwork needs this to wield igraph things
library(ggrepel) # fancy, non-ovelapping labels
library(svgPanZoom) # zoom, zoom
library(SVGAnnotation) # to help svgPanZoom; it's a bioconductor package
library(DT) # pretty tables
There’s an API backing the WSJ web app. It’s not advertised, but it’s not hidden either. They were kind enough to actually make this resource available to the public to help them make up their minds as to whether this was a horrible, awful, terrible, inexcusable breach of national security through conceit, hubris and naïvety (see, I have opines :-) – or not – and we really shouldn’t constantly hit their API just because we want to work with the data on our own.
To that end, we grab the data from the API and save the R object off so we can work with the local copy whenever we want to.
if (!file.exists("clinton_emails.rda")) {
clinton_emails <- fromJSON("http://graphics.wsj.com/hillary-clinton-email-documents/api/search.php?subject=&text=&to=&from=&start=&end=&sort=docDate&order=desc&docid=&limit=27159&offset=0")$rows
save(clinton_emails, file="clinton_emails.rda")
}
load("clinton_emails.rda")
There are some from/to paris with multipe recipients (~140). We can munge those into shape, but I’m just going to get rid of them since this is just a post about visualizing the network. strsplit
and tidyr::unnest
can help if you want to preserve those small number of emails.
clinton_emails %>%
mutate(from=trimws(from),
to=trimws(to)) %>%
filter(from != "") %>%
filter(to != "") %>%
filter(!grepl(";", from)) %>%
filter(!grepl(";", to)) -> clinton_emails
Assignment: Reduce the number of
filter()
statements in that code block.
Data with “from” and “to” characteristics lend themselves to graphs. Graphs (in yet another opinion of mine) are inherently objects designed for computation and this could be a fun data set to use to learn some basic graph theory. Let’s make a graph first:
gr <- graph_from_data_frame(clinton_emails[,c("from", "to")], directed=FALSE)
Aye, that was all we needed to do. Just tell igraph
where the “from” and “to” bits are and it does the rest. You can add extra data to nodes & edges, but this will do just fine for this post.
First, we’ll take a look at the degree centrality so we can properly size the nodes for the final vis.
V(gr)$size <- centralization.degree(gr)$res
datatable(arrange(data_frame(person=V(gr)$name, centrality_degree=V(gr)$size), desc(centrality_degree)))
The names with higher degrees shouldn’t be a shocker. This is all about former Secretary Clinton and if you google a bit (or just follow politics like some folks follow $SPORTSBALL) you’ll grok why the others are so high on the e-mail frequency list.
Note that this is a bit different than just doing a simple crosstab count:
datatable(arrange(ungroup(count(clinton_emails, from, to)), desc(n)))
Assigment: “pipify” that code block.
That does show that there are a large number of redundant edges. We’ll combine them by simplifying the graph and stroring the sum of the edge connections (it will be stored in the weight
attribute as long as there is an existing weight
attribute).
E(gr)$weight <- 1
g <- simplify(gr, edge.attr.comb="sum")
You can use that weight
computationally or to size the line connections between vertices. That will be an exercise left to the reader.
Since we’re just going to visualize the network, we’ll pick a layout and one of my favs is Fruchterman–Reingold. Here’s where we’ll use ggnetwork
.
First, we set a random seed since you’ll get a different orientation each time if you don’t (the graph algorithm starts at a random point). Then we tell ggnetwork
to use the FR algorithm to do it’s work.
set.seed(1492)
dat <- ggnetwork(g, layout="fruchtermanreingold", arrow.gap=0, cell.jitter=0)
What do arrow.gap
and cell.jitter
do? Be curious! Hit up help
and play with the settings.
It’s astonishingly easy to get this graph into ggplot2 now (thanks to ggnetwork
). geom_edges
+ geom_nodes
understand the attribute data associated with those graph components, so you can play with how you want various aesthetics mapped.
I add a “repelling label” to the nodes with higher centrality so it’s easier to see who the “top talkers” are.
Finally, I pass the ggplot
object to svgPlot
and svgPanZoom
to make it easier to generate a huge graph but still make it explorable.
It may look tiny, but pan/zoom like you would a google map to navigate the graph.
ggplot() +
geom_edges(data=dat,
aes(x=x, y=y, xend=xend, yend=yend),
color="grey50", curvature=0.1, size=0.15, alpha=1/2) +
geom_nodes(data=dat,
aes(x=x, y=y, xend=xend, yend=yend, size=sqrt(size)),
alpha=1/3) +
geom_label_repel(data=unique(dat[dat$size>50,c(1,2,5)]),
aes(x=x, y=y, label=vertex.names),
size=2, color="#8856a7") +
theme_blank() +
theme(legend.position="none") -> gg
svgPanZoom(svgPlot(show(gg), height=15, width=15),
width="960px",
controlIconsEnabled=TRUE)
What are those tiny pairs of disconnected mini-graphs doing in there? It’s doubtful those folks had e-mail accounts on this illegal, mismanaged server so I’m positing that they are parsing errors by the WSJ. Take a look for yourself:
clinton_emails %>%
filter(from != "Hillary Clinton" & to != "Hillary Clinton") %>%
datatable()
The WSJ also has thd PDFs available. They (thankfully) appear to all contain text vs images (various U.S. government offices have a reputation for giving image PDFs vs text content PDFs to make it harder to work with them, especially with FOIA requests).
If time permits, future posts will expand on the graph component (from the algorithmic side) and do a bit of text mining & visualization on the subjects and PDF text.
You can find the code for this post in this gist.