R Archives - Page 42 of 56

Category Archives: R

Working with the Clinton State Dept Email Dumps in R (Part 1: Graphs)

I put this together after experimenting with `ggplot2` and `ggnetwork` earlier this week. The changes I made added `svgPanZoom` into the mix. Consequently, it has a widget in it, so it was just easier to embed the full R markdown HTML into an `iframe` than to try to extract the content piecemeal into WP.

You can bust the `iframe` via .

Read on to see how to grab some JSON, create edge list, do some basic graph stats with `igraph` and generate an interactive visualization with `ggplot2` and `svgPanZoom`.

Making Faceted Heatmaps with ggplot2

We were doing some exploratory data analysis on some attacker data at work and one of the things I was interested is what were “working hours” by country. Now, I don’t put a great deal of faith in the precision of geolocated IP addresses since every geolocation database that exists thinks I live in Vermont (I don’t) and I know that these databases rely on a pretty “meh” distributed process for getting this local data. However, at a country level, the errors are tolerable provided you use a decent geolocation provider. Since a rant about the precision of IP address geolocation was not the point of this post, let’s move on.

One of the best ways to visualize these “working hours” is a temporal heatmap. Jay & I made a couple as part of our inaugural Data-Driven Security Book blog post to show how much of our collected lives were lost during the creation of our tome.

I have some paired-down, simulated data based on the attacker data we were looking at. Rather than the complete data set, I’m providing 200,000 “events” (RDP login attempts, to be precise) in the eventlog.csv file in the data/ directory that have the timestamp, and the source_country ISO 3166-1 alpha-2 country code (which is the source of the attack) plus the tz time zone of the source IP address. Let’s have a look:

library(data.table)  # faster fread() and better weekdays()
library(dplyr)       # consistent data.frame operations
library(purrr)       # consistent & safe list/vector munging
library(tidyr)       # consistent data.frame cleaning
library(lubridate)   # date manipulation
library(countrycode) # turn country codes into pretty names
library(ggplot2)     # base plots are for Coursera professors
library(scales)      # pairs nicely with ggplot2 for plot label formatting
library(gridExtra)   # a helper for arranging individual ggplot objects
library(ggthemes)    # has a clean theme for ggplot2
library(viridis)     # best. color. palette. evar.
library(knitr)       # kable : prettier data.frame output

attacks <- tbl_df(fread("data/eventlog.csv"))

kable(head(attacks))

timestamp	source_country	tz
2015-03-12T15:59:16.718901Z	CN	Asia/Shanghai
2015-03-12T16:00:48.841746Z	FR	Europe/Paris
2015-03-12T16:02:26.731256Z	CN	Asia/Shanghai
2015-03-12T16:02:38.469907Z	US	America/Chicago
2015-03-12T16:03:22.201903Z	CN	Asia/Shanghai
2015-03-12T16:03:45.984616Z	CN	Asia/Shanghai

For a temporal heatmap, we’re going to need the weekday and hour (or as granular as you want to get). I use a factor here so I can have ordered weekdays. I need the source timezone weekday/hour so we have to get a bit creative since the time zone parameter to virtually every date/time operation in R only handles a single element vector.

make_hr_wkday <- function(cc, ts, tz) {

  real_times <- ymd_hms(ts, tz=tz[1], quiet=TRUE)

  data_frame(source_country=cc,
             wkday=weekdays(as.Date(real_times, tz=tz[1])),
             hour=format(real_times, "%H", tz=tz[1]))

}

group_by(attacks, tz) %>%
  do(make_hr_wkday(.$source_country, .$timestamp, .$tz)) %>%
  ungroup() %>%
  mutate(wkday=factor(wkday,
                      levels=levels(weekdays(0, FALSE)))) -> attacks
kable(head(attacks))

tz	source_country	wkday	hour
Africa/Cairo	BG	Saturday	22
Africa/Cairo	TW	Sunday	08
Africa/Cairo	TW	Sunday	10
Africa/Cairo	CN	Sunday	13
Africa/Cairo	US	Sunday	17
Africa/Cairo	CA	Monday	13

It’s pretty straightforward to make an overall heatmap of activity. Group & count the number of “attacks” by weekday and hour then use geom_tile(). I’m going to clutter up the pristine ggplot2 commands with some explanation for those still learning ggplot2:

wkdays <- count(attacks, wkday, hour)

kable(head(wkdays))

wkday	hour	n
Sunday	00	1076
Sunday	01	1307
Sunday	02	1189
Sunday	03	1301
Sunday	04	1145
Sunday	05	1313

Here, we’re just feeding in the new data.frame we just created to ggplot and telling it we want to use the hour column for the x-axis, the wkday column for the y-axis and that we are doing a continuous scale fill by the n aggregated count:

gg <- ggplot(wkdays, aes(x=hour, y=wkday, fill=n))

This does all the hard work. geom_tile() will make tiles at each x&y location we’ve already specified. I knew we had events for every hour, but if you had missing days or hours, you could use tidyr::complete() to fill those in. We’re also telling it to use a thin (0.1 units) white border to separate the tiles.

gg <- gg + geom_tile(color="white", size=0.1)

this has some additional magic in that it’s an awesome color scale. Read the viridis package vignette for more info. By specifying a name here, we get a nice label on the legend.

gg <- gg + scale_fill_viridis(name="# Events", label=comma)

This ensures the plot will have a 1:1 aspect ratio (i.e. geom_tile()–which draws rectangles–will draw nice squares).

gg <- gg + coord_equal()

This tells ggplot to not use an x- or y-axis label and to also not reserve any space for them. I used a pretty bland but descriptive title. If I worked for some other security company I’d’ve added “ZOMGOSH CHINA!” to it.

gg <- gg + labs(x=NULL, y=NULL, title="Events per weekday & time of day")

Here’s what makes the plot look really nice. I customize a number of theme elements, starting with a base theme of theme_tufte() from the ggthemes package. It removes alot of chart junk without having to do it manually.

gg <- gg + theme_tufte(base_family="Helvetica")

I like my plot titles left-aligned. For hjust:

0 == left
0.5 == centered
1 == right

gg <- gg + theme(plot.title=element_text(hjust=0))

We don’t want any tick marks on the axes and I want the text to be slightly smaller than the default.

gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_text(size=7))

For the legend, I just needed to tweak the title and text sizes a wee bit.

gg <- gg + theme(legend.title=element_text(size=8))
gg <- gg + theme(legend.text=element_text(size=6))
gg

(NOTE: there’s an alternate version of this post with SVG graphics and nicer tables)

That’s great, but what if we wanted the heatmap breakdown by country? We’ll do this two ways, first with each country’s heatmap using the same scale, then with each one using it’s own scale. That will let us compare at a macro and micro level.

For either view, I want to rank-order the countries and want nice country names versus 2-letter abbreviations. We’ll do that first:

count(attacks, source_country) %>%
  mutate(percent=percent(n/sum(n)), count=comma(n)) %>%
  mutate(country=sprintf("%s (%s)",
                         countrycode(source_country, "iso2c", "country.name"),
                         source_country)) %>%
  arrange(desc(n)) -> events_by_country

kable(events_by_country[,5:3])

country	count	percent
China (CN)	85,243	42.6%
United States (US)	48,684	24.3%
Korea, Republic of (KR)	12,648	6.3%
Netherlands (NL)	8,572	4.3%
Viet Nam (VN)	6,340	3.2%
Taiwan, Province of China (TW)	3,469	1.7%
United Kingdom (GB)	3,266	1.6%
France (FR)	3,252	1.6%
Ukraine (UA)	2,219	1.1%
Germany (DE)	2,055	1.0%
Argentina (AR)	1,793	0.9%
Canada (CA)	1,646	0.8%
Russian Federation (RU)	1,633	0.8%
Japan (JP)	1,476	0.7%
Singapore (SG)	1,278	0.6%
Hong Kong (HK)	1,239	0.6%
…	…	…

Now, we’ll do a simple ggplot facet, but also exclude the top 2 attacking countries since they skew things a bit (and, we’ll see them in the last vis):

filter(attacks, source_country %in% events_by_country$source_country[3:12]) %>%
  count(source_country, wkday, hour) %>%
  ungroup() %>%
  left_join(events_by_country[,c(1,5)]) %>%
  complete(country, wkday, hour, fill=list(n=0)) %>%
  mutate(country=factor(country,
                        levels=events_by_country$country[3:12])) -> cc_heat

Before we go all crazy and plot, let me explain ^^ a bit. I’m filtering by the top 10 (excluding the top 2) countries, then doing the group/count. I need the pretty country info, so I’m joining that to the result. Not all countries attacked every day/hour, so we use that complete() operation I mentioned earlier to ensure we have values for all countries for each day/hour combination. Finally, I want to print the heatmaps in order, so I turn the country into an ordered factor.

gg <- ggplot(cc_heat, aes(x=hour, y=wkday, fill=n))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + scale_fill_viridis(name="# Events")
gg <- gg + coord_equal()
gg <- gg + facet_wrap(~country, ncol=2)
gg <- gg + labs(x=NULL, y=NULL, title="Events per weekday & time of day by country\n")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_text(size=5))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(plot.title=element_text(hjust=0))
gg <- gg + theme(strip.text=element_text(hjust=0))
gg <- gg + theme(panel.margin.x=unit(0.5, "cm"))
gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
gg <- gg + theme(legend.title=element_text(size=6))
gg <- gg + theme(legend.title.align=1)
gg <- gg + theme(legend.text=element_text(size=6))
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
gg <- gg + theme(legend.key.width=unit(1, "cm"))
gg

To get individual scales for each country we need to make n separate ggplot object and combine then using gridExtra::grid.arrange. It’s pretty much the same setup as before, only without the facet call. We’ll do the top 16 countries (not excluding anything) this way (pick any number you want, though provided you like scrolling). I didn’t bother with a legend title since you kinda know what you’re looking at by now :-)

count(attacks, source_country, wkday, hour) %>%
  ungroup() %>%
  left_join(events_by_country[,c(1,5)]) %>%
  complete(country, wkday, hour, fill=list(n=0)) %>%
  mutate(country=factor(country,
                        levels=events_by_country$country)) -> cc_heat2

lapply(events_by_country$country[1:16], function(cc) {
  gg <- ggplot(filter(cc_heat2, country==cc),
               aes(x=hour, y=wkday, fill=n, frame=country))
  gg <- gg + geom_tile(color="white", size=0.1)
  gg <- gg + scale_x_discrete(expand=c(0,0))
  gg <- gg + scale_y_discrete(expand=c(0,0))
  gg <- gg + scale_fill_viridis(name="")
  gg <- gg + coord_equal()
  gg <- gg + labs(x=NULL, y=NULL,
                  title=sprintf("%s", cc))
  gg <- gg + theme_tufte(base_family="Helvetica")
  gg <- gg + theme(axis.ticks=element_blank())
  gg <- gg + theme(axis.text=element_text(size=5))
  gg <- gg + theme(panel.border=element_blank())
  gg <- gg + theme(plot.title=element_text(hjust=0, size=6))
  gg <- gg + theme(panel.margin.x=unit(0.5, "cm"))
  gg <- gg + theme(panel.margin.y=unit(0.5, "cm"))
  gg <- gg + theme(legend.title=element_text(size=6))
  gg <- gg + theme(legend.title.align=1)
  gg <- gg + theme(legend.text=element_text(size=6))
  gg <- gg + theme(legend.position="bottom")
  gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
  gg <- gg + theme(legend.key.width=unit(1, "cm"))
  gg
}) -> cclist

cclist[["ncol"]] <- 2

do.call(grid.arrange, cclist)

You can find the data and source for this R markdown document on github. You’ll need to devtools::install_github("hrbrmstr/hrbrmrkdn") first since I’m using a custom template (or just change the output: to html_document in the YAML header).

Plot the new SVG R logo with ggplot2

High resolution and SVG versions of the new R logo are finally available.

I converted the SVG to WKT (file here) which means we can use it like we would a shapefile in R. That includes plotting!

Here’s a short example of how to read that WKT and plot the logo using ggplot2:

library(sp)
library(maptools)
library(rgeos)
library(ggplot2)
library(ggthemes)
 
r_wkt_gist_file <- "https://gist.githubusercontent.com/hrbrmstr/07d0ccf14c2ff109f55a/raw/db274a39b8f024468f8550d7aeaabb83c576f7ef/rlogo.wkt"
if (!file.exists("rlogo.wkt")) download.file(r_wkt_gist_file, "rlogo.wkt")
rlogo <- readWKT(paste0(readLines("rlogo.wkt", warn=FALSE)))
 
rlogo_shp <- SpatialPolygonsDataFrame(rlogo, data.frame(poly=c("halo", "r")))
rlogo_poly <- fortify(rlogo_shp, region="poly")
 
ggplot(rlogo_poly) + 
  geom_polygon(aes(x=long, y=lat, group=id, fill=id)) + 
  scale_fill_manual(values=c(halo="#b8babf", r="#1e63b5")) +
  coord_equal() + 
  theme_map() + 
  theme(legend.position="none")

Craft httr calls cleverly with curlconverter

UPDATE curlconverter will now return (as the function return value) a working R function. See the README for examples

When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:

primary

Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:

XHR

You can actually see a preview of those requests (usually JSON):

Developer_Tools_-_http___graphics_latimes_com_election-2016-new-hampshire-results_

While you could go through all the headers and cookies and transcribe them into httr::GET or httr::POST requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl and curl packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:

curl 'http://graphics.latimes.com/election-2016-31146-feed.json' 
  -H 'Pragma: no-cache' 
  -H 'DNT: 1' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' 
  -H 'Accept: */*' 
  -H 'Cache-Control: no-cache' 
  -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' 
  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' 
  -H 'Connection: keep-alive' 
  -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT'
  -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' 
  --compressed

While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter.

The curlconverter package has (for the moment) two main functions:

straighten() : which returns a list with all of the necessary parts to craft an httr POST or GET call
make_req() : which actually _returns a working httr call, pre-filled with all of the necessary information.

By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req() or req_params <- straighten()), but they can take in a vector of cURL command lines, too (NOTE: make_req() is currently limited to one while straighten() can handle as many as you want).

Let’s show what happens using election results cURL command line:

REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache'  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed"
 
resp <- curlconverter::straighten(REP)
jsonlite::toJSON(resp, pretty=TRUE)
 
    ## [
    ##   {
    ##     "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"],
    ##     "method": ["get"],
    ##     "headers": {
    ##       "Pragma": ["no-cache"],
    ##       "DNT": ["1"],
    ##       "Accept-Encoding": ["gzip, deflate, sdch"],
    ##       "X-Requested-With": ["XMLHttpRequest"],
    ##       "Accept-Language": ["en-US,en;q=0.8"],
    ##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"],
    ##       "Accept": ["*/*"],
    ##       "Cache-Control": ["no-cache"],
    ##       "Connection": ["keep-alive"],
    ##       "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"],
    ##       "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"]
    ##     },
    ##     "cookies": {
    ##       "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"],
    ##       "s_cc": ["true"]
    ##     },
    ##     "url_parts": {
    ##       "scheme": ["http"],
    ##       "hostname": ["graphics.latimes.com"],
    ##       "port": {},
    ##       "path": ["election-2016-31146-feed.json"],
    ##       "query": {},
    ##       "params": {},
    ##       "fragment": {},
    ##       "username": {},
    ##       "password": {}
    ##     }
    ##   }
    ## ]

You can then use the items in the returned list to make a GET request manually (but still tediously).

curlconverter‘s make_req() will try to do this conversion for you automagically using httr‘s little used VERB() function. It’s easier to show than to tell:

curlconverter::make_req(REP)

VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", 
     add_headers(Pragma = "no-cache", 
                 DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
                 `X-Requested-With` = "XMLHttpRequest", 
                 `Accept-Language` = "en-US,en;q=0.8", 
                 `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
                 Accept = "*/*", 
                 `Cache-Control` = "no-cache", 
                 Connection = "keep-alive", 
                 `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", 
                 Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/"))

You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).

You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter Node module by Nick Carneiro.

It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).

Alternate R Markdown Templates

The `knitr`/R markdown system is a great way to organize reports and analyses. However, the built-in ones (that come with RStudio/the `rmarkdown` package) rely on Bootstrap and also use jQuery. There’s nothing wrong with that, but the generated standalone HTML documents (which are a great way to distribute reports) don’t really need all that cruft and it’s fun & informative to check out new frameworks from time-to-time. Also, jQuery is a heavy crutch I’m working hard to not need anymore.

To that end, I created a package — [`markdowntemplates`](https://github.com/hrbrmstr/markdowntemplates) — that contains three alternate templates that you can use out of the box or (hopefully) clone, customize and use in your future work. One template is based on the [Bulma CSS framework](http://bulma.io), the other is based on the [Skeleton CSS framework](http://getskeleton.com) and the last one is a super-minimal template with no formatting (i.e. it’s a good one to build on).

The github repo has screenshots of the basic formatting.

I tried to keep with the base formatting of each theme, but I went a bit crazy and showed how to have a fixed banner in the Skeleton version.

### How it works

There are three directories inside `inst/rmarkdown/templates` each has a similar structure:

– a `resources` directory with CSS (and potentially javascript)
– a `skeleton` directory which holds example `Rmd` “skeleton” files
– a `base.html` file which is the parameterized HTML template for the Rmd
– a `template.yaml` file which is how RStudio/`knitr` knows there’s one or more R markdown templates in your package

The `minimal` `base.html` is small enough to include here:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"$if(lang)$ lang="$lang$" xml:lang="$lang$"$endif$>
 
<head>
 
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1">
 
<title>$if(title)$$title$$endif$</title>
 
$for(header-includes)$
$header-includes$
$endfor$
 
$if(highlightjs)$
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
      href="$highlightjs$/$if(highlightjs-theme)$$highlightjs-theme$$else$default$endif$.css"
      $if(html5)$$else$type="text/css" $endif$/>
<script src="$highlightjs$/highlight.js"></script>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
   window.setTimeout(function() {
      hljs.initHighlighting();
   }, 0);
}
</script>
$endif$
 
$if(highlighting-css)$
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
$highlighting-css$
</style>
$endif$
 
$for(css)$
<link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
$endfor$
 
</head>
 
<body>
<div class="container">
 
<h1>$if(title)$$title$$endif$</h1>
 
$for(include-before)$
$include-before$
$endfor$
 
$if(toc)$
<div id="$idprefix$TOC">
$toc$
</div>
$endif$
 
$body$
 
$for(include-after)$
$include-after$
$endfor$
 
</div>
 
$if(mathjax-url)$
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "$mathjax-url$";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>
$endif$
 
</body>
</html>

I kept a bit of the RStudio template code for source code formatting, but grokking the actual template language should be pretty straightforward. You should be able to pick out `$title$` in there and you can add as many parameters to the `Rmd` YAML section as you like (with corresponding counterparts in that template). I added a corresponding, exported R function for each supported template to show how easy it is to customize the parameters while also accepting further customizations in the YAML of each `Rmd`.

Imagine building a base template with your personal or organization’s branding *and* having it set apart from the cookie-cutter RStudio `rmarkdown` Bootstrap/jQuery template. Or, create course-specific templates like the [`s20x` package](https://github.com/cran/s20x). Once you peek behind the curtain to see how things are done, it’s not so complex and the sky is the limit.

I’ll try to get these in CRAN as soon as possible. If you have a preference for another CSS framework (I’m thinking of adding a “Metro” CSS kit and a Google web starter CSS kit), shoot me an issue or PR and I’ll incorporate it. The more examples we have the easier it will be for folks to create new templates.

Any & all feedback is most welcome.

Cobble XPath Interactively with the xmlview Package

2016-01-13 – 12:35
Posted in data wrangling, htmlwidgets, R, xml
Tagged post
Comments (1)

(If you don’t know what XML is, you should probably [read a primer](https://en.wikipedia.org/wiki/XML) before reading this post,)

When working with data, one inevitably comes across things encoded in XML. I’m in the “anti-XML” camp, but deal with my fair share of XML in “cyber” and help out enough people who have to work with XML that I’ve become pretty proficient when slicing & dicing it.

R has two main packages to deal with XML: the original `XML` package and the more lightweight and modern `xml2` package. If you really need all the power of `libxml2` (the C library that powers both packages) or are _creating_ XML from R, then you probably know your way around the `XML` package and are pretty self-sufficient.

Most folks can get by with the `xml2` package if their goal is to work with XML data. By “work with” I mean read in files or data from APIs that come in XML format and have to find nuggets of gold in between all those `<` and `>` tags. To do so requires finding what you need and that means using a query language called `XPath` to pinpoint the node(s) you are after. Working with `XPath` can be pretty daunting for those who went to school to ultimately cure diseases, build high-performing stock portfolios, target advertising to everyone or perform a host of other real work. Becoming an expert in `XPath` was not something on the bucket list but to work with XML you will need to be familiar with it.

The [`xmlview`](https://github.com/hrbrmstr/xmlview) package provides a way to visually inspect XML and interactively test out `XPath` expressions. It’s as simple to use as:

devtools::install_github("ramnathv/htmlwidgets") # we use some bleeding edge features
devtools::install_github("hrbrmstr/xmlview")
library(xml2)
library(xmlview)
 
# plain text XML
xml_view("<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>")
 
# read-in XML document
doc <- read_xml("http://www.npr.org/rss/rss.php?id=1001")
xml_view(doc, add_filter=TRUE)

(There’s also an experimental `xml_tree_view()` in there by @timelyportfolio that we’ll be adding features to at a pretty rapid pace.)

Here’s a screenshot of it in action:

There are options to change the CSS styling for the formatted code. Yep, it will format and highlight XML for you so it’s easier to work with. There’s an animated gif of a screencast over [on github](https://github.com/hrbrmstr/xmlview) as well.

Once you perfect your `XPath` expression, hit the “R” button and it will generate the code you can copy back into RStudio. It understands namespaces but try not to stuff a huge XML document in there as browsers don’t work well with large data elements (the viewer is an `htmlwidget` and is, hence, browser-based).

It works with plain character XML/HTML, and many `xml2` data types. I have no current plans for `XML` package object support but toss up an issue on github if you really need it (or, better yet, a PR). If there are other desired features (especially from educators), please post a request in github issue as well.

Watch for more features in the coming weeks and a CRAN release once the bleeding edge `htmlwidgets` packages makes it to CRAN.

The Force: Accounted, in R

2016-01-09 – 10:47
Posted in Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (1)

Despite being a cybersecurity professional, it’s pretty easy to social engineer me:

https://t.co/R3qVHEMTLo @hrbrmstr is there an R package? ;)

— Thorsten G. (@SaThaRiel74) January 8, 2016

I’ll note that @jayjacobs does it all the time to me.

I took Thorsten’s tweet as a challenge to ggplot2-ize the Bloomberg visualizations as best as possible.

All the code in [on github](https://github.com/hrbrmstr/forceaccounted) and you can see the finished product (knitted from an Rmd file) [on this project page](http://rud.is/projects/force_accounted.html) or mini-scroll below in the `iframe`.

I encourage folks to look at the project (it’s actually a package) source as it has quite a bit of data munging and ggplot2 “tricks” that could be useful in “real” visualizations.

iptools 0.3.0 (“Violet Packet”) Now on CRAN with Windows Support!

2016-01-08 – 10:44
Posted in Cybersecurity, DNS, Information Security, R
Tagged post
Comments (1)

`iptools` is a set of tools for working with IP addresses. Not just work, but work _fast_. It’s backed by `Rcpp` and now uses the [AsioHeaders](http://dirk.eddelbuettel.com/blog/2016/01/07/#asioheaders_1.11.0-1) package by Dirk Eddelbuettel, which means it no longer needs to _link_ against the monolithic Boost libraries and *works on Windows*!

What can you do with it? One thing you can do is take a vector of domain names and turn them into IP addresses:

library(iptools)
 
hostname_to_ip(c("rud.is", "dds.ec", "ironholds.org", "google.com"))
 
## [[1]]
## [1] "104.236.112.222"
## 
## [[2]]
## [1] "162.243.111.4"
## 
## [[3]]
## [1] "104.131.2.226"
## 
## [[4]]
##  [1] "2607:f8b0:400b:80a::100e" "74.125.226.101"           "74.125.226.102"          
##  [4] "74.125.226.100"           "74.125.226.96"            "74.125.226.104"          
##  [7] "74.125.226.99"            "74.125.226.103"           "74.125.226.105"          
## [10] "74.125.226.98"            "74.125.226.97"            "74.125.226.110"

That means you can pump a bunch of domain names from logs into `iptools` and get current IP address allocations out for them.

You can also do the reverse:

library(magrittr)
library(purrr)
library(iptools)
 
hostname_to_ip(c("rud.is", "dds.ec", "ironholds.org", "google.com")) %>% 
  flatten_chr() %>% 
  ip_to_hostname() %>% 
  flatten_chr()
 
##  [1] "104.236.112.222"           "dds.ec"                    "104.131.2.226"            
##  [4] "yyz08s13-in-x0e.1e100.net" "yyz08s13-in-f5.1e100.net"  "yyz08s13-in-f6.1e100.net" 
##  [7] "yyz08s13-in-f4.1e100.net"  "yyz08s13-in-f0.1e100.net"  "yyz08s13-in-f8.1e100.net" 
## [10] "yyz08s13-in-f3.1e100.net"  "yyz08s13-in-f7.1e100.net"  "yyz08s13-in-f9.1e100.net" 
## [13] "yyz08s13-in-f2.1e100.net"  "yyz08s13-in-f1.1e100.net"  "yyz08s13-in-f14.1e100.net"

Notice that it handled IPv6 addresses and also cases where no reverse mapping existed for an IP address.

You can convert IPv4 addresses to and from long integer format (the 4 octet version of IPv4 addresses is primarily to make them easier for humans to grok), generate random IP addresses for testing, test IP addresses for validity and type and also reference data sets with registered assignments (so you can see allocated IP groups). Plus, it includes `xff_extract()` which can help identify an actual IP address (helpful when connections come from behind proxies).

We can’t thank Dirk enough for cranking out `AsioHeaders` since it means there will be many more network/”cyber” packages coming for R and available on every platform.

You can find `iptools` version `0.3.0` [on CRAN](https://cran.r-project.org/web/packages/iptools/) now (it may take your mirror a bit to catch up), grab the source [release](https://github.com/hrbrmstr/iptools/releases/tag/v0.3.0) on GitHub or check out the [repo](https://github.com/hrbrmstr/iptools/), poke around, submit issues and/or contribute!

Isn’t it great when an R package can help you with resolutions in the new year?