R Archives - Page 45 of 56

Category Archives: R

Three New #rstats Twitter Bots To Follow

2015-09-06 – 07:30
Posted in cartography, data science, gis, R, Stack Exchange, Stack Overflow, statistics
Tagged post
Comments (1)

I engage with the Stack[Overflow|Exchange] community quite a bit and was super-happy @treycausey made the [Stack Overflow #rstats bot](https://twitter.com/StackOverflowR) (@StackOverflowR) since I’m also on Twitter alot (mostly hanging out in #rstats these days).

However, #rstats questions exist in other Stack watering holes, like the [Geographic Information Systems Stack Exchange](http://gis.stackexchange.com/questions/tagged/r). [Cross Validated](http://stats.stackexchange.com/questions/tagged/r) and the fledgling [Data Science Stack Exchange](http://datascience.stackexchange.com/questions/tagged/r). They are all easy enough to follow in your favorite RSS reader (which _is_ @feedly, _right_?), but it’s also equally as easy to turn them into Twitter #rstats bots (here they are):

– @DataSciSERBot (Data Science Stack Exchange #rstats posts)
– @GISStackExchR (Geographic Information Systems #rstats posts)
– @RStatsStExBot (Cross Validated Stack Exchange #rstats posts)

They use [this IFTTT recipe](https://ifttt.com/recipes/322136-twitter-bot-for-data-science-stack-exchange-rstats-questions) to take the RSS feeds from each Stack community and turn them into tweets. Each forum (and, hence, Twitter bot) has _way_ less volume than @StackOverflowR and also tend to be of less general interest than @StackOverflowR. However, each specialized #rstats question forums [usually] have _really_ interesting problems & answers that I’ve learned a great deal from. You may get inspiration for a solution to something, see a really neat way to accomplish a task, get an idea for a new R package or just gain new and useful knowledge in areas previously unfamiliar to you.

They won’t be spamming the #rstats Twitter channel (the errant #rstats tag on one of the tweets was fixed in the recipe), so you’ll have to follow them or deliberately check in on them to see the updates.

I’ll try to keep an eye out for other Stack communities that feature #rstats and add them to the bot family. If you see any I’ve missed or am missing in the future, drop a comment here or note to @hrbrmstr on Twitter.

Coloring (and Drawing) Outside the Lines in ggplot

2015-08-27 – 09:36
Posted in Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (8)

Time for another Twitter-inspired blog post this week, this time from a tweet by @JonKalodimos:

Is there a way to do this in #rstats #ggplot2 https://t.co/kxWQFlYpbB

— Jonathan Kalodimos (@JonKalodimos) August 27, 2015

I had seen and appreciated Ann’s post on her makeover of the main graphic in [NPR’s story](http://www.npr.org/sections/money/2014/10/21/357629765/when-women-stopped-coding) and did a quick mental check of how I’d do the same in ggplot2 as I was reading it. Jon’s question was a good prompt to dump physical memory to internet memory.

Here’s the NPR graphic:

When_Women_Stopped_Coding___Planet_Money___NPR

It is actually pretty darn good on it’s own, but I also agree with Ann that direct labeling could have made it better. Here’s her makeover:

Let’s see how to do this in ggplot2. We’ll use the actual data from NPR’s story since the graphic was built with D3 and, hence, the data is part of the graphic. Let’s get the `library` stuff out of the way:

library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(gridExtra)
library(grid)

Now, we’ll grab the CSV that the NPR folks used for the graphic and take a look at it. I found it via Developer Tools in Chrome:

# use the NPR story data file ---------------------------------------------
# and be kind to NPR's bandwidth budget
url <- "http://apps.npr.org/dailygraphics/graphics/women-cs/data.csv"
fil <- "gender.csv"
if (!file.exists(fil)) download.file(url, fil)
 
gender <- read.csv(fil, stringsAsFactors=FALSE)
 
# take a look at the CSV structure ----------------------------------------
 
glimpse(gender)
 
## Observations: 48
## Variables:
## $ date              (int) 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, ...
## $ Medical.School    (dbl) 0.09, 0.10, 0.10, 0.09, 0.09, 0.11, 0.14, 0.17, 0.20, 0.22, 0.24, 0.25, 0.25, 0.25, 0.28, 0.29, 0.31, ...
## $ Law.School        (chr) "0.04", "0.04", "0.05", "0.07", "0.07", "0.1", "0.12", "0.16", "0.2", "0.24", "0.27", "0.28", "0.3", "...
## $ Physical.Sciences (chr) "0.14", "0.14", "0.14", "0.14", "0.14", "0.15", "0.16", "0.16", "0.17", "0.19", "0.2", "0.2", "0.22", ...
## $ Computer.science  (dbl) 0.146, 0.108, 0.120, 0.130, 0.129, 0.136, 0.136, 0.149, 0.164, 0.190, 0.198, 0.239, 0.258, 0.281, 0.30...
 
tail(gender)
 
##    date Medical School Law School Physical Sciences Computer science
## 43 2008           0.48       0.47              0.41            0.177
## 44 2009           0.48       0.47              0.42            0.179
## 45 2010           0.48       0.47              0.41            0.182
## 46 2011           0.47         tk                tk            0.177
## 47 2012           0.47         tk                tk            0.182
## 48 2013           0.46         tk                              0.179

Those `tk` values are referred to in the [code that makes the NPR graphic](http://apps.npr.org/dailygraphics/graphics/women-cs/js/graphic.js) so we’ll replace them with `NA`s and make all the columns numeric:

gender <- mutate_each(gender, funs(as.numeric))

We should also clean up the column names since we’ll be using them for the legend and the direct labels:

colnames(gender) <- str_replace(colnames(gender), "\\.", " ")
 
gender_long <- mutate(gather(gender, area, value, -date),
                      area=factor(area, levels=colnames(gender)[2:5],
                                  ordered=TRUE))

That that code link also has the colors NPR used for the graphic, so let’s define those now since we bothered to look at it:

gender_colors <- c('#11605E', '#17807E', '#8BC0BF','#D8472B')
names(gender_colors) <- colnames(gender)[2:5]

We’ll be needing those names later on, hence why I named the values in the vector.

With the data, labels and colors defined, we can make a “standard” ggplot:

chart_title <- expression(atop("What Happened To Women In Computer Science?",
                               atop(italic("% Of Women Majors, By Field"))))
 
gg <- ggplot(gender_long)
gg <- gg + geom_line(aes(x=date, y=value, group=area, color=area))
gg <- gg + scale_color_manual(name="", values=gender_colors)
gg <- gg + scale_y_continuous(label=percent)
gg <- gg + labs(x=NULL, y=NULL, title=chart_title)
gg <- gg + theme_bw(base_family="Helvetica")
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.key=element_blank())
gg

Rplot01

That’s also pretty good on it’s own. It’s possible to make it look a bit more like the NPR chart, but it’s hard to format a title & subtitle in a ggplot title _and_ have it left-justified, so I opted for font style. It’s also possible to make the legend look like NPR’s but that’s not the point of the post.

So, how do we make this look more like Ann’s makeover?

First we need to get the last values for each of the variables so we know what point on the `y` axis we need to place the labels. That’s made a bit trickier with the `NA`s:

last_vals <- sapply(colnames(gender)[2:5], function(x) last(na.exclude(gender[,x])))
last_date <- tail(gender$date)+1 # doing this ^ wld have made it a double

Next, we need to turn off the legend and increase the plot margin on the right-hand side:

gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin = unit(c(1, 7, 2, 1), "lines"))

I figured out those #’s by interactive trial-and-error, though I initially guessed `6` for the right-hand margin increase. Also, this should demonstrate one reason for the `gg <- gg +` madness you see in my code/posts since, when you start doing more in ggplot, you end up with that idiom more oft than not. Now, we add the labels. We do it with with custom annotations that are placed "one year" after the latest `x` value and at the same `y` value as the last reading of each area. We also color the label the same as the line, which is why we needed a named vector.

for (i in 1:length(last_vals)) {
  gg <- gg + annotation_custom(grob=textGrob(names(last_vals)[i], hjust=0,
                                             gp=gpar(fontsize=8, 
                                                     col=gender_colors[names(last_vals)[i]])),
                               xmin=2014, xmax=2014,
                               ymin=last_vals[i], ymax=last_vals[i])
}

Finally, we have to do some of the remaining work by hand since we have to turn off panel clipping and the only way I know how to do that is at the grob/gtable level, but it’s not that scary or complex of a task. Also, since we are manipulating the built ggplot object, we have to use `grid.draw` to present our chart:

gb <- ggplot_build(gg)
gt <- ggplot_gtable(gb)
 
gt$layout$clip[gt$layout$name=="panel"] <- "off"
 
grid.draw(gt)

Here’s the result:

Rplot02

I’ve deliberately left the fonts a bit small and not-changed their positions on the `y`-axis to give readers a bit of homework. They both _should_ be changed and the plot margins could also be tweaked a tad. You can find the complete code [on github](https://gist.github.com/hrbrmstr/83deb0baeabae0824389) so tweak away!

If you have another way to accomplish the same task or want to show off your tweaked version, drop a note in the comments or at that gist link.

New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs

UPDATE: `docxtractr` is now [on CRAN](https://cran.rstudio.com/web/packages/docxtractr/index.html)

———————

This is more of a follow-up from [yesterday’s post](http://rud.is/b/2015/08/23/using-r-to-get-data-out-of-word-docs/). The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there’s now a `devtools`-installable package [on github](https://github.com/hrbrmstr/docxtractr) that makes it way easier to get information about the tables in a Word document and extract them—uniform or not.

There are plenty of examples in the GitHub README and also in the package examples. But, I will show the basic functionality here.

The package ships with four example Word documents, but we’ll work with the last one: `complex.doc`. It has five tables and the last two have varying columns and rows and look like:

complex

Let’s read those two in:

complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

docx_tbl_count(complx)
#> [1] 5

docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/docxtractr/examples/complex.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]
#> 
#> Table 4
#>   total cells: 11
#>   row count  : 4
#>   uniform    : unlikely => found differing cell counts (3, 2) across some rows 
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 5
#>   total cells: 21
#>   row count  : 7
#>   uniform    : likely!
#>   has header : unlikely


docx_extract_tbl(complx, 4, header=TRUE)
#> Source: local data frame [3 x 3]
#> 
#>   Foo  Bar Baz
#> 1  Aa BbCc  NA
#> 2  Dd   Ee  Ff
#> 3  Gg   Hh  ii

docx_extract_tbl(complx, 5, header=TRUE)
#> Source: local data frame [6 x 3]
#> 
#>    Foo Bar Baz
#> 1   Aa  Bb  Cc
#> 2   Dd  Ee  Ff
#> 3   Gg  Hh  Ii
#> 4 Jj88  Kk  Ll
#> 5       Uu  Ii
#> 6   Hh  Ii   h

It reads in “uniform” tables properly and will warn you if there is a header marked in Word but not asked for in the extraction.

Next steps are to both allow specifying column types and try to guess column types (`readr` has some nice functions for this) and perhaps return more metadata (if possible).

Feature requests & bug reports are most welcome [on GitHub](https://github.com/hrbrmstr/docxtractr/issues).

Using R To Get Data Out Of Word Docs

NOTE: after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a package.

Also: docxtractr is now on CRAN

This was asked on twitter recently:

Is it possible to import data entered in MS Word into R – I have multiple tables in 235 files that need importing #rstats

— Richard Telford (@richardjtelford) August 23, 2015

The answer is a very cautious “yes”. Much depends on how well-formed and un-formatted the table is.

Take this really simple docx file: data.docx.

It has a single table in it:

data_docx

Now, .docx files are just zipped directories, so rename that to data.zip, unzip it and navigate to data/word/document.xml and you’ll see something like this (though it’ll be more compressed):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
    <w:tbl>
        <w:tblPr>
            <w:tblStyle w:val="TableGrid"/>
            <w:tblW w:w="0" w:type="auto"/>
            <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
        </w:tblPr>
        <w:tblGrid>
            <w:gridCol w:w="2337"/>
            <w:gridCol w:w="2337"/>
            <w:gridCol w:w="2338"/>
            <w:gridCol w:w="2338"/>
        </w:tblGrid>
        <w:tr w:rsidR="00244D8A" w14:paraId="6808A6FE" w14:textId="77777777" w:rsidTr="00244D8A">
            <w:tc>
                <w:tcPr>
                    <w:tcW w:w="2337" w:type="dxa"/>
                </w:tcPr>
                <w:p w14:paraId="7D006905" w14:textId="77777777" w:rsidR="00244D8A" w:rsidRDefault="00244D8A">
                    <w:r>
                        <w:t>This</w:t>
                    </w:r>
                </w:p>
            </w:tc>
            <w:tc>
                <w:tcPr>
                    <w:tcW w:w="2337" w:type="dxa"/>
                </w:tcPr>
                <w:p w14:paraId="13C9E52C" w14:textId="77777777" w:rsidR="00244D8A" w:rsidRDefault="00244D8A">
                    <w:r>
                        <w:t>Is</w:t>
                    </w:r>
                </w:p>
            </w:tc>
...

We can easily make out a table structure with rows and columns. In the simplest cases (which is all I’ll cover in this post) where the rows and columns are uniform it’s pretty easy to grab the data:

library(xml2)

# read in the XML file
doc <- read_xml("data/word/document.xml")

# there is an egregious use of namespaces in these files
ns <- xml_ns(doc)

# extract all the table cells (this is assuming one table in the document)
cells <- xml_find_all(doc, ".//w:tbl/w:tr/w:tc", ns=ns)

# convert the cells to a matrix then to a data.frame)
dat <- data.frame(matrix(xml_text(cells), ncol=4, byrow=TRUE), 
                  stringsAsFactors=FALSE)

# if there are column headers, make them the column name and remove that line
colnames(dat) <- dat[1,]
dat <- dat[-1,]
rownames(dat) <- NULL

dat

##   This      Is     A   Column
## 1    1     Cat   3.4      Dog
## 2    3    Fish 100.3     Bird
## 3    5 Pelican   -99 Kangaroo

You’ll need to clean up the column types, but you have at least freed the data from the evil file format it was in.

If there is more than one table you can use XML node targeting to process each one separately or into a list. I’ve wrapped that functionality into a rudimentary function that will:

auto-copy a Word doc to a temporary location
rename it to a zip
unzip it to a temporary location
read in the document.xml
auto-determine the number of tables in the document
auto-calculate # rows & # columns per table
convert each table
return all the tables into a list
clean up the temporarily created items

library(xml2)

get_tbls <- function(word_doc) {
  
  tmpd <- tempdir()
  tmpf <- tempfile(tmpdir=tmpd, fileext=".zip")
  
  file.copy(word_doc, tmpf)
  unzip(tmpf, exdir=sprintf("%s/docdata", tmpd))
  
  doc <- read_xml(sprintf("%s/docdata/word/document.xml", tmpd))
  
  unlink(tmpf)
  unlink(sprintf("%s/docdata", tmpd), recursive=TRUE)

  ns <- xml_ns(doc)
  
  tbls <- xml_find_all(doc, ".//w:tbl", ns=ns)
  
  lapply(tbls, function(tbl) {
    
    cells <- xml_find_all(tbl, "./w:tr/w:tc", ns=ns)
    rows <- xml_find_all(tbl, "./w:tr", ns=ns)
    dat <- data.frame(matrix(xml_text(cells), 
                             ncol=(length(cells)/length(rows)), 
                             byrow=TRUE), 
                      stringsAsFactors=FALSE)
    colnames(dat) <- dat[1,]
    dat <- dat[-1,]
    rownames(dat) <- NULL
    dat
    
  })
  
}

Using this multi-table Word doc – doc3:

data3

we can extract the three tables thusly:

get_tbls("~/Dropbox/data3.docx")

## [[1]]
##   This      Is     A   Column
## 1    1     Cat   3.4      Dog
## 2    3    Fish 100.3     Bird
## 3    5 Pelican   -99 Kangaroo
## 
## [[2]]
##   Foo Bar Baz
## 1  Aa  Bb  Cc
## 2  Dd  Ee  Ff
## 3  Gg  Hh  ii
## 
## [[3]]
##   Foo Bar
## 1  Aa  Bb
## 2  Dd  Ee
## 3  Gg  Hh
## 4  1    2
## 5  Zz  Jj
## 6  Tt  ii

This function tries to calculate the rows/columns per table but it does rely on a uniform table structure.

Have an alternate method or more feature-complete way of handling Word docs as tabular data sources? Then definitely drop a note in the comments.

Doh! I Could Have Had Just Used V8!

An R user recently had the need to split a “full, human name” into component parts to retrieve first & last names. The full names could be anything from something simple like _”David Regan”_ to more complex & diverse such as _”John Smith Jr.”_, _”Izaque Iuzuru Nagata”_ or _”Christian Schmit de la Breli”_. Despite the face that I’m _pretty good_ at searching GitHub & CRAN for R stuff, my quest came up empty (though a teensy part of me swears I saw this type of thing in a package somewhere). I _did_ manage to find Python & node.js modules that carved up human names but really didn’t have the time to re-implement their functionality from scratch in R (or, preferably, Rcpp).

Rather than rely on the Python bridge to R (yuck) I decided to use @opencpu’s [V8 package](https://cran.rstudio.com/web/packages/V8/index.html) to wrap a part of the node.js [humanparser](https://github.com/chovy/humanparser) module. If you’re not familiar with V8, it provides the ability to run JavaScript code within R and makes it possible to pass variables into JavaScript functions and get data back in return. All the magic happens via a JSON data passing & Rcpp wrappers (and, of course, the super-awesome code Jeroen writes).

Working with JavaScript in R is as simple as creating an instance of the JavaScript V8 interpreter, loading up the JavaScript code that makes the functions work:

library(V8)
 
ct <- new_context()
ct$source(system.file("js/underscore.js", package="V8"))
ct$call("_.filter", mtcars, JS("function(x){return x.mpg < 15}"))
 
#>                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
#> Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
#> Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4

There are many more examples in the [V8 vignette](https://cran.rstudio.com/web/packages/V8/vignettes/v8_intro.html).

For `humanparser` I needed to use Underscore.js (it comes with V8) and a [function](https://github.com/chovy/humanparser/blob/master/index.js#L5-L74) from `humanparser` that I carved out to work the way I wanted it to. You can look at the innards of the package [on github](https://github.com/hrbrmstr/humanparser)—specifically, [this file](https://github.com/hrbrmstr/humanparser/blob/master/R/humanparser.r) (it’s _really_ small)— and, to use the two new functions the package exposes it’s as simple as doing:

devtools::install_github("hrbrmstr/humanparser")
 
library(humanparser)
 
parse_name("John Smith Jr.")
 
#> $firstName
#> [1] "John"
#> 
#> $suffix
#> [1] "Jr."
#> 
#> $lastName
#> [1] "Smith"
#> 
#> $fullName
#> [1] "John Smith Jr."

or the following to convert a bunch of ’em:

full_names <- c("David Regan", "Izaque Iuzuru Nagata", 
                "Christian Schmit de la Breli", "Peter Doyle", "Hans R.Bruetsch", 
                "Marcus Reichel", "Per-Axel Koch", "Louis Van der Walt", 
                "Mario Adamek", "Ugur Tozsekerli", "Judit Ludvai" )
 
parse_names(full_names)
 
#> Source: local data frame [11 x 4]
#> 
#>    firstName     lastName                     fullName middleName
#> 1      David        Regan                  David Regan         NA
#> 2     Izaque       Nagata         Izaque Iuzuru Nagata     Iuzuru
#> 3  Christian  de la Breli Christian Schmit de la Breli     Schmit
#> 4      Peter        Doyle                  Peter Doyle         NA
#> 5       Hans   R.Bruetsch              Hans R.Bruetsch         NA
#> 6     Marcus      Reichel               Marcus Reichel         NA
#> 7   Per-Axel         Koch                Per-Axel Koch         NA
#> 8      Louis Van der Walt           Louis Van der Walt         NA
#> 9      Mario       Adamek                 Mario Adamek         NA
#> 10      Ugur   Tozsekerli              Ugur Tozsekerli         NA
#> 11     Judit       Ludvai                 Judit Ludvai         NA

Now, the functions in this package won’t win any land-speed records since we’re going from R to C[++] to JavaScript and back, passing JSON-converted data back & forth, so I pwnd @quominus into making a full Rcpp-based human, full-name parser. And, he’s nearly done! So, keep an eye on [humaniformat](https://github.com/Ironholds/humaniformat) since it will no doubt be in CRAN soon.

The real point of this post is that there are _tons_ of JavaScript modules that will work well with the V8 package and let you get immediate functionality for something that might not be in R yet. You can prototype quickly (it took almost no time to make that package and you don’t even need to go that far), then optimize later. So, next time—if you can’t find some functionality directly in R—see if you can get by with a JavaScript shim, then convert to full R/Rcpp when/if you need to go into production.

If you’ve done any creative V8 hacks, drop a note in the comments!

Track Hurricane Danny (Interactively) with R + leaflet

2015-08-20 – 11:41
Posted in Data Visualization, DataVis, DataViz, R, Weather
Tagged post
Comments (5)

poster image

Danny became the [first hurricane of the 2015 Season](http://www.accuweather.com/en/weather-news/atlantic-gives-birth-to-tropical-depression-four-danny/51857239), so it’s a good time to revisit how one might be able to track them with R.

We’ll pull track data from [Unisys](http://weather.unisys.com/hurricane/atlantic/2015/index.php) and just look at Danny, but it should be easy to extrapolate from the code.

For this visualization, we’ll use [leaflet](http://rstudio.github.io/leaflet/) since it’s all the rage and makes the plots interactive without any real work (thanks to the very real work by the HTML Widgets folks and the Leaflet.JS folks).

Let’s get the library calls out of the way:

library(leaflet)
library(stringi)
library(htmltools)
library(RColorBrewer)

Now, we’ll get the tracks:

danny <- readLines("http://weather.unisys.com/hurricane/atlantic/2015/DANNY/track.dat")

Why aren’t we using `read.csv` or `read.table` directly, you ask? Well, the data is in a _really_ ugly format thanks to the spaces in the `STATUS` column and two prefix lines:

Date: 18-20 AUG 2015
Hurricane-1 DANNY
ADV  LAT    LON      TIME     WIND  PR  STAT
  1  10.60  -36.50 08/18/15Z   30  1009 TROPICAL DEPRESSION
  2  10.90  -37.50 08/18/21Z    -     - TROPICAL DEPRESSION
  3  11.20  -38.80 08/19/03Z    -     - TROPICAL DEPRESSION
  4  11.30  -40.20 08/19/09Z    -     - TROPICAL DEPRESSION
  5  11.20  -41.10 08/19/15Z    -     - TROPICAL DEPRESSION
  6  11.50  -42.00 08/19/21Z    -     - TROPICAL DEPRESSION
  7  12.10  -42.70 08/20/03Z    -     - TROPICAL DEPRESSION
  8  12.20  -43.70 08/20/09Z    -     - TROPICAL DEPRESSION
  9  12.50  -44.80 08/20/15Z    -     - TROPICAL DEPRESSION
+12  13.10  -46.00 08/21/00Z   70     - HURRICANE-1
+24  14.00  -47.60 08/21/12Z   75     - HURRICANE-1
+36  14.70  -49.40 08/22/00Z   75     - HURRICANE-1
+48  15.20  -51.50 08/22/12Z   70     - HURRICANE-1
+72  16.00  -56.40 08/23/12Z   65     - HURRICANE-1
+96  16.90  -61.70 08/24/12Z   65     - HURRICANE-1
+120  18.00  -66.60 08/25/12Z   55     - TROPICAL STORM

But, we can put that into shape pretty easily, using `gsub` to make it easier to read everything with `read.table` and we just skip over the first two lines (we’d use them if we were doing other things with more of the data).

danny_dat <- read.table(textConnection(gsub("TROPICAL ", "TROPICAL_", danny[3:length(danny)])), 
           header=TRUE, stringsAsFactors=FALSE)

Now, let’s make the data a bit prettier to work with:

# make storm type names prettier
danny_dat$STAT <- stri_trans_totitle(gsub("_", " ", danny_dat$STAT))
 
# make column names prettier
colnames(danny_dat) <- c("advisory", "lat", "lon", "time", "wind_speed", "pressure", "status")

Those steps weren’t absolutely necessary, but why do something half-baked (unless it’s chocolate chip cookies)?

Let’s pick better colors than Unisys did. We’ll use a color-blind safe palette from Color Brewer:

danny_dat$color <- as.character(factor(danny_dat$status, 
                          levels=c("Tropical Depression", "Tropical Storm",
                                   "Hurricane-1", "Hurricane-2", "Hurricane-3",
                                   "Hurricane-4", "Hurricane-5"),
                          labels=rev(brewer.pal(7, "YlOrBr"))))

And, now for the map! We’ll make lines for the path that was already traced by Danny, then make interactive points for the forecast locations from the advisory data:

last_advisory <- tail(which(grepl("^[[:digit:]]+$", danny_dat$advisory)), 1)
 
# draw the map
leaflet() %>% 
  addTiles() %>% 
  addPolylines(data=danny_dat[1:last_advisory,], ~lon, ~lat, color=~color) -> tmp_map
 
if (last_advisory < nrow(danny_dat)) {
 
   tmp_map <- tmp_map %>% 
     addCircles(data=danny_dat[last_advisory:nrow(danny_dat),], ~lon, ~lat, color=~color, fill=~color, radius=25000,
             popup=~sprintf("<b>Advisory forecast for +%sh (%s)</b><hr noshade size='1'/>
                           Position: %3.2f, %3.2f<br/>
                           Expected strength: <span style='color:%s'><strong>%s</strong></span><br/>
                           Forecast wind: %s (knots)<br/>Forecast pressure: %s",
                           htmlEscape(advisory), htmlEscape(time), htmlEscape(lon),
                           htmlEscape(lat), htmlEscape(color), htmlEscape(status), 
                           htmlEscape(wind_speed), htmlEscape(pressure)))
}
 
html_print(tmp_map)

Click on one of the circles to see the popup.

The entire source code is in [this gist](https://gist.github.com/hrbrmstr/e3253ddd353f1a489bb4) and, provided you have the proper packages installed, you can run this at any time with:

devtools::source_gist("e3253ddd353f1a489bb4", sha1="00074e03e92c48c470dc182f67c91ccac612107e")

The use of the `sha1` hash parameter will help ensure you aren’t being asked to run a potentially modified & harmful gist, but you should visit the gist first to make sure I’m not messing with you (which, I’m not).

If you riff off of this or have suggestions for improvement, drop a note here or in the gist comments.

cdcfluview – On The Way to “CRAN 7K”

I like to turn coincidence into convergence whenever possible. This weekend, a user of [cdcfluview](http://github.com/hrbrmstr/cdcfluview) had a question that caused me to notice a difference in behaviour between the package was interacting with CDC FluView API, so I updated the package to accommodate the change and the user.

Around the same time, @recology_ tweeted:

we're at 6983 #rstats pkgs https://t.co/4i6dWF103I – quick, someone submit 17 more so we can hit 7K

— Scott Chamberlain (@recology_) August 9, 2015

Finally, the [2015-2016 flu season](http://www.cdc.gov/flu/about/season/flu-season-2015-2016.htm) is also fast approaching (so a three-fer!), making a CRAN leap for `cdcfluview` quite timely.

Since I can’t let @quonimous have all the `#rstats` fun & glory, I added a function and two data sets to `cdcfluview`, did the CRAN dance and it’s now [on CRAN](https://cran.rstudio.com/web/packages/cdcfluview/). I also did the github dance to have it’s entry in the [Web Technologies Task View](https://cran.rstudio.com/web/views/WebTechnologies.html) updated.

The new function lets you grab the XML behind the high-level [Weekly US Map: Influenza Summary Update](http://www.cdc.gov/flu/weekly/usmap.htm) (I’ll be adding a function to make a similar plot that won’t require gosh-awful Flash) and the data sets provide metadata about the composition of HHS regions and Census regions, making it easier to compose factors, add labels to maps or even segment maps/combine polygons. The existing two grab current & historical detailed national & state influenza data.

There’s an example on github and in the

(https://rud.is/b/2015/01/10/new-r-package-cdcfluview-retrieve-flu-data-from-cdcs-fluview-portal/).

If you have any other data you need freed from the confines of the CDC FluView Flash portal, please file an issue & paste a screen shot (if you are comfortable with most browser Developer Tools views, even a dump of the request or “as cURL” URL would be awesome).

Adding a CRAN Search Engine to Chrome

Riffing off of [the previous post](http://rud.is/b/2015/08/05/speeding-up-your-quests-for-r-stuff/), here’s a way to quickly search CRAN (the @RStudio flavor) from the Chrome search bar.

– Paste `chrome://settings/searchEngines` into your location bar and hit return/enter
– Scroll down until the input boxes show, enabling you to add a search engine
– For _”Add a new search engine”_ put “`CRAN`”
– For _”Keyword”_ put “`R`”, “`rstats`” or “`CRAN`”, but “`R`” is super easy to type, though it may not be optimal for you :-)
– For _”URL with %s in place of query”_ put the following:

https://www.google.com/search?as_q=%s&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=cran.rstudio.com&as_occt=any&safe=images&as_filetype=&as_rights=&gws_rd=ssl

(you may be able to trim that URL a bit, if desired)

Save that and then in the Chrome location bar hit `R` then `[TAB]` and you’ll be sending the query to a custom Google search that only looks on CRAN (specifically the @RStudio CRAN mirror).