hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Live Google Spreadsheet For Keeping Track Of Sochi Medals

The “medals” R post by [TRInker](http://trinkerrstuff.wordpress.com/2014/02/09/sochi-olympic-medals-2/) and re-blogged by [Revolutions](http://blog.revolutionanalytics.com/2014/02/winter-olympic-medal-standings-presented-by-r.html) were both spiffy and a live example why there’s no point in not publishing raw data.

You don’t need to have R (or any other language) do the scraping, though. The “`IMPORTHTML`” function (yes, function names seem to be ALL CAPS now over at Google Drive) in Google Drive Spreadsheets can easily do the scraping with just s simple:

=IMPORTHTML("http://www.sochi2014.com/en/medal-standings","table",0)

that will refresh on demand and every hour.

Here’s a [live URL](https://docs.google.com/spreadsheets/d/1Al7I7nS0BP50IfThs55OKv5UPI9u-ctZgZRyDQma_G8/export?format=csv&gid=0) that will give back a CSV of the results which can easily be used in R thusly:

library(RCurl)
 
sochi.medals.URL = "https://docs.google.com/spreadsheets/d/1Al7I7nS0BP50IfThs55OKv5UPI9u-ctZgZRyDQma_G8/export?format=csv&gid=0"
medals <- read.csv(textConnection(getURL(sochi.medals.URL)), 
                   stringsAsFactors = FALSE)
 
str(medals)
 
'data.frame':  89 obs. of  6 variables:
$ Rank   : chr  "1" "2" "3" "4" ...
$ Country: chr  "Norway" "Canada" "Netherlands" "United States" ...
$ Gold   : int  4 4 3 2 2 1 1 1 1 1 ...
$ Silver : int  3 3 2 1 0 3 2 0 0 0 ...
$ Bronze : int  4 2 3 3 0 3 0 1 0 0 ...
$ Total  : int  11 9 8 6 2 7 3 2 1 1 ...
 
print(medals)
 
   Rank                                   Country Gold Silver Bronze Total
1     1                                    Norway    4      3      4    11
2     2                                    Canada    4      3      2     9
3     3                               Netherlands    3      2      3     8
4     4                             United States    2      1      3     6
5     5                                   Germany    2      0      0     2
6     6                              Russian Fed.    1      3      3     7
7     7                                   Austria    1      2      0     3
8     8                                    France    1      0      1     2
9    =9                                   Belarus    1      0      0     1
10   =9                                     Korea    1      0      0     1
11   =9                                    Poland    1      0      0     1
12   =9                                  Slovakia    1      0      0     1
13   =9                               Switzerland    1      0      0     1
14   14                                    Sweden    0      3      1     4
15   15                            Czech Republic    0      2      1     3
16   16                                  Slovenia    0      1      2     3
17   17                                     Italy    0      1      1     2
18  =18                                     China    0      1      0     1
19  =18                                   Finland    0      1      0     1
20  =20                             Great Britain    0      0      1     1
21  =20                                   Ukraine    0      0      1     1
22    -                                   Albania    0      0      0     0
23    -                                   Andorra    0      0      0     0
24    -                                 Argentina    0      0      0     0
25    -                                   Armenia    0      0      0     0
...
87    -                             Virgin Isl, B    0      0      0     0
88    -                            Virgin Isl, US    0      0      0     0
89    -                                  Zimbabwe    0      0      0     0

Which frees you up from dealing with the scraping and lets you focus solely on the data.

You can set it up in your own Google Docs as well, just make sure to publish the spreadhseet to the web (with ‘everyone’ read permisssions), strip off the `pubhtml` at the end of the published URL and add `export?format=csv&gid=0` in its place.

Data Driven Security Roundup: betaPERT, Shiny, Honeypots, Passwords & Reproducible Research

2014-02-09 – 23:10
Posted in Data Analysis, data driven security, Data Visualization, DataVis, DataViz, Information Security, R, Risk, Risk Assessment
Comments (1)

Jay Jacobs (@jayjacobs)—my co-author of the soon-to-be-released book [Data-Driven Security](http://amzn.to/ddsec)—& I have been hard at work over at the book’s [sister-blog](http://dds.ec/blog) cranking out code to help security domain experts delve into the dark art of data science.

We’ve covered quite a bit of ground since January 1^st, but I’m using this post to focus more on what we’ve produced using R, since that’s our go-to language.

Jay used the blog to do a [long-form answer](http://datadrivensecurity.info/blog/posts/2014/Jan/severski/) to a question asked by @dseverski on the [SIRA](http://societyinforisk.org) mailing list and I piled on by adding a [Shiny app](http://datadrivensecurity.info/blog/posts/2014/Jan/solvo-mediocris/) into the mix (both posts make for a pretty `#spiffy` introduction to expert-opinion risk analyses in R).

Jay continued by [releasing a new honeypot data set](http://datadrivensecurity.info/blog/data/2014/01/marx.gz) and corresponding two-part^[^{[1](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part1/)}^,^{[2](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part2/)}^] post series to jump start analyses on that data. (There’s a D3 geo-visualization stuck in-between those posts if you’re into that sort of thing).

I got it into my head to start a project to build a [password dump analytics tool](http://datadrivensecurity.info/blog/posts/2014/Feb/ripal/) in R (with **much** more coming soon on that, including a full-on R package + Shiny app combo) and also continue the discussion we started in the book on the need for the infusion of reproducible research principles and practices in the information security domain by building off of @sucuri_security’s [Darkleech botnet](http://datadrivensecurity.info/blog/posts/2014/Feb/reproducible-research-sucuri-darkleech-data/) research.

You can follow along at home with the blog via it’s [RSS feed](http://datadrivensecurity.info/blog/feeds/all.atom.xml) or via the @ddsecblog Twitter account. You can also **play** along at home if you feel you have something to contribute. It’s as simple as a github pull request and some really straightforward markdown. Take a look the blog’s [github repo](https://github.com/ddsbook/blog) and hit me up (@hrbrmstr) for details if you’ve got something to share.

Data-Driven Security (The Book) Update #ShamelessSelfPromotion

2014-01-31 – 06:27
Posted in Charts & Graphs, Data Analysis, data driven security, Data Visualization, DataVis, DataViz, Information Security, Metrics
Leave a Comment

Data-Driven-Security If I made a Venn diagram of the cross-section of readers of this blog and the [Data Driven Security](http://dds.ec/) web sites it might be indistinguishable from a pure circle. However, just in case there are a few stragglers out there, I figured one more post on the fact that the new book by @jayjacobs & me is available _now_ in electronic form (not pre-order) wouldn’t hurt. The print book is still making it’s way from dead trees to store shelves and should be ready for the expected February 17th debut.

Here’s the list of links to e-tailers (man, I hate that term) who have it available for the various e-readers out there.

– [Amazon/Kindle](http://www.amazon.com/gp/product/B00I1Y7THY/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00I1Y7THY&linkCode=as2&tag=rudisdotnet-20)
– [B&N/Nook](http://www.barnesandnoble.com/w/data-driven-security-jay-jacobs/1117239036?ean=9781118793725)
– [Google Books](https://play.google.com/store/books/details?id=LQigAgAAQBAJ)
– [Kobo](http://store.kobobooks.com/en-US/ebook/data-driven-security)

If you happen to catch it out in the wild and not on this list, drop me (@hrbrmstr) a note `#pls`.

And, a huge thank you! to everyone for their kind accolades yesterday (esp to those who’ve purchased the book :-)

Lies, Damn Lies, “Data Journalism” and Charts That Don’t Start at 0

2014-01-28 – 20:48
Posted in Charts & Graphs, Data Analysis, DataVis, DataViz, economics, media, R
Comments (13)

This tweet by @moorehn (who usually is a superb economic journalist) really bugged me:

Alarming chart of employment for people between 25 and 54. It's like a ski jump. #SOTUecon pic.twitter.com/KNGYmwI88C

— Heidi N. Moore (@moorehn) January 29, 2014

I grabbed the raw data from EPI: (http://www.epi.org/files/2012/data-swa/jobs-data/Employment%20to%20population%20ratio%20(EPOPs).xls) and properly started the graph at 0 for the y-axis and also broke out men & women (since the Excel spreadsheet had the data). It’s a really different picture:

I’m not saying employment is great right now, but it’s nowhere near a “ski jump”. So much for the state of data journalism at the start of 2014.

Here’s the hastily crafted R-code:

library(ggplot2)
library(ggthemes)
library(reshape2)
 
a <- read.csv("empvyear.csv")
b <- melt(a, id.vars="Year")
 
gg <- ggplot(data=b, aes(x=Year, y=value, group=variable))
gg <- gg + geom_line(aes(color=variable))
gg <- gg + ylim(0, 100)
gg <- gg + theme_economist()
gg <- gg + labs(x="Year", y="Employment as share of population (%)", 
                title="Employment-to-population ratio, age 25–54, 1975–2011")
gg <- gg + theme(legend.title = element_blank())
gg

And, here’s the data extracted from the Excel file:

Year,Men,Women
1975,89.0,51.0
1976,89.5,52.9
1977,90.1,54.8
1978,91.0,57.3
1979,91.1,59.0
1980,89.4,60.1
1981,89.0,61.2
1982,86.5,61.2
1983,86.1,62.0
1984,88.4,63.9
1985,88.7,65.3
1986,88.5,66.6
1987,89.0,68.2
1988,89.5,69.3
1989,89.9,70.4
1990,89.1,70.6
1991,87.5,70.1
1992,86.8,70.1
1993,87.0,70.4
1994,87.2,71.5
1995,87.6,72.2
1996,87.9,72.8
1997,88.4,73.5
1998,88.8,73.6
1999,89.0,74.1
2000,89.0,74.2
2001,87.9,73.4
2002,86.6,72.3
2003,85.9,72.0
2004,86.3,71.8
2005,86.9,72.0
2006,87.3,72.5
2007,87.5,72.5
2008,86.0,72.3
2009,81.5,70.2
2010,81,69.3
2011,81.4,69

Change The Default “Shell…” Action In RStudio for OS X

RStudio is my R development environment of choice and I work primarily on/in Mac OS X. While it’s great that Apple provides a built-in Terminal application, I prefer to use [iTerm 2](http://www.iterm2.com/#/section/home) when I need to do work at a shell. The fine folks at RStudio provide a handy `Shell`… menu item off the `Tools` menu, but it (rightly) defaults to using Apple’s Terminal.app for functionality since they can’t assume what terminal application you are using.

To change it to iTerm (or whatever your favorite terminal application is) you need to fire up a text editor and change (make a backup first!) `/Applications/RStudio.app/Contents/MacOS/mac-terminal` to contain the following modified AppleScript:

#!/usr/bin/osascript
on run argv
  set dir to quoted form of (first item of argv)
  tell app "iTerm"
    activate
    tell the first terminal
      launch session "Default Session"
      tell the last session
        set name to "RStudio Session"
        write text "cd " & dir
      end tell
    end tell
  end tell
end run

That will open a new tab in iTerm, set the session (tab) name to “RStudio Session” and change the directory to the current working directory in RStudio.

More often than not, I’m just using [Alfred](http://www.alfredapp.com/) to kick up iTerm and doing the `cd` myself, but this added tweak (which you have to do **every** time you upgrade RStudio) reduces the churn when I do end up using the feature within RStudio.

Announcing The Launch Of The Data Driven Security [Blog|Podcast]

2014-01-06 – 07:06
Posted in Data Analysis, data driven security, Data Visualization, DataVis, DataViz
Leave a Comment

While you’re waiting for the [book](http://amzn.to/ddsec) by @jayjacobs & @hrbrmstr to hit the shelves, why not head on over to the inaugural post of the [Data Driven Security Blog](http://datadrivensecurity.info/blog) & give a listen to the first episode of the [Data Driven Security Podcast](http://datadrivensecurity.info/podcast).

The Data Driven Security Blog aspires to be your go-to “how to” and “what’s new?” resource for all facets of security data science and the Data Driven Security Podcast aims to expand on the blog and showcase all the practitioners leading the way in, well, data-driven security.

We start the blog with some D3 visualizations of our book’s [github logs](http://datadrivensecurity.info/blog/posts/2014/Jan/dds-github/), and the [inaugural podcast episode](http://datadrivensecurity.info/podcast/data-driven-security-episode-0.html) gives a bit of background on Jay, Bob, the book and the goal of the podcast.

Join in the discussion on @ddsecblog, @ddsecpodcast and on [Google+](https://plus.google.com/+DatadrivensecurityInfo1/posts). If you prefer e-mail updates over Twitter and/or RSS/Atom feeds, you can also sign up for our non-spammy [newsletter](http://datadrivensecurity.info/blog/pages/subscribe.html).

We welcome all feedback, suggestions and ideas, so don’t be shy!

Points, Polygons and Power Outages

2013-12-27 – 10:30
Posted in Data Visualization, DataVis, DataViz, Development, maine, maps, Programming, R
Comments (1)

Most of my free coding time has been spent tweaking a D3-based live power outage tracker for Central Maine Power customers (there’s also a woefully less-featured Shiny app for it, too). There is some R associated with the D3 vis, but it’s limited to a cron job that’s makes the CSV files for the sparklines in D3 vis by

reading historical outage data from a database of scraped readings I’ve been keeping
filling out the time series
reducing it to an hourly time series, and
trimming the data set to the last 30 days of records:

#!/usr/bin/Rscript
# running in a cron job so no spurious text output pls
options(warn=-1)
options(show.error.messages=FALSE)
 
suppressMessages(library(methods))
suppressMessages(library(zoo))
library(chron)
library(xts)
library(reshape2)
library(DBI)
library(RMySQL)
 
m <- dbDriver("MySQL");
con <- dbConnect(m, user='DBUSER', password='DBPASSWORD', host='localhost', dbname='DBNAME');
res <- dbSendQuery(con, "SELECT * FROM outage") # cld just pull the 2 fields I need
outages <- fetch(res, n = -1)
outages$ts <- as.POSIXct(gsub("\\:[0-9]+\\..*$","", outages$ts), format="%Y-%m-%d %H:%M")
 
# for each county we have data for
for (county in unique(outages$county)) {
 
  # get the 2 fields I need (shld prbly filter that in the SELECT)
  outage.raw <- outages[outages$county == county,c(1,4)]
 
  # make it a zoo object
  outage.zoo <- zoo(outage.raw$withoutpower, outage.raw$ts)
 
  # fill out the 15-minute readings
  complete.zoo <- merge(outage.zoo, zoo(, seq(start(outage.zoo), max(outages$ts), by="15 min")), all=TRUE)
  complete.zoo[is.na(complete.zoo)] <- 0
 
  # shrink to hourly and trim at 30 days
  hourly.zoo <- last(to.hourly(complete.zoo), "30 days")
 
  # crank out a CSV  
  df <- data.frame(hourly.zoo)
  df <- data.frame(ts=rownames(df), withoutPower=df$complete.zoo.High)
 
  write.csv(df, sprintf("OUTPOUT_LOCATION/%s.csv",county), row.names=FALSE)
 
}

I knew there were other power companies in Maine, but CMP is the largest, so I focused my attention on getting data from it. After hearing an outage update on MPBN I decided to see if Bangor Hydro Electric had scrape-able content and it turns out there was a lovely (well, as lovely as XML can be) XML file delivered on the page with this “meh” Google push-pin map:

The XML file is used to build the markers for the map and has marker tags that look like this:

<marker name="Exeter" outages="18" 
        lat="44.96218" lng="-69.12253" 
        streets="BUTTERS  RD;CHAMPEON  RD;GREENBUSH  RD;" 
        reflabels="87757-1;84329-1;85032-1;"/>

I’m really only tracking county-level data and BHE does not provide that, even in the huge table of street-level outages that you’ll see on that outage page. I decided to use R to aggregate the BHE data to the county level via the “point-in-polygon” method.

Getting Right To the Point

To perform the aggregation in R, I needed county-level polygons for Maine. I already had that thanks to the previous work, but I wanted to optimize the search process, so I took the US counties shapefile and used OGR from the GDAL (Geospatial Data Abstraction Library) suite to extract just the Maine county polygons:

ogr2ogr -f "ESRI Shapefile" \
        -where "STATE_NAME = 'MAINE'" maine.shp counties.shp

You can see both a reduction in file size and complexity via ogrinfo:

$ll *shp
-rwxr-xr-x@ 1 bob  staff  1517624 Jan  8  2010 counties.shp
-rw-r--r--  1 bob  staff    12588 Dec 26 23:03 maine.shp

$ ogrinfo -sql "SELECT COUNT(*) FROM counties" counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(counties):0
  COUNT_* (Integer) = 3141

$ ogrinfo -sql "SELECT COUNT(*) FROM maine" maine.shp
INFO: Open of 'maine.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: maine
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(maine):0
  COUNT_* (Integer) = 16

The conversion reduces the file size from 1.5MB to ~12K and shrinks the number of polygons from 3,141 to 16. The counties.shp and maine.shp shapefiles were built from U.S. census data and have scads more information that you might want to use (i.e. perhaps, for correlation with the outage info, though storms are the prime causal entity for the outages :-):

$ ogrinfo -al -so counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 3141
Extent: (-176.806138, 18.921786) - (-66.969271, 71.406235)
Layer SRS WKT:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137.0,298.257223563]],
    PRIMEM["Greenwich",0.0],
    UNIT["Degree",0.0174532925199433]]
NAME: String (32.0)
STATE_NAME: String (25.0)
STATE_FIPS: String (2.0)
CNTY_FIPS: String (3.0)
FIPS: String (5.0)
POP2000: Integer (9.0)
POP2007: Integer (9.0)
POP00_SQMI: Real (10.1)
POP07_SQMI: Real (7.1)
WHITE: Integer (9.0)
BLACK: Integer (9.0)
AMERI_ES: Integer (9.0)
ASIAN: Integer (9.0)
HAWN_PI: Integer (9.0)
OTHER: Integer (9.0)
MULT_RACE: Integer (9.0)
HISPANIC: Integer (9.0)
MALES: Integer (9.0)
FEMALES: Integer (9.0)
AGE_UNDER5: Integer (9.0)
AGE_5_17: Integer (9.0)
AGE_18_21: Integer (9.0)
AGE_22_29: Integer (9.0)
AGE_30_39: Integer (9.0)
AGE_40_49: Integer (9.0)
AGE_50_64: Integer (9.0)
AGE_65_UP: Integer (9.0)
MED_AGE: Real (9.1)
MED_AGE_M: Real (9.1)
MED_AGE_F: Real (9.1)
HOUSEHOLDS: Integer (9.0)
AVE_HH_SZ: Real (9.2)
HSEHLD_1_M: Integer (9.0)
HSEHLD_1_F: Integer (9.0)
MARHH_CHD: Integer (9.0)
MARHH_NO_C: Integer (9.0)
MHH_CHILD: Integer (9.0)
FHH_CHILD: Integer (9.0)
FAMILIES: Integer (9.0)
AVE_FAM_SZ: Real (9.2)
HSE_UNITS: Integer (9.0)
VACANT: Integer (9.0)
OWNER_OCC: Integer (9.0)
RENTER_OCC: Integer (9.0)
NO_FARMS97: Real (11.0)
AVG_SIZE97: Real (11.0)
CROP_ACR97: Real (11.0)
AVG_SALE97: Real (7.2)
SQMI: Real (8.1)
OID: Integer (9.0)

With the new shapefile in hand, the basic workflow to get BHE outages at the county level is:

Read and parse the BHE outages XML file to get the lat/long pairs
Build a SpatialPoints object out of those pairs
Make a SpatialPolygonsDataFrame out of the reduced Maine counties shapefile
Overlay the points in the polygons and get the feature metadata intersection (including county)
Aggregate the outage data

The R code (below) does all that and is liberally commented. One has to appreciate how succinct the XML parsing is and the beautiful simplicity of the over() function (which does all the really hard work).

library(XML)
library(maptools)
library(sp)
library(plyr)
 
# Small script to get county-level outage info from Bangor Hydro
# Electric's town(-ish) level info
#
# BHE's outage google push-pin map is at
#   http://apps.bhe.com/about/outages/outage_map.cfm
 
# read BHE outage XML file that was intended for the google map
# yep. One. Line. #takethatpython
 
doc <- xmlTreeParse("http://apps.bhe.com/about/outages/outage_map.xml", 
                    useInternalNodes=TRUE)
 
# xmlToDataFrame() coughed up blood on that simple file, so we have to
# resort to menial labor to bend the XML to our will
 
doc.ls <- xmlToList(doc)
doc.attrs <- doc.ls$.attrs
doc.ls$.attrs <- NULL
 
# this does the data frame conversion magic, tho it winces a bit
 
suppressWarnings(doc.df <- data.frame(do.call(rbind, doc.ls), 
                                      stringsAsFactors=FALSE))
 
# need numbers for some of the columns (vs strings)
 
doc.df$outages <- as.numeric(doc.df$outages)
doc.df$lat <- as.numeric(doc.df$lat)
doc.df$lng <- as.numeric(doc.df$lng)
 
# SpatialPoints likes matrices, note that it's in LON, LAT order
# that always messes me up for some reason
 
doc.m <- as.matrix(doc.df[,c(4,3)])
doc.pts <- SpatialPoints(doc.m)
 
# I trimmed down the country-wide counties file from
#   http://www.baruch.cuny.edu/geoportal/data/esri/usa/census/counties.zip
# with
#   ogr2ogr -f "ESRI Shapefile" -where "STATE_NAME = 'MAINE'" maine.shp counties.shp
# to both save load time and reduce the number of iterations for over() later
 
counties <- readShapePoly("maine.shp", repair=TRUE, IDvar="NAME")
 
# So, all the above was pretty much just for this next line which does  
# the "is this point 'a' in polygon 'b' automagically for us. 
 
found.pts <- over(doc.pts, counties)
 
# steal the column we need (county name) and squirrel it away with outage count
 
doc.df$county <- found.pts$NAME
doc.sub <- doc.df[,c(2,7)]
 
# aggregate the result to get outage count by county
 
count(doc.sub, c("county"), wt_var="outages")
 
##      county freq
##1    Hancock 4440
##2  Penobscot  869
##3      Waldo   28
##4 Washington  545
##5       <NA>  328

Astute readers will notice unresolved points (the NAs). I suspect they are right on coastal boundaries that were probably missed in these simplified county polygons. We can see what they are by looking at the NA entries in the merged data frame:

doc.df[is.na(doc.df$county),c(1:4)]
           name outages      lat       lng
35    Deer Isle       1 44.22451 -68.67778
38   Harborside     166 44.34900 -68.81555
39     Sorrento      43 44.47341 -68.17723
62    Bucksport      71 44.57369 -68.79562
70    Penobscot      40 44.44552 -68.81780
78      Bernard       1 44.24119 -68.35585
79   Stonington       5 44.15619 -68.66669
80 Roque Bluffs       1 44.61286 -67.47971

But a map would be more useful for those not familiar with Maine geography/extents:

library(ggplot2)
 
ff = fortify(counties, region = "NAME")
 
missing <- doc.df[is.na(doc.df$county),]
 
gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_path(aes(group = group), size=0.15, fill="black")
gg <- gg + geom_point(data=missing, aes(x=lng, y=lat), 
                      color="#feb24c", size=3)
gg <- gg + coord_map(xlim=extendrange(range(missing$lng)), ylim=extendrange(range(missing$lat)))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

The “zoom in” is done by taking and slightly extending the range of the extracted points via range() and extendrange(), reproduced below:

range(missing$lng)
[1] -68.81780 -67.47971
range(missing$lat)
[1] 44.15619 44.61286
 
extendrange(range(missing$lng))
[1] -68.88470 -67.41281
extendrange(range(missing$lat))
[1] 44.13336 44.63569

It turns out my suspicion was right, so to use this in “production” I’ll need a more accurate shapefile for Maine counties (which I have, but Descent is calling me, so it will have to wait for another day).

I’ll leave you with a non-Google push-pin map of outages that you can build upon (it needs some tweaking):

Plot_Zoom-2

gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_polygon(aes(group = group), size=0.15, fill="black", color="#7f7f7f")
gg <- gg + geom_point(data=doc.df, aes(x=lng, y=lat, alpha=outages, size=outages), 
                      color="#feb24c")
gg <- gg + coord_map(xlim=c(-71.5,-66.75), ylim=c(43,47.5))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

You can find all the R code in one, compact gist.

Newest Font Obsession: Lato

The_Fonts_-_Lato-3 I tend to obsess over fonts and the latest obsession is Lato.

It was created over three years ago by Łukasz Dziedzic but I only discovered it recently when searching for something to use with my Live Maine Power Outages visualization.

The Lato main site provides desktop fonts and web fonts, but it’s also available at Google Fonts which—despite caving into Google’s insidious tracking—will probably make them load faster and save some bandwidth on your site(s).

Now I just need to import Lato into R and remember that I can tweak the matplotlib font config to use it with iPython as well.