Archive for the ‘Programming’ Category

Rforecastio 1.2.0 Bug-fix Update

Not even going to put an R category on this since I don’t want to pollute R-bloggers with this tiny post, but I had to provide the option to let folks specify ssl.verifypeer=FALSE (so I made it a generic option to pass in any CURL parameters) and I had a couple gaping bugs that I missed due to not clearing out my environment before building & testing.

Rforecastio Package Update (1.1.0)

I’ve bumped up the version number of Rforecastio (github) to 1.1.0. The new features are:

  • removing the SSL certificate bypass check (it doesn’t need it anymore)
  • using plyr for easier conversion of JSON->data frame
  • adding in a new daily forecast data frame
  • roxygen2 inline documentation
library(Rforecastio)
library(ggplot2)
library(plyr)
 
# NEVER put API keys in revision control systems or source code!
fio.api.key= readLines("~/.forecast.io")
 
my.latitude = "43.2673"
my.longitude = "-70.8618"
 
fio.list <- fio.forecast(fio.api.key, my.latitude, my.longitude)
 
fio.gg <- ggplot(data=fio.list$hourly.df, aes(x=time, y=temperature))
fio.gg <- fio.gg + labs(y="Readings", x="Time", title="Houry Readings")
fio.gg <- fio.gg + geom_line(aes(y=humidity*100), color="green")
fio.gg <- fio.gg + geom_line(aes(y=temperature), color="red")
fio.gg <- fio.gg + geom_line(aes(y=dewPoint), color="blue")
fio.gg <- fio.gg + theme_bw()
fio.gg

daily

fio.gg <- ggplot(data=fio.list$daily.df, aes(x=time, y=temperature))
fio.gg <- fio.gg + labs(y="Readings", x="Time", title="Daily Readings")
fio.gg <- fio.gg + geom_line(aes(y=humidity*100), color="green")
fio.gg <- fio.gg + geom_line(aes(y=temperatureMax), color="red")
fio.gg <- fio.gg + geom_line(aes(y=temperatureMin), color="red", linetype=2)
fio.gg <- fio.gg + geom_line(aes(y=dewPoint), color="blue")
fio.gg <- fio.gg + theme_bw()
fio.gg

hourly

Moving From system() calls to Rcpp Interfaces

Over on the Data Driven Security Blog there’s a post on how to use Rcpp to interface with an external library (in this case ldns for DNS lookups). It builds on another post which uses system() to make a call to dig to lookup DNS TXT records.

The core code is below and at both the aforementioned blog post and this gist. The post walks you though creating a simple interface and a future post will cover how to build a full package interface to an external library.

Points, Polygons and Power Outages

Most of my free coding time has been spent tweaking a D3-based live power outage tracker for Central Maine Power customers (there’s also a woefully less-featured Shiny app for it, too). There is some R associated with the D3 vis, but it’s limited to a cron job that’s makes the CSV files for the sparklines in D3 vis by

  • reading historical outage data from a database of scraped readings I’ve been keeping
  • filling out the time series
  • reducing it to an hourly time series, and
  • trimming the data set to the last 30 days of records:

#!/usr/bin/Rscript
# running in a cron job so no spurious text output pls
options(warn=-1)
options(show.error.messages=FALSE)
 
suppressMessages(library(methods))
suppressMessages(library(zoo))
library(chron)
library(xts)
library(reshape2)
library(DBI)
library(RMySQL)
 
m <- dbDriver("MySQL");
con <- dbConnect(m, user='DBUSER', password='DBPASSWORD', host='localhost', dbname='DBNAME');
res <- dbSendQuery(con, "SELECT * FROM outage") # cld just pull the 2 fields I need
outages <- fetch(res, n = -1)
outages$ts <- as.POSIXct(gsub("\\:[0-9]+\\..*$","", outages$ts), format="%Y-%m-%d %H:%M")
 
# for each county we have data for
for (county in unique(outages$county)) {
 
  # get the 2 fields I need (shld prbly filter that in the SELECT)
  outage.raw <- outages[outages$county == county,c(1,4)]
 
  # make it a zoo object
  outage.zoo <- zoo(outage.raw$withoutpower, outage.raw$ts)
 
  # fill out the 15-minute readings
  complete.zoo <- merge(outage.zoo, zoo(, seq(start(outage.zoo), max(outages$ts), by="15 min")), all=TRUE)
  complete.zoo[is.na(complete.zoo)] <- 0
 
  # shrink to hourly and trim at 30 days
  hourly.zoo <- last(to.hourly(complete.zoo), "30 days")
 
  # crank out a CSV  
  df <- data.frame(hourly.zoo)
  df <- data.frame(ts=rownames(df), withoutPower=df$complete.zoo.High)
 
  write.csv(df, sprintf("OUTPOUT_LOCATION/%s.csv",county), row.names=FALSE)
 
}

I knew there were other power companies in Maine, but CMP is the largest, so I focused my attention on getting data from it. After hearing an outage update on MPBN I decided to see if Bangor Hydro Electric had scrape-able content and it turns out there was a lovely (well, as lovely as XML can be) XML file delivered on the page with this “meh” Google push-pin map:

Bangor_Hydro_Electric_-_Outage_Map

The XML file is used to build the markers for the map and has marker tags that look like this:

<marker name="Exeter" outages="18" 
        lat="44.96218" lng="-69.12253" 
        streets="BUTTERS  RD;CHAMPEON  RD;GREENBUSH  RD;" 
        reflabels="87757-1;84329-1;85032-1;"/>

I’m really only tracking county-level data and BHE does not provide that, even in the huge table of street-level outages that you’ll see on that outage page. I decided to use R to aggregate the BHE data to the county level via the “point-in-polygon” method.

Getting Right To the Point

To perform the aggregation in R, I needed county-level polygons for Maine. I already had that thanks to the previous work, but I wanted to optimize the search process, so I took the US counties shapefile and used OGR from the GDAL (Geospatial Data Abstraction Library) suite to extract just the Maine county polygons:

ogr2ogr -f "ESRI Shapefile" \
        -where "STATE_NAME = 'MAINE'" maine.shp counties.shp

You can see both a reduction in file size and complexity via ogrinfo:

$ll *shp
-rwxr-xr-x@ 1 bob  staff  1517624 Jan  8  2010 counties.shp
-rw-r--r--  1 bob  staff    12588 Dec 26 23:03 maine.shp
$ ogrinfo -sql "SELECT COUNT(*) FROM counties" counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(counties):0
  COUNT_* (Integer) = 3141
$ ogrinfo -sql "SELECT COUNT(*) FROM maine" maine.shp
INFO: Open of 'maine.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: maine
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(maine):0
  COUNT_* (Integer) = 16

The conversion reduces the file size from 1.5MB to ~12K and shrinks the number of polygons from 3,141 to 16. The counties.shp and maine.shp shapefiles were built from U.S. census data and have scads more information that you might want to use (i.e. perhaps, for correlation with the outage info, though storms are the prime causal entity for the outages :-):

$ ogrinfo -al -so counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 3141
Extent: (-176.806138, 18.921786) - (-66.969271, 71.406235)
Layer SRS WKT:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137.0,298.257223563]],
    PRIMEM["Greenwich",0.0],
    UNIT["Degree",0.0174532925199433]]
NAME: String (32.0)
STATE_NAME: String (25.0)
STATE_FIPS: String (2.0)
CNTY_FIPS: String (3.0)
FIPS: String (5.0)
POP2000: Integer (9.0)
POP2007: Integer (9.0)
POP00_SQMI: Real (10.1)
POP07_SQMI: Real (7.1)
WHITE: Integer (9.0)
BLACK: Integer (9.0)
AMERI_ES: Integer (9.0)
ASIAN: Integer (9.0)
HAWN_PI: Integer (9.0)
OTHER: Integer (9.0)
MULT_RACE: Integer (9.0)
HISPANIC: Integer (9.0)
MALES: Integer (9.0)
FEMALES: Integer (9.0)
AGE_UNDER5: Integer (9.0)
AGE_5_17: Integer (9.0)
AGE_18_21: Integer (9.0)
AGE_22_29: Integer (9.0)
AGE_30_39: Integer (9.0)
AGE_40_49: Integer (9.0)
AGE_50_64: Integer (9.0)
AGE_65_UP: Integer (9.0)
MED_AGE: Real (9.1)
MED_AGE_M: Real (9.1)
MED_AGE_F: Real (9.1)
HOUSEHOLDS: Integer (9.0)
AVE_HH_SZ: Real (9.2)
HSEHLD_1_M: Integer (9.0)
HSEHLD_1_F: Integer (9.0)
MARHH_CHD: Integer (9.0)
MARHH_NO_C: Integer (9.0)
MHH_CHILD: Integer (9.0)
FHH_CHILD: Integer (9.0)
FAMILIES: Integer (9.0)
AVE_FAM_SZ: Real (9.2)
HSE_UNITS: Integer (9.0)
VACANT: Integer (9.0)
OWNER_OCC: Integer (9.0)
RENTER_OCC: Integer (9.0)
NO_FARMS97: Real (11.0)
AVG_SIZE97: Real (11.0)
CROP_ACR97: Real (11.0)
AVG_SALE97: Real (7.2)
SQMI: Real (8.1)
OID: Integer (9.0)

With the new shapefile in hand, the basic workflow to get BHE outages at the county level is:

  • Read and parse the BHE outages XML file to get the lat/long pairs
  • Build a SpatialPoints object out of those pairs
  • Make a SpatialPolygonsDataFrame out of the reduced Maine counties shapefile
  • Overlay the points in the polygons and get the feature metadata intersection (including county)
  • Aggregate the outage data

The R code (below) does all that and is liberally commented. One has to appreciate how succinct the XML parsing is and the beautiful simplicity of the over() function (which does all the really hard work).

library(XML)
library(maptools)
library(sp)
library(plyr)
 
# Small script to get county-level outage info from Bangor Hydro
# Electric's town(-ish) level info
#
# BHE's outage google push-pin map is at
#   http://apps.bhe.com/about/outages/outage_map.cfm
 
# read BHE outage XML file that was intended for the google map
# yep. One. Line. #takethatpython
 
doc <- xmlTreeParse("http://apps.bhe.com/about/outages/outage_map.xml", 
                    useInternalNodes=TRUE)
 
# xmlToDataFrame() coughed up blood on that simple file, so we have to
# resort to menial labor to bend the XML to our will
 
doc.ls <- xmlToList(doc)
doc.attrs <- doc.ls$.attrs
doc.ls$.attrs <- NULL
 
# this does the data frame conversion magic, tho it winces a bit
 
suppressWarnings(doc.df <- data.frame(do.call(rbind, doc.ls), 
                                      stringsAsFactors=FALSE))
 
# need numbers for some of the columns (vs strings)
 
doc.df$outages <- as.numeric(doc.df$outages)
doc.df$lat <- as.numeric(doc.df$lat)
doc.df$lng <- as.numeric(doc.df$lng)
 
# SpatialPoints likes matrices, note that it's in LON, LAT order
# that always messes me up for some reason
 
doc.m <- as.matrix(doc.df[,c(4,3)])
doc.pts <- SpatialPoints(doc.m)
 
# I trimmed down the country-wide counties file from
#   http://www.baruch.cuny.edu/geoportal/data/esri/usa/census/counties.zip
# with
#   ogr2ogr -f "ESRI Shapefile" -where "STATE_NAME = 'MAINE'" maine.shp counties.shp
# to both save load time and reduce the number of iterations for over() later
 
counties <- readShapePoly("maine.shp", repair=TRUE, IDvar="NAME")
 
# So, all the above was pretty much just for this next line which does  
# the "is this point 'a' in polygon 'b' automagically for us. 
 
found.pts <- over(doc.pts, counties)
 
# steal the column we need (county name) and squirrel it away with outage count
 
doc.df$county <- found.pts$NAME
doc.sub <- doc.df[,c(2,7)]
 
# aggregate the result to get outage count by county
 
count(doc.sub, c("county"), wt_var="outages")
 
##      county freq
##1    Hancock 4440
##2  Penobscot  869
##3      Waldo   28
##4 Washington  545
##5       <NA>  328

Astute readers will notice unresolved points (the NAs). I suspect they are right on coastal boundaries that were probably missed in these simplified county polygons. We can see what they are by looking at the NA entries in the merged data frame:

doc.df[is.na(doc.df$county),c(1:4)]
           name outages      lat       lng
35    Deer Isle       1 44.22451 -68.67778
38   Harborside     166 44.34900 -68.81555
39     Sorrento      43 44.47341 -68.17723
62    Bucksport      71 44.57369 -68.79562
70    Penobscot      40 44.44552 -68.81780
78      Bernard       1 44.24119 -68.35585
79   Stonington       5 44.15619 -68.66669
80 Roque Bluffs       1 44.61286 -67.47971

But a map would be more useful for those not familiar with Maine geography/extents:

Plot_Zoom

library(ggplot2)
 
ff = fortify(counties, region = "NAME")
 
missing <- doc.df[is.na(doc.df$county),]
 
gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_path(aes(group = group), size=0.15, fill="black")
gg <- gg + geom_point(data=missing, aes(x=lng, y=lat), 
                      color="#feb24c", size=3)
gg <- gg + coord_map(xlim=extendrange(range(missing$lng)), ylim=extendrange(range(missing$lat)))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

The “zoom in” is done by taking and slightly extending the range of the extracted points via range() and extendrange(), reproduced below:

range(missing$lng)
[1] -68.81780 -67.47971
range(missing$lat)
[1] 44.15619 44.61286
 
extendrange(range(missing$lng))
[1] -68.88470 -67.41281
extendrange(range(missing$lat))
[1] 44.13336 44.63569

It turns out my suspicion was right, so to use this in “production” I’ll need a more accurate shapefile for Maine counties (which I have, but Descent is calling me, so it will have to wait for another day).

I’ll leave you with a non-Google push-pin map of outages that you can build upon (it needs some tweaking):

Plot_Zoom-2

gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_polygon(aes(group = group), size=0.15, fill="black", color="#7f7f7f")
gg <- gg + geom_point(data=doc.df, aes(x=lng, y=lat, alpha=outages, size=outages), 
                      color="#feb24c")
gg <- gg + coord_map(xlim=c(-71.5,-66.75), ylim=c(43,47.5))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

You can find all the R code in one, compact gist.

One More Time…Mapping Maine Power Outages with D3

Central_Maine_Power_Live_Outage_Map-2

It started with a local R version and migrated to a Shiny version and is now in full D3 glory.

Some down time gave me the opportunity to start a basic D3 version of the outage map, but it needs a bit of work as it relies on a page meta refresh to update (every 5 minutes) vs an inline element dynamic refresh. The fam was getting a bit irked at coding time on Thanksgiving, so keep watching the following gists for updates after the holiday:

Reverse IP Address Lookups With R (From Simple To Bulk/Asynchronous)

R lacks some of the more “utilitarian” features found in other scripting languages that were/are more geared—at least initially—towards systems administration. One of the most frustrating missing pieces for security data scientists is the lack of ability to perform basic IP address manipulations, including reverse DNS resolution (even though it has nsl() which is just glue to gethostbyname()!).

If you need to perform reverse resolution, the only two viable options available are to (a) pre-resolve a list of IP addresses or (b) whip up something in R that takes advantage of the ability to perform system calls. Yes, one could write a C/C++ API library that accesses native resolver routines, but that becomes a pain to maintain across platforms. System calls also create some cross-platform issues, but they are usually easier for the typical R user to overcome.

Assuming the dig command is available on your linux, BSD or Mac OS system, it’s pretty trivial to pass in a list of IP addresses to a simple sapply() one-liner:

resolved = sapply(ips, function(x) system(sprintf("dig -x %s +short",x), intern=TRUE))

That works for fairly small lists of addresses, but doesn’t scale well to hundreds or thousands of addresses. (Also, @jayjacobs kinda hates my one-liners #true.)

A better way is to generate a batch query to dig, but the results will be synchronous, which could take A Very Long Time depending on the size of the list and types of results.

The best way (IMO) to tackle this problem is to perform an asynchronous batch query and post-process the results, which we can do with a little help from adns (which homebrew users can install with a quick “brew install adns“).

Once adns is installed, it’s just a matter of writing out a query list, performing the asynchronous batch lookup, parsing the results and re-integrating with the original IP list (which is necessary since errant or unresponsive reverse queries will not be returned by the adns system call).

#pretend this is A Very Long List of IPs
ip.list = c("1.1.1.1", "2.3.4.99", "1.1.1.2", "2.3.4.100", "70.196.7.32", 
  "146.160.21.171", "146.160.21.172", "146.160.21.186", "2.3.4.101", 
  "216.167.176.93", "1.1.1.3", "2.3.4.5", "2.3.4.88", "2.3.4.9", 
  "98.208.205.1", "24.34.218.80", "159.127.124.209", "70.196.198.151", 
  "70.192.72.48", "173.192.34.24", "65.214.243.208", "173.45.242.179", 
  "184.106.97.102", "198.61.171.18", "71.184.118.37", "70.215.200.159", 
  "184.107.87.105", "174.121.93.90", "172.17.96.139", "108.59.250.112", 
  "24.63.14.4")
 
# "ips" is a list of IP addresses
ip.to.host <- function(ips) {
  # save out a list of IP addresses in adnshost reverse query format
  # if you're going to be using this in "production", you *might*
  # want to consider using tempfile() #justsayin
  writeLines(laply(ips, function(x) sprintf("-i%s",x)),"/tmp/ips.in")
  # call adnshost with the file
  # requires adnshost :: http://www.chiark.greenend.org.uk/~ian/adns/
  system.output <- system("cat /tmp/ips.in | adnshost -f",intern=TRUE)
  # keep our file system tidy
  unlink("/tmp/ips.in")
  # clean up the result
  cleaned.result <- gsub("\\.in-addr\\.arpa","",system.output)
  # split the reply
  split.result <- strsplit(cleaned.result," PTR ")
  # make a data frame of the reply
  result.df <- data.frame(do.call(rbind, lapply(split.result, rbind)))
  colnames(result.df) <- c("IP","hostname")
  # reverse the octets in the IP address list
  result.df$IP <- sapply(as.character(result.df$IP), function(x) {
    y <- unlist(strsplit(x,"\\."))
    sprintf("%s.%s.%s.%s",y[4],y[3],y[2],y[1])
  })
  # fill errant lookups with "NA"
  final.result <- merge(ips,result.df,by.x="x",by.y="IP",all.x=TRUE)
  colnames(final.result) = c("IP","hostname")
  return(final.result)
}
 
resolved.df <- ip.to.host(ip.list)
head(resolved.df,n=10)
 
                IP                                   hostname
1          1.1.1.1                                       <NA>
2          1.1.1.2                                       <NA>
3          1.1.1.3                                       <NA>
4   108.59.250.112      vps-1068142-5314.manage.myhosting.com
5   146.160.21.171                                       <NA>
6   146.160.21.172                                       <NA>
7   146.160.21.186                                       <NA>
8  159.127.124.209                                       <NA>
9    172.17.96.139                                       <NA>
10   173.192.34.24 173.192.34.24-static.reverse.softlayer.com

If you wish to suppress adns error messages and any resultant R warnings, you can add an “ignore.stderr=TRUE” to the system() call and an “options(warn=-1)” to the function itself (remember to get/reset the current value). I kinda like leaving them in, though, as it shows progress is being made.

Whether you end up using a one-liner or the asynchronous function, it would be a spiffy idea to setup a local caching server, such as Unbound, to speed up subsequent queries (because you will undoubtedly have subsequent queries unless your R scripts are perfect on the first go-round).

If you’ve solved the “efficient reverse DNS query problem” a different way in R, drop a note in the comments! I know quite a few folks who’d love to buy you tasty beverage!

You can find similar, handy IP address and other security-oriented R code in our (me & @jayjacobs’) upcoming book on security data analysis and visualization.

My Picks From #PyCon2013

alogoWhile you can (and should) view all the presentations from #PyCon2013, here are my picks for the ones that interested me the most, as they focus on scaling, mapping, automation (both web & electronics) and data analysis:

A huge thanks to the speakers and conference organizers for making these resources freely available, especially to those of us who were not able to attend the conference.

Retrieve IP ASN & BGP Peer Info With R

This is part of a larger project I’m working on, but it’s useful enough to share (github version coming soon).

The fine folks at @TeamCymru have a great service to map IP addresses to ASN/BGP information en masse.

There are libraries for Python, Perl and other languages but none for R (that I could find). So, I threw together a quick set of functions to interface to @TeamCymru’s service. Unlike many other modern services, this one isn’t XML or JSON over a RESTful interface, so the code uses a socketConnection() over the standard WHOIS TCP port to post and retrieve simple text lists.

#
# bulkorigin.R - perform bulk IP to ASN mapping via Team Cymru whois service
#
# Author: @hrbrmstr
# Version: 0.1
# Date: 2013-02-07
#
# Copyright 2013 Bob Rudis
# 
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#   
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.#
 
library(plyr)
 
# short function to trim leading/trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
 
BulkOrigin <- function(ip.list,host="v4.whois.cymru.com",port=43) {
 
  # Retrieves BGP Origin ASN info for a list of IP addresses
  #
  # NOTE: IPv4 version
  #
  # NOTE: The Team Cymru's service is NOT a GeoIP service!
  # Do not use this function for that as your results will not
  # be accurate.
  #
  # Args:
  #   ip.list : character vector of IP addresses
  #   host: which server to hit for lookup (defaults to Team Cymru's server)
  #   post: TCP port to use (defaults to 43)
  #
  # Returns:
  #   data frame of BGP Origin ASN lookup results
 
 
  # setup query
  cmd = "begin\nverbose\n" 
  ips = paste(unlist(ip.list), collapse="\n")
  cmd = sprintf("%s%s\nend\n",cmd,ips)
 
  # setup connection and post query
  con = socketConnection(host=host,port=port,blocking=TRUE,open="r+")  
  cat(cmd,file=con)
  response = readLines(con)
  close(con)
 
  # trim header, split fields and convert results
  response = response[2:length(response)]
  response = lapply(response,function(n) {
    sapply(strsplit(n,"|",fixed=TRUE),trim)
  })  
  response = adply(response,c(1))
  response = response[,2:length(response)]
  names(response) = c("AS","IP","BGP.Prefix","CC","Registry","Allocated","AS.Name")
 
  return(response)
 
}
 
BulkPeer <- function(ip.list,host="v4-peer.whois.cymru.com",port=43) {
 
  # Retrieves BGP Peer ASN info for a list of IP addresses
  #
  # NOTE: IPv4 version
  #
  # NOTE: The Team Cymru's service is NOT a GeoIP service!
  # Do not use this function for that as your results will not
  # be accurate.
  #
  # Args:
  #   ip.list : character vector of IP addresses
  #   host: which server to hit for lookup (defaults to Team Cymru's server)
  #   post: TCP port to use (defaults to 43)
  #
  # Returns:
  #   data frame of BGP Peer ASN lookup results
 
 
  # setup query
  cmd = "begin\nverbose\n" 
  ips = paste(unlist(ip.list), collapse="\n")
  cmd = sprintf("%s%s\nend\n",cmd,ips)
 
  # setup connection and post query
  con = socketConnection(host=host,port=port,blocking=TRUE,open="r+")  
  cat(cmd,file=con)
  response = readLines(con)
  close(con)
 
  # trim header, split fields and convert results
  response = response[2:length(response)]
  response = lapply(response,function(n) {
    sapply(strsplit(n,"|",fixed=TRUE),trim)
  })  
  response = adply(response,c(1))
  response = response[,2:length(response)]
  names(response) = c("Peer.AS","IP","BGP.Prefix","CC","Registry","Allocated","Peer.AS.Name")
  return(response)
 
}

Take a list of IPs, make an IP connection, formulate a bulk query and convert the results. Here’s a small script to test it:

ips = c("100.43.81.11","100.43.81.7")
origin = BulkOrigin(ips)
str(origin)
peers = BulkPeer(ips)
str(peers)

That code outputs:

'data.frame':	2 obs. of  7 variables:
 $ AS        : chr  "13238" "13238"
 $ IP        : chr  "100.43.81.11" "100.43.81.7"
 $ BGP.Prefix: chr  "100.43.64.0/19" "100.43.64.0/19"
 $ CC        : chr  "US" "US"
 $ Registry  : chr  "arin" "arin"
 $ Allocated : chr  "2011-12-06" "2011-12-06"
 $ AS.Name   : chr  "YANDEX Yandex LLC" "YANDEX Yandex LLC"

and

'data.frame':	8 obs. of  7 variables:
 $ Peer.AS     : chr  "174" "3257" "9002" "10310" ...
 $ IP          : chr  "100.43.81.11" "100.43.81.11" "100.43.81.11" "100.43.81.11" ...
 $ BGP.Prefix  : chr  "100.43.64.0/19" "100.43.64.0/19" "100.43.64.0/19" "100.43.64.0/19" ...
 $ CC          : chr  "US" "US" "US" "US" ...
 $ Registry    : chr  "arin" "arin" "arin" "arin" ...
 $ Allocated   : chr  "2011-12-06" "2011-12-06" "2011-12-06" "2011-12-06" ...
 $ Peer.AS.Name: chr  "COGENT Cogent/PSI" "TINET-BACKBONE Tinet SpA" "RETN-AS ReTN.net Autonomous System" "YAHOO-1 - Yahoo!" ...

respectively for each str().

Nothing super-sexy, but it’s part of a mission I’m on to make IP addresses “first class citizens” in R. I’m starting with building some smaller functions that accumulate IP metadata and will ultimately collect them all into a compact R library.

In the interim, I thought these two routines might be useful to some folks.

With just these two functions, you can use various graphing libraries to get a picture of the network connectivity. Here’s a small sample to get you started:

library(igraph)
 
ips = c("100.43.81.11")
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
g = graph.empty() + vertices(c(ips, peers$Peer.AS, origin$AS),size=30)
V(g)$label = c(ips, peers$Peer.AS, origin$AS)
e = lapply(peers$Peer.AS,function(x) {
  c(origin$AS,x)
})
g = g + edges(unlist(e))
g = g + edge(ips, origin$AS)
g$layout = layout.circle
plot(g)

asn

If you know of any other R libraries or code that provide functions that operate on IP addresses or interface to services that provide IP address metadata, please drop a note in the comments or ping me on Twitter.

SHODAN API in R (With Examples)

Folks may debate the merits of the SHODAN tool, but in my opinion it’s a valuable resource, especially if used for “good”. What is SHODAN? I think ThreatPost summed it up nicely:

“Shodan is a Web based search engine that discovers Internet facing computers, including desktops, servers and routers. The engine, created by programmer John Matherly, allows users to filter searches for systems running a specific type of application (say, Apache Web servers or FTP) and filter results by geographic region. The search engine indexes host ’banners,’ which include meta-data sent between a server and client and includes information such as the type of software run, what services are available and so on.”

I’m in R quite a bit these days and thought it would be useful to have access to the SHODAN API in R. I have a very rudimentary version of the API (search only) up on github which can be integrated into your R environment thus:

library(devtools)
install_github("Rshodan","hrbrmstr")
library(shodan)
help(shodan) # you don't really need to do this cmd

It’ll eventually be in CRAN, but I have some cleanup work to do before the maintainers will accept the submission. If you are new to R, there are a slew of dependencies you’ll need to add to the base R installation. Here’s a good description of how to do that on pretty much every platform.

After I tweeted the above reference, @shawnmer asked the following:

https://twitter.com/shawnmer/status/290904140782137344

That is not an unreasonable request, especially if one is new to R (or SHODAN). I had been working on this post and a more extended example and finally able to get enough code done to warrant publishing it. You can do far more in R than these simple charts & graphs. Imagine taking data from multiple searches–either across time or across ports–and doing a statistical comparison. Or, use some the image processing & recognition libraries within R as well as a package such as RCurl to fetch images from open webcams and attempt to identify people or objects. The following should be enough for most folks to get started.

You can cut/paste the source code here or download the whole source file.

The fundamental shortcut this library provides over just trying to code it yourself is taking the JSON response from SHODAN and turning it into an R data frame. That is not as overtly trivial as you might think and you may want to look at the source code for the library to see where I grabbed some of that code from. I’m also not 100% convinced it’s going to work under all circumstances (hence part of the 0.1 status).

library(shodan)
library(ggplot2)
library(xtable)
library(maps)
library(rworldmap)
library(ggthemes)
 
 
# if you're behind a proxy, setting this will help
# but it's strongly suggested/encouraged that you stick the values in a file and 
# read them in vs paste them in a script
# options(RCurlOptions = list(proxy="proxyhost", proxyuserpwd="user:pass"))
 
setSHODANKey("~/.shodankey")
 
# query example taken from Michael “theprez98” Schearer's DEFCON 18 presentation
# https://www.defcon.org/images/defcon-18/dc-18-presentations/Schearer/DEFCON-18-Schearer-SHODAN.pdf
 
# find all Cisco IOS devies that may have an unauthenticated admin login
# setting trace to be TRUE to see the progress of the query
result = SHODANQuery(query="cisco last-modified www-authenticate",trace=TRUE)
 
#find the first 100 found memcached instances
#result = SHODANQuery(query='port:11211',limit=100,trace=TRUE)
 
df = result$matches
 
# aggregate result by operating system
# you can use this one if you want to filter out NA's completely
#df.summary.by.os = ddply(df, .(os), summarise, N=sum(as.numeric(factor(os))))
#this one provides count of NA's (i.e. unidentified systems)
df.summary.by.os = ddply(df, .(os), summarise, N=length(os))
 
# sort & see the results in a text table
df.summary.by.os = transform(df.summary.by.os, os = reorder(os, -N))
df.summary.by.os

That will yield:

FALSE                 os   N
FALSE 1      Linux 2.4.x  60
FALSE 2      Linux 2.6.x   6
FALSE 3 Linux recent 2.4   2
FALSE 4     Windows 2000   2
FALSE 5   Windows 7 or 8  10
FALSE 6       Windows XP   8
FALSE 7             <NA> 112


You can plot it with:

# plot a bar chart of them
(ggplot(df.summary.by.os,aes(x=os,y=N,fill=os)) + 
   geom_bar(stat="identity") + 
   theme_few() +
   labs(y="Count",title="SHODAN Search Results by OS"))

to yield:

png

and:

world = map_data("world")
(ggplot() +
   geom_polygon(data=world, aes(x=long, y=lat, group=group)) +
   geom_point(data=df, aes(x=longitude, y=latitude), colour="#EE760033",size=1.75) +
   labs(x="",y="") +
   theme_few())

png-1

You can easily do the same by country:

# sort & view the results by country
# see above if you don't want to filter out NA's
df.summary.by.country_code = ddply(df, .(country_code, country_name), summarise, N=sum(!is.na(country_code)))
df.summary.by.country_code = transform(df.summary.by.country_code, country_code = reorder(country_code, -N))
 
df.summary.by.country_code
##    country_code              country_name  N
## 1            AR                 Argentina  2
## 2            AT                   Austria  2
## 3            AU                 Australia  2
## 4            BE                   Belgium  2
## 5            BN         Brunei Darussalam  2
## 6            BR                    Brazil 14
## 7            CA                    Canada 16
## 8            CN                     China  6
## 9            CO                  Colombia  4
## 10           CZ            Czech Republic  2
## 11           DE                   Germany 12
## 12           EE                   Estonia  4
## 13           ES                     Spain  4
## 14           FR                    France 10
## 15           HK                 Hong Kong  2
## 16           HU                   Hungary  2
## 17           IN                     India 10
## 18           IR Iran, Islamic Republic of  4
## 19           IT                     Italy  4
## 20           LV                    Latvia  4
## 21           MX                    Mexico  2
## 22           PK                  Pakistan  4
## 23           PL                    Poland 16
## 24           RU        Russian Federation 14
## 25           SG                 Singapore  2
## 26           SK                  Slovakia  2
## 27           TW                    Taiwan  6
## 28           UA                   Ukraine  2
## 29           US             United States 28
## 30           VE                 Venezuela  2
## 31         <NA>                      <NA>  0


(ggplot(df.summary.by.country_code,aes(x=country_code,y=N)) + 
  geom_bar(stat="identity") +
  theme_few() +
  labs(y="Count",x="Country",title="SHODAN Search Results by Country"))

png-2

And, easily generate the must-have choropleth:

# except make a choropleth
# using the very simple rworldmap process
shodanChoropleth = joinCountryData2Map( df.summary.by.country_code, joinCode = "ISO2", nameJoinColumn = "country_code")
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(shodanChoropleth, nameColumnToPlot="N",colourPalette="terrain",catMethod="fixedWidth")

png-3

Again, producing pretty pictures is all well-and-good, but it’s best to start with some good questions you need answering to make any visualization worthwhile. In the coming weeks, I’ll do some posts that show what types of questions you may want to ask/answer with R & SHODAN.

I encourage folks that have issues, concerns or requests to use github vs post in the comments, but I’ll try to respond to either as quickly as possible.

Easier HTML Table-scraping For Scripts With Google Drive

We had our first, real, snowfall of the season in Maine today and that usually means school delays/closings. Our “local” station – @WCHS6 – has a Storm Center Closings page as well as an SMS notification service. I decided this morning that I needed a command line version (and, eventually, a version that sends me a Twitter DM), but I also was tight for time (a lunchtime meeting ending early is responsible for this blog post).

While I’ve consumed my share of Beautiful Soup and can throw down some mechanize with the best of them, it came to me that there may be an even easier way, and one that may also help with the eventual blocking of such a scraping service.

I setup a Google Drive spreadsheet to use the importHTML formula to read in the closings table on the page:

=importHTML("http://www.wcsh6.com/weather/severe_weather/cancellations_closings/default.aspx","table",0)

Then did a File→Publish to the web and setup up Sheet 1 to “Automatically republish when changes are made” and also to have the link be to the CSV version of the data:

Screenshot 12:17:12 1:16 PM

The raw output looks a bit like:

Name,Status,Last Updated
,,
Westbook Seniors,Luncheon PPD to January 7th,12/17/2012 5:22:51
,,
Allied Wheelchair Van Services,Closed,12/17/2012 6:49:47
,,
American Legion - Dixfield,Bingo cancelled,12/17/2012 11:44:12
,,
American Legion Post 155 - Naples,Closed,12/17/2012 12:49:00

The conversion has some “blank” lines but that’s easy enough to filter out with some quick bash:

curl --silent "https://docs.google.com/spreadsheet/pub?key=0AlCY1qfmPPZVdFBsX3kzLUVHZl9Mdmw3bS1POWNsWnc&single=true&gid=0&outpu
t=csv" | grep -v "^,,"

And, looking for the specific school(s) of our kids is an easy grep as well.

The reason this is interesting is that the importHTML is dynamic and will re-convert the HTML table each time the code retrieves the CSV URL. Couple that with the fact that it’s far less likely that Google will be blocked than it is my IP address(es) and this seems to be a pretty nice alternative to traditional parsing.

If I get some time over the break, I’ll do a quick benchmark of using this method over some python and perl scraping/parsing methods.

Optimization WordPress Plugins & Solutions by W3 EDGE