Skip navigation

Category Archives: R

Naomi Robbins is running a graph makeover challenge over at her Forbes blog and this is my entry for the B2B/B2C Traffic Sources one (click for larger version):

And, here’s the R source for how to generate it:

library(ggplot2)
 
df = read.csv("b2bb2c.csv")
 
ggplot(data=df,aes(x=Site,y=Percentage,fill=Site)) + 
  geom_bar(stat="identity") + 
  facet_grid(Venue ~ .) + 
  coord_flip() + 
  opts(legend.position = "none", title="Social Traffic Sources for B2B & B2C Companies") + 
  stat_bin(geom="text", aes(label=sprintf("%d%%",Percentage), vjust=0, hjust=-0.2, size = 10))

And, here’s the data:

Site     Venue	Percentage
Facebook B2B	72
LinkedIn B2B	16
Twitter	 B2B	12
Facebook B2C	84
LinkedIn B2C	1
Twitter	 B2C	15

I chose to go with a latticed bar chart as I think it helps show the relative portions within each category (B2B/B2C) and also enables quick comparisons across categories for all three factors.

UPDATE: As indicated in the code comments, Google took down the cone KML files. I’ll be changing the code to use the NHC archived cone files later tonight

I will (most likely) not be littering the blog with any more updates to the ‘Sandy’ code unless they are really significant. You can follow along at home to any changes over at github.

Of note (in terms of changes) is the addition of the 5-day forecast cone and some annotations:

As indicated in the code comments, Google took down the cone KML files. I’ll be changing the code to use the NHC archived cone files later tonight

NOTE: There is significantly updated code on github for the Sandy ‘R’ dataviz.

This is a follow-up post to the quickly crafted Watch Sandy in “R” post last night. I noticed that Google provided the KML on their crisis map and wanted to show how easy it is to add it to previous code.

I added comments inline to make it easier to follow along.

library(maps)
library(maptools)
library(rgdal)
 
# need this for handling "paste" extra spaces
trim.trailing <- function (x) sub("\\s+$", "", x)
 
# get track data from Unisys
wx = read.table(file="http://weather.unisys.com/hurricane/atlantic/2012/SANDY/track.dat", skip=3,fill=TRUE)
 
# join last two columns (type of storm) that read.table didn't parse as one
# and give the data frame real column names
wx$V7 = trim.trailing(paste(wx$V7,wx$V8," "))
wx$V8 = NULL
colnames(wx) = c("Advisory","Latitude","Longitude","Time","WindSpeed","Pressure","Status")
 
# annotate the name with forecast (+12/+24/etc) projections
wx$Advisory = unlist(strsplit(toString(wx$Advisory),", "))
wx$a = ""
wx$a[grep("\\+",wx$Advisory)] = wx$Advisory[grep("\\+",wx$Advisory)]
wx$Status = trim.trailing(paste(wx$Status,wx$a," "))
 
# cheap way to make past plots one color and forecast plots another
wx$color = "red"
wx$color[grep("\\+",wx$Advisory)] = "orange"
 
# only want part of the map
xlim=c(-90,-60)
 
# plot the map itself
map("state", interior = FALSE, xlim=xlim)
map("state", boundary = FALSE, col="gray", add = TRUE,xlim=xlim)
 
# plot the track with triangles and colors
lines(x=wx$Longitude,y=wx$Latitude,col="black",cex=0.75)
points(x=wx$Longitude,y=wx$Latitude,col=wx$color,pch=17,cex=0.75)
 
# annotate it with the current & projects strength status + forecast
text(x=wx$Longitude,y=wx$Latitude,col='blue',labels=wx$Status,adj=c(-.15),cex=0.33)
 
# get the forecast "cone" from the KML that google's crisis map provides
# NOTE: you need curl & unzip (with funzip) on your system
# NOTE: Google didn't uniquely name the cone they provide, so this will break
#       post-Sandy. I suggest saving the last KML & track files out out if you
#       want to save this for posterity
 
# this gets the cone and makes it into a string that getKMLcoordinates can process
coneKML = paste(unlist(system("curl -s -o - http://mw1.google.com/mw-weather/maps_hurricanes/three_day_cone.kmz | funzip", intern = TRUE)),collapse="")
 
# process the coneKML and make a data frame out of it
coords = getKMLcoordinates(textConnection(coneKML))
coords = data.frame(SpatialPoints(coords, CRS('+proj=longlat')))
 
# this plots the cone and leaves the track visible
polygon(coords$X1,coords$X2, pch=16, border="red", col='#FF000033',cex=0.25)
 
# no one puts Sandy in a box (well, except us)
box()

UPDATE:

  • Significantly updated code on github
  • Well, a couple folks asked how to make it more “centered” on the hurricane and stop the labels from chopping off, so I modified the previous code a bit to show how to do that.

As indicated in the code comments, Google took down the cone KML files. I’ll be changing the code to use the NHC archived cone files later tonight

While this will pale in comparison to the #spiffy Sandy graphics on Weather Underground and other sites (including Google’s “Crisis Center” maps), I thought it might be interesting to show folks how they can keep an eye on “Sandy” using R.

library(maps)
 
trim.trailing <- function (x) sub("\\s+$", "", x)
 
wx = read.table(file="http://weather.unisys.com/hurricane/atlantic/2012/SANDY/track.dat", skip=3,fill=TRUE)
wx$V7 = trim.trailing(paste(wx$V7,wx$V8," "))
wx$V8 = NULL
colnames(wx) = c("Advisory","Latitude","Longitude","Time","WindSpeed","Pressure","Status")
 
wx$Advisory = unlist(strsplit(toString(wx$Advisory),", "))
wx$a = ""
wx$a[grep("\\+",wx$Advisory)] = wx$Advisory[grep("\\+",wx$Advisory)]
wx$Status = trim.trailing(paste(wx$Status,wx$a," "))
 
wx$color = "red"
wx$color[grep("\\+",wx$Advisory)] = "orange"
 
xlim=c(-90,-60)
 
map("state", interior = FALSE, xlim=xlim)
map("state", boundary = FALSE, col="gray", add = TRUE,xlim=xlim)
 
points(x=wx$Longitude,y=wx$Latitude,col=wx$color,pch=17,cex=0.75)
 
text(x=wx$Longitude,y=wx$Latitude,col='blue',labels=wx$Status,adj=c(-.15),cex=0.33)
box()

This short script reads in data from the Unisys Hurricane Weather Center and plots colored symbols on a US map with the forecast projections. I may be extending this, but it’s a good starting point for anyone who wants to play with mapping relatively live data from the internet (or learn some more R).

There’s a good FAQ on how to do the MongoDB query -> R data frame but I wanted to post a more complete example that included the database connection and query setup since I suspect there are folks new to Mongo who would appreciate the end-to-end view. The code is fully annotated with comments, and I’ll caveat that this was for pulling data from my solar radiation sensor (it provides some context for the query and values).

library(rmongodb)
library(chron) # NOTE: you don't need this for Mongo; it's for the sensor readings plot
 
# connect to mongodb server on host and connect to db
mongo = mongo.create(host="MONGODB_HOST",db="DATABASE_NAME")
 
if (mongo.is.connected(mongo)) {
 
  # this sets up the query (there are other "buffer.append…" functions
  today = format(Sys.time(), "%Y-%m-%d")
  buf = mongo.bson.buffer.create()
  mongo.bson.buffer.append.string(buf,"date",today)
  query = mongo.bson.from.buffer(buf)
 
  # run the query and get total results & the starting db cursor  
  todays.readings.count = mongo.count(mongo,"solar.readings",query)
  todays.readings.cursor = mongo.find(mongo,"solar.readings",query)
 
  # setup some vectors to hold our results  
  time = vector("character",todays.readings.count)
  lux = vector("numeric",todays.readings.count)
  full = vector("numeric",todays.readings.count)
  IR = vector("numeric",todays.readings.count)
 
  i = 1
 
  # iterate over the results with the cursor    
  while (mongo.cursor.next(todays.readings.cursor)) {
 
    # get the values of the current record
    cval = mongo.cursor.value(todays.readings.cursor)
 
    # split it out into our vectors    
    time[i] = mongo.bson.value(cval,"time")
    full[i] = mongo.bson.value(cval,"Full")
    lux[i] = mongo.bson.value(cval,"Lux")
    IR[i] = mongo.bson.value(cval,"IR")
 
    i = i + 1
 
  }
 
  # packages all our values up into a data frame  
  df = as.data.frame(list(time=time,full=full,lux=lux,IR=IR))
 
  # (for my wx data, I need 'time' as an actual time value)  
  df$Time = times(df$time)
  df$time = NULL
 
  par(mfrow=c(3,1))
  plot(df$full~df$Time,type="l",col="blue",lwd="1",xlab="",ylab="Full Spectrum",main=paste(today," Solar Radiation Readings"))
  plot(df$lux~df$Time,type="l",col="blue",lwd="1",xlab="",ylab="Lux (calculated)")
  plot(df$IR~df$Time,type="l",col="blue",lwd="1",xlab="Time",ylab="IR")
  par(mfrow=c(1,1))
 
}

NOTE: A great deal of this post comes from @jayjacobs as he took a conversation we were having about thoughts on ways to look at the data and just ran like the Flash with it.

Did you know that – if you’re a US citizen – you have approximately a 1 in 5 chance of getting the flu this year? If you’re a male (no regional bias for this one), you have a 1 in 400 chance of developing Hodgkin’s Disease and a 1 in 5,000 chance of dying from testicular cancer.

Moving away from medical stats, if you’re a NJ resident, you have a 1 in 1,000 chance of winning $275 in the straight “Pick 3” lottery and a 1 in 13,983,816 chance of jackpotting the “Pick 6”.

What does this have to do with botnets? Well, we’ve determined that – if you’re a US resident – you have a 1 in 6,000 chance of getting the ZeroAccess flu (or winning the ZeroAccess lottery, whichever makes you feel better). Don’t believe me? Let’s look at the data.

For starters, we’re working with this file which is a summary file by US state that includes actual state population, the number of internet users in that state and the number of bots in that state (data is from Internet World Statistics). As an example, Maine has:

  • 1,332,155 residents
  • 1,102,933 internet users
  • 219 bot infections

(To aspiring security data scientists out there, I should point out that we’ve had to gather or crunch through on our own much of the data we’re using. While @fsecure gave us a great beginning, there’s no free data lunch)

Where’d we get the 1 : 6000 figure? We can do some quick R math and view the histogram and summary data:

#read in the summary data
df <- read.csv("zerogeo.csv", header=T)
 
# calculate how many people for 1 bot infection per state:
df$per <- round(df$intUsers/df$bots)
 
# plot histogram of the spread
hist(df$per, breaks=10, col="#CCCCFF", freq=T, main="Internet Users per Bot Infection")

Along with the infection rate/risk, we can also do a quick linear regression to see if there’s a correlation between the number of internet users in a state and the infection rate of that state:

# "lm" is an R function that, amongst other things, can be used for linear regression
# so we use it to performa quick regression on how internet users describe bot infections
users <- lm(df$bots~df$intUsers)
 
# and, R makes it easy to plot that model
plot(df$intUsers, df$bots, xlab="Internet Users", ylab="Bots", pch=19, cex=0.7, col="#3333AA")
abline(users, col="#3333AA")

Apart from some outliers (more on that in another post), there is – as Jay puts it – “very strong (statistical) relationship between the population of internet users and the infection rate in the states.” Some of you may be saying “Duh?!” right about now, but all we’ve had up until this point are dots or colors on a map. We’ve taken that superficial view (yes, it’s just really eye candy) and given it some depth and meaning.

We’re pulling some demographic data from the US Census and will be doing another data summarization at the ZIP code level to see what other aspects (I’m really focused on analyzing median income by ZIP code to see if/how that describes bot presence).

If you made it this far, I’d really like to know what you would have thought the ZeroAccess “flu” chances were before seeing that it’s 1 : 6,000 (since your guesstimate was probably based on the map views).

Finally, Jay used the summary data to work up a choropleth in R:

# setup our environment
library(ggplot2)
library(maps)
library(colorspace)
 
# read the data
zero <- read.csv("zerogeo.csv", header=T)
 
# extract state geometries from maps library
states <- map_data("state")
 
# this "cleans up the data" to make it easier to merge with the built in state data
zero.clean <- data.frame(region=tolower(zero$state), 
                         perBot=round(zero$intUsers/zero$bots),
                         intUsers=zero$intUsers)
choro <- merge(states, zero.clean, sort = FALSE, by = "region")
 
choro <- choro[order(choro$order),]
 
# "bin" the data to enable us to use a better set of colors
choro$botBreaks <- cut(choro$perBot, 10)
 
# get the plot
c1 = qplot(long, lat, data = choro, group = group, fill = botBreaks, geom = "polygon", 
      main="Population of Internet Users to One Zero Access Botnet Infenction") +
        theme(axis.line=element_blank(),axis.text.x=element_blank(),
              axis.text.y=element_blank(),axis.ticks=element_blank(),
              axis.title.x=element_blank(),
              axis.title.y=element_blank(),
              panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
              panel.grid.minor=element_blank(),plot.background=element_blank())
 
# display it with modified color scheme (we hate the default ggplot2 blue)
c1 + scale_fill_brewer(palette = "Reds")

Since F-Secure was #spiffy enough to provide us with GeoIP data for mapping the scope of the ZeroAccess botnet, I thought that some aspiring infosec data scientists might want to see how to use something besides Google Maps & Google Earth to view the data.

If you look at the CSV file, it’s formatted as such (this is a small portion…the file is ~140K lines):

CL,"-34.9833","-71.2333"
PT,"38.679","-9.1569"
US,"42.4163","-70.9969"
BR,"-21.8667","-51.8333"

While that’s useful, we don’t need quotes and a header would be nice (esp for some of the tools I’ll be showing), so a quick cleanup in vi gives us:

Code,Latitude,Longitude
CL,-34.9833,-71.2333
PT,38.679,-9.1569
US,42.4163,-70.9969
BR,-21.8667,-51.8333

With just this information, we can see how much of the United States is covered in ZeroAccess with just a few lines of R:

# read in the csv file
bots = read.csv("ZeroAccessGeoIPs.csv")
 
# load the maps library
library(maps)
 
# draw the US outline in black and state boundaries in gray
map("state", interior = FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
 
# plot the latitude & longitudes with a small dot
points(x=bots$Longitude,y=bots$Latitude,col='red',cex=0.25)

Can you pwn me now?

Click for larger map

If you want to see how bad your state is, it’s just as simple. Using my state (Maine) it’s just a matter of swapping out the map statements with more specific data:

bots = read.csv("ZeroAccessGeoIPs.csv")
library(maps)
 
# draw Maine state boundary in black and counties in gray
map("state","maine",interior=FALSE)
map("county","maine",boundary=FALSE,col="gray",add=TRUE)
 
points(x=bots$Longitude,y=bots$Latitude,col='red',cex=0.25)

We’re either really tech/security-savvy or don’t do much computin’ up here

Click for larger map

Because of the way the maps library handles geo-plotting, there are points outside the actual map boundaries.

You can even get a quick and dirty geo-heatmap without too much trouble:

bots = read.csv("ZeroAccessGeoIPs.csv")
 
# load the ggplot2 library
library(ggplot2)
 
# create an plot object for the heatmap
zeroheat <- qplot(xlab="Longitude",ylab="Latitude",main="ZeroAccess Botnet",geom="blank",x=bots$Longitude,y=bots$Latitude,data=bots)  + stat_bin2d(bins =300,aes(fill = log1p(..count..))) 
 
# display the heatmap
zeroheat


Click for larger map

Try playing around with the bins to see how that impacts the plots (the stat_bin2d(…) divides the “map” into “buckets” (or bins) and that informs plot how to color code the output).

If you were to pre-process the data a bit, or craft some ugly R code, a more tradtional choropleth can easily be created as well. The interesting part about using a non-boundaried plot is that this ZeroAccess network almost defines every continent for us (which is kinda scary).

That’s just a taste of what you can do with just a few, simple lines of R. If I have some time, I’ll toss up some examples in Python as well. Definitely drop a note in the comments if you put together some #spiffy visualizations with the data they provided.

As promised, this post is a bit more graphical, but I feel the need to stress the importance of the first few points in chapter 2 of the book (i.e. the difference between mean and average and why variance is meaningful). These are fundamental concepts for future work.

The “pumpkin” example (2.1) gives us an opportunity to do some very basic R:

  1. pumpkins <- c(1,1,1,3,3,591) #build an array
  2. mean(pumpkins) #mean (average)
  3. var(pumpkins) #variance
  4. sd(pumpkins) #deviation

(as you can see, I’m still trying to find the best way to embed R source code)

We move from pumpkins to babies for Example 2.2 (you’ll need the whole bit of source from previous examples (that includes all the solutions in this example) to make the rest of the code snippets work). Here, we can quickly compute and compare the standard deviations (with difference) and the means (with difference) to help us analyze the statistical significane questions in the chapter:

  1. sd(firstbabies$prglength)
  2. sd(notfirstbabies$prglength)
  3. sd(firstbabies$prglength) - sd(notfirstbabies$prglength)
  4.  
  5. mean(firstbabies$prglength)
  6. mean(notfirstbabies$prglength)
  7. mean(firstbabies$prglength) - mean(notfirstbabies$prglength)

You’ll see the power of R’s hist function in a moment, but you should be a bit surprised when you see the output if you enter to solve Example 2.3:

  1. mode(firstbabies$prglength)

That’s right, R does not have a built-in mode function. It’s pretty straightforward to compute, tho:

  1. names(sort(-table(firstbabies$prglength))[1])

(notice how “straightforward” != “simple”)

We have to use the table function to generate a table of value frequencies. It’s a two-dimensional structure with the actual value associated with the frequency represented as a string indexed at the same position. Using “-” inverts all the values (but keeps the two-dimensional indexing consistent) and sort orders the structure so we can use index “[1]” to get to the value we’re looking for. By using the names function, we get the string representing the value at the highest frequency. You can see this iteratively by breaking out the code:

  1. table(firstbabies$prglength)
  2. str(table(firstbabies$prglength))
  3. sort(table(firstbabies$prglength))
  4. sort(table(firstbabies$prglength))[1] #without the "-"
  5. sort(-table(firstbabies$prglength))[1]
  6. names(sort(-table(firstbabies$prglength))[1])

There are a plethora of other ways to compute the mode, but this one seems to work well for my needs.

Pictures Or It Didn’t Happen

I did debate putting the rest of this post into a separate example, but if you’ve stuck through this far, you deserve some stats candy. It’s actually pretty tricky to do what the book does here:

So, we’ll start off with simple histogram plots of each set being compared:

  1. hist(firstbabies$prglength)

  1. hist(notfirstbabies$prglength)

I separated those out since hist by default displays the histogram and if you just paste the lines consecutively, you’ll only see the last histogram. What does display is, well, ugly and charts should be beautiful. It will take a bit to explain the details (in another post) but this should get you started:

  1. par(mfrow=c(1,2))par(mfrow=c(1,2))
  2. hist(firstbabies$prglength, cex.lab=0.8, cex.axis=0.6, cex.main=0.8, las=1, col="white", ylim=c(0,3000),xlim=c(17,max(firstbabies$prglength)), breaks="Scott", main="Histogram of first babies", xlab="Weeks")
  3. hist(notfirstbabies$prglength, cex.lab=0.8, cex.axis=0.6, cex.main=0.8, las=1, col="blue", ylim=c(0,3000),xlim=c(17,max(notfirstbabies$prglength)), breaks="Scott", main="Histogram of other babies", xlab="Weeks")
  4. par(mfrow=c(1,1))

In the above code, we’re telling R to setup a canvas that will have one row and two plot areas. This makes it very easy to have many graphs on one canvas.

Next, the first hist sets up up some label proportions (the cex parameters), tells R to make Y labels horizontal (las=1), makes the bars white, sets up sane values for the X & Y axes, instructs R to use the “Scott” algorithm for calculating sane bins (we’ll cover this in more details next post) and sets up sane titles and X axis labels. Finally, we reset the canvas for the next plot.

There’s quite a bit to play with there and you can use the “help()” command to get information on the hist function and plot function. You can setup your own bin size by substituting an array for “Scott”. If you have specific questions, shoot a note in the comments, but I’ll explain more about what’s going on in the next post as we add in probability histograms and start looking at the data in more detail.


Click for larger image

Download R Source of Examples 2.1-2.3