rud.is

SHODAN API in R (With Examples)

2013-01-17 – 15:21
Posted in Charts & Graphs, DataVis, DataViz, Information Security, Programming, R, Vulnerabilities
Leave a Comment

Folks may debate the merits of the SHODAN tool, but in my opinion it’s a valuable resource, especially if used for “good”. What is SHODAN? I think ThreatPost summed it up nicely:

“Shodan is a Web based search engine that discovers Internet facing computers, including desktops, servers and routers. The engine, created by programmer John Matherly, allows users to filter searches for systems running a specific type of application (say, Apache Web servers or FTP) and filter results by geographic region. The search engine indexes host ’banners,’ which include meta-data sent between a server and client and includes information such as the type of software run, what services are available and so on.”

I’m in R quite a bit these days and thought it would be useful to have access to the SHODAN API in R. I have a very rudimentary version of the API (search only) up on github which can be integrated into your R environment thus:

library(devtools)
install_github("Rshodan","hrbrmstr")
library(shodan)
help(shodan) # you don't really need to do this cmd

It’ll eventually be in CRAN, but I have some cleanup work to do before the maintainers will accept the submission. If you are new to R, there are a slew of dependencies you’ll need to add to the base R installation. Here’s a good description of how to do that on pretty much every platform.

After I tweeted the above reference, @shawnmer asked the following:

https://twitter.com/shawnmer/status/290904140782137344

That is not an unreasonable request, especially if one is new to R (or SHODAN). I had been working on this post and a more extended example and finally able to get enough code done to warrant publishing it. You can do far more in R than these simple charts & graphs. Imagine taking data from multiple searches–either across time or across ports–and doing a statistical comparison. Or, use some the image processing & recognition libraries within R as well as a package such as RCurl to fetch images from open webcams and attempt to identify people or objects. The following should be enough for most folks to get started.

You can cut/paste the source code here or download the whole source file.

The fundamental shortcut this library provides over just trying to code it yourself is taking the JSON response from SHODAN and turning it into an R data frame. That is not as overtly trivial as you might think and you may want to look at the source code for the library to see where I grabbed some of that code from. I’m also not 100% convinced it’s going to work under all circumstances (hence part of the 0.1 status).

library(shodan)
library(ggplot2)
library(xtable)
library(maps)
library(rworldmap)
library(ggthemes)
 
 
# if you're behind a proxy, setting this will help
# but it's strongly suggested/encouraged that you stick the values in a file and 
# read them in vs paste them in a script
# options(RCurlOptions = list(proxy="proxyhost", proxyuserpwd="user:pass"))
 
setSHODANKey("~/.shodankey")
 
# query example taken from Michael “theprez98” Schearer's DEFCON 18 presentation
# https://www.defcon.org/images/defcon-18/dc-18-presentations/Schearer/DEFCON-18-Schearer-SHODAN.pdf
 
# find all Cisco IOS devies that may have an unauthenticated admin login
# setting trace to be TRUE to see the progress of the query
result = SHODANQuery(query="cisco last-modified www-authenticate",trace=TRUE)
 
#find the first 100 found memcached instances
#result = SHODANQuery(query='port:11211',limit=100,trace=TRUE)
 
df = result$matches
 
# aggregate result by operating system
# you can use this one if you want to filter out NA's completely
#df.summary.by.os = ddply(df, .(os), summarise, N=sum(as.numeric(factor(os))))
#this one provides count of NA's (i.e. unidentified systems)
df.summary.by.os = ddply(df, .(os), summarise, N=length(os))
 
# sort & see the results in a text table
df.summary.by.os = transform(df.summary.by.os, os = reorder(os, -N))
df.summary.by.os

That will yield:

FALSE                 os   N
FALSE 1      Linux 2.4.x  60
FALSE 2      Linux 2.6.x   6
FALSE 3 Linux recent 2.4   2
FALSE 4     Windows 2000   2
FALSE 5   Windows 7 or 8  10
FALSE 6       Windows XP   8
FALSE 7             <NA> 112

You can plot it with:

# plot a bar chart of them
(ggplot(df.summary.by.os,aes(x=os,y=N,fill=os)) + 
   geom_bar(stat="identity") + 
   theme_few() +
   labs(y="Count",title="SHODAN Search Results by OS"))

to yield:

and:

world = map_data("world")
(ggplot() +
   geom_polygon(data=world, aes(x=long, y=lat, group=group)) +
   geom_point(data=df, aes(x=longitude, y=latitude), colour="#EE760033",size=1.75) +
   labs(x="",y="") +
   theme_few())

You can easily do the same by country:

# sort & view the results by country
# see above if you don't want to filter out NA's
df.summary.by.country_code = ddply(df, .(country_code, country_name), summarise, N=sum(!is.na(country_code)))
df.summary.by.country_code = transform(df.summary.by.country_code, country_code = reorder(country_code, -N))
 
df.summary.by.country_code

##    country_code              country_name  N
## 1            AR                 Argentina  2
## 2            AT                   Austria  2
## 3            AU                 Australia  2
## 4            BE                   Belgium  2
## 5            BN         Brunei Darussalam  2
## 6            BR                    Brazil 14
## 7            CA                    Canada 16
## 8            CN                     China  6
## 9            CO                  Colombia  4
## 10           CZ            Czech Republic  2
## 11           DE                   Germany 12
## 12           EE                   Estonia  4
## 13           ES                     Spain  4
## 14           FR                    France 10
## 15           HK                 Hong Kong  2
## 16           HU                   Hungary  2
## 17           IN                     India 10
## 18           IR Iran, Islamic Republic of  4
## 19           IT                     Italy  4
## 20           LV                    Latvia  4
## 21           MX                    Mexico  2
## 22           PK                  Pakistan  4
## 23           PL                    Poland 16
## 24           RU        Russian Federation 14
## 25           SG                 Singapore  2
## 26           SK                  Slovakia  2
## 27           TW                    Taiwan  6
## 28           UA                   Ukraine  2
## 29           US             United States 28
## 30           VE                 Venezuela  2
## 31         <NA>                      <NA>  0

(ggplot(df.summary.by.country_code,aes(x=country_code,y=N)) + 
  geom_bar(stat="identity") +
  theme_few() +
  labs(y="Count",x="Country",title="SHODAN Search Results by Country"))

And, easily generate the must-have choropleth:

# except make a choropleth
# using the very simple rworldmap process
shodanChoropleth = joinCountryData2Map( df.summary.by.country_code, joinCode = "ISO2", nameJoinColumn = "country_code")
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(shodanChoropleth, nameColumnToPlot="N",colourPalette="terrain",catMethod="fixedWidth")

Again, producing pretty pictures is all well-and-good, but it’s best to start with some good questions you need answering to make any visualization worthwhile. In the coming weeks, I’ll do some posts that show what types of questions you may want to ask/answer with R & SHODAN.

I encourage folks that have issues, concerns or requests to use github vs post in the comments, but I’ll try to respond to either as quickly as possible.

How Low Can It [The Mississsippi River] Go?

2013-01-12 – 22:35
Posted in Charts & Graphs, DataVis, DataViz
Tagged drought, mississippi
Comments (1)

Good Stats, Bad Stats has a really good critique of this post that you should read after this (so you know how to avoid the mistakes I made :-)

I’ve heard quite a bit about the current problems with the Mississippi River and wanted to see for myself (with data) just how bad it is.

St Louis seems to be quite indicative of the severity of the situation, so I pulled the USGS “stream” records for it and also the historic low water level records for it and put them both into R for some analysis & graphing:

click for larger version

click for larger version

They are both in PDF format as well [1] [2]

As you can see, there have only been four other (recorded) times when the river was this low and it has just come off of multi-year severely high points with a fairly rapid trend downwards. I’m sure the residents along the Mississippi do not need this data to tell them just how bad things are, but it has helped me understand just how bad the situation is.

For those interested, the R code uses ggplot2 for time-series charting along with a custom theme and various annotation aids that might be useful to others learning their way around the grammar of graphics in R (so the code is below).

#
# stream.R - graph the recent history of the Mississippi River at St Louis
#
 
library(ggplot2)
require(scales)
library(ggthemes)
 
# read in st louis, mo USGS stream data
 
df.raw = read.csv("~/Desktop/stream.txt")
 
# need date/time as an R Date for ggplot2 time series plot
 
df = data.frame(as.POSIXct(df.raw$datetime,format = "%Y-%m-%d %H:%M"),df.raw$gauge)
df = df[!is.na(df.raw$gauge),]
 
# pretty up the column names
 
colnames(df) = c("datetime","gauge")
 
# we uses these a few times
 
maxdate = max(df$datetime)
mindate = min(df$datetime)
mingauge = min(df$gauge)
 
# do the plot
 
st1 = ggplot(df, aes(datetime, gauge)) +
 
  theme_economist() + # pretty theme
 
  # background bands for various water level stages
 
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=28, ymax=30), alpha=1, fill="khaki1") +
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=30, ymax=35), alpha=1, fill="gold1") +
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=35, ymax=40), alpha=1, fill="firebrick") +
 
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=mingauge, ymax=0), alpha=1, fill="white") +
 
  # labels for the bands
 
  geom_text(data=data.frame(x=maxdate,y=29), aes(x=x,y=y,label="Action Stage "), size=3, hjust=1) +
  geom_text(data=data.frame(x=maxdate,y=32), aes(x=x,y=y,label="Flood Stage "), size=3, hjust=1) +
  geom_text(data=data.frame(x=maxdate,y=37), aes(x=x,y=y,label="Moderate Flood Stage "), size=3, hjust=1, colour="white") +
 
  geom_text(data=data.frame(x=mindate,y=mingauge/2), aes(x=x,y=y,label=" Below gauge"), size=3, hjust=0, colour="black") +
 
  # the raw stream data
 
  geom_line(size=0.15) +
 
  # change the x label to just years
 
  scale_x_datetime(breaks=date_breaks("years"), labels=date_format("%Y")) + 
 
  # labels
 
  labs(title="Mississipi River Depth at St Louis, MO", x="", y="Gauge Height (in.)") +
 
  # add a smoothed trend line
 
  geom_smooth() +
 
  # remove the legend
 
  theme(legend.position = "none")
 
# make a PDF
 
ggsave("~/Desktop/mississippi.pdf",st1,w=8,h=5,dpi=150)

#
# low.R - graph the historic low records for the Mississippi River at St Louis
#
 
library(ggplot2)
require(scales)
library(ggthemes)
 
# read in historic low records
 
df.raw = read.csv("~/Desktop/low.csv")
 
# need date/time as an R Date for ggplot2 time series plot
 
df = data.frame(as.POSIXct(df.raw$date,format = "%m/%d/%Y"),df.raw$gauge)
colnames(df) = c("date","gauge")
 
# pretty up the column names
 
maxdate = max(df$date)
mindate = min(df$date)
 
# do the plot
 
low1 = ggplot(data=df,aes(date,gauge)) + 
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=-4.55, ymax=-4.55), alpha=1, color="firebrick",fill="firebrick") +
  geom_text(data=data.frame(x=mindate,y=-4.75), aes(x=x,y=y,label="January 2013 :: -4.55in"), size=3, hjust=0, colour="firebrick") +
  geom_line(size=0.15) + 
  labs(title="Historic Low Water Depths at St Louis, MO", x="", y="Gauge Height (in.)") +
  theme_economist()
 
ggsave("~/Desktop/low.pdf",low1,w=8,h=5,dpi=150)

Slopegraphs in R

2013-01-11 – 20:59
Posted in Charts & Graphs, DataVis, DataViz, Python, R
Leave a Comment

I updated the code to use ggsave and tweaked some of the font & line size values for more consistent (and pretty) output. This also means that I really need to get this up on github.

If you even remotely follow this blog, you’ll see that I’m kinda obsessed with slopegraphs. While I’m pretty happy with my Python implementation, I do quite a bit of data processing & data visualization in R these days and had a few free hours on a recent trip to Seattle, so I whipped up some R code to do traditional and multi-column rank-order slopegraphs in R, mostly due to a post over at Microsoft’s security blog.

#
# multicolumn-rankorder-slopegraph.R
#
# 2013-01-12 - formatting tweaks
# 2013-01-10 - Initial version - boB Rudis - @hrbrmstr
#
# Pretty much explained by the script title. This is an R script which is designed to produce
# 2+ column rank-order slopegraphs with the ability to highlight meaningful patterns
#
 
library(ggplot2)
library(reshape2)
 
# transcription of table from:
# http://blogs.technet.com/b/security/archive/2013/01/07/operating-system-infection-rates-the-most-common-malware-families-on-each-platform.aspx
#
# You can download it from: 
# https://docs.google.com/spreadsheet/ccc?key=0AlCY1qfmPPZVdHpwYk0xYkh3d2xLN0lwTFJrWXppZ2c
 
df = read.csv("~/Desktop/malware.csv")
 
# For this slopegraph, we care that #1 is at the top and that higher value #'s are at the bottom, so we 
# negate the rank values in the table we just read in
 
df$Rank.Win7.SP1 = -df$Rank.Win7.SP1
df$Rank.Win7.RTM = -df$Rank.Win7.RTM
df$Rank.Vista = -df$Rank.Vista
df$Rank.XP = -df$Rank.XP
 
# Also, we are really comparing the end state (ultimately) so sort the list by the end state.
# In this case, it's the Windows 7 malware data.
 
df$Family = with(df, reorder(Family, Rank.Win7.SP1))
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We define our color palette manually so we can highlight the lines that "matter".
# This means you'll need to generate the slopegraph at least one time prior to determine
# which lines need coloring. This should be something we pick up algorithmically, eventually
 
sgPalette = c("#990000", "#990000",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#990000", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#CCCCCC", "#CCCCCC",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#000000", "#000000",  "#000000", "#000000", "#000000","#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000")
 
 
# start the plot
#
# we do a ton of customisations to the plain ggplot canvas, but it's not rocket science
 
sg = ggplot(dfm, aes(factor(variable), value, 
                     group = Family, 
                     colour = Family, 
                     label = Family)) +
  scale_colour_manual(values=sgPalette) +
  theme(legend.position = "none", 
        axis.text.x = element_text(size=5),
        axis.text.y=element_blank(), 
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        panel.grid.major = element_line("black", size = 0.1),
        panel.grid.major = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.background = element_blank())
 
# plot the right-most labels
 
sg1 = sg + geom_line(size=0.15) + 
  geom_text(data = subset(dfm, variable == "Rank.Win7.SP1"), 
            aes(x = factor(variable), label=sprintf(" %-2d %s",-(value),Family)), size = 1.75, hjust = 0) 
 
# plot the left-most labels
 
sg1 = sg1 + geom_text(data = subset(dfm, variable == "Rank.XP"), 
                     aes(x = factor(variable), label=sprintf("%s %2d ",Family,-(value))), size = 1.75, hjust = 1)
 
# this ratio seems to work well for png output
# you'll need to tweak font size for PDF output, but PDF will make post-processing in 
# Illustrator or Inkscape much easier.
 
ggsave("~/Desktop/malware.pdf",sg1,w=8,h=5,dpi=150)

Click for larger version

I really didn’t think the table told a story well and I truly believe slopegraphs are pretty good at telling stories.

This bit of R code is far from generic and requires the data investigator to do some work to make it an effective visualization, but (I think) it’s one of the better starts at a slopegraph library in R. It suffers from the same issues I’ve pointed out before, but it’s far fewer lines of code than my Python version and it handles multi-column slopegraphs quite nicely.

To be truly effective, you’ll need to plot the slopegraph first and then figure out which nodes to highlight and change the sgPalette accordingly to help the reader/viewer focus on what’s important.

I’ll post everything on github once I recover from cross-country travel and—as always–welcome feedback and fixes.

CVE Queries Right From Your Browser’s Address Bar

2012-12-28 – 21:09
Posted in Browsers, Chrome, Vulnerabilities
Leave a Comment

I’m not sure why I never did this earlier, but a post on LifeHacker gave me an idea to add location bar quick search of CVEs (Common Vulnerabilities and Exposures), no doubt due to their example on searching LifeHacker for “security”.

My two favorite sites for searching CVE specifics are, at present, Risk IO’s and CVE Details.

I’m fairly certain anyone in security reading this can figure out the rest, but as I’m ever a slave to minutiae, here are the two shortcuts I’ve setup in Chrome:

Title: CVE Details
Search URL: http://cvedetails.com/cve-details.php?cve_id=%s
Shortcut: cved

Title: Risk I/O Vulnerability Search
Search URL: https://db.risk.io/?q=%s
Shortcut: cvedb

Here’s what the location bar changes to when I use cvedb to search for 2012‑4774

In reality, this is only saving a scroll and a click since entering CVE‑2012‑4774 into an unoptimized location bar would have just searched Google and given me most of the usual suspects in the first few links. Still, it saves some time and immediately gets me the vulnerability data from the sites I prefer.

I may start poking to see what other security-related searches I can setup in the location bar.

Easier HTML Table-scraping For Scripts With Google Drive

2012-12-17 – 15:21
Posted in Google Docs, HTML5, Programming, Python
Leave a Comment

We had our first, real, snowfall of the season in Maine today and that usually means school delays/closings. Our “local” station – @WCHS6 – has a Storm Center Closings page as well as an SMS notification service. I decided this morning that I needed a command line version (and, eventually, a version that sends me a Twitter DM), but I also was tight for time (a lunchtime meeting ending early is responsible for this blog post).

While I’ve consumed my share of Beautiful Soup and can throw down some mechanize with the best of them, it came to me that there may be an even easier way, and one that may also help with the eventual blocking of such a scraping service.

I setup a Google Drive spreadsheet to use the importHTML formula to read in the closings table on the page:

=importHTML("http://www.wcsh6.com/weather/severe_weather/cancellations_closings/default.aspx","table",0)

Then did a File→Publish to the web and setup up Sheet 1 to “Automatically republish when changes are made” and also to have the link be to the CSV version of the data:

The raw output looks a bit like:

Name,Status,Last Updated
,,
Westbook Seniors,Luncheon PPD to January 7th,12/17/2012 5:22:51
,,
Allied Wheelchair Van Services,Closed,12/17/2012 6:49:47
,,
American Legion - Dixfield,Bingo cancelled,12/17/2012 11:44:12
,,
American Legion Post 155 - Naples,Closed,12/17/2012 12:49:00

The conversion has some “blank” lines but that’s easy enough to filter out with some quick bash:

curl --silent "https://docs.google.com/spreadsheet/pub?key=0AlCY1qfmPPZVdFBsX3kzLUVHZl9Mdmw3bS1POWNsWnc&single=true&gid=0&outpu
t=csv" | grep -v "^,,"

And, looking for the specific school(s) of our kids is an easy grep as well.

The reason this is interesting is that the importHTML is dynamic and will re-convert the HTML table each time the code retrieves the CSV URL. Couple that with the fact that it’s far less likely that Google will be blocked than it is my IP address(es) and this seems to be a pretty nice alternative to traditional parsing.

If I get some time over the break, I’ll do a quick benchmark of using this method over some python and perl scraping/parsing methods.

The ‘fing’ Corollary

Back in 2011, @joshcorman posited “HD Moore’s Law” which is basically:

Casual Attacker power grows at the rate of Metasploit

I am officially submitting the ‘fing’ corollary to said law:

Fundamental defender efficacy can be ascertained within 10 ‘fings’

The tool ‘fing’ : http://overlooksoft.com/fing : is a very lightweight-yet-wicked-functional network & services scanner that runs on everything from the linux command line to your iPhone. I have a permanent ‘sensor’ always running at home and have it loaded on every device I can. While the fine folks at Overlook Software would love you to death for buying a fingbox subscription, it can be used quite nicely in standalone mode to great effect.

I break out ‘fing’ during tedious meetings, bus/train/plane rides or trips to stores (like Home Depot [hint]) just to see who/what else is on the Wi-Fi network and to also get an idea of how the network itself is configured.

What’s especially fun at—um—*your* workplace is to run it from the WLAN (iOS/Android) to see how many hosts it finds on the broadcast domain, then pick pseudo-random (or just interesting looking) hosts to see what services (ports) are up and then use the one-click-access mechanism to see what’s running behind the port (especially browser-based services).

How does this relate to HD Moore’s Law? What makes my corollary worthy of an extension?

If I use ‘fing’ to do a broadcast domain discovery, select ten endpoints and discover at least one insecure configuration (e.g. telnet on routers, port 80 admin login screens, highly promiscuous number of ports) you should not consider yourself to be a responsible defender. The “you” is a bit of a broad term, but if your multi-millon dollar (assuming an enterprise) security program can be subverted internally with just ‘fing’, you really won’t be able to handle metasploit, let alone a real attacker.

While metasploit is pretty straightforward to run even for a non-security professional, ‘fing’ is even easier and should be something you show your network and server admins (and developers) how to use on their own, even if you can’t get it officially sanctioned (yes, I said that). Many (most) non-security IT professionals just don’t believe us when we tell them how easy it is for attackers to find things to exploit and this is a great, free way to show them. Or, to put it another way: “demos speak louder than risk assessments“.

If ‘fing’ isn’t in your toolbox, get it in there. If you aren’t running it regularly at work/home/out-and-about, do so. If you aren’t giving your non-security colleagues simple tools to help them be responsible defenders: start now.

And, finally, if you’re using ‘fing’ or any other simple tool in a similar capacity, drop a note in the comments (always looking for useful ways to improve security).

Forbes Graph Makeover Contest Entry #1

2012-12-05 – 11:10
Posted in Charts & Graphs, infographics, R
Comments (2)

Naomi Robbins is running a graph makeover challenge over at her Forbes blog and this is my entry for the B2B/B2C Traffic Sources one (click for larger version):

And, here’s the R source for how to generate it:

library(ggplot2)
 
df = read.csv("b2bb2c.csv")
 
ggplot(data=df,aes(x=Site,y=Percentage,fill=Site)) + 
  geom_bar(stat="identity") + 
  facet_grid(Venue ~ .) + 
  coord_flip() + 
  opts(legend.position = "none", title="Social Traffic Sources for B2B & B2C Companies") + 
  stat_bin(geom="text", aes(label=sprintf("%d%%",Percentage), vjust=0, hjust=-0.2, size = 10))

And, here’s the data:

Site     Venue	Percentage
Facebook B2B	72
LinkedIn B2B	16
Twitter	 B2B	12
Facebook B2C	84
LinkedIn B2C	1
Twitter	 B2C	15

I chose to go with a latticed bar chart as I think it helps show the relative portions within each category (B2B/B2C) and also enables quick comparisons across categories for all three factors.

Election Trackers Moved From Flash to HTML5

2012-11-27 – 18:56
Posted in Browsers, Charts & Graphs, d3, HTML5, Javascript
Leave a Comment

For those inclined to click, I was interviewed by Fahmida Rashid (@fahmiwrite) over at Sourceforge’s HTML5 center a few weeks ago (right after the elections) due to my tweets on the use of HTML5 tech over Flash. Here’s one of them:

https://twitter.com/hrbrmstr/status/266006111256207361

While a tad inaccurate (one site did use Flash with an HTML fallback and some international sites are still stuck in the 1990s), it is still a good sign of how the modern web is progressing.

I can honestly say I’ve never seen my last name used so many times in one article :-)