hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

π, Awareness, DataVis, VAST 2013, Moar data! & GReader Machinations

2013-03-14 – 09:44
Posted in Cybersecurity, Data Analysis, Data Visualization, DataVis, Information Security, Security Awareness
Tagged WEIS 2013
Leave a Comment

Far too many interesting bits to spam on Twitter individually but each is worth getting the word out on:

tumblr_m0wabdueuR1qd78lno1_1280

– It’s [π Day](https://www.google.com/search?q=pi+day)^*
– Unless you’re living in a hole, you probably know that [Google Reader is on a death march](http://www.bbc.co.uk/news/technology-21785378). I’m really liking self-hosting [Tiny Tiny RSS](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDEQFjAA&url=https%3A%2F%2Fgithub.com%2Fgothfox%2FTiny-Tiny-RSS&ei=YtlBUfOLJvLe4AOHtoDIAQ&usg=AFQjCNGwtEr8slx-i0vNzhQi4b4evRVXFA&bvm=bv.43287494,d.dmg) so far, and will follow up with a standalone blog post on it.
– @jayjacobs & I are speaking soon at both [SOURCE Boston](http://www.sourceconference.com/boston/) & [Secure360](http://secure360.org/). We hope to have time outside the talk to discuss security visualization & analysis with you, so go on and register!
– Speaking of datavis, [VizSec 2013](http://www.vizsec.org/) is on October 14th in Atlanta, GA this year and the aligned [VAST 2013 Challenge](http://vacommunity.org/VAST+Challenge+2013) (via @ieeevisweek) will have another infosec-themed mini-challenge. Watch that space, grab the data and start analyzing & visualizing!
– If you’re in or around Boston, it’d be great to meet you at @openvisconf on May 16-17.
– Majestic SEO has released a new data set for folks to play with: [a list of the top 100,000 websites broken down by the country in which they are hosted](http://blog.majesticseo.com/research/top-websites-by-country/). It requires a free reg to get the link, but no spam so far. (h/t @teamcymru)
– And, finally, @MarketplaceAPM aired a [good, accessible story](http://www.marketplace.org/topics/tech/who-pays-bill-cyber-war) on “Who pays the bill for a cyber war?”. I always encourage infosec industry folk to listen to shows like Marketplace to see how we sound to non-security folk and to glean ways we can communicate better with the people we are ultimately trying to help.

^*Image via davincismurf

Visualizing Risky Words — Part 4 (D3 Word Trees)

2013-03-12 – 12:55
Posted in d3, Data Analysis, Data Visualization, DataVis, DataViz, Information Security
Comments (1)

This is a fourth post in my [Visualizing Risky Words](http://rud.is/b/2013/03/06/visualizing-risky-words/) series. You’ll need to read starting from that link for context if you’re just jumping in now.

I was going to create a rudimentary version of an interactive word tree for this, but the extremely talented @jasondavies (I marvel especially at his cartographic work) just posted what is probably the best online [word tree generator](https://www.jasondavies.com/wordtree/) ever made…and in D3 no less.

A word tree is a “visual interactive concordance” and was created back in 2007 by Martin Wattenberg and Fernanda Viégas. You can [read more about](http://hint.fm/projects/wordtree/) this technique on your own, but a good summary (from their site) is:

A word tree is a visual search tool for unstructured text, such as a book, article, speech or poem. It lets you pick a word or phrase and shows you all the different contexts in which it appears. The contexts are arranged in a tree-like branching structure to reveal recurrent themes and phrases.

I pasted the VZ RISK INTSUM texts into Jason’s tool so you could investigate the corpus to your heart’s content. I would suggest exploring “patch”, “vulnerability”, “adobe”, “breach” & “malware” (for starters).

Jason’s implementation is nothing short of beautiful. He uses SVG text tspans to make the individual text elements not just selectable but easily scaleable with browser window resize events.

The actual [word tree D3 javascript code](http://www.jasondavies.com/wordtree/wordtree.js?20130312.1) shows just how powerful the combination of the language and @mbostock’s library is. He has, in essence, built a completely cross-platform tokenizer and interactive visualization tool in ~340 lines of javascript. Working your way through that code through to understanding will really help improve your D3 skills.

Visualizing Risky Words — Part 3

2013-03-10 – 23:11
Posted in d3, Data Analysis, Data Visualization, DataVis, DataViz, Information Security
Comments (1)

The DST changeover in the US has made today a fairly strange one, especially when combined with a very busy non-computing day yesterday. That strangeness manifest as a need to take the D3 heatmap idea mentioned in the [previous post](http://rud.is/b/2013/03/09/visualizing-risky-words-part-2/) and actually (mostly) implement it. Folks just coming to this thread may want to start with the [first post](http://rud.is/b/2013/03/06/visualizing-risky-words/) in the series.

I did a quick extraction of the R TermDocumentMatrix with nested for loops and then extracted the original texts of the corpus and put them into some javascript variables along with some D3 code to show how to do a [rudimentary interactive heatmap](http://rud.is/d3/vzwordshm/).

(click for live demo)

As you mouse over each of the tiles they will be highlighted and the word/document will be displayed along with the frequency count. Click on the tile and the text will appear with the highlighted word.

Some caveats:

– The heatmap looks a bit different from the R one in part 2 as the terms/keywords are in alphabetical order.
– There are no column or row headers. I won’t claim anything but laziness, though the result does look a bit cleaner this way.
– I didn’t bother extracting the stemming results (it’s kind of a pain), so not all of the keywords will be highlighted when you select a tile.
– It’s one, big HTML document complete with javascript & data. That’s not a recommended practice in production code but it will make it easier for folks to play around with.
– The HTML is not commented well, but there’s really not much of it. The code that makes the heatmap is pretty straightforward if you even have a rough familiarity with D3. Basically, enumerate the data, generating a colored tile for each row/col entry and then mapping click & mouseover/out events to each generated SVG element.
– I also use jQuery in the code, but that’s not a real dependency. I just like using jQuery selectors for non-D3 graphics work. It bulks up the document, so use them together wisely. NOTE: If you don’t want to use jQuery, you’ll need to change the selector code.

Drop a note in the comments if you do use the base example and improve on it or if you have any questions on the code. For those interested, I’ll be putting all the code from the “Visualizing Risky Words” posts into a github repository at the end of the series.

Visualizing Risky Words — Part 2

2013-03-09 – 07:29
Posted in Charts & Graphs, Data Analysis, Data Visualization, DataVis, DataViz, Information Security, R
Comments (2)

This is a follow-up to my [Visualizing Risky Words](http://rud.is/b/2013/03/06/visualizing-risky-words/) post. You’ll need to read that for context if you’re just jumping in now. Full R code for the generated images (which are pretty large) is at the end.

Aesthetics are the primary reason for using a word cloud, though one can pretty quickly recognize what words were more important on well crafted ones. An interactive bubble chart is a tad better as it lets you explore the corpus elements that contained the terms (a feature I have not added yet).

I would posit that a simple bar chart can be of similar use if one is trying to get a feel for overall word use across a corpus:

(click for larger version)

It’s definitely not as sexy as a word cloud, but it may be a better visualization choice if you’re trying to do analysis vs just make a pretty picture.

If you are trying to analyze a corpus, you might want to see which elements influenced the term frequencies the most, primarily to see if there were any outliers (i.e. strong influencers). With that in mind, I took @bfist’s [corpus](http://securityblog.verizonbusiness.com/2013/03/06/2012-intsum-word-cloud/) and generated a heat map from the top terms/keywords:

(click for larger version)

There are some stronger influencers, but there is a pattern of general, regular usage of the terms across each corpus component. This is to be expected for this particular set as each post is going to be talking about the same types of security threats, vulnerabilities & issues.

The R code below is fully annotated, but it’s important to highlight a few items in it and on the analysis as a whole:

– The extra, corpus-specific stopword list : “week”, “report”, “security”, “weeks”, “tuesday”, “update”, “team” : was designed after manually inspecting the initial frequency breakdowns and inserting my opinion at the efficacy (or lack thereof) of including those terms. I’m sure another iteration would add more (like “released” and “reported”). Your expert view needs to shape the analysis and—in most cases—that analysis is far from a static/one-off exercise.
– Another area of opine was the choice of 0.7 in the removeSparseTerms(tdm, sparse=0.7) call. I started at 0.5 and worked up through 0.8, inspecting the results at each iteration. Playing around with that number and re-generating the heatmap might be an interesting exercise to perform (hint).
– Same as the above for the choice of 10 in subset(tf, tf>=10). Tweak the value and re-do the bar chart vis!
– After the initial “ooh! ahh!” from a word cloud or even the above bar chart (though, bar charts tend to not evoke emotional reactions) is to ask yourself “so what?”. There’s nothing inherently wrong with generating a visualization just to make one, but it’s way cooler to actually have a reason or a question in mind. One possible answer to a “so what?” for the bar chart is to take the high frequency terms and do a bigram/digraph breakdown on them and even do a larger cross-term frequency association analysis (both of which we’ll do in another post)
– The heat map would be far more useful as a D3 visualization where you could select a tile and view the corpus elements with the term highlighted or even select a term on the Y axis and view an extract from all the corpus elements that make it up. That might make it to the TODO list, but no promises.

I deliberately tried to make this as simple as possible for those new to R to show how straightforward and brief text corpus analysis can be (there’s less than 20 lines of code excluding library imports, whitespace, comments and the unnecessary expansion of some of the tm function calls that could have been combined into one). Furthermore, this is really just a basic demonstration of tm package functionality. The post/code is also aimed pretty squarely at the information security crowd as we tend to not like examples that aren’t in our domain. Hopefully it makes a good starting point for folks and, as always, questions/comments are heartily encouraged.

# need this NOAWT setting if you're running it on Mac OS; doesn't hurt on others
Sys.setenv(NOAWT=TRUE)
library(ggplot2)
library(ggthemes)
library(tm)
library(Snowball) 
library(RWeka) 
library(reshape)
 
# input the raw corpus raw text
# you could read directly from @bfist's source : http://l.rud.is/10tUR65
a = readLines("intext.txt")
 
# convert raw text into a Corpus object
# each line will be a different "document"
c = Corpus(VectorSource(a))
 
# clean up the corpus (function calls are obvious)
c = tm_map(c, tolower)
c = tm_map(c, removePunctuation)
c = tm_map(c, removeNumbers)
 
# remove common stopwords
c = tm_map(c, removeWords, stopwords())
 
# remove custom stopwords (I made this list after inspecting the corpus)
c = tm_map(c, removeWords, c("week","report","security","weeks","tuesday","update","team"))
 
# perform basic stemming : background: http://l.rud.is/YiKB9G
# save original corpus
c_orig = c
 
# do the actual stemming
c = tm_map(c, stemDocument)
c = tm_map(c, stemCompletion, dictionary=c_orig)
 
# create term document matrix : http://l.rud.is/10tTbcK : from corpus
tdm = TermDocumentMatrix(c, control = list(minWordLength = 1))
 
# remove the sparse terms (requires trial->inspection cycle to get sparse value "right")
tdm.s = removeSparseTerms(tdm, sparse=0.7)
 
# we'll need the TDM as a matrix
m = as.matrix(tdm.s)
 
# datavis time
 
# convert matri to data frame
m.df = data.frame(m)
 
# quick hack to make keywords - which got stuck in row.names - into a variable
m.df$keywords = rownames(m.df)
 
# "melt" the data frame ; ?melt at R console for info
m.df.melted = melt(m.df)
 
# not necessary, but I like decent column names
colnames(m.df.melted) = c("Keyword","Post","Freq")
 
# generate the heatmap
hm = ggplot(m.df.melted, aes(x=Post, y=Keyword)) + 
  geom_tile(aes(fill=Freq), colour="white") + 
  scale_fill_gradient(low="black", high="darkorange") + 
  labs(title="Major Keyword Use Across VZ RISK INTSUM 202 Corpus") + 
  theme_few() +
  theme(axis.text.x  = element_text(size=6))
ggsave(plot=hm,filename="risk-hm.png",width=11,height=8.5)
 
# not done yet
 
# better? way to view frequencies
# sum rows of the tdm to get term freq count
tf = rowSums(as.matrix(tdm))
# we don't want all the words, so choose ones with 10+ freq
tf.10 = subset(tf, tf>=10)
 
# wimping out and using qplot so I don't have to make another data frame
bf = qplot(names(tf.10), tf.10, geom="bar") + 
  coord_flip() + 
  labs(title="VZ RISK INTSUM Keyword Frequencies", x="Keyword",y="Frequency") + 
  theme_few()
ggsave(plot=bf,filename="freq-bars.png",width=8.5,height=11)

Visualizing Risky Words

2013-03-06 – 23:05
Posted in d3, Data Visualization, DataVis, DataViz, Javascript
Leave a Comment

NOTE: Parts [2], [3] & [4] are also now up.

Inspired by a post by @bfist who created the following word cloud in Ruby from VZ RISK INTSUM posts (visit the link or select the visualization to go to the post):

I ♥ word clouds as much as anyone and usually run Presidential proclamations & SOTU addresses through a word cloud generator just to see what the current year’s foci are.

However, word clouds rarely provide what the creator of the visualization intends. Without performing more strict corpus analysis, one is really just getting a font-based frequency counter. While pretty, it’s not a good idea to derive too much meaning from a simple frequency count since there are phrase & sentence structure components to consider as well as concepts such as stemming (e.g. “risks” & “risk” are most likely the same thing, one is just plural…that’s a simplistic definition/example, though).

I really liked Jim Vallandingham’s Building a Bubble Cloud walk through on how he made a version of @mbostock’s NYTimes convention word counts and decided to both run a rudimentary stem on the VZ RISK INTSUM corpus along with a principal component analysis [PDF] to find the core text influencers and feed the output to a modified version of the bubble cloud:

You can select the graphic to go to the “interactive”/larger version of it. I had intended to make selecting a circle bring up the relevant documents from the post corpus, but that will have to be a task for another day.

It’s noteworthy that both @bfist’s work and this modified version share many of the same core “important” words. With some stemming refinement and further stopword removal (e.g. “week” was in the original run of this visualization and is of no value for this risk-oriented visualization, so I made it part of the stopword list), this could be a really good way to get an overview of what the risky year was all about.

I won’t promise anything, but I’ll try to get the R code cleaned up enough to post. It’s really basic tm & PCA work, so no rocket-science degree is required. Fork @vlandham’s github repo & follow the aforelinked tutorial for the crunchy D3-goodness bits.

Security Hobos

2013-03-03 – 07:35
Posted in Breach, Information Security, Security Awareness
Comments (3)

If you haven’t viewed/read Wendy Nather’s (@451Wendy) insightful [Living Below The Security Poverty Line](https://451research.com/t1r-insight-living-below-the-security-poverty-line) you really need to do that before continuing (we’ll still be here when you get back).

Unfortunately, the catalyst for this post came from two recent, real-world events: my returned exposure to the apparent ever-increasing homeless issue in San Francisco (a side effect of choosing a hotel 10 blocks away from Moscone) and the hacking of a [small, local establishment](http://www.tnhonline.com/works-bakery-customers-targeted-by-cyber-thieves-1.2988390#.UTMuF-tASS0) resulting in exposure of customer credit cards.

If you do any mom-and-pop, brick-and-mortar shopping you’ve seen it: the Windows-based point-of-sale terminal that is the *only* computer for the owners. Your credit card will be scanned on the same machine cat videos will be viewed and e-mail will be read. In many small shops, that machine is also where accounting functions are performed.

These truly small business (TSB) owners aren’t living below the security poverty line, they are security hobos. They *kinda* know they need to care about the safety of their data, but their focus is on their business or creative processes. When they do have time to care about security, that part of their world is so complex that it’s far too easy to make the choice to ignore it than to do something about it. If your immediate reaction was to disagree with my complexity posit, here are just a few tasks a TSB owner must face in a world of modern commerce:

– Updating operating system patches
– Updating browser software
– Updating Flash
– Updating Java
– Maintain web site/Twitter/Facebook securely
– Recognizing phishing e-mails/posts/tweets
– Understanding browser security
– Keeping signature anti-malware up-to-date
– Remember passwords for system, POS vendor, government sites, e-mail, etc.
– Maintain secure Wi-Fi and Internet firewall
– Maintain physical security (e.g. cameras)

Those tasks may be as autonomous as breathing for security folk and technically-savvy users, but they are extraneous tasks that are confusing for most TSBs and may often cause instability issues with the wretched POS software options out in the marketplace. These folks also cannot afford to hire security consultants to do this work for them.

Verizon’s 2012 DBIR & Trustwave’s 2012 report both showed that [these types of businesses](http://www.slate.com/articles/technology/technology/2012/03/verizon_s_data_breach_investigations_report_reveals_that_restaurants_are_the_easiest_target_for_hackers_.single.html#pagebreak_anchor_2) were part of the groups most targeted by criminals, yet the best our industry can do is dress up folks in schoolgirl costumes at @RSAConference whilst telling TSBs to keep their systems up-to-date and not re-use passwords. It’s the security equivalent of walking by a truly desperate person on the street without even making eye contact as your body language exudes the “get a job” vibe.

We have to do better than this.

Until software and hardware vendors start to—or are forced to—actually care about security, it will be up to security professionals to create the digital equivalent of a soup kitchen to make the situation better. What can you do?

– speak at local Chamber of Commerce meetings and provide practical take-aways for those who attend
– discuss security topics with friends or relatives who are TSB owners
– have your [ISSA|ISC²|NAISG] chapter setup a booth at conventions which attract TSBs (y’know…get out of the echo chamber, mebbe?)
– raise awareness through blogging and other media outlets
– produce & distribute awareness materials—a great example would be @Veracode’s non-domain [infographics](http://www.veracode.com/blog/category/infographics/)
– demand better (in general) out of your security vendors
– lobby government for better security standards

It may not seem like much, but we have to start somewhere if we’re going to find a way to help protect those that most vulnerable, especially since it will also mean helping to keep *our own* information safe.

In case you are a truly small business owner who is reading this post, there are some things you can do to help ensure you won’t be a victim:

– Use a dedicated machine for your POS work—an iPad with [Square](https://squareup.com/) is a good option but doesn’t work for everyone
– Do not perform any operations on the Internet on the system that you do accounting tasks on
– Use @1Password to create, store & manage all your passwords on all your systems/devices
– Use [Secunia PSI](http://secunia.com/vulnerability_scanning/personal/) to help keep your Windows systems up-to-date
– Set all operating system and anti-malware software to auto-update
– Do not put your security cameras on the Internet; if you do, password protect them
– Research what your responsibilities are and what actions you’ll need to take in the event you do discover that your business or customer information has been exposed

Third Annual RSA PhöCon : Thursday, Feb 27 @ 1830 : Miss Saigon

Come join us for a PhöCon good time at 18:30 (6:30PM) Thursday! The neighborhood is…interesting…but it’s close to Moscone and has had good food for the past couple years. I’ll be heading up there from the Expo area at ~1815. Hit me up on Twitter if you want to head out together.

[Miss Saigon](http://misssaigonsf.com‎) @ 100 6th Street

View Larger Map

Follow up/Resources :: GRC-T18 – Data Analysis and Visualization for Security Professionals #RSAC

2013-02-27 – 07:30
Posted in Charts & Graphs, d3, Data Analysis, Data Visualization, DataVis, DataViz, infographics, Information Security, Open Source
Leave a Comment

Many thanks to all who attended the talk @jayjacobs & I gave at RSA on Tuesday, February 26, 2013. It was really great to be able to talk to so many of you afterwards as well.

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.