Skip navigation

Tag Archives: python

ThinkStats (by Allen B. Downey) is a good book to get you familiar with statistics (and even Python, if you’ve done some scripting in other languages).

I thought it would be interesting to present some of the examples & exercises in the book in R. Why? Well, once you’ve gone through the material in a particular chapter the “hard way”, seeing how you’d do the same thing in a language specifically designed for statistical computing should show when it’s best to use such a domain specific language and when you might want to consider a hybrid approach. I am also hoping it helps make R a bit more accessible to folks.

You’ll still need the book and should work through the Python examples to get the most out of these posts.

I’ll try to get at least one example/exercise section up a week.

Please submit all errors, omissions or optimizations in the comments section.

The star of the show is going to be the “data frame” in most of the examples (and is in this one). Unlike the Python code in the book, most of the hard work here is figuring out how to use the data frame file reader to parse the ugly fields in the CDC data file. By using some tricks, we can approximate the “field start:length” style of the Python code but still keep the automatic reading/parsing of the R code (including implicit handling of “NA” values).

The power & simplicity of using R’s inherent ability to apply a calculation across a whole column (pregnancies$agepreg <- pregnancies$agepreg / 100) should jump out. Unfortunately, not all elements of the examples in R will be as nice or straightforward.

You'll also notice that I cheat and use str() for displaying summary data.

Enough explanation! Here's the code:

  1. # ThinkStats in R by @hrbrmstr
  2. # Example 1.2
  3. # File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
  4.  
  5. # setup a data frame that has the field start/end info
  6.  
  7. pFields <- data.frame(name  = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord',  'agepreg',  'finalwgt'), 
  8.                       begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423), 
  9.                       end   = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440) 
  10. ) 
  11.  
  12. # calculate widths so we can pass them to read.fwf()
  13.  
  14. pFields$width <- pFields$end - pFields$begin + 1 
  15.  
  16. # we aren't reading every field (for the book exercises)
  17.  
  18. pFields$skip <-  (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0)) 
  19.  
  20. widths <- c(t(pFields[,4:5])) 
  21. widths <- widths[widths!=0] 
  22.  
  23. # read in the file
  24.  
  25. pregnancies <- read.fwf("2002FemPreg.dat", widths) 
  26.  
  27. # assign column names
  28.  
  29. names(pregnancies) <- pFields$name 
  30.  
  31. # divide mother's age by 100
  32.  
  33. pregnancies$agepreg <- pregnancies$agepreg / 100
  34.  
  35. # convert weight at birth from lbs/oz to total ounces
  36.  
  37. pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
  38.  
  39. rFields <- data.frame(name  = c('caseid'), 
  40.                       begin = c(1), 
  41.                       end   = c(12) 
  42. ) 
  43.  
  44. rFields$width <- rFields$end - rFields$begin + 1 
  45. rFields$skip <-  (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0)) 
  46.  
  47. widths <- c(t(rFields[,4:5])) 
  48. widths <- widths[widths!=0] 
  49.  
  50. respondents <- read.fwf("2002FemResp.dat", widths) 
  51. names(respondents) <- rFields$name
  52.  
  53. str(respondents)
  54. str(pregnancies)

Those were the words that greeted me within five minutes of checking out the Flask microframework for Python web applications. I feel compelled to inline those four, short paragraphs:

I’m not joking. Well, maybe a little. If you write a web application, you are probably allowing users to register and leave their data on your server. The users are entrusting you with data. And even if you are the only user that might leave data in your application, you still want that data to be stored securely.

Unfortunately, there are many ways the security of a web application can be compromised. Flask protects you against one of the most common security problems of modern web applications: cross-site scripting (XSS). Unless you deliberately mark insecure HTML as secure, Flask and the underlying Jinja2 template engine have you covered. But there are many more ways to cause security problems.

The documentation will warn you about aspects of web development that require attention to security. Some of these security concerns are far more complex than one might think, and we all sometimes underestimate the likelihood that a vulnerability will be exploited, until a clever attacker figures out a way to exploit our applications. And don’t think that your application is not important enough to attract an attacker. Depending on the kind of attack, chances are that automated bots are probing for ways to fill your database with spam, links to malicious software, and the like.

So always keep security in mind when doing web development.

Let’s look at the key take-away messages…

Data Should Be Stored Securely

Interestingly enough, this is not the default mindset of one of the more popular modern database technologies [mongoDB] (and it has plenty of company [memcached], too).

Even if your app starts out without any real sensitive data, odds are you will be storing credentials, e-mail addresses, social network handles and other bits of information that you should feel some fundamental responsibility to treat with care. There are somemcached manymysql resourcesoracle tocouchdb helpsqlite that you really have no excuse.

And, it will save you time later on when you realize you actually need to have a secure storage foundation.

Watch The Input To Your Apps

Flask protects you against one of the most common security problems of modern web applications: cross-site scripting (XSS). There are many others. If you are a programmer and have never even heard of OWASP, then you need to put down your PS3/Xbox controller and do a quick read on at least their take on the top ten web app security risks (btw: there are way more than ten, but you need to start somewhere).

The thing is, unless the halls of higher education have crumbled completely since I was in school, I distinctly remember having the concept of input validation, bounds checking, etc. being rammed into my thick skull in almost every programming class (and this was way before web apps were even contemplated). You may think you’re innovating by posting a link to your functioning rapid prototype on Hacker News, but what you’re really doing is being sloppy, lazy and irresponsible. Period.

And, while it’s fine to seek out frameworks like Flask and rely on some of their inherent protections, it does not absolve you from your responsibility to deliberately & consciously build rugged software (which doesn’t just mean “secure”).

“Don’t think that your application is not important enough to attract an attacker”

I’m not sure if any amount of verbiage will convince someone of this fact if they are determined not to believe/accept it. It’s a much larger discussion (and this is already a long post). If you are inclined to have a slightly open mind, I encourage you to read So You Think Your Website Won’t Get Hacked by Joseph Schembr. It’s really slanted towards “script-kiddies,” but should pique your interest enough to keep exploring why your hacked-up personal URL shortener might be a target.

Fin

It’s impressive that the Flask authors cover security in some way, shape or form on 21 pages in the documentation [PDF]. If you’re building or contributing to other frameworks, projects or engines (hint, hint, Node.JS devs!) I would strongly encourage you to take as much time and consideration as the Flask team did to ensure you are making it as easy as possible for your users to deploy applications as securely as possible by default.