I’ve been threatening to do a series on “data science community security” for a while and had cause to issue this inaugural post today. It all started with this:
Hey #rstats folks: don't do this. Srsly. Don't do this. Pls. Will blog why. Just don't do this. https://t.co/qkem5ruEBi
— boB Rudis (@hrbrmstr) February 23, 2017
Let me begin with the following: @henrikbengtsson is an awesome member of the #rstats community. He makes great things and I trust his code and intentions. This post is not about him, it’s about raising awareness regarding security in the data science community.
I can totally see why folks would like Henrik’s tool. Package dependency management — including installing packages — is not the most beloved of R tasks, especially for new R users or those who prefer performing their respective science or statistical work vs delve deep into the innards of R. The suggestion to use:
source('http://callr.org/install#knitr')
no doubt came from a realization of how cumbersome it can be to deal with said dependency management. You can even ostensibly see what the script does since Henrik provides a link to it right on the page.
So, why the call to not use it?
For starters, if you do want to use this approach, grab his script and make a local copy of it. Read it. Try to grok what it does. Then, use it locally. It will likely be a time-/effort-saver for many R users.
My call was to not source it from the internet.
Why? To answer that I need to talk about trust.
hrbrmstr’s Hierarchy of Package Trust
When you install a package on your system you’re bringing someone else’s code into your personal work space. When you try to use said code with a library()
call, R has a few mechanisms to run code on package startup. So, when you just install and load a package you’re executing real code in the context of your local user. Some of that code may be interpreted R code. Some may be calling compiled code. Some of it may be trying to execute binaries (apps) that are already on your system.
Stop and think about that for a second.
If you saw a USB stick outside your office with a label “Cool/Useful R Package” would you insert it into your system and install the package? (Please tell me you answered “No!” :-)
With that in mind, I have a personal “HieraRchy of Package Trust” that I try to stick by:
Tier 1
This should be a pretty obvious one, but if it’s my own code/server or my org’s code/server there’s inherent trust.
Tier 2
When you type install.pacakges()
and rely on a known CRAN mirror, MRAN server or Bioconductor download using https
you’re getting quite a bit in the exchange.
CRAN GuaRdians at least took some time to review the package. They won’t catch every possible potentially malicious bit and the efficacy of evaluating statistical outcomes is also left to the package user. What you’re getting — at least from the main cran.r-project.org
repo and RStudio’s repos — are reviewed packages served from decently secured systems run by organizations with good intentions. Your trust in other mirror servers is up to you but there are no guarantees of security on them. I’ve evaluated the main CRAN and RStudio setups (remotely) and am comfortable with them but I also use my own personal, internal CRAN mirror for many reasons, including security.
Revolution-cum-Microsoft MRAN is also pretty trustworthy, especially since Microsoft has quite a bit to lose if there are security issues there.
Bioconductor also has solid package management practices in place, but I don’t use that ecosystem much (at all, really) so can’t speak too much more about it except that I’m comfortable enough with it to put it with the others at that level.
Tier 3
If I’m part of a known R cabal in private collaboration, I also trust it, but it’s still raw source and I have to scan through code to ensure the efficacy of it, so it’s a bit further down the list.
Tier 4
If I know the contributors to a public source repo, I’ll also consider trusting it, but I will still need to read through the source and doubly-so if there is compiled code involved.
Tiers 5 & 6
If the repo source is a new/out-of-the-blue contributor to the R community or hosted personally, it will be relegated to the “check back later” task list and definitely not installed without a thorough reading of the source.
NOTE
There are caveats to the list above — like CRAN R packages that download pre-compiled Windows libraries from GitHub — that I’ll go into in other posts, along with a demonstration of the perils of trust that I hope doesn’t get Hadley too upset (you’ll see why in said future post ?).
Also note that there is no place on said hierarchy for the random USB stick of cool/useful R code. #justdontdoit
Watering Holes
The places where folks come together to collaborate have a colloquial security name: a “watering hole”. Attackers use these known places to perform — you guessed it — “watering hole” attacks. They figure out where you go, who/what you trust and use that to do bad things. I personally don’t know of any current source-code attacks, but data scientists are being targeted in other ways by attackers. If attackers sense there is a source code soft-spot it will only be a matter of time before they begin to use that vector for attack. The next section mentions one possible attacker that you’re likely not thinking of as an “attacker”.
This isn’t FUD.
Governments, competitors and criminals know that the keys to the 21st century economy (for a while, anyway) reside in data and with those who can gather, analyze and derive insight from data. Not all of us have to worry about this, but many of us do and you should not dismiss the value of the work you’re doing, especially if you’re not performing open research. Imagine if a tiny bit of data exfiltration code managed to get on your Spark cluster or even your own laptop. This can easily happen with a tampered package (remember the incident a few years ago with usage tracking code in R scripts?).
A Bit More On https
I glossed over the https
bit above, but by downloading a package over SSL/TLS you’re ensuring that the bits of code aren’t modified in transit from the server to your system and what you’re downloading is also not shown to prying eyes. That’s important since you really want to be sure you’re getting what you think your getting (i.e. no bits are changed) and you may be working in areas your oppressive, authoritarian government doesn’t approve of, such as protecting the environment or tracking global climate change (?).
The use of https
also does show — at least in a limited sense — that the maintainers of the server knew enough to actually setup SSL/TLS and thought — at least for a minute or two — about security. The crazy move to “Let’s Encrypt” everything is a topic for another, non-R post, but you can use that service to get free certificates for your own sites with a pretty easy installation mechanism.
I re-mention SSL/TLS as a segue back to the original topic…
Back to the topic at hand
So, what’s so bad about:
source('http://callr.org/install#knitr')
On preview: nothing. Henrik’s a good dude and you can ostensibly see what that script is doing.
On review: much.
I won’t go into a great deal of detail, but that server is running a RHEL 5 server with 15 internet services enabled, ranging from FTP to mail to web serving along with two database servers (both older versions) directly exposed to the internet. The default serving mode is http
and the SSL certificate it does have is not trusted by any common certificate store.
None of that was found by any super-elite security mechanism.
Point your various clients at those services on that system and you’ll get a response. To put it bluntly, that system is 100% vulnerable to attack. (How to setup a defensible system is a topic for another post.).
In other words, if said mechanism becomes a popular “watering hole” for easy installation of R packages, it’s also a pretty easy target for attackers to take surreptitious control of and then inject whatever they want, along with keeping track of what’s being installed, by whom and from which internet locale.
Plain base::source()
does nothing on your end to validate the integrity of that script. It’s like using devtools::source()
or devtools::source_gist()
without the sha1
parameter (which uses a hash to validate the integrity of what you’re sourcing). Plus, it seems you cannot do:
devtools::source_url('http://callr.org/install#knitr', sha1="2c1c7fe56ea5b5127f0e709db9169691cc50544a")
since the httr
call that lies beneath appears to be stripping away the #…
bits. So, there’s no way to run this remotely with any level of integrity or confidentiality.
TLDR
If you like this script (it’s pretty handy) put it in a local directory and source it from there.
Fin
I can’t promise a frequency for “security in the data science community” posts but will endeavor to crank out a few more before summer. Note also that R is not the only ecosystem with these issues, so Python, Julia, node.js and other communities should not get smug :-)
Our pursuit of open, collaborative work has left us vulnerable to bad intentioned ne’er-do-wells and it’s important to at least be aware of the vulnerabilities in our processes, workflows and practices. I’m not saying you have to be wary of every devtools::instal_github()
that you do, but you are now armed with information that might help you think twice about how often you trust calls to do such things.
In the meantime, download @henrikbengtsson’s script, thank him for making a very useful tool and run it locally (provided you’re cool with potentially installing things from non-CRAN repos :-)
8 Comments
Do you trust ROpenSci (http://ropensci.org/)?
Yes. Full disclosure: I’m a pkg reviewer for them.
They have an outstanding pkg review process and I’m helping to add security checks into the process as well.
I asked because it is not in your “HieraRchy of Package Trust”
Good post Bob, thank you. Is there a tool/mechanism that provides anti-malware like protection when packages are downloaded and installed? Is such a thing possible? My question probably reveals my ignorance about how those things work.
Thx. I think the best we could do (for a while, anyway) is (a) provide a mechanism to note when things are being call on attach or on load as that’s the most likely time something is going to try to do anything and (b) provide a mechanism to note when a package has compiled code in it or uses system calls. Point (b) is not as easy as it sounds as it’s super easy to obfuscate system calls the same way attackers try to obfuscate PHP/JavaScript. Keeping data scientists safe is something I’m trying to focus on this year so as I find or build things I’ll keep posting them :-)
Does it help any to, say, install downloaded packages using a USB stick onto a machine that never connects to the Internet (or any network even)?
I guess a package could do something sophisticated by putting confidential data on the USB stick… Hmm… no I can’t really see how this could be attacked actually.
Using something like Little Snitch (it’s a macOS app but Windows & Linux have similar ones) would enable you to have granular control over what R (rsession, r.exe, r.app, etc) can do network-comms-wise. Good, modern behavioural malware detection software (pretty much only avail to enterprise folks) would also likely catch (at some point) rogue packages on workstations.
But…
Folks processing sensitive data or with a more explicit concern about security should really have an internal R package repository setup and a process for validating both the security and efficacy of the packages in said repository. All packages should be hashed and the hash verified on some frequency. When packages need an update the same acquisition process should be followed. Packages with binary components should also be vetted in sandboxed environments.
Thanks for your reply (and everything you do, you’re great!).
Can a person get around Little Snitch by calling another process in R?
Also, I was wondering if you could comment more directly on the idea of a two machine setup. One machine downloads and compiles the package (with an Internet connection). Then I transfer the packages to the other machine using physical media. The other machine (with sensitive data) uses the packages but has no Internet connection. Can you think of a way for an attacker to get the data off of the sensitive data machine?
While I understand that the precautions you list should always be taken, the fact is a lot of R methods are in somewhat complex packages hosted only on github (e.g. Leeper’s margins https://github.com/leeper/margins). A large company might be able to have an analyst go through the source code for every github package used, but a small team won’t have the resources to do this. But really the only security concern for a tiny company that works with sensitive data is to never leak the sensitive data, so I think our never-connect setup is secure.
A setup where you use a single computer, but “shut off the Internet” (or block all ports) when a rsession is running doesn’t seem secure because even if you kill an rsession, child processes could stay running. Then when you open the ports, the child processes could send sensitive data. I guess one could restart the machine instead of killing the rsession? But then this all depends on whether child processes can do something to the system that makes them start on reboot. I don’t like this so I think the two-machine approach is best.
I bet you could get the same result by running a virtual machine on a computer, where the virtual machine has no internet connection. Just have to make sure no sensitive data ever leaves the virtual machine.
This two-computer setup doesn’t seem to work on a large central network since some port e.g. port 22 has to be open. Not sure what to do there.
I very much want to hear your opinion on all this. Is this discussed in detail in your book? Or do you plan to write a book on this?
5 Trackbacks/Pingbacks
[…] I’ve been threatening to do a series on “data science community security” for a while and had cause to issue this inaugural post today. It all started with this: Hey #rstats folks: don’t do this. Srsly. Don’t do this. Pls. Will blog why. Just don’t do this. https://t.co/qkem5ruEBi— boB Rudis (@hrbrmstr) February 23, 2017 Let… Continue reading → […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] blathered about trust before 1 2, but said blatherings were in a “what if” context. Unfortunately, the if has turned […]
[…] blathered about trust before 12, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a […]
[…] blathered about trust before 12, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a […]