Skip navigation

I’ve blathered about trust before 1 2, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident.

The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and take-down of a series of malicious Python packages. Here’s their high-level incident summary:

SK-CSIRT identified malicious software libraries in the official Python package
repository, PyPI, posing as well known libraries. A prominent example is a fake
package urllib-1.21.1.tar.gz, based upon a well known package
urllib3-1.21.1.tar.gz.
Such packages may have been downloaded by unwitting developer or administrator
by various means, including the popular “pip” utility (pip install urllib).
There is evidence that the fake packages have indeed been downloaded and
incorporated into software multiple times between June 2017 and September 2017.

Words are great but, unlike some other FOSS projects (*cough* R *cough*) the PyPI folks have authoritative log data regarding package downloads from PyPI. This means we can begin to quantify the exposure. The Google BigQuery SQL was pretty straightforward:

SELECT timestamp, file.project as package, country_code, file.version AS version
FROM (
  (TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP('2016-01-22'), TIMESTAMP('2017-09-15')))
)
WHERE file.project IN ('acqusition', apidev-coop', 'bzip', 'crypt', 'django-server',
                       'pwd', 'setup-tools', 'telnet', 'urlib3', 'urllib')

Let’s see what the daily downloads of the malicious package look like:

Thanks to Curtis Doty (@dotysan on GH) I learned that the BigQuery table can be further filtered to exclude mirror-to-mirror traffic. The data for that is now in the GH repository and the chart in this callout shows that the exposure was very, very (very) limited:

But, we need counts of the mal-package dopplegangers (i.e. the good packages) to truly understand scope of exposure:

Thankfully, the SK-CSIRT folks caught this in time and the exposure was limited. But those are some popular tools that were targeted and it’s super-easy to sneak these into requirements.txt and scripts since the names are similar and the functionality is duplicated.

I’ll further note that the crypto package was “good” at some point in time then went away and was replaced with the nefarious one. That seems like a pretty big PyPI oversight (vis-a-vis package retirement & name re-use), but I’m not casting stones. R’s devtools::install_github() and wanton source()ing are just as bad, and the non-CRAN ecosystem is an even more varmint-prone “wild west” environment.

Furthermore, this is a potential exposure issue in many FOSS package repository ecosystems. On the one hand, these are open environments with tons of room for experimentation, creativity and collaboration. On the other hand, they’re all-too-easy targets for malicious hackers to prey upon.

I, unfortunately, have no quick-fix solutions to offer. “Review your code and dependencies” is about the best I can suggest until individual ecosystems work on better integrity & authenticity controls or there is a cross-ecosystem effort to establish “best practices” and perhaps even staffed, verified, audited, free services that work like a sheriff+notary to help ensure the safety of projects relying on open source components.

Python folks: double check that you weren’t a victim here (it’s super easy to type some of those package names wrong, and hopefully you’ve noticed builds failing if you had done so).

R folks: don’t be smug, watch your GitHub dependencies and double check your projects.

You can find the data and the scripts used to generate the charts (ironically enough) on GitHub.

Finally: I just want to close with a “thank you!” to PyPI’s Donald Stufft who (quickly!) pointed me to a blog post detailing the BigQuery setup.

4 Comments

  1. I wrote about this a while ago: https://hackernoon.com/building-a-botnet-on-pypi-be1ad280b8d6 . The current download stats for these packages for this year is about 470,000

    • (akismet had this in the spam bucket, apologies for a delayed reply)

      Seriously gd detective work in that post and it has me wondering how many other ecosystems (node, ruby, etc) have similar issues like that. It’s rly disconcerting that the security of these fundamental underpinnings of virtually everything that runs the modern internet are so fragile.

  2. I think you mean SK-CSIRT ?


4 Trackbacks/Pingbacks

  1. […] article was first published on R – rud.is, and kindly contributed to […]

  2. […] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]

  3. […] article was first published on R – rud.is, and kindly contributed to […]

  4. By Linkdump #57 | WZB Data Science Blog on 22 Sep 2017 at 3:37 am

    […] It’s a FAKE (?)! Revisiting Trust In FOSS Ecosystems […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.