In the previous post I tried to explain what Content Security Policies (CSPs) are and how to work with them in R. In case you didn’t RTFPost the TLDR is that CSPs give you control over what can be loaded along with your web content and can optionally be configured to generate a violation report for any attempt to violate the policy you create. While you don’t need to specify a report URI you really should since at the very least you’ll know if you errantly missed a given host, wildcard, or path. You’ll also know when there’s been malicious or just plain skeezy activity going on with third-parties and your content (which is part of the whole point of CSPs).
There’s an “R” category tag on this post (so it’s hitting R-bloggers, et al) since it’s part of an unnumbered series on working with CSPs in R and the next posts will show how to analyze the JSON-formatted reports that are generated. But, to analyze such reports you kinda need a way to get them first. So, we’re going to setup a “serverless” workflow in Amazon AWS to shove CSP reports into a well-organized structure in S3 from which we’ll be able to access, ingest, and analyze them.
Sure, there are services out there who will (legit for free) let you forward violation reports to them but if you can do this for “free” on your own and not give data out to a third-party to make money or ostensibly do-gooder reputation from I can’t fathom an argument for just giving up control.
Note that all you need is an an internet-accessible HTTPS endpoint that can take an HTTP POST request with a JSON payload and then store that somewhere, so if you want to, say, use the plumber
package to handle these requests without resorting to AWS, then by all means do so! (And, blog about it!)
AWS “Serverless” CSP Report Workflow Prerequisites
You’re obviously going to need an Amazon AWS account and will also need the AWS Command Line Interface tools installed plus an IAM user that has permissions to use CloudFormation. AWS has been around a while now so yet-another-howto on signing up for AWS, installing the CLI tools and generating an IAM user would be, at-best, redundant. Amazon has decent intro resources and, honestly, it’s 2019 and having some familiarity with how to work with at least one cloud provider is pretty much a necessary skillset at this point depending on what part of “tech” you’re in. If you’re new to AWS then follow the links in this paragraph, run through some basics and jump back to enter the four commands you’ll need to run to bootstrap your CSP collection setup.
Bootstrapping an S3 CSP Collector in AWS
We’re going to use this CloudFormation workflow to bootstrap the CSP collection process and you should skim the yaml file to see what’s going on. Said yaml is “infrastructure as code”, meaning it’s a series of configuration directives to generate AWS services for you (i.e. no pointing-and-clicking) and, perhaps more importantly, destroy them for you if you no longer want to keep this active.
The CF Output directive will be the URI you’re going to use in the report-uri
/report-to
CSP directives and is something we’ll be querying for at the end of the setup process.
The first set of resources are AWS Glue templates which would enable wiring up the CSP report results into AWS Athena. Glue is a nice ETL framework but it’s kinda expensive if set in active mode (Amazon calls it ‘crawler’ mode) so this CloudFormation recipe only created the Glue template but does not activate it. This section can (as the repo author notes) be deleted but it does no harm and costs nothing extra so leaving it in is fine as well.
The next bit sets up an AWS Firehose configuration which is a silly sounding name for setting up a workflow for where to store “streaming” data. This “firehose” config is just going to setup a path for an S3 bucket and then setup the necessary permissions associated with said bucket. This is where we’re going to pull data from in the next post.
The aforementioned “firehose” can take streaming data from all kinds of input sources and our data source is going to be a POSTed JSON HTTP interaction from a browser so we need to have something that listens for these POST requests and wire that up to the “firehose”. For that we need an API gateway and that’s what the penultimate section sets up for us. It instructs AWS to setup an API endpoint to listen for POST requests, tells it the data type (JSON) it will be handling and then tells it what AWS Lambda to call, which is in the last section.
Said lambda code is in the repo’s index.js and is a short Node.js script to post-process the CSP report JSON into something slightly more usable in a data analysis context (the folks who made the violation report clearly did not have data science folks in mind when creating the structure given the liberal use if -
in field names).
If the above sounds super-complex just go get CSP reports, you’re not-wrong. We trade off the cost and tedium of self-hosting and securing a standalone-yet-simple JSON POST handling server for a moderately complex workflow that involves multiple types of moving parts in AWS. The downside is having to gain a more than casual familiarity with AWS components. The plus side is that this is pretty much free unless your site is wildly popular and either constantly under XSS attack or your CSP policy is woefully misconfigured.
“‘Free’, you say?!” Yep. Free. (OK, “mostly” free)
- AWS API Gateway: 1,000,000 HTTP REST API calls (our POST reqs that call the lambda code) per month are free
- AWS Lambda (the
index.js
runner which sends data to the “firehose”): 1,000,000 free requests per month and 400,000 seconds of compute time per month (theindex.js
takes ~1s to run) - AWS Firehose (the bit that shoves data into S3): first 500 TB/month is $0.029 USD
- AWS S3: First 50 TB / month is $0.023 per GB (the CSP JSON POSTs gzip’d are usually <1K each) + some super-fractional (of a penny) costs for PUTting data into S3 and copying data from S3.
A well-crafted CSP and a typical site should end up costing you way less than $1.00 USD/month and you can monitor it all via the console or with alerts (change your region, if needed). Plus, you can destroy it at any time with one command (we haven’t built it yet so we’ll see this in a bit).
Launching the Bootstrap
As the repo says, do:
$ git clone git@github.com:michaelbanfield/serverless-csp-report-to.git # get the repo
$ cd serverless-csp-report-to # go to the dir
$ aws s3 mb s3://some-unique-and-decent-bucket-name-to-hold-the-lambda-code/ # pick a good name that you'll recognize
$ aws cloudformation package \ # generate the build template
--template-file template.yaml \
--s3-bucket <bucket-you-just-created> \
--output-template-file packaged-template.yaml
$ aws cloudformation deploy \ # launch the build
--template-file /path/to/packaged-template.yaml \
--stack-name CSPReporter \
--capabilities CAPABILITY_IAM
It’ll take a minute or two and when it is done just do:
$ aws cloudformation describe-stacks \
--query "Stacks[0].Outputs[0].OutputValue" \
--output text \
--stack-name CSPReporter
To get the URL you’ll use in the reporting directives.
To get rid of all these created resources you can go into the console and do it or just do
$ aws cloudformation --delete-stack --stack-name CSPReporter
To see the bucket that was created for the CSP reports just do:
$ aws s3 ls | grep firehose
FIN
If you’re experienced with AWS that was likely not a big deal. If you’re new or inexperienced with AWS this is not a bad way to get some experience with a “serverless” API setup since it’s cheap, easy to delete and touches on a number of key components within AWS.
You can browse through the AWS console to see all of what was created and eventually tweak the CF yaml to bend it to your own will.
Next time we’ll dive in to CSP violation report analysis with R.
REMINDER to — regardless of the source (whether it’s me, RStudio, spiffy R package authors, or big names like AWS/Microsoft/etc.) — always at least spot check the code you’re about to install or execute. Everyone needs to start developing and honing a zero-trust mindset when it comes to even installing apps from app stores on your phones/tablets let alone allowing random R, C[++], Python, Go, Rust, Haskel, … code to execute on your laptops and servers. This is one reason I went through the sections in the YAML and deliberately linked to the index.js
. Not knowing what the code does can lead to unfortunate situations down the line.
NOTE: If you have an alternative Terraform configuration for this drop a note in the comments since TF is a bit more “modern” and less AWS-centric “infrastructure as code” framework. Also, if you’ve done this with Azure or other providers, also drop a note in the comments since it may be of use to folks who aren’t interested in using AWS. Finally, if you do make a plumber
server for this, also drop a note to a post with how you did it and perhaps discuss the costs & headaches involved.
One Trackback/Pingback
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]