The Evolution of Data Literacy at the U.S. Department of Energy + Finding Power Grid Cyber Attacks in a Data Haystack

I was chatting with some cyber-mates at a recent event and the topic of cyber attacks on the U.S. power-grid came up (as it often does these days). The conversation was brief, but the topic made its way into active memory and resurfaced when I saw today’s Data Is Plural newsletter which noted that “Utility companies are required to report major power outages and other “electric disturbance events” to the Department of Energy within a business day (or, depending on the type of event, sooner) of the incident. The federal agency then aggregates the reports annual summary datasets.” (follow the links to the newsletter to get the URLs for the site since Jeremy deserves your ?).

Many of us data nerds use the Data Is Plural newsletters as fodder for class assignments, blog posts or personal “data katas”. This time, I was after cyber attack data.

When you head to the annual reports URL, you’re greeted with a SharePoint-driven HTML table:

So, our options are PDF or XLS (and I mean .xls, too, they’re not modern .xlsx files). We’ll opt for the latter and cache them locally before working on them. One “gotcha” is that the hrefs look like this: https://www.oe.netl.doe.gov/download.aspx?type=OE417XLS&ID=78 — i.e. no filenames. But, the filenames come along for the ride when an HTTP GET or HEAD request is issued in a content-disposition response header. We’ll use this metadata instead of siphoning off the year from the first column of the table:

library(rvest)
library(readxl)
library(tidyverse)

doe <- read_html("https://www.oe.netl.doe.gov/OE417_annual_summary.aspx")

dir.create("~/Data/doe-cache-dir", showWarnings = FALSE)

html_nodes(doe, xpath=".//a[contains(., 'XLS')]") %>%
  html_attr("href") %>%
  { .pb <<- progress_estimated(length(.)) ; . } %>% # we likely don't rly need progress bars tho
  walk(~{

    .pb$tick()$print()

    dl_url <- sprintf("https://www.oe.netl.doe.gov/%s", .x)

    res <- HEAD(dl_url) # so we can get the filename
    stop_for_status(res) # halt on network errors

    fil <- str_replace(
      string = res$headers['content-disposition'],
      pattern = "attachment; filename=",
      replacement = "~/Data/doe-cache-dir/"
    )

    if (!file.exists(fil)) { # this pattern allows us to issue a lightweight HTTP HEAD request, then cache and refresh w/o wasting server/our bandwidth/cpu
      res <- GET(dl_url, httr::write_disk(fil))
      stop_for_status(res)
      Sys.sleep(5) # be kind to the server(s) but only if we're downloading data files since HEAD requests don't really tax services
    }

  })

Let's do a quick check for the likelihood of uniformity. Some of these files go back to 2002 and I suspect they're more geared for "printing" (the PDF counterparts were a clue) than programmatic processing:

# check to see if the files are all the same (spoiler alert: they're not)
list.files("~/Data/doe-cache-dir", "xls", full.names=TRUE) %>%
  map_df(~list(
    fil = basename(.x),
    ncols = read_xls(.x, col_names=FALSE, col_types="text") %>% ncol()
  )) -> cols_profile

cols_profile
## # A tibble: 17 x 2
##    fil                     ncols
##                       
##  1 2002_Annual_Summary.xls     8
##  2 2003_Annual_Summary.xls     8
##  3 2004_Annual_Summary.xls     8
##  4 2005_Annual_Summary.xls     8
##  5 2006_Annual_Summary.xls     8
##  6 2007_Annual_Summary.xls     8
##  7 2008_Annual_Summary.xls     8
##  8 2009_Annual_Summary.xls     8
##  9 2010_Annual_Summary.xls     8
## 10 2011_Annual_Summary.xls     9
## 11 2012_Annual_Summary.xls     9
## 12 2013_Annual_Summary.xls     9
## 13 2014_Annual_Summary.xls     9
## 14 2015_Annual_Summary.xls    11
## 15 2016_Annual_Summary.xls    11
## 16 2017_Annual_Summary.xls    11
## 17 2018_Annual_Summary.xls    11

O_o

At this point, I paused and wanted to see what was going on in the minds of the DoE staffers charged with releasing this data.

: 2002

: 2003

: 2004

: 2005

: 2006

: 2007

: 2008

: 2009

: 2010

: 2011

: 2012

: 2013

: 2014

: 2015

: 2016

: 2017

: 2018

(You can grab the macOS Quick Look preview snaps of all of those here.)

From 2002 to 2010 the Excel documents are clearly designed for print as the target, complete with month breaklines and repeated (+ heavily formatted) headers. They even left other tabs around (inconsistently).

Things got a little better between 2011 and 2014, but we still have month breaks and occasional, repeated headers (someone likely learned how to generate headers-per-page in Excel in 2011 then the administration changed hands and new staffers came in and fubar'd 2012 a bit before going back to the slightly better format).

Prior to 2015, the print-as-target trumped programmatic access. Interestingly enough, this is roughly when "data science" was on the upswing (in a major way):

Starting with 2015 we have a "month" column, more uniformity for specifying dates & times and more care given to other value fields, so kudos to the then and current staffers who made our data-machinating lives easier.

This really is, I believe, a byproduct of modern "data literacy". Folks in charge of gathering and publishing data are realizing there are multiple ways others want/need to consume the data. The original purpose for this data was to hand a report to someone after regulations were put in place to mandate notifications. I'm willing to bet nobody did anything with this data for a few years. Staffers either learned to wield Excel better or new staffers came in with this new knowledge. Finally, recent years clearly show that the staffers realize that folks are as (or more) likely to programmatically consume this information as they are reading a a long list of events (?). More work is needed (and an API or CSV/JSON output would be super cool) but it's great to see data literacy alive and well in the halls of the U.S. gov.

Said modern format changes do not really help us work with the complete data set and the more recent files have issues all their own, including inconsistency in the way the date/time columns are represented in Excel cells.

By golly, we're still going to try to read all these files in and work with them (for at least the purpose I originally set out on). We'll have to deal with the differences in columns and come up with a way to remove non-data rows. I also kinda at least want dates as dates. Here's my stab at an initial clean-up (there's lots more work to do, though):

map2(cols_profile$fil, cols_profile$ncols, ~{

  if (.y == 8) { # handle 8 cols

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("date_began", "region", "time", "area", "event_type", "loss", "customers_affected", "date_restored"),
      col_types = c("date", "text", "text", "text", "text", "text", "text", "date")
    )) %>%
      filter(!is.na(date_began)) %>%
      mutate(date_began = as.Date(date_began))

  } else if (.y == 9) { # handle 9 cols

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("date_began", "time_began", "date_restored", "time_restored", "area", "region", "event_type", "loss", "customers_affected"),
      col_types = c("date", "guess", "date", "guess", "text", "text", "text", "text", "text")
    )) %>%
      filter(!is.na(date_began)) %>%
      mutate(date_began = as.Date(date_began))

  } else if (.y == 11) { # handle 11 cols

    # note that the date columns aren't uniform in the Excel spreadsheets even in these more data-literate files :-(

    suppressWarnings(read_xls(
      path = sprintf("~/Data/doe-cache-dir/%s", .x),
      col_names = c("month", "date_began", "time_began", "date_restored", "time_restored", "area", "region", "alert_criteria", "event_type", "loss", "customers_affected"),
      col_types = c("text", "text", "guess", "text", "guess", "text", "text", "text", "text", "text", "text")
    )) %>%
      mutate(
        date_began = case_when(
          str_detect(date_began, "/") ~ suppressWarnings(as.Date(date_began, format="%m/%d/%Y")),
          str_detect(date_began, "^[[:digit:]]+$") ~ suppressWarnings(as.Date(as.integer(date_began), origin = "1899-12-30")),
          TRUE ~ suppressWarnings(as.Date(NA))
        )
      ) %>%
      mutate(
        date_restored = case_when(
          str_detect(date_restored, "/") ~ suppressWarnings(as.Date(date_restored, format="%m/%d/%Y")),
          str_detect(date_restored, "^[[:digit:]]+$") ~ suppressWarnings(as.Date(as.integer(date_restored), origin = "1899-12-30")),
          TRUE ~ suppressWarnings(as.Date(NA))
        )
      ) %>%
      filter(!is.na(date_began))

  }

}) -> reports

reports[[1]]
## # A tibble: 23 x 8
##    date_began region time                area      event_type      loss  customers_affec… date_restored
##                                                       
##  1 2002-01-30 SPP    0.25                Oklahoma  Ice Storm       500   1881134          2002-02-07 12:00:00
##  2 2002-01-29 SPP    Evening             Metropol… Ice Storm       500-… 270000           NA
##  3 2002-01-30 SPP    0.66666666666666663 Missouri  Ice Storm       210   95000            2002-02-10 21:00:00
##  4 2002-02-27 WSCC   0.45000000000000001 Californ… Interruption o… 300   255000           2002-02-27 11:35:00
##  5 2002-03-09 ECAR   0                   Lower Pe… Severe Weather  190   190000           2002-03-11 12:00:00
##  6 2002-04-08 WSCC   0.625               Arizona   Vandalism/      0     0                2002-04-09 00:00:00
##  7 2002-07-09 WSCC   0.51875000000000004 Californ… Interruption o… 240   1 PG&E           2002-07-09 19:54:00
##  8 2002-07-19 WSCC   0.49375000000000002 Californ… Interruption o… 240   1 PG&E           2002-07-19 16:30:00
##  9 2002-07-20 NPCC   0.52777777777777779 New York  Fire            278   63500            2002-07-20 20:12:00
## 10 2002-08-02 MAIN   0.52986111111111112 Illinois  Interruption o… 232   53565            2002-08-02 18:36:00
## # ... with 13 more rows

reports[[10]]
## # A tibble: 307 x 9
##    date_began time_began  date_restored       time_restored  area    region event_type loss  customers_affec…
##                                                    
##  1 2011-01-11 0.96388888… 2011-01-11 00:00:00 0.96388888888… Athens… NPCC   Electrica… 0     0
##  2 2011-01-12 0.25        2011-01-12 00:00:00 0.58333333333… Massac… NPCC   Winter St… N/A   80000
##  3 2011-01-13 0.30624999… 2011-01-13 00:00:00 0.34236111111… North … FRCC   Firm Syst… 150   20900
##  4 2011-01-18 0.58333333… 2011-01-18 00:00:00 0.58333333333… Whitma… NPCC   Vandalism  0     0
##  5 2011-01-23 0.29166666… 2011-01-23 00:00:00 0.54166666666… Frankl… WECC   Vandalism  0     0
##  6 2011-01-24 0.55555555… 2011-01-24 00:00:00 0.5625         Newman… WECC   Suspiciou… 0     0
##  7 2011-01-25 0.14097222… 2011-01-25 00:00:00 0.45833333333… Newark… RFC    Vandalism  0     0
##  8 2011-01-26 0.39236111… 2011-01-27 00:00:00 0.70833333333… Carson… WECC   Suspected… 0     0
##  9 2011-01-26 0.39791666… 2011-01-27 00:00:00 0.62708333333… Michig… RFC    Vandalism  0     0
## 10 2011-01-26 0.70833333… 2011-01-31 00:00:00 0.33333333333… Montgo… RFC    Winter St… N/A   210000
## # ... with 297 more rows

reports[[17]]
## # A tibble: 120 x 11
##    month  date_began time_began date_restored time_restored area   region alert_criteria     event_type loss
##                                                          
##  1 Janua… 2018-01-01 0.7645833… 2018-01-02    0.7576388888… Tenne… SERC   Public appeal to … Severe We… Unkn…
##  2 Janua… 2018-01-01 0.7381944… NA            Unknown       Texas: TRE    Public appeal to … Severe We… Unkn…
##  3 Janua… 2018-01-01 0.9006944… 2018-01-02    0.4375        Tenne… SERC   Public appeal to … System Op… Unkn…
##  4 Janua… 2018-01-02 0.4166666… 2018-02-12    0.3333333333… New Y… NPCC   Fuel supply emerg… Fuel Supp… 675
##  5 Janua… 2018-01-02 0.3125     NA            Unknown       South… SERC   Public appeal to … Severe We… 0
##  6 Janua… 2018-01-02 0.28125    2018-01-02    0.375         North… SERC   System-wide volta… Severe We… 14998
##  7 Janua… 2018-01-04 0.0756944… 2018-01-04    0.0895833333… Texas… TRE    Physical attack t… Actual Ph… Unkn…
##  8 Janua… 2018-01-12 0.5472222… 2018-01-12    0.6201388888… Michi… RF     Cyber event that … System Op… 41
##  9 Janua… 2018-01-15 0.1805555… 2018-01-18    0.2416666666… Texas: TRE    Public appeal to … Severe We… Unkn…
## 10 Janua… 2018-01-16 0.625      2018-01-18    0.5416666666… Tenne… SERC   Public appeal to … Severe We… Unkn…
## # ... with 110 more rows, and 1 more variable: customers_affected

If you'd've handled the above differently it'd be ? if you could drop a note in the comments (for both my benefit and that of any other readers who have kindly made it this far into this tome).

At this point, I really just want to finally see if there are any "cyber" events in the data set and when/where they were. To do that, let's whittle down the columns a bit and make one data frame out of all the reports:

map_df(reports, ~{
  select(.x, date_began, region, area, event_type, customers_affected, date_restored) %>%
    mutate(date_restored = as.Date(date_restored)) %>%
    mutate(
      customers_affected = suppressWarnings(
        str_replace_all(customers_affected, "\\-.*$|[[:punct:]]+|[[:alpha:]]+", "") %>%
          as.numeric()
      )
    ) %>%
    mutate(date_restored = as.Date(ifelse(is.na(date_restored), date_began, date_restored), origin = "1970-01-01"))
}) -> events

events
## # A tibble: 2,243 x 6
##    date_began region area                          event_type                 customers_affect… date_restored
##                                                                       
##  1 2002-01-30 SPP    Oklahoma                      Ice Storm                            1881134 2002-02-07
##  2 2002-01-29 SPP    Metropolitan Kansas City Area Ice Storm                             270000 2002-01-29
##  3 2002-01-30 SPP    Missouri                      Ice Storm                              95000 2002-02-10
##  4 2002-02-27 WSCC   California                    Interruption of Firm Load             255000 2002-02-27
##  5 2002-03-09 ECAR   Lower Peninsula of Michigan   Severe Weather                        190000 2002-03-11
##  6 2002-04-08 WSCC   Arizona                       Vandalism/                                 0 2002-04-09
##  7 2002-07-09 WSCC   California                    Interruption of Firm Power                 1 2002-07-09
##  8 2002-07-19 WSCC   California                    Interruption of Firm Powe…                 1 2002-07-19
##  9 2002-07-20 NPCC   New York                      Fire                                   63500 2002-07-20
## 10 2002-08-02 MAIN   Illinois                      Interruption of Firm Power             53565 2002-08-02
## # ... with 2,233 more rows

Now we're cookin' with gas!

Let's do a quick check to make sure things look OK:

count(events, event_type, sort=TRUE)
## # A tibble: 390 x 2
##    event_type                         n
##                              
##  1 Severe Weather                   369
##  2 Vandalism                        216
##  3 Severe Weather - Thunderstorms    97
##  4 Suspected Physical Attack         87
##  5 System Operations                 74
##  6 Severe Thunderstorms              70
##  7 Winter Storm                      51
##  8 Ice Storm                         42
##  9 Physical Attack - Vandalism       40
## 10 High Winds                        33
## # ... with 380 more rows

Those events+quantities seem to make sense. Now, for my ultimate goal:

filter(events, grepl("cyber|hack", event_type, ignore.case=TRUE)) # yep, grepl() is still in muscle memory
## # A tibble: 19 x 6
##    date_began region area                               event_type             customers_affec… date_restored
##                                                                       
##  1 2003-01-25 ECAR   Cincinnati, Ohio                   Cyber Threat From Int…               NA 2003-01-25
##  2 2011-02-03 RFC    Bowie, Maryland                    Suspected Cyber Attack                0 2011-02-03
##  3 2011-02-17 WECC   Roseville, California              Suspected Cyber Attack                0 2011-02-23
##  4 2011-03-14 RFC    Baltimore, Maryland                Suspected Cyber Attack               NA 2011-03-14
##  5 2011-04-03 SERC   Unknown                            Suspected Cyber Attack                0 2011-04-05
##  6 2011-07-08 RFC    PJM Corporate Office, Pennsylvania Suspected Cyber Attack               NA 2011-07-11
##  7 2011-12-21 WECC   Boise, Idaho                       Suspected Cyber Attack                0 2011-12-21
##  8 2012-01-17 TRE    Austin, Texas                      Suspected Cyber Attack                0 2012-01-17
##  9 2012-02-17 SERC   Little Rock, Arkansas              Suspected Cyber Attack               NA 2012-02-17
## 10 2012-11-15 MRO    Iowa; Michigan                     Suspected Cyber Attack               NA 2012-11-15
## 11 2013-06-21 MRO    Michigan, Iowa                     Suspected Cyber Attack               NA 2013-10-30
## 12 2013-10-16 SERC   Roxboro Plant, North Carolina      Cyber Event with Pote…                0 2013-10-16
## 13 2014-03-20 NPCC   New York                           Suspected Cyber Attack               NA 2014-03-20
## 14 2014-10-21 MRO    Carmel, Indiana                    Suspected Cyber Attack               NA 2014-10-21
## 15 2014-12-30 NPCC   New Hampshire, Massachusetts, Mai… Suspected Cyber Attack               NA 2014-12-31
## 16 2016-02-07 NPCC   New York: Orange County            Cyber Attack                         NA 2016-02-07
## 17 2016-04-12 WECC   Washington: Pend Oreille County    Cyber Attack                          0 2016-04-12
## 18 2016-11-09 WECC   California: Stanislaus County, Sa… Cyber Attack                          0 2016-11-09
## 19 2016-12-13 WECC   California: Riverside County;      Cyber Event                           0 2016-12-13

FIN

There's a great deal of work left out of this power-outage data cleanup work:

Turn outage start/end info into POSIXct objects
Normalize area (make it a data_frame column with state and municipality so it can be unnested nicely)
Normalize event_type since many of the phrases used are equivalent and some have more than one categorization
Normalize loss somehow and do a better job with customers_affected (I did not double-check my work and I think there are errors in that column, now, but I didn't need it for my goal).

Since GitLab snippets are terrible, awful, slow things I've grudgingly posted the above code (contiguously) over at GitHub.

If you wrangle the data more and/or come up with other insights drop a note in the comments with a link to your post.

rud.is