Skip navigation

Tag Archives: post

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria.

A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: idiomatic is in the eye of the beholder.

I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled.

I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of idiomatic:

using, containing, or denoting expressions that are natural to a native speaker

That comes from idiom:

a form of expression natural to a language, person, or group of people

I usually joke with my students that a strength (and weakness) of R is that there are ~twelve ways to do any given task. While the statement is deliberately hyperbolic, the core message is accurate: there’s more than one way to do most things in R. A cascading truth is: what makes one way more “correct” over another often comes down to idiom.

My rstudio::conf 2017 presentation included an example of my version of using purrr for idiomatic CSV/JSON directory slurping. There are lots of ways to do this in R (the point of the post is not really to show you how to do the directory slurping and it is unlikely that I’ll approve comments with code snippets about that task). Here are three. One from base R tribe, one from the data.table tribe and one from the tidyverse tribe:

# We need some files and we'll use base R to make some
dir.create("readings")
for (i in 1970:2010) write.csv(mtcars, file.path("readings", sprintf("%s.csv", i)), row.names=FALSE)

fils <- list.files("readings", pattern = ".csv$", full.names=TRUE)

do.call(rbind, lapply(fils, read.csv, stringsAsFactors=FALSE))

data.table::rbindlist(lapply(fils, data.table::fread))

purrr::map_df(fils, readr::read_csv)

You get data for all the “years” into a data.frame, data.table and tibble (respectively) with those three “phrases”.

However, what if you want the year as a column? Many of these “datalogger” CSV data sets do not have a temporal “grouping” variable as they let the directory structure & naming conventions embed that bit of metadata. That information would be nice, though:

do.call(rbind, lapply(fils, function(x) {
  f <- read.csv(x, stringsAsFactors=FALSE)
  f$year <- gsub("^readings/|\\.csv$", "", x)
  f
}))

dt <- data.table::rbindlist(lapply(fils, data.table::fread), idcol="year")
dt[, year := gsub("^readings/|\\.csv$", "", fils[year])]

purrr::map_df(fils, readr::read_csv, .id = "year") %>% 
  dplyr::mutate(year = stringr::str_replace_all(fils[as.numeric(year)],
                                                "^readings/|\\.csv$", ""))

All three versions do the same thing, and each tribe understands each idiom.

The data.table and tidyverse versions get you much faster file reading and the ability to “fill” missing columns — another common slurping task. You can hack something together in base R to do column fills (you’ll find a few StackOverflow answers that accomplish such a task) but you will likely decide to choose one of the other idioms for that and become equally as comfortable in that new idiom.

There are multiple ways to further extend the slurping example, but that’s not the point of the post.

Each set of snippets contains 100% valid R code. They accomplish the task and are idiomatic for each tribe. Despite what any “mil gun feos turrach na latsa” experts’ exchange would try to tell you, the best idiom is the one that works for you/you & your collaborators and the one that gets you to the real work — data analysis — in the most straightforward & reproducible way possible (for you).

Idiomatic does not mean there’s only a singular One, True Way™, and I think a whole host of us forget that at times.

Write good, clean, error-free, reproducible code.

Choose idioms that work best for you and your collaborators.

Adapt when necessary.

UPDATE: I was reminded that I made a more generic version of adobecolor to handle many types of swatch files which you can find on github.

Many of my posts seem to begin with a link to a tweet, and this one falls into that pattern:

I’d seen the Ars Tech post about the named color palette derived from some training data. I could tell at a glance of the resultant palette:

that it would not be ideal for visualizations (use this site test the final image in this post and verify that on your own) but this was a neat, quick project to take on, especially since it let me dust off an old GH package, adobecolor and it was likely I could beat Karthik to creating a palette ;-)

The “B+” goal is to get a color palette that “matches” the one in the Tumlbr post. The “A” goal is to get a named palette.

These are all the packages we end up using:

library(tesseract)
library(magick)
library(stringi)
library(adobecolor) # hrbrmstr/adobecolor - may not be Windows friendly
library(tidyverse)

Attempt #1 (B+!!)

I’m a macOS user, so I’ve got great tools like xScope at my disposal. I’m really handy with that app and the Loupe tool makes it easy to point at a color, save it to a palette board and export an ACO palette file.

That whole process took ~18 seconds (first try). I’m not saying that to brag. But we often get hung up on both speed and programmatic reproducibility. I ultimately — as we’ll see in a bit — really went for speed vs programmatic reproducibility.

It’s dead simple to get the palette into R:

aco_fil <- "ml_cols.aco"
aco_hex <- rev(read_aco(aco_fil))

col2rgb(aco_hex)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## red    112  203   97  191  120  221  169  233  177   216    62   178   199
## green  112  198   92  174  114  196  167  191  138   200    63   184   172
## blue    85  166   73  156  124  199  171  143  109   185    67   196   146
##       [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## red      48   172   177   203   219   162   152   232   197   191
## green    94   152   100   205   210    98   165   177   161   161
## blue     83   145   107   192   179   106   158   135   171   124

IIRC there may still be a byte-order issue (PRs welcome) I need to deal with on Windows in adobecolor but you likely will never need to use the package again.

A quick eyeball comparison between the Tumblr list and that matrix indicates the colors are off. That could be for many reasons starting from the way they were encoded in the PNG by whatever programming language was used to train the neural net and make the image (likely Python) to Tumblr degrading it to something on my end. You’ll see that the colors are close enough for humans that it’s likely close enough.

There, I’ve got a B+ with about a total of 60s of work! Plenty of time left to try shooting for an A!

Attempt #2 (FAIL)

We’ve got the PNG from the Tumblr post and the tesseract package in R. Perhaps this will be super-quick, too:

pal_img_fil <- "tumblr_inline_opgsh0UI6N1rl9zu7_400.png"

pal_ocr <- ocr(pal_img_fil)
stri_split_lines(pal_ocr)
## [[1]]
##  [1] "-ClaniicFug112113 84"      "-Snowhnn.k 201 199165"    
##  [3] "- Cmbabcl 97 93 68"        "-Bunfluw 190 174 155"      
##  [5] "-an:hing Blue 121 114125"  "Bank Bun 221 196199"      
##  [7] "- Caring Tan 171 166170"   "-Smrguun 233191 141"      
##  [9] "-Sink 176 131; 110"        "Slummy Beige 216 200135"  
## [11] "- Durkwumi 61 63 66"       "Flow/£1178 1114 196"      
## [13] "- Sand Dan 2111 172143"    "- Grade 136: 41; 94 x3"   
## [15] "-Ligh[OfBlasll75150147"    "-Grass 13m 176 99108"     
## [17] "Sindis Poop 204 205 194"   "Dupe 219 2119179"         
## [19] "-'n:sling156101 106"       "-SloncrElu13152165 159"   
## [21] "- Buxblc Simp 226 1x1 132" "-Sl.mky 13m197162171"     
## [23] "-'J\\milyl90164116"        ""                         
## [25] ""

Ugh.

Perhaps if we crop out the colors:

image_read(pal_img_fil) %>%
  image_crop("+57") %>%
  ocr() %>%
  stri_split_lines()
## [[1]]
##  [1] "Clanfic Fug112113 84"       "Snowhunk 201 199 165"     
##  [3] "Cmbabcl 97 93 as"          "Bunfluwl90174155"          
##  [5] "Kunming Blue 121 114 125"  "Bank Bun 221196199"       
##  [7] "Caring Tan 171 ms 170"     "Slarguun 233 191 141"     
##  [9] "Sinkl76135110"             ""                         
## [11] "SIIImmy Beige 216 200 135" "Durkwuud e1 63 66"        
## [13] "Flower 175 154 196"        ""                         
## [15] "Sand Dan 201 172 143"      "Grade 1m AB 94: 53"       
## [17] ""                          "Light 0mm 175 150 147"    
## [19] "Grass Ba! 17a 99 ms"       "sxndis Poop 204 205 194"  
## [21] "Dupe 219 209 179"          ""                         
## [23] "Tesling 156 101 106"       "SloncrEluc 152 165 159"   
## [25] "Buxblc Simp 226 131 132"   "Sumky Bean 197 162 171"   
## [27] "1\\mfly 190 164 11a"        ""                         
## [29] ""

Ugh.

I’m woefully unfamiliar with how to use the plethora of tesseract options to try to get better performance and this is taking too much time for a toy post, so we’ll call this attempt a failure :-(

Attempt #3 (A-!!)

I’m going to go outside of R again to New OCR and upload the Tumblr palette there and crop out the colors (it lets you do that in-browser). NOTE: Never use any free site for OCR’ing sensitive data as most are run by content thieves.

Now we’re talkin’:

ocr_cols <- "Clardic Fug 112 113 84
Snowbonk 201 199 165
Catbabel 97 93 68
Bunfiow 190 174 155
Ronching Blue 121 114 125
Bank Butt 221 196 199
Caring Tan 171 166 170
Stargoon 233 191 141
Sink 176 138 110
Stummy Beige 216 200 185
Dorkwood 61 63 66
Flower 178 184 196
Sand Dan 201 172 143
Grade Bat 48 94 83
Light Of Blast 175 150 147
Grass Bat 176 99 108
Sindis Poop 204 205 194
Dope 219 209 179
Testing 156 101 106
Stoncr Blue 152 165 159
Burblc Simp 226 181 132
Stanky Bean 197 162 171
Thrdly 190 164 116"

We can get that into a more useful form pretty quickly:

stri_match_all_regex(ocr_cols, "([[:alpha:] ]+) ([[:digit:]]+) ([[:digit:]]+) ([[:digit:]]+)") %>%
  print() %>%
  .[[1]] -> col_mat
## [[1]]
##       [,1]                         [,2]             [,3]  [,4]  [,5] 
##  [1,] "Clardic Fug 112 113 84"     "Clardic Fug"    "112" "113" "84" 
##  [2,] "Snowbonk 201 199 165"       "Snowbonk"       "201" "199" "165"
##  [3,] "Catbabel 97 93 68"          "Catbabel"       "97"  "93"  "68" 
##  [4,] "Bunfiow 190 174 155"        "Bunfiow"        "190" "174" "155"
##  [5,] "Ronching Blue 121 114 125"  "Ronching Blue"  "121" "114" "125"
##  [6,] "Bank Butt 221 196 199"      "Bank Butt"      "221" "196" "199"
##  [7,] "Caring Tan 171 166 170"     "Caring Tan"     "171" "166" "170"
##  [8,] "Stargoon 233 191 141"       "Stargoon"       "233" "191" "141"
##  [9,] "Sink 176 138 110"           "Sink"           "176" "138" "110"
## [10,] "Stummy Beige 216 200 185"   "Stummy Beige"   "216" "200" "185"
## [11,] "Dorkwood 61 63 66"          "Dorkwood"       "61"  "63"  "66" 
## [12,] "Flower 178 184 196"         "Flower"         "178" "184" "196"
## [13,] "Sand Dan 201 172 143"       "Sand Dan"       "201" "172" "143"
## [14,] "Grade Bat 48 94 83"         "Grade Bat"      "48"  "94"  "83" 
## [15,] "Light Of Blast 175 150 147" "Light Of Blast" "175" "150" "147"
## [16,] "Grass Bat 176 99 108"       "Grass Bat"      "176" "99"  "108"
## [17,] "Sindis Poop 204 205 194"    "Sindis Poop"    "204" "205" "194"
## [18,] "Dope 219 209 179"           "Dope"           "219" "209" "179"
## [19,] "Testing 156 101 106"        "Testing"        "156" "101" "106"
## [20,] "Stoncr Blue 152 165 159"    "Stoncr Blue"    "152" "165" "159"
## [21,] "Burblc Simp 226 181 132"    "Burblc Simp"    "226" "181" "132"
## [22,] "Stanky Bean 197 162 171"    "Stanky Bean"    "197" "162" "171"
## [23,] "Thrdly 190 164 116"         "Thrdly"         "190" "164" "116"

The print() is in the pipe as I can never remember where each stringi functions stick lists but usually guess right, plus I wanted to check the output.

Making those into colors is super-simple:

y <- apply(col_mat[,3:5], 2, as.numeric)

ocr_cols <- rgb(y[,1], y[,2], y[,3], names=col_mat[,2], maxColorValue = 255)

If we look at Attempt #1 and Attempt #2 together:

ocr_cols
##    Clardic Fug       Snowbonk       Catbabel        Bunfiow  Ronching Blue 
##      "#707154"      "#C9C7A5"      "#615D44"      "#BEAE9B"      "#79727D" 
##      Bank Butt     Caring Tan       Stargoon           Sink   Stummy Beige 
##      "#DDC4C7"      "#ABA6AA"      "#E9BF8D"      "#B08A6E"      "#D8C8B9" 
##       Dorkwood         Flower       Sand Dan      Grade Bat Light Of Blast 
##      "#3D3F42"      "#B2B8C4"      "#C9AC8F"      "#305E53"      "#AF9693" 
##      Grass Bat    Sindis Poop           Dope        Testing    Stoncr Blue 
##      "#B0636C"      "#CCCDC2"      "#DBD1B3"      "#9C656A"      "#98A59F" 
##    Burblc Simp    Stanky Bean         Thrdly 
##      "#E2B584"      "#C5A2AB"      "#BEA474"

aco_hex
##  [1] "#707055" "#CBC6A6" "#615C49" "#BFAE9C" "#78727C" "#DDC4C7" "#A9A7AB"
##  [8] "#E9BF8F" "#B18A6D" "#D8C8B9" "#3E3F43" "#B2B8C4" "#C7AC92" "#305E53"
## [15] "#AC9891" "#B1646B" "#CBCDC0" "#DBD2B3" "#A2626A" "#98A59E" "#E8B187"
## [22] "#C5A1AB" "#BFA17C"

we can see they’re really close to each other, and I doubt all but the most egregiously picky color snobs can tell the difference visually, too:

par(mfrow=c(1,2))
scales::show_col(ocr_cols)
scales::show_col(aco_hex)
par(mfrow=c(1,1))

(OK, #3D3F43 is definitely hitting my OCD as being annoyingly different than #3D3F42 on my MacBook Pro so count me in as a color snob.)

Here’s the final palette:

structure(c("#707154", "#C9C7A5", "#615D44", "#BEAE9B", "#79727D", 
"#DDC4C7", "#ABA6AA", "#E9BF8D", "#B08A6E", "#D8C8B9", "#3D3F42", 
"#B2B8C4", "#C9AC8F", "#305E53", "#AF9693", "#B0636C", "#CCCDC2", 
"#DBD1B3", "#9C656A", "#98A59F", "#E2B584", "#C5A2AB", "#BEA474"
), .Names = c("Clardic Fug", "Snowbonk", "Catbabel", "Bunfiow", 
"Ronching Blue", "Bank Butt", "Caring Tan", "Stargoon", "Sink", 
"Stummy Beige", "Dorkwood", "Flower", "Sand Dan", "Grade Bat", 
"Light Of Blast", "Grass Bat", "Sindis Poop", "Dope", "Testing", 
"Stoncr Blue", "Burblc Simp", "Stanky Bean", "Thrdly"))

This third attempt took ~5 minutes vs 60s.

FIN

Why “A-“? Well, I didn’t completely verify the colors and values matched 100% in the final submission. They are likely the same, but the best way to get something corrected by others it to put it on the internet, so there it is :-)

I’d be a better human and coder if I took the time to learn tesseract more, but I don’t have much need for OCR’ing text. It is likely worth the time to brush up on tesseract after you read this post.

Don’t use this palette! I created it mostly to beat Karthik to making the palette (I have no idea if I succeeded), to also show that you should not forego your base R roots (I could have let that be subliminal but I wasn’t trying to socially engineer you in this post) and to bring up the speed/reproducibility topic. I see no issues with manually doing tasks (like uploading an image to a web site) in certain circumstances, but it’d be an interesting topic of debate to see just what “rules” folks use to determine how much effort one should put into 100% programmatic reproducibility.

You can find the ACO file and an earlier, alternate attempt at making the palette in this gist.

Most of the examples of working with most of the AWS services show basic username & password authentication. That’s all well-and-good, but many shops use the AWS Security Token Service to provide temporary credentials and session tokens to limit exposure and provide more uniform multi-factor authentication. At my workplace, Frank Mitchell created a nice electron app to make it super easy to create and re-up these credentials. The downside of this is that all AWS service usage for work requires using these credentials and I was having the darndest time trying to get Athena’s JDBC driver working with it (but I wasn’t spending alot of time on it as I tend to mirror research data to a local, beefy Apache Drill server).

I finally noticed the

com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider

class and decided to give the following a go (you will need to point fil to wherever you have the Athena jar file):

library(RJDBC)
library(tidyverse)

fil <- "~/Drivers/AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

aws <- ini::read.ini("~/.aws/credentials")

Sys.setenv(AWS_ACCESS_KEY_ID = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_access_key_id)
Sys.setenv(AWS_SECRET_ACCESS_KEY = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_secret_access_key)
Sys.setenv(AWS_SESSION_TOKEN = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_session_token)

provider <- "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider"

con <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                 s3_staging_dir=Sys.getenv("AWS_S3_STAGING_DIR"),
                 schema_name="DEFAULT_DB_SCHEMA_NAME",
                 aws_credentials_provider_class=provider)

dbListTables(con)

dbListFields(con, "SOME_TABLE_IN_THE_DEFAULT_DB")

dbGetQuery(con, "SELECT * FROM DEFAULT_DB_SCHEMA_NAME.SOME_TABLE_IN_THE_DEFAULT_DB limit 10;")

YMMV on Windows (comments about what does and does not work on Windows are welcome).

The provider line was the key element I was missing prior to last night.

The Awsaml utility monitors/maintains entries under it’s purview credentials file and keeps consistent profile ids, so I keep that AWS_PROFILE setting in my ~/.Renviron.

I also keep the default S3 Athena data staging bucket in an environment variable as well.

If you provide a default schema_name then you can list tables and fields but queries need fully qualified database (Amazon calls them “schemas”) dot table name.

Initial attempts to have this setup “just work” with dplyr 0.6.0 (the forthcoming EPIC release) were unsuccessful but I’ll poke at all this when I get time and likely write a small Athena package to help smooth over rougher areas.

Y’all likely figured all this out way before I did, but in the event someone else is looking for the information, it should be google-able now.

If you follow me on Twitter or monitor @Rapid7’s Community Blog you know I’ve been involved a bit in the WannaCry ransomworm triage.

One thing I’ve been doing is making charts of the hourly contribution to the Bitcoin addresses that the current/main attackers are using to accept ransom payments (which you really shouldn’t pay, now, even if you are impacted as it’s unlikely they’re actually giving up keys anymore because the likelihood of them getting cash out of the wallets without getting caught is pretty slim).

There’s a full-on CRAN-ified Rbitcoin package but I didn’t need the functionality in it (yet) to do the monitoring. I posted a hastily-crafted gist on Friday so folks could play along at home, but the code here is a bit more nuanced (and does more).

In the spirit of these R⁶ posts, the following is presented without further commentary apart from the interwoven comments with the exception that this method captures super-micro-payments that do not necessarily translate 1:1 to victim count (it’s well within ball-park estimates but not precise w/o introspecting each transaction).

library(jsonlite)
library(hrbrthemes)
library(tidyverse)

# the wallets accepting ransom payments

wallets <- c(
  "115p7UMMngoj1pMvkpHijcRdfJNXj6LrLn",
  "12t9YDPgwueZ9NyMgw519p7AA8isjr6SMw",
  "13AM4VW2dhxYgXeQepoHkHSQuy6NgaEb94"
)

# easy way to get each wallet info vs bringing in the Rbitcoin package

sprintf("https://blockchain.info/rawaddr/%s", wallets) %>%
  map(jsonlite::fromJSON) -> chains

# get the current USD conversion (tho the above has this, too)

curr_price <- jsonlite::fromJSON("https://blockchain.info/ticker")

# calculate some basic stats

tot_bc <- sum(map_dbl(chains, "total_received")) / 10e7
tot_usd <- tot_bc * curr_price$USD$last
tot_xts <- sum(map_dbl(chains, "n_tx"))

# This needs to be modified once the counters go above 100 and also needs to
# account for rate limits in the blockchain.info API

paged <- which(map_dbl(chains, "n_tx") > 50)
if (length(paged) > 0) {
  sprintf("https://blockchain.info/rawaddr/%s?offset=50", wallets[paged]) %>%
    map(jsonlite::fromJSON) -> chains2
}

# We want hourly data across all transactions

map_df(chains, "txs") %>%
  bind_rows(map_df(chains2, "txs")) %>% 
  mutate(xts = anytime::anytime(time),
         xts = as.POSIXct(format(xts, "%Y-%m-%d %H:00:00"), origin="GMT")) %>%
  count(xts) -> xdf

# Plot it

ggplot(xdf, aes(xts, y = n)) +
  geom_col() +
  scale_y_comma(limits = c(0, max(xdf$n))) +
  labs(x = "Day/Time (GMT)", y = "# Transactions",
       title = "Bitcoin Ransom Payments-per-hour Since #WannaCry Ransomworm Launch",
       subtitle=sprintf("%s transactions to-date; %s total bitcoin; %s USD; Chart generated at: %s EDT",
                        scales::comma(tot_xts), tot_bc, scales::dollar(tot_usd), Sys.time())) +
  theme_ipsum_rc(grid="Y")

I hope all goes well with everyone as you try to ride out this ransomworm storm over the coming weeks. It will likely linger for quite a while, so make sure you patch!

Tagging this as #rstats-related since many R coders use Travis-CI to automate package builds (and other things). Security researcher Ivan Vyshnevskyi did some ++gd responsible disclosure to the Travis-CI folks letting them know they were leaking the contents of “secure” environment variables in the build logs.

The TL;DR on “secure” environment variables is that they let you store secrets — such as OAuth keys or API tokens — ostensibly “securely” (they have to be decrypted to be used so someone/something has they keys to do that so it’s not really “secure”). That is, they should not leak them in build logs. Except that they did…for a bit.

As mentioned, this flaw was reported and is now fixed. Regen your “secrets” and keep an eye on Travis security announcements moving forward.

Political machinations are a tad insane in the U.S. these days & I regularly hit up @ProPublica & @GovTrack sites (& sub to the GovTrack e-mail updates) as I try to be an informed citizen, especially since I’ve got a Senator and Representative who seem to be in the sway of ?.

I’ve always appreciated the ProPublica and GovTrack cartograms as they present a great deal of information in a compact space (especially the House versions). Something nudged me into starting an R package to let folks create them in R (mainly with ggplot2 but an htmlwidget version is planned), which I’ve dubbed voteogram.

With the voteogram package, you can:

  • pull ProPublica roll call vote data for the 101st Congress up through today (via roll_call())
  • plot ProPublica-esque Senate roll call vote cartograms
  • plot ProPublica-esque House roll call vote cartograms
  • plot GovTrack-esque House roll call vote cartograms

GovTrack uses — what I’ve seen @thosjleeper refer to as — a “parliamentary plot” for their version of the Senate roll call cartogram and sir Leeper already has that type of plot covered in ggparliament, so I’ve just focused on the other ones here.

Roll Call

You need data for these cartogram generation functions and you can specify your own populated data frame (the needed columns are in the manual pages for the cartogram plotters). However, you’ll likely want to plot existing data that others have tallied and ProPublica makes that super simple since each vote is in a standalone JSON file. All you have to do is specify whether you want the roll call vote for the house or senate, the Congress number (current one is 115), the session number (current one is 1) and the roll call vote number.

For example, we can see all the idiots Representatives who voted, recently, to kill people repeal the ACA with the following function call:

(h256 <- roll_call("house", 115, 1, 256))
## 115th Congress / Session: 1 / House Roll Call: 256 / May  4, 2017
## 
## American Health Care Act
## 
## Result: Passed

str(h256, max.level = 1)
## List of 29
##  $ vote_id              : chr "H_115_1_256"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 256
##  $ needed_to_pass       : int 216
##  $ date_of_vote         : chr "May  4, 2017"
##  $ time_of_vote         : chr "02:18 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "RECORDED VOTE"
##  $ question             : chr "On Passage"
##  $ description          : chr "American Health Care Act"
##  $ nyt_title            : chr "On Passage"
##  $ total_yes            : int 217
##  $ total_no             : int 213
##  $ total_not_voting     : int 1
##  $ gop_yes              : int 217
##  $ gop_no               : int 20
##  $ gop_not_voting       : int 1
##  $ dem_yes              : int 0
##  $ dem_no               : int 193
##  $ dem_not_voting       : int 0
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "No"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##  - attr(*, "class")= chr [1:2] "pprc" "list"

As you can see, it has a custom print function and the usable data (for cartographic needs) is in $votes. You can go to town with just that information, making bar charts or tracking individual Congress-critter votes.

Do your best to cache this data as you retrieve it. ProPublica is a non-profit and the JSON files are on AWS. While there’s a certain number of free bits of bandwidth-per-month allotted buy Amazon’s S3 service, best to make sure you’re not tipping them over on any given month. Plus, the vote data doesn’t change once it’s recorded. Consider donating to them if you decided to always grab fresh copies.

There’s a fortify function for this object (it’s classed pprc) so you can pass it right into ggplot() for use or pipe it into a dplyr chain for aggregation & filtering.

House Rules

With the data in hand, we can make some cartograms (the real purpose of the package). I riffed off the ProPublica colors (and haven’t fully finished copying them yet as I need to search for 2 more categories of Independent voting colors) but you can replace them with anything you want. Just reset the scale and use the names in the exposed color value vectors.

There’s also a theme_voteogram() which is designed to augment any base theme (like hrbrthemes::theme_ipsum_rc()) (it’s much like ggthemes::theme_map()).

Here’s the ProPublica view for that particular vote:

house_carto(rep) +
  labs(x=NULL, y=NULL, 
       title="House Vote 256 - Passes American Health Care Act,\nRepealing Obamacare") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

The house_carto() function defaults to the ProPublica cartogram, but you can easily change that:

house_carto(rep, "gt") +
  labs(x=NULL, y=NULL, 
       title="House Vote 256 - Passes American Health Care Act,\nRepealing Obamacare") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

Senate Drools

Again, the senate_carto() function only has the ProPublica-esque cartogram available and works pretty much the same way after getting the Senate vote data:

sen <- roll_call("senate", 115, 1, 110)

senate_carto(sen) +
  labs(title="Senate Vote 110 - Invokes Cloture on Neil Gorsuch Nomination") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

FIN

There’s a bit of work left to do in the package (including an htmlwidget version). You’re invited to file PRs or Issues as you are so moved.

I caught a glimpse of a tweet by @dataandme on Friday:

Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her finger on the pulse of everything that’s happening in the data science world and is one of the most ardent amplifiers there is.

The post she linked to was a bit older (2015) and had a very “stream of consciousness” feel to it. I actually wish more R folks took to their blogs like this to post their explorations into various topics. The code in this post likely worked at the time it was posted and accomplished the desired goal (which means it was ultimately decent code). Said practice will ultimately help both you and others.

Makeover Time

As I’ve noted before, web scraping has some rules, even though they can be tough to find. This post made a very common mistake of not putting in a time delay between requests (a cardinal scraping rule) which we’ll fix in a moment.

There are a few other optimizations we can make. The first is moving from a for loop to something a bit more vectorized. Another is to figure out how many pages we need to scrape from information in the first set of results.

However, an even bigger one is to take advantage of the underlying XHR POST request that the new version of the site ultimately calls (it appears this site has undergone some changes since the blog post and it’s unlikely the code in the post actually works now).

Let’s start by setting up a function to grab individual pages:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

get_page <- function(i=1, pb=NULL) {
  
  if (!is.null(pb)) pb$tick()$print()
  
  POST(url = "http://www.propwall.my/wp-admin/admin-ajax.php", 
       body = list(action = "star_property_classified_list_change_ajax", 
                   tab = "Most Relevance", 
                   page = as.integer(i), location = "Mont Kiara", 
                   category = "", listing = "For Sale", 
                   price = "", keywords = "Mont Kiara, Kuala Lumpur", 
                   filter_id = "17", filter_type = "Location", 
                   furnishing = "", builtup = "", 
                   tenure = "", view = "list", 
                   map = "on", blurb = "0"), 
       encode = "form") -> res
  
  stop_for_status(res)
  
  res <- content(res, as="parsed") 
  
  Sys.sleep(sample(seq(0,2,0.5), 1))
  
  res
  
}

The i parameter gets passed into the body of the POST request. You can find that XHR POST request via the Network tab of your browser Developer Tools view. You can either transcribe it by hand or use the curlconverter package (which is temporarily off CRAN so you’ll need to get it from github) to auto-convert it to an httr::VERB request.

We also add a parameter (default to NULL) to support the use of a progress bar (so we can see what’s going on). If we pass in a populated dplyr progress bar, this will tick it down for us.

Now, we can use that to get the total number of listings.

get_page(1) %>% 
  html_node(xpath=".//a[contains(., 'Classifieds:')]") %>% 
  html_text() %>% 
  stri_match_last_regex("([[:digit:],]+)$") %>% 
  .[,2] %>% 
  stri_replace_all_fixed(",", "") %>% 
  as.numeric() -> classified_ct

total_pages <- 1 + (classified_ct %/% 20)

We’ll setup another function to extract the listing URLs and titles:

get_listings <- function(pg) {
  data_frame(
    link = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>%  html_attr("href"),
    description = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>% html_text(trim = TRUE)  
  )
}

Rather than chain calls to html_nodes() we take advantage of well-formed CSS selectors (which ultimately gets auto-translated to XPath strings). This has the advantage of speed (though that’s not necessarily an issue when web scraping) as well as brevity.

Now, we’ll scrape all the listings:

pb <- progress_estimated(total_pages)
listings_df <- map_df(1:total_pages, ~get_listings(get_page(.x, pb)))

Yep. That’s it. Everything’s been neatly abstracted into functions and we’ve taken advantage of some modern R idioms to accomplish our first task.

FIN

With the above code you should be able to do your own makeover of the remaining code in the original post. Remember to:

  • add a delay when you sequentially scrape pages from a site
  • abstract out common operations into functions
  • take advantage of purrr functions (or built-in *apply functions) to avoid for loops

I’ll close with a note about adhering to site terms of service / terms and conditions. Nothing I found when searching for ToS/ToC on the site suggested that scraping, automated grabbing or use of the underlying data in bulk was prohibited. Many sites have such restrictions — like IMDB (I mention that as it’s been used alot lately by R folks and it really shouldn’t be). LinkedIn recently sued scrapers for ToS such violations.

I fundamentally believe violating ToS is unethical behavior and should be avoided just on those grounds. When I come across sites I need information from that have restrictive ToS I contact the site owner (when I can find them) and ask them for permission and have only been refused a small handful of times. Given those recent legal actions, it’s also to better be safe than sorry.

Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus”) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this:

to:

Does pandoc work?
=================

Simple document with **bold** and *italics*.

This is definitely a job that pandoc can handle.

pandoc is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown / knitr.

Yes, you read that correctly. Your beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.

Doing the aforementioned conversion from docx to markdown is super-simple from R:

rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")

Give the help on rmarkdown::pandoc_convert() a read as well as the very thorough and helpful documentation over at pandoc.org to see the power available at your command.

Just One More Thing

This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist :-)

There’s a neat, non-on-CRAN package by François Keck called subtoolshttps://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.

There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.

Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext to see how the season looked.

The overall scoring for a given episode is naive and can definitely be improved upon.

Definitely drop a link to anything you create in the comments!

# devtools::install_github("fkeck/subtools")

library(subtools)
library(tidytext)
library(hrbrthemes)
library(tidyverse)

data(stop_words)

bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")

fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE)

pb <- progress_estimated(length(fils))

map_df(1:length(fils), ~{

  pb$tick()$print()

  read.subtitles(fils[.x]) %>%
    sentencify() %>%
    .$subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by="word") %>%
    inner_join(bing, by="word") %>%
    inner_join(afinn, by="word") %>%
    mutate(season = 1, ep = .x)

}) %>% as_tibble() -> season_sentiments


count(season_sentiments, ep, sentiment) %>%
  mutate(pct = n/sum(n),
         pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent

ggplot() +
  geom_ribbon(data = filter(bing_sent, sentiment=="positive"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  geom_ribbon(data = filter(bing_sent, sentiment=="negative"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) +
  scale_y_continuous(expand=c(0,0), limits=c(-1,1),
                     labels=c("100%\nnegative", "50%", "0", "50%", "positive\n100%")) +
  labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1",
       subtitle="Sentiment balance per episode") +
  scale_fill_ipsum(name="Sentiment") +
  guides(fill = guide_legend(reverse=TRUE)) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1)))