Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

(General reminder abt “R⁶” posts in that they are heavy on code-examples, minimal on expository. I try to design them with 2-3 “nuggets” embedded for those who take the time to walk through the code examples on their systems. I’ll always provide further expository if requested in a comment, so don’t hesitate to ask if something is confusing.)

I had to check something on the macOS systems across the abode today and — on a lark — decided to do all the “shell” scripting in R vs bash for a change. After performing the tasks, it occurred to me that not all R users on macOS realize there are hidden gems of information spread across the “boring” parts of the filesystem in SQLite databases. So, I put together a small example that:

  • identifies all the top-level apps in /Applications
  • extracts code signing information from them into a tibble
  • grabs the Gatekeeper whitelist database
  • uses an internal SQLite transformation function inside a dplyr chain
  • merges the info and gives you a basis to explore your apps
  • shows off the wicked cool processx package by @gaborcsardi that you should be using in place of system() / system2()

A quick note about this Gatekeeper database: If an app is not already recognized by macOS and you allow it to run your local system is updated to “whitelist” this app so you don’t get the notification in the future. Apple maintains a large signature list that gets updated throughout the year on your system and your own list is merged with it.

A second quick note (I guess I’m doing more expository than promised :-) about QUOTE(): In certain dplyr / dbplyr contexts, what looks like local function calls are actually passed over to the SQL side vs used locally. Even though quote() is an internal R function, the use of the lowercase version of it would still go over to the SQL side vs be eval’d locally. However, to avoid confusing others, I try to uppercase conflicts like this when they occur.

library(processx)
library(stringi)
library(docxtractr) # install_github("hrbrmstr/docxtractr")
library(tidyverse)

list.files("/Applications", pattern = "app$", full.names = TRUE) %>% 
  map_df(~{
    message(.x)
    res <- run("codesign", args = c("-dvvv", .x), error_on_status = FALSE)
    if (any(grepl("not signed at all", res$stderr))) {
      list(Executable = .x)
    } else {
      stri_split_lines(res$stderr)[[1]] %>% 
        keep(~grepl("=", .)) %>% 
        stri_split_fixed("=", 2, simplify = TRUE) -> res
      as.list(set_names(res[,2], res[,1]))
    }
  }) %>% 
  mcga() %>% 
  mutate(short_name = stri_replace_last_fixed(basename(executable), ".app", "")) -> my_apps

glimpse(my_apps)
## Observations: 102
## Variables: 20
## $ executable                  <chr> "/Applications/1Password 6.app/Con...
## $ identifier                  <chr> "com.agilebits.onepassword4", "com...
## $ format                      <chr> "app bundle with Mach-O thin (x86_...
## $ codedirectory_v             <chr> "20200 size=29594 flags=0x0(none) ...
## $ hash_type                   <chr> "sha256 size=32", "sha256 size=32"...
## $ candidatecdhash_sha1        <chr> "e21ac2a66473feec6276b448fc518678d...
## $ candidatecdhash_sha256      <chr> "8cf4cb4bdbea3b4d4f9e293e1aee1edb7...
## $ hash_choices                <chr> "sha1,sha256", "sha1,sha256", "sha...
## $ cdhash                      <chr> "8cf4cb4bdbea3b4d4f9e293e1aee1edb7...
## $ signature_size              <chr> "8915", "8936", "4610", "8528", "4...
## $ authority                   <chr> "Apple Root CA", "Apple Root CA", ...
## $ timestamp                   <chr> "Jul 17, 2017, 10:18:00 AM", "Jul ...
## $ info_plist_entries          <chr> "32", "31", "30", "17", "30", "18"...
## $ teamidentifier              <chr> "2BUA8C4S2C", "XZZXE9SED4", "94KV3...
## $ sealed_resources_version    <chr> "2 rules=12 files=2440", "2 rules=...
## $ internal_requirements_count <chr> "1 size=220", "1 size=192", "1 siz...
## $ signed_time                 <chr> NA, NA, "Jul 13, 2017, 6:40:13 PM"...
## $ library_validation_warning  <chr> NA, NA, NA, "OS X SDK version befo...
## $ platform_identifier         <chr> NA, NA, NA, NA, "2", NA, NA, "2", ...
## $ short_name                  <chr> "1Password 6", "Alfred 3", "Amazon...

# this is macOS atekeeper whitelist db
db <- src_sqlite("/private/var/db/gkopaque.bundle/Contents/Resources/gkopaque.db")

db
## src:  sqlite 3.19.3 [/private/var/db/gkopaque.bundle/Contents/Resources/gkopaque.db]
## tbls: conditions, merged, whitelist
whitelist <- tbl(db, "whitelist")

# they are binary blobs, but we need them as text, so use SQLite's "QUOTE()" function
mutate(whitelist, current = QUOTE(current)) %>% 
  select(current) %>% 
  collect() %>% 
  mutate(
    cdhash = stri_replace_first_fixed(current, "X'", "") %>% 
      stri_replace_last_fixed("'", "") %>% 
      stri_trans_tolower()
  ) -> gatekeeper_whitelist_hashes

my_apps <- left_join(my_apps, gatekeeper_whitelist_hashes)

# Unsigned apps
filter(my_apps, is.na(cdhash)) %>% 
  select(short_name)

# you can see your own output here ;-)

# App organization
select(my_apps, identifier) %>% 
  mutate(
    identifier = stri_split_fixed(identifier, ".", simplify = TRUE)[,2],
    identifier = ifelse(identifier == "", "UNSPECIFIED", identifier)
  ) %>% 
  count(identifier, sort=TRUE)
## # A tibble: 47 x 2
##        identifier     n
##             <chr> <int>
##  1          apple    37
##  2    UNSPECIFIED    10
##  3      microsoft     6
##  4  eclecticlight     3
##  5         google     3
##  6         amazon     2
##  7      agilebits     1
##  8 antonycourtney     1
##  9       appscape     1
## 10   audacityteam     1
## # ... with 37 more rows


# Which apps have code-signing validation warnings
filter(my_apps, !is.na(library_validation_warning)) %>% 
  select(short_name, library_validation_warning)
## # A tibble: 7 x 2
##        short_name
##             <chr>
## 1    Amazon Music
## 2        Audacity
## 3         firefox
## 4         RStudio
## 5             VLC
## 6 Webcam Settings
## 7     WordService
## # ... with 1 more variables: library_validation_warning <chr>

NOTE: One could easily extend this example to look for apps across the filesystem.

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module, which has the ability to natively create parquet files.

Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.

The Python parquet process is pretty simple since you can convert a pandas DataFrame directly to a pyarrow Table which can be written out in parquet format with pyarrow.parquet. We just need to follow this process through reticulate in R:

library(reticulate)

pd <- import("pandas", "pd")
pa <- import("pyarrow", "pa")
pq <- import("pyarrow.parquet", "pq")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)
mtcars_tab <- pa$Table$from_pandas(mtcars_df)

pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))

I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the sergeant package:

readr::write_csv(mtcars, "~/Data/mtcars_r.csvh")
dc <- drill_connection("localhost")
drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")

Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python DataFrames (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use preserve_index=False on the call to Table.from_pandas to get rid of that icky index.

It’s fairly efficient even for something like nycflights13::flights which has ~330K rows and 19 columns:

system.time(
  r_to_py(nycflights13::flights) %>% 
  pd$DataFrame$from_dict() %>% 
  pa$Table$from_pandas() %>% 
  pq$write_table(where = "/tmp/flights.parquet")
)
##    user  system elapsed 
##   1.285   0.108   1.398 

If you need to generate parquet files in a pinch, reticulate seems to be a good way to go.

UPDATE (2018-01-25)

API’s change and while the above still works, there’s a slightly simpler way, now:

library(reticulate)

pd <- import("pandas", "pd")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)

city_wx_df$to_parquet(path.expand("~/Data/mtcars_python.parquet"), "pyarrow")

One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest:

If you load that up that tweet and follow the thread, you’ll see a really good question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls.

The answer is a bit nuanced as there are some written and unwritten “rules” for those who would seek to scrape web site content. For the sake of brevity in this post, we’ll only focus on “best practices” (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that “Step 0” must be to validate that the site’s terms & conditions or terms of service allow you to scrape & use data from said site.

Robot Roll Call

The absolute first thing you should do before scraping a site should be to check out their robots.txt file. What’s that? Well, I’ll let you read about it first from the vignette of the package we’re going to use to work with it.

Now that you know what such a file is, you also likely know how to peruse it since the vignette has excellent examples. But, we’ll toss one up here for good measure, focusing on one field that we’re going to talk about next:

library(tidyverse)
library(rvest)

robotstxt::robotstxt("seobook.com")$crawl_delay %>% 
  tbl_df()
## # A tibble: 114 x 3
##          field       useragent value
##          <chr>           <chr> <chr>
##  1 Crawl-delay               *    10
##  2 Crawl-delay        asterias    10
##  3 Crawl-delay BackDoorBot/1.0    10
##  4 Crawl-delay       BlackHole    10
##  5 Crawl-delay    BlowFish/1.0    10
##  6 Crawl-delay         BotALot    10
##  7 Crawl-delay   BuiltBotTough    10
##  8 Crawl-delay    Bullseye/1.0    10
##  9 Crawl-delay   BunnySlippers    10
## 10 Crawl-delay       Cegbfeieh    10
## # ... with 104 more rows

I chose that site since it has many entries for the Crawl-delay field, which defines the number of seconds a given site would like your crawler to wait between scrapes. For the continued sake of brevity, we’ll assume you’re going to be looking at the * entry when you perform your own scraping tasks (even though you should be setting your own User-Agent string). Let’s make a helper function for retrieving this value from a site, adding in some logic to provide a default value if no Crawl-Delay entry is found and to optimize the experience a bit (note that I keep varying the case of crawl-delay when I mention it to show that the field key is case-insensitive; be thankful robotstxt normalizes it for us!):

.get_delay <- function(domain) {
  
  message(sprintf("Refreshing robots.txt data for %s...", domain))
  
  cd_tmp <- robotstxt::robotstxt(domain)$crawl_delay
  
  if (length(cd_tmp) > 0) {
    star <- dplyr::filter(cd_tmp, useragent=="*")
    if (nrow(star) == 0) star <- cd_tmp[1,]
    as.numeric(star$value[1])
  } else {
    10L
  }

}

get_delay <- memoise::memoise(.get_delay)

The .get_delay() function could be made a bit more bulletproof, but I have to leave some work for y’all to do on your own. So, why both .get_delay() and the get_delay() functions, and what is this memoise? Well, even though robotstxt::robotstxt() will ultimately cache (in-memory, so only in the active R session) the robots.txt file it retrieved (if it retrieved one) we don’t want to do the filter/check/default/return all the time since it just wastes CPU clock cycles. The memoise() operation will check which parameter was sent and return the value that was computed vs going through that logic again. We can validate that on the seobook.com domain:

get_delay("seobook.com")
## Refreshing robots.txt data for seobook.com...
## [1] 10

get_delay("seobook.com")
## [1] 10

You can now use get_delay() in a Sys.sleep() call before your httr:GET() or rvest::read_html() operations.

Not So Fast…

Because you’re a savvy R coder and not some snake charmer, gem hoarder or go-getter, you likely caught the default 10L return value in .get_delay() and thought “Hrm… Where’d that come from?”.

I’m glad you asked!

I grabbed the first 400 robots.txt WARC files from the June 2017 Common Crawl, which ends up being ~1,000,000 sites. That sample ended up having ~80,000 sites with one or more CRAWL-DELAY entries. Some of those sites had entries that were not valid (in an attempt to either break, subvert or pwn a crawler) or set to a ridiculous value. I crunched through the data and made saner bins for the values to produce the following:

Sites seem to either want you to wait 10 seconds (or less) or about an hour between scraping actions. I went with the lower number purely for convenience, but would caution that this decision was based on the idea that your intention is to not do a ton of scraping (i.e. less than ~50-100 HTML pages). If you’re really going to do more than that, I strongly suggest you reach out to the site owner. Many folks are glad for the contact and could even arrange a better method for obtaining the data you seek.

FIN

So, remember:

  • check site ToS/T&C before scraping
  • check robots.txt before scraping (in general and for Crawl-Delay)
  • contact the site owner if you plan on doing a large amount of scraping
  • introduce some delay between page scrapes, even if the site does not have a specific crawl-delay entry), using the insights gained from the Common Crawl analysis to inform your decision

I’ll likely go through all the Common Crawl robots.txt WARC archives to get a fuller picture of the distribution of values and either update this post at a later date or do a quick new post on it.

(You also might want to run robotstxt::get_robotstxt("rud.is") #justsayin :-)

It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.

Why am I calling this out?

Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:

sudo tcpdump -ni en0 -s0 -w capture01.pcap

And, wait a bit.

NOTE: Some of you may have to change the en0 to your main network interface name (a quick google for that for your platform should get you to the right one to use).

That command will passively record all network activity on your system until you ctrl-c it. The longer it goes the larger it gets.

When you’ve recorded a minute or two of packets, ctrl-c the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter to work with these files but my use case/project dried up and haven’t gone back to it.

Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:

library(sergeant)
library(iptools)
library(tidyverse)
library(cymruservices)

db <- src_drill("localhost")

my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`")

glimpse(my_pcaps)
## Observations: 25
## Variables: 12
## $ src_ip          <chr> "192.168.10.100", "54.159.166.81", "192.168.10...
## $ src_port        <int> 60025, 443, 60025, 443, 60025, 58976, 443, 535...
## $ tcp_session     <dbl> -2.082796e+17, -2.082796e+17, -2.082796e+17, -...
## $ packet_length   <int> 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75...
## $ data            <chr> "...g9B..c.<..O..@=,0R.`........K..EzYd=.........
## $ src_mac_address <chr> "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:...
## $ dst_port        <int> 443, 60025, 443, 60025, 443, 443, 58976, 5353,...
## $ type            <chr> "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP...
## $ dst_ip          <chr> "54.159.166.81", "192.168.10.100", "54.159.166...
## $ dst_mac_address <chr> "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:...
## $ network         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ timestamp       <dttm> 2017-07-27 23:54:58, 2017-07-27 23:54:59, 201...

summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% 
  collect() %>% 
  summarise(max - min)
## # A tibble: 1 x 1
##     `max - min`
##          <time>
## 1 1.924583 mins

count(my_pcaps, type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type     n
##   <chr> <int>
## 1   TCP  4974
## 2   UDP   774

filter(my_pcaps, type=="TCP") %>% 
  count(dst_port, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   DrillConnection
## # Ordered by: desc(n)
##    dst_port     n
##       <int> <int>
##  1      443  2580
##  2    56202   476
##  3    56229   226
##  4    56147   169
##  5    56215   103
##  6    56143    94
##  7    56085    89
##  8    56203    56
##  9    56205    39
## 10    56209    39
## # ... with more rows

filter(my_pcaps, type=="TCP") %>% 
  count(dst_ip, sort=TRUE) %>% 
  collect() -> dst_ips

filter(dst_ips, !is.na(dst_ip)) %>%
  left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")),
            by = c("dst_ip"="ips")) %>%
  filter(!in_cidr) %>%
  left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>%
  select(dst_ip, n, as_name)
## # A tibble: 37 x 3
##            dst_ip     n                              as_name
##             <chr> <int>                                <chr>
##  1   104.244.42.2   862           TWITTER - Twitter Inc., US
##  2 104.244.46.103   556           TWITTER - Twitter Inc., US
##  3  104.20.60.241   183 CLOUDFLARENET - CloudFlare, Inc., US
##  4     31.13.80.8   160        FACEBOOK - Facebook, Inc., US
##  5  52.218.160.76   100     AMAZON-02 - Amazon.com, Inc., US
##  6  104.20.59.241    79 CLOUDFLARENET - CloudFlare, Inc., US
##  7  52.218.160.92    66     AMAZON-02 - Amazon.com, Inc., US
##  8  199.16.156.81    58           TWITTER - Twitter Inc., US
##  9 104.244.42.193    47           TWITTER - Twitter Inc., US
## 10  52.86.113.212    42    AMAZON-AES - Amazon.com, Inc., US
## # ... with 27 more rows

No custom R code. No modification to the sergeant package. Just query it like any other data source.

One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.

I had to add "pcap": { "type": "pcap" } to the formats section of the dfs storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap type. But after that, it just works.

Well, kinda.

The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant package (for now) can’t do much with the data component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.

Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant side to help make the data element more useful.

Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.

If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

  • grabs the vis iframe
  • extracts and evaluates the target javascript to get a nice data frame
  • performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
  • plots the distributions using the beeswarm quasirandom alogrithm
library(V8)
library(rvest)
library(ggbeeswarm)
library(hrbrthemes)
library(tidyverse)

pg <- read_html("http://mattstiles.org/dailygraphics/graphics/attorney-general-tenure-20172517/child.html?initialWidth=840&childId=pym_0&parentTitle=Chart%3A%20If%20Ousted%2C%20Jeff%20Sessions%20Would%20Have%20a%20Historically%20Short%20Tenure%20%7C%20The%20Daily%20Viz&parentUrl=http%3A%2F%2Fthedailyviz.com%2F2017%2F07%2F25%2Fchart-if-ousted-jeff-sessions-would-have-a-historically-short-tenure%2F")

ctx <- v8()
ctx$eval(html_nodes(pg, xpath=".//script[contains(., 'DATA')]") %>% html_text())

ctx$get("DATA") %>% 
  as_tibble() %>% 
  readr::type_convert() %>% 
  mutate(party = ifelse(is.na(party), "Other", party)) %>% 
  mutate(party = fct_lump(party)) %>% 
  mutate(color1 = case_when(
    party == "Democratic" ~ "#313695",
    party == "Republican" ~ "#a50026",
    party == "Other" ~ "#4d4d4d")
  ) %>% 
  mutate(color2 = ifelse(grepl("Sessions", label), "#2b2b2b", "#00000000")) -> ags

ggplot() + 
  geom_quasirandom(data = ags, aes(party, amt, color = color1)) +
  geom_quasirandom(data = ags, aes(party, amt, color = color2), 
                   fill = "#ffffff00", size = 4, stroke = 0.25, shape = 21) +
  geom_text(data = data_frame(), aes(x = "Republican", y = 100, label = "Jeff Sessions"), 
            nudge_x = -0.15, family = font_rc, size = 3, hjust = 1) +
  scale_color_identity() +
  scale_y_comma(limits = c(0, 4200)) +
  labs(x = "Party", y = "Tenure (days)", 
       title = "U.S. Attorneys General",
       subtitle = "Distribution of tenure in office, by days & party: 1789-2017",
       caption = "Source data/idea: Matt Stiles <bit.ly/2vXAHTM>") +
  theme_ipsum_rc(grid = "XY")

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.

I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon.

sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box” with sergeant‘s dplyr interface, file an issue and I’ll make a special one for it in the package.

I’ve written about drill on the blog before so check out those posts for some history and stay tuned for more examples. The README should get you started using sergeant and/or Drill (if you aren’t running Drill now, take a look and you’ll likely get hooked).

I’d like to take a moment to call out special thanks to Edward Visel for bootstrapping the dbplyr update to sergeant when the dplyr/dbplyr interfaces split. It saved me loads of time and really helped the progress of this package move faster towards a CRAN release.

I couldn’t let this stand unchallenged:

Rasmussen makes their Presidential polling data available for both ? & O. Why not compare their ratings from day 1 in office (skipping days that Rasmussen doesn’t poll)?


library(hrbrthemes)
library(rvest)
library(tidyverse)

list(
  Obama="http://m.rasmussenreports.com/public_content/politics/obama_administration/obama_approval_index_history",
  Trump="http://m.rasmussenreports.com/public_content/politics/trump_administration/trump_approval_index_history"
) %>%
  map_df(~{
    read_html(.x) %>%
      html_table() %>%
      .[[1]] %>%
      tbl_df() %>%
      select(date=Date, approve=`Total Approve`, disapprove=`Total Disapprove`)
  }, .id="who") -> ratings

mutate_at(ratings, c("approve", "disapprove"), function(x) as.numeric(gsub("%", "", x, fixed=TRUE))/100) %>%
  mutate(date = lubridate::dmy(date)) %>%
  filter(!is.na(approve)) %>%
  group_by(who) %>%
  arrange(date) %>%
  mutate(dnum = 1:n()) %>%
  ungroup() %>%
  ggplot(aes(dnum, approve, color=who)) +
  geom_hline(yintercept = 0.5, size=0.5) +
  geom_point(size=0.25) +
  scale_y_percent(limits=c(0,1)) +
  scale_color_manual(name=NULL, values=c("Obama"="#313695", "Trump"="#a50026")) +
  labs(x="Day in office", y="Approval Rating",
       title="Presidential approval ratings from day 1 in office",
       subtitle="For fairness, data was taken solely from Trump's favorite polling site (Rasmussen)",
       caption="Data Source: <rasmussenreports.com>\nCode: <https://gist.github.com/hrbrmstr/a7310e1b64d0797401d01d0c6195bd8b>") +
  theme_ipsum_rc(grid="XY", base_size = 16) +
  theme(legend.direction = "horizontal") +
  theme(legend.position=c(0.8, 1.05))

I’ll make a new post occasionally throughout ?’s term.

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.

If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.

The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.

The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.