I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad… Continue reading
R⁶ — Exploring macOS Applications with codesign, Gatekeeper & R
(General reminder abt “R⁶” posts in that they are heavy on code-examples, minimal on expository. I try to design them with 2-3 “nuggets” embedded for those who take the time to walk through the code examples on their systems. I’ll always provide further expository if requested in a comment, so don’t hesitate to ask if… Continue reading
R⁶ — Reticulating Parquet Files
The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R… Continue reading
Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R
One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats // Ethics in Web Scraping https://t.co/y5YxvzB8Fd — boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that… Continue reading
Reading PCAP Files with Apache Drill and the sergeant R Package
It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also… Continue reading
R⁶ — General (Attys) Distributions
Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice): Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp —… Continue reading
Ten-HUT! The Apache Drill R interface package — sergeant — is now on CRAN
I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon. sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t… Continue reading
R⁶ — Disproving Approval
I couldn’t let this stand unchallenged: The new Rasmussen Poll, one of the most accurate in the 2016 Election, just out with a Trump 50% Approval Rating.That's higher than O's #'s! — Donald J. Trump (@realDonaldTrump) June 18, 2017 Rasmussen makes their Presidential polling data available for both ? & O. Why not compare their… Continue reading