Revisiting Readability With RStudio

I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them in order. New S3 print() Method Objects created… Continue reading

Teasing Out Top Daily Topics with GDELT’s Television Explorer

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should… Continue reading

New CRAN Package Announcement: splashr

I’m pleased to announce that splashr is now on CRAN. (That image was generated with splashr::render_png(url = “https://cran.r-project.org/web/packages/splashr/”)). The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the… Continue reading

Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content

I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney: [Please RT]#RStats folks, @nj_tierney & I need your help for {naniar}!When does R silently drop/omit NA? https://t.co/V5elyGcG8Z pic.twitter.com/VScLXFCl2n — Colin… Continue reading

Reticulating Readability

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages… Continue reading