I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them in order. New S3 print() Method Objects created… Continue reading
Increasing Output Buffer Size in Apache Drill UDFs Custom (Simple) Functions
Putting this here to make it easier for others who try to Google this topic to find it w/o having to find and tediously search through other UDFs (user-defined functions). I was/am making a custom UDF for base64 decoding/encoding and ran into: It’s incredibly easy to “fix” (and, if my Java weren’t so rusty I’d… Continue reading
Teasing Out Top Daily Topics with GDELT’s Television Explorer
Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should… Continue reading
Readability Redux
I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text. I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt… Continue reading
New CRAN Package Announcement: splashr
I’m pleased to announce that splashr is now on CRAN. (That image was generated with splashr::render_png(url = “https://cran.r-project.org/web/packages/splashr/”)). The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the… Continue reading
Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content
I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney: [Please RT]#RStats folks, @nj_tierney & I need your help for {naniar}!When does R silently drop/omit NA? https://t.co/V5elyGcG8Z pic.twitter.com/VScLXFCl2n — Colin… Continue reading
Unbottling “.msg” Files in R
There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together… Continue reading
Reticulating Readability
I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages… Continue reading