{"id":12804,"date":"2020-08-07T07:42:25","date_gmt":"2020-08-07T12:42:25","guid":{"rendered":"https:\/\/rud.is\/b\/?p=12804"},"modified":"2020-08-08T12:35:24","modified_gmt":"2020-08-08T17:35:24","slug":"quick-hit-comparison-of-whole-file-reading-methods","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/","title":{"rendered":"Quick Hit: Comparison of &#8220;Whole File Reading&#8221; Methods"},"content":{"rendered":"<p>(This is part 1 of <code>n<\/code> posts using this same data; <code>n<\/code> will likely be 2-3, and the posts are more around optimization than anything else.)<\/p>\n<p>I recently had to analyze HTTP response headers (generated by a <code>HEAD<\/code> request) from around 74,000 sites (each response stored in a text file). They look like this:<\/p>\n<pre><code class=\"language-http\">HTTP\/1.1 200 OK\nDate: Mon, 08 Jun 2020 14:40:45 GMT\nServer: Apache\nLast-Modified: Sun, 26 Apr 2020 00:06:47 GMT\nETag: \"ace-ec1a0-5a4265fd413c0\"\nAccept-Ranges: bytes\nContent-Length: 967072\nX-Frame-Options: SAMEORIGIN\nContent-Type: application\/x-msdownload\n<\/code><\/pre>\n<p>I do this quite a bit in R when we create new studies at <a href=\"https:\/\/opendata.rapid7.com\/\">work<\/a>, but I&#8217;m usually only working with a few files. In this case I had to go through all these files to determine if a condition hypothesis (more on that in one of the future posts) was accurate.<\/p>\n<p>Reading in a bunch of files (each one into a string) is fairly straightforward in R since <code>readChar()<\/code> will do the work of reading and we just wrap that in an iterator:<\/p>\n<pre><code class=\"language-r\">length(fils)\n## [1] 73514 \n\n# check file size distribution\nsummary(\n  vapply(\n    X = fils,\n    FUN = file.size,\n    FUN.VALUE = numeric(1),\n    USE.NAMES = FALSE\n  )\n)\n## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n## 19.0   266.0   297.0   294.8   330.0  1330.0 \n\n# they're all super small\n\nsystem.time(\n  vapply(\n    X = fils, \n    FUN = function(.f) readChar(.f, file.size(.f)), \n    FUN.VALUE = character(1), \n    USE.NAMES = FALSE\n  ) -&gt; tmp \n)\n##  user  system elapsed \n## 2.754   1.716   4.475 \n<\/code><\/pre>\n<p>NOTE: You can use <code>lapply()<\/code> or <code>sapply()<\/code> to equal effect as they all come in around 5 seconds on a modern SSD-backed system.<\/p>\n<p>Now, five seconds is completely acceptable (though that brief pause does feel awfully slow for some reason), but <em>can we do better<\/em>? I mean we do have some choices when it comes to slurping up the contents of a file into a length 1 character vector:<\/p>\n<ul>\n<li><code>base::readChar()<\/code><\/li>\n<li><code>readr::read_file()<\/code><\/li>\n<li><code>stringi::stri_read_raw()<\/code> (+ <code>rawToChar()<\/code>)<\/li>\n<\/ul>\n<p>Do any of them beat {base}? Let&#8217;s see (using the largest of the files):<\/p>\n<pre><code class=\"language-r\">library(stringi)\nlibrary(readr)\nlibrary(microbenchmark)\n\nlargest &lt;- fils[which.max(sapply(fils, file.size))]\n\nfile.size(largest)\n## [1] 1330\n\nmicrobenchmark(\n  base = readChar(largest, file.size(largest)),\n  readr = read_file(largest),\n  stringi = rawToChar(stri_read_raw(largest)),\n  times = 1000,\n  control = list(warmup = 100)\n)\n## Unit: microseconds\n##     expr     min       lq      mean   median       uq     max neval\n##     base  79.862  93.5040  98.02751  95.3840 105.0125 161.566  1000\n##    readr 163.874 186.3145 190.49073 189.1825 192.1675 421.256  1000\n##  stringi  52.113  60.9690  67.17392  64.4185  74.9895 249.427  1000\n<\/code><\/pre>\n<p>I had predicted that the {stringi} approach would be slower given that we have to explicitly turn the raw vector into a character vector, but it is modestly faster. ({readr} has quite a bit of functionality baked into it \u2014 for good reasons \u2014 which doesn&#8217;t help it win any performance contests).<\/p>\n<p>I still felt there had to be an even faster method, especially since I knew that the files all had HTTP response headers and that they every one of them could each be easily read into a string in (pretty much) one file read operation. That knowledge will let us make a C++ function that cuts some corners (more like &#8220;sands&#8221; some corners, really). We&#8217;ll do that right in R via {Rcpp} in this function (annotated in C++ code comments):<\/p>\n<pre><code class=\"language-r\">library(Rcpp)\n\ncppFunction(code = '\nString cpp_read_file(std::string fil) {\n\n  \/\/ our input stream\n  std::ifstream in(fil, std::ios::in | std::ios::binary);\n\n  if (in) { \/\/ we can work with the file\n\n  #ifdef Win32\n    struct _stati64 st; \/\/ gosh i hate windows\n    _wstati64(fil.cstr(), &amp;st) \/\/ this shld work but I did not test it\n  #else\n    struct stat st;\n    stat(fil.c_str(), &amp;st);\n  #endif\n\n    std::string out(st.st_size, 0); \/\/ make a string buffer to hold the data\n\n    in.seekg(0, std::ios::beg); \/\/ ensure we are at the beginning\n    in.read(&amp;out[0], st.st_size); \/\/ read in the file\n    in.close();\n\n    return(out);\n\n  } else {\n    return(NA_STRING); \/\/ file missing or other errors returns NA\n  }\n\n}\n', includes = c(\n  \"#include &lt;fstream&gt;\",\n  \"#include &lt;string&gt;\",\n  \"#include &lt;sys\/stat.h&gt;\"\n))\n<\/code><\/pre>\n<p>Is that going to be faster?<\/p>\n<pre><code class=\"language-r\">microbenchmark(\n  base = readChar(largest, file.size(largest)),\n  readr = read_file(largest),\n  stringi = rawToChar(stri_read_raw(largest)),\n  rcpp = cpp_read_file(largest),\n  times = 1000,\n  control = list(warmup = 100)\n)\n## Unit: microseconds\n##     expr     min       lq      mean   median       uq     max neval\n##     base  80.500  91.6910  96.82752  94.3475 100.6945 295.025  1000\n##    readr 161.679 175.6110 185.65644 186.7620 189.7930 399.850  1000\n##  stringi  51.959  60.8115  66.24508  63.9250  71.0765 171.644  1000\n##     rcpp  15.072  18.3485  21.20275  21.0930  22.6360  62.988  1000\n<\/code><\/pre>\n<p>It sure looks like it, but let&#8217;s put it to the test:<\/p>\n<pre><code class=\"language-r\">system.time(\n  vapply(\n    X = fils, \n    FUN = cpp_read_file, \n    FUN.VALUE = character(1), \n    USE.NAMES = FALSE\n  ) -&gt; tmp \n)\n##  user  system elapsed \n## 0.446   1.244   1.693 \n<\/code><\/pre>\n<p>I&#8217;ll take a two-second wait over a five-second wait any day!<\/p>\n<h3>FIN<\/h3>\n<p>I have a few more cases coming up where there will be 3-5x the number of (similar) files that I&#8217;ll need to process, and this optimization will shave time off as I iterate through each analysis, so the trivial benefits here will pay off more down the road.<\/p>\n<p>The next post in this particular series will show how to use the {future} family to reduce the time it takes to turn those HTTP headers into data we can use.<\/p>\n<p>If I missed your favorite file slurping function, drop a note in the comments and I&#8217;ll update the post with new benchmarks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.) I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this: HTTP\/1.1 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_post_was_ever_published":false},"categories":[91],"tags":[],"class_list":["post-12804","post","type-post","status-publish","format-standard","hentry","category-r"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Quick Hit: Comparison of &quot;Whole File Reading&quot; Methods - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Quick Hit: Comparison of &quot;Whole File Reading&quot; Methods - rud.is\" \/>\n<meta property=\"og:description\" content=\"(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.) I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this: HTTP\/1.1 [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2020-08-07T12:42:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-08-08T17:35:24+00:00\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Quick Hit: Comparison of &#8220;Whole File Reading&#8221; Methods\",\"datePublished\":\"2020-08-07T12:42:25+00:00\",\"dateModified\":\"2020-08-08T17:35:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/\"},\"wordCount\":489,\"commentCount\":7,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"articleSection\":[\"R\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/\",\"name\":\"Quick Hit: Comparison of \\\"Whole File Reading\\\" Methods - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"datePublished\":\"2020-08-07T12:42:25+00:00\",\"dateModified\":\"2020-08-08T17:35:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2020\\\/08\\\/07\\\/quick-hit-comparison-of-whole-file-reading-methods\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Quick Hit: Comparison of &#8220;Whole File Reading&#8221; Methods\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Quick Hit: Comparison of \"Whole File Reading\" Methods - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/","og_locale":"en_US","og_type":"article","og_title":"Quick Hit: Comparison of \"Whole File Reading\" Methods - rud.is","og_description":"(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.) I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this: HTTP\/1.1 [&hellip;]","og_url":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/","og_site_name":"rud.is","article_published_time":"2020-08-07T12:42:25+00:00","article_modified_time":"2020-08-08T17:35:24+00:00","author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Quick Hit: Comparison of &#8220;Whole File Reading&#8221; Methods","datePublished":"2020-08-07T12:42:25+00:00","dateModified":"2020-08-08T17:35:24+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/"},"wordCount":489,"commentCount":7,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"articleSection":["R"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/","url":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/","name":"Quick Hit: Comparison of \"Whole File Reading\" Methods - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"datePublished":"2020-08-07T12:42:25+00:00","dateModified":"2020-08-08T17:35:24+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2020\/08\/07\/quick-hit-comparison-of-whole-file-reading-methods\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Quick Hit: Comparison of &#8220;Whole File Reading&#8221; Methods"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-3kw","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":12811,"url":"https:\/\/rud.is\/b\/2020\/08\/08\/quick-hit-speeding-up-data-frame-creation\/","url_meta":{"origin":12804,"position":0},"title":"Quick Hit: Speeding Up Data Frame Creation","author":"hrbrmstr","date":"2020-08-08","format":false,"excerpt":"(This is part 2 of n \"quick hit\" posts, each walking through some approaches to speeding up components of an iterative operation. Go here for part 1). Thanks to the aforementioned previous post, we now have a super fast way of reading individual text files containing HTTP headers from HEAD\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6226,"url":"https:\/\/rud.is\/b\/2017\/09\/09\/teasing-out-top-daily-topics-with-gdelts-television-explorer\/","url_meta":{"origin":12804,"position":1},"title":"Teasing Out Top Daily Topics with GDELT&#8217;s Television Explorer","author":"hrbrmstr","date":"2017-09-09","format":false,"excerpt":"Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They've done an incredible job expanding and stabilizing the API and just recently released \"top trending tables\" which summarise what the \"top\" topics and phrases are across news stations\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom.png?fit=1173%2C1200&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom.png?fit=1173%2C1200&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom.png?fit=1173%2C1200&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom.png?fit=1173%2C1200&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom.png?fit=1173%2C1200&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":4611,"url":"https:\/\/rud.is\/b\/2016\/08\/06\/quicklookr-a-macos-quicklook-plugin-for-r-data-files\/","url_meta":{"origin":12804,"position":2},"title":"QuickLookR &#8211; A macOS QuickLook plugin for R Data files","author":"hrbrmstr","date":"2016-08-06","format":false,"excerpt":"I had tried to convert my data-saving workflows to [`feather`](https:\/\/github.com\/wesm\/feather\/tree\/master\/R) but there have been [issues](https:\/\/github.com\/wesm\/feather\/issues\/155) with it supporting large files (that seem to be near resolution), so I've been continuing to use R Data files for local saving of processed\/cleaned data. I make _many_ of these files and sometimes I\u2026","rel":"","context":"In &quot;Objective-C&quot;","block_context":{"text":"Objective-C","link":"https:\/\/rud.is\/b\/category\/objective-c\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":10848,"url":"https:\/\/rud.is\/b\/2018\/05\/30\/os-secrets-exposed-extracting-extended-file-attributes-and-exploring-hidden-download-urls-with-the-xattrs-package\/","url_meta":{"origin":12804,"position":3},"title":"OS Secrets Exposed: Extracting Extended File Attributes and Exploring Hidden Download URLs With The xattrs Package","author":"hrbrmstr","date":"2018-05-30","format":false,"excerpt":"Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6127,"url":"https:\/\/rud.is\/b\/2017\/07\/27\/reading-pcap-files-with-apache-drill-and-the-sergeant-r-package\/","url_meta":{"origin":12804,"position":4},"title":"Reading PCAP Files with Apache Drill and the sergeant R Package","author":"hrbrmstr","date":"2017-07-27","format":false,"excerpt":"It's no secret that I'm a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12790,"url":"https:\/\/rud.is\/b\/2020\/07\/10\/a-look-at-pan-os-versions-with-a-bit-of-r\/","url_meta":{"origin":12804,"position":5},"title":"A Look at PAN-OS Versions with a Bit of R","author":"hrbrmstr","date":"2020-07-10","format":false,"excerpt":"The incredibly talented folks over at Bishop Fox were quite generous this week, providing a scanner for figuring out PAN-OS GlobalProtect versions. I've been using their decoding technique and date-based fingerprint table to keep an eye on patch status (over at $DAYJOB we help customers, organizations, and national cybersecurity centers\u2026","rel":"","context":"In &quot;Cybersecurity&quot;","block_context":{"text":"Cybersecurity","link":"https:\/\/rud.is\/b\/category\/cybersecurity\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/12804","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=12804"}],"version-history":[{"count":7,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/12804\/revisions"}],"predecessor-version":[{"id":12817,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/12804\/revisions\/12817"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=12804"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=12804"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=12804"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}