R Archives - Page 4 of 56

Category Archives: R

Rusty HTML Minifying In R with {rextendr}

In my M-F newsletter today I mentioned an awesome Rust-based HTML/JS/CSS minifier library that also include batteries for a few other languages.

There was no R port, so I made one using {rextendr}. The {rextendr} package makes is as easy to use Rust code in R packages as {Rcpp} does C/C++ code.

It was as simple as adding some dependencies to the Rust Cargo.toml file and then adding one Rust function to the main lib.rs file, and writing a thin wrapper function ({rextendr} can do that, too, but I wanted some default function parameters) for the shim. It took almost no time, and now you, too, can use the utility:

library(minifyhtml)

'
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta charset="UTF-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1"/>
  <!-- COMMENT -->
  <style>
    * { color: black; }
  </style>
  <title>TITTLE</title>
  </head>
  <body>
    <p>
       Some text
    </p>
    <script>
      console.log("This is a console log message.");
    </script>
  </body>
</html>
' -> src

cat(minify(src))
## <html xmlns=http://www.w3.org/1999/xhtml><meta charset=UTF-8><meta content=width=device-width,initial-scale=1 name=viewport><style>* { color: black; }</style><title>TITTLE</title><body><p>Some text</p><script>console.log("This is a console log message.");</script>

FIN

I have to work out one kink (due to developing on arm64 macOS) and the utility will also be able to minify CSS and JS embedded in HTML.

You can find {minifyhtml} on GitHub.

Keeping Those SSH Keys Safe

I came across a neat site that uses a Golang wasm function called from javascript on the page to help you see if your GitHub public SSH keys are “safe”. What does “safe” mean? This is what the function checks for (via that site):

Recommended key sizes are as follows:

For the RSA algorithm at least 2048, recommended 4096

The DSA algorithm should not be used

For the ECDSA algorithm, it should be 521

For the ED25519, the key size should be 256 or larger

The site also provides links to standards and guides to support the need for stronger keys.

I threw together a small R package — {pubcheck} — to check local keys, keys in a character vector, and keys residing in GitHub. One function will even check the GitHub keys of all the GitHub users a given account is following:

Local file

library(pubcheck)
library(tidyverse)

check_ssh_pub_key("~/.ssh/id_rsa.pub") |> 
  mutate(key = ifelse(is.na(key), NA_character_, sprintf("%s…", substr(key, 1, 30)))) |> 
  knitr::kable()

key	algo	len	status
ssh-rsa AAAAB3NzaC1yc2EAAAADAQ…	rsa	2048	✅ Key is safe; For the RSA algorithm at least 2048, recommended 4096

A GitHub user

check_gh_user_keys(c("hrbrmstr", "mikemahoney218")) |> 
  mutate(key = ifelse(is.na(key), NA_character_, sprintf("%s…", substr(key, 1, 30)))) |> 
  knitr::kable()

user	key	algo	len	status
hrbrmstr	ssh-rsa AAAAB3NzaC1yc2EAAAADAQ…	rsa	2048	✅ Key is safe; For the RSA algorithm at least 2048, recommended 4096
hrbrmstr	ssh-rsa AAAAB3NzaC1yc2EAAAADAQ…	rsa	2048	✅ Key is safe; For the RSA algorithm at least 2048, recommended 4096
hrbrmstr	ssh-rsa AAAAB3NzaC1yc2EAAAADAQ…	rsa	2048	✅ Key is safe; For the RSA algorithm at least 2048, recommended 4096
mikemahoney218	ssh-rsa AAAAB3NzaC1yc2EAAAADAQ…	rsa	4096	✅ Key is safe
mikemahoney218	ssh-ed25519 AAAAC3NzaC1lZDI1NT…	ed25519	256	✅ Key is safe
mikemahoney218	ssh-ed25519 AAAAC3NzaC1lZDI1NT…	ed25519	256	✅ Key is safe
mikemahoney218	ssh-ed25519 AAAAC3NzaC1lZDI1NT…	ed25519	256	✅ Key is safe

Keys of all the users a GitHub account is following

check_gh_following("koenrh") |> 
  mutate(key = ifelse(is.na(key), NA_character_, sprintf("%s…", substr(key, 1, 30)))) |> 
  knitr::kable()

user	key	algo	len	status
framer	NA	NA	NA	NA
jurre	ssh-rsa AAAAB3NzaC1yc2EAAAADAQ…	rsa	2048	✅ Key is safe; For the RSA algorithm at least 2048, recommended 4096

What’s it like out there?

I processed my followers list and had some interesting results:

library(pubcheck)
library(hrbragg)
library(tidyverse)

# this takes a while as the # of users is > 500!
res <- check_gh_following("hrbrmstr")

res |> 
  count(user) |> 
  arrange(n) |> 
  count(n, name = "n_users") |> 
  mutate(csum = cumsum(n_users)) |> 
  ggplot() +
  geom_line(
    aes(n, csum)
  ) +
  geom_point(
    aes(n, csum)
  ) + 
  scale_x_continuous(breaks = 1:21) +
  scale_y_comma() +
  labs(
    x = "# Keys In User GH", y = "# Users",
    title = "Cumulative Plot Of User/Key Counts [n=522 users]",
    subtitle = "A handful of users have >=10 keys configured in GitHub; one has 21!!"
  ) +
  theme_cs(grid="XY")

res |> 
  count(algo, len, status) |> 
  mutate(kind = ifelse(is.na(status), "No SSH keys in account", sprintf("%s:%s\n%s", algo, len, status))) |> 
  mutate(kind = fct_reorder(gsub("[;,]", "\n", kind), n, identity)) |> 
  ggplot() +
  geom_col(
    aes(n, kind),
    width = 0.65, 
    fill = "steelblue", 
    color = NA
  ) +
  scale_x_comma(position = "top") +
  labs(
    x = NULL, y = NULL,
    title = "SSH Key Summary For GH Users hrbrmstr Is Following"
  ) +
  theme_cs(grid="X") +
  theme(plot.title.position = "plot")

FIN

Whether you use the website or the R package, it’d be a good idea to check on your SSH keys at least annually.

Update On ESC/POS Package Work

The morning before work was super productive and there is a nigh-complete DSL for ESC/POS commands along with the ability to just print {ggplot2}/{grid} object.

I changed the package name to {escpos} since it is no longer just plot object focused, and the DSL looks a bit like this:

library(stringi)
library(hrbrthemes)
library(ggplot2)
library(escpos)

ggplot() +
  geom_point(
    data = mtcars,
    aes(wt, mpg),
    color = "red"
  ) +
  labs(
    title = "A good title"
  ) +
  theme_ipsum_es(grid="XY") -> gg

epson_ip = "HOSTNAME_OR_IP_OF_YOUR_PRINTER"

escpos(epson_ip) |>
  pos_bold("on") %>%
  pos_align("center") %>%
  pos_size("2x") %>%
  pos_underline("2dot") %>%
  pos_plaintext("This Is A Title") %>%
  pos_lf(2) |>
  pos_underline("off") %>%
  pos_size("normal") %>%
  pos_align("left") %>%
  pos_bold("off") %>%
  pos_font("b") %>%
  pos_plaintext(
    stringi::stri_rand_lipsum(1)
  ) |>
  pos_lf(2) |>
  pos_font("a") %>%
  pos_plaintext(
    paste0(capture.output(
      str(mtcars, width = 40, strict.width = "cut")
    ), collapse = "\n")
  ) |>
  pos_lf(2L) |>
  pos_plot(gg, color = TRUE) %>%
  pos_lf(2L) |>
  pos_font("c") %>%
  pos_plaintext(
    stringi::stri_rand_lipsum(1, start_lipsum = FALSE)
  ) |>
  pos_lf(3) |>
  pos_cut() %>%
  pos_print()

FIN

I still need to make a more generic options “setter” (i.e. so one can set multiple modes in one function call), and I think supporting some kind of markdown/HTML subset to make it easier just to specify that without using the full DSL would be helpful. More updates over the coming weeks!

Turning {ggplot2} Into a PoS (Point-of-Sale) System

At the end of March, I caught a fleeting tweet that showcased an Epson thermal receipt printer generating a new “ticket” whenever a new GitHub issue was filed on a repository. @aschmelyun documents it well in this blog post. It’s a pretty cool hack, self-contained on a Pi Zero.

Andrew’s project birthed an idea: could I write an R package that will let me plot {ggplot2}/{grid} objects to it? The form factor of the receipt printer is tiny (~280 “pixels” wide), but the near infinite length of the paper means one can play with some data visualizations that cannot be done in other formats (and it would be cool to be able to play with other content to print to it in and outside of R).

One of the features that makes Andrew’s hack extra cool is that he used an Epson receipt printer model that was USB connected. I don’t see the need to dedicate and extra piece of plastic, metal, and silicon to manage the printing experience, especially since I already have a big linux server where I run personal, large scale data science jobs. I ended up getting a used (lots of restaurants close down each week) Epson TM-T88V off of eBay since it has Ethernet and is supports ESC/POS commands.

After unpacking it, I needed to get it on the local network. There are many guides out there for this, but this one sums up the process pretty well:

Plug the printer in and reset it
Hook up a system directly to it (Ethernet to Ethernet)
Configure your system to use the Epson default IP addressing scheme
Access the web setup page
Configure it to work on your network
Disconnect and restart the printer

To make sure everything worked, I grabbed one of the (weirdly) many projects on GitHub that provided a means for formatting graphics files to an ESC/POS compatible raster bitmap and processed a simple R-generated png to it, then used netcat to shunt the binary blob over to the printer on the default port of 9100.

I did some initial experiments with {magick}, pulling the graphics bits out of generated plots and then wrapping some R code around doing the conversion. It was clunky and tedious, and I knew there had to be a better way, so I hunted for some C/C++, Rust, or Go code that already did the conversion and found png2escpos by The Working Group. However, I’ve switched to png2pos by Petr Kutalek as the dithering it does won’t require the R user to produce only black-and-white plots for them to look good.

I thought about implementing a graphics device to support any R graphics output, but there are enough methods to convert a base R plot to a grid/grob object that I decided to mimic the functionality of ggsave() and make a ggpos() function. The comment annotations in the code snippet below walk you through the extremely basic process:

ggpos <- function(plot = ggplot2::last_plot(),
                  host_pos,
                  port = 9100L,
                  scale = 2,
                  width = 280,
                  height = 280,
                  units = "px",
                  dpi = 144,
                  bg = "white",
                  ...) {

  # we generate a png file using ggsave()

  png_file <- tempfile(fileext = ".png")

  ggplot2::ggsave(
    filename = png_file,
    plot = plot,
    scale = scale,
    width = width,
    height = height,
    units = units,
    dpi = dpi,
    bg = bg,
    ...
  )

  # we call an internal C function to convert the generated png file to an ESC/POS raster bitmap file

  res <- png_to_raster(png_file)

  if (res != "") { # if the conversion ended up generating a new file

    # read in the raw bytes

    escpos_raster <- stringi::stri_read_raw(res)

    # open up a binary socket to the printer 

    socketConnection(
      host = host_pos,
      port = port,
      open = "a+b"
    ) -> con

    on.exit(close(con))

    # shunt all the bytes over to it

    writeBin(
      object = escpos_raster,
      con = con,
      useBytes = TRUE
    )

  }

  invisible(res)

}

The only work I needed to do on the original C code was to have it output directly to a file instead of stdout.

Now, plotting to the printer is as straightforward as:

library(ggplot2)
library(escpos)

ggplot(mtcars) +
  geom_point(
    aes(wt, mpg)
  ) +
  labs(
    title = "Test of {ggpos}"
  ) +
  theme_ipsum_es(grid="XY") +
  theme(
    panel.grid.major.x = element_line(color = "black"),
    panel.grid.major.y = element_line(color = "black")
  ) -> gg

ggpos(gg, host_pos = HOSTNAME_OR_IP_ADDRESS_OF_YOUR_PRINTER)

That code produces this output (I’m still getting the hang of ripping the spooled paper off this thing):

This is the whole thing in action:

One of the 2022 #30DayChartChallenge topics was “part-to-whole”, so I rejiggered my treemap entry into a very long plot that would make CVS cashiers feel quite inferior.

You can find {escpos} over on GitHub.

FIN

One big caveat for this is that these printers have a tiny memory buffer, so very long, complex plots aren’t going to work out of the box. I had to break up my faceted heatmaps info individual ones and shunt them over one-by-one.

I’ll be ~~switching over the the new C library soon, and~~ adding a small DSL to handle text formatting and printing from R (the device has 2 fonts and almost no styles). I’ve even threatened to make a ShinyPOS application, but we’ll see how the motivation for that goes.

Kick the tyres and let me know if you end up using the package (+ share your creation to the 🌎).

A zsh Helper Script For Updating macOS RStudio Daily Electron + Quarto CLI Installs

RStudio’s macOS Electron build is coming along quite nicely and is blazing fast on Apple Silicon.

I like to install the dailies, well, daily!; and, of late, RStudio and Quarto are joined at the hip. As a result, I regularly found myself having to manually update Quarto CLI right after updating RStudio, so I made a small zsh script that will use the new RStudio dailies JSON API to grab the latest daily, and the GitHub JSON API to grab the latest Quarto release, then install them both.

Caveats and limitations are in the repo.

Starting 2022 Off With A Fairly Complex {ggplot2} Recreation Plot

The New York Times had a [tragic] story on Covid deaths today and one of their plots really stuck with me for how well it told that part of the story.

NOTE: The red panel highlights are off a bit as I manually typed the data in (I only did the recreation to keep {ggplot2} muscle memory as I hadn’t doe a major customization like this in quite some time).

Only one {grid} hack (for the faceted X axis labels) too!

Hopefully, I’ll have more real-world opportunity to build some detailed, properly-annotated {ggplot2} plots this year.

Shout out to @ClausWilke for {ggtext} and all the folks who’ve made {ggplot2} such a powerful data visualization tool.

library(grid)
library(gtable)
library(hrbrthemes)
library(tidyverse)

gtable_filter_remove <- function (x, name, trim = FALSE) {
  # https://stackoverflow.com/a/36780639
  matches <- !(x$layout$name %in% name)
  x$layout <- x$layout[matches, , drop = FALSE]
  x$grobs <- x$grobs[matches]
  if (trim) 
    x <- gtable_trim(x)
  x
}

read.csv(text="race,age_group,before,after,cause
White,Under 25,1,3,Covid-19 deaths increased as a share of deaths from all cause
White,25-44,3,10,Covid-19 deaths increased as a share of deaths from all cause
White,45-64,8,15,Covid-19 deaths increased as a share of deaths from all cause
White,65-84,13,11,NA
White,85+,14,6,NA
Hispanic,Under 25,3,4,Covid-19 deaths increased as a share of deaths from all cause
Hispanic,25-44,17,21,Covid-19 deaths increased as a share of deaths from all cause
Hispanic,45-64,33,26,NA
Hispanic,65-84,33,17,NA
Hispanic,85+,21,9,NA
Black,Under 25,1,3,Covid-19 deaths increased as a share of deaths from all cause
Black,25-44,7,13,Covid-19 deaths increased as a share of deaths from all cause
Black,45-64,15,17,Covid-19 deaths increased as a share of deaths from all cause
Black,65-84,20,12,Covid-19 deaths increased as a share of deaths from all cause
Black,85+,17,8,NA
Asian,Under 25,2,4,Covid-19 deaths increased as a share of deaths from all cause
Asian,25-44,12,14,Covid-19 deaths increased as a share of deaths from all cause
Asian,45-64,21,13,NA
Asian,65-84,23,8,NA
Asian,85+,17,4,NA") -> xdf

xdf %>% 
  mutate(
    before = before/100,
    after = after/100,
    age_group = fct_inorder(age_group),
    race = factor(race, levels = rev(c("Asian", "Black", "Hispanic", "White")))
  ) -> xdf

{

  ggplot( data = xdf) +
    geom_rect(
      data = xdf,
      aes(
        xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf,
        fill = cause
      ),
      alpha = 1/6, color = NA
    ) +
    geom_rect(
      data = xdf %>% 
        filter(
          (race == "White" & age_group %in% c("65-84", "85+")) |
            (race == "Hispanic" & age_group %in% c("45-64", "65-84", "85+")) |
            (race == "Black" & age_group %in% c("85+")) |
            (race == "Asian" & age_group %in% c("45-64", "65-84", "85+"))
        ),
      aes(
        xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf
      ),
      fill = "#999999", alpha = 1/6, color = NA
    ) +
    geom_segment(
      aes(-Inf, xend = Inf, -Inf, yend= -Inf),
      size = 0.25, color = "black"
    ) +
    geom_segment(
      data = xdf, aes("1", before, xend="2", yend=after),
      size = 0.25
    ) +
    geom_point(
      data = xdf, aes("1", before), 
      fill = "#999999", color = "white", size = 2, stroke = 0.5, shape = 21
    ) +
    geom_point(
      data = xdf, aes("2", after),
      fill = "#bb271a", color = "white", size = 2, stroke = 0.5, shape = 21
    ) +
    geom_text(
      data = xdf,
      aes("1", before+0.05, label = scales::percent(before, 1)),
      color = "#999999", family = font_es_bold, fontface = "bold", size = 3
    ) +
    geom_text(
      data = xdf,
      aes("2", after+0.05, label = scales::percent(after, 1)),
      color = "#bb271a", family = font_es_bold, fontface = "bold", size = 3
    ) +
    scale_x_discrete(
      expand = c(0, 0),
      labels = c("<span style='color:#999999'>BEFORE</span>", "<span style='color:#bb271a'>AFTER</a>")
    ) +
    scale_y_percent(
      limits = c(-0.005, 0.405),
      breaks = c(-0.005, 0.1, 0.2, 0.3, 0.405),
      labels = c("", "", "", "", "40%\nof deaths from\nall causes for\nthis group")
    ) +
    scale_fill_manual(
      name = NULL,
      values = c("#bb271a"),
      na.translate = FALSE
    ) +
    coord_cartesian(clip = "off") +
    facet_wrap(
      facets = race~age_group, 
      scales = "free_x",
      labeller = \(labels, multi_line = TRUE){
        labels <- lapply(labels, as.character)
        labels[["race"]][c(1,2,4,5,6,7,9,10,11,12,14,15,16,17,19,20)] <- ""
        labels[["age_group"]] <- sprintf("<span style='font-style:normal;font-weight:normal;'>%s</span>", labels[["age_group"]])
        labels[["race"]][c(3,8,13,18)] <- sprintf("<span style='font-size:12pt;'>**%s**</span>", labels[["race"]][c(3,8,13,18)])
        labels
      }
    ) +
    labs(
      x = NULL, y = NULL,
      title = "Covid-19 deaths <span style='color:#999999'>before</span> and <span style='color:#bb271a'>after</span> universal adult vaccine eligibility",
      caption = "Source: Provisional weekly death data from the C.D.C. through Nov. 27. Note: Only the four largest racial and ethnic groups are included. Universal vaccine eligibility was April 19, the date when all adults in the United States were eligible for vaccination."
    ) +
    theme_ipsum_es(grid="Y", plot_title_size = 16) +
    theme(
      plot.title.position = "plot",
      plot.title = ggtext::element_markdown(hjust = 0.5),
      plot.caption = ggtext::element_textbox_simple(
        hjust = 0, size = 8.5, family = font_es, color = "#999999",
        margin = margin(t = 14)
      ),
      axis.ticks.x.bottom = ell(size = 0.25) ,
      axis.line.x.bottom = ell(lineend = "square", size = 0.25),
      axis.text.x.bottom = ggtext::element_markdown(size = 8, margin = margin(t = 6)),
      axis.text.y.left = elt(size = 8, vjust = 1, lineheight = 0.875,  color = "#999999"),
      strip.text.x = ggtext::element_markdown(hjust = 0.5, size = 10, family = font_es),
      strip.text = ggtext::element_markdown(hjust = 0.5, size = 10, family = font_es),
      panel.spacing.x = unit("40", "pt"),
      panel.spacing.y = unit(6, "pt"),
      panel.border = elb(),
      legend.position = "top"
    ) -> gg

  grid.newpage()
  grid.draw(
    gtable_filter_remove(
      x = ggplotGrob(gg),
      name = c(sprintf("axis-b-%d-1", 2:5), sprintf("axis-b-%d-2", 2:5), sprintf("axis-b-%d-3", 2:5), sprintf("axis-b-%d-4", 2:5))
    )
  )

}

2021 Advent of Code Day 02 “Don’t Try This At Home” Edition

The Moderna booster level drained me all day on Dec 1 and did what jab two did during the overnight period (achy enough to wake me up and not get back to slumber easily). To try to wear myself down, I decided to practice a bit of R with the 2021 Advent of Code. There are plenty of superb R bloggers chronicling their daily katas that I do not feel compelled to post every one (truth be told, work and fam tasks/priorities will make devoting any time to this year’s daily puzzles a rare event).

Day 01 was very straightforward (even part 2 which I used {RcppRoll} despite hoping to stick to only base R 4.x) so it’s kinda not worth a post (for me), but Day 02 was kinda fun as I don’t have regular opportunities to use scan() and get().

The input is a series of submarine commands:

forward 5
down 5
forward 8
up 3
down 8
forward 2

with a set of rules that change between parts 1 and 2.

We can read in those commands with scan() which lets us specify a pattern for each line (scan() takes care of dealing with whitespace for you):

scan(
  text = "forward 5
down 5
forward 8
up 3
down 8
forward 2",
what = list(character(0), integer(0))
) |>
  setNames(c("command", "value")) -> input

str(input)
## List of 2
##  $ command: chr [1:6] "forward" "down" "forward" "up" ...
##  $ value  : int [1:6] 5 5 8 3 8 2

The rules (link above) were pretty basic, increment/decrement some variables based on the command input, but I wanted to avoid a bunch of if statements. Since R has the get() function that enables searching by name for an object, we can make a series of functions that have the command as the identifier and then use get() to call the function:

up <- \(x) depth <<- depth - x
down <- \(x) depth <<- depth + x
forward <- \(x) position <<- position + x

position <- depth <- 0

for (idx in seq_along(input$command)) {
  get(input$command[idx], mode = "function")(input$value[idx])
}

(the final answer is computed by position X depth).

While I find this to be a “fun” solution, I’d strongly suggest:

avoiding using the new shortcut function declaration in mixed R version shops as it’s very new and likely to be confusing to new R users
being wary of the <<- assignment operator as it’s introducing a side-effect (parent/global environment modification) which will come back to bite you in other projects some day
ditching the $ referencing in favour of [[]] / [] to avoid partial name matching “gotchas”, and
adding explicit documentation to what you’re doing with get() calls (provided you really have a good case for using get() to begin with)

The code only changes slightly for part 2, so I’ll refrain from adding even more unreadable code from this post.

Creating A Custom MaxMind mmdb File For Cloud Provider Ranges

I’ve been wanting to create a custom MaxMind mmdb file for alternate IPv4 range classifications for a while, and finally had an opportunity to do so for the $DAYJOB. MaxMind mmdb files are small and easy to work with and provide lightning fast lookups.

This repo — https://github.com/hrbrmstr/clouds2mmdb — has an R script for pulling cloud provider info (I already had some boilerplate code for it and detest reinventing the wheel in another language just to keep a project in a single language) and a Python script for the custom mmdb writing.

Current providers supported are:

AWS
Azure
DigitalOcean
Google (GCP)
Oracle Cloud
Rackspace (see clouds2csv.r for the details)

and, PRs are welcome for other cloud providers.

The provider info is stored in the isp field:

mmdblookup --file clouds.mmdb  --ip 167.99.224.0 isp 
## 
##   "DigitalOcean" <utf8_string>
##