Homebrew 2.0.0 Released == homebrewanalytics package updated

A major new release of Homebrew has landed and now includes support for Linux as well as Windows! via the Windows Subsystem for Linux. There are overall stability and speed improvements baked in as well. The aforelinked notification has all the info you need to see the minutiae. Unless you’ve been super-lax in updating, brew update will get you the latest release.

There are extra formulae analytics endpoints and the homebrewanalytics? R package has been updated to handle them. A change worth noting in the package is that all the API calls are memoised to avoid hammering the Homebrew servers (though the “API” is really just file endpoints and they aren’t big files but bandwidth is bandwidth). Use the facilities in the memoise package to invalidate the cache if you have long running scripts.

Use your favorite social coding site to install it (If I don’t maintain mirrors on your open social coding platform of choice just drop a note in the comments and I’ll start mirroring there as well):

devtools::install_git("https://git.sr.ht/~hrbrmstr/homebrewanalytics")
# or
devtools::install_gitlab("hrbrmstr/homebrewanalytics")
# or
devtools::install_github("hrbrmstr/homebrewanalytics")

The README and in-package manual pages provide basic examples of retrieving data. But we can improve upon those here, such as finding out the dependency distribution of Homebrew formulae:

library(hrbrthemes)
library(homebrewanalytics) # git.sr.hr/~hrbrmstr ; git[la|hu]b/hrbrmstr
library(tidyverse)

f <- brew_formulae()

mutate(f, n_dependencies = lengths(build_dependencies)) %>% 
  count(n_dependencies) %>% 
  mutate(n_dependencies = factor(n_dependencies)) %>% 
  ggplot() +
  geom_col(aes(n_dependencies, n), fill = ft_cols$slate, width = 0.65) +
  scale_y_comma("# formulae") +
  labs(
    x = "# Dependencies",
    title = "Dependency distribution for Homebrew formulae"
  ) +
  theme_ft_rc(grid="Y")

Given how long it sometimes takes to upgrade my various Homebrew installations I was surprised to see 0 be so prevalent, but one of the major changes in 2.0.0 is going to be more binary installs (unless you really need custom builds) so that is likely part of my experience, especially with the formulae I need to support cybersecurity and spatial operations.

We can also see which formuale are in the top 50%:

unlist(f$dependencies) %>% 
  table(dnn = "library") %>% 
  broom::tidy() %>% 
  arrange(desc(n)) %>% 
  mutate(pct = n/sum(n), cpct = cumsum(pct)) %>% 
  filter(cpct <= 0.5) %>% 
  mutate(pct = scales::percent(pct)) %>% 
  mutate(library = factor(library, levels = rev(library))) %>% 
  ggplot(aes(n, library)) +
  geom_segment(aes(xend=0, yend=library), color = ft_cols$slate, size=3.5) +
  geom_text(
    aes(x = (n+max(n)*0.005), label = sprintf("%s (%s)", n, pct)), 
    hjust = 0, size = 3, family = font_rc, color = ft_cols$gray
  ) +
  scale_x_comma(position = "top", limits=c(0, 500)) +
  labs(
    x = "# package using the library", y = NULL,
    title = "Top 50% of libraries used across Homebrew formulae"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_text(family = "mono"))

It seems openssl is pretty popular (not surprising but always good to see cybersecurity things at the top of good lists for a change)! macOS ships with an even more dreadful (I know that’s hard to imagine) default Python setup than usual so it being number 2 is not unexpected.

And, finally, we can also check on how frequently formulae are installed. Let’s look back on the last 90 days:

ggplot() +
  geom_density(
    aes(x = installs$count, y = stat(count)),
    color = ft_cols$slate, fill = alpha(ft_cols$slate, 1/2)
  ) +
  scale_x_comma("# install events", trans = "log10") +
  scale_y_comma("# formulae") +
  labs(
    title = "Homebrew Formulate 'Install Events' Distribution (Past 90 days)"
  ) +
  theme_ft_rc(grid="XY")

I’ll let you play with the package to find out who the heavy hitters are and explore more about the Homebrew ecosystem.

FIN

Kick the tyres. File issues & PRs and a hearty “Welcome!” to the Homebrew ecosystem for Linux and Windows users. My hope is that the WSL availability will eventually make it easier to develop for Windows systems and avoid the “download the kinda sketchy compiled windows libraries from github on package install” practice we have today.

If you crank out some analytics using the packages don’t forget to blog about it and drop a link in the comments!

Cover image from Data-Driven Security
Amazon Author Page

5 Comments Homebrew 2.0.0 Released == homebrewanalytics package updated

  1. Julia Silge

    Bob, what do you think about Homebrew’s analytics and data collection model? I ask because I’ve been involved in some discussions over the past year or so about why R doesn’t have that kind of information (use, not downloads). Do you have opinions about whether Homebrew implemented it well, fairly, safely, etc?

    Reply
    1. hrbrmstr

      Definitely a good question. I have some issues with it since it’s both opt-out and uses Google. It’s off on my portable system b/c Google doesn’t need to see where I travel and the only thing “anonymous” about the IP address they store is the last octet (I don’t really believe Google throws that away though). I do leave it on for my always-at-home system since Google knows where I live b/c I have little choice but to use Google search and don’t feel like using Tor or a VPN for that all the time. I also don’t like the fact that brew analytics off provides no feedback.

      Having said that, I do find poking at what’s really used in Homebrew somewhat interesting and I can imagine that formulae maintainers also appreciate having information to know if they’re wasting their time or not. And, as someone who groks data analysis, all the maths required to make opt-in data even somewhat useful seems like something that might not be readily accessible to them and they have no funds for their own analytics collection infrastructure being an unbacked FOSS project.

      Their analytics is akin to PyPI and cranlogs but Homebrew goes out of their way to be more detailed so it’s higher fidelity data for their devs. I’m actually not thrilled cranlogs exists nor that RStudio force-defaults to their CRAN mirrors but anyone can stand up a mirror, promote it and try to get users to use it; and, at some point — unless you have CRAN at home like I do — you have to connect to something to install packages.

      But you asked about R (and likely also intimated about installed R packages collecting usage data on load). Baking usage collection in R itself is a non-starter for me and I’d self-compile it with that commented out. I say that, but I happily (mindlessly) let virtually every macOS app I use “check for updates” which they do half for my benefit (to give me updates!) and half for theirs (to collect usage data). I don’t have that same bargain with R itself (i.e. it never tells me when there’s an update; that is something I have to keep track of on my own).

      As far as packages directly collecting usage data goes I’ve vehemently opposed to it though it’s amusing that it’s somewhat allowed. First, “why opposed?”. R itself would have to provide the conduit for the data collection otherwise any pkg developer would be able to throw together their own call to download.file(), httr::GET() or something similar and shunt up some bits of siphoned off metadata. I have little faith in the pseudo anonymous R package dev community that they’d do the collection well or ethically and have even less faith they’d store the data securely. If R had an opt-in options(ENABLE_PACKAGE_USAGE_ANALYTICS=TRUE) I’d trust that a bit more since I’d have access to the source, know that the underlying code had been argued about, nitpicked, guard railed, etc for ages before it got in there (and would still compile that out :-). If they did that, then they’d have to use some funding source to ensure there was a global network of collection agents (which can’t be CRAN mirrors since they’re setup and maintained by defined but random orgs that nobody should trust with their data). I’m not sure R Core or CRAN could get the funding for that. If they did, say by a commercial entity, then I’d most assuredly compile it out and blacklist those domains/IPs in home DNS (yep, I run my own DNS too :-) and firewall b/c at some point that commercial entity is going to do something stupid/bad/evil since it’s a commercial entity.

      Back to “somewhat allowed”. Ref:

      That’s an example of the sketch behaviour I mentioned in the blog post. I hate this. Rly rly hate this. I hate more that CRAN allows it than the actual practice. I hate it more now that Microsoft owns GitHub. Sure, the folks that use this “winlibs hack” can’t see the data but CRAN gleefully allows my (ok, not me since I don’t use Windows) connection information and R version (download.file() sends what’s in options("HTTPUserAgent") to be sent o GitHub for that download so I can have the privilege of using xml2 on Windows (yes, that’s if one source compiles so not as horrible as it could be). Why? Why do they trust GitHub? I’ve not gotten ’round to seeing how many other packages use that type of sketch hack (b/c I rly don’t want to know).

      I also really don’t know what R and/or package usage statistics would be good for. Would it really help, say, Albrecht Gebhardt in some way to know that 3 times a year I need to use akima::interp() (it’s literally abt 3x a year, too :-) He likely made the package b/c he and some colleagues needed it (literally also why I make packages). If a developer relies on the knowledge of the popularity of said package (which is all a decently anonymized set of analytic data would capture) to keep maintaining it, perhaps that developer might want to find a new maintainer? I know we’re all busy (crushed, lately, actually) and our time is valuable and knowing where to direct said scant time helps prioritize, but there’s likely a better way to intuit that than adding tracking code to R.

      It’s late (or early depending on your definition of 0300/0400) so I may have rambled a bit and the topic is (broadly) human safety so I likely got heated but the question was indeed a good one and if I did ramble to the point of incomprehensibility drop another comment with something like “WTHeck?” and I’ll gladly refine at a more civilized time of day :-)

      Reply
      1. Julia

        Thank you for these thoughts, Bob! (And thanks for your patience, as it took me a bit to get back to this.) There’s a lot to think through here, and you really bring a lot of it into focus. The usefulness of these kinds of statistics are less compelling for the long tail of R packages/functions, like your example, but just the fact that somebody like you thought that the Homebrew analytics were worth digging into does tend to make me (and not just me) think there would be value here, more than our intuition alone about what’s used more at least. I agree with you 100% about individual R package authors doing this being just a bad sketchy idea based on what we know already about the sophistication overall of the population. Heck, I’d put myself in that not-sophisticated-enough bucket. That’s why I asked what you thought, naturally! ? Thanks for taking the time to reflect on this.

        Reply
  2. Pingback: Homebrew 2.0.0 Released == homebrewanalytics package updated – Data Science Austria

  3. Pingback: Homebrew 2.0.0 Released == homebrewanalytics package updated

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.