I made a promise to someone that my next blog would be about stringi
vs stringr
and I intend to keep said promise.
stringr
and stringi
do “string operations”: find, replace, match, extract, convert, transform, etc.
The stringr
package is now part of the tidyverse
and is 100% focused on string processing and is pretty much a wrapper package for stringi
. The stringi
package wraps chunks of the icu4c
library but the stringi
API frmaing was actually based on the patterns in the stringr
package API. stringr
did not wrap stringi
at the time but does now and stringi
strays a bit (on occasion) from string processing since the entire icu4c
library is at it’s disposal. Confused? Good! There’s more!
The impetus for asking me to blog about this is that I’m known to say “just use stringi
” in situations where someone has taken a stringr
“shortcut”. Let me explain why.
Readers Digest
First, you need to read pages 4-5 of the stringi
manual [PDF] and then the stringr
vignette. I’m not duplicating the information on those pages. The TL;DR on them is:
- that
stringr
makes some (valid) assumptions about defaults for thestringi
calls it wraps stringr
is much easier to initially grok as it’s very focused and has far fewer functions- they both use ICU regular expressions
stringi
includes more than string processing and has far more total functions:
As noted, stringr
wraps stringi
calls (for the most part) and some of the stringr
functions reference more than one stringi
function:
That’s my primary defense for “just use stringi
” — stringr
“just uses” it and you are forced to install stringi
on every system stringr
is on, so why introduce another dependency into your code?
All Wrapped Up
These are the stringr
functions with a 1:~1 correspondence to stringi
functions:
stri_c stri_conv stri_count stri_detect stri_dup stri_extract stri_extract_all stri_join stri_length stri_locate stri_locate_all stri_match stri_match_all stri_order stri_pad stri_replace stri_replace_all stri_replace_na stri_sort stri_split stri_split_fixed stri_sub stri_sub<- stri_subset stri_trim stri_wrap
I used 1:~1 since at the heart of the string processing capabilities of both packages lies the concept of granular control of matching behaviour. Specifically, there are four modes (so it’s really 1:4?):
- fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets
- coll: Compare strings respecting standard collation rules
- regex: The default. Uses ICU regular expressions
- boundary: Match boundaries between things
stringr
has function modifiers around pattern
to handle those whereas stringi
requires explicit function calls. So, you’d do the following to replace a fixed char/byte sequence in each package:
stri_replace_all_fixed("Lorem i.sum dolor sit amet, conse.tetur adipisicing elit.", ".", "#")
str_replace_all("Lorem i.sum dolor sit amet, conse.tetur adipisicing elit.", fixed("."), "#")
In that case there’s not much in the way of keystroke savings, but the default mode of stringr
is to use regex replacement and you do save both an i
and _regex
for that but add one more function call in-between you and your goal. When you work with multi-gigabyte character structures (as I do), those milliseconds often add up. If keystrokes > milliseconds in your workflow, you may want to stick with stringr
.
Treasure Hunting in stringi
If you take some time to look at what’s in stringi
you’ll find quite a bit (I excluded the fixed/coll/reged/boundary versions for brevity):
That’s an SVG, so zoom in as much as you need to to read it.
These are stringi
gems:
stri_stats_general
(stats abt a character vector)stri_trans_totitle
(For When You Want Title Case)stri_flatten
(paste0
but better defaults)stri_rand_strings
(random strings)stri_rand_lipsum
(random Lorem Ipsum lines!)stri_count_words
,stri_extract_all_words
,stri_extract_first_words
,stri_extract_last_words
Plus it has some helpful operators:
%s!=%
,%s!==%
,%s+%
,%s<%
,%s<=%
,%s==%
,%s===% %s>%
,%s>=%
,%stri!=%
,%stri!==%
,%stri+%
,%stri<%
,%stri<=%
,%stri==%
,%stri===%
,%stri>%
,%stri>=%
Of those, %s+%
is ++handy for string concatenation.
Prior to readr
, these were my go-to line/raw readers/writer: stri_read_raw
, stri_read_lines
, and stri_write_lines
.
It also handles gnarly character encoding operations in a cross-platform, predictable manner.
FIN
To do a full comparison justice would have required writing a mini-book which is something I can’t spare cycles on, so my primary goals were to make sure folks knew stringr
wrapped stringi
and to show that stringi
has much more to offer than you probably knew. If you start to get hooked on some of the more “fun” or utilitarian functions in stringi
it’s probably worth switching to it. If string ops are ancillary operations to you and you normally work in regex-land, then you’re not missing out on anything and can save a few keystrokes here and there by using stringr
.
Comments are extremely encouraged for this post as I’m curious if you know about stringi
before and when/where/how you use it vs stringr
(or, why you don’t).
The ? Resistance
I need to be up-front about something: I’m somewhat partially at fault for ? being elected. While I did not vote for him, I could not in any good conscience vote for his Democratic rival. I wrote in a ticket that had one Democrat and one Republican on it. The “who” doesn’t matter and my district in Maine went abundantly for ?’s opponent, so there was no real impact of my direct choice but I did actively point out the massive flaws in his opponent. Said flaws were many and I believe we’d be in a different bad place, but not equally as bad of a place now with her. But, that’s in the past and we’ve got a new reality to deal with, now.
This is a (hopefully) brief post about finding a way out of this mess we’re in. It’s far from comprehensive, but there’s honest-to-goodness evil afoot that needs to be met head on.
Brand Damage
You’ll note I’m not using either of their names. Branding is extremely important to both of them, but is the almost singular focus of ?. His name is his hotel brand, company brand and global identifier. Using it continues to add it to the history books and can only help inflate the power of that brand. First and foremost, do not use his name in public posts, articles, papers, etc. “POTUS”, “The President”, “The Commander in Chief”, “?” (chosen to match his skin/hair color, complexion and that comb-over tuft) are all sufficient references since there is date-context with virtually anything we post these days. Don’t help build up his brand. Don’t populate historical repositories with his name. Don’t give him what he wants most of all: attention.
Document and Defend with Data
Speaking of the historical record, we need to be blogging and publishing regularly the actual facts based on data. We also need to save data as there’s signs of a deliberate government purge going on. I’m not sure how successful said purge will be in the long run and I suspect that the long-term effects of data purging and corruption by this administration will have lasting unintended consequences.
Join/support @datarefuge to save data & preserve the historical record.
Install the Wayback Machine plugin and take the 2 seconds per site you visit to click it.
Create blog posts, tweets, news articles and papers that counter bad facts with good/accurate/honest ones. Don’t make stuff up (even a little). Validate your posits before publishing. Write said posts in a respectful tone.
Support the Media
When the POTUS’ Chief Strategist says things like “The media should be embarrassed and humiliated and keep its mouth shut and just listen for a while” it’s a deliberate attempt to curtail the Press and eventually there will be more actions to actually suppress Press freedom.
I’m not a liberal (I probably have no convenient definition) and I think the Press gave Obama a free ride during his eight year rule. They are definitely making up for that now, mostly because their very livelihoods are at stake.
The problem with them is that they are continuing to let themselves be manipulated by ?. He’s a master at this manipulation. Creating a story about the size of his hands in a picture delegitimizes you as a purveyor of news, especially when — as you’re watching his hands — he’s separating families, normalizing bigotry and undermining the Constitution. Forget about the hands and even forget about the hotels (for now). There was even a recent story trying to compare email servers (the comparison is very flawed). Stop it.
Encourage reporters to focus on things that actually matter and provide pointers to verifiable data they can use to call out the lack of veracity in ?’s policies. Personal blog posts are fleeting things but an NYT, WSJ (etc) story will live on.
Be Kind
I’ve heard and read some terrible language about rural America from what I can only classify as “liberals” in the week this post was written. Intellectual hubris and actual, visceral disdain for those who don’t think a certain way were two major reasons why ? got elected. The actual reasons he got elected are diverse and very nuanced.
Regardless of political leaning, pick your head up from your glowing rectangles and go out of your way to regularly talk to someone who doesn’t look, dress, think, eat, etc like you. Engage everyone with compassion. Regularly challenge your own beliefs.
There is a wedge that I estimate is about 1/8th of the way into the core of America now. Perpetuating this ideological “us vs them” mindset is only going to fuel the fires that created the conditions we’re in now and drive the wedge in further. The only way out is through compassion.
Remember: all life matters. Your degree, profession, bank balance or faith alignment doesn’t give you the right to believe you are better than anyone else.
FIN (for now)
I’ll probably move most of future opines to a new medium (not uppercase Medium) as you may be getting this drivel when you want recipes or R code (even though there are separate feeds for them).