Hy-phen-ate All The Things! (in R)

`hyphenatr`–what may be my smallest package ever–has just hit [CRAN](https://cloud.r-project.org/web/packages/hyphenatr/). It, well, _hyphenates_ words using [`libhyphen`](https://cloud.r-project.org/web/packages/hyphenatr/index.html) (a.k.a. `libhnj`). There are no external dependencies (i.e. no `brew install`, `apt get`, et. al. required) and it compiles on _everything_ CRAN supports _including_ Windows.

I started coding this to see if it could be a “poor dude’s ‘syllabifier'” (NOTE: _”dude”_ is gender agnostic and I am fully aware of the proper NLP terms to use but it’s way more fun to make up words) to make it easer to turn my [Zellingach project](http://rud.is/projects/zellingenach.html) from earlier in the year into a generalized package. In short, [Tex hyphenation rules](https://en.wikipedia.org/wiki/Hyphenation_algorithm) (which is what `libhyphen` and, hence, `hyphenatr` uses) don’t generalize to separating all syllables since–for instance–you really wouldn’t want to leave some trailing syllables hanging apart from their siblings (mostly for typographic reasons). Rather than let my investigation work be for naught, you get a package!

### What’s in the box?

`hyphenatr` ships with support for _39_ language hyphenation rules. Here’s proof:

> library(hyphenatr)
 
list_dicts()
#>  [1] "af_ZA"  "bg_BG"  "ca"     "cs_CZ"  "da_DK"  "de"     "de_AT"  "de_CH" 
#>  [9] "de_DE"  "el_GR"  "en_GB"  "en_US"  "es_ANY" "et_EE"  "fr"     "gl"    
#> [17] "hr_HR"  "hu_HU"  "is"     "it_IT"  "lt"     "lt_LT"  "lv_LV"  "nb_NO" 
#> [25] "nl_NL"  "nn_NO"  "pl_PL"  "pt_BR"  "pt_PT"  "ro_RO"  "ru_RU"  "sh"    
#> [33] "sk_SK"  "sl_SI"  "sr"     "sv"     "te_IN"  "uk_UA"  "zu_ZA"

Where underscores are present, it’s `locationcode_COUNTRYCODE` otherwise it’s just the location code and you can switch which dictionary is in use with `switch_dict()`. `en_US` is default because I’m a lazy, narcissistic American. You can read about those files [here](https://github.com/LibreOffice/dictionaries), and I followed Dirk’s (Eddelbuettel) model in [`AsioHeaders`](https://cran.rstudio.com/web/packages/AsioHeaders/index.html), keeping all individual copyrights & author credits [intact](https://cloud.r-project.org/web/packages/hyphenatr/COPYRIGHTS) (open source attribution is not as easy as you might think).

By default, `hyphenatr` will stick a `=` where hyphens can be (this is the `libhyphen` default). You can change that to anything else (examples below) _or_ you can ask `hyphenatr` to just return a split vector (i.e. components of the word split at hyphenation points).

### How does it work?

You call `hyphenate` on a vector of words. On my demure 13″ MacBook Pro it takes ~24ms to process 10,000 words. Don’t believe me? Give this a go on your own system:

library(hyphenatr)
library(microbenchmark)
 
dat <- readLines(system.file("extdata/top10000en.txt", package="hyphenatr"))
 
microbenchmark(out1 <- hyphenate(dat))
#> Unit: milliseconds
#>                    expr      min       lq     mean   median       uq      max neval
#>  out1 <- hyphenate(dat) 20.77134 22.16768 23.70809 23.65906 24.73395 30.21601   100

I extracted some of the results of that to give an idea what you get back:

out1[500:550]
#>  [1] "got"            "fam=ily"        "pol=icy"        "in=vestors"     "record"         "loss"          
#>  [7] "re=ceived"      "April"          "Ex=change"      "code"           "graph=ics"      "agency"        
#> [13] "in=creased"     "man=ager"       "keep"           "look"           "of=ten"         "de=signed"     
#> [19] "Euro=pean"      "earn=ings"      "en=vi=ron=ment" "July"           "job"            "third"         
#> [25] "wa=ter"         "net"            "banks"          "an=a=lysts"     "strong"         "party"         
#> [31] "econ=omy"       "away"           "dol=lar"        "taken"          "de=vel=oped"    "con=tinue"     
#> [37] "al=low"         "Mi=crosoft"     "key"            "ei=ther"        "se=cu=rity"     "project"       
#> [43] "agreed"         "though"         "Ja=pan"         "rather"         "coun=tries"     "plant"         
#> [49] "along"          "Ap=ple"         "ac=tion"

It’s a tad slower if you want separated vectors back (~30ms) but I think you’ll find that mode more useful if you do plan on using the package:

microbenchmark(out2 <- hyphenate(dat, simplify=FALSE))
#> Unit: milliseconds
#>                                      expr      min       lq     mean   median       uq      max neval
#>  out2 <- hyphenate(dat, simplify = FALSE) 26.32844 28.27894 29.26569 29.13235 29.80986 33.21204   100
 
jsonlite::toJSON(out2[530:540], pretty=TRUE)
#> [
#>   ["econ", "omy"],
#>   ["away"],
#>   ["dol", "lar"],
#>   ["taken"],
#>   ["de", "vel", "oped"],
#>   ["con", "tinue"],
#>   ["al", "low"],
#>   ["Mi", "crosoft"],
#>   ["key"],
#>   ["ei", "ther"],
#>   ["se", "cu", "rity"]
#> ]

As I stated earlier, you can use whatever separator you want, but you’ll pay the price as that’ll take an _excruciating_ ~31ms for this word list:

microbenchmark(out3 <- hyphenate(dat, simplify="-"))
#> Unit: milliseconds
#>                                    expr      min       lq     mean  median       uq     max neval
#>  out3 <- hyphenate(dat, simplify = "-") 26.22136 28.04543 29.82251 30.0245 31.20909 36.4886   100
 
out3[500:550]
#>  [1] "got"            "fam-ily"        "pol-icy"        "in-vestors"     "record"         "loss"          
#>  [7] "re-ceived"      "April"          "Ex-change"      "code"           "graph-ics"      "agency"        
#> [13] "in-creased"     "man-ager"       "keep"           "look"           "of-ten"         "de-signed"     
#> [19] "Euro-pean"      "earn-ings"      "en-vi-ron-ment" "July"           "job"            "third"         
#> [25] "wa-ter"         "net"            "banks"          "an-a-lysts"     "strong"         "party"         
#> [31] "econ-omy"       "away"           "dol-lar"        "taken"          "de-vel-oped"    "con-tinue"     
#> [37] "al-low"         "Mi-crosoft"     "key"            "ei-ther"        "se-cu-rity"     "project"       
#> [43] "agreed"         "though"         "Ja-pan"         "rather"         "coun-tries"     "plant"         
#> [49] "along"          "Ap-ple"         "ac-tion"

If you’re processing text for use in HTML, you could use this package to add “[soft hyphens](https://en.wikipedia.org/wiki/Soft_hyphen)” (`­`) to the words, but now we’re _dangerously close_ to a nigh intolerable ~40ms for 10,000 words:

microbenchmark(out4 <- hyphenate(dat, simplify="&shy;"))
#> Unit: milliseconds
#>                                        expr      min       lq    mean   median       uq      max neval
#>  out4 <- hyphenate(dat, simplify = "&shy;") 28.57537 29.78537 31.6346 31.31182 33.16067 37.89471   100
 
out4[500:550]
#>  [1] "got"                        "fam&shy;ily"                "pol&shy;icy"                "in&shy;vestors"            
#>  [5] "record"                     "loss"                       "re&shy;ceived"              "April"                     
#>  [9] "Ex&shy;change"              "code"                       "graph&shy;ics"              "agency"                    
#> [13] "in&shy;creased"             "man&shy;ager"               "keep"                       "look"                      
#> [17] "of&shy;ten"                 "de&shy;signed"              "Euro&shy;pean"              "earn&shy;ings"             
#> [21] "en&shy;vi&shy;ron&shy;ment" "July"                       "job"                        "third"                     
#> [25] "wa&shy;ter"                 "net"                        "banks"                      "an&shy;a&shy;lysts"        
#> [29] "strong"                     "party"                      "econ&shy;omy"               "away"                      
#> [33] "dol&shy;lar"                "taken"                      "de&shy;vel&shy;oped"        "con&shy;tinue"             
#> [37] "al&shy;low"                 "Mi&shy;crosoft"             "key"                        "ei&shy;ther"               
#> [41] "se&shy;cu&shy;rity"         "project"                    "agreed"                     "though"                    
#> [45] "Ja&shy;pan"                 "rather"                     "coun&shy;tries"             "plant"                     
#> [49] "along"

As stated, it works with other languages:

switch_dict("de_DE")
 
hyphenate("kommen")
#> [1] "kimn-men"

(I had picked a different word at random [from the internet](http://www.columbia.edu/~fdc/utf8.html), “tägelîch”, but it turned out to be a “Middle High German” words (i.e. not currently in use). I, still, equally randomly place the blame on @sooshie but thank Twitter and blog comments for pointing it out :-)

### Moving right along

If you hit any snags, drop an issue [on GitHub](https://github.com/hrbrmstr/hyphenatr). If you have any hyphenation language rules (in the supported “LibreOffice” format) please submit a PR (both including the file and updating `inst\COPYRIGHTS`).

I cannot conclude w/o giving special thanks to Edwin de Jonge & Gergely Daróczi for language testing.

Well, I really can’t conclude without impersonating a Dalek:

cat(toupper(hyphenate("Exterminate!", simplify=" - ")))

`EX – TER – MI – NATE!`

Cover image from Data-Driven Security
Amazon Author Page

3 Comments Hy-phen-ate All The Things! (in R)

  1. Pingback: Hy-phen-ate All The Things! (in R) – Mubashir Qasim

  2. JaLi

    Thank’s for the package. Nice to have (and nicely written post). For your de_DE example you did select a ‘bad’ word indeed, as it is non-existing. :) There is no i^ character in german and the word it is closest to (‘täglich’ meaning ‘daily’) would be hyphenated on a different position (‘täg-lich’). Cheers.

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.