There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R package — msgxtractr
— that can read in these ‘.msg’ files and produce a list as a result.
I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, not-so-cross-platform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg()
that smashes open bottled up msgs, grabs sane/useful fields and produces a list()
with them all wrapped up in a bow (an example is at the end and in the GH README).
Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.
Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.
devtools::install_github("hrbrmstr/msgxtractr")
library(msgxtractr)
print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr"))))
## List of 7
## $ headers :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 18 variables:
## ..$ Return-path : chr "<brizhou@gmail.com>"
## ..$ Received :List of 1
## .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n<brianzhou@me.com>; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)"
## ..$ Original-recipient : chr "rfc822;brianzhou@me.com"
## ..$ Received-SPF : chr "pass (st11p00mm-smtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__
## ..$ DKIM-Signature : chr "v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__
## ..$ MIME-version : chr "1.0"
## ..$ X-Received : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)"
## ..$ Date : chr "Mon, 18 Nov 2013 10:26:24 +0200"
## ..$ Message-id : chr "<CADtJ4eNjQSkGcBtVteCiTF+YFG89+AcHxK3QZ=-Mt48xygkvdQ@mail.gmail.com>"
## ..$ Subject : chr "Test for TIF files"
## ..$ From : chr "Brian Zhou <brizhou@gmail.com>"
## ..$ To : chr "brianzhou@me.com"
## ..$ Cc : chr "Brian Zhou <brizhou@gmail.com>"
## ..$ Content-type : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a"
## ..$ Authentication-results : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com header.i=@gmail.com\nheader."| __truncated__
## ..$ x-icloud-spam-score : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent"
## ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__
## ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__
## $ sender :List of 2
## ..$ sender_email: chr "brizhou@gmail.com"
## ..$ sender_name : chr "Brian Zhou"
## $ recipients :List of 2
## ..$ :List of 3
## .. ..$ display_name : NULL
## .. ..$ address_type : chr "SMTP"
## .. ..$ email_address: chr "brianzhou@me.com"
## ..$ :List of 3
## .. ..$ display_name : NULL
## .. ..$ address_type : chr "SMTP"
## .. ..$ email_address: chr "brizhou@gmail.com"
## $ subject : chr "Test for TIF files"
## $ body : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__
## $ attachments :List of 2
## ..$ :List of 4
## .. ..$ filename : chr "importOl.tif"
## .. ..$ long_filename: chr "import OleFileIO.tif"
## .. ..$ mime : chr "image/tiff"
## .. ..$ content : raw [1:969674] 49 49 2a 00 ...
## ..$ :List of 4
## .. ..$ filename : chr "raisedva.tif"
## .. ..$ long_filename: chr "raised value error.tif"
## .. ..$ mime : chr "image/tiff"
## .. ..$ content : raw [1:1033142] 49 49 2a 00 ...
## $ display_envelope:List of 2
## ..$ display_cc: chr "Brian Zhou"
## ..$ display_to: chr "brianzhou@me.com"
## NULL
NOTE: Don’t try to read those TIFF images with magick
or even the tiff
package. The content seems to have some strange tags/fields. But, saving it (use writeBin()
) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):