Unbottling “.msg” Files in R

There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R packagemsgxtractr — that can read in these ‘.msg’ files and produce a list as a result.

I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, not-so-cross-platform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg() that smashes open bottled up msgs, grabs sane/useful fields and produces a list() with them all wrapped up in a bow (an example is at the end and in the GH README).

Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.

Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.

devtools::install_github("hrbrmstr/msgxtractr")

library(msgxtractr)

print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr"))))

## List of 7
##  $ headers         :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  18 variables:
##   ..$ Return-path               : chr "<brizhou@gmail.com>"
##   ..$ Received                  :List of 1
##   .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n<brianzhou@me.com>; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)"
##   ..$ Original-recipient        : chr "rfc822;brianzhou@me.com"
##   ..$ Received-SPF              : chr "pass (st11p00mm-smtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__
##   ..$ DKIM-Signature            : chr "v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__
##   ..$ MIME-version              : chr "1.0"
##   ..$ X-Received                : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)"
##   ..$ Date                      : chr "Mon, 18 Nov 2013 10:26:24 +0200"
##   ..$ Message-id                : chr "<CADtJ4eNjQSkGcBtVteCiTF+YFG89+AcHxK3QZ=-Mt48xygkvdQ@mail.gmail.com>"
##   ..$ Subject                   : chr "Test for TIF files"
##   ..$ From                      : chr "Brian Zhou <brizhou@gmail.com>"
##   ..$ To                        : chr "brianzhou@me.com"
##   ..$ Cc                        : chr "Brian Zhou <brizhou@gmail.com>"
##   ..$ Content-type              : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a"
##   ..$ Authentication-results    : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com header.i=@gmail.com\nheader."| __truncated__
##   ..$ x-icloud-spam-score       : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent"
##   ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__
##   ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__
##  $ sender          :List of 2
##   ..$ sender_email: chr "brizhou@gmail.com"
##   ..$ sender_name : chr "Brian Zhou"
##  $ recipients      :List of 2
##   ..$ :List of 3
##   .. ..$ display_name : NULL
##   .. ..$ address_type : chr "SMTP"
##   .. ..$ email_address: chr "brianzhou@me.com"
##   ..$ :List of 3
##   .. ..$ display_name : NULL
##   .. ..$ address_type : chr "SMTP"
##   .. ..$ email_address: chr "brizhou@gmail.com"
##  $ subject         : chr "Test for TIF files"
##  $ body            : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__
##  $ attachments     :List of 2
##   ..$ :List of 4
##   .. ..$ filename     : chr "importOl.tif"
##   .. ..$ long_filename: chr "import OleFileIO.tif"
##   .. ..$ mime         : chr "image/tiff"
##   .. ..$ content      : raw [1:969674] 49 49 2a 00 ...
##   ..$ :List of 4
##   .. ..$ filename     : chr "raisedva.tif"
##   .. ..$ long_filename: chr "raised value error.tif"
##   .. ..$ mime         : chr "image/tiff"
##   .. ..$ content      : raw [1:1033142] 49 49 2a 00 ...
##  $ display_envelope:List of 2
##   ..$ display_cc: chr "Brian Zhou"
##   ..$ display_to: chr "brianzhou@me.com"
## NULL

NOTE: Don’t try to read those TIFF images with magick or even the tiff package. The content seems to have some strange tags/fields. But, saving it (use writeBin()) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):

Cover image from Data-Driven Security
Amazon Author Page

8 Comments Unbottling “.msg” Files in R

  1. Pingback: Unbottling “.msg” Files in R – Cloud Data Architect

  2. Pingback: Unbottling “.msg” Files in R – Mubashir Qasim

  3. Pingback: Unbottling “.msg” Files in R | A bunch of data

  4. Pingback: Unbottling “.msg” Files in R – Cyber Security

  5. cguilarte

    I’m getting the following message when I try to install it:

    Installing package into ‘C:/Users/s5807891/Documents/R/win-library/3.4’
    (as ‘lib’ is unspecified)
    * installing source package ‘msgxtractr’ …
    ** libs

    *** arch – i386
    Warning: running command ‘make -f “Makevars” -f “C:/PROGRA~1/R/R-34~1.3/etc/i386/Makeconf” -f “C:/PROGRA~1/R/R-34~1.3/share/make/winshlib.mk” CXX=’$(CXX11) $(CXX11STD)’ CXXFLAGS=’$(CXX11FLAGS)’ CXXPICFLAGS=’$(CXX11PICFLAGS)’ SHLIB_LDFLAGS=’$(SHLIB_CXX11LDFLAGS)’ SHLIB_LD=’$(SHLIB_CXX11LD)’ SHLIB=”msgxtractr.dll” OBJECTS=”RcppExports.o pole.o r_pole.o”‘ had status 127
    ERROR: compilation failed for package ‘msgxtractr’
    * removing ‘C:/Users/s5807891/Documents/R/win-library/3.4/msgxtractr’
    In R CMD INSTALL
    Warning in install.packages :
    running command ‘”C:/PROGRA~1/R/R-34~1.3/bin/x64/R” CMD INSTALL -l “C:\Users\s5807891\Documents\R\win-library\3.4” “C:/Users/s5807891/Downloads/msgxtractr_0.2.1.tar.gz”‘ had status 1
    Warning in install.packages :
    installation of package ‘C:/Users/s5807891/Downloads/msgxtractr_0.2.1.tar.gz’ had non-zero exit status

    Reply
  6. Din

    This works perfectly when I import an email that’s not been forwarded even once but when I try to import an email that has been forwarded the output is not as expected. For eg if I import an email that has been forwarded and I print the imported msg I get data as unspecified. Is there a solution to my problem? Thanks in advance. I am new to R and not sure if I am doing things incorrectly.

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.