Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms
attribute now, too).
Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.
When you’re in a terminal session you can tell that a file has extended attributes by looking for an @
sign near the permissions column:
$ cd ~/Downloads $ ls -l total 264856 -rw-r--r--@ 1 user staff 169062 Nov 27 2017 1109.1968.pdf -rw-r--r--@ 1 user staff 171059 Nov 27 2017 1109.1968v1.pdf -rw-r--r--@ 1 user staff 291373 Apr 27 21:25 1804.09970.pdf -rw-r--r--@ 1 user staff 1150562 Apr 27 21:26 1804.09988.pdf -rw-r--r--@ 1 user staff 482953 May 11 12:00 1805.01554.pdf -rw-r--r--@ 1 user staff 125822222 May 14 16:34 RStudio-1.2.627.dmg -rw-r--r--@ 1 user staff 2727305 Dec 21 17:50 athena-ug.pdf -rw-r--r--@ 1 user staff 90181 Jan 11 15:55 bgptools-0.2.tar.gz -rw-r--r--@ 1 user staff 4683220 May 25 14:52 osquery-3.2.4.pkg
You can work with extended attributes from the terminal with the xattr
command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?
I didn’t think so. Thus begat the xattrs
? package.
Exploring Past Downloads
Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs
package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads
(if you’ve chosen a different default for your browsers, use that directory).
We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.
First, let’s see what is hidden with the RStudio disk image:
library(xattrs)
library(reticulate) # not 100% necessary but you'll see why later
library(tidyverse) # we'll need this later
list_xattrs("~/Downloads/RStudio-1.2.627.dmg")
## [1] "com.apple.diskimages.fsck" "com.apple.diskimages.recentcksum"
## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"
There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms
. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:
get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
## [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
## [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
## [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
## [169] 00 00 00 89
Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00
) in it. We can try to read it as a string, though:
get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"
So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).
There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist()
function that comes with the xattrs
package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.
I like to prime the Python setup with invisible(py_config())
but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON
env var for the one I use with R). You’ll need to install the biplist
module via pip3 install bipist
or pip install bipist
depending on your setup. I highly recommended using Python 3.x vs 2.x, though.
biplist <- import("biplist", as="biplist")
biplist$readPlistFromString(
get_xattr_raw(
"~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms"
)
)
## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg"
## [2] "https://dailies.rstudio.com/rstudio/oss/mac/"
That's much better.
Let's work with metadata for the whole directory:
list.files("~/Downloads", full.names = TRUE) %>%
keep(has_xattrs) %>%
set_names(basename(.)) %>%
map_df(read_xattrs, .id="file") -> xdf
xdf
## # A tibble: 24 x 4
## file name size contents
##
## 1 1109.1968.pdf com.apple.lastuseddate#PS 16
## 2 1109.1968.pdf com.apple.metadata:kMDItemWhereFroms 110
## 3 1109.1968.pdf com.apple.quarantine 74
## 4 1109.1968v1.pdf com.apple.lastuseddate#PS 16
## 5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms 116
## 6 1109.1968v1.pdf com.apple.quarantine 74
## 7 1804.09970.pdf com.apple.metadata:kMDItemWhereFroms 86
## 8 1804.09970.pdf com.apple.quarantine 82
## 9 1804.09988.pdf com.apple.lastuseddate#PS 16
## 10 1804.09988.pdf com.apple.metadata:kMDItemWhereFroms 104
## # ... with 14 more rows
## count(xdf, name, sort=TRUE)
## # A tibble: 5 x 2
## name n
##
## 1 com.apple.metadata:kMDItemWhereFroms 9
## 2 com.apple.quarantine 9
## 3 com.apple.lastuseddate#PS 4
## 4 com.apple.diskimages.fsck 1
## 5 com.apple.diskimages.recentcksum 1
Now we can focus on the task at hand: recovering the URLs:
list.files("~/Downloads", full.names = TRUE) %>%
keep(has_xattrs) %>%
set_names(basename(.)) %>%
map_df(read_xattrs, .id="file") %>%
filter(name == "com.apple.metadata:kMDItemWhereFroms") %>%
mutate(where_from = map(contents, biplist$readPlistFromString)) %>%
select(file, where_from) %>%
unnest() %>%
filter(!where_from == "")
## # A tibble: 15 x 2
## file where_from
##
## 1 1109.1968.pdf https://arxiv.org/pdf/1109.1968.pdf
## 2 1109.1968.pdf https://www.google.com/
## 3 1109.1968v1.pdf https://128.84.21.199/pdf/1109.1968v1.pdf
## 4 1109.1968v1.pdf https://www.google.com/
## 5 1804.09970.pdf https://arxiv.org/pdf/1804.09970.pdf
## 6 1804.09988.pdf https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf
## 7 1805.01554.pdf https://arxiv.org/pdf/1805.01554.pdf
## 8 athena-ug.pdf http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf
## 9 athena-ug.pdf https://www.google.com/
## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz
## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/
## 12 osquery-3.2.4.pkg https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
## 13 osquery-3.2.4.pkg https://osquery.io/downloads/official/3.2.4
## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/
(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)
Note: if Python is not an option for you, you can use the hack-y read_bplist()
function in the package, but it will be much, much slower and you'll need to deal with an ugly list
object vs some quaint text vectors.
FIN
Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!
Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.