Mining Your Gmail Data - Part 2
/Download all your Gmail
So, step one in analyzing your Gmail data is to pull it all down to your machine. I'm using Gmvault to retrieve them all. Depending on how many messages you have stored in Gmail, you might want to start with the -t quick
option, which only grabs the (by default) last 10 days of mail. Once you're sure you've got everything set up correctly, you can run the full command. Also, Gmvault compresses the data by default, so when you pull it down for analysis you'll want the --no-compression
option. If you've been using Gmail for over a decade like I have, once you start the process you'll have some time to get a cup of coffee.
Pull all the data together in one place
Okay, I've got about 43,000 .eml
files (and their accompanying metadata) on disk; now what?
If you're using the Gmvault defaults on a Windows 7 or above machine, the actual data will be stored in C:\Users\[username]\gmvault-db
.
The next step is to write a script to chew through the folders full of .eml files and aggregate their metadata into one place. This can be done with a pretty simple PowerShell script and CDO (Collaboration Data Objects, which is an older Microsoft API and as far as I know is pretty much guaranteed to be available on your machine; I last used it on Windows 2000, and I'm doing this on Windows 10):
$emails = @()
$gmvaultdb = "[path to your gmvault db folder goes here]"
$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count
Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
Write-Host "Processing $_ ($i of $total)"
$adoDbStream = New-Object -ComObject ADODB.Stream
$adoDbStream.Open()
$adoDbStream.LoadFromFile($_.FullName)
$cdoMessage = New-Object -ComObject CDO.Message
$cdoMessage.DataSource.OpenObject($adoDbStream, "_Stream")
$cdoMessage.Fields.Item("urn:schemas:mailheader:from").Value -match '"\s<(.*)>$' | Out-Null;
$fromEmail = $Matches[1]
$props = @{
Size = $_.Length
To = $cdoMessage.Fields.Item("urn:schemas:mailheader:to").Value
From = $cdoMessage.Fields.Item("urn:schemas:mailheader:from").Value
FromEmail = $fromEmail
Subject = $cdoMessage.Fields.Item("urn:schemas:mailheader:subject").Value
Received = $cdoMessage.Fields.Item("urn:schemas:mailheader:received").Value
}
$emails += (New-Object PSObject -Property $props)
$i++
}
$emails | Select To, From, Subject, Size, FromEmail | Export-Csv emaildata.csv -NoTypeInformation
The script retrieves every single .eml
file in the Gmvault backup folders and loads each one up as a CDO Message object. CDO takes care of all the parsing for us, and we can just extract the fields we care about. We pull those fields into a giant array of PSObjects and then use the magical Export-Csv command to create a comma-separated file with all of our email data.
Depending on how many emails you're dealing with, this may take a while.
By the way, the set of fields you can pull out of a CDO.Message object (the fields defined in "urn : schemas : mailheader") are documented on MSDN.
Now that we've got a massive CSV full of data about our email, we can start to break it down and analyze it. More on that in the next post.