Mining Your Gmail Data - Part 2

Download all your Gmail

So, step one in analyzing your Gmail data is to pull it all down to your machine. I'm using Gmvault to retrieve them all. Depending on how many messages you have stored in Gmail, you might want to start with the -t quick option, which only grabs the (by default) last 10 days of mail. Once you're sure you've got everything set up correctly, you can run the full command. Also, Gmvault compresses the data by default, so when you pull it down for analysis you'll want the --no-compression option. If you've been using Gmail for over a decade like I have, once you start the process you'll have some time to get a cup of coffee.

Pull all the data together in one place

Okay, I've got about 43,000 .eml files (and their accompanying metadata) on disk; now what?

If you're using the Gmvault defaults on a Windows 7 or above machine, the actual data will be stored in C:\Users\[username]\gmvault-db.

The next step is to write a script to chew through the folders full of .eml files and aggregate their metadata into one place. This can be done with a pretty simple PowerShell script and CDO (Collaboration Data Objects, which is an older Microsoft API and as far as I know is pretty much guaranteed to be available on your machine; I last used it on Windows 2000, and I'm doing this on Windows 10):

$emails = @()

$gmvaultdb = "[path to your gmvault db folder goes here]"

$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count

Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
    Write-Host "Processing $_ ($i of $total)"
    $adoDbStream = New-Object -ComObject ADODB.Stream
    $adoDbStream.Open()
    $adoDbStream.LoadFromFile($_.FullName)
    $cdoMessage = New-Object -ComObject CDO.Message
    $cdoMessage.DataSource.OpenObject($adoDbStream, "_Stream")

    $cdoMessage.Fields.Item("urn:schemas:mailheader:from").Value -match '"\s<(.*)>$' | Out-Null;
    $fromEmail = $Matches[1]

    $props = @{
        Size = $_.Length
        To = $cdoMessage.Fields.Item("urn:schemas:mailheader:to").Value
        From = $cdoMessage.Fields.Item("urn:schemas:mailheader:from").Value
        FromEmail = $fromEmail   
        Subject = $cdoMessage.Fields.Item("urn:schemas:mailheader:subject").Value
        Received = $cdoMessage.Fields.Item("urn:schemas:mailheader:received").Value
    }

    $emails += (New-Object PSObject -Property $props)
    $i++
} 

$emails | Select To, From, Subject, Size, FromEmail | Export-Csv emaildata.csv -NoTypeInformation

The script retrieves every single .eml file in the Gmvault backup folders and loads each one up as a CDO Message object. CDO takes care of all the parsing for us, and we can just extract the fields we care about. We pull those fields into a giant array of PSObjects and then use the magical Export-Csv command to create a comma-separated file with all of our email data.

Depending on how many emails you're dealing with, this may take a while.

By the way, the set of fields you can pull out of a CDO.Message object (the fields defined in "urn : schemas : mailheader") are documented on MSDN.

Now that we've got a massive CSV full of data about our email, we can start to break it down and analyze it. More on that in the next post.