Mining Your Gmail Data - Part 4

I mentioned last time that I needed to look into the email size data to see what part attachments were playing the in total size of email I had from each address. This turned out to be a bit more complicated than I expected, but I did find a new tool to add to the email data mining arsenal.

Getting information about attachments via CDO is difficult. Finding things like the MIME type and the file name isn't terribly difficult, but digging down to the actual contents so I could see how large the attachments are quickly turned into an exercise in frustration with out-of-date COM documentation.

Enter MimeKit, an excellent open-source library for parsing email. It's .NET, it's fast, it's got great documentation, and it's got an object model which makes sense. And it's available as a nuget package, which means using it from PowerShell is a breeze.

I just open PowerShell to the folder where I'm writing my email processing scripts and run the following command:

nuget install MimeKit

(This assumes you've got NuGet in your path.) This will create a MimeKit folder with everything you need to use it. To access it from a PowerShell script, we just need to use the Add-Type command: Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll".

MimeKit can read an .eml file from disk, so loading up a message and gathering data about it is as simple as $mimeMessage = [MimeKit.MimeMessage]::Load([filename]). The attachments are available as a collection, and MimeKit's Attachment object can be accessed as a stream, which means we can easily determine its length. Which is what I'll need for my next data set - I want to gather the total size of all attachments sent by each email address.

Here's the script:

$emails = @()

$gmvaultdb = "C:\Users\[username]\gmvault-db\db"

$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count

Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll"

Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
    Write-Host "Processing $_ ($i of $total)"

    $mimeMessage = [MimeKit.MimeMessage]::Load($_.FullName)

    $attachments = @()

    $mimeMessage.Attachments | % {

        $attachment = @{
            Id = $_.ContentId
            FileName = $_.ContentDisposition.FileName
            Length = $_.ContentObject.Stream.Length
            ContentType = $_.ContentType
        }

        $attachments += (New-Object PSObject -Property $attachment)
    }    

    $mimeMessage.From.ToString() -match '"\s<(.*)>$' | Out-Null;
    $fromEmail = $Matches[1]

    $props = @{
        Id = $mimeMessage.MessageId
        Size = $_.Length
        To = $mimeMessage.To.ToString()
        From = $mimeMessage.From.ToString()
        FromEmail = $fromEmail
        Subject = $mimeMessage.Subject
        TotalAttachmentSize = ($attachments | measure 'Length' -Sum).Sum
    }

    $emails += (New-Object PSObject -Property $props)

    $i++
}

$emails | Select Id, To, From, FromEmail, Subject, Size, TotalAttachmentSize | Export-Csv attachments1.csv -NoTypeInformation

A couple of things to note:

  • These sizes are the actual Base64-encoded sizes within the emails. The actual files would be smaller on disk. But for our purposes (smaller backups, freeing up Gmail space, etc.) we care about the encoded storage size.
  • The 'From' data here is a little misleading; this script is assuming a single 'From' address for every message, even though multiple addresses are possible. I don't know that I've ever gotten an email with multiple 'From' addresses (I should check that!), but if I have it's so rare that I'm willing to write off that error for now.
  • This accounts for multiple attachments in a single email (thus the measure 'Length' -Sum call); eventually I might break out attachments in the data individually so I can see things like "how many PDFs do I get" or "how many images do I get". But I need to learn some more stuff about data wrangling in pandas first.

Now that I've got the data in CSV format, I can slurp it into pandas to see how much of my email is file attachments. Here's my python script:

import pandas as pd
import numpy as np
import humanfriendly

# Read in our email data file
df = pd.read_csv('../attachments1.csv', header = 0)

# Fill in zero for all the TotalAttachmentSize entries with nothing in them
df = df.fillna(0)

# Filter out sent mail 
notFromMe = df.query('FromEmail != "[my email]"')

# Determine the top 30 senders by total attachment sizes
fromEmailByAttachmentSize = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'TotalAttachmentSize' : np.sum, 'FromEmail' : np.size}) 
    .sort('TotalAttachmentSize',ascending=False) 
    .head(30) )

# Add a 'human readable' version of the TotalAttachmentSize in another column
fromEmailByAttachmentSize['HR'] = map(humanfriendly.format_size, fromEmailByAttachmentSize['TotalAttachmentSize'])

The biggest bit of new pandas knowledge in this script is the fillna() method, which sets TotalAttachmentSize to zero for me in all the rows where there are no attachments.

The rest is simple; I'm merely grouping the 'From' addresses together and summing their TotalAttachmentSize values, then grabbing the top 30. Unsurprisingly, the top of the list is all relatives and friends; I'm guessing it's mostly photos. I'll dive into breaking that data down more in the next email data post.