Mining Your Gmail Data - Part 5

Last time I talked about mining my Gmail data, I figured out that most of my received attachments (by size) were from friends and family. This time I'm going to break down the data a little more so I can figure out what types of data are in those attachments. On a practical level, I could reduce my Gmail backup size by deleting old messages with photos I already have backed up elsewhere. In reality, I just want to screw around with the charting capabilities in pandas.

This time I need all the data broken down by individual attachments. The script for this is similar to the last one, but instead of aggregating the attachments per message, I'm keeping them separate. I'm also grabbing some additional data, like the MIME type of each attachment and the file name. And because I'll need counts of individual attachments, I need to make sure I've got a unique ID for them. I can't use the message ID, because a single message can have multiple attachments. And I can't use the file name, because that may be repeated across messages. So I'm concatenating the two into a new field called AttachmentId. I'm also ignoring all the emails without any attachments. Here's the new script:

$emails = @()
$gmvaultdb = "C:\Users\hartez\gmvault-db\db"
$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count
Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll"

Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
    Write-Host "Processing $_ ($i of $total)"
    $mimeMessage = [MimeKit.MimeMessage]::Load($_.FullName)
    $attachments = @()

    $mimeMessage.Attachments | % {
        $attachment = @{
            Id = $mimeMessage.MessageId + $_.ContentDisposition.FileName
            FileName = $_.ContentDisposition.FileName
            ContentTypeName = $_.ContentType.Name
            MimeType = $_.ContentType.MimeType
            MediaType = $_.ContentType.MediaType
            MediaSubtype = $_.ContentType.MediaSubtype 
            Length = $_.ContentObject.Stream.Length
            ContentType = $_.ContentType
        }
        $attachments += (New-Object PSObject -Property $attachment)
    }    

    $mimeMessage.From.ToString() -match '"\s<(.*)>$' | Out-Null;
    $fromEmail = $Matches[1]

    if($attachments.Count -gt 0) {
        $attachments | % {
            $props = @{
                Id = $mimeMessage.MessageId
                To = $mimeMessage.To.ToString()
                From = $mimeMessage.From.ToString()
                FromEmail = $fromEmail
                AttachmentId = $_.Id
                FileName = $_.FileName
                ContentTypeName = $_.ContentTypeName
                MimeType = $_.MimeType
                MediaType = $_.MediaType
                MediaSubtype = $_.MediaSubtype 
                Size = $_.Length
                ContentType = $_.ContentType
            }
        }

        $emails += (New-Object PSObject -Property $props)
    } 

    $i++
}

$emails | Select Id, To, From, FromEmail, FileName, AttachmentId, ContentTypeName, MimeType, MediaType, MediaSubtype, Size, ContentType | Export-Csv attachments2.csv -NoTypeInformation

In pandas, I'm doing the usual import of the data and filtering down to messages which aren't from myself. Then I group the data by MIME type:

    types = notFromMe.groupby(['MimeType'])

and create aggregate columns for the total number of attachments of each type and the total size of those attachments:

    types = types.agg({'AttachmentId' : 'count', 'Size' : 'sum'})

I'll also need the total number of attachments and the total size of all attachments so I can calculate percentages:

    totalCount = types['AttachmentId'].sum()
    totalSize = types['Size'].sum()

With that in place, adding the percentage columns is easy:

    types['percentCount'] = types['AttachmentId'] / totalCount
    types['percentSize'] = types['Size'] / totalSize

At the top of the script I've already imported matplotlib:

    import matplotlib.pyplot as plt
    plt.style.use('ggplot')

Now we're all set to plot this data so I can see the relative sizes of each file type. I'll start with 'percentSize':

    types['percentSize'].plot(kind='pie', figsize=(6, 6))
    plt.show()

Attachment Types by %, First Try

Wow. That's ... pretty ugly. Let's see if we can un-clutter that a bit by combining all of the really low-percentage stuff into a slice called "other". While we're at it, we'll create a method that we can re-use when we want to graph the percentages by count:

    def combinedPlot(df, col, cutoff):

        # Just get the mime types which are 1% or more
        overCutoff = df[col]
        overCutoff = df.query(col + '>' + str(cutoff))

        # Fill in the 'other' section
        remaining = 1 - (major[col].sum())
        other = pd.DataFrame({col : pd.Series([remaining], index = ['other'])})

        # Add the 'other' section to our main data
        both = overCutoff.append(other)

        # Plot it
        both[col].plot(kind='pie', figsize=(6, 6))
        plt.show()

To use it, we just drop in our types data frame, the column we want to graph, and where we want to cut off the data:

    combinedPlot(types, 'percentSize', 0.01)

Attachment Types by %, Cleaned Up

That's much better. But it raises a couple of questions:

  1. What's with application/octet-stream? That could be any sort of file; can we dig down into that?
  2. Can we get some more succinct labels? We don't really need the 'application/; and 'video/' prefixes.

This is already long; we'll get to those questions next time.

Mining Your Gmail Data - Part 4

I mentioned last time that I needed to look into the email size data to see what part attachments were playing the in total size of email I had from each address. This turned out to be a bit more complicated than I expected, but I did find a new tool to add to the email data mining arsenal.

Getting information about attachments via CDO is difficult. Finding things like the MIME type and the file name isn't terribly difficult, but digging down to the actual contents so I could see how large the attachments are quickly turned into an exercise in frustration with out-of-date COM documentation.

Enter MimeKit, an excellent open-source library for parsing email. It's .NET, it's fast, it's got great documentation, and it's got an object model which makes sense. And it's available as a nuget package, which means using it from PowerShell is a breeze.

I just open PowerShell to the folder where I'm writing my email processing scripts and run the following command:

nuget install MimeKit

(This assumes you've got NuGet in your path.) This will create a MimeKit folder with everything you need to use it. To access it from a PowerShell script, we just need to use the Add-Type command: Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll".

MimeKit can read an .eml file from disk, so loading up a message and gathering data about it is as simple as $mimeMessage = [MimeKit.MimeMessage]::Load([filename]). The attachments are available as a collection, and MimeKit's Attachment object can be accessed as a stream, which means we can easily determine its length. Which is what I'll need for my next data set - I want to gather the total size of all attachments sent by each email address.

Here's the script:

$emails = @()

$gmvaultdb = "C:\Users\[username]\gmvault-db\db"

$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count

Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll"

Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
    Write-Host "Processing $_ ($i of $total)"

    $mimeMessage = [MimeKit.MimeMessage]::Load($_.FullName)

    $attachments = @()

    $mimeMessage.Attachments | % {

        $attachment = @{
            Id = $_.ContentId
            FileName = $_.ContentDisposition.FileName
            Length = $_.ContentObject.Stream.Length
            ContentType = $_.ContentType
        }

        $attachments += (New-Object PSObject -Property $attachment)
    }    

    $mimeMessage.From.ToString() -match '"\s<(.*)>$' | Out-Null;
    $fromEmail = $Matches[1]

    $props = @{
        Id = $mimeMessage.MessageId
        Size = $_.Length
        To = $mimeMessage.To.ToString()
        From = $mimeMessage.From.ToString()
        FromEmail = $fromEmail
        Subject = $mimeMessage.Subject
        TotalAttachmentSize = ($attachments | measure 'Length' -Sum).Sum
    }

    $emails += (New-Object PSObject -Property $props)

    $i++
}

$emails | Select Id, To, From, FromEmail, Subject, Size, TotalAttachmentSize | Export-Csv attachments1.csv -NoTypeInformation

A couple of things to note:

  • These sizes are the actual Base64-encoded sizes within the emails. The actual files would be smaller on disk. But for our purposes (smaller backups, freeing up Gmail space, etc.) we care about the encoded storage size.
  • The 'From' data here is a little misleading; this script is assuming a single 'From' address for every message, even though multiple addresses are possible. I don't know that I've ever gotten an email with multiple 'From' addresses (I should check that!), but if I have it's so rare that I'm willing to write off that error for now.
  • This accounts for multiple attachments in a single email (thus the measure 'Length' -Sum call); eventually I might break out attachments in the data individually so I can see things like "how many PDFs do I get" or "how many images do I get". But I need to learn some more stuff about data wrangling in pandas first.

Now that I've got the data in CSV format, I can slurp it into pandas to see how much of my email is file attachments. Here's my python script:

import pandas as pd
import numpy as np
import humanfriendly

# Read in our email data file
df = pd.read_csv('../attachments1.csv', header = 0)

# Fill in zero for all the TotalAttachmentSize entries with nothing in them
df = df.fillna(0)

# Filter out sent mail 
notFromMe = df.query('FromEmail != "[my email]"')

# Determine the top 30 senders by total attachment sizes
fromEmailByAttachmentSize = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'TotalAttachmentSize' : np.sum, 'FromEmail' : np.size}) 
    .sort('TotalAttachmentSize',ascending=False) 
    .head(30) )

# Add a 'human readable' version of the TotalAttachmentSize in another column
fromEmailByAttachmentSize['HR'] = map(humanfriendly.format_size, fromEmailByAttachmentSize['TotalAttachmentSize'])

The biggest bit of new pandas knowledge in this script is the fillna() method, which sets TotalAttachmentSize to zero for me in all the rows where there are no attachments.

The rest is simple; I'm merely grouping the 'From' addresses together and summing their TotalAttachmentSize values, then grabbing the top 30. Unsurprisingly, the top of the list is all relatives and friends; I'm guessing it's mostly photos. I'll dive into breaking that data down more in the next email data post.

Mining Your Gmail Data - Part 3

My foray into Gmvault was originally spurred by a desire to back up my Gmail. It occurs to me that I probably have a lot of stuff in there I don't really care about. I'm not a data-retention purist who needs every single notification email I've ever received from Twitter or Facebook; I'm happy to delete a lot of that junk to make my off-site backup smaller and reduce my bandwidth a bit.

But going through a decade of email to figure out what I can safely delete is painful. It'd be nice to get a report of the largest offenders with regard to junk mail so I can delete them en masse from Gmail before running the backup. At the same time, this gives me a starting point for creating a filter in Gmail which I can use to label emails as "Safe To Delete". This way, I can continue archiving them like normal, and if my backup starts to get ugly or I start running short on Gmail space, I can search for "SafeToDelete" and "older than 30 days" and remove the junk.

So that's the first project - grab a backup of all my Gmail since 2005 and analyze it for the source addresses which have sent me the most mail, both by message size and total number of messages.

Depending on how many emails we're dealing with, we might be able to work out the answers to our questions using Excel. But I'm learning pandas, so I'm going to use that. Also, some of the more complicated analysis I want to do later on should be much, much easier in pandas, which is designed for this kind of data crunching.

(BTW, if you're a Visual Studio fan like me, VS supports the heck out of Python these days. I'm using Python Tools for Visual Studio and it's amazing.)

So here's some quick-and-dirty python to get a list of the top 20 email addresses by count of message received and by total size of messages received:

import pandas as pd
import numpy as np
import humanfriendly

# Read in our email data file
df = pd.read_csv('../emaildata.csv', header = 0)

# Filter out sent mail 
notFromMe = df.query('FromEmail != "[your email here]"')

# Determine the top 20 senders by total email size
fromEmailBySize = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'Size' : np.sum, 'FromEmail' : np.size}) 
    .sort('Size',ascending=False) 
    .head(30) )

# Determine the top 20 senders by email count
fromEmailByCount = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'Size' : np.sum, 'FromEmail' : np.size}) 
    .sort('FromEmail',ascending=False) 
    .head(30) )

# Add a 'human readable' version of the Size in another column
fromEmailBySize['HR'] = map(humanfriendly.format_size, fromEmailBySize['Size'])
fromEmailByCount['HR'] = map(humanfriendly.format_size, fromEmailByCount['Size'])

This is fairly straightforward. First we slurp the CSV file into a data frame. Then we filter out the 'sent' email so it doesn't pollute our results.

To get the top 30 senders by email size, we group all the data by FromEmail and aggregate the sum of all the sizes. Then we sort it by the Size field and grab the top 30. To get the top 30 by count, we do basically the same thing, but we sort it by the count instead.

In the last two lines I'm mapping the format_size function from the humanfriendly library on to the Size columns to do the work of displaying the sizes in nice, human-readable format. Here are a couple of (slightly redacted) screenshots of my results:

My email, sorted by received message count

My email, sorted by the sum of the message sizes

So looking at fromEmailByCount, I see a couple of obvious contenders for deletion. I don't need to back up those notification emails from Facebook, for example, or the back issues of Thrillist, or the topic notifications from Stack Exchange. Eventually I'll go into Gmail, search for anything from those addresses, and delete them.

fromEmailBySize reveals some other interesting data. I tend to want to keep emails from actual people (as opposed to notification robots), but some of these people are probably near the top of the list because of a few attachments. If the attachment is something that isn't important (or that I already have backed up somewhere in another format), it probably makes sense to delete it rather than keeping it in the backup. I think the next phase of this project will be getting some more data on attachments.

Mining Your Gmail Data - Part 2

Download all your Gmail

So, step one in analyzing your Gmail data is to pull it all down to your machine. I'm using Gmvault to retrieve them all. Depending on how many messages you have stored in Gmail, you might want to start with the -t quick option, which only grabs the (by default) last 10 days of mail. Once you're sure you've got everything set up correctly, you can run the full command. Also, Gmvault compresses the data by default, so when you pull it down for analysis you'll want the --no-compression option. If you've been using Gmail for over a decade like I have, once you start the process you'll have some time to get a cup of coffee.

Pull all the data together in one place

Okay, I've got about 43,000 .eml files (and their accompanying metadata) on disk; now what?

If you're using the Gmvault defaults on a Windows 7 or above machine, the actual data will be stored in C:\Users\[username]\gmvault-db.

The next step is to write a script to chew through the folders full of .eml files and aggregate their metadata into one place. This can be done with a pretty simple PowerShell script and CDO (Collaboration Data Objects, which is an older Microsoft API and as far as I know is pretty much guaranteed to be available on your machine; I last used it on Windows 2000, and I'm doing this on Windows 10):

$emails = @()

$gmvaultdb = "[path to your gmvault db folder goes here]"

$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count

Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
    Write-Host "Processing $_ ($i of $total)"
    $adoDbStream = New-Object -ComObject ADODB.Stream
    $adoDbStream.Open()
    $adoDbStream.LoadFromFile($_.FullName)
    $cdoMessage = New-Object -ComObject CDO.Message
    $cdoMessage.DataSource.OpenObject($adoDbStream, "_Stream")

    $cdoMessage.Fields.Item("urn:schemas:mailheader:from").Value -match '"\s<(.*)>$' | Out-Null;
    $fromEmail = $Matches[1]

    $props = @{
        Size = $_.Length
        To = $cdoMessage.Fields.Item("urn:schemas:mailheader:to").Value
        From = $cdoMessage.Fields.Item("urn:schemas:mailheader:from").Value
        FromEmail = $fromEmail   
        Subject = $cdoMessage.Fields.Item("urn:schemas:mailheader:subject").Value
        Received = $cdoMessage.Fields.Item("urn:schemas:mailheader:received").Value
    }

    $emails += (New-Object PSObject -Property $props)
    $i++
} 

$emails | Select To, From, Subject, Size, FromEmail | Export-Csv emaildata.csv -NoTypeInformation

The script retrieves every single .eml file in the Gmvault backup folders and loads each one up as a CDO Message object. CDO takes care of all the parsing for us, and we can just extract the fields we care about. We pull those fields into a giant array of PSObjects and then use the magical Export-Csv command to create a comma-separated file with all of our email data.

Depending on how many emails you're dealing with, this may take a while.

By the way, the set of fields you can pull out of a CDO.Message object (the fields defined in "urn : schemas : mailheader") are documented on MSDN.

Now that we've got a massive CSV full of data about our email, we can start to break it down and analyze it. More on that in the next post.