Mining Your Gmail Data - Part 3

My foray into Gmvault was originally spurred by a desire to back up my Gmail. It occurs to me that I probably have a lot of stuff in there I don't really care about. I'm not a data-retention purist who needs every single notification email I've ever received from Twitter or Facebook; I'm happy to delete a lot of that junk to make my off-site backup smaller and reduce my bandwidth a bit.

But going through a decade of email to figure out what I can safely delete is painful. It'd be nice to get a report of the largest offenders with regard to junk mail so I can delete them en masse from Gmail before running the backup. At the same time, this gives me a starting point for creating a filter in Gmail which I can use to label emails as "Safe To Delete". This way, I can continue archiving them like normal, and if my backup starts to get ugly or I start running short on Gmail space, I can search for "SafeToDelete" and "older than 30 days" and remove the junk.

So that's the first project - grab a backup of all my Gmail since 2005 and analyze it for the source addresses which have sent me the most mail, both by message size and total number of messages.

Depending on how many emails we're dealing with, we might be able to work out the answers to our questions using Excel. But I'm learning pandas, so I'm going to use that. Also, some of the more complicated analysis I want to do later on should be much, much easier in pandas, which is designed for this kind of data crunching.

(BTW, if you're a Visual Studio fan like me, VS supports the heck out of Python these days. I'm using Python Tools for Visual Studio and it's amazing.)

So here's some quick-and-dirty python to get a list of the top 20 email addresses by count of message received and by total size of messages received:

import pandas as pd
import numpy as np
import humanfriendly

# Read in our email data file
df = pd.read_csv('../emaildata.csv', header = 0)

# Filter out sent mail 
notFromMe = df.query('FromEmail != "[your email here]"')

# Determine the top 20 senders by total email size
fromEmailBySize = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'Size' : np.sum, 'FromEmail' : np.size}) 
    .sort('Size',ascending=False) 
    .head(30) )

# Determine the top 20 senders by email count
fromEmailByCount = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'Size' : np.sum, 'FromEmail' : np.size}) 
    .sort('FromEmail',ascending=False) 
    .head(30) )

# Add a 'human readable' version of the Size in another column
fromEmailBySize['HR'] = map(humanfriendly.format_size, fromEmailBySize['Size'])
fromEmailByCount['HR'] = map(humanfriendly.format_size, fromEmailByCount['Size'])

This is fairly straightforward. First we slurp the CSV file into a data frame. Then we filter out the 'sent' email so it doesn't pollute our results.

To get the top 30 senders by email size, we group all the data by FromEmail and aggregate the sum of all the sizes. Then we sort it by the Size field and grab the top 30. To get the top 30 by count, we do basically the same thing, but we sort it by the count instead.

In the last two lines I'm mapping the format_size function from the humanfriendly library on to the Size columns to do the work of displaying the sizes in nice, human-readable format. Here are a couple of (slightly redacted) screenshots of my results:

My email, sorted by received message count

My email, sorted by the sum of the message sizes

So looking at fromEmailByCount, I see a couple of obvious contenders for deletion. I don't need to back up those notification emails from Facebook, for example, or the back issues of Thrillist, or the topic notifications from Stack Exchange. Eventually I'll go into Gmail, search for anything from those addresses, and delete them.

fromEmailBySize reveals some other interesting data. I tend to want to keep emails from actual people (as opposed to notification robots), but some of these people are probably near the top of the list because of a few attachments. If the attachment is something that isn't important (or that I already have backed up somewhere in another format), it probably makes sense to delete it rather than keeping it in the backup. I think the next phase of this project will be getting some more data on attachments.