January 04, 2016

When Do I Send Emails?

January 04, 2016/ E.Z. Hart

Work and the holidays have been distracting me from blogging and fun side projects for a couple of months, so I'm easing back into it with a really quick and easy Gmail data-wrangling post.

When I first started playing around with my Gmail data, I mentioned that I wanted to get some of the stats that Xobni used to provide before they were swallowed by the Yahoo! black hole. A couple of the simpler stats to compile are "what days of the week do I send most of my email?" and "what time of day do I send the most emails?".

In order to do any time-based analysis on my emails, I'm going to need the dates and times they were sent. So I've taken the PowerShell script from a while back and made a slight modification; in the $props hash I'm adding a field called SentDate:

$props = @{
    Id = $mimeMessage.MessageId
    To = $mimeMessage.To.ToString()
    From = $mimeMessage.From.ToString()
    FromEmail = $fromEmail
    Subject = $mimeMessage.Subject
    Body = $bodyText
    Format = $actualFormat
    SentDate = $mimeMessage.Date.ToUniversalTime()
}

MimeKit provides the sent date of the email as a DateTimeOffset; to keep things consistent, I'm converting everything to UTC at this stage.

From there, I import the data into pandas as per usual and filter it down to just the emails sent by me:

import pandas as pd
import numpy as np
import humanfriendly
import matplotlib.pyplot as plt
plt.style.use('ggplot')

df = pd.read_csv('../times.csv', header = 0)

fromMe = df.query('FromEmail == "[my email]"')

It turns out that indexing your data by date/time in pandas is pretty easy; you just create a DateTimeIndex:

temp = pd.DatetimeIndex(fromMe['SentDate'], tz = 'UTC').tz_convert('America/Denver')

Here I'm telling pandas to create an index by the SentDate field, and that the field is already in UTC. Then I'm converting all of those dates and times to my local timezone so that the data makes sense from my local perspective. This mostly works, because I mostly live in the Mountain timezone. Some of my data will be a little skewed because of emails sent while traveling and a few months where I lived in the Eastern time zone, but not so much that I care. In a later post I might look at how this data changes over time, which is more interesting (I might even be able to identify when and where I was traveling based on that data).

But for now, let's break down the data in temp and shove it back into the original dataset:

fromMe['DayOfWeek'] = temp.dayofweek
fromMe['Hour'] = temp.hour
fromMe['Year'] = temp.year

Now for each email from me, I've got a column that tells me what hour of the day and day of the week it was sent. From there, aggregating it and charting it are a snap:

# Number of emails sent by day of week
sentDayOfWeek = fromMe.groupby(['DayOfWeek']).agg({'Id' : 'count'})
sentDayOfWeek['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Day Of Week')
plt.show()

# Number of emails sent by hour of day
sentHourOfDay = fromMe.groupby(['Hour']).agg({'Id' : 'count'})
sentHourOfDay['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Hour Of Day')
plt.show()

The data is about what I'd expect; more emails on Monday than any other day (0 == Monday on this chart) and the majority of emails sent during the workday (with a dip around lunch).

Aggregating by year provides a bit of mystery, though:

sentYear = fromMe.groupby(['Year']).agg({'Id' : 'count'})
sentYear['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Year')
plt.show()

The numbers vary quite a bit more than I'd expect. 2004 makes sense; I only started using Gmail in July of that year. And the next couple of years shows me using Gmail more and more over my old Lycos account. The spike in 2011 also seems reasonable, as that's when I stopped working at an office with an Exchange server, so my day-to-day email load shifted. But the dips in 2012 and 2015? No idea. I'll have to dig further into those.

October 26, 2015

How Long Would It Take To Read All My Email?

October 26, 2015/ E.Z. Hart

This is part of a series on mining your own Gmail data.

For this post I want to tackle a fun question: how long would it take to read all of my email if that's all I did, 24/7? It's one of those questions that should interest anyone who's concerned about information overload or is looking to pare down their information consumption: "Just how much of my time is theoretically committed to my inbox?"

First, the obvious: nobody actually does this. No one actually reads every email they receive from start to finish (as anyone who's dealt with email in a corporate environment knows all too well). Most of us have filters (both electronic and mental) set up to glean the info we need and skip the rest.

And I'll bet that a lot of email is written without any expectation that the whole thing will be read; the author may be well aware that different portions of the email are relevant to different recipients, or that the email itself will only be interesting to a subset of the mailing list (e.g., many marketing emails).

So it's not the one super-relevant data point that should make people completely re-think their information consumption habits or anything like that. But it is fun to think about, and as one data point among many others it might prove interesting or useful.

On to the fun part - actually coming up with a number!

Like most people, I'm getting new emails all the time. So technically I should be taking into account all the new emails I receive while I'm still reading through my old ones. But that's hard, so I'm not going to bother. Instead, I'm just going to assume I've stopped getting emails at all while I'm reading. Which means that getting a basic number is easy - I just have to count all the words in all my emails, divide that by the number of words per minute I read, and I've got the number of minutes it would take to read everything.

The first thing I need to do is go back to my PowerShell script and pull in the body of each email. This is where we hit the first snag - HTML emails.

For doing word counts, I really don't want to look at HTML emails, because there's a ton of junk in there which a human won't be reading. Luckily, most email clients which send HTML emails also include a text version; in those cases, we'll just extract that text portion of the email and ignore the HTML. Unfortunately, this isn't always the case; when there's not a text version available, we'll just have to get the HTML and figure out how to deal with it later.

As usual, MimeKit will be doing most of the work. This version of the script is pretty similar to our previous ones, except that we have to loop through the possible body formats for each message to figure out which formats are available. We always check for the 'Text' format first, because that's the one we really want. If that's not available, we run through the others until we find one that works.

The relevant changes are the hash of the possible formats, which we use for iteration and for tracking the number of emails of each type:

$formats = @{
    [MimeKit.Text.TextFormat]::Text = 0; 
    [MimeKit.Text.TextFormat]::Flowed = 0;
    [MimeKit.Text.TextFormat]::Html = 0; 
    [MimeKit.Text.TextFormat]::Enriched = 0; 
    [MimeKit.Text.TextFormat]::RichText = 0; 
    [MimeKit.Text.TextFormat]::CompressedRichText = 0
}

And the section where we determine what the actual format is and store it:

    $bodyText = $null
    $actualFormat = $null

    # Run through all the enumeration values
    # The pipe through sort ensures that we check them in the enum order,
    # which is great because we prefer text over flowed over HTML, etc.
    $formats.Keys | sort | % { 
        # try each Format until we find one that works
        if($actualFormat -eq $null) { 
            # Try to get the body in the current format          
            $bodyText = $mimeMessage.GetTextBody($_)
            if($bodyText) {
                $actualFormat = $_
            } 
        }
    }

    if($actualFormat -eq $null) {
        $unknownFormat += 1;
        $actualFormat = "Unknown"
    } else {
        $formats[$actualFormat] += 1;
    }

You can find the full script here.

A couple of notes:

This isn't perfect; sometimes MimeKit can't really figure out what the format is. For example, I have some Skype notification emails which MimeKit thinks are HTML only, but are in fact text. I'm not sure why MimeKit gets confused (probably incorrect headers in the original emails), but out of about 43,000 emails only a couple dozen seem to have issues, so I'm not going to worry about it.
In all of my emails, the only two formats returned were Text and HTML. This might have something to do with what Gmail supports; I've seen some posts that suggest Gmail doesn't support Flowed, though those may be outdated. In any case, I'm only really dealing with Text and HTML in my word counts.

Once we've got the data, we can load it up in pandas and do some counting. Doing a naive count of the words in the plain text emails is trivial; we just define a method that uses Python's split method with None as the delimiter argument, and then look at the length of the returned list. Here's what textWordCount looks like:

def textWordCount(text):
    if not(isinstance(text, str)):
        return 0

    return len(text.split(None))

But the HTML emails are problematic because most of the content is markup that the user will never actually read. So we need to strip all that markup out and just count the words in the text portions of the HTML. To do that, we create another method which parses the HTML email content using the amazing Beautiful Soup library, strips away the style, script, head, and title parts, and extracts the text from what's left using get_text(). Once we've got the actual human-readable text, we can run it through our usual word counting method:

def htmlWordCount(text):
    if not(isinstance(text, str)):
        return 0

    soup = bsoup(text, 'html.parser')

    if soup is None:
        return 0

    stripped = soup.get_text(" ", strip=True)

    [s.extract() for s in soup(['style', 'script', 'head', 'title'])]

    stripped = soup.get_text(" ", strip=True)

    return textWordCount(stripped)

I took a couple of online tests to get an idea of how fast I read and came up with 350 words per minute. With that bit of data, we can now add some more columns to our data and figure out the total time to read all the emails:

def wordCount(row):

    if(row['Format'] == 'Html'):
        return htmlWordCount(row['Body'])

    return textWordCount(row['Body'])

averageWordsPerMinute = 350

# Count the words in each message body
emails['WordCount'] = emails.apply(wordCount, axis=1)
emails['MinutesToRead'] = emails['WordCount'] / averageWordsPerMinute

# Get total number of minutes required to read all these emails
totalMinutes = emails['MinutesToRead'].sum()

# And convert that to a more human-readable timespan
timeToRead = humanfriendly.format_timespan(totalMinutes * 60)

The full script is here, if you're playing at home.

Running that against all of my Gmail gives me:

>>> timeToRead
'2 weeks, 6 days and 18 hours'

So if I sat down and read at my fastest speed 24/7 for three weeks straight with no breaks, no sleep, and never slowing down, I could finish reading every word of every email I've ever received in my Gmail account. If I only read them 8 hours a day, it'd take me about 9 weeks to finish.

That's actually less than I expected, though "two whole months of your life spent just reading your email" is a still a bit sobering.

Sobering enough that I'm not going to try to compute this for my other four email accounts, anyway.

August 31, 2015

Mining Your Gmail Data - Part 4

August 31, 2015/ E.Z. Hart

I mentioned last time that I needed to look into the email size data to see what part attachments were playing the in total size of email I had from each address. This turned out to be a bit more complicated than I expected, but I did find a new tool to add to the email data mining arsenal.

Getting information about attachments via CDO is difficult. Finding things like the MIME type and the file name isn't terribly difficult, but digging down to the actual contents so I could see how large the attachments are quickly turned into an exercise in frustration with out-of-date COM documentation.

Enter MimeKit, an excellent open-source library for parsing email. It's .NET, it's fast, it's got great documentation, and it's got an object model which makes sense. And it's available as a nuget package, which means using it from PowerShell is a breeze.

I just open PowerShell to the folder where I'm writing my email processing scripts and run the following command:

nuget install MimeKit

(This assumes you've got NuGet in your path.) This will create a MimeKit folder with everything you need to use it. To access it from a PowerShell script, we just need to use the Add-Type command: Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll".

MimeKit can read an .eml file from disk, so loading up a message and gathering data about it is as simple as $mimeMessage = [MimeKit.MimeMessage]::Load([filename]). The attachments are available as a collection, and MimeKit's Attachment object can be accessed as a stream, which means we can easily determine its length. Which is what I'll need for my next data set - I want to gather the total size of all attachments sent by each email address.

Here's the script:

$emails = @()

$gmvaultdb = "C:\Users\[username]\gmvault-db\db"

$total = (Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | measure).Count

Add-Type -Path "MimeKit.1.2.10.0\lib\net45\MimeKit.dll"

Get-ChildItem $gmvaultdb -Recurse -Filter *.eml | ForEach-Object {$i=0} {
    Write-Host "Processing $_ ($i of $total)"

    $mimeMessage = [MimeKit.MimeMessage]::Load($_.FullName)

    $attachments = @()

    $mimeMessage.Attachments | % {

        $attachment = @{
            Id = $_.ContentId
            FileName = $_.ContentDisposition.FileName
            Length = $_.ContentObject.Stream.Length
            ContentType = $_.ContentType
        }

        $attachments += (New-Object PSObject -Property $attachment)
    }    

    $mimeMessage.From.ToString() -match '"\s<(.*)>$' | Out-Null;
    $fromEmail = $Matches[1]

    $props = @{
        Id = $mimeMessage.MessageId
        Size = $_.Length
        To = $mimeMessage.To.ToString()
        From = $mimeMessage.From.ToString()
        FromEmail = $fromEmail
        Subject = $mimeMessage.Subject
        TotalAttachmentSize = ($attachments | measure 'Length' -Sum).Sum
    }

    $emails += (New-Object PSObject -Property $props)

    $i++
}

$emails | Select Id, To, From, FromEmail, Subject, Size, TotalAttachmentSize | Export-Csv attachments1.csv -NoTypeInformation

A couple of things to note:

These sizes are the actual Base64-encoded sizes within the emails. The actual files would be smaller on disk. But for our purposes (smaller backups, freeing up Gmail space, etc.) we care about the encoded storage size.
The 'From' data here is a little misleading; this script is assuming a single 'From' address for every message, even though multiple addresses are possible. I don't know that I've ever gotten an email with multiple 'From' addresses (I should check that!), but if I have it's so rare that I'm willing to write off that error for now.
This accounts for multiple attachments in a single email (thus the measure 'Length' -Sum call); eventually I might break out attachments in the data individually so I can see things like "how many PDFs do I get" or "how many images do I get". But I need to learn some more stuff about data wrangling in pandas first.

Now that I've got the data in CSV format, I can slurp it into pandas to see how much of my email is file attachments. Here's my python script:

import pandas as pd
import numpy as np
import humanfriendly

# Read in our email data file
df = pd.read_csv('../attachments1.csv', header = 0)

# Fill in zero for all the TotalAttachmentSize entries with nothing in them
df = df.fillna(0)

# Filter out sent mail 
notFromMe = df.query('FromEmail != "[my email]"')

# Determine the top 30 senders by total attachment sizes
fromEmailByAttachmentSize = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'TotalAttachmentSize' : np.sum, 'FromEmail' : np.size}) 
    .sort('TotalAttachmentSize',ascending=False) 
    .head(30) )

# Add a 'human readable' version of the TotalAttachmentSize in another column
fromEmailByAttachmentSize['HR'] = map(humanfriendly.format_size, fromEmailByAttachmentSize['TotalAttachmentSize'])

The biggest bit of new pandas knowledge in this script is the fillna() method, which sets TotalAttachmentSize to zero for me in all the rows where there are no attachments.

The rest is simple; I'm merely grouping the 'From' addresses together and summing their TotalAttachmentSize values, then grabbing the top 30. Unsurprisingly, the top of the list is all relatives and friends; I'm guessing it's mostly photos. I'll dive into breaking that data down more in the next email data post.

August 26, 2015

Mining Your Gmail Data - Part 3

August 26, 2015/ E.Z. Hart

My foray into Gmvault was originally spurred by a desire to back up my Gmail. It occurs to me that I probably have a lot of stuff in there I don't really care about. I'm not a data-retention purist who needs every single notification email I've ever received from Twitter or Facebook; I'm happy to delete a lot of that junk to make my off-site backup smaller and reduce my bandwidth a bit.

But going through a decade of email to figure out what I can safely delete is painful. It'd be nice to get a report of the largest offenders with regard to junk mail so I can delete them en masse from Gmail before running the backup. At the same time, this gives me a starting point for creating a filter in Gmail which I can use to label emails as "Safe To Delete". This way, I can continue archiving them like normal, and if my backup starts to get ugly or I start running short on Gmail space, I can search for "SafeToDelete" and "older than 30 days" and remove the junk.

So that's the first project - grab a backup of all my Gmail since 2005 and analyze it for the source addresses which have sent me the most mail, both by message size and total number of messages.

Depending on how many emails we're dealing with, we might be able to work out the answers to our questions using Excel. But I'm learning pandas, so I'm going to use that. Also, some of the more complicated analysis I want to do later on should be much, much easier in pandas, which is designed for this kind of data crunching.

(BTW, if you're a Visual Studio fan like me, VS supports the heck out of Python these days. I'm using Python Tools for Visual Studio and it's amazing.)

So here's some quick-and-dirty python to get a list of the top 20 email addresses by count of message received and by total size of messages received:

import pandas as pd
import numpy as np
import humanfriendly

# Read in our email data file
df = pd.read_csv('../emaildata.csv', header = 0)

# Filter out sent mail 
notFromMe = df.query('FromEmail != "[your email here]"')

# Determine the top 20 senders by total email size
fromEmailBySize = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'Size' : np.sum, 'FromEmail' : np.size}) 
    .sort('Size',ascending=False) 
    .head(30) )

# Determine the top 20 senders by email count
fromEmailByCount = ( notFromMe.groupby(by=['FromEmail']) 
    .agg({'Size' : np.sum, 'FromEmail' : np.size}) 
    .sort('FromEmail',ascending=False) 
    .head(30) )

# Add a 'human readable' version of the Size in another column
fromEmailBySize['HR'] = map(humanfriendly.format_size, fromEmailBySize['Size'])
fromEmailByCount['HR'] = map(humanfriendly.format_size, fromEmailByCount['Size'])

This is fairly straightforward. First we slurp the CSV file into a data frame. Then we filter out the 'sent' email so it doesn't pollute our results.

To get the top 30 senders by email size, we group all the data by FromEmail and aggregate the sum of all the sizes. Then we sort it by the Size field and grab the top 30. To get the top 30 by count, we do basically the same thing, but we sort it by the count instead.

In the last two lines I'm mapping the format_size function from the humanfriendly library on to the Size columns to do the work of displaying the sizes in nice, human-readable format. Here are a couple of (slightly redacted) screenshots of my results:

My email, sorted by received message count

My email, sorted by the sum of the message sizes

So looking at fromEmailByCount, I see a couple of obvious contenders for deletion. I don't need to back up those notification emails from Facebook, for example, or the back issues of Thrillist, or the topic notifications from Stack Exchange. Eventually I'll go into Gmail, search for anything from those addresses, and delete them.

fromEmailBySize reveals some other interesting data. I tend to want to keep emails from actual people (as opposed to notification robots), but some of these people are probably near the top of the list because of a few attachments. If the attachment is something that isn't important (or that I already have backed up somewhere in another format), it probably makes sense to delete it rather than keeping it in the backup. I think the next phase of this project will be getting some more data on attachments.

CodeWise

CodeWise

The CodeWise Blog

CodeWise

When Do I Send Emails?

How Long Would It Take To Read All My Email?

Mining Your Gmail Data - Part 4

Mining Your Gmail Data - Part 3

CodeWise