When Do I Send Emails?

Work and the holidays have been distracting me from blogging and fun side projects for a couple of months, so I'm easing back into it with a really quick and easy Gmail data-wrangling post.

When I first started playing around with my Gmail data, I mentioned that I wanted to get some of the stats that Xobni used to provide before they were swallowed by the Yahoo! black hole. A couple of the simpler stats to compile are "what days of the week do I send most of my email?" and "what time of day do I send the most emails?".

In order to do any time-based analysis on my emails, I'm going to need the dates and times they were sent. So I've taken the PowerShell script from a while back and made a slight modification; in the $props hash I'm adding a field called SentDate:

$props = @{
    Id = $mimeMessage.MessageId
    To = $mimeMessage.To.ToString()
    From = $mimeMessage.From.ToString()
    FromEmail = $fromEmail
    Subject = $mimeMessage.Subject
    Body = $bodyText
    Format = $actualFormat
    SentDate = $mimeMessage.Date.ToUniversalTime()
}

MimeKit provides the sent date of the email as a DateTimeOffset; to keep things consistent, I'm converting everything to UTC at this stage.

From there, I import the data into pandas as per usual and filter it down to just the emails sent by me:

import pandas as pd
import numpy as np
import humanfriendly
import matplotlib.pyplot as plt
plt.style.use('ggplot')

df = pd.read_csv('../times.csv', header = 0)

fromMe = df.query('FromEmail == "[my email]"')

It turns out that indexing your data by date/time in pandas is pretty easy; you just create a DateTimeIndex:

temp = pd.DatetimeIndex(fromMe['SentDate'], tz = 'UTC').tz_convert('America/Denver')

Here I'm telling pandas to create an index by the SentDate field, and that the field is already in UTC. Then I'm converting all of those dates and times to my local timezone so that the data makes sense from my local perspective. This mostly works, because I mostly live in the Mountain timezone. Some of my data will be a little skewed because of emails sent while traveling and a few months where I lived in the Eastern time zone, but not so much that I care. In a later post I might look at how this data changes over time, which is more interesting (I might even be able to identify when and where I was traveling based on that data).

But for now, let's break down the data in temp and shove it back into the original dataset:

fromMe['DayOfWeek'] = temp.dayofweek
fromMe['Hour'] = temp.hour
fromMe['Year'] = temp.year

Now for each email from me, I've got a column that tells me what hour of the day and day of the week it was sent. From there, aggregating it and charting it are a snap:

# Number of emails sent by day of week
sentDayOfWeek = fromMe.groupby(['DayOfWeek']).agg({'Id' : 'count'})
sentDayOfWeek['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Day Of Week')
plt.show()

# Number of emails sent by hour of day
sentHourOfDay = fromMe.groupby(['Hour']).agg({'Id' : 'count'})
sentHourOfDay['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Hour Of Day')
plt.show()

The data is about what I'd expect; more emails on Monday than any other day (0 == Monday on this chart) and the majority of emails sent during the workday (with a dip around lunch).

Aggregating by year provides a bit of mystery, though:

sentYear = fromMe.groupby(['Year']).agg({'Id' : 'count'})
sentYear['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Year')
plt.show()
The numbers vary quite a bit more than I'd expect. 2004 makes sense; I only started using Gmail in July of that year. And the next couple of years shows me using Gmail more and more over my old Lycos account. The spike in 2011 also seems reasonable, as that's when I stopped working at an office with an Exchange server, so my day-to-day email load shifted. But the dips in 2012 and 2015? No idea. I'll have to dig further into those.