When Do I Send Emails?
/Work and the holidays have been distracting me from blogging and fun side projects for a couple of months, so I'm easing back into it with a really quick and easy Gmail data-wrangling post.
When I first started playing around with my Gmail data, I mentioned that I wanted to get some of the stats that Xobni used to provide before they were swallowed by the Yahoo! black hole. A couple of the simpler stats to compile are "what days of the week do I send most of my email?" and "what time of day do I send the most emails?".
In order to do any time-based analysis on my emails, I'm going to need the dates and times they were sent. So I've taken the PowerShell script from a while back and made a slight modification; in the $props
hash I'm adding a field called SentDate
:
$props = @{
Id = $mimeMessage.MessageId
To = $mimeMessage.To.ToString()
From = $mimeMessage.From.ToString()
FromEmail = $fromEmail
Subject = $mimeMessage.Subject
Body = $bodyText
Format = $actualFormat
SentDate = $mimeMessage.Date.ToUniversalTime()
}
MimeKit provides the sent date of the email as a DateTimeOffset; to keep things consistent, I'm converting everything to UTC at this stage.
From there, I import the data into pandas as per usual and filter it down to just the emails sent by me:
import pandas as pd
import numpy as np
import humanfriendly
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('../times.csv', header = 0)
fromMe = df.query('FromEmail == "[my email]"')
It turns out that indexing your data by date/time in pandas is pretty easy; you just create a DateTimeIndex:
temp = pd.DatetimeIndex(fromMe['SentDate'], tz = 'UTC').tz_convert('America/Denver')
Here I'm telling pandas to create an index by the SentDate
field, and that the field is already in UTC. Then I'm converting all of those dates and times to my local timezone so that the data makes sense from my local perspective. This mostly works, because I mostly live in the Mountain timezone. Some of my data will be a little skewed because of emails sent while traveling and a few months where I lived in the Eastern time zone, but not so much that I care. In a later post I might look at how this data changes over time, which is more interesting (I might even be able to identify when and where I was traveling based on that data).
But for now, let's break down the data in temp and shove it back into the original dataset:
fromMe['DayOfWeek'] = temp.dayofweek
fromMe['Hour'] = temp.hour
fromMe['Year'] = temp.year
Now for each email from me, I've got a column that tells me what hour of the day and day of the week it was sent. From there, aggregating it and charting it are a snap:
# Number of emails sent by day of week
sentDayOfWeek = fromMe.groupby(['DayOfWeek']).agg({'Id' : 'count'})
sentDayOfWeek['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Day Of Week')
plt.show()
# Number of emails sent by hour of day
sentHourOfDay = fromMe.groupby(['Hour']).agg({'Id' : 'count'})
sentHourOfDay['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Hour Of Day')
plt.show()
The data is about what I'd expect; more emails on Monday than any other day (0 == Monday on this chart) and the majority of emails sent during the workday (with a dip around lunch).
Aggregating by year provides a bit of mystery, though:
sentYear = fromMe.groupby(['Year']).agg({'Id' : 'count'})
sentYear['Id'].plot(kind='bar', figsize=(6, 6), title='Emails Sent By Year')
plt.show()