Creating A Fake Me From My Emails

Some twitter robot or another got me thinking about Markov Chains the other day (in the text generator sense), and it occurred to me that it shouldn't be too hard to create one which (poorly) simulates me.

Markov chains are basically a set of states and probabilities of moving from one state to another. If you build one out of a body of text, you can map the likelihood of a given set of words following another set of words. The upshot of this is that if you start from a random set of words and follow the map (choosing your next state at each node randomly in proportion to its likelihood from the initial text), you can end up with something that sounds (sort of) like it came from the original body text. It's a popular way to create twitter bots. Markov chains have other, much more practical uses, but I'm not concerned about them today.

I've got 10 years of my emails already sitting in a .csv file; step one was loading them up and discarding all the ones I didn't send. After that, most of the work was cleaning up the data - most of the stuff in the bodies of my emails is actually pretty useless for this purpose. I had to remove all the quotes parts of other people's messages, all the HTML messages (even when cleaned up, they polluted the Markov chain too much), 'Forwarded Message' sections, URLs, and my own signatures.

After passing all the emails through the removeJunk function below, I globbed all the texts together into one giant string and fed it into this Markov generator from Amanda Pickering. With that done, I could just call generate_words() over and over to see what kind of nonsense fake me would spew out.

So here's my final code for taking my emails and creating a fake me, Black Mirror-style:

import pandas as pd
import numpy as np
import re
from markovgenerator import MarkovGenerator

# Read in our email data file
df = pd.read_csv('../bodytext.csv', header = 0)

# Only use mail I sent 
emails = df.query('FromEmail == "[my email]"').copy()

# Blank out any missing body text
emails.Body.fillna(' ', inplace = True)

# Regexes for truncating messages
# If any of these are found, the rest of the message is stuff I didn't write
quoteHeaderRegex = re.compile('On.*?wrote:', re.DOTALL)
originalMessageRegex = re.compile('^\s?\-.*?(Original|Forwarded) Message.*?\-\s?$', re.MULTILINE | re.IGNORECASE)
htmlRegex = re.compile('^\<html\>', re.MULTILINE)
googleReaderRegex = re.compile('^E\.Z\. Hart - Google Reader', re.MULTILINE)

# Other things in emails that aren't relevant
# If these are found, replace them with empty string
fromAndToRegex = re.compile('^(from:|to:|sent:).*?$', re.MULTILINE | re.IGNORECASE)
sigRegex = re.compile('^\-[\-\s]{1,4}E\.Z\.', re.MULTILINE)
dividerRegex = re.compile('\-{3,}')
urlRegex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')


def markovText(row):
    text = row['Body']

    if(row['Format'] == 'Html'):
        return ''

    return removeJunk(text)

def removeJunk(text):
    text = stripAfter(text, quoteHeaderRegex)
    text = stripAfter(text, originalMessageRegex)
    text = stripAfter(text, googleReaderRegex)
    text = stripAfter(text, htmlRegex)

    text = re.sub(fromAndToRegex, '', text)
    text = re.sub(sigRegex, '', text)
    text = re.sub(dividerRegex, '', text)
    text = re.sub(urlRegex, '', text)

    return text

def stripAfter(text, regex):
    target = regex.search(text)
    if(target):
        return text[:target.start()]
    return text

# Run all the emails through the cleanup function
emails['Markov'] = emails.apply(markovText, axis=1)

# Concatenate all the emails into one giant input string
input = emails['Markov'][:].str.cat()

markov_gen = MarkovGenerator(input, 200, 3)
markov_gen.generate_words()

And here are a few of my favorite phrases from the results:

"Use cheap rum. Cheap rum is going to get the crab wontons- otherwise I can't guarantee your safety:)"

"And more important than anything else, has been what has kept me employed and made me successful. Anyway, I'm glad you took your flashlight."

"We will begin working on the changes Tory has asked for, and I'll eventually start going full troll without her around:) Okay."

"You're receiving this email because you're rewriting 10,000 lines of code that solved the first two weeks of August while in between leases."

"At this point I'll need jumping and, ideally, that signs be installed? How would I go about making that request? Again, we completely understand if you don't have any SharePoint development experience, just experience as a user in each role just to make sure it was knitting and not crocheting- I don't know football that well, but any of them, they might actually turn into assets , though."

Have fun creating your own email doppelgängers, but remember - cheap rum is going to get the crab wontons. I can't guarantee your safety.