Backing Up GitHub Code to a Synology

One of the items that's been lurking on my todo list for a long, long time is setting up a nightly backup of all the code I have hosted on GitHub and Bitbucket. I'm getting more and more concerned with the stability, security, and longevity of cloud services every day. While I think both these organizations are doing an excellent job, it's always possible that something catastrophic will happen, especially if attackers become sufficiently determined.

The goal of this little project was to get my Synology (a DS414slim) to retrieve all my source code each day and include it in my nightly offsite backup (using CrashPlan). That way, there are three copies of my work in three different locations at any given time. An event big enough to make me lose all my source code at once is probably an event big enough that my source code no longer matters.

I'll start with my GitHub backup - it's the simpler of the two, since I only host publicly available code there (no private repositories). Because my Bitbucket repos are private, the process of retrieving them is a bit more involved.

I started with Petr Trofimov's GitHub backup script and made some modifications to get it working on my Synology. I'm using the SynoCommunity git package, so step one was to add the full path to git when calling xargs (I was too lazy to muck around with getting the PATH set correctly). Next was adding the per_page=100 parameter to the initial retrieval of the repo list. This assures that I get all my repos in one go, and don't have to deal with multiple pages. It'll break when I hit the 101st repo, but that's likely a long way off.

I hard-coded my username and the destination folders into the script to make things simpler, and (after a few tries) modified the tar command to use bzip2 to save some space. I also added a clean-up step to remove archives older than 30 days at end of the script.

Another big change is the string of grep -v commands which the list of repositories gets piped through. A few of the repos I've forked are pretty large, and I'm pretty sure that Stripe, for example, takes care of their own backups. So I've filtered out the bigger repos that I feel are probably safe from a major GitHub hiccup in order to reduce my bandwidth and storage a bit.

With this script running nightly in the task scheduler, I'm now comfortable that my hard work won't be lost if something goes terribly, terribly wrong with GitHub. And it was pretty easy to set up; I'm mostly annoyed that I spent so long with this sitting on my todo list.

Here's my full modified script:

#!/bin/sh

set -ex

USER="hartez"
API_URL="https://api.github.com/users/${USER}/repos?type=owner&per_page=100"
DATE=$(date +"%Y%m%d")
TEMP_FOLDER="backup"
TEMP_PATH="/volume1/homes/ez/documents/github-backup"
TEMP_FULL_PATH="${TEMP_PATH}/${TEMP_FOLDER}"
BACKUP_FILE="github_${USER}_${DATE}.tar.xz"

# Clean up previous backup stuff in case something went wrong (e.g. power outage)
rm -rf "$TEMP_FULL_PATH"

mkdir "$TEMP_FULL_PATH" && cd "$TEMP_FULL_PATH"
curl -s "$API_URL" | grep -Eo '"git_url": "[^"]+"' | grep -v "[Ff]ubu" | grep -v "bottles" | grep -v "stripe\.net" | awk '{print $2}' | xargs -n 1 /volume1/@appstore/git/bin/git clone
cd ${TEMP_PATH}
tar -Jcf "$BACKUP_FILE" --directory="$TEMP_PATH" "$TEMP_FOLDER"
rm -rf "$TEMP_FULL_PATH"

# Clean up backups over 30 days old
find /volume1/homes/ez/documents/github-backup/*.xz -type f -mtime +30 -delete

Next time I'll talk about the Bitbucket process, which took considerably more work.