Backing Up Bitbucket Code to a Synology

In my last post, I talked about getting my Synology to do a nightly backup of all my GitHub repos. Since all of them are public, it's a pretty simply process to clone them all locally. This time, I want to talk about Bitbucket, which is where I keep all my private repos. Because they're private, and because some of them are owned by teams, the process is a bit more involved.

I started with an excellent Bash script from the BibSonomy team which covers most of the bases; it can retrieve private repos along with all of their issues and wiki entries, and it works with both git and mercurial. My Synology runs BusyBox, which means that the default shell is Almquist, not Bash. Rather than install Bash, I just rewrote the script a bit so that it would work in ash - mostly this was just a matter of changing how the functions were defined.

The other challenge was that the original script assumed that the box it was running on already had the correct SSL CA Certificates installed for curl. This wasn't the case for my Synology, so I grabbed cacert.pem from the cUrl site and dropped it into the folder with the script. Then I modified the script to call cUrl with the --cacert cacert.pem option. It's quick and dirty, but it worked for getting this process up and running.

The script basically has four dependencies: cUrl, jq, git, and hg (mercurial). cUrl and jq were already installed on my Synology. Setting up git and mercurial is easy - just use the packages provided by SynoCommunity. The mercurial package added hg to my path; for git you'll need to add /volume1/@appstore/git/bin to your path or set it before you call the script (export PATH="/volume1/@appstore/git/bin:$PATH").

The full version of the modified bitbucket-backup.sh script can be found here.

To retrieve your personal repos, you can set USER_NAME to your Bitbucket user name and the API_KEY to your password before calling the script:

export USER_NAME=hartez
export API_KEY=[password]
./bitbucket-backup.sh

For teams, you'll need to generate an API key by going to "Manage Team" and selecting API key under "Access Management". Then you can retrieve everything for your team:

export USER_NAME=[team name]
export API_KEY=[api key]
./bitbucket-backup.sh

I do this in a script which handles all of my teams (4 at the moment), zips everything up, and cleans up older archives:

#!/bin/ash

cd /volume1/homes/ez/documents/bitbucket-backup

# Add git to the path temporarily
export PATH="/volume1/@appstore/git/bin:$PATH"

# Use the current date and time to name the working dir and archive
DEST=$(date "+%Y%m%d%H%M%S")
export BACKUP_LOCATION=$DEST

# My Stuff
export USER_NAME=hartez
export API_KEY=[password]
./bitbucket-backup.sh 

# Teams
export USER_NAME=[team 1]
export API_KEY=[api key]
./bitbucket-backup.sh 

#... 

# Archive everything
BACKUP_FILE="${DEST}.tar.xz"
tar -Jcf "$BACKUP_FILE" "$DEST"

# Clean up the working directory
rm -rf $DEST

# Clean up backups over 30 days old
find /volume1/homes/ez/documents/bitbucket-backup/*.xz -type f -mtime +30 -delete

That's all there is to it.

It's actually quite a relief to know that I've got all this code backed up locally and off-site. I don't really expect anything to go tragically wrong with either Bitbucket or GitHub, but until I got these backups running I hadn't realized how much psychic weight there was in that tiny, nagging worry that just maybe all my work would get lost. It's definitely worth a little time to set this sort of thing up, if just for the peace of mind. Hopefully these articles will make it a bit easier for someone else.

Update: I was checking up on my backup the other day and noticed that my personal projects (the ones which weren't under a group) weren't backing up anymore. It turns out that the above solution isn't compatible with personal accounts if you've got 2-factor authentication enabled. Since there's no option to set up an API key for personal accounts at this point, getting this working with 2FA in place would require jumping through a bunch of OAuth hoops. Hopefully application specific passwords will be implemented soon and I can move to using that feature; if not, eventually I'm going to be spending some painful hours dealing with OAuth. Either way, I'll put up another post detailing the process.

Backing Up GitHub Code to a Synology

One of the items that's been lurking on my todo list for a long, long time is setting up a nightly backup of all the code I have hosted on GitHub and Bitbucket. I'm getting more and more concerned with the stability, security, and longevity of cloud services every day. While I think both these organizations are doing an excellent job, it's always possible that something catastrophic will happen, especially if attackers become sufficiently determined.

The goal of this little project was to get my Synology (a DS414slim) to retrieve all my source code each day and include it in my nightly offsite backup (using CrashPlan). That way, there are three copies of my work in three different locations at any given time. An event big enough to make me lose all my source code at once is probably an event big enough that my source code no longer matters.

I'll start with my GitHub backup - it's the simpler of the two, since I only host publicly available code there (no private repositories). Because my Bitbucket repos are private, the process of retrieving them is a bit more involved.

I started with Petr Trofimov's GitHub backup script and made some modifications to get it working on my Synology. I'm using the SynoCommunity git package, so step one was to add the full path to git when calling xargs (I was too lazy to muck around with getting the PATH set correctly). Next was adding the per_page=100 parameter to the initial retrieval of the repo list. This assures that I get all my repos in one go, and don't have to deal with multiple pages. It'll break when I hit the 101st repo, but that's likely a long way off.

I hard-coded my username and the destination folders into the script to make things simpler, and (after a few tries) modified the tar command to use bzip2 to save some space. I also added a clean-up step to remove archives older than 30 days at end of the script.

Another big change is the string of grep -v commands which the list of repositories gets piped through. A few of the repos I've forked are pretty large, and I'm pretty sure that Stripe, for example, takes care of their own backups. So I've filtered out the bigger repos that I feel are probably safe from a major GitHub hiccup in order to reduce my bandwidth and storage a bit.

With this script running nightly in the task scheduler, I'm now comfortable that my hard work won't be lost if something goes terribly, terribly wrong with GitHub. And it was pretty easy to set up; I'm mostly annoyed that I spent so long with this sitting on my todo list.

Here's my full modified script:

#!/bin/sh

set -ex

USER="hartez"
API_URL="https://api.github.com/users/${USER}/repos?type=owner&per_page=100"
DATE=$(date +"%Y%m%d")
TEMP_FOLDER="backup"
TEMP_PATH="/volume1/homes/ez/documents/github-backup"
TEMP_FULL_PATH="${TEMP_PATH}/${TEMP_FOLDER}"
BACKUP_FILE="github_${USER}_${DATE}.tar.xz"

# Clean up previous backup stuff in case something went wrong (e.g. power outage)
rm -rf "$TEMP_FULL_PATH"

mkdir "$TEMP_FULL_PATH" && cd "$TEMP_FULL_PATH"
curl -s "$API_URL" | grep -Eo '"git_url": "[^"]+"' | grep -v "[Ff]ubu" | grep -v "bottles" | grep -v "stripe\.net" | awk '{print $2}' | xargs -n 1 /volume1/@appstore/git/bin/git clone
cd ${TEMP_PATH}
tar -Jcf "$BACKUP_FILE" --directory="$TEMP_PATH" "$TEMP_FOLDER"
rm -rf "$TEMP_FULL_PATH"

# Clean up backups over 30 days old
find /volume1/homes/ez/documents/github-backup/*.xz -type f -mtime +30 -delete

Next time I'll talk about the Bitbucket process, which took considerably more work.