Rolling My Own Time Machine

by Matt Cholick

I may have missed national backup day by a little (or maybe a lot as it was nearly nearly half a year ago...), but I've finally put together something more robust than just copying whatever I can recall as important onto an external hard drive every few months. I wanted something cheap, secure, and offsite. I decided to go with Amazon Web Services, rsync, a bit of python, and an Ubuntu server install with full disk encryption.

This solution doesn't cost much. AWS is nicely priced at $0.10 per GB-month and I can provision only what I need. Data transfer in is free, so there's no cost there. The server will only run a few hours at a time, so the cost of the virtual machine is pennies a month. The only significant cost to this solution is the actual storage cost.

Amazon makes it trivial to copy a snapshot of a volume to S3, so I could even keep a set of revolving versions without additional work if I felt the need (I don't at this time).

Going this route should be pretty secure as well. Servers don't get any more secure than when they're turned off, and with this solution I can keep my backup host turned off 99% of the time. I have also decided to encrypt the set of ebs drives the server mounts. Do I really need to encrypt my mp3 collection? Probably not, but I do have more personal things in my home directory. If you believe Amazon's assurances this step is completely unnecessary, but I like to have that extra layer for the peace of mind. This was a pretty painless process anway. I mostly followed the guide here.

The initial upload did take about 24 hours for 40GB. rsync only sends the delta, though, so every future backup will take a fraction of that time. I'm happy; I finally have solid offsite backup of irreplaceable things like photos and source code.

And here's the script - or an abridged version anyway, as I've cut out everything but a single rsync; the rest were just more of the same. It's neither elegant nor generic nor pythonic. It was quick though, and it does accomplish the job. I'm tossing it up here on the off chance that it will give someone else a starting point.

#!/usr/bin/env python
import os, sys, getopt, subprocess

#v for verbose, r for recurse, t for preserve time, p for permissions,
#z for compression, h for human readable sizes
baseSwitches = '-vrtpzh --delete'

def usage():
    message = """
Usage: backup.py [OPTION]...

Options
 -h, --help			help
 -n, --dry-run		Dry run
     --aws=host     Specify amazon host
"""
    print(message)

def doCommand(source, dest, includes, excludes, log):
    global baseSwitches
    opts = {'source': source, 'switches': baseSwitches, 'dest': dest}

    opts['excludes'] = ''
    for exclude in includes:
        opts['excludes'] += '-f "- %s" ' % exclude

    opts['includes'] = ''
    for include in excludes:
        opts['includes'] += '-f "+ %s" ' % include

    command = 'rsync %(switches)s %(includes)s %(excludes)s %(source)s %(dest)s' % opts

    print >> log, command + "\n\n"
    log.flush()
    p = subprocess.Popen(command, stdout=log, stderr=sys.stderr, shell=True)
    p.wait()

def main(argv):
    global baseSwitches
    user = os.environ['USER']
    if user == 'root':
        print >> sys.stderr, 'This script should not be run as root'
        sys.exit(2)

    try:
        opts, args = getopt.getopt(argv, "hn", ["help", "dry-run", "aws="])
    except getopt.GetoptError:
        usage()
        sys.exit(2)

    host = ''
    for opt, arg in opts:
        if opt in ("-h", "--help"):
            usage()
            sys.exit()
        if opt in ("-n", "--dry-run"):
            baseSwitches += " --dry-run"
        if opt in ("--aws"):
            host = arg

    doSync(host)

def doSync(host):
    #rotate logs, requires writable destination - /var/log wouldn't be writable by default
    baseLog = '/var/log/backup.py/backup.log'
    if os.path.exists(baseLog + '.2'):
        os.rename(baseLog + '.2', baseLog + '.3')
    if os.path.exists(baseLog + '.1'):
        os.rename(baseLog + '.1', baseLog + '.2')
    if os.path.exists(baseLog):
        os.rename(baseLog, baseLog + '.1')
    log = open(baseLog, 'w')

    baseTarget = '/media/backups/'
    if host:
        baseTarget = host + ":" + baseTarget

    print >> log, "\n\nBackup up home"
    print >> log, "------------------------------------------------------"
    log.flush()

    #listing of source - these are open source projects, no need for me to backup
    sourcePath = '/home/mattc/source.txt'
    source = open(sourcePath, 'w')
    p = subprocess.Popen("tree -d -L 2 /home/mattc/source", stdout=source, stderr=sys.stderr, shell=True)
    p.wait()
    source.close()

    #listing of libs - downloaded libraries, no need to backup.
    libPath = '/home/mattc/lib.txt'
    lib = open(libPath, 'w')
    p = subprocess.Popen("ls -R /home/mattc/lib/", stdout=lib, stderr=sys.stderr, shell=True)
    p.wait()
    lib.close()

    #rsync of home
    excludes = ['/.*', '/temp', '/apps', '/mnt1', '/mnt2', '/mnt3', '/mnt4', '/source', 'lib', '/Dropbox', '/Music', 'lost+found']
    includes = ['/.ssh', '/source.txt', 'lib.txt']
    doCommand('/home/mattc/', baseTarget + 'mattc', excludes, includes, log)

    os.remove(sourcePath)
    os.remove(libPath)

    log.close()

if __name__ == "__main__":
    main(sys.argv[1:])