Testing Huge PostgreSQL Backups for 37 Cents a Day

I spent some time revisiting and improving Feedbin’s backup strategy recently.

I was prompted by a couple of high-profile database incidents including GitLab and Instapaper.

I already felt pretty good about the backup system I had in place. I had tested backup recovery manually and felt the tools were solid.

However, if for some reason the database backup was failing or my restore procedure had a mistake, I would not know until I actually had to recover the database.

Feedbin uses WAL-E from Heroku to continually back up the database. WAL-E, and the PostgreSQL architecture that enables it, gives you offsite backups that can be recovered to any specified point in time.

WAL-E is set to back up the full database every day via Cron.

#!/bin/bash
wal-e backup-push /var/lib/postgresql/9.2/main

This command pushes the database to S3. This functions as a base backup that when combined with the WAL archives, that are continuously uploaded, can be restored to any point after the base backup started.

WAL-E offers a counterpart command, backup-fetch, to actually restore the data from a backup-push. To test the backups I needed to build in an automated way to restore the database.

Feedbin already uses Digital Ocean for a few things, so my first thought was to use Digital Ocean for this. I wrote a script to provision a Digital Ocean server, restore the backup to it and then delete the server after the backup completed.

This turned out to be too expensive. Sending data to S3 is free, but reading it back out will cost you. For Feedbin’s database this worked out to be about $40 every time I restored the database. I wanted to test backups daily but the data transfer cost would quickly add up to about $1,200/month.

While I was looking at S3 pricing, I found out that reading from S3 is free when it is read by an EC2 instance in the same region as your S3 bucket. It was also possible to save money on the EC2 instance itself by using Spot Instances instead of on-demand. With Spot Instances you bid on your instance and AWS tells you if you can have it for that price or not.

Critically, no matter what your bid is, you never pay more than the spot price which is “The current market price of a Spot instance per hour.” With this in mind you don’t have to guess what to bid and since your bid matches the on-demand price, your instance will never be terminated early due to the price exceeding your bid.

The instance I want costs $0.78/hr so I bid $0.78/hr, but only end up paying the spot price of ~$0.18/hr.

Price of database restore with On-Demand vs. Spot Instances

I was new to the AWS CLI, but once I figured out the right data to send, it turned out to be a fairly simple script.

#!/bin/bash

export AWS_SHARED_CREDENTIALS_FILE=/etc/aws.d/credentials.ini

user_data=$(base64 --wrap 0 /etc/aws.d/pg_restore)
launch_specification=$(cat <<_EOF_
{
    "UserData": "${user_data}",
    "ImageId": "ami-6edd3078",
    "SecurityGroupIds": ["sg-XYXYXY"],
    "InstanceType": "c4.4xlarge",
    "SubnetId": "subnet-XYXYXY",
    "EbsOptimized": true,
    "BlockDeviceMappings": [
        {
            "DeviceName": "/dev/sda1",
            "Ebs": {
                "DeleteOnTermination": true,
                "SnapshotId": "snap-0a7be01f9503d7bee",
                "VolumeSize": 800,
                "VolumeType": "gp2"
            }
        }
    ],
    "IamInstanceProfile": {
        "Name": "database-restore"
    }
}
_EOF_
)

/usr/local/bin/aws --region us-east-1 \
ec2 request-spot-instances \
--spot-price "0.78" \
--instance-count 1 \
--type "one-time" \
--launch-specification "${launch_specification}"

This puts in a request to launch a c4.4xlarge instance with 800GB of storage. It also specifies UserData which is executed by the instance after it boots up, which is a perfect fit for the script that configures the machine to run postgres and restore the database backup.

#!/bin/bash

# pg_restore

# Add postgresql.org offical releases as an apt source
sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
apt-get install -y wget ca-certificates
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

# Install dependencies
apt-get -y update
apt-get install -y postgresql-9.2 postgresql-contrib-9.2 postgresql-server-dev-9.2 \
build-essential python3-dev python3-pip libevent-dev daemontools lzop pv ssl-cert

# Make the postgres user a sudoer so it can shut down the machine later
echo "postgres ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/postgres
chmod 440 /etc/sudoers.d/postgres

# Install and configure WAL-E
python3 -m pip install wal-e[aws]
mkdir -p /etc/wal-e.d/env
echo "secret-key"   > /etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
echo "access-key"   > /etc/wal-e.d/env/AWS_ACCESS_KEY_ID
echo 's3://bucket/' > /etc/wal-e.d/env/WALE_S3_PREFIX
chown -R root:postgres /etc/wal-e.d

# Download the latest backup
service postgresql stop
rm -Rf /var/lib/postgresql/9.2/main
envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-fetch --pool-size=16 /var/lib/postgresql/9.2/main LATEST

# set the postgres recovery settings
sudo -u postgres bash -c "cat > /var/lib/postgresql/9.2/main/recovery.conf <<- _EOF_
restore_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch --prefetch=16 \"%f\" \"%p\"'
recovery_end_command = 'mail -s \"Database Restore Complete\" email@example.com && sudo shutdown -h now'
_EOF_
"

service postgresql start

This is all that is needed to stand-up a fully functioning PostgreSQL server and restore the database. No Chef, Ansible or any other provisioning tools required.

The important part here is that postgres lets you specify a command to run once recovery is complete, the recovery_end_command.

recovery_end_command = 'mail -s "Database Restore Complete" email@example.com && sudo shutdown -h now'

Here I have it send me an email and shut down the server, which terminates the EC2 instance so it’s no longer incurring cost.

If the email goes missing, then I know the restore never completed and I can go figure out what went wrong. AWS helps you out here too. The results of the UserData script are automatically logged to /var/log/cloud-init-output.log So you can seen exactly where the restore went wrong.

I would be interested in hearing any questions or comments about this.