I’d been moving more and more personal projects to my ClickHouse database, but I didn’t have a good backup solution. In fact, I didn’t have any backup solution whatsoever.

When we were working on some incremental ClickHouse backup scripts at work with my colleague, he recommended that I should look into getting backups going for my personal database as well. I wasn’t eager to deal with the complexity of incremental backups, and I didn’t want fancy clickhouse-backup scripts either.

While checking out these options, we found that ClickHouse has a built-in backup/restore functionality that can use local storage or S3. After running a few tests locally, I got the queries working the way I wanted, and I tested the S3 functionality.

Setting up S3

While you can set up S3 backups using the ClickHouse configuration XML files, you can also provide all the parameters in the BACKUP query itself. This is the approach I took, as I didn’t want to mess with the configuration files or have my S3 credentials in them.

I created a new bucket for my backups using the AWS console (but you can use the CLI or the API as well). Afterwards, I grabbed the bucket URL and generated AWS credentials with the permission to write to the bucket.

With the bucket URL and the credentials in hand, I was ready to run a manual backup.

Backup queries with ClickHouse

Here is the query I ended up using. We are excluding the system database because it has caused some problems while I was setting up the backup. You should try to include it if it works for you. And a personal TODO for me to try including it again.

backup all
except database system
to S3(

AWS_KEY and AWS_SECRET should be replaced with the credentials you generated. DT is the current date and time, but you can use anything you want. I generated DT with the date -u '+%Y-%m-%dT%H:%M:%SZ' command.

This query returns instantly, and runs the backup in the background. You can check the status of the backup using the system.backups and the system.backup_log tables.

After the backup was complete, I was able to see a file called 2023-11-29T00:00:01Z.zip in my S3 bucket.

Automating the backup

Being able to make backups with a single query is great, but it’s useless if I need to remember to run it regularly. Fortunately, it’s easy to automate running ClickHouse queries with cron.

I used curl to run the query, but you can also use clickhouse client if you want. The query is the same, I send a POST request to the ClickHouse server with the query as the body, and the credentials in the URL. Here’s what my shell script looks like:

dt=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
qry="backup all except database system to S3('https://my-ch-backups.s3.eu-west-1.amazonaws.com/ch01/${dt}.zip', 'AWSKEY', 'AWSSECRET');"

curl '' --data "${qry}"

And here’s the crontab entry.

0 0 * * * /home/clickhouse/full-backup-s3.sh

You can use other scheduling tools as well, like Apache Airflow or systemd timers.

S3 lifecycle rules

Now we have daily backups of our ClickHouse database, and this is very useful for restoring data from past days. But we don’t want to accumulate these backups forever and end up paying more for the backups than for the database itself.

Fortunately, S3 has a feature called lifecycle rules that can automatically take actions on objects as they age. I set up the following rule for my backups:

  • Day 0: Objects uploaded.
  • Day 1: Objects move to Glacier Instant Retrieval.
  • Day 45: Objects expire.

This means after backups are taken, they are stored with the default settings for the first day. After that, they are moved to a cheaper storage tier, and after 45 days, they are deleted.

There you have it, a simple and cheap backup solution for your ClickHouse database. I hope you found this useful. Feel free to comment under this post if you have any questions, or send me a message to discuss your ClickHouse needs.