Robust Discourse backups

wchargin · January 20, 2020, 7:25pm

Description

Over the past few months, we’ve seen an notable increase in valuable contributions on the SourceCred Discourse forum. It would be a shame if we were to lose it all!

Discourse admins can create and restore site backups. These backups are hosted on Discourse servers and can also be downloaded by admins. @decentralion and I have been manually triggering and replicating these backups, but our ad hoc backup strategy has some limitations:

Discourse servers appear to only host up to three backups. When a new backup is created and three backups already exist, the oldest is deleted.
Storing Discourse backups on machines that maintainers regularly use increases the risk of accidentally leaking them. Discourse backups contain sensitive information, including private messages and user emails.
Storing Discourse backups on machines that maintainers regularly use increases the risk of accidentally losing them.
Storing Discourse backups independently wastes a lot of space. Most content on the Discourse is rarely changed. With separately stored archives, we do not benefit from deduplication.
The backup process is entirely manual.
The restore process has never been tested, and therefore the backups should be assumed to be entirely unusable.

We can solve all of these problems by using Tarsnap, a backup service run by some of the world’s best cryptographers. Tarsnap is end-to-end encrypted with an extensive threat model, and also has an effective deduplication algorithm that’s perfect for “snapshot backups” like these, reducing total storage by multiple orders of magnitude.

Using a backup system like Tarsnap would also mitigate, but not entirely solve for, the problem of insider risk among Discourse admins. By getting out of the habit of manually triggering and downloading backups, any such events can be more closely audited.

Status

Proposal

Champion

@wchargin

Benefits

We will have a solid backup story whose storage efficiency enables us to take backups frequently and securely.

Implementation plan

Write a program to extract Discourse’s multiply-compressed backups into uncompressed forms that can be effectively deduplicated by Tarsnap, and then converted back to an equivalent archive readable by Discourse.
Set up a Tarsnap account, with master keys and write-only keys protected under the appropriate ACLs. This includes figuring out billing.
Develop a workflow to perform a Discourse backup with minimal manual intervention.
Test the backup and restore process on a non-production instance.
Potential stretch goal: Automate the execution of this workflow with a VM running on a cloud provider.
- Would probably require setting up S3 backups for Discourse.
- May be better suited as a follow-up initiative.

Implementation note: Naïvely decompressing and recompressing files always preserves the contents of the uncompressed files, but does not in general preserve the exact representation of the compressed form. The fact that what we will be backing up is not identical to the backup that Discourse gives us should be somewhat scary. I’ve read the parts of the Discourse code that create and restore from these backups, and they don’t appear to require bit-for-bit preservation (e.g., they’re not cryptographically signed), so as a first pass minimal implementation it will suffice to produce re-compressed Discourse backups that are merely equivalent to the originals, not equal to them.

However, I have managed to reverse-engineer the exact compression systems that are used to generate the backup (including two different gzip implementations!) and should be able to quickly replace that first implementation with one that really does preserve every bit. This contract will be somewhat brittle—it could be broken at any time if Discourse or Postgres change their archive formats or compression settings, which they are perfectly within their rights to do. But it’s fairly likely that any such regressions will be easy enough to fix, and in the worst case we can always settle for merely-equivalent archives.

Time estimate: I’ve spent about 20 hours on this so far, and would estimate 30–60 more. The high uncertainty here is because I haven’t used Tarsnap before, and haven’t set up a Discourse instance before, either (probably needed for testing).

Deliverables

A new repository under the @sourcecred GitHub organization with infrastructure as described in the implementation plan.
(Verifiable only by Discourse admins with read access to the backup instance) A Tarsnap bucket with valid backups in it.

Dependencies

the Discourse platform itself
the SourceCred Discourse instance
the Tarsnap service

References

Contributions

CPython PR #18077 fixing gzip metadata (plus associated bug)
CPython PR #18080 fixing tarfile headers
proof of concept for bit-for-bit reversible extraction
…

wchargin · January 20, 2020, 7:25pm

cc @Beanow, who may be interested in this

Beanow · January 21, 2020, 12:40am

I would like to this again. But I already did when I saw the title.

wchargin · January 21, 2020, 12:43am

Heh; thanks for your support.

Beanow · January 21, 2020, 1:25am

Have a running minio (S3-compatible) instance for caching. But would hate to be responsible for one that will receive before-encryption discourse backups. Being unencrypted and all

If you’re (deeply) familiar with docker, consider GitHub - discourse/discourse_docker: A Docker image for Discourse Their setup is convoluted compared to a normal docker workflow. But once set up, having a local test discourse has been really handy.

Agree it’s not a bit-for-bit requirement. And wouldn’t consider it scary. From a disaster recovery point of view, if the unpacked result contains the SQL statements to restore the database, that’s ok. It may take some work, but we can recover from disaster. Everything beyond that is a bonus :]

Beanow · January 21, 2020, 1:28am

@nothingismagick are you familiar with Tarsnap?

wchargin · January 21, 2020, 2:29am

I’m somewhat familiar with Docker. I’ll take a look at this. Thanks for the reference.

Yes, the unpacked result contains the necessary SQL (it’s output from pg_dump), which is excellent. I’m more concerned about secondary effects of not having reproducible output. For instance, reproducible output enables trivially verifiable integrity checking. This affords testing of the backup pipeline itself, easier cross-validated recovery in the case of some outage (“hey, can you confirm that you’re restoring the backup with this hash?”), and better robustness to changes in Discourse archive formats. When you’re modifying the backup system, you should be extremely cautious, and it’s a big help if your commit message can just say, “Test Plan: Output is identical before and after this change.”

I’m not sure what you’re suggesting here. Caching what? The Tarsnap client talks directly to Tarsnap servers, using a custom TLS-like protocol to transmit data encrypted on the client.

Beanow · January 23, 2020, 12:13pm

I thought about the case of using an S3 (compatible) remote storage to “ingest” the backups as Discourse makes them, and automate the encrypted backup. For example using Minio (which I’m running for another use-case: caching) to be that S3-like target, and use something like a filesystem watcher to trigger the encrypted backup.

But the trust requirements for such an environment are very high. Because of the direct access to unencrypted backups. Deferring these trust issues to “real” S3 ACLs and using their vendor lock-in fun like Lambda as an equivalent of the filesystem watcher doesn’t change the trust issues, just puts them with AWS

So just saying it’s a bit hairy to pull this off. But have some experience operating an S3-like.

Beanow · March 6, 2020, 3:28pm

Just checking in, I believe @wchargin got pretty far in working this system out. What’s left to implement this and where does it sit priorities wise?

wchargin · March 6, 2020, 3:31pm

No, I didn’t get very far in implementing this; I wanted to gather feedback about the initiative and the prioritization before working on it, and I don’t think that it’s at the top of our priorities right now. Hence the “Proposal” status—I’m not sure why this is tagged with in-progress; I’ll fix that.

Topic		Replies	Views
Discourse Backup Protocol Initiatives Wish List	1	1447	December 22, 2019
Discourse Admin trust model Governance	4	1666	December 12, 2019
Discourse mirror revision Initiatives ✔️completed	3	1539	December 12, 2019
SourceCred Keybase Bot? Initiatives Wish List	5	943	September 11, 2019
Proposal: Discourse Reorganization (v2)	15	1730	December 4, 2020