Description
Over the past few months, we’ve seen an notable increase in valuable contributions on the SourceCred Discourse forum. It would be a shame if we were to lose it all!
Discourse admins can create and restore site backups. These backups are hosted on Discourse servers and can also be downloaded by admins. @decentralion and I have been manually triggering and replicating these backups, but our ad hoc backup strategy has some limitations:
- Discourse servers appear to only host up to three backups. When a new backup is created and three backups already exist, the oldest is deleted.
- Storing Discourse backups on machines that maintainers regularly use increases the risk of accidentally leaking them. Discourse backups contain sensitive information, including private messages and user emails.
- Storing Discourse backups on machines that maintainers regularly use increases the risk of accidentally losing them.
- Storing Discourse backups independently wastes a lot of space. Most content on the Discourse is rarely changed. With separately stored archives, we do not benefit from deduplication.
- The backup process is entirely manual.
- The restore process has never been tested, and therefore the backups should be assumed to be entirely unusable.
We can solve all of these problems by using Tarsnap, a backup service run by some of the world’s best cryptographers. Tarsnap is end-to-end encrypted with an extensive threat model, and also has an effective deduplication algorithm that’s perfect for “snapshot backups” like these, reducing total storage by multiple orders of magnitude.
Using a backup system like Tarsnap would also mitigate, but not entirely solve for, the problem of insider risk among Discourse admins. By getting out of the habit of manually triggering and downloading backups, any such events can be more closely audited.
Status
Proposal
Champion
Benefits
We will have a solid backup story whose storage efficiency enables us to take backups frequently and securely.
Implementation plan
- Write a program to extract Discourse’s multiply-compressed backups into uncompressed forms that can be effectively deduplicated by Tarsnap, and then converted back to an equivalent archive readable by Discourse.
- Set up a Tarsnap account, with master keys and write-only keys protected under the appropriate ACLs. This includes figuring out billing.
- Develop a workflow to perform a Discourse backup with minimal manual intervention.
- Test the backup and restore process on a non-production instance.
- Potential stretch goal: Automate the execution of this workflow with
a VM running on a cloud provider.
- Would probably require setting up S3 backups for Discourse.
- May be better suited as a follow-up initiative.
Implementation note: Naïvely decompressing and recompressing files always preserves the contents of the uncompressed files, but does not in general preserve the exact representation of the compressed form. The fact that what we will be backing up is not identical to the backup that Discourse gives us should be somewhat scary. I’ve read the parts of the Discourse code that create and restore from these backups, and they don’t appear to require bit-for-bit preservation (e.g., they’re not cryptographically signed), so as a first pass minimal implementation it will suffice to produce re-compressed Discourse backups that are merely equivalent to the originals, not equal to them.
However, I have managed to reverse-engineer the exact compression
systems that are used to generate the backup (including two different
gzip
implementations!) and should be able to quickly replace that
first implementation with one that really does preserve every bit. This
contract will be somewhat brittle—it could be broken at any time if
Discourse or Postgres change their archive formats or compression
settings, which they are perfectly within their rights to do. But it’s
fairly likely that any such regressions will be easy enough to fix, and
in the worst case we can always settle for merely-equivalent archives.
Time estimate: I’ve spent about 20 hours on this so far, and would estimate 30–60 more. The high uncertainty here is because I haven’t used Tarsnap before, and haven’t set up a Discourse instance before, either (probably needed for testing).
Deliverables
- A new repository under the
@sourcecred
GitHub organization with infrastructure as described in the implementation plan. - (Verifiable only by Discourse admins with read access to the backup instance) A Tarsnap bucket with valid backups in it.
Dependencies
- the Discourse platform itself
- the SourceCred Discourse instance
- the Tarsnap service
References
Contributions
-
CPython PR #18077 fixing
gzip
metadata (plus associated bug) -
CPython PR #18080 fixing
tarfile
headers - proof of concept for bit-for-bit reversible extraction
- …