How I maintain

I've been meaning to write up some notes on how I manage, and how I've dealt with certain problems in the past – in particular, the several days of issues and downtime in May 2022.

Hosting is hosted at DigitalOcean. There's nothing really special about DO (in fact, I think about moving often) but there's a few features that have really saved me a few times now. First, it's very easy to create a new disk volume, and once you have a volume, it's pretty easy to expand its size. I store the database on a separate volume. Currently the database is taking up ~65GB of space. When the volume is close to full, I'll expand it as needed. Second, it's very easy to take snapshots of volumes. I have a script that takes a nightly snapshot of the database volume. I also make snapshots before doing upgrades, server maintenance, etc. If something bad happens and I need to restore the db copy, I can create a new volume, attach it to the server, and switch from the broken db to the snapshot db. I've had to do this several times, and knowing I can do it again really helps alleviate the stress of running the server.

I run the instance using docker compose. I know that docker causes some people a lot of suffering (enough that the official mastodon documentation doesn't seem to include using docker as an option anymore), but I like it for a few reasons. First, I have a lot of professional experience using docker, so I'm used to the different ways it can cause you pain. Second, I find that using docker makes it a little easier to run upgrades and rollbacks. Third, it makes it a little easier to maintain the code/scripts I need to run the instance in git without having to fork the entire mastodon codebase. Finally, it also makes the service a lot more portable, since if/when I want to move the instance to a new server, I don't need to reinstall as many required programs.

The code

I have a slightly customized build of mastodon, with a docker file that looks an awful lot like this:

FROM tootsuite/mastodon:v3.5.3

COPY app/views/about/_registration.html.haml /opt/mastodon/app/views/about/
COPY app/views/about/_botsinspace-custom-signup.html.haml /opt/mastodon/app/views/about/

This takes the pre-existing image for mastodon, copies a few customized files in, and that's it!

Here's what the Dockerfile and docker-compose.yml look like:

Upgrades and maintenance

When it's time to run an upgrade, I make a snapshot of the database, update the version numbers in docker-compose.yml, and run something along the lines of docker compose build && docker compose up -d. This builds a new docker image and deploys it, then restarts everything as needed. If something goes wrong, I roll back the version and re-run docker compose up -d. The configuration file itself could be a little more optimized (ideally I'd only specify the version stuff once), but I'm lazy and usually do it via search/replace in my editor.

Other stuff

A few things happen outside of docker:

nginx - nginx runs directly on the server, and routes traffic to docker. The configuration is reasonably close to the default mastodon configuration file. There's a couple of rules in there to block some bad actors, and there's some rate-limiting as well.

Lets Encrypt - I use Let's Encrypt to setup HTTPS certificates/etc. I use DNS validation since there's a special plugin that handles everything via the Digital Ocean API.

Scheduled tasks - There's a few nightly tasks running in cron – making backups, running mastodon maintenance/etc.

File storage - File storage is a huge chunk of the expense of running the instance. Uploads are stored in Digital Ocean's Spaces, which is basically a clone of S3. I kept files on S3 for awhile, but I don't like giving Amazon money, and Spaces is a little cheaper. It's also probably better for performance to have the file storage closer to the actual server.

Emails - Emails are sent with MailPace and it works well enough that I basically never think about it.

Server upgrades

The server is running Ubuntu. Server updates aren't too much of a concern, but if I need to do upgrade between major versions or something else large like that, I take advantage of the fact that the database is on its own dedicated volume. I can boot an entirely new server, install any required software (I basically have a script for this), copy over my configuration files, then detach the volume from the old server, attach it to the new one, and update DNS to point to the new server.

Moderation and new accounts

At the moment, I handle all moderation issues and new account requests myself. I use a slightly tweaked version of ivory to help with spam signups and things like that. It's certainly possible that this will become enough work that I can't handle it myself, but that hasn't happened yet.

When things go wrong

The upgrade to Mastodon v3.5.0 involved upgrading PostgreSQL from version 9.6 to 14. There were instructions for running this upgrade that were along the lines of: make a dump of the data in the old version of postgres, upgrade, then import the data into the new version. With a large database, that can take hours or even days, and if it fails while it's running, that's a bunch of time that you've wasted. So, I took a snapshot and shutdown, and started running the dump. Unfortunately, the process failed for me over and over again, and when I eventually got it to work, and tried to bring back online, it was clear that there were some data issues. I rolled back to the old snapshot and started running upgrade tests on a separate test server.

Eventually I found a neat little docker image that can be used to upgrade between postgres versions, and that seemed to work.

However, there was another problem – was experiencing a data corruption issue. When I tried to run mastodon's custom fix-duplicates script, I found a whole new set of issues. That script checks a bunch of tables for duplicate data. Many of those tables have a manageable amount of data in them, but some of them – particularly the conversations and status tables – each of which have over 50 million rows in them right now. The script was trying to run fairly complicated queries against that table, but the server didn't have enough memory to process the result. This meant I needed to write some custom ruby code to do the same thing without causing quite so much server load. I managed to do that (luckily I program in Ruby for a living), let it run for a couple of hours, and when it was done, I was able to bring back online.

If I hadn't been able to take snapshots, and increase the database storage volume as needed, and if I wasn't well-versed in Ruby, there's a good chance that this upgrade would've either failed entirely, involved a lot of data lostt, or taken many days/weeks to finish.