site banner

Site Rollback to February 3rd

Very sorry about this one, guys.

For reasons that I'm currently unsure of, the database decided to eat itself. I'm giving it something like 40/40/20 "hacker", "postgres bug", or "host glitch".

The site currently has two different backup methods. The manual backup triggers whenever I do a site update; the last site update was on 1/31. The automatic backup is supposed to be daily, and I check it once in a while to make sure it's working. It's been working literally ever since before the main Motte site was launched . . . and it broke on 2/4. Good timing, thanks, system.

In theory, the quality-contribution system captured all reported quality contributions during this time. I'll try to retrieve those. Any opinions on whether I should just go ahead and repost them, or whether I should send them to the person so they can repost them?

I do have a dump of a bunch of text snippets that are all that was left of the database. If you remember some phrases you used I might be able to retrieve parts of lost posts. That said, someone's tried this with a few posts and got 0/2, so absolutely no promises here. Feel free to ask though! If you want to take data recovery more seriously, make a copy of your browser cache, which can then be pored over to find people's posts. I'm not totally sure how important this is, but I bet at least a few people will be sad to lose effortposts they made.

I've fixed the backup issue and set up better monitoring so it will yell at me if it fails again. I've also temporarily increased backup frequency to hourly, just in case there's some serious stability issue right now that I'm not aware of. The good news is that this shouldn't happen again, at least with as much lost data. But that doesn't really fix this one.

Apologies again.

This too shall pass.

17
Jump in the discussion.

No email address required.

I've fixed the backup issue and set up better monitoring so it will yell at me if it fails again.

Important backups should also send notifications on success. Notification only on failure risks a scenario where both the backup and the notifications fail.

To be even safer, the script that sends the success notification should pull some independent confirmation the backup actually occurred, like the output of ls -l on the directory the database dumps are going to, and should include this in the notification text. Without this, a 'success' email only technically means that a particular point in a script was reached, not that a backup happened.

I've tried that before and what inevitably happens is I just end up ignoring the success notices.

In this case, however, I'm using healthchecks.io to handle this; it'll start pinging me on its own if it doesn't get regular notifications of success. So unless that service goes down, we're good.

To be ever safer, the script that sends the success notification should pull some independent confirmation the backup actually occurred, like the output of ls -l on the directory the database dumps are going to, and should include this in the notification text. Without this, a 'success' email only technically means that a particular point in a script was reached, not that a backup happened.

Ideally, yeah. In this case it's worth noting that it's taking full drive images, so it's, uh, kind of hard to do an ls. I guess I could run it as root and do a whole loopback thing to mount the image but I don't think that's likely to be necessary.