site banner

A Writeup On The Reason The Motte Relaunch Was Rocky

I think anyone who's been watching this switchover has noted it hasn't been the smoothest. I'm still kinda decompressing from that and I figured I'd write up why, just so you could all marvel at the ridiculous chain of catastrophes.

So.

We get the site up. People register their accounts. People start almost immediately reporting 429 errors when registering.

429 Too Many Requests is an error that means a user has done too much stuff lately, commonly known as "rate limiting". A lot of the site is rate limited, but it should be rate limited well above what an actual human will do. For example, the account creation is rate-limited at 10 per day per person; if you need more than ten accounts every day then uh maybe you're not behaving quite like we want.

Of course, people weren't making ten accounts per person; rate limiting was broken.

We looked into the rate limiting code. Rdrama runs on a service called Cloudflare, which relays connections and does a bunch of fancy caching and performance optimization and also doesn't provide service if you're farming kiwis. An annoying thing about this kind of a service is that it makes it a little trickier to figure out "who" someone is; Cloudflare includes that information on requests, but it's not in the normal place. The rate limiting code was using the Cloudflare-specific IP info. Problem: We're not on Cloudflare. So that info was just wrong. I took out the Cloudflare-specific stuff and the problem did not get fixed in any way.

Well, Cloudflare does all this fancy optimization (it's called "reverse proxying", please don't ask why), but actually, so do we. The Motte runs on the same server setup as The Vault, and The Vault is specifically designed to be extremely cacheable. We've got our own little similar frontend server doing something identical, and all connections, including Motte connections, go through it. This means we needed to get the IP from our own reverse proxy, using a different technique, which we did, and which also entirely failed to fix the issue.

At this point I tried to disable the rate limiter entirely. The rate limiter refused to disable. We'll get back to this one.

The reason, I guessed, the reverse-proxy IP didn't work is that our reverse proxy is actually behind another reverse proxy. It's reverse proxies all the way down. You may not like it, but this is what peak web development looks like. Anyway, we were getting one layer further up, but we needed to be another layer further up. The hosting service I use does in fact have a switch for enabling this; it's called Proxy Protocol. I turned Proxy Protocol on and the entire site instantly went down. So I flipped it back and the site came back up. Then I did this a few more times just to be sure it wasn't a coincidence. It wasn't.

It turns out that the reverse proxy run by me requires some very specific configuration settings to be compatible with the Proxy Protocol setting. The problem is that I'm running this proxy in sort of a weird way. Most people using this server architecture have, like, an entire devops team. I don't! It's just me. And I don't really know what I'm doing. So cue half an hour of occasional outages as I try something new. It is worth noting that some of the changes I made also broke the site, but I was suspicious that the two changes had to be made together to work at all, so sometimes I'd break the site, then I'd break the site in another way, then I'd sit there for a minute hoping it worked, and it wouldn't, and then I'd revert both changes.

Finally I figured out the magic incantation! The site worked, we got IPs, the rate limiting was functional. The 429 error was forever vanquished! I looked at the site, and checked the perf charts, and noted that we were capping the CPU on the absolute-bottom-barrel server I'd chosen, so I figured, hey, I tried moving servers before as part of a test, this should be fine, let's just fork over an extra $12/mo and boost the server a bunch, and I did this, and the site broke entirely.

I spent another thirty minutes trying to fix it; if anyone noticed the site being entirely down for a while, well, that was me trying to untangle what was wrong. I tried connecting directly to the site from its own computer; it didn't work. I spent twenty minutes analyzing this and eventually realized I was just doing it wrong. Worked fine once I did it wrong. I eventually decided this was a routing issue and had a deep suspicion.

See, Proxy Protocol was set using a switch on the hosting provider's GUI. But that's sketchy as hell - why is it a manual switch? I went back and checked and sure enough it had gotten turned off. So I turned it back on.

Site back up and running.

As near as I can tell, there is a switch on the GUI. But this switch is also overridden by some settings in my configuration. Importantly, it's overridden irregularly; sometimes you'll do something, and it'll say "oh shucks, gotta go check that switch!" Because I hadn't realized this, it went and checked it and dutifully turned it off again.

I think I've fixed that now.

So, what was the deal with rate limiting not turning off?

If you use Kubernetes to run a process, and you tell it you want the latest version of a Docker image, it will download that latest version every time you restart the process.

If you tell it you want a specific labeled version, then it won't. It'll just use whatever it has, even if the label has changed.

So if you changed from "latest" to "dev" and "main" . . . then things just don't update when you think they will, and this change happens silently unless you're aware of what Kubernetes is about to do.

I think I've fixed that now too.

I bet this new server makes things faster, doesn't it?

Nope.

Turned out the CPU usage wasn't even coming from The Motte. It was an Archive Warrior I was running on that just to soak up some extra bandwidth. Apparently it's just stupidly CPU-hungry?

I think I've fixed that also.

And that was my day, more or less.

How's your day going?

(Extra thanks to the various people who were helping out on Discord, incidentally, especially Snakes who fixed a whole bunch of not-quite-as-critical-but-still-pretty-dang-important stuff while I was fighting with the servers.)

(Edit: I forgot to mention that I also spent a few hours trying to unclog an HVAC drain line so it wouldn't flood the house. That doesn't even feel like the same day anymore.)

40
Jump in the discussion.

No email address required.

You are close to things and solving problems. Good on you. But maybe too close for perspective "rocky" launch won't be judged for a few more weeks. I've been busy with real life stuff, and this feels like the first time I've had the chance to sit down and actually use the new site. I got the registration error too, whatever, hickups happen. My impression is still excitement.

The site jannies always see the worst. Dont need to share it all with us all the time and make that negative perspective the default perspective.

That's fair, somewhat, but note that I'm also used to video game launch requirements, and "you tried to launch but nobody could play the game for hours" would definitely count as rocky.

Not catastrophic. But rocky.

This will all be forgotten in a few days, you're totally right there.

Yeah, I thought you might be comparing it to video game launches. Definitely not the same. I've been doing web development my whole career. And everything has this very ephemeral feel. With a game someone is focused on it and they paid a decent amount of money for it work, and damnit they want it to work right now when they have time.

No one pays for web stuff, and its always a semi-background task for most people, and there are always so many valid reasons why the user's machine might be screwed up.

You end up with this weird situation where major websites might be down for a few hours at a time, impacting literally millions of users, and by the next week everyone has forgotten about it. (unless it happens every week, and then you start getting a reputation). Its almost more of a problem getting your users to care about problems. "Hey our web servers are gonna be shut off cuz x company believes in censoring views they don't like". Most users: 'huh? whatever, im sure you'll just find a new web server or something, doesn't amazon sell web servers, or like cloud things you can use?'

Yeah, I'd second this. Microsoft has weird partial outages for the sole legal download source for the entire .NET ecosystem for three days, and a half-dozen twitterites and their own github was the only place to care.