site banner

A Writeup On The Reason The Motte Relaunch Was Rocky

I think anyone who's been watching this switchover has noted it hasn't been the smoothest. I'm still kinda decompressing from that and I figured I'd write up why, just so you could all marvel at the ridiculous chain of catastrophes.

So.

We get the site up. People register their accounts. People start almost immediately reporting 429 errors when registering.

429 Too Many Requests is an error that means a user has done too much stuff lately, commonly known as "rate limiting". A lot of the site is rate limited, but it should be rate limited well above what an actual human will do. For example, the account creation is rate-limited at 10 per day per person; if you need more than ten accounts every day then uh maybe you're not behaving quite like we want.

Of course, people weren't making ten accounts per person; rate limiting was broken.

We looked into the rate limiting code. Rdrama runs on a service called Cloudflare, which relays connections and does a bunch of fancy caching and performance optimization and also doesn't provide service if you're farming kiwis. An annoying thing about this kind of a service is that it makes it a little trickier to figure out "who" someone is; Cloudflare includes that information on requests, but it's not in the normal place. The rate limiting code was using the Cloudflare-specific IP info. Problem: We're not on Cloudflare. So that info was just wrong. I took out the Cloudflare-specific stuff and the problem did not get fixed in any way.

Well, Cloudflare does all this fancy optimization (it's called "reverse proxying", please don't ask why), but actually, so do we. The Motte runs on the same server setup as The Vault, and The Vault is specifically designed to be extremely cacheable. We've got our own little similar frontend server doing something identical, and all connections, including Motte connections, go through it. This means we needed to get the IP from our own reverse proxy, using a different technique, which we did, and which also entirely failed to fix the issue.

At this point I tried to disable the rate limiter entirely. The rate limiter refused to disable. We'll get back to this one.

The reason, I guessed, the reverse-proxy IP didn't work is that our reverse proxy is actually behind another reverse proxy. It's reverse proxies all the way down. You may not like it, but this is what peak web development looks like. Anyway, we were getting one layer further up, but we needed to be another layer further up. The hosting service I use does in fact have a switch for enabling this; it's called Proxy Protocol. I turned Proxy Protocol on and the entire site instantly went down. So I flipped it back and the site came back up. Then I did this a few more times just to be sure it wasn't a coincidence. It wasn't.

It turns out that the reverse proxy run by me requires some very specific configuration settings to be compatible with the Proxy Protocol setting. The problem is that I'm running this proxy in sort of a weird way. Most people using this server architecture have, like, an entire devops team. I don't! It's just me. And I don't really know what I'm doing. So cue half an hour of occasional outages as I try something new. It is worth noting that some of the changes I made also broke the site, but I was suspicious that the two changes had to be made together to work at all, so sometimes I'd break the site, then I'd break the site in another way, then I'd sit there for a minute hoping it worked, and it wouldn't, and then I'd revert both changes.

Finally I figured out the magic incantation! The site worked, we got IPs, the rate limiting was functional. The 429 error was forever vanquished! I looked at the site, and checked the perf charts, and noted that we were capping the CPU on the absolute-bottom-barrel server I'd chosen, so I figured, hey, I tried moving servers before as part of a test, this should be fine, let's just fork over an extra $12/mo and boost the server a bunch, and I did this, and the site broke entirely.

I spent another thirty minutes trying to fix it; if anyone noticed the site being entirely down for a while, well, that was me trying to untangle what was wrong. I tried connecting directly to the site from its own computer; it didn't work. I spent twenty minutes analyzing this and eventually realized I was just doing it wrong. Worked fine once I did it wrong. I eventually decided this was a routing issue and had a deep suspicion.

See, Proxy Protocol was set using a switch on the hosting provider's GUI. But that's sketchy as hell - why is it a manual switch? I went back and checked and sure enough it had gotten turned off. So I turned it back on.

Site back up and running.

As near as I can tell, there is a switch on the GUI. But this switch is also overridden by some settings in my configuration. Importantly, it's overridden irregularly; sometimes you'll do something, and it'll say "oh shucks, gotta go check that switch!" Because I hadn't realized this, it went and checked it and dutifully turned it off again.

I think I've fixed that now.

So, what was the deal with rate limiting not turning off?

If you use Kubernetes to run a process, and you tell it you want the latest version of a Docker image, it will download that latest version every time you restart the process.

If you tell it you want a specific labeled version, then it won't. It'll just use whatever it has, even if the label has changed.

So if you changed from "latest" to "dev" and "main" . . . then things just don't update when you think they will, and this change happens silently unless you're aware of what Kubernetes is about to do.

I think I've fixed that now too.

I bet this new server makes things faster, doesn't it?

Nope.

Turned out the CPU usage wasn't even coming from The Motte. It was an Archive Warrior I was running on that just to soak up some extra bandwidth. Apparently it's just stupidly CPU-hungry?

I think I've fixed that also.

And that was my day, more or less.

How's your day going?

(Extra thanks to the various people who were helping out on Discord, incidentally, especially Snakes who fixed a whole bunch of not-quite-as-critical-but-still-pretty-dang-important stuff while I was fighting with the servers.)

(Edit: I forgot to mention that I also spent a few hours trying to unclog an HVAC drain line so it wouldn't flood the house. That doesn't even feel like the same day anymore.)

40
Jump in the discussion.

No email address required.

Appreciate all the hard work. This is a heck of a project!

Weird, there was actually someone who tested 2FA and found it worked fine! I was hoping that was a non-issue.

Bug added.

Thank you for all the effort you have put in and continue to put in! You're the best.

I'm glad I chose to sit out the first day or so. I was expecting a hug-of-death, though as I understand, server capacity wasn't the problem; it was everything else. Thank you for your hard work.

I'm not seeing a meta thread anywhere. If one exists, I'd appreciate if someone could point me to it. For now, this seems like as good a place as any to report issues. It is my understanding that you want to purge all the rdrama.net styling, so:

  • The mobile site turns the URL bar pink.

  • The lines next to comments are pink on mobile.

  • I'm not sure if the colourful flairs were left in intentionally, but to me they don't really fit the website.

First two we want to fix, we just haven't yet; higher-priority stuff awaits.

The last one I'm sorta divided on. I do want people to be able to personalize. We'll have to see whether it becomes a problem or not.

All good work, but I have a request: Can we have a general observations and suggestions thread pinned for a while until most of the gripes are ironed out? As I'm sure you'll appreciate, beta is one thing, but even more issues come up in live service.

Y'know what, that's a good idea. Done.

When rDrama was new (and then not so new), we still had constant outages and hilarious glitches all the time. Like Aevann got 0 sleep for the first six months or so. Every time he’d add a feature, a million other things would break. Then as soon as he’d fix one of them, two million more would break. There was one night where we couldn’t comment or view any threads because something innocuous broke when patching something else and we all just communicated via thread titles and publicly visible reports for hours.

Now everything runs incredibly smoothly and Aevann has learned a ton just through endless trial by fire (and sleep deprivation) for those months. We add huge new things all the time and are constantly optimizing early jank with new knowledge and nothing ever really breaks for more than a couple minutes at worst anymore.

I realize this is a No Fun Allowed by design place, but it’s important - for your own sanity as the dev, and for the userbase’s tolerance of early growing pains and learning moments - to take it all in stride and have fun with it. We fostered a culture immediately of “shit’s going to break, we’re learning, deal with it” and people have always taken it in stride and memed about it endlessly because we built that culture up. No one gets mad. No one has to make tedious mea culpas because we broke something. We reward people for breaking things and encourage it because then we can fix an issue we weren’t aware of. This is a good system and lets people have fun and not freak out when things break.

I’d strongly recommend not setting an expectation for lengthy technical explanations of what happened and why when something goes wrong. A sentence or two at most. “Sorry to I was drunk and fell asleep at 4am trying to fix something else and I was too tired to fix it, that’s why you could only communicate via dick pics for 6 hours” is perfectly serviceable.

That’s one of the many nice parts about not being a massive global megacorp like Reddit. You’re just a few dudes doing something for fun. You don’t owe stakeholders receipts for something that broke. There are no stakeholders. The userbase will understand. But if you go about explaining everything that went wrong every time something goes wrong, you’ll breed mounting discontent and you’ll never have time for anything else.

Lighten up nerds.

Unironic thanks, we'd be struck on reddit without the rdrama code, and reddit put a damper on autistically examining every aspect of our culture in the nerdiest way possible. (Trains. I mean trains) test: 🚂🚃🚃

Honestly a good reason I wrote it out is just because it was a funny clusterfuck. But yeah, the whole place is going to be unstable for a while; that's just the truth of launching a new service, especially when you don't have a full-time dev team.

In general I'm not going to be posting these unless they're specifically interesting :V

Don't be afraid to ask the rdrama devs for help. Behind the troll shell, they are really quite kind and helpful, and they know how to run a website. There's a reason rdrama has succeeded as well as it has.

We've got one of their devs helping us already, and they've been absolutely stellar :)

Rocky was a great movie

Thank you for your work!

I found the occasional but temporary problems charming. Reassuring. Work was obviously getting done.

Thank you for all the work in getting the site up and running!

Thanks for all the hard work, and it's a shame it didn't work out over on leddit.

Can we start our own collection of Marseys like on the parent site?

You are close to things and solving problems. Good on you. But maybe too close for perspective "rocky" launch won't be judged for a few more weeks. I've been busy with real life stuff, and this feels like the first time I've had the chance to sit down and actually use the new site. I got the registration error too, whatever, hickups happen. My impression is still excitement.

The site jannies always see the worst. Dont need to share it all with us all the time and make that negative perspective the default perspective.

That's fair, somewhat, but note that I'm also used to video game launch requirements, and "you tried to launch but nobody could play the game for hours" would definitely count as rocky.

Not catastrophic. But rocky.

This will all be forgotten in a few days, you're totally right there.

Yeah, I thought you might be comparing it to video game launches. Definitely not the same. I've been doing web development my whole career. And everything has this very ephemeral feel. With a game someone is focused on it and they paid a decent amount of money for it work, and damnit they want it to work right now when they have time.

No one pays for web stuff, and its always a semi-background task for most people, and there are always so many valid reasons why the user's machine might be screwed up.

You end up with this weird situation where major websites might be down for a few hours at a time, impacting literally millions of users, and by the next week everyone has forgotten about it. (unless it happens every week, and then you start getting a reputation). Its almost more of a problem getting your users to care about problems. "Hey our web servers are gonna be shut off cuz x company believes in censoring views they don't like". Most users: 'huh? whatever, im sure you'll just find a new web server or something, doesn't amazon sell web servers, or like cloud things you can use?'

Yeah, I'd second this. Microsoft has weird partial outages for the sole legal download source for the entire .NET ecosystem for three days, and a half-dozen twitterites and their own github was the only place to care.

Just want to say: You're doing a lot of work on this transition. Thank you for doing this, we really appreciate you doing everything you can to take us out from under the thumb of the reddit admins

Hey, you're welcome :) Fingers crossed this all works out.