Friday, November 16th, 2007

Yesterday’s downtime

We had a bad outage yesterday on the newly-installed web server. This followed two days of needle-like 5-10 minute outages. Needless to say, we’ve gone back to the old server.

It was a bad one—four-hours long and in the middle of the day. Worse, we didn’t have a “down” page up. This wasn’t for lack of trying; our server was completely non-responsive. When we got it back, we had a number of hours of “rolling outages” as the server caches refilled. Add a couple of logistical issues* and it was a nightmare. Although user comment has been kind—so kind that I fear that negative voices are going unheard!**—you have a right to expect more. This was a bad one, and we’re going to learn from it.

I do want to stress that no data was lost. This was all about the “web server” (the part that sends you the page) not the “database servers,” which have all the data. We have five live backups of your data now, and daily offsite backups too. We didn’t have working web server backups. We should.

Details. The last blog post includes a paragraph that is, in retrospect, a bit funny. (Not funny-ha-ha, mind you.)

“If you don’t notice anything, you can congratulate Felius [John], who just moved us to a new, dedicated web server.”

Well, the new server was the problem. And if you can’t congratulate him on that, you can congratulate him on getting things back up quickly once he was brought in. (Initially we thought we could do it without him, and it was the middle of the night in Australia.) He worked like a dog yesterday, and will be doing so today. Fortunately, we now have really excellent monitoring in place. The monitoring didn’t help us in the crisis—we were monitoring a dead man—but it will help John reconstruct what happened.

In the wake of this, he has two jobs: Figure out what happened and make sure it never happens again. In system issues, John is the “decider,” but we have a rough idea what needs to happen. First, we need webserver fail-over. Second, we need better tools for getting back on our feet. It makes no sense to have rolling blackouts for users when search-engines take up about half our traffic. After that John will work to the new webserver working, this time for good.

Casey, Chris and I are going to be doing our part to help on systems today. We can’t do what John does, but we can do something. We’re running on 8/12 memory cache. I don’t expect problems, but I can’t be sure.

Thanks for all your patience or, if you didn’t have any, for your righteous indignation! We need them both.

In other news: (whew!)

*I’m in Cambridge, MA so I couldn’t get into the server room to work on it, although I was about to drive up. Our “colo” guy, who should have been available, was unreachable too, something that’s never happened before—and a good reason not to host out of Portland, ME where there’s only one server guy at the colo. And our “remote reboot” wasn’t installed yet.
**This is an interesting reversal of something I saw with the Second-Life post, where negative voices drowned out positive. I don’t want to criticize members who cut us slack, but I think naysayers can also feel squelched.

Labels: downtime


Leave a Reply