Saturday, July 15th, 2006

Server problems abating

Stewie and Seamus, Chris’ dogs

Short version: We now have a three-server cluster. Speed has picked up. We’re hoping to solve lingering character set issues. A new employee, Chris Gann, set everything up.

Long version: Introducing… Chris Gann (LT: stalepez, chrisgannlibrarything.com), LibraryThing’s newest employee. Chris will eventually be coding, but his first task was to set up a “cluster” of powerful new servers to deal with LibraryThing’s current traffic, with a path to handling an order-of-magnitude increase in the future. He designed the architecture, ordered the boxes, set them up and arranged for the transfer from California. Chris is an old hand at this stuff. Back in 1999 he co-founded LinuxBox, a pioneering hosting facility for open-source projects, later acquired by OpenAvenue.*

The new servers have been up for almost two days—long enough to see that, at least as far as speed, things are shaping up. We’ve moved from one fairly modest server in California—a second one broke right before the Wall Street Journal article hit!—to three servers, two moderately powerful and one a “monster.” They are set up as a “master” and two “slaves.”** The master handles “writes,” the slaves “reads.” Read load is balanced between the two slaves, and they “fail over” to the other if something goes wrong. The system also provides increased data security, with complete database copies stored on three computers. It is also possible to take one server offline for backups without causing interruptions. I can also finally run statistics pages—the Zeitgeist particularly—without hiccups.

We saw an immediate improvement and speed has picked up as “caches” grew and as scripts were modified to take advantage of load distribution. (Maybe 1/3 of scripts have been rewritten, but they are the heaviest ones.) Between midnight last night and this morning, the master, “Zeus” had only three “slow queries” (11, 13 and 13 seconds respectively); everything else took less than 10 seconds. The slaves, Apollo and Athena, had zero and 12 slow queries respectively. That looked odd, so we dug into the code and discovered that the randomizing function was giving Athena too much work. I expect the number to drop as the load balances better. As for the servers presently in California, one will be charged with the worst queries—recommendations and relatedness—doing them on a schedule and caching the results. The second will become a development server, so when I try to run a “six degrees of Jane Austen” it doesn’t crash the database.

Oh, best of all, because of an order mixup, we have three more servers sitting in the LibraryThing foyer. As far as Dell believes, they don’t exist. They won’t let us send them back—the charge has been refunded too. I suspect they’ll eventually come to their senses and let us send them back. If not—hey—sever error in our favor!

Character-set issues. Users have reported problems with character sets. As the data transfer was binary, this is probably a configuration issue. (I believe the same thing happend before, and was fixed with a configuration change.) Chris is on the problem, and will report back here or on the Google Group as soon as he can.

We thank you for your patience. I can’t promise problems are forever over, but a significant step has been made. With luck, we won’t be firefighting all the time, and be able to push forward the site more.

* In a strange twist of fate, the co-founder of OpenAvenue, Jayson Minard, is now the CTO of Abebooks.
** As an American History major, these terms still give me the creeps.

Labels: 1

0 Comments:

Leave a Reply