Monday, February 13th, 2006

Work disambiguation and the “Ship of Theseus”

This blog post is long, and involves both showy mythological allusions and inside-baseball discussions of database structures. In brief, you’ll be seeing some new features, but you may also catch some glitches as I bring them live. Thank you for your support.

In philosophy, the “Ship of Theseus” is a “replacement paradox.” The story is that the Greek hero Theseus (you know, minotaurs and balls of string) rebuilt his ship during the voyage back from Crete—perhaps even while it was moving—such that the ship arrived at Athens with no piece of wood that had left from Crete.* The question is: Was it the same ship?

Anyway, LibraryThing is a true “Ship of Theseus.” I’m rebuilding it as it moves. This week I’ll be putting in a new keel—a whole new structure for thinking about books and works.

The former system was essentially composed of discrete books. If two books had very similar authors and titles (eg., two editions of Romeo and Juliet) , the system guessed that they were the same “work.” These guesses were pretty good—particularly considering they had to be made on the fly—but not good enough. And there was no way to change them. Notably, the whole system operated without a separate “works” database. This was clever and economical, but also limited.

The new system introduces a robust concept of “work.” On the database side this means a special “works” database, where each work has a title (the most common title of books belonging to the work). It is the way whereby most LibraryThing books can acquire LCCNs, Deweys and other cataloging information. It will allow users to discuss books—for example, on a forum—without worrying that they were only talking to people who had the same edition they did. Techies will like that it opens the door to an external API, relying on Library of Congress data, not Amazon data, which is forbidden. And, most importantly, it will allow ordinary people to participate in the sacred act of cataloging, combining and splitting books from works as they see fit. This has never before been done before. It’s Wikipedia for book cataloging.

Anyway, all this is coming this week. The trick is, the system is so complex and involves so much “calculation” that I can’t bring the server down, make the changes and bring it up again without unacceptable downtime. Testing it on my own Mac takes forever and won’t give it the stress test it needs (LibraryThing can average 3,000 “queries” per minute!) So, I’m going to be rebuilding the ship while it moves.

In fact, the new system is already mostly in place, but invisible. It’s going to become more and more visible as the week progresses. Once everything is changed and I’m satisfied it works, I will add the last element, exposing work disambiguation to the masses. Then I’ll take down the old system.

So bear with me as I make these changes. The switch-over is highly planned; I even have stuff on paper–I’m a real programmer now! But the presence of two different systems will lead to inconsistencies in presentation and other hiccups. If you notice that your official book counts disagree by one, let it slide. If something breaks, wait ten seconds and try again. If book recommendations go briefly insane—well, serendipity is a good thing!

Advice corner. I still haven’t quite figured out the user-interface on work disambiguation. I think it will mostly take place on the author pages. Users will click checkboxes by books and then click “combine books” to combine them. I’m not certain if “work splitting” will also happen on author pages. Certainly work pages will let you to see all the editions of a work, allowing you to remove one or more editions as not belonging to the work at hand. Your suggestions would be appreciated.

Lastly, I want to favor library titles for books. Amazon too frequently puts edition and marketing info into their titles. (This isn’t their fault; they’re not running a cataloging ap.) And using library data will allow LibraryThing to offer an external API. The only trick is, libraries don’t capitalize the way most people think is “right.” It’s “Lord of the rings” not “Lord of the Rings.” I think people will go ape if work pages, recommendations and other such start using library-format titles. On the other end, it’s hard to write a perfect capitalization algorithm, and library purists may resent the use of the vulgar form. What to do?

* He also stranded his wife on a desert island, but the only philosophical issues there are ethical. The ancient story is actually a little different. According to Plutarch (Theseus 22-23), the Athenians of later days exhibited the ship that the ancient hero Theseus had sailed back from his adventures in Crete. Over time, the Athenians had replaced its planking bit-by-bit, until no part of the ship was original. Personally, I think the modern paradox should be changed again. Theseus’ voyage was pretty much a straight shot, and, in the story, he gives no time to even changing his sails—although doing so would have averted his father’s death—let alone rebuilding the boat from the inside out. The whole thing would make a lot more sense as Odysseus’ ship, or Jason’s. The latter has the advantage of allowing Medea to fix the ship through magical means, even while it moved. Of course, Jason ditched Medea too. What’s with it with these guys?

Labels: 1

0 Comments:

Leave a Reply