Archive for February, 2007

Monday, February 12th, 2007

Library of Congress Authority Files, Open!

So begins the PDF announcing and detailing a major new development for the library-data world. Simon Spero, library-geek extraordinaire, has released a nearly complete copy of the Library of Congress Authority Files.

Get them here:

Simon assembled the files, available in MarcXML, by querying the Library of Congress’ Authorities website one-by-one over months. He’s a patient man.

As I’ve discussed before, Library of Congress data is both free and unfree. As a work of the US government, it cannot be copyrighted.* But the LC has traditionally restricted access, offering small amounts through public interfaces**, and selling larger amounts through its Cataloging Distribution Service. A small industry has developed where the CDS’s buyers resell it commercially. Until now, nobody has decided to just… let it go.

I anticipate that Simon’s action will draw some criticism. If the LC can’t make money selling its cataloging, how will it support this vital work? This sentiment will grow stronger when Casey Bisson releases the full LC Marc data, but whether for authorities or other cataloging data I think this is short sighted.

As I see it, the failure of the LC and other libraries to get their data “out there” on the open web has hurt them far more deeply than their catalog sales could ever recoup. It has made them seem irrelevant, standing silent and apart from the great conversation, which grows more interesting with each passing year.

The first culprits are the online catalogs***, ugly, backward things lamed with session-based URLs. If you want to link to the LC, you can’t. The URL you get will only work for you, for ten minutes. Linking–the very soul of the Web–is impossible.

The second culprit is how libraries have distributed the data itself. Amazon makes its book data accessible to all in a handy, universally-understood XML format. It’s so easy and appealing, over 140,000 developers have signed up to receive it. Libraries by contrast generally make their data available—if they make it available—over a tricky and obscure protocol know as z39.50. And the data itself is in MARC, a rich but impenetrable spectrum of formats—eg., DanMARC, the Danish MARC format!—used by and largely only understood by librarians.

With wretched web sites and unretrievable, unparseable data, libraries have lost vital ground. If the world worked right, Googling a book should turn up a library within the first few results. But libraries seldom make the top 100, and despite being the largest library on the planet and producing the lion’s share of original cataloging, the Library of Congress is completely absent. In its place are Amazon, its peers and sites that use Amazon data.**** Libraries may know a lot, but simplicity, attractiveness and ubiquitous data have won out.

It’s time to fight back. Libraries and library data can change the book web for the better. Three cheers to Simon for making a critical first step. Viva La Revolución, my brother.*****

*The LC reserves the right to copyright it outside of the United States. It’s unclear if they ever have.
**In LibraryThing’s case, through a z39.50 connection. Although the limits are not clearly specified, we’ve been given to understand that large-scale mining will not be tolerated.
***What library-techs called OPACs—Online Public Access Catalog. The fact that someone still needs to to add “Public Access” to “Online” is the problem in miniature. Does Google call itself a Public Access Search Engine?
****Don’t get me wrong; Amazon is a great site, and should be up in the top results too.
*****In so far as both Simon and I blogged the death of Milton Friedman, I suspect we’re equally uneasy with revolutionary Spanish.

Labels: Uncategorized

Thursday, February 8th, 2007

Web 2.0 Video

Unless you’ve been on Mars, you have seen this. Chris Anderson put it best: “This is why I do what I do.”

Labels: Uncategorized

Monday, February 5th, 2007

Can subjects be relevancy ranked?

I wrote this up on the plane from San Francisco. (I was there on a secret, unbloggable mission!*) It’s a bit involved and it doesn’t “arrive” anywhere, but, if you’re interested in subjects and relevancy ranking, it might be worth thinking about.

There are a couple differences between user tagging (“free tagging,” “social tagging,” etc.) and traditional library classification. “Who does it?” is the most obvious difference, followed by whether or not the labeling action takes place within a predefined ontology, or is made up on the fly.

It’s easy to ignore a third, and very critical difference. Subject classifications, like the Library of Congress Subject Headings (LCSH), are essentially binary. It’s non-overlapping buckets. Something either does or does no belong in a subject. There are no gradations of belonging.

The idea is, as Clay Shirky and David Weinberger have reminded us, rooted in the physical world. Subject classification escapes the physicality of shelf-order classification, in which a book must be shelved in a single place, but is still restrained by the physicality of the catalog card. A catalog card can only reference a certain number of subjects. Nobody wants a book to take up twenty cards. And the subject cards can only reference so many books. About 90% of all literature could fall under the LCSH subject Man-woman relationships. But it would make no sense to slot this 90% under that heading in a physical card catalog–the card catalog would instantly grow by 90%! And there seem to be very real differences in relevancy and “what-the-heck”-ness between real-life members of the “Man-woman relationships” LCSH: High Fidelity, Great expectations, The Fountainhead, I Kissed Dating Goodbye, and The Official Hottie Hunting Guide.

If you’re very selective, you can keep the numbers down. But, apart from the rule that the first subject is generally the primary one, there’s no good way to relevancy rank the books belonging to a subject.

Tags can do it, because tens, hundreds or thousands of users applying tags creates a “statistics of meaning.” So, 1984 is tagged dytopia 549 times, torture six times and Great Britain two times. The numbers can be turned into ranking, so 1984 shows up high on a list of books about “dystopia,” lower under “torture” and near the end of a list of books about Great Britain.

This is all well-worn territory. My question is this: Is there any way to relevancy-rank books within subjects?

I was reminded of the question when checking out OCLC’s new project, FictionFinder. I’ll blog about the whole later, but for now know that you can search for a LCSH subject and get back a list of books belonging to it. (I can’t link to the results, which are session based.**) Check out the LCSH “City and Town Life” and the top book is Red Badge of Courage. Lacking a better method, FictionFinder let popularity (the number of OCLC libraries with a copy) stand in for relevance. LibraryThing does the same, using our popularity numbers instead. The results are not systemmatically better (in this case Ulysses wins).

I tried two solutions:

The first was to tie into LibraryThing’s tags. So, figure out what tags are most characteristic of books with the subject “Man-Woman Relationships,” and then use the presence and number of these tags to rank the subject results. So, for example, “Man-Woman Relationships” has a global correlation with “relationships,” “dating” and “romance,” none of which are very prominent among the tags applied to Great Expectations, so it can fall low on the list.

I got far enough down this road to know it was going to help.

The second and more interesting algorithm was to see if books can be ranked within subjects without any other information. This would help OCLC, who are unlikely to pay for LibraryThing data, and to any library that employs LCSH, most of which would have no “popularity” data to use either.

I hit upon the idea that subjects “reinforce” each other, and that this must leave a statistical signature. For example, it seems that “Love stories” and “Psychological fiction” are commonly applied to books about “Man-Woman Relationships,” but that “Androgynous robot alone on an island — Stories” is not. (Okay, that’s not real, but the point stands.) Can these “related subjects” relevancy rank the subject itself?

I wish so, but I can’t get it to work well enough. It works for some topics, but falls down for others, laughably.

Some ideas I’ve considered:

  • Treating subjects as links, and running some sort of “page-rank” style connection algorithm against them. Maybe this would bring out coincidences that simple statistics misses.
  • Using other library data, such as LCC and Dewey. This would be reminiscent of how I made LibraryThing’s LCSH/LCC/Dewey recommendations.
  • Doing statistics on other fields, such as the title. So, for example, there’s probably a statistical correlation between “Man-woman relationships” and books with “dating,” “men and women” and “proposal” in the title.

None strike me as the silver bullet.

Anyway, my plane has landed–allowing me to do real work again–so I end in aporia. Ideas?

*I’m itching to blog it, but I have to hold off for now. I’ll throw some pictures up soon, however. I’d never been to San Francisco before. What a wonderful wonderful town.
**One can understand why OPACs made in 1996 are session based. How frustrating to see a new product with them.

Labels: Uncategorized