So begins the PDF announcing and detailing a major new development for the library-data world. Simon Spero, library-geek extraordinaire, has released a nearly complete copy of the Library of Congress Authority Files.
Get them here:
Simon assembled the files, available in MarcXML, by querying the Library of Congress’ Authorities website one-by-one over months. He’s a patient man.
As I’ve discussed before, Library of Congress data is both free and unfree. As a work of the US government, it cannot be copyrighted.* But the LC has traditionally restricted access, offering small amounts through public interfaces**, and selling larger amounts through its Cataloging Distribution Service. A small industry has developed where the CDS’s buyers resell it commercially. Until now, nobody has decided to just… let it go.
I anticipate that Simon’s action will draw some criticism. If the LC can’t make money selling its cataloging, how will it support this vital work? This sentiment will grow stronger when Casey Bisson releases the full LC Marc data, but whether for authorities or other cataloging data I think this is short sighted.
As I see it, the failure of the LC and other libraries to get their data “out there” on the open web has hurt them far more deeply than their catalog sales could ever recoup. It has made them seem irrelevant, standing silent and apart from the great conversation, which grows more interesting with each passing year.
The first culprits are the online catalogs***, ugly, backward things lamed with session-based URLs. If you want to link to the LC, you can’t. The URL you get will only work for you, for ten minutes. Linking–the very soul of the Web–is impossible.
The second culprit is how libraries have distributed the data itself. Amazon makes its book data accessible to all in a handy, universally-understood XML format. It’s so easy and appealing, over 140,000 developers have signed up to receive it. Libraries by contrast generally make their data available—if they make it available—over a tricky and obscure protocol know as z39.50. And the data itself is in MARC, a rich but impenetrable spectrum of formats—eg., DanMARC, the Danish MARC format!—used by and largely only understood by librarians.
With wretched web sites and unretrievable, unparseable data, libraries have lost vital ground. If the world worked right, Googling a book should turn up a library within the first few results. But libraries seldom make the top 100, and despite being the largest library on the planet and producing the lion’s share of original cataloging, the Library of Congress is completely absent. In its place are Amazon, its peers and sites that use Amazon data.**** Libraries may know a lot, but simplicity, attractiveness and ubiquitous data have won out.
It’s time to fight back. Libraries and library data can change the book web for the better. Three cheers to Simon for making a critical first step. Viva La Revolución, my brother.*****
*The LC reserves the right to copyright it outside of the United States. It’s unclear if they ever have.
**In LibraryThing’s case, through a z39.50 connection. Although the limits are not clearly specified, we’ve been given to understand that large-scale mining will not be tolerated.
***What library-techs called OPACs—Online Public Access Catalog. The fact that someone still needs to to add “Public Access” to “Online” is the problem in miniature. Does Google call itself a Public Access Search Engine?
****Don’t get me wrong; Amazon is a great site, and should be up in the top results too.
*****In so far as both Simon and I blogged the death of Milton Friedman, I suspect we’re equally uneasy with revolutionary Spanish.
Labels: Uncategorized