Archive for the ‘open data’ Category

Friday, October 29th, 2010

Better German cataloging from open data

University of Konstanz (Wikimedia Commons)

Casey has just finished loading 1.38 million library MARC records from Konstanz University into LibraryThing’s search index, Overcat.

While Overcat isn’t the only way to find German items–you can search libraries directly–it has become many members’ first source. At 35.2 million items, it’s now considerably larger than any remote source, as well as faster and more diverse. The Konstanz University records jump it up significantly as a German-language source.

Adding the records was possible because Konstanz chose to release the records as “CC-0,” essentially “public domain.” In as much as OCLC has convinced (or intimidated) much of the library world into acting as if library records were private property, this was a brave move.(1) You can read more about the release on the Open Knowledge foundation blog. It’s notable they originally opted for a more restricted, non-commercial license, but, under prompting from German librarians, opened it up all the way.

And what will we do with these records? Evil things! Hardly. LibraryThing has never sold library records and we never will. But the records will make a small percentage of members happy, as their German books suddenly got easier to catalog. These records, in turn, will serve as a scaffold to add other cataloging-like data—what we call Common Knowledge (CK)—all of which is released under a Creative Commons Attribution license. In this way open data improves open data, and everyone is the richer.

1. Their action is especially notable in that German governmental agencies aren’t required to disclaim copyright, as US ones are. Locking up free US government or government-funded library data, as OCLC does, is obnoxious and legally dubious, but Germany has different rules–including a true “database copyright” the United States lacks.

Labels: cataloging, open data, openness

Thursday, February 19th, 2009

Seeing parallels

Steve Lawson wrote this wonderful piece for his blog See also…, reprinted here (by permission) in full:

There is a large organization whose main business isn’t producing information, but instead hosting and aggregating information for many thousands of users on the web. Users upload content, and use the service to make that content public worldwide, and, likewise, to find other users’ content. Then one day the large organization decides to change the rules about how that information is shared, giving the organization more rights–to the point where it sounds to some people like the organization is trying to claim ownership of the users’ content, rather than simply hosting it and making it available on the web.

A small but vocal and influential group of users object to the policy change. The organization protests that it isn’t their intent to fundamentally change their relationship with their users and that legal documents tend to sound scarier than they really are. Most customers are either unaware or unconcerned by the change in policy, but the outcry continues until the organization backs down a bit, sticking with the old policy for the time being. The future, though, is up in the air.

Facebook? Or OCLC?

Perfect, just perfect.

Labels: facebook, oclc, open data, steve lawson

Monday, December 22nd, 2008, RIP, Ed Summers’ presentation of Library of Congress Subject Headings data as Linked Data, has ended. As Ed explained:

“On December 18th I was asked to shut off by the Library of Congress. As an LC employee I really did not have much choice other than to comply.”

I am not as up on or enthusiastic about Ed’s Semantic-Web intentions, but the open-data implications are clear: the Library of Congress just took down public data. I didn’t think things could get much worse after the recent OCLC moves, but this is worse. The Library of Congress is the good guy.

Jenn Riley put it well:

“I know our library universe is complex. The real world gets in the way of our ideals. … But at some point talk is just talk and action is something else entirely. So where are we with library data? All talk? Or will we take action too? If our leadership seems to be headed in the wrong direction, who is it that will emerge in their place? Does the momentum need to shift, and if so, how will we make this happen? Is this the opportunity for a grass-roots effort? I’m not sure the ones I see out there are really poised to have the effect they really need to have. So what next?”

The time has come to get serious. The library world is headed in the wrong direction. It’s wrong for patrons—and taxpayers. And it’s wrong for libraries.

By the way, Ed, we’re recruiting library programmers. The job description includes wanting to change the world.

See also: Panlibus.

Labels: library of congress, open data

Tuesday, August 5th, 2008

Open Shelves Classification: Welcome Laena and David

Back in July I proposed the Open Shelves Classification (OSC), a new, free, crowdsourced replacement for the Dewey Decimal System. I also created a group to start in on the project.

The proposal included a call for a volunteer to lead the group. I was happy to write the software, and members would create the OSC, but someone with a library degree was needed to shepherd the project and make the occasional tough decision.

I’ve found two: the LIS team of Laena McCarthy and David Conners. It turns out, I already knew them. Abby and I met with Laena and David, back at ALA 2007, when they were MLS students doing a joint LibraryThing-related project called Folksonomies in Action. They impressed us then. It was extraordinary to talk to librarians with a deep understanding and creative take on the ideas LibraryThing was exploring. Since then Laena and David have started promising careers as librarians and professors. So, after receiving word they were interested in the project, we are only too happy to bring them on.

Laena M. McCarthy (user: laena). Laena is currently an Assistant Professor and Image Cataloger at the Pratt Institute in NYC. Her bio contains the priceless bit:

“Previously, she worked in Antarctica as the world’s Southernmost librarian, where she provided a remote research station with access to information. She incorporated into the library the first permanent art gallery in Antarctica.”

Laena’s teaching and research focus on the application of bottom-up, usability-centric design and collaboration. She is currently researching image tagging, FRBR for works of art & architecture, and information architecture. Her work has been published in Library Journal and the forthcoming Magazines for Libraries 2008.

In her free time, among other things, she can be found making jam, competing in food competitions, scuba diving and writing.

David Conners (user: conners). David is the Digital Collections Librarian at Haverford College in Pennsylvania. At Haverford, David works to make the College’s unqiue materials, such as the first organized protest against slavery in the New World, available online. He also oversees the College’s oral history program and the audio component of Special Collections exhibits such as “A Few Well Selected Books.”

David’s research interests include subject analysis, FRBR, and, occasionally, doped ablators. His work has been published in Library Journal, The Serials Librarian, and Physics of Plasmas.

The torch is passed! From this point on, it’s their project to direct. But we’re in agreement on their role: They aren’t royalty, they’re facilitators. They’re there to listen and to encourage conversation. They’re there to guide things toward consensus. They’re there too see the project stays on track and true to its goals. They’re there to propose forking the project or moving it elsewhere, if that’s what it needs and the community wants it.
Laena and David are doing this for fun and interest. As a fun side-project with no financial component—OSC is by definition public domain in every respect—we can’t pay them. But we’ve promised to help pay their way to LIS conferences, if someone wants them to talk about it. (At least one group already does.) And there’s the hope that, if OSC can accomplish its goals, they will have helped create something highly beneficial for libraries and library patrons everywhere.

If you’re interested in the project, come join the group and find out more.

Labels: DDC, dewey decimal, Dewey Decimal Classification, open data, Open Shelves Classification, OSC

Thursday, February 21st, 2008

Taxation without web presentation

The Library of Congress recently signed a deal to accept 3 million dollars worth of “technology, services and funding” from Microsoft towards building a new website powered by Microsoft’s Silverlight plug-in. I (Casey) usually leave the blogging to Tim, but I’ve got to say something about this.

Microsoft, in general, is very good to libraries, and libraries are very good to them. Microsoft gets huge tax breaks for donating software licenses — something that doesn’t really cost them a thing — and libraries get software they couldn’t afford otherwise.

This is a different beast, however. It sounds like Microsoft technologies will be used from the ground-up — if you use Microsoft’s Silverlight to do the front-end, your developers pretty much have to use Visual Studio and Microsoft languages, your database admins have to use MS SQL Server, and your systems admins have to use Windows and IIS. In any case, it seems unlikely that Microsoft would consult on a project and not recommend you use Microsoft as much as possible.

Once you’re locked in to the entire Microsoft stack, you pretty much can’t change a single piece without completely redoing your entire IT operation from top-to-bottom. When the free deal expires or you need new servers, you end up having to buy new Microsoft licenses and software. It’s like giving somebody a kitten for a present — they’ll still be paying for and cleaning up after your gift 10 years from now.

Most disturbingly, users are locked in, too: anybody using an iPhone, an old version of Windows, any version of Linux, or any other operating system or device not supported by Silverlight will be unable to use the Library of Congress’ new website. How is that compatible with the principles of democracy or librarianship? It’s taxation without web presentation. And how exactly is that a quantum leap forward? (If the LOC really wanted to make a quantum leap, it would open up its data.)

Giant package deals are the wrong way to make both technical and business decisions about software; it doesn’t matter who’s doing the packaging, or how. You should be able to use the best operating system for the job, the best database for the job, and the best programming language for the job. You should be able to hire developers and systems administrators, not Microsoft developers and Windows administrators, and should give them the freedom to use the best solution, not the Microsoft solution. Sometimes the Microsoft solution is best, sometimes it isn’t, but that’s something that shouldn’t be dictated unilaterally.

“I take comfort when I see one of our competitors looking to hire Microsoft developers instead of software developers, for reasons the hacker/entrepreneur Paul Graham explained well:

If you ever do find yourself working for a startup, here’s a handy tip for evaluating competitors. Read their job listings. Everything else on their site may be stock photos or the prose equivalent, but the job listings have to be specific about what they want, or they’ll get the wrong candidates.”

“During the years we worked on Viaweb I read a lot of job descriptions. A new competitor seemed to emerge out of the woodwork every month or so. The first thing I would do, after checking to see if they had a live online demo, was look at their job listings. After a couple years of this I could tell which companies to worry about and which not to. The more of an IT flavor the job descriptions had, the less dangerous the company was. The safest kind were the ones that wanted Oracle experience. You never had to worry about those. You were also safe if they said they wanted C++ or Java developers. If they wanted Perl or Python programmers, that would be a bit frightening– that’s starting to sound like a company where the technical side, at least, is run by real hackers. If I had ever seen a job posting looking for Lisp hackers, I would have been really worried.”

But it’s disappointing to see an institution you respect, admire, and fund with your tax dollars going down that same road. It’s even more disappointing because the Library of Congress does make smart decisions about technology. They announced another major project a few months back that took an entirely different approach to selecting the tools they would use. The people behind the World Digital Library sat down and thought about the best tools for the job, and they came up with an interesting and eclectic list: “python, django, postgres, jquery, solr, tilecache, ubuntu, trac, subversion, vmware”. Those tools are free, open-source, designed with developer productivity in mind, aren’t tightly linked to each other, and don’t inherently limit who can access your website. That’s what should matter.

Labels: library of congress, microsoft, open data, open source

Tuesday, December 11th, 2007

Open data and the Future of Bibliographic Control

We’ve got until December 15th to submit comments on the draft report produced by the Working Group on the Future of Bibliographic Control.

No—keep reading! This is important. People in the library profession need to be involved in this stuff. Further, people outside the profession need to be involved too. As the report notices, library data is used by many outside the library world, starting with library patrons, and extending even to It shouldn’t go unnoticed, for example, that draft report mentions LibraryThing four times. For while LibraryThing uses library data, it was invented by and is mostly used by non-librarians.

Aaron Swartz, the dynamo behind Open Library, sent me a note about one important aspect of the draft report, namely what it’s missing: It doesn’t mention open data. There is serious discussion about sharing, but also the alarming proposal that the LC attempt to recoup more money from the sale of it’s data. That’s a shame. I’m not alone in believing that open access to library data is the future. A report about the future should confront the future.

The economy of library records is a complex one but not primarily a free one. By and large libraries pay the Dublin, Ohio-based OCLC for their records, even if the records were created at government expense. That model looks increasingly dated. And it is killing innovation.

It hasn’t killed LibraryThing yet, but the specter has always hung over our head. It’s why LibraryThing has—so far—not pitched itself to small libraries. OCLC doesn’t care about personal cataloging, and the libraries we use are—in every conversation I’ve had—enthusiastic about what we do. They want their data out there; they’re libraries for Pete’s sake! But if we offered data to public libraries we’d be cutting into the OCLC profit model. That could be dangerous.

Aaron invited me to sign onto a list of people interested in the issue. I did so. I invite you—any of you—to do so as well. The text says it perfectly:

“Bibliographic records are part of our shared cultural heritage and should be made available to the public for re-use without restriction. This will allow libraries to share records more efficiently, but will also make possible more advanced online sites for book-lovers, easier analysis by social scientists, interesting visualizations and summary statistics by journalists and others, as well as many other possibilities we cannot predict in advance.”

“Government agencies and public institutions are increasingly making data open. We strongly encourage the Library of Congress to join this movement by recommending that more bibliographic data is made available for access, re-use and re-distribution without restriction.”

The petition is here: .

Labels: library of congress, open data, open library, Working Group on the Future of Bibliographic Control

Saturday, September 22nd, 2007

Magical Thinking at Harvard

A Babylonian Demon Bowl (Kelsey Museum)

“Know the secret name of something and you control it,” is an extremely ancient idea, stretching as far back as the Sumerians, and running through subsequent Mesopotamian, Egyptian and Greco-Roman magic. The secrecy of the name was critical to its power, and to the mystique of those who knew it. One suspects it also helped their hourly rates.

It’s modern equivalent is the “unique identifier.” Information is available as never before, but its sheer quantity limits discovery. Unique identifiers cut through the clutter. And they can be powerful. Let the wrong person know your Social Security Number and you’ll be in a world of hurt as great as a malevolent spirit caught by a name under a Babylonian demon bowl.

In the legal world the equivalent is the West American Digest System, which numbers court cases for lawyers. Although the cases are invariably in the public domain, the numbers that identify them are not. And controlling “the only recognized legal taxonomy” gives its creator, West Publishing, a valuable monopoly.

In the book world, it’s the ISBN. Know a book’s title and you can find yourself away in a sea of editions. Discover its ISBN and you’ve got it for sure. Type the ISBN into BookFinder or and you’ve a panoply of new and used sellers.

Although assigned by private firms, ISBNs will never go the way of the West American Digest System. But their power explains why the Harvard Coop* has taken to ejecting customers who attempt to write down ISBNs. As reported in the Crimson, this is exactly what happened to one Harvard student, Jarret A. Zafra. In another (?) incident, reported by the Herald, the Coop called the police on three more ISBNs-scribblers.** When asked about the policy, Coop administration told the Crimson that it “considers that information the Coop’s intellectual property.”

The IP claim is hogwash. ISBNs are facts. Under US law facts can’t be copyrighted. The Coop is probably within its rights to expel whomever it wants, bhat won’t stop people from trying. The three students above were volunteers for a site called, which is compiling a complete list of all books used at Harvard. When a Harvard Student types in an ISBN, CrimsonReading connects them to new and used booksellers. Affiliate revenues go to charity. By calling on volunteers and getting Harvard professors involved, CrimsonReading is getting around the Coop’s magical secrecy. Three cheers to them for doing it.

We need more projects like CrimsonReading. Much the same idea was behind my Google Book Search Search bookmarklet, which asked volunteers to collect Google Book Search IDs. In this case, the unique identifier was new and more secret. By giving its scans unique—and effectively secret—numbers, Google is creating a whole new bibliographic identification scheme. And where ISBNs cover only about thirty years of books, Google’s IDs are designed to cover every book printed, including millions in the public domain.

Control the name and you control the thing. It’s what WestLaw is doing. It’s what’s what the Coop is trying to do.

Is it what Google is doing? I’m not sure. And I don’t see any signs of this happening on its own yet. For example, sellers on used book sites are not using Google Book IDs to nail down editions. But the danger is there.

Secret and proprietary numbering systems pose a serious challenge to the benign potential of the internet. When the secrecy or obscurity are used against this potential, people need to act up—and break the spell.

*Always pronounced “coop,” not “coöp.” Full disclosure: My parents belong to the Coop, which is a true “cooperative” in organization. This means they share in the annual dividend accord to how much they spend there. So I’m working against them!
**I grew up near Harvard Square, and the Coop was one of my haunts. (It’s a general-purpose bookstore as well.) Quite a few of my friends were expelled from the Coop for shoplifting. If CrimsonReading really wants to get the job done, it should enroll the private-school street urchins of Square in the ISBN game.

Labels: google book search, harvard coop, isbns, open data, westlaw

Monday, July 16th, 2007

Open Library

The word is finally out about Open Library, the Internet Archive’s open cataloging project:

It too late in the evening to get into what it’s about. You can read about it. But I can tell you it’s a big deal. Open Library is going to change book data forever. It’s not clear to me how all the ideas will shake out—the wiki idea will be a particularly hard sell to many in the library world!—but I know this: the genie is out of the bottle. Book data is opening up.

It’s a relief to talk about it. I was one of the people at the first meeting too, and, before that, I had some role in developing one of the central ideas—an open source alternative to OCLC, building from the LC records.* I missed a second meeting, and I ticked off some with my insistence that Open Library be developed openly as well. In retrospect, I was too hard on them.

Well, it’s all out now, and it’s wide open. The developers are eager to find out what you think. You can download the code. Congratulations to Brewster Kahle, Aaron Schwartz and the rest for bringing Open Library so far so fast.

I can’t wait to see where it takes us.

*From my email, it looks like Casey Bisson had this idea around the same time as I did. Either way, I never went beyond talking, and Casey pushed it forward. (See this Talis podcast.) I don’t know what his roll in the final product was, but he deserves a big share of the praise.

Labels: internet archive, open data, open library

Thursday, April 12th, 2007

WorldCat: Think locally, act globally

OCLC just announced a “pilot” of WorldCat Local. In essence, WorldCat local is OCLC providing libraries with a OPAC.

That’s the news. Here’s the opinion. Talis’ estimable Richard Wallis writes:

“Yet another clear demonstration that the library world is changing. The traditional boundaries between the ILS/LMS, and library and non-library data services are blurring. Get your circulation from here; your user-interface from there; get your global data from over there; your acquisitions from somewhere else; and blend it with data feeds from here, there and everywhere is becoming more and more a possibility.”

I think this is exactly wrong. OCLC isn’t creating a web service. They’re not contributing to the great data-service conversation. They’re trying to convert a data licensing monopoly into a services monopoly. If the OCLC OPAC plays nice with, say, the Talis Platform, I’ll eat my hat. If it allows outside Z39.50 access I’ll eat two hats.

They will, as the press release states “break down silos.” They’ll make one big silo and set the rules for access. The pattern is already clear. MIT thought that its bibliographic records were its own, but OCLC shut them down when they tried to act on that. The fact is, libraries with their data in OCLC are subject to OCLC rules. And since OCLC’s business model requires centralizing and restricting access to bibliographic data, the situation will not improve.

As a product, OCLC local will probably surpass the OPACs offered by the traditional vendors. It will be cleaner and work better. It may well be cheaper and easier to manage. There are a lot of good things about this. And—lest my revised logo be misunderstood—there are no bad people here. On the contrary, OCLC is full of wonderful people—people who’ve dedicated their lives to some of the highest ideals we can aspire. But the institution is dependent on a model that, with all the possibilities for sharing available today, must work against these ideals.

Keeping their data hidden, restricted and off the “live” web has hurt libraries more than we can ever know. Fifteen years ago, libraries were where you found out about books. One would have expected that to continue on the web–that searching for a book would turn up libraries alongside bookstores, authors and publishers.

It hasn’t worked out that way. Libraries are all-but-invisible on the web. Search for the “Da Vinci Code” and you won’t get the Library of Congress–the greatest collection of books and book data ever assembled–not even if you click through a hundred pages. You do get WorldCat, seventeen pages in!

The causes are multiple, and discussed before. But a major factor is how libraries deal with book data, and that’s largely a function of OCLC’s business model. Somehow institutions dedicated to the idea that knowledge should be freely available to all have come to the conclusion that knowledge about knowledge—book data—should not, and traditional library mottos like Boston‘s “Free to All” and Philadelphia‘s Liber Libere Omnibus (“Free books for all!”) given way to:

“No part of any Data provided in any form by WorldCat may be used, disclosed, reproduced, transferred or transmitted in any form without the prior written consent of OCLC except as expressly permitted hereunder.”

We now return you to our regularly-scheduled blogging.

Labels: library of congress, oclc, open data, worldcat local