Archive for the ‘open data’ Category

Tuesday, May 15th, 2012

Harvard University’s 12 million records now in LibraryThing

Short version. Our “Overcat” search now includes 12.3 million records from Harvard University!

Long version. On April 24 the Harvard Library announced that more than 12 million MARC records from across its 73 libraries would be made available under the library’s Open Metadata policy and a Creative Commons 0 public domain license. The announcement stunned the library world, because Harvard went against the wishes of the shared-cataloging company OCLC, who have long sought to prevent libraries from releasing records in this way. (For background on OCLC’s efforts see past blog posts.)

It took a while to process, but we’ve finally completed adding all 12.3 million MARC records (3.1GB of bibliographic goodness!) to LibraryThing. They’ve gone into OverCat, our giant index of library records from around the world—now numbering more than 51 million records! As a result, when searching OverCat under “Add books,” you’ll now see results “from Harvard OpenMetadata.”

This release (“big data for books,” as David Weinberger calls it) is, to put it mildly, a Very Big Deal. Harvard’s collections are both deep and broad, covering a wide variety of languages, fields, and formats. The addition of these 12 million records to OverCat has significantly improved our capacity for the cataloging of scholarly and rare books, and greatly enhanced our coverage generally.

Kudos to Harvard for making this metadata available, and we hope that other libraries will follow suit.

For more on the metadata release, see Quentin Hardy’s New York Times blog post, the Dataset description, or the Open Metadata FAQ. And happy cataloging!

Come discuss here.


Harvard requests and we’re happy to add: The “Harvard University Open Metadata” records in OverCat contain information from the Harvard Library Bibliographic Dataset, which is provided by the Harvard Library under its Bibliographic Dataset Use Terms and includes data made available by, among others, OCLC Online Computer Library Center, Inc. and the Library of Congress.

Labels: cataloging, open data

Friday, October 29th, 2010

Better German cataloging from open data


University of Konstanz (Wikimedia Commons)

Casey has just finished loading 1.38 million library MARC records from Konstanz University into LibraryThing’s search index, Overcat.

While Overcat isn’t the only way to find German items–you can search libraries directly–it has become many members’ first source. At 35.2 million items, it’s now considerably larger than any remote source, as well as faster and more diverse. The Konstanz University records jump it up significantly as a German-language source.

Adding the records was possible because Konstanz chose to release the records as “CC-0,” essentially “public domain.” In as much as OCLC has convinced (or intimidated) much of the library world into acting as if library records were private property, this was a brave move.(1) You can read more about the release on the Open Knowledge foundation blog. It’s notable they originally opted for a more restricted, non-commercial license, but, under prompting from German librarians, opened it up all the way.

And what will we do with these records? Evil things! Hardly. LibraryThing has never sold library records and we never will. But the records will make a small percentage of members happy, as their German books suddenly got easier to catalog. These records, in turn, will serve as a scaffold to add other cataloging-like data—what we call Common Knowledge (CK)—all of which is released under a Creative Commons Attribution license. In this way open data improves open data, and everyone is the richer.


1. Their action is especially notable in that German governmental agencies aren’t required to disclaim copyright, as US ones are. Locking up free US government or government-funded library data, as OCLC does, is obnoxious and legally dubious, but Germany has different rules–including a true “database copyright” the United States lacks.

Labels: cataloging, open data, openness

Thursday, September 17th, 2009

The Amazon policy change, and how we’re responding.

“Amazon Cardboard Boxes” by Flickr member Akira Ohgaki (Attribution 2.0 Generic)

Summary: Amazon is requiring us remove links to other booksellers on work pages. We’re creating a new “Get it Now” page, with links to other booksellers, especially local bookstores and libraries, and a host of new features. Talk about it here.

The challenge. We’re days away from releasing a series of changes to our book pages, both forced and intentional. Amazon is requiring all websites, as a condition of getting any data from them, to have the primary page link to Amazon alone. Links to other booksellers are prohibited. Secondary pages—pages you go to from the primary page—can have non-Amazon links.

Everyone at LibraryThing disagrees with this decision. LibraryThing is not a social cataloging and social networking site for Amazon customers but for book lovers. Most of us are Amazon customers on Tuesday, and buy from a local bookstore or get from a library on Wednesday and Thursday! We recognize Amazon’s value, but we certainly value options.

Importanly, the decision is probably not even good for Amazon. Together with a new request-monitoring system, banning iPhone applications that use Amazon data, and much of their work on the Kindle, Amazon is retreating from its historic commitment to simplicity, flexibility and openness. They won through openness. Their data is all over the web, and with it millions of links to Amazon. They won’t benefit from a retreat here.

But agree or not, we have to follow their terms. We thought long and hard about giving up Amazon data entirely, converting to library data only, in concert with a commercial provider, like Bowker or Ingram, and with help from publishers and members. Unlike our competitors, who are exclusively based on Amazon and who don’t “catalog” so much as keep track of which Amazon items you have, that option is available to us. But we’d lose a lot, particularly book covers. Ultimately, we’ve decided the disadvantages outweigh the benefits.

The Response. Most of all, we think we’ve found a way to give Amazon what they require, and continue to provide members with options: We’re going cut back our primary-page links to Amazon alone, and give people the best, most diverse secondary pages we can make. We are allowed to link to other booksellers, like IndieBound and Barnes and Noble on secondary pages, and we’re going to do it far better than we ever have. We’re going to take something away, but also make something better—something that goes way past what we did before, in features and in diversity of options.

The upcoming “Get it Now” page will go far beyond our current “Buy, borrow, swap” links, with a live new and used price-comparison engine, as well as sections for ebooks, audiobooks and swap sites. The page will be edition-aware, and draw on feeds or live data (so the links work). Many members have wanted live pricing data for the books they already own and these features can be used for that purpose too. We’ll also be doing some stuff with libraries nobody else has, or can, do.

Key to the upcoming Get it Now page is a “Local” module, drawing on LibraryThing Local, showing all the libraries and bookstores near you. Where possible, this list will incorporate holdings data and links to buy—the sort of information you never get from a Google search on a book. If not, we’ll give you their telephone numbers and show you where they are on a map. We’ll make the page customizable, and let members add sources to it.

We think the new page will make a lot of members happy. For one thing, LibraryThing has never been about buying books, so having all these links on a separate page won’t be a great loss. And if the new format doesn’t make members happy, we’ll listen, and together we can plan to take LibraryThing on a truly independent course.

Post your comment here, or come talk about this on Site Talk.

Labels: amazon, apis, google, open data

Thursday, February 19th, 2009

Seeing parallels

Steve Lawson wrote this wonderful piece for his blog See also…, reprinted here (by permission) in full:

There is a large organization whose main business isn’t producing information, but instead hosting and aggregating information for many thousands of users on the web. Users upload content, and use the service to make that content public worldwide, and, likewise, to find other users’ content. Then one day the large organization decides to change the rules about how that information is shared, giving the organization more rights–to the point where it sounds to some people like the organization is trying to claim ownership of the users’ content, rather than simply hosting it and making it available on the web.

A small but vocal and influential group of users object to the policy change. The organization protests that it isn’t their intent to fundamentally change their relationship with their users and that legal documents tend to sound scarier than they really are. Most customers are either unaware or unconcerned by the change in policy, but the outcry continues until the organization backs down a bit, sticking with the old policy for the time being. The future, though, is up in the air.

Facebook? Or OCLC?

Perfect, just perfect.

Labels: facebook, oclc, open data, steve lawson

Monday, December 22nd, 2008

LCSH.info, RIP

LCSH.info, Ed Summers’ presentation of Library of Congress Subject Headings data as Linked Data, has ended. As Ed explained:

“On December 18th I was asked to shut off lcsh.info by the Library of Congress. As an LC employee I really did not have much choice other than to comply.”

I am not as up on or enthusiastic about Ed’s Semantic-Web intentions, but the open-data implications are clear: the Library of Congress just took down public data. I didn’t think things could get much worse after the recent OCLC moves, but this is worse. The Library of Congress is the good guy.

Jenn Riley put it well:

“I know our library universe is complex. The real world gets in the way of our ideals. … But at some point talk is just talk and action is something else entirely. So where are we with library data? All talk? Or will we take action too? If our leadership seems to be headed in the wrong direction, who is it that will emerge in their place? Does the momentum need to shift, and if so, how will we make this happen? Is this the opportunity for a grass-roots effort? I’m not sure the ones I see out there are really poised to have the effect they really need to have. So what next?”

The time has come to get serious. The library world is headed in the wrong direction. It’s wrong for patrons—and taxpayers. And it’s wrong for libraries.

By the way, Ed, we’re recruiting library programmers. The job description includes wanting to change the world.

See also: Panlibus.

Labels: library of congress, open data

Thursday, August 7th, 2008

A million free covers from LibraryThing

A few days ago, just before hitting thirty million books, we hit one million user-uploaded covers. So, we’ve decided to give them away—to libraries, to bookstores, to everyone.

The basics. The process, patterned after the Amazon.com cover service, is simplicity itself:

  1. Take an ISBN, like 0545010225
  2. Put your Developer Key and the ISBN into a URL, like so:
    http://covers.librarything.com/devkey/KEY
    /medium/isbn/0545010225
  3. Put that in an image tag, like so:
    <img src="http://covers.librarything.com/devkey/KEY/medium/isbn/0545010225">
  4. And your website, library catalog or bookstore has a cover.

Easy details. Each cover comes in three sizes. Just replace “medium” with “small” or “large.”

As with Amazon, if we don’t have a cover for the book, we return a transparent 1×1 pixel GIF image. So you can put the cover-image on OPAC pages without knowing if we have the image. If we have it, it shows; if we don’t, it doesn’t.

The Catch? To get covers, you’ll need a LibraryThing Developer Key—any member can get one. This puts a top limit on the number of covers you can retrieve per day—currently 1,000 covers. In fact, we only count it when a cover is made from the original, o our actual limit will be much higher. We encourage you to cache the files locally.

You also agree to some very limited terms:

  • You do not make LibraryThing cover images available to others in bulk. But you may cache bulk quantities of covers.
  • Use does not involve or promote a LibraryThing competitor.
  • If covers are fetched through an automatic process (eg., not by people hitting a web page), you may not fetch more than one cover per second.

You will note that unlike the new API to our Common Knowledge data, you are not required to link back to LibraryThing. But we would certainly appreciate it.

Caveats. Some caveats:

  • At present only about 913,000 covers are accessible, the others being non-ISBN covers.
  • Accuracy isn’t guaranteed–this is user data–and coverage varies.
  • Some covers are blurrier than we’d like, particularly at the “large” size. This is sometimes about original files and sometimes about our resizing routines. We’re working on the latter.

Why are you doing this? The goal is half promotional and half humanitarian.

First, some background. This service “competes” with Amazons cover service, now part of Amazon Web Services. Amazon’s service is, quite simply, better. They have far more covers, and no limit on the number of requests. By changing the URL you can do amazing things to Amazon covers.

The catch is that Amazon’s Terms of Service require a link-back. If you’re trying to make money from Amazon Affiliates, this is a good thing. But libraries and small bookstores have been understandably wary about linking to Amazon. Recent changes in Amazon’s Terms of Service have deepened this worry.

Meanwhile, there are a number of commercial cover providers. They too are probably, on average, better. But they cost money. Not surprisingly many libraries and bookstores skip covers, or paste them in manually from publisher sites.

That’s too bad. Publishers and authors want libraries and bookstores to show their covers. Under U.S. law showing covers to show off books for sale, rental or commentary falls under Fair Use in most circumstances. (We are not lawyers and make no warrant that your use will be legal.) We’ve felt for years that selling covers was a fading business. Serving the files is cheap and getting cheaper. It was time for someone to step up.*

So we’re stepping up. We’re hoping that by encouraging caching and limiting requests, we can keep our bandwidth charges under control. (If it really spikes, we’ll limit new developer keys for a while; if you submit this to Slashdot, we will be Slashdotted for sure!) And it will be good for LibraryThing—another example of our open approach to data. Although none of our competitors do anything like this—indeed our Facebook competitors don’t even allow export although, of course, they import LibraryThing files!—we think LibraryThing has always grown, in part, because we were the good guys—more “Do occasional good” than “Do no evil.”

If we build it, they will come. If the service really pick up, we’re going to add a way for publishers, bookstores and authors to get in on it. We’d be happy to trade some bandwidth out for what publishers know—high-quality covers, author photos, release dates and so forth. We’ve already worked with some publisher data, but we’d love to do more with it.


*In the past, we had been talking to the Open Libary project about a joint effort. We even sent them all our covers and a key to the identifiers that linked them. But nothing came of it. To some extent that was our fault, and to some extent not. (I think them and us would differ on the blame here.) In any case, I was tired of the time and transactional friction, and wanted to try a different approach.

Labels: apis, book covers, covers, open data

Tuesday, August 5th, 2008

Open Shelves Classification: Welcome Laena and David

Back in July I proposed the Open Shelves Classification (OSC), a new, free, crowdsourced replacement for the Dewey Decimal System. I also created a group to start in on the project.

The proposal included a call for a volunteer to lead the group. I was happy to write the software, and members would create the OSC, but someone with a library degree was needed to shepherd the project and make the occasional tough decision.

I’ve found two: the LIS team of Laena McCarthy and David Conners. It turns out, I already knew them. Abby and I met with Laena and David, back at ALA 2007, when they were MLS students doing a joint LibraryThing-related project called Folksonomies in Action. They impressed us then. It was extraordinary to talk to librarians with a deep understanding and creative take on the ideas LibraryThing was exploring. Since then Laena and David have started promising careers as librarians and professors. So, after receiving word they were interested in the project, we are only too happy to bring them on.

Laena M. McCarthy (user: laena). Laena is currently an Assistant Professor and Image Cataloger at the Pratt Institute in NYC. Her bio contains the priceless bit:

“Previously, she worked in Antarctica as the world’s Southernmost librarian, where she provided a remote research station with access to information. She incorporated into the library the first permanent art gallery in Antarctica.”

Laena’s teaching and research focus on the application of bottom-up, usability-centric design and collaboration. She is currently researching image tagging, FRBR for works of art & architecture, and information architecture. Her work has been published in Library Journal and the forthcoming Magazines for Libraries 2008.

In her free time, among other things, she can be found making jam, competing in food competitions, scuba diving and writing.

David Conners (user: conners). David is the Digital Collections Librarian at Haverford College in Pennsylvania. At Haverford, David works to make the College’s unqiue materials, such as the first organized protest against slavery in the New World, available online. He also oversees the College’s oral history program and the audio component of Special Collections exhibits such as “A Few Well Selected Books.”

David’s research interests include subject analysis, FRBR, and, occasionally, doped ablators. His work has been published in Library Journal, The Serials Librarian, and Physics of Plasmas.

The torch is passed! From this point on, it’s their project to direct. But we’re in agreement on their role: They aren’t royalty, they’re facilitators. They’re there to listen and to encourage conversation. They’re there to guide things toward consensus. They’re there too see the project stays on track and true to its goals. They’re there to propose forking the project or moving it elsewhere, if that’s what it needs and the community wants it.
Laena and David are doing this for fun and interest. As a fun side-project with no financial component—OSC is by definition public domain in every respect—we can’t pay them. But we’ve promised to help pay their way to LIS conferences, if someone wants them to talk about it. (At least one group already does.) And there’s the hope that, if OSC can accomplish its goals, they will have helped create something highly beneficial for libraries and library patrons everywhere.

If you’re interested in the project, come join the group and find out more.

Labels: Dewey Decimal Classification, open data, open shelves classification, osc

Thursday, February 21st, 2008

Taxation without web presentation

The Library of Congress recently signed a deal to accept 3 million dollars worth of “technology, services and funding” from Microsoft towards building a new website powered by Microsoft’s Silverlight plug-in. I (Casey) usually leave the blogging to Tim, but I’ve got to say something about this.

Microsoft, in general, is very good to libraries, and libraries are very good to them. Microsoft gets huge tax breaks for donating software licenses — something that doesn’t really cost them a thing — and libraries get software they couldn’t afford otherwise.

This is a different beast, however. It sounds like Microsoft technologies will be used from the ground-up — if you use Microsoft’s Silverlight to do the front-end, your developers pretty much have to use Visual Studio and Microsoft languages, your database admins have to use MS SQL Server, and your systems admins have to use Windows and IIS. In any case, it seems unlikely that Microsoft would consult on a project and not recommend you use Microsoft as much as possible.

Once you’re locked in to the entire Microsoft stack, you pretty much can’t change a single piece without completely redoing your entire IT operation from top-to-bottom. When the free deal expires or you need new servers, you end up having to buy new Microsoft licenses and software. It’s like giving somebody a kitten for a present — they’ll still be paying for and cleaning up after your gift 10 years from now.

Most disturbingly, users are locked in, too: anybody using an iPhone, an old version of Windows, any version of Linux, or any other operating system or device not supported by Silverlight will be unable to use the Library of Congress’ new website. How is that compatible with the principles of democracy or librarianship? It’s taxation without web presentation. And how exactly is that a quantum leap forward? (If the LOC really wanted to make a quantum leap, it would open up its data.)

Giant package deals are the wrong way to make both technical and business decisions about software; it doesn’t matter who’s doing the packaging, or how. You should be able to use the best operating system for the job, the best database for the job, and the best programming language for the job. You should be able to hire developers and systems administrators, not Microsoft developers and Windows administrators, and should give them the freedom to use the best solution, not the Microsoft solution. Sometimes the Microsoft solution is best, sometimes it isn’t, but that’s something that shouldn’t be dictated unilaterally.

“I take comfort when I see one of our competitors looking to hire Microsoft developers instead of software developers, for reasons the hacker/entrepreneur Paul Graham explained well:

If you ever do find yourself working for a startup, here’s a handy tip for evaluating competitors. Read their job listings. Everything else on their site may be stock photos or the prose equivalent, but the job listings have to be specific about what they want, or they’ll get the wrong candidates.”

“During the years we worked on Viaweb I read a lot of job descriptions. A new competitor seemed to emerge out of the woodwork every month or so. The first thing I would do, after checking to see if they had a live online demo, was look at their job listings. After a couple years of this I could tell which companies to worry about and which not to. The more of an IT flavor the job descriptions had, the less dangerous the company was. The safest kind were the ones that wanted Oracle experience. You never had to worry about those. You were also safe if they said they wanted C++ or Java developers. If they wanted Perl or Python programmers, that would be a bit frightening– that’s starting to sound like a company where the technical side, at least, is run by real hackers. If I had ever seen a job posting looking for Lisp hackers, I would have been really worried.”

But it’s disappointing to see an institution you respect, admire, and fund with your tax dollars going down that same road. It’s even more disappointing because the Library of Congress does make smart decisions about technology. They announced another major project a few months back that took an entirely different approach to selecting the tools they would use. The people behind the World Digital Library sat down and thought about the best tools for the job, and they came up with an interesting and eclectic list: “python, django, postgres, jquery, solr, tilecache, ubuntu, trac, subversion, vmware”. Those tools are free, open-source, designed with developer productivity in mind, aren’t tightly linked to each other, and don’t inherently limit who can access your website. That’s what should matter.

Labels: library of congress, microsoft, open data, open source

Friday, February 15th, 2008

Take our files, raw.

Short. Here’s a page of our raw graphics files. If you find that fun, have some. If you make an interesting change, all the better.

Long. We believe in openness. But openness is a process. It’s not so much that openness is difficult or painful* it’s that openness is non-obvious. You don’t see each successive layer until you remove the one above it.

Since the site started, we’ve enjoyed kibitzing about how it should look. We’d talk about layout and design. We’d throw up an image and sit back for reactions. Occasionally a user would get inspired and post what they thought something should look like. We just concluded a great exchange about the new “Author” and “Legacy” badges. Members helped us refine the wording and the colors enormously.

Open, right? But wait! Why didn’t we post our raw images for members to play with, if they wanted? You can talk about a GIF, but that’s like asking people to have conversations about a prepared speech.

Frankly, until now, I never even thought of the idea. I’ve never heard of a company that did it. And although it happens on open source projects, it’s not universal. The Open Library project, for example, is a model of openness. You can download both code and data; but you won’t find any design files on the site.

So, why not? We don’t lose trademark or copyright by posting a raw Photoshop file, with layers and alternate versions, anymore than we lose them by posting GIFs and JPEGs. What is the potential downside? Just in case there’s any confusing, we’ve posted a notice about copyright and trademark, but also granted explicit permission to make changes and blog about them.

So, here’s a wiki page for us to post our raw graphics files, and users to view, edit and remix them. It’s a very selective list so far, mostly because I started with what was lying around my on my desktop.**

More, much deeper openness coming next week…


*Although maintaining the “What I did today?” page proved too much work, and it helps that I have very thick skin for most criticism.
**There’s a side-benefit to putting all the files up on the wiki. Last time I lost my hard drive I lost almost no work—it’s all up on the “cloud” these days—except for my Photoshop files.

Labels: love, member input, open data, openness

Tuesday, December 11th, 2007

Open data and the Future of Bibliographic Control

We’ve got until December 15th to submit comments on the draft report produced by the Working Group on the Future of Bibliographic Control.

No—keep reading! This is important. People in the library profession need to be involved in this stuff. Further, people outside the profession need to be involved too. As the report notices, library data is used by many outside the library world, starting with library patrons, and extending even to Amazon.com. It shouldn’t go unnoticed, for example, that draft report mentions LibraryThing four times. For while LibraryThing uses library data, it was invented by and is mostly used by non-librarians.

Aaron Swartz, the dynamo behind Open Library, sent me a note about one important aspect of the draft report, namely what it’s missing: It doesn’t mention open data. There is serious discussion about sharing, but also the alarming proposal that the LC attempt to recoup more money from the sale of it’s data. That’s a shame. I’m not alone in believing that open access to library data is the future. A report about the future should confront the future.

The economy of library records is a complex one but not primarily a free one. By and large libraries pay the Dublin, Ohio-based OCLC for their records, even if the records were created at government expense. That model looks increasingly dated. And it is killing innovation.

It hasn’t killed LibraryThing yet, but the specter has always hung over our head. It’s why LibraryThing has—so far—not pitched itself to small libraries. OCLC doesn’t care about personal cataloging, and the libraries we use are—in every conversation I’ve had—enthusiastic about what we do. They want their data out there; they’re libraries for Pete’s sake! But if we offered data to public libraries we’d be cutting into the OCLC profit model. That could be dangerous.

Aaron invited me to sign onto a list of people interested in the issue. I did so. I invite you—any of you—to do so as well. The text says it perfectly:

“Bibliographic records are part of our shared cultural heritage and should be made available to the public for re-use without restriction. This will allow libraries to share records more efficiently, but will also make possible more advanced online sites for book-lovers, easier analysis by social scientists, interesting visualizations and summary statistics by journalists and others, as well as many other possibilities we cannot predict in advance.”

“Government agencies and public institutions are increasingly making data open. We strongly encourage the Library of Congress to join this movement by recommending that more bibliographic data is made available for access, re-use and re-distribution without restriction.”

The petition is here: http://www.okfn.org/wiki/OpenBibliographicData .

Labels: library of congress, open data, open library, Working Group on the Future of Bibliographic Control

Saturday, September 22nd, 2007

Magical Thinking at Harvard

A Babylonian Demon Bowl (Kelsey Museum)

“Know the secret name of something and you control it,” is an extremely ancient idea, stretching as far back as the Sumerians, and running through subsequent Mesopotamian, Egyptian and Greco-Roman magic. The secrecy of the name was critical to its power, and to the mystique of those who knew it. One suspects it also helped their hourly rates.

It’s modern equivalent is the “unique identifier.” Information is available as never before, but its sheer quantity limits discovery. Unique identifiers cut through the clutter. And they can be powerful. Let the wrong person know your Social Security Number and you’ll be in a world of hurt as great as a malevolent spirit caught by a name under a Babylonian demon bowl.

In the legal world the equivalent is the West American Digest System, which numbers court cases for lawyers. Although the cases are invariably in the public domain, the numbers that identify them are not. And controlling “the only recognized legal taxonomy” gives its creator, West Publishing, a valuable monopoly.

In the book world, it’s the ISBN. Know a book’s title and you can find yourself away in a sea of editions. Discover its ISBN and you’ve got it for sure. Type the ISBN into BookFinder or Abebooks.com and you’ve a panoply of new and used sellers.

Although assigned by private firms, ISBNs will never go the way of the West American Digest System. But their power explains why the Harvard Coop* has taken to ejecting customers who attempt to write down ISBNs. As reported in the Crimson, this is exactly what happened to one Harvard student, Jarret A. Zafra. In another (?) incident, reported by the Herald, the Coop called the police on three more ISBNs-scribblers.** When asked about the policy, Coop administration told the Crimson that it “considers that information the Coop’s intellectual property.”

The IP claim is hogwash. ISBNs are facts. Under US law facts can’t be copyrighted. The Coop is probably within its rights to expel whomever it wants, bhat won’t stop people from trying. The three students above were volunteers for a site called CrimsonReading.org, which is compiling a complete list of all books used at Harvard. When a Harvard Student types in an ISBN, CrimsonReading connects them to new and used booksellers. Affiliate revenues go to charity. By calling on volunteers and getting Harvard professors involved, CrimsonReading is getting around the Coop’s magical secrecy. Three cheers to them for doing it.

We need more projects like CrimsonReading. Much the same idea was behind my Google Book Search Search bookmarklet, which asked volunteers to collect Google Book Search IDs. In this case, the unique identifier was new and more secret. By giving its scans unique—and effectively secret—numbers, Google is creating a whole new bibliographic identification scheme. And where ISBNs cover only about thirty years of books, Google’s IDs are designed to cover every book printed, including millions in the public domain.

Control the name and you control the thing. It’s what WestLaw is doing. It’s what’s what the Coop is trying to do.

Is it what Google is doing? I’m not sure. And I don’t see any signs of this happening on its own yet. For example, sellers on used book sites are not using Google Book IDs to nail down editions. But the danger is there.

Secret and proprietary numbering systems pose a serious challenge to the benign potential of the internet. When the secrecy or obscurity are used against this potential, people need to act up—and break the spell.


*Always pronounced “coop,” not “coöp.” Full disclosure: My parents belong to the Coop, which is a true “cooperative” in organization. This means they share in the annual dividend accord to how much they spend there. So I’m working against them!
**I grew up near Harvard Square, and the Coop was one of my haunts. (It’s a general-purpose bookstore as well.) Quite a few of my friends were expelled from the Coop for shoplifting. If CrimsonReading really wants to get the job done, it should enroll the private-school street urchins of Square in the ISBN game.

Labels: google book search, harvard coop, isbns, open data, westlaw

Monday, July 16th, 2007

Open Library

The word is finally out about Open Library, the Internet Archive’s open cataloging project:

http://demo.openlibrary.org/

It too late in the evening to get into what it’s about. You can read about it. But I can tell you it’s a big deal. Open Library is going to change book data forever. It’s not clear to me how all the ideas will shake out—the wiki idea will be a particularly hard sell to many in the library world!—but I know this: the genie is out of the bottle. Book data is opening up.

It’s a relief to talk about it. I was one of the people at the first meeting too, and, before that, I had some role in developing one of the central ideas—an open source alternative to OCLC, building from the LC records.* I missed a second meeting, and I ticked off some with my insistence that Open Library be developed openly as well. In retrospect, I was too hard on them.

Well, it’s all out now, and it’s wide open. The developers are eager to find out what you think. You can download the code. Congratulations to Brewster Kahle, Aaron Schwartz and the rest for bringing Open Library so far so fast.

I can’t wait to see where it takes us.


*From my email, it looks like Casey Bisson had this idea around the same time as I did. Either way, I never went beyond talking, and Casey pushed it forward. (See this Talis podcast.) I don’t know what his roll in the final product was, but he deserves a big share of the praise.

Labels: internet archive, open data, open library

Thursday, April 12th, 2007

WorldCat: Think locally, act globally

OCLC just announced a “pilot” of WorldCat Local. In essence, WorldCat local is OCLC providing libraries with a OPAC.

That’s the news. Here’s the opinion. Talis’ estimable Richard Wallis writes:

“Yet another clear demonstration that the library world is changing. The traditional boundaries between the ILS/LMS, and library and non-library data services are blurring. Get your circulation from here; your user-interface from there; get your global data from over there; your acquisitions from somewhere else; and blend it with data feeds from here, there and everywhere is becoming more and more a possibility.”

I think this is exactly wrong. OCLC isn’t creating a web service. They’re not contributing to the great data-service conversation. They’re trying to convert a data licensing monopoly into a services monopoly. If the OCLC OPAC plays nice with, say, the Talis Platform, I’ll eat my hat. If it allows outside Z39.50 access I’ll eat two hats.

They will, as the press release states “break down silos.” They’ll make one big silo and set the rules for access. The pattern is already clear. MIT thought that its bibliographic records were its own, but OCLC shut them down when they tried to act on that. The fact is, libraries with their data in OCLC are subject to OCLC rules. And since OCLC’s business model requires centralizing and restricting access to bibliographic data, the situation will not improve.

As a product, OCLC local will probably surpass the OPACs offered by the traditional vendors. It will be cleaner and work better. It may well be cheaper and easier to manage. There are a lot of good things about this. And—lest my revised logo be misunderstood—there are no bad people here. On the contrary, OCLC is full of wonderful people—people who’ve dedicated their lives to some of the highest ideals we can aspire. But the institution is dependent on a model that, with all the possibilities for sharing available today, must work against these ideals.

Keeping their data hidden, restricted and off the “live” web has hurt libraries more than we can ever know. Fifteen years ago, libraries were where you found out about books. One would have expected that to continue on the web–that searching for a book would turn up libraries alongside bookstores, authors and publishers.

It hasn’t worked out that way. Libraries are all-but-invisible on the web. Search for the “Da Vinci Code” and you won’t get the Library of Congress–the greatest collection of books and book data ever assembled–not even if you click through a hundred pages. You do get WorldCat, seventeen pages in!

The causes are multiple, and discussed before. But a major factor is how libraries deal with book data, and that’s largely a function of OCLC’s business model. Somehow institutions dedicated to the idea that knowledge should be freely available to all have come to the conclusion that knowledge about knowledge—book data—should not, and traditional library mottos like Boston‘s “Free to All” and Philadelphia‘s Liber Libere Omnibus (“Free books for all!”) given way to:

“No part of any Data provided in any form by WorldCat may be used, disclosed, reproduced, transferred or transmitted in any form without the prior written consent of OCLC except as expressly permitted hereunder.”

We now return you to our regularly-scheduled blogging.

Labels: library of congress, oclc, open data, worldcat local