Archive for July, 2007

Thursday, July 26th, 2007

Internet Archive wants book-loving systems engineer

In the spirit of fraternal concern, I post that the Internet Archive is looking for a systems engineer with PHP experience for their book-scanning project. (Also they promised to send us their discards. We need one too.)

The help-wanted has some excellent provisions:

  • Love and respect for books; pride and care in your work
  • Not afraid of terabytes

If I had the skills, I’d be tempted to take it. The Internet Archive is a great institution. The people are great, and they have the best office space ever. LibraryThing’s second-story apartment steps from the Portland waterside pales in comparison. They have this adorable jewel-box in San Francisco’s Presidio, with the Golden Gate Bridge right outside the window.

Labels: internet archive, jobs

Wednesday, July 25th, 2007

Is LibraryThing making you fat?

New England Journal of Medicine article–with fancy animation–explains how social networks cluster… by weight. (Hat tip: David Weinberger)

Labels: Uncategorized

Wednesday, July 25th, 2007

Free copies of Everything is Miscellaneous

I’ve flogged David Weinberger’s Everything is Miscellaneous before, in blog posts and in my Library of Congress talk. I think it’s something close to the intellectual justification for LibraryThing.

Anyway, I flogged the ARC for so long that, when it came out, LibraryThing bought a small box of hardcovers direct from the publisher–to give out at conferences, to thank people for inviting me to talk, and so forth. I still have half a box. So I’m going to open it up to the whole LibraryThing community.

We’re going to give out ten copies. We like contests*—we have a Harry Potter book photo and another review contest going—so we’re going to make a contest of it.

I’ve created a thread Contest: What does tagging do to knowledge?

  • If you want the book, come there and say a word or two about tagging.
  • It doesn’t need to be a big deal. A few sentences with some examples would be fine.
  • You can talk about tagging on LibraryThing or other sites. You can do personal tagging, global tag pages, the new tagmash feature, David’s talks, my talk, Clay Shirky’s talk, your talk, or whatever.
  • You can say something positive, something negative or just ask an interesting question.
  • You can post as many messages as you want, but you don’t get more chances, duh.

I’m not going to pick winners. I’m just going to randomly pick ten members who left comments. But you can’t just say “I want a book.”


*No purchase necessary. Void where prohibited. Also void where discouraged, unseemly or tacky. We pay to mail it. You are responsible for taxes. My taxes.

Labels: Uncategorized

Tuesday, July 24th, 2007

Tagmash: Book tagging grows up

Tagmash: alcohol, history gets over the fact that almost nobody tags things history of alcohol

Short version: I’ve just gone live with a new feature called “tagmash,” pages for the intersections of tags. This is a fairly obvious thing to do, but it isn’t trivial in context. In getting past words or short phrases, tagmash closes some of the gap between tagging and professional subject classifications.

For example, there is no good tag for “France during WWII.” Most people just don’t tag that verbosely. Tagmash allows for a page combining the two: France, wwii. If you want to skip the novels, you can do france, wwii, -fiction. The results are remarkably good.

Tagmash pages are created when a user asks for the combination, but unlike a “search” they persist, and show up elsewhere. For example, the tagmash for France, Germany shows France, wwii as a partial overlap, alongside others. Related tagmashes now also show up on select tag and library subject pages, as a third system for browsing the limitless world of books.

Booooring? Go ahead and play a bit:

That’s the short version. But stop here and you’ll never know what Zombie Listmania is!

Long version. LibraryThing has shown some of the things that book tags are good for, such as plain language, genre fiction, capturing identity and perspective, academic schools, staying current and changing over time. (Details and examples in footnote.*)

It also demonstrates some of the weaknesses, including:

  1. Idiots
  2. Bad actors (spammers, racists, anarchists)
  3. “Personal” tags clouding the tagosphere with junk (eg., “at the beach house”)
  4. The lack of a “controlled” vocabulary results in ambiguous terms (eg., classics, leather, magic)
  5. Tags lacks the detail and focus available to a hierarchical subject system like the Library of Congress Subject Headings (LCSH), eg.,
    Great Britain — History — Elizabeth, 1558-1603 — Fiction
    , or
    Jews — Italy — Bologna — Conversion to Christianity — History — 19th century**

As I’ve argued elsewhere and in my Library of Congress talk, problems 1, 2 and 3 are mitigated by having LOTS of tags. Idiocy, malice and personal junk fall out statistically. A tag here or there can’t be trusted, but a large body of tags in agreement is different.

Problems 4 and 5 are harder to tackle. Flickr has shown the way with one solution, statistical clustering. The screen shot below shows this–clusters of images related to the tag “bow.”

Some day–when I become a better programer?–I’m going to try this on LibraryThing data. It will help with ambiguity—the secondary tags on the various meanings of “leather” are surely wildly divergent! But I suspect it separates better than it clarifies. Flickr supposes that tags fall into discrete clusters, but subjects interact with books in extremely complex ways. On a more basic level, I am suspicious of the too-quick resort to algorithms against user data.*** After all, if computers are so good at figuring out meaning, why were users necessary in the first place? It smacks of technological revanchism.

So, where Flickr’s clusters are automated, tagmash is a semi-automated process. LibraryThing does the statistics, but users decide what the meaningful clusters are. Some mashes are interesting and useful. Some aren’t. By and large, uninteresting clusters won’t last.****

This certainly helps with ambiguity. Take the problemmatic tag leather, which divides easily into tagmashes like:

Now let’s take the “focusing” power of hierarchy. As mentioned above, there is no good way to get at “france during wwii.” The tag Vichy covers some of the ground, but not enough. Tagmash provides an answer.

The book list is good, and a simple union gets around an imposed hierarchy. Looking at the related LCSHs, for example, one is left in doubt whether France is part of World War II, or World War II part of France—or what:

Of course, both trees are equally artificial. David Weinberger writes how, in the real world, a leaf can be on many branches. But it’s equally true that what’s trunk and what’s branch are largely about where you start–dirt or pinecone. Either way, branching happens. The order of the branches isn’t necessarily important.

Even as it borrows some of the virtues of subject classification, tagmash keeps the strenghts of tagging. Subject systems are pre-built things. Now and then they get larger, but it takes deliberation and effort. What gets “blessed” is often surprising. I would have never predicted the unusually staid LCSH would have embraced:

But tagging has no limits. Think of the tagmash “erotica” and “zombies” and there it is. (Tagmash: erotica, zombies). Want to know what chick lit takes place in Greece? (Tagmash: chick lit, greece.) Young adult books involving horses? (Tagmash: horses, young adult.) Poems from or about San Francisco? (Tagmash: poetry, san francisco). Slavery in Brazil? (Tagmash: brasil, slavery.) Non-fiction books about Narnia? (Tagmash: narnia, -fiction.) The options are endless.

Of course, tagmash only narrows the gap. It doesn’t eliminate it. Tagmash: poetry, San Francisco still can’t distinguish between poetry about and poetry from San Francisco–it involves whatever is tagged “San Francisco” and that’s probably a mixed bag.***** Well-planned and carefully executed subject systems have strengths that no ad hoc, regular-person system can match.

Lastly—let there be no doubt—tagmash needs a very large quantity of tags to work. For tagmash after tagmash, the data is simply insufficient.

You’ve made it to Zombie Listmania! There are some obvious directions this can go:

  • The syntax can improve, for example to allow alternates (eg., humor, cats/dogs)
  • The syntax can include non-tag factors, such as formal subject headings (Tag: zombies, LCSH: love stories), languages, dates, authors and so forth.
  • The syntax can include weights (eg., Zombies 50%, vampires 50%, love stories 90%). Abby and I experimented with just such a system, creating algorithmic proxies for BISAC (bookstore) headings. It isn’t that hard to do.
  • Complex mashes could acquire titles and other metadata.
  • Users could follow a tagmash, and be alerted whenever new material enters the list.

Amazon calls its static, or dead, lists “Listmania.” All these tend to create a “Zombie Listmania,” lists of books that “won’t stay dead.” Instead, they change over time, as the underlying social and non-social data change. There’s no reason you couldn’t create “Zombie” versions of formal subject headings—a series of tags and other markers which approximated the content of a professionally-assigned subject heading.

Pretty cool idea, I think. We’ll see what we can do about it.

Details.

  • Tagmashes can be made from any tagmash or tag page. Just search for a tag or two or more tags with a comma between them. The URLS are the same /tag/ plus a tag or tags separated by commas.
  • The weighting of tags is wiggly. We’re trying to get at both raw numbers of tags on an item and the relative salience (number divided by total number of tags), and then cross this data tag-by-tag. There is no obvious answer. In an ideal world, some tags would about salience (eg., humor) and others would be threshholds (eg., fiction)–that is, when you’re looking for humor, fiction you want the funniest fiction, not the most fictional humor.
  • You can enter the tags in any order, but it will reformat your URL in alphabetical order, with the minuses at the end, such that “wwii, france” is the same as “france, wwii.”
  • A single minus (-fiction) “discriminates” against items tagged “fiction.” A double minus (–fiction) disqualifies all books with the fiction tag.
  • Tagmashes don’t get built until someone builds them. The first time can take a while to generate. There is currently no system to expire older or underused tagmashes.
  • UPDATE: I’m seeing a lot of part/whole tagmashes. These rarely work. When you search for “Einstein, science” or “Manet, art” you’re not doing much more than putting a statistical cramp on the smaller of the two tags—a few Manet books won’t have an art tag, and that will be the end of them. Tagmashes work with different things, not a thing and its category.

Footnotes!

*What’s good about tagging:

  • Tags use everyday terms (the tag cooking vs. the subject cookery)
  • Tags are great for genre fiction that subject systems can’t keep up with as fast or as well as their readers (chick lit, cyberpunk, paranormal romance)
  • Tags often encode subtleties that “controlled vocabulary” irons out (lgbt, glbt, queer, gay, homosexuality)
  • Tags capture identity and perspective that subject systems can’t or wont (queer, glbt, lgbt, christian living)
  • Tags are good for schools of thought (intelligent design, austrian economics)
  • Tags respond quickly to change (hurricane katrina)
  • Tags “keep happening” in a way that systems like LCSH do not, getting added to books where LCSH misses the “first wave” of anything new (memetics, sociobiology)

**I’ve left out one problem, not covered at the LC—how “democratic” weighting can put Angela’s Ashes at the top of the Ireland tag. books. I want to write a blog post on the topic sometime. I think there are ways around it, and algorithmic solutions that nobody has really tried.

Aside: Much LIS anti-tagging polemic focuses on the most trivial of problems—spelling mistakes and “incorrect” tags. The former underestimates technology, the latter insults our intelligence. LibraryThing has dealt with the spelling problem, and has seen very few “wrong” tags. In fact, there are some serious problems with tagging. But you have to understand tags before you can see the problems, and many refuse to get past the idea that people will spell “white” wrong, or tag white horses as black.
***This is half formed. I have a problem with the reflexive “turn” from people-centered data to algorithms. I see this pattern again and again in software. Something transformative happens–something human. But it’s imperfect, so programmers conclude that programs will fix humans. In a way, it’s a reassertion of importance. More often, humans fix humans. To adapt David Weinberger, the answer to user-generated data is MORE user-generated data.
****Probably there’s got to be some system to expire unused clusters.
*****UPDATE: After turning the feature loose I watched what new tagmashes would be created. One was children, cooking. Should I call the police?

Labels: new feature, tagging, tagmash

Saturday, July 21st, 2007

My Library of Congress talk

The Library of Congress has just posted a talk I did there back in April, part of the Digital Future and You series.

I cover the basics of LibraryThing and some of what LibraryThing “means” to libraries, including a long section on tagging. It has a short section—a sermon, really—on open data, in anticipation of the launch of Open Library, and another on the upcoming Everything is Miscellaneous.*

To my regret, it ends abruptly. They didn’t include the 20+ minute Q&A**, which went a lot deeper on some of the interesting issues (particularly tagging), and with the nation’s top library talent!

Being asked to talk in front of the LC was a great honor. There aren’t many institutions I hold in higher regard. And it was fun. I got to be myself—PowerPoint-less, off-the-cuff and passionate–and was greeted warmly and given the benefit of the doubt when I pushed the limits. Also, I got to have lunch with some of their top people. It was a blast.


*The subtext of that section is that I just had a lunch conversation about open data, and heard more about the whys, wherefores and finances involved.
**Apparently they felt that they needed permission from everyone who appears on tape, and that the questions were not well miked.

Labels: Uncategorized

Saturday, July 21st, 2007

Facebook and the blink tag

Altay’s attempt to insert the CSS version of the old <blink> tag into our upcoming Facebook application, produced this excellent reply from Facebook:

He was in fact kidding. Or so he says.

Labels: facebook

Thursday, July 19th, 2007

Yeah, me neither!

New dating site, WeNeither.com—find people through shared dislikes. It’s sort of like LibraryThing’s Unsuggester, except not a joke, and it might get you laid.*

Altay and I have been playing with it. So far, nobody hates books, thank God. Ditto Pandas and life. Altay discovered a soul mate who hates Harry Potter. Then he discovered he was otherwise friendless tonight.

*Unsuggester just had a data refresh. Go wild.

Labels: Uncategorized

Thursday, July 19th, 2007

Keen vs. Weinberger

For those who haven’t seen it yet—The Wall Street Journal published the full text of a debate between Andrew Keen (“Cult of the Amateur“) and David Weinberger (“Everything is Miscellaneous“) on Web 2.0. Available here.

And why I love Weinberger:

“When I say the Web is us, I don’t mean that it’s an aggregation of individuals — a herd of screeching monkeys or a scurry of voiceless cockroaches running from the light. We’re connected, primarily through talk in which we show one another what we find interesting in the world. That’s essential to the Web. The Web is only a web because we’re building links that say “Here’s something worth your time, and here’s why.” It’s a little act of selflessness in which a person who has our attention directs it elsewhere.”

Labels: Uncategorized

Monday, July 16th, 2007

Open Library

The word is finally out about Open Library, the Internet Archive’s open cataloging project:

http://demo.openlibrary.org/

It too late in the evening to get into what it’s about. You can read about it. But I can tell you it’s a big deal. Open Library is going to change book data forever. It’s not clear to me how all the ideas will shake out—the wiki idea will be a particularly hard sell to many in the library world!—but I know this: the genie is out of the bottle. Book data is opening up.

It’s a relief to talk about it. I was one of the people at the first meeting too, and, before that, I had some role in developing one of the central ideas—an open source alternative to OCLC, building from the LC records.* I missed a second meeting, and I ticked off some with my insistence that Open Library be developed openly as well. In retrospect, I was too hard on them.

Well, it’s all out now, and it’s wide open. The developers are eager to find out what you think. You can download the code. Congratulations to Brewster Kahle, Aaron Schwartz and the rest for bringing Open Library so far so fast.

I can’t wait to see where it takes us.


*From my email, it looks like Casey Bisson had this idea around the same time as I did. Either way, I never went beyond talking, and Casey pushed it forward. (See this Talis podcast.) I don’t know what his roll in the final product was, but he deserves a big share of the praise.

Labels: internet archive, open data, open library

Friday, July 13th, 2007

Library 0.1

Discovered while looking for good source of Armenian-language bibliographic data (from in the Russian National Library).

Is your catalog online? Yes, our catalog is online.

Actually, the sad thing is, having scans of all the catalog cards isn’t THAT much worse than today’s online catalog systems (OPACs). Anyway, at least I can link to the page and assume the link will work an hour from now.

Labels: armenian, opacs