Monday, February 26th, 2007

Wikipedia citations, with feed

Update: Changed feed URL.

I’ve added a cool new feature, building on some work by library programmer Lars Aronsson—Wikipedia citations to all works pages. That is, work pages now list of all the Wikipedia articles that cite the work. The data is also available in feed form.

Here’s how it goes. At the top of J. F. C. Fuller’s A Military History of the Western World it lists how many citations, with a link:

And, down below, it shows all the articles:

How we I did it. Basically, I did a complete run through the Wikipedia dump files (source), parsing out anything that looked like an ISBN and checking if it is. It’s pretty easy. So it sees:

Fuller, J.F.C. A Military History of the Western World. Three Volumes. New York: Da Capo Press, Inc., 1987 and 1988. — v. 1. From the earliest times to the Battle of Lepanto; ISBN 0-306-80304-6: 255, 266, 269, 270, 273 (Trajan, Roman Emperor).

and gets the ISBN. I’ve started in on the harder problem, parsing books without ISBNs, like:

Bowersock, G.W. Roman Arabia, Harvard University Press, 1983.

It’s not actually that hard. But it’s fiddly. And it’s one of those problems where each additional percent of accuracy costs 50% more effort.

What’s the most cited books? The most cited book on Wikipedia is… The Official Pokemon Handbook. Surprised? Don’t be. In fact, eighteen of the top twenty most-cited works are Pokemon books. It boggles the mind. Somebody, or a bunch of somebodies went ISBN-happy on all the Pokemon entries. Fortunately, the existence of so many citations to Pokemon does not impair the quality of the rest. It’s just… Wikipedia. There’s a decidedly quirky character to many of the other winners, testimony to some serious passions. Number 28, with 177 citations, is Richard Grimmett‘s Birds of India, Pakistan, Nepal, Bangladesh, Bhutan, Sri Lanka and the Maldives. I think this effect would be diminished a lot if non-ISBN books were added.

Where did this come from? I owe the idea to Lars Aronsson, who came up with a simple script and ran it against the Wikipedia dumps and posted the results on Web4Lib back in September. I wrote him soon after to see if he was going to provide a public data feed, or if he minded if I did. He did not. His results differed a bit from mine. I’ll be in touch with him to square the differences.

Unfortunately, the Wikipedia data is not updated as often as one might like. The most recent is from November of last year. I’ll keep an eye on the download page, and reparse the data when a new dump comes available.

What’s this about a feed? We’re big fans of openness. And it’s Wikipedia data anyway. So we’ve made a feed of it. You can get it here:

http://www.librarything.com/feeds/WikipediaCitations.xml.gz

UPDATE: I changed the URL and gzipped it. Needness to say, I’m not putting any restrictions on this, but if you do something cool, I’d love to hear about it.

As usual, tell me what you think.

*We’ve seriously considered open-sourcing LibraryThing. But given the state of the code, it would be, as Nabokov said of rough drafts, like passing around samples of our sputum. We may out-source pieces of the code—the pieces we’re happiest about.
**LibraryThing is in the odd position of having almost as much bot traffic as we have person traffic. Google loves us. Guys, you love us too much!

Labels: 1

3 Comments:

  1. Tom Morris says:

    This blog post says “Needness (sic) to say, I’m not putting any restrictions on this,” but the page with the feeds (http://www.librarything.com/wiki/index.php/LibraryThing_APIs) says “non-commercial use.” Which is correct?

    Also, the link to here from that page is broken (citation misspelled).

  2. Hello,

    We would like to download a dump of the thingsAPI, to avoid spamming your servers too much, but the feeds links are broken. Could you provide new ones?

    Thanks!
    Arthur

Leave a Reply to Arthur Darcet