Friday, December 17th, 2010

Romeo and Juliet, with—Get your mind out the gutter!

Today Google released its Books Ngram Viewer, a remarkable statistical snapshot of the books in Google. The New York Times did an nice piece on it.

So I went to work on it. My guess was that, like much else with Google books, the data was ratty. It didn’t have to look far. At first glance this chart appears to show that “fuck” had a remarkable early history—being more popular in 1725 than even today! (link)

Don’t get too excited. A quick search on the phrase in books between 1700 and 1800 treed the cause:

Yes, Google can’t tell between an f and an ſ, the “s without a bar” more properly known as a long, descending or medial s. To the disappointment of many, Shakespeare wrote “suck’d.” The effect pops up all over. Here’s a graph of “crimſon” vs. “crimson.” If nothing else we can now follow the demise of the ſ with precision.

There’s no question this is a cool tool. But given Google’s grand ambitions and how common s is in English, it’s a pretty startling lapse.

Labels: google, google book search, humor

9 Comments:

  1. Mark says:

    Not the first people to be caught by the long S! Wonder if they have been caught out by any of the ligatures.

  2. I can see the ‘fuck’ graph but not the ‘crimson’ graph.

  3. Caffron says:

    Try librarian versus online. Very informative.

  4. David Starner says:

    One of my fellow volunteers for Project Gutenberg who does the final cleanup on a lot of the long-s books swears that they used suck in the long-s days a lot more than now. Lo and behold, http://ngrams.googlelabs.com/graph?content=fuck%2Csuck%2Csucked&year_start=1500&year_end=2000&corpus=0&smoothing=3 seems to back that up.

  5. Circeus says:

    Given that many of the NLP people at Gogle worked with him, I think Frederick Jelinek would be disappointed.

  6. Andrew says:

    It’s intriguing to see with these graphs how the long-s variant is only really dominant in the eighteenth century:

    curse, curfe; stall, ftall; search, fearch.

    Is this an early shift in usage, or is there something systematically odd about the pre-1650 corpus? Perhaps Google’s algorithm is better at identifying some ſs than others, in different typefaces?

  7. Andrew says:

    …and a quick followup; it seems much better at parsing ſ as s in the older material. Interesting! I wonder if there’s something about their OCR software that anticipates this in order texts, and just has the thresholds set wrong/

  8. David Starner says:

    It’s interesting that it does recognize the long-s in some situations. But I will note that there is something systematically odd about the pre-1650 corpus; most of it is modern reprints that either don’t use the long-s, or use it in much clearer printing then pre-1800 printing ever was.

  9. Susanna says:

    The curse of the long s strikes again! My mom and I sell some reprints of old chapbooks…we’re always having to explain the change in s, especially since two of the chapbooks are kids’ books…