Our digital humanities interdisciplinary seminar watched the TED talk from Erez Lieberman Aiden and Jean-Baptiste Michel on “What we learned from 5 million books,” and the reactions from students varied from worry about the future of humanities to admiration for the technology behind so-called “culturomics.”
Because we had just read Jerome Mcgann’s Radiant Textuality, some students considered Aiden and Michel’s work the “antithesis of Mcgann.” But let’s start with the positive reactions. The n-gram tool lets historians play with concepts, especially at the beginning of a research initiative or even at the end. It takes little time to get results, it maps cultural trends through time, and, most importantly, it puts a tool for humanistic inquiry in the hands of the people. Indeed, it does more than that–the n-gram attempts to bring into the humanities the principle of verification. Beyond these potential advantages the n-gram also marks an important public moment: here, at long last, appears an obvious way to use computers for the humanities. For this Aiden and Michel et al. deserve our thanks.
Students, however, raised a host of concerns about this talk and its implications. The most critical of these suggest that Aiden and Michel’s presentation was at best naive and at worst misleading. We do not know what data are included in their “5 million books.” Are citations, endnotes, bibliographies, and tables of contents included? How do we handle words with multiple meanings, such as “nature”? Cutting 7 million books from the corpus raises questions about what is left out and how the results might be skewed. With English language sources only in the corpus, the results are obviously skewed. Indeed, since many of my students had just attended a lecture by Timothy Snyder on “Bloodlands: Europe between Hitler and Stalin”, they were intensely aware of how language sources can vastly distort historical interpretations. One student summed up the implications of Aiden and Michel’s presentation: “Five million books is our culture.” And this struck these graduate students as ridiculously naive.
Others were more critical of the presentation for its positivism and reductionism. The n-gram sets up a “results oriented” approach–it had to be awesome, it had to be practical, it had to be computable. Having just read John Seely Brown and Paul Duguid The Social Life of Information, students thought that Aiden and Michel suffered from technological “tunnel vision.” The example of “slavery” and its spike in the 1860s and 1960s seems strikingly obvious. Indeed, so does the example of Marc Chagal. Historians and art historians, after all, know much about censorship and about Chagal. Here too, the n-gram seems like a blunt instrument–a rake–in a field that requires complex and intricate observation–a magnifying glass. Yet, here too, we see the most promising aspect of Aiden and Michel’s work. In developing the n-gram tool, they have begun to trace the patterns of something like “censorship” at least as its characteristics appear in such a voluminous record of texts. Now, historians have other sources to verify whether a society has practiced censorship, and it seems unlikely that the n-gram will find examples of censorship where we did not already know about. But the idea that a pattern in huge corpuses of text might signal a particular social or legal framework seems promising and innovative. In fact, the potential elucidated in the Chagal and censorship example might be the most exciting aspect of the TED talk and the whole concept of “culturomics.”
Brown and Duguid point to the many aspects of documents that cannot be reduced to n-grams. The vinegar on paper signifying a cholera epidemic years later. We will have to wait to see whether Aiden and Michel will be able to advance their tools to account for other languages, say, or to satisfy Roy Rosenzweig’s prescient question, “will abundance bring better or more thoughtful history?”