Sunday, 19 June 2011

Culturomics, Big Data, Code Breakers and the Casaubon Delusion

Suddenly it seems as if 'big data' humanities is all the crack; with quantitative biologists and mathematicians diving in where previously only historians, literary critics and linguists dared to swim.  Digital humanists have been slowly engineering a new field from history and linguistics (aided and abetted by library science) for over a decade, gradually building new bodies of evidence, and road testing new methodologies.  But in just the last year or so, the biologists and mathematicians, with Google's help, have stolen a march on all their puny efforts.  In particular, it seems that Science and Nature have fallen head over heels in love with 'culturomics' and the heady enthusiasms of Erez Lieberman Aiden and Jean-Baptiste Michel, and their Google ngram viewer.  To read the most recent issue of  Nature is to be confronted with a heady mix of big science and gushing Hello Magazine prose, that work to mythologise the new 'science' of  culturomics and its creators.  It feels like the birth of a myth and of a brand.


This is all rather wonderful, and I am a huge fan of the Google ngram viewer, and the playful way it allows scholars and students to engage with the 'infinite archive' of inherited texts.  I think Aiden and Michel (and Google) have done the humanities a huge service.   But their real achievements do not quite explain the cloud of hyperbole that seems to be rising around them.


And this made me wonder what is really at issue here?  What is it about culturomics that turns on the reporters from  Nature.  At its heart, the use of word frequency with a reasonably sized (if problematic) data set simply provides one more form of evidence to be added to all the rest.  Knowing that the term 'electricity' peaks between 1870 and 1900 is useful evidence, but does not provide either an explanation for why, or a description of how it is being used.   Historians will no doubt look this particular gift horse in the mouth, and worry at the condition of its teeth; but they will also happily use the ngram viewer as one more component in a complex landscape of evidence.  This use may be delayed by the peculiar lack of any guidance on how to cite the results of a search, but it will be normalised in due course.


But simply providing a new body of evidence is not what seems to get Nature going.  Instead, it is the claim that the ngram viewer lays the basis for a new 'science', and that the results make other forms of historical analysis redundant.  In the words of Aiden and Michel, somehow this data is uniquely available for 'scientific purposes',  in contrast of other forms of evidence. 

It is not, therefore, the mechanics of the ngram viewer that is at issue.  Instead it is the underlying intellectual paradigm that Aiden and Michel bring to its use.  They appear to claim to be able to read history from the patterns the ngram viewer exposes - to decipher significant patterns from the data itself.  Their great party tricks (and they are particularly impressive in live performance) include the analysis of the decline of irregular verbs to a describable mathematical pattern, an equation, and the rise of 'celebrity' as measured by the number of times an individual is mentioned in print.  These imply that all historical development can, like irregular verbs, be described in mathematical terms, and that 'human nature', like the desire for fame, can be used as a constant to measure the changing technologies of culture. 

In some respects, we have been here before.  In the demographic and cliometric history so popular through the 1970s and 80s, extensive data sets were used to explore past societies and human behaviour.  The aspirations of that generation of historians were just as ambitious as are those of the parents of culturomics.  But, demography and cliometrics started from a detailed model of how societies work, and sought to test that model against the evidence; revising it in light of each new sample and equation.

The difference with culturomics is that there is no pretence to a model.  Instead, its practitioners will simply seek to discover patterns in the entrails of human speech, hoping to find the inherent meanings encoded there.  What I think the scientific community finds so compelling is that like quantitative biology and DNA analysis, Aiden and Michel are using one of the controlling metaphors of 20th-century science, 'code breaking' and applying it to a field that has hitherto resisted the siren call of analytical positivism.  


Since the 1940s the notion that 'codes' can be cracked to reveal a new understanding of 'nature' has formed the main narrative of science.  With the re-description of DNA as just one more code in the 1950s, wartime computer science became a peacetime biological frontier (cashing in on big-pharma, as military expenditure declined).  That Aiden comes from a background in DNA analysis should clue us to the fact that culturomics is an attempt to apply the same kind of code breaking to human society as a whole.



I strongly suspect that the project will fail, just as naive readings of DNA as a code for life have largely failed to fulfil their promise. But much more importantly, this attempt to repurpose a 'scientific' approach to historical analysis simply miss-understands the function of history itself.  These large-scale visualisations of language may be the raw material of history, the basis for an argument, the foundation for a narrative, the evidence put in the appendix in support of a subtle point, but they do not serve as a work of history. 

Historians interpret the past to the present.  They marshal evidence and use all the tools of genre writing to allow a modern reader to engage with the past.  And the questions they ask are not driven by the evidence, but by the needs of a modern society.  Gender history, the history of sexuality, and of race, have been created by two generations of historians not because the archives are groaning under the weight of relevant evidence, but because our society needs to understand the role of these forces in the present.  The fundamental flaw with culturomics is that it assumes that history is about the past; that what historians seek to achieve is an ever more accurate description of everything.  Instead, it is about the present.  Ironically, Aiden and Michel have rediscovered the 'Casaubon delusion'; and believe, like George Eliot's tragic figure, that they can create a new 'Key to all Mythologies'.   They need to listen to the Dorotheas of this world.

7 comments:

Ernesto said...

All I can say is "bravo". And thank you.

jbmichel said...

Hey Tim -

Interesting thoughts, It's great to see people diving into the discussion, and especially how much the discussion has changed in only six short months!

A few comments, in three parts.

---
You wrote:
"This use may be delayed by the peculiar lack of any guidance on how to cite the results of a search, but it will be normalised in due course."

Actually, in the "About the Ngram Viewer" (http://ngrams.googlelabs.com/info) section, we write: 

"If you're going to use this data for an academic publication, please cite:

Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010)"

jbmichel said...

(Part 2 of 3)

You wrote:
"At its heart, the use of word frequency with a reasonably sized (if problematic) data set simply provides one more form of evidence to be added to all the rest...But simply providing a new body of evidence is not what seems to get Nature going.  Instead, it is the claim that the ngram viewer lays the basis for a new 'science', and that the results make other forms of historical analysis redundant."

We don't understand why this straw-man keeps coming up, as we have been unambiguous about this point in the past: n-gram results are a new form of evidence. They do not make any extant method of historical analysis redundant. Here are a few primary sources. 

From the paper:

"Culturomic results are a new type of evidence in the humanities."

From the very first line of the Culturomics FAQ:

"1. Is this supposed to replace close reading of texts?
Absolutely not. Anyone who has appreciated the work of a great artist - say, Shakespeare - or an insightful scholar - say, Michael Walzer's Exodus and Revolution - couldn't possibly think that quantitative approaches can replace close reading. 

Quite the opposite is true: quantitative methods can be a great source of ideas that can then be explored further by studying primary texts.

2. How does this relate to other methods in the humanities?
Our hope is that the culturomic approach will be able to supplement existing techniques."

From the current Nature piece:

"
A.
And yet [Erez] doesn’t think that the old approaches will ever disappear. “I think you should use the best methods available — and all of them,” he says. “And I think that includes carefully reading texts and trying to get behind what authors think.”

B.
"…[Erez] tells the story of Isaac Casaubon, a sixteenth-century Protestant scholar, who undermined the presumed 
Egyptian provenance of a set of religious texts by identifying a reference to a Greek play on words — something that could only have been written hundreds of years later. “That point is as objective an interpretive remark as any remark a scientist might make,” says Lieberman Aiden. “So the methods of humanists are very, very formidable. And I think the degree of insecurity they have over whether these methods are here to stay is not really befitting.”
"

We think these texts make it unambiguously clear that we have absolutely no intention of replacing existing methods. There is no viable alternative reading of our statements in this area, and the attribution of this attitude to us is simply incorrect - a Casaubon delusion.

Our goal is no more than to enable such data to provide - in your exact words - "one more form of evidence to be added to all the rest".

---

jbmichel said...

(Part 3 of 3)

You wrote:
"Historians interpret the past to the present.  They marshal evidence and use all the tools of genre writing to allow a modern reader to engage with the past.  And the questions they ask are not driven by the evidence, but by the needs of a modern society...The fundamental flaw with culturomics is that it assumes that history is about the past..."

We make absolutely no such assumption. We are agnostic about what motivates a person's questions. What interests us is the process by which scholars 'marshal evidence'. 

Data is evidence. Just as those who read Akkadian can skillfully marshal Akkadian primary sources, those with quantitative skills can skillfully marshal data. Our goal is to contribute data and methods so that this new form of evidence can thrive.

All the best,
Jean-Baptiste Michel
Erez Lieberman Aiden

Elijah Meeks said...

I wish the DH community could be as openly critical of work done by self-identified DH practitioners as they are of Culturomics.

Arno Bosse said...

Elijah: bingo.

Tim Hitchcock said...

Dear Jean-Baptiste, Thanks for your comments on the post. On the issue of citations - the point I was trying to make was that there are no directions for citing a graph generated using the ngram viewer, as opposed to citing your articles. This is a wider problem than just with the ngram viewer, and until humanists figure out how to cite a search and its results in a repeatable and credible manner, we will be practising increasingly poor scholarship.

On the wider issue of your own and Erez's engagement with the humanities and history; the situation is somewhat out of your control, as the science press has emphasised those aspects of your work that imply a new and newly 'scientific' approach. This is an emphasis that has a powerful appeal to those given to a crude factology, and reflects the continuing distance between approaches to knowledge. There is clearly a sizeable chasm between your own practise and how it is represented. But while you remain at the stage of demonstrating a methodology, rather than using it to write history, the issue of how your practise works with other and older forms of scholarship, will remain - and is worth interrogating.

Let me also re-iterate how much I admire your work.

What I very much hope will emerge in due course is some great history that uses your techniques and methodologies to evidence society's understanding of the past. And I very much hope that you and Erez will be the ones to write it.

My experience of the Digital Humanities community as a whole is that it tends to enthuse over new tools and ways of visualising data, without being sufficiently concerned to critique the usefulness and purpose of the wider project, or to relate that project to the functions of humanist and social science scholarship. In many respects the criticisms I have made of your project are just as true of the wider Digital Humanities community, and form a topic with which I am continually struggling in my own work.

All the best, Tim