Monday, 30 January 2012

Academic History Writing and the Headache of Big Data

By way of a preface

The post that follows is formed from the text of a presentation I am due to deliver at King's College London on 9 February, but which reflects what I was worrying about in early December 2011 - several months after I wrote the synopsis that was used to advertise the talk, a month before I attended the AHA conference in Chicago with its extensive programme on digital histories, and six weeks before I got around to reading Stephen Ramsay's, Reading Machines: Toward an Algorithmic Criticism.  Both listening to the text mining presentations at the AHA, and thinking about Ramsay's observations about computers and literary criticism have contributed to moving me on from the text below.  In particular, Ramsay's work has encouraged me to remember that history writing has always been more fully conceived by its practitioners as an act of 'creation', and as a craft in its own right, than has literary criticism (which has more fully defined itself against a definable 'other'  - a literary object of study).  As a result, I found myself fully in agreement with Ramsay's proposal that digital criticisms should '...channel the heightened objectivity made possible by the machine into the cultivation of those heightened subjectivities necessary for critical work.'(p.x)  But was most struck by his conclusion that the 'hacker/scholar' had moved camps from critic to creator. (p.85).  It made me remember that even the most politically informed and technically sophisticated piece of digital analysis only becomes 'history' when it is created and consumed as such.  This made me reflect that we have the undeniable choice to create new forms of history that retain the empathetic and humane characteristics found in the old generic forms; and simply need to get on with it.  In the process I have concluded that the conundrums of positivism with which this post are concerned, are in many ways a canard that detract from crafting purposeful history.

 


Academic History Writing 
and the Headache of Big Data 

In the nature of titles and synopses for presentations such as this one, you write them before you write the paper, and they reflect what you are thinking about at the time.  My problem is that I keep changing my mind.  I try to dress this up as serious open mindedness – a constant engagement with a constantly changing field, but in reality it is just a kind of inexcusable intellectual incontinence – which I am afraid I am going to force you all to witness this afternoon. 

I promised to spend the next forty minutes or so discussing research methodologies, historical praxis and the challenge of ‘big data’; and I do promise to get there eventually.  But first I want to do something deeply self-serving and self-indulgent that nevertheless seemed to me a necessary pre-condition for making any serious statement about both the issues raised by recent changes in the technical landscape, and how ‘Big Data’, in particular, will impact on writing history – and whether this is a good thing.

And I am afraid, the place I need to start is with some thirteen years spent developing online historical resources.

Unlike a lot of people working in the digital humanities, in collaboration with Bob Shoemaker, I have pretty much controlled my research agenda and the character of the projects I have worked on from day one.   This has been a huge privilege for which I am hugely grateful, but it means that there has been an underlying trajectory embedded within my work as a historian and digital humanist.  This agenda has been continuously negotiated with Bob Shoemaker, whose own distinct agenda and perspective has also fundamentally shaped the resulting projects, and more recently with Sharon Howard; and has been informed throughout by the work of Jamie McLaughlin who has been primarily responsible for the programming involved.  But, the websites I have helped to create were designed with our historical interests and intellectual commitments as imperatives.  And as such they incorporate a series of explicit assumptions that have worked in dialogue with the changing technology.  In other words, the seven or eight major projects I have co-directed are, from my perspective at least, fragments of a single coherent research agenda and project. 

And that project is about the amalgamation of the Digital Humanities with an absolute commitment to a particular kind of history: ‘History from Below’.  They form an attempt to integrate the British Marxist Historical Tradition, with all the assumptions that implies about the roles of history in popular memory, and community engagement, with digital delivery.  In the language of the moment, they are a fragment of what we might discuss as a peculiar flavour of ‘public history’.  And what I feel I have discovered in the last five or six years, is that there is a fundamental contradiction between the direction of technological development, and that agenda – that ‘big data’ in particular, and history from below don’t mix.

We started with the Old Bailey Proceedings – not because it was a perfect candidate for digitisation (who knew what that looked like in 1999), but because it was the classic source for ‘history from below’ and the social history of eighteenth-century London, used by Edward Thompson and George Rude. 




·        














  





  •       125 million words of trial accounts
  •              197,745 trials reflecting the brutal exercise of state power on the relatively powerless. 
  •            250,000 defendants, and 220,000 victims. 

A constant and ever changing litmus test of class and social change.

The underlying argument – in 1999 – was that the web represented a new public face for historical analysis, and that by posting the Old Bailey Proceedings we empowered everyone to be their own historian – to discover for themselves that landscape of unequal power.  By 2003, when we posted the first iteration of the site – and more as a result of the creation of the online census’s rather the Old Bailey itself – the argument had changed somewhat to a simple acceptance of the worth and value of a demonstrable democratisation of access to the stuff of social history.

The site did not have the explicit political content of Raphael Samuel’s work or Edward Thompson’s, but it both created an emphasis on the lived experience of the poor, and gave free public access to the raw materials of history to what are now some 23 million users.

And it is important to remember at this point what most academic projects have looked like for the last decade, and the kinds of agendas that underpin them.  If you wanted to characterise the average academic historical web resource, it would be a digitisation project aimed at the manuscripts of a philosopher or ‘scientist’.  Newton, Bentham, the Philosophes, or founding fathers in the US; most digital projects have replicated the intellectual, and arguably rather intellectually old fashioned end, of the broader historical project.  Gender history, the radical tradition, even economic and demographic history have been poorly represented on line – despite the low technical hurdles involved in posting the evidence for demographic and economic history in particular.

The importance of the Old Bailey therefore was simply to grab an audience for the kind of history that I wanted people to be thinking about – empathetic, aware of social division and class, and focused on non-elite people.  And to do so as a balance to what increasingly seems to me to be the emergence of a very conservative notion of what historical material looked like. 

The next step – the creation of the London Lives web site, was essentially driven by the same agenda, with the explicit addition that it should harness that wild community of family historians, and wild interest in the history of the individual, to creating an understanding of individuals in their specific contexts – of building lives, by way of building our understanding of communities, and essentially – of social relationships.




 
  • 3.5 million names,
  •  240,000 pages of transcribed manuscripts reflecting social welfare and crime
  • and a framework that allowed individual users to create individual lives, that could in turn be built in to micro-histories. 
This was social history online – the stuff of a digital history from below.

This hasn’t garnered quite the same audience, or had the same impact as the Old Bailey Online (it does not contain the glorious narrative drama inherent in a trial account), and the history it contains is just harder work to make real.
 
But, from my perspective, the character and end of the two projects were absolutely consistent.  Designed around 2004 (and completed in 2010), in some respects London Lives was a naïve attempt to make crowd sourcing an integral part of the process – though not in order to get work done for free (which seems to be the motivation for applying crowdsourcing in a lot of instances), but more as a way of helping to create communities of users, who in turn become both communities of consumers of history, and communities of creators, of their own histories.

Around the same time as London Lives was kicking off, starting in 2005, and in collaboration with Mark Greengrass, we began to experiment with Semantic Web methodologies, Natural Language Processing, and a bunch of Web 2.0 techniques – all of which were driven in part by the engagement of people like Jamie McLaughlin, Sharon Howard, Ed McKenzie and Katherine Rogers at the Humanities Research Institute in Sheffield, and in part by the interest generated by the Old Bailey as a ‘Massive Text Object’ from digital humanists such as Bill Turkel.  In other words, during the middle of the last decade, the balance between the technology and its use as a mode of delivery began to shift.  We became more technically engaged with the Digital Humanities, and this began to create a tension with the historical agenda we were pursuing.

And as a result, it was around this point that the basic coherence of the underlying project became more confused.  Just as the demise of the Arts and Humanities Data Service in 2007 signalled the end of a coherent British digitisation policy (and the end of a particular vision of how history online might work), the rising significance of external technical developments began to impact significantly on our agenda, as we worked to amalgamate rapid technical innovation with the values and expectations of a public, democratic form of history.  In other words the technology began to overtake our initial and underlying purpose.

And the first upshot of that elision was the Connected Histories site:


  •  15 Major web resources
  •  10 billion words
  • 150,000 images
All made available through a federated search facility.  Everything from Parliamentary Papers, to collections of ephemera and the British Museum’s collection of prints and drawings, were brought together and made keyword searchable through an abstracted index.  With its distributed API architecture and use of NLP to tag a wide variety of source types, it represented a serious application of what at the time were relatively new methodologies.

And unlike the previous sites, it was effectively driven by a changing national context, and by technology, and included a range of partners far beyond those involved in previous projects - most significantly Jane Winter and the Institute of Historical Research.  In part this project was driven by a critique of data ‘silos’, but more fundamentally, we saw it as an answer to the incoherence of the digitisation project as a whole, following the withdrawal of funding to the AHDS, and the closure of the Arts and Humanities Research Council’s Resource Enhancement Scheme.  It also formed an answer to the firewalls of privilege that were increasingly being thrown up around newspapers and other big digital resources – an important epiphenomenon of Britain’s mixed ecology of web delivery. In other words, while trying desperately to maintain a democratic model of intellectual access, we were forced to respond to a rapidly changing techno-cultural environment. 

In many respects, Connected Histories was an attempt to design an architecture,  including an integral role for APIs, RDF indexes, and a comprehensive division between scholarly resources, and front end analytical functionality, that would keep the work of the previous decade safe from complete irrelevance.  At its most powerful we believed the architecture would allow the underlying data to be curated, logged and preserved, even as the ‘front end’ grew tired and ridiculous.   

Early attempts to make the project  automatic and fully self-sustaining through the use of crawlers, and hackerish scraping methodologies fell by the way, as even the great national memory institutions and commercial operations like ProQuest and Gale, signed up to the project. 
But, we also kept the hope that Connected Histories would effectively allow democratic access (or at least a democratically available map of the online landscape) to every internet user.  There was no real, popular demand for this.  Google has frightened us all in to believing there is an infinite body of material out there, so we can’t know its full extent.  But it seemed important to us that what the public has paid for should be knowable by the public.

And here is where the conundrums of ‘Big Data’ begin.   And these conundrums are of two sorts – the first simple and technical; and the second more awkward and philosophical.

By this time, two years ago or so, we had what looked like ‘pretty big  data’, and the outline of a robust technical architecture that  separated out academic resources from search facilities, both making the data  much more sustainable and easily curated, and the analysis much more challenging and interesting. Suddenly, all the joys of datamining, corpus linguistics, textmining, of network analysis and interactive visualisations beckoned.

And it is this latter challenging and exciting analytical environment that is so fundamentally problematic.  Because we had ‘pretty big data’, and the architecture to do something serious with it, we suddenly found ourselves very much in danger of excluding precisely the audience for history that we started out to address.   The intellectual politics of the projects (the commitment to a history from below), and the technology actually came in to conflict for the first time – though this would only be apparent if you looked under the bonnet, at the underlying architecture and the modes of working it assumed.

One problem is that these new methodologies are and will continue to be reasonably technically challenging.  If you need to be command-line comfortable to do good history – there is no way the web resources created are going to reach a wider democratic audience, or allow them to create histories that can compete for attention with those created within the academy – you end up giving over the creation of history to a top down, technocratic elite.  In other words, you build in ‘history from above’, rather than ‘history from below’, and arguably privilege conservative versions of the past.  One way forward, therefore, lay in attempting to make this new architecture work more effectively for an audience without substantial technical skills. 

In collaboration with Matthew Davies and Mark Merry at the Centre for Metropolitan History and with the Museum of London Archaeological Service, we tried to do just this with Locating London’s Past.

  • Seventeen datasets
  • 4.9 million geo-referenced place names
  •  29,000 individually defined polygons.

But the main point is that it is a shot at creating the most intuitive front end version we could imagine of the sort of ‘mash up’ that the API architecture makes both possible, and effectively encourages.

In other words, this was an attempt to take what a programmer might want to achieve with an API, and put it directly into the hands of a wider non-technical public.  And we chose maps and geography as the exemplar data, and GIS as the best methodology, simply because, while every geographer will tell you maps are profound ideological constructs embedding a complex discourse, they are understood by a wider public in an intuitive and unproblematic way – allowing that public to make use of the statistics derivable from ‘big data’ in a way that intellectually feels like a classic ‘mash up’, but which requires little more expertise than navigating between stations on the London underground.

So arguably, Locating London’s Past is in a direct line from the Old Bailey, and London Lives – seeking to engage and encourage the same basic audience to use the web to make their own history – and to do so from below – to create a humane, individualistic, and empathetic history that contributes to a simple politics of humanism.

But it is not a complete answer, and the next project highlighted the problem even more comprehensively.  At the same time as we were working on Connected Histories and Locating London’s Past, by way of engaging that history from below audience, making all this stuff safe for a democratic and universal audience - we were also involved with the first round of the Digging Into Data Programme, with a project called Data Mining With Criminal Intent.


The Data Mining with Criminal Intent project brought together three teams of scholars including Dan Cohen and Fred Gibbs from CHNM, and Geoffrey Rockwell and Stefan Sinclair of Voyant Tools, along with Bill Turkel from the University of Western Ontario, and Jamie McLaughlin from the HRI in Sheffield.  It was intened to achieve just a few things.  First, to build on that new distributed architecture to illustrate how tools and data in the humanities might be shared across the net  - to embed an API methodology within a more complex network of distributed sites and tools; and second, to create an environment in which some ‘big data’ might be made available for use with the innovative tools created by linguists for textual analysis.  And finally to begin to explore what kinds of new questions, these new tools and architecture would allow us to ask and answer. 



To achieve these ends, we brought onto a single metaphorical page, the Old Bailey material with the browser based citation management system,  Zotero, and Voyant Tools  – new tools for working with large numbers of words.   

Much of this was a simple working out of the API architecture and the implications inherent in separating data from analysis.  But, it also led me to work with Bill Turkel, using Mathematica to do some macro-analysis of the Old Bailey Proceedings themselves.

One of the interesting things about this is that simply because we did it so long ago, rekeying the text instead of using an OCR methodology, the Proceedings are now one of the few big resources relating to the period before 1840 or so, that is actually much use for text mining.  Try creating an RDF triple out of the Burney Collection’s OCR and you get nothing that can be used as the basis for a semantic analysis – there is just too much noise.  The exact opposite is true of the Proceedings because of their semi-structured character, highly tagged content, and precise transcription.  And at 127 million words, they are just about big enough to do something sensible. And where Bill and I ended up was with a basic analysis of trial length and verdict over 240 years, that allowed us to critique and revise the history of the evolution of the criminal justice system, and the rise of plea bargaining.  And we came to this conclusion through a methodology that I can only describe as ‘staring at data’ – looking open-eyed at endless iterations of the material, cut and sliced in different ways.  It is a methodology that is central to much scientific analysis, and it is fun.



But it is also where my conundrum comes in.  However compelling the process is, it does not normally result in the kind of history I do.  It is not ‘history from below’, it is not humanistic, nor is it very humane.  It can only rarely be done by someone working part time out of interest, and it does not feed in to ‘public history’ or memory in any obvious way.  The result is powerful, and intellectually engaging – it is the tools of the digital humanities wielded to create a compelling story that changes how we understand the past (which is fun); but it is a contribution to a kind of legal and academic history I do not normally write.

And the point is, that the kind of history created in this instance, is precisely the natural upshot of ‘big data’ analysis.  In other words, what has become self-evident to me, is that ‘big data’, and even ‘pretty big data’ inevitably creates a different and generically distinct form of historical analysis, and fundamentally changes the character of the historical agenda that is otherwise in place.  This may seem obvious – but it needs to be stated explicitly.

To illustrate this in a slightly different way, we need look no further than the doyens of ‘big data’; the creators of the Googe Ngram viewer.


I love the Google ngram viewer, and it clearly points the way forward in lots of ways.  But if you look at what Erez Lieberman Aiden and Jean-Baptiste Michel do with it, its impact on the form of historical scholarship begins to look problematic.  Rather like what Bill Turkel and I did with the Old Bailey material, Lieberman Aiden and Michel appear to claim to be able to read history from the patterns the ngram viewer exposes - to decipher significant changes from the data itself.  Their usual examples include the analysis of the decline of irregular verbs to a precise  mathematical equation, and the rise of 'celebrity' as measured by the number of times an individual is mentioned in print. 

These imply that all historical development can, like irregular verbs, be described in mathematical terms, and that 'human nature', like the desire for fame, can be used as a constant to measure the changing technologies of culture.  And that like the Old Bailey – we can discover change and effect through exploring the raw data.  And that once we do this, it will become newly available, in the words of Lieberman Aiden and Michel, for 'scientific purposes'.

In other words, there is a kind of scientific positivism that is actively encouraged by the model of ‘big data’ analysis.  All the ambiguities of theory and structuralism, thick description and post modernism are simply irrelevant.

In some respects, I have no problem with this whatsoever.  I have never been a fully paid up post-modernist, and put most simply, unlike a thorough-going post-modernist, I think we can know stuff about the past.

I do, however, have two particular issues. First, if I work towards a more big data-like approach, I am forced to rework and rethink my own ‘public history’ stance.   I am no longer simply making material and empathetic engagement available to a wider audience; and therefore, the purpose of my labours is left open to doubt (by myself at the very least).  But second, I am being drawn into a kind of positivism that assumes what will come out of the equations (the code breaking to use the dominate metaphor of the last 60 years) is socially usefully or morally valuable.

In a sense, what ‘big data’ encourages is a morality-free engagement with a positivist  understanding of human history.  In contrast, the core of the historical tradition has been focused on the dialogue between the present and the past, and the usefulness of history in creating a working civil society.  The lessons we take from the past are those which we need, rather than those which are most self-evident.  If the project of history I bought in to was politically and morally framed (and it was), the advent of big data challenges the very root of that project.

Of course, this should not really be a problem, if only because history has always been a dialogue between irrefutable evidence, and discursive construction (between what you find in the archive and what you write in a book).  And science and its positivist pretentions have always been framed within a recognised sociology of knowledge and constructed hermeneutic.

But, for me, I remain with a conundrum – how to turn big data in to good history?  How do we preserve the democratic and accessible character of the web, while using the tools of a technocratic science model in which popular engagement is generally an afterthought rather than the point.

I really just want to conclude about there – with the conundrum.  For me, and for most of the digital humanities in the UK, the journey of the last fifteen years or so has been about access and audience – issues that are fundamentally un-problematic – which can be politically engaging and beautiful; and for this, one needs look no further than Tim Sherratt’s Invisible Australian’s project.




Even if you prefer your history in a more elite vein than me, more people being able to read more sources is an unproblematic good thing, a simple moral good.  And arguably, having the opportunity to stare hard at data, and look for anomalies, and weirdness, is also an unproblematic good. 

But, if we are now being led by the technology itself to write different kinds of history – the tools are shaping the product.  If we end up losing the humane and the individual, because the data doesn’t quite work so easily that way, we are in danger of giving up the power of historical narrative (the ability to conjure up a person and emotions with words), without thinking through the nature of what will be gained in exchange.  I am tempted to go back to my structuralist / Marxist roots and start ensuring my questions are sound before the data is assayed, but this seems to deny the joys of an open-eyed search for the weird.  I am caught between audience and public engagement, on the one hand, and the positivist implications of big data, on the other.

And I am left in a conundrum.   In the synopses I wrote back in October or so, I thought I would be arguing:  “that the analysis and exploration of 'big data' provides an opportunity to re-incorporate historical understandings in to a positivist analysis, while challenging historians to engage directly and critically with the tools of computational linguistics.”

The challenge is certainly there, but I am less clear that the re-integration of history and positivism can be pursued without losing history’s fundamental and humanist purpose.  For me, there remain big issues with big data; and a challenge to historians to figure out how to turn big data, to real historical account.

Tuesday, 13 December 2011

Playing around with colour on Locating London's Past



Just in a spirit of playing around, and exploring large data sets without any preconceived questions or assumptions, I thought I would throw a few words at Locating London's Past and the Old Bailey dataset, and see if any patterns emerged. And it occurred to me that words for colour, when mapped on to eighteenth-century London, might come up more frequently in some parts of town over others - perhaps 'white' in neo-classical areas, and 'brown' or 'green' at the more rural boundaries. I am not sure that anything actually emerged, but it was fun to think about. The base measure against which you would want to compare these colour distributions would be all crime locations (34,000 or so) mapped by street.
ALL CRIME LOCATIONS, BY STREET
RED

BLUE

GREEN


BROWN

YELLOW
WHITE
BLACK

Is there a pattern there? I have not really got a clue, so I thought I would put together some combinations, just on the off chance, and following a naive assumption about how colour might work in an eighteenth-century urban context (where bright colours were expensive).



RED, BLUE, YELLOW

BLACK, WHITE


GREEN, BROWN


I was still not quite convinced, but thought I should have one last go with the data displayed as 'Large Blocks', and by further combining 'manufactured colours' and 'natural' ones.
RED, BLUE, YELLOW, BLACK, WHITE - LARGE BLOCKS



GREEN, BROWN - LARGE BLOCKS



Or finally, the same sets of results with the sets of colours subtracted from one an another.

RED, BLUE, YELLOW, BLACK, WHITE, MINUS GREEN AND BROWN- LARGE BLOCKS

GREEN AND BROWN, MINUS RED, BLUE ETC - LARGE BLOCKS

THE TWO COLOUR SETS MINUS THEIR OPPOSITE OVERLAID ('MANUFACTURED' VS 'NATURAL)


Does this prove that 'manufactured' colours were more common in the West and East End, while 'natural' colours dominated in the northern and north-western suburbs.  No, it does not.  But it made me wonder.

Monday, 12 December 2011

Playing with Locating London's Past

With colleagues at the Universities of Sheffield and the IHR, we launched a new web resource this morning that allows you to map some seventeen different large scale datasets related to 18th century London on to a GIS compliant version of John Rocque's 1746 map of the capital - all in a Google Maps environment.  See www.locatinglondon.org   I think it is very pretty and intuitive, but what I find most interesting about the site is that it allows you to explore a component of these datasets that we have hitherto done very little with - the spatial.  I don't know what is there yet, but I suspect I will have a good time finding out.  

My first thought was to play with a nice dichotomy in the data for the Old Bailey Proceedings - the published trial accounts for London, 1674-1819 (they continue to be printed up till 1913 but only the 18th century elements are currently available for mapping).  

One aspect of the tagging we imposed on the Proceedings was a distinction between 'Crime Location' and 'Defendants' Home'.  This information is pretty consistently given in the text and tagged in the XML, and the 18th century trials include around 34,000 crime locations, and around 12,000 defendants' homes.  
 
A quick search for all 'Crime Locations' (34,427), when mapped on to 'Street' and displayed on to a blank screen, looks like this:

  
And an equally quick mapping of 12,031 Defendant's Homes looks like:


 
When placed over the warped version of John Rocque's 1746 map of London, the result is:


 I don't have an argument about this data, or even much of an observation.  The predominance of 'Defendants' Home' in the eastern part of the city, seems pretty compelling, and could form the basis for an analysis of the relative access to justice in eighteenth-century London, or when mapped against wealth, part of an argument about the nature of crime, and its motivation.  But more importantly, the process of 'playing' with this data strikes me as central to a very different kind of research narrative than I am used to.  I am not formulating questions, and then using the data to answer them - I am throwing together visualisations in search of contrasts that stand out, and look weird.

I am very much looking forward using the interactive elements of the Locating London's Past site to find anomalies and confusions that allow me to reformulate the questions I am asking.





Sunday, 23 October 2011

Academic History Writing and its Disconnects


This is the rough text of a short talk I am scheduled to deliver at a symposium on 'Future Directions in Book History'  at Cambrdige on the 24th of November 2011.


I am on the programme as talking briefly about the ‘OldBailey Online and other resources’ (by which I assume is meant London Lives, Connected Histories, and Locating London’s Past, and the other websites I have helped to create over the last ten or twelve years).  But I am afraid I have no interest whatsoever in discussing the Old Bailey or the other websites.  The hard intellectual work that went in to their creation was done between 1999 and 2010, and for the most part they have found an audience and a user base and will have their own impact, without me having to discuss them any further.  We know how to do this stuff, and anyone can read the technical literature, and I very much encourage you to do so.


Instead, I want to talk about how the evolution of the forms of delivery and analysis of text inherent in the creation of the online, problematizes and historicises the notion of the book as an object, and as a technology; and in the process problematizes the discipline of history itself as we practise it in the digital present. 


The project of putting billions of words of keyword searchable stuff out there is now nearing completion.  We are within sight of that moment when all printed text produced between 1455 and 1923 (when the Disney Corporation has determined that the needs of modern corporate capitalism trumped the Enlightenment ideal), will be available online for you to search and read.  The vast majority of that text is currently configured to pretend to be made up of ‘books’ and other print artefacts,   But, of course, it is not.  At some level it is just text – the difference between one book and the next a single line of metadata.  The hard leather covers that used to divide one group of words from another are gone; and every time you choose to sit comfortably in your office reading a screen, instead of going to a library or an archive, while kidding yourself that you are still reading a ‘book’, you are in fact participating in a charade.  We are swimming in deracinated, Google-ised, Wikipedia-ised text.


In other words, and let’s face it: the book as a technology for packaging and delivery, storing and finding text is now redundant.  The underpinning mechanics that determined its shape and form are as antiquated as moveable type.  And in the process of moving beyond the book, we have also abandoned the whole post-enlightenment infrastructure of libraries and card catalogues (or even OPACS), of concordances, and indexes and tables of contents.  They are all built around the book, and the book is dead. 


If this all sounds rather doom laden and apocalyptic – and no doubt we could argue about the rosy future and romantic appeal of the hard copy book – it shouldn’t.  At least as far as the ‘history of the book’ is concerned these developments have been entirely positive

First, it has allowed us to begin to escape the intellectual shackles that the book as a form of delivery, imposed upon us.  If we can escape the self-delusion that we are reading ‘books’, the development of the infinite archive, and the creation of a new technology of distribution,  actually allows us to move beyond the linear and episodic structures the book demands, to something different and more complex.  It also allows us to more effectively view the book as an historical artefact and now redundant form of controlling technology.  The 'book' is newly available for analysis.


The absence of books makes their study more important, more innovative, and more interesting.  It also makes their study much more relevant to the present – a present in which we are confronted by a new, but equally controlling and limiting technology for transmitting ideas.  By mentally escaping the ‘book’ as a normal form and format, we can see it more clearly for what it was.  And to this extent, the death of the book is a fantastic and liberating thing – the fascism of the format is beaten.


At the same time, I think we are confronted by a profound intellectual challenge that addresses the very nature of the historical discipline.  This transition from the ‘book’, to something new, fundamentally undercuts what we do more generally as ‘historians’.  When you start to unpick the nature of the historical discipline, it is tied up with the technologies of the printed page and the book in ways that are powerful and determining.  Our footnotes, our post-Rankean cross referencing and practises of textual analysis are embedded within the technology of the book, and its library.


Equally, our technology of authority – all the visual and textual clues that separate a CUP monograph from the irresponsible musings of a know-nothing prose merchant – are slipping away.  While our professional identity – the titles, positions and honorifics – built again on the supposedly secure foundations of book publishing – is ever less compelling. So the question then becomes, is history – particularly in its post-Rankean, professional and academic form - dead?  Are we losing that beautiful disciplinary character that allows us to think beyond the surface, and makes possible complex analyses that transcend mere cleverness?
 

And on the face of it, the answer is yes – the renewed role of the popular block buster, and an every growing and insecure emphasis on readership over scholarship, would suggest that it is. In Britain we shy away from the metrics that would demonstrate ‘impact’ primarily because we  fear that we may not have any.


Collectively we have put our heads in the sands, and our arses in the air, and seemingly invited the world to take a shot.  A single and self-evident instance that evidences a deeper malaise is our current failure to bother citing what we read.  We read online journal articles, but cite the hard copy edition; we do keywords searches, while pretending to undertake immersive reading. We search 'Google Books', and pretend we are not.


But even more importantly, we ignore the critical impact of digitisation on our intellectual praxis.  Only 48% of the significant words in the Burney collection ofeighteenth-century newspapers are correctly transcribed as a result of poor OCR.  This makes the other 52% completely un-findable.  And of course, from the perspective of the relationship between scholarship and sources, it is always the same 52%.  My colleague Bill Turkel, describes this as the Las Vegas effect – all bright lights, and an invitation to instant scholarly riches, but with no indication of the odds, and no exit signs.  We use the Burney collection regardless – not even bothering to apply the kind of critical approach that historians have built their professional authority upon.  This is roulette dressed up as scholarship.

In other words, we have abandoned the rigour of traditional scholarship.  Provenance, edition, transcription, editorial practise, readership, authorship, reception – the things we query issues in relation to  books, are left unexplored in relation to the online text we actually read.

And as importantly, the way we promulgate our ‘history’ has not kept up either.  I want television programmes with footnotes, and graphs with underlying spreadsheets and sliders.  Yes, I want narrative and analysis, structure, point and purpose.  I want to continue to be able to engage in the grand conversation that is history; but it cannot continue to be produced as a ragged and impotent ghost of a fifteenth century technology; and if we don’t do something about it, we might as well all go off and figure out how to write titillating tales of eighteenth-century sex scandals, because at least they sell.

The book had a wonderful 1200 odd year history, which is certainly worth exploring.  Its form self-evidently controlled and informed significant aspects of cultural and intellectual change in the West (and through the impositions of Empire, the rest of the world as well); but if, as historians, we are to avoid going the way of the book, we need to separate out what we think history is designed to achieve, and to create a scholarly technology that delivers it.


In a rather intemperate attack on the work of Jane Jacobs, published in 1962, Louis Mumford observed that:


‘… minds unduly fascinated by computers carefully confine themselves to asking only the kind of question that computers can answer and are completely negligent of the human contents  or the human results.’

I am afraid that in the last couple of decades, historians who are unduly fascinated by books, have restricted themselves to asking only the kind of questions books can answer.  Fifty years is a long time in computer science.  It is about time we found out if a critical and self-consciously scholarly engagement with computers might not now allow us to more effectively address the ‘human contents’ of the past.

Sunday, 19 June 2011

Culturomics, Big Data, Code Breakers and the Casaubon Delusion

Suddenly it seems as if 'big data' humanities is all the crack; with quantitative biologists and mathematicians diving in where previously only historians, literary critics and linguists dared to swim.  Digital humanists have been slowly engineering a new field from history and linguistics (aided and abetted by library science) for over a decade, gradually building new bodies of evidence, and road testing new methodologies.  But in just the last year or so, the biologists and mathematicians, with Google's help, have stolen a march on all their puny efforts.  In particular, it seems that Science and Nature have fallen head over heels in love with 'culturomics' and the heady enthusiasms of Erez Lieberman Aiden and Jean-Baptiste Michel, and their Google ngram viewer.  To read the most recent issue of  Nature is to be confronted with a heady mix of big science and gushing Hello Magazine prose, that work to mythologise the new 'science' of  culturomics and its creators.  It feels like the birth of a myth and of a brand.


This is all rather wonderful, and I am a huge fan of the Google ngram viewer, and the playful way it allows scholars and students to engage with the 'infinite archive' of inherited texts.  I think Aiden and Michel (and Google) have done the humanities a huge service.   But their real achievements do not quite explain the cloud of hyperbole that seems to be rising around them.


And this made me wonder what is really at issue here?  What is it about culturomics that turns on the reporters from  Nature.  At its heart, the use of word frequency with a reasonably sized (if problematic) data set simply provides one more form of evidence to be added to all the rest.  Knowing that the term 'electricity' peaks between 1870 and 1900 is useful evidence, but does not provide either an explanation for why, or a description of how it is being used.   Historians will no doubt look this particular gift horse in the mouth, and worry at the condition of its teeth; but they will also happily use the ngram viewer as one more component in a complex landscape of evidence.  This use may be delayed by the peculiar lack of any guidance on how to cite the results of a search, but it will be normalised in due course.


But simply providing a new body of evidence is not what seems to get Nature going.  Instead, it is the claim that the ngram viewer lays the basis for a new 'science', and that the results make other forms of historical analysis redundant.  In the words of Aiden and Michel, somehow this data is uniquely available for 'scientific purposes',  in contrast of other forms of evidence. 

It is not, therefore, the mechanics of the ngram viewer that is at issue.  Instead it is the underlying intellectual paradigm that Aiden and Michel bring to its use.  They appear to claim to be able to read history from the patterns the ngram viewer exposes - to decipher significant patterns from the data itself.  Their great party tricks (and they are particularly impressive in live performance) include the analysis of the decline of irregular verbs to a describable mathematical pattern, an equation, and the rise of 'celebrity' as measured by the number of times an individual is mentioned in print.  These imply that all historical development can, like irregular verbs, be described in mathematical terms, and that 'human nature', like the desire for fame, can be used as a constant to measure the changing technologies of culture. 

In some respects, we have been here before.  In the demographic and cliometric history so popular through the 1970s and 80s, extensive data sets were used to explore past societies and human behaviour.  The aspirations of that generation of historians were just as ambitious as are those of the parents of culturomics.  But, demography and cliometrics started from a detailed model of how societies work, and sought to test that model against the evidence; revising it in light of each new sample and equation.

The difference with culturomics is that there is no pretence to a model.  Instead, its practitioners will simply seek to discover patterns in the entrails of human speech, hoping to find the inherent meanings encoded there.  What I think the scientific community finds so compelling is that like quantitative biology and DNA analysis, Aiden and Michel are using one of the controlling metaphors of 20th-century science, 'code breaking' and applying it to a field that has hitherto resisted the siren call of analytical positivism.  


Since the 1940s the notion that 'codes' can be cracked to reveal a new understanding of 'nature' has formed the main narrative of science.  With the re-description of DNA as just one more code in the 1950s, wartime computer science became a peacetime biological frontier (cashing in on big-pharma, as military expenditure declined).  That Aiden comes from a background in DNA analysis should clue us to the fact that culturomics is an attempt to apply the same kind of code breaking to human society as a whole.



I strongly suspect that the project will fail, just as naive readings of DNA as a code for life have largely failed to fulfil their promise. But much more importantly, this attempt to repurpose a 'scientific' approach to historical analysis simply miss-understands the function of history itself.  These large-scale visualisations of language may be the raw material of history, the basis for an argument, the foundation for a narrative, the evidence put in the appendix in support of a subtle point, but they do not serve as a work of history. 

Historians interpret the past to the present.  They marshal evidence and use all the tools of genre writing to allow a modern reader to engage with the past.  And the questions they ask are not driven by the evidence, but by the needs of a modern society.  Gender history, the history of sexuality, and of race, have been created by two generations of historians not because the archives are groaning under the weight of relevant evidence, but because our society needs to understand the role of these forces in the present.  The fundamental flaw with culturomics is that it assumes that history is about the past; that what historians seek to achieve is an ever more accurate description of everything.  Instead, it is about the present.  Ironically, Aiden and Michel have rediscovered the 'Casaubon delusion'; and believe, like George Eliot's tragic figure, that they can create a new 'Key to all Mythologies'.   They need to listen to the Dorotheas of this world.

Friday, 1 April 2011

Towards a New History Lab for the Digital Past

The text of a talk delivered at the launch of the Connected Histories held at the Intitute of Historical Research, London, on 31 March 2011.


When the Institute of Historical Research was first established in 1921, its purpose and object was described as: 

to become an index to historical knowledge, a focus of historical research, a clearing-house of historical ideas, and a historical laboratory open to students of all universities and all nations.
Institute of Historical Research leaflet, 1921

And those of you who know the IHR, as it has evolved through almost a century of change, will recognise in its seminars, in its unique open shelf library, and in its simple role as the centre of a community of historians, of students, and of the curious and argumentative, the continuing vibrancy of this original spirit and purpose.
In many respects, Connected Histories is a simple attempt to ensure that this spirit and objective continues to thrive online; that immediate access to 2 billion words and 150,000 images  - searchable at the click of a mouse, and sharable across time and space – will enhance that community, and the history it creates.


But Connected Histories is also a recognition that the nature of historical research has changed; that we are drowning in an infinite archive – an ever expanding world of information.  And that the secure sense of a discipline that knew how to judge quality, how to assess evidence, is challenged by the sheer number of sources we can interrogate for words – at least - if not yet for meaning.

Given the privilege of a few minutes with a powerful audience, I want to do a couple of things this afternoon.  First, I want to describe just what Connected Histories does and how it works.  And in the process say a bit about why it is designed the way it is, and what issues it is meant to address.  And second, I want to talk a bit about how it fits into a trajectory of changing research and publishing practise – to describe where it sits in a process of frighteningly rapid change, and to locate it along with the other resource being introduced today – Mapping Crime.
Connected Histories is what is called a ‘federated’ search facility, and currently makes some eleven different web resources available – over two billion words of text, and 150,000 images, some free to access, others supported by a JISC license for use in British Higher Education, and others still, commercial sites designed for a wider audience of family and local historians.  

It includes
  • British History Online

  • British Museum Images

  • British Newspapers, 1600-1800

  • Charles Booth Archive

  • Clergy of the Church of England Database 1540-1835

  • House of Commons Parliamentary Papers

  • John Johnson Collection of Printed Ephemera

  • John Strype's Survey of London Online

  • London Lives 1690-1800

  • Origins Network

  • The Proceedings of the Old Bailey Online, 1674-1913


And we are in the process of adding several more.

Underpinning its searches are indexes of every word in those eleven ‘distributed’ websites – each of which were chosen to represent large bodies of academically credible and relevant material.  As a part of this index, each word is associated with a web address, a URL, that allows you to click through to the original.  This creates a basic facility that in response to a word or phrase search will return tens of thousands of results, each associated with a snippet of text, and each linked to the full resource held elsewhere.

In other words, at its fundament and in its water, Connected Histories is simply a comprehensive index of words.  But in the process of creating that index, we also sought to assign meaning to some of them.  Using a methodology called natural language processing, we identified names and dates and places (to an accuracy of around 75%).  So, in addition to an index of all the words in these 11 resources, we have also created indexes of all the names and places and dates mentioned: all the names in the Burney Collection, and all the dates in the Parliamentary Papers (however they are expressed).  

In other words, what we have is not just one, but four indexes, and you are searching each of these 2 billion words, or the millions of names or dates or places, each time you enter a query – allowing you to combine keyword and name, date and place searches to find just what you want.

That it works is a testimony to the hard work of the technical staff at the HRI and IHR, to Kathy Rogers and Bruce Tate in particular.  But also to Sharon Howard, who managed the project, and to the large team of people involved.

The starting point for this project was always an attempt to address what Digital Humanists tend to label the ‘silo effect’ – the idea that one of the problems with small scale websites and resources of the sort so many of us have worked to create over the last fifteen years, is that you tend to go along to one site – do a bit of research – before heading to another.   That just like traditional forms of research most of us forget what we knew in the British Library, during the short walk to the London Metropolitan Archives.

And in its most basic formulation the silos Connected Histories seeks to blow apart, are the boundaries between web sites.  You can now cross search the British Museum image collection, against Strype’s history of London, and associate images of specific locations, with descriptions and commentary on them.  You can search the Parliamentary Papers, in combination with the records of all the sessions papers of the county of Middlesex – bringing onto a single screen precept and practise.  There are sixty thousand settlement examinations that can now be cross referenced against apprenticeship documents, and trial records.

But this blithe image of easy cross searching, fundamentally understates the complexity of the issue, and the precise reasons Connected Histories is designed in the way that it is.

One overwhelming, and very real, aspect of the ‘silo effect’, is that while many of the primary sources we need are freely available, many others are not.  The walls of some silos are much more difficult to breech than others.  While frustrating, this is not necessarily a bad thing.  Unless we can convince the state and the taxpayer to pay for universal digitisation, we can’t really complain if the cost of digital resources is being borne by the end users.  Early modern and modern Britain is quite simply the most digitised where and when in existence, because of the combined efforts of the academy, of the great cultural institutions of Britain, of individual scholars, and private publishing companies motivated by profit.  But, at the same time, as scholars and teachers, we need access to all those resources.  Or at the least we need to know what is in them, in order to make an informed decision about what we want out of them.

The model of a series of indexes to the original material is precisely designed to address this issue.  We don’t need to have direct access to all the products of ProQuest and Gale, of the Origins.net, or JISC funded projects like the John Johnson collection, if we have an index that tells us what they contain.  These materials can sit behind their paywalls, the intellectual property they contain safe from harm; while we can now interrogate them from a distance.

In other words, Connected Histories is designed to build a bridge between the academy and commercial publishers; it is designed to mess up the models of delivery, and the walls of division that keep us apart.  These ‘silos’ are almost philosophical in character, but even more than the technical ones that divide one website from another, they need to be breeched; and that is part of what this project has been about.

But, the silo effect goes beyond even this.  It exists between our own ears as well. At its best and most compelling, history is a community of scholars, sharing knowledge and effort in pursuit of a real and usable understanding of the past – it is a collective project.  At its worst, it is a collection of egomaniacs, desperate to be lauded as the great authority on this or that – however specialised and narrow that might be.  The much lamented lone scholar is as frequently a Casaubon, forever seeking and failing to find the key to all knowledge; as they are a Dorothea, driven by enthusiasm and a desire to share with others.  At some level, the ‘silo effect’ is inherent in the idea of ‘authorship’ (and the ‘authority’ it implies).  It is there when we decide one persons’ work is literature, and another is art history; when we label by period or methodology, when we decide who to exclude from the conversation, and who to include.

We do not all need to work collaboratively, or to abandon our notions of intellectual property, but in the spirit of a ‘history lab’, we do need to share our work, and remember the common purpose of historical research.  And Connected Histories is again an attempt to address these particular silos.  

By creating individual workspaces that build into a new body of ‘connections’ , by allowing users to link documents, and names, and stuff, across billions of words, and then pooling those links and allowing them to be explored by a wider public; Connected Histories, is designed to build a new shared body of knowledge grounded in everyday practical scholarship.  It is designed to nudge the lone scholar to become a more sociable animal.  

In many respects all these ‘silos’ are part of our inheritance from the Enlightenment.  They are inherent in every library catalogue, and in the practise of individual scholarship leading to named authorship.  They reflect the co-evolution of the academic community, in a symbiotic death grip, with commercial publishing; and they were imported without fanfare or thought, into what one might want to describe as Web 1.0 – that first iteration of the internet created in the image of older forms of scholarship and communication – with e-mail, e-spreadsheets, e-footnotes, e-everything – all mimicking an older intellectual technology.

In other words, there is a bigger ‘silo’ out there – a division that is more fundamental to the internet and the cultures of scholarship than the mere distance between the technical implementation of British History Online, on one hand, and The Burney Newspapers or the Old Bailey on the other.  

Yes, we want to consult this material in one go; and yes we need to overcome the boundaries, created by pay walls and subscription; and yes we want historians to work together in a common laboratory of ideas and connections.  But, what really needs to break down is the silo that suggests that information itself is something to be consulted and collected; that it is an unchanging object of study, rather than a pool of constantly changing stuff that can be interrogated from any angle, and pursued along any trajectory. 

The most fundamental silo Connected Histories is intended to address is between traditional forms of criticism and scholarship that assume we can contain data in an internally structured and divided, ‘library’; and the emerging world of text and data mining, that sees data as a process – something to be played with and analysed on a massive scale, across boundaries of genre and type.

The innovation at the heart of Connected Histories, the one I think is most interesting, is the methodology used to allow us to sit in London this afternoon, and locate the site and its gubbins in the IHR; while the indexes it interrogates sit on a server in Sheffield – which distributes pointers to eleven different servers around the country.

What has been created by the Institute and the Humanities Research Institute in Sheffield, is a model that uses an ‘API’ as its core.  An API is an Application Programming Interface (the most widely used version of which is Google Maps), and it is designed to allow you to create a simple query that can address a dataset from a distance (in this case four indexes).  It is not a website, or a ‘front end’, it doesn’t need to exist as a visual or physical thing.  It is essentially a series of agreed conventions that allow anyone to address a web resource and ask it for a bit of the data it contains.  What Connected Histories does, is locate the ‘front end’ in London, with information about the sources, with the workspace and connections, etc., but that front end’s main job is to address the API in Sheffield, to gather the data required from the indexes, to bundle it up into an xml file, and to present it in an attractive way to the end user who can then navigate to the original sites, create links, and share searches.

In other words, the indexes in Sheffield have been created as a standardised and generic resource, which is then addressed by a specialised and bespoke search and save environment.

For most of us, this is a seamless process of little interest; but what it does is create a space between search and data, that can now be occupied by anyone.   In other words, and unlike most free-standing websites - it is designed to be mash-upable.

With a bit of technical nous you can now generate a bit of code that will automatically select and download the contents of all the indexes, reflecting all the words in Connected Histories.  The text miners who do this will not be gaining access to the original resources – there is no intellectual property issue here (beyond that of this project).  There is no question of them being able to recreate the sites so laboriously constructed by whatever business or academic model was employed to create them in the first instance; but they will have access to what amounts to a detailed description of the contents of it all – the index of every word, and name and place and date.

Or to put it another way, the API architecture breaks down the structures of online resources into their component parts – separates out data from processing, from delivery - allowing each to be re-used and re-purposed.  At the moment it looks like a traditional website with a single front end and datastore, but that front end can address more than one data store; and the datastore can be addressed by more than one front end.   

The API architecture addresses that final wall, that silo that means that providers are on one side and data consumers, forced to query the data through an ever narrowing front end, are on the other.  Suddenly, we are all mixed up in the infinite archive.
 

To my way of thinking, this comes under the heading of an unalloyed good thing.  An outcome that liberates data, while protecting it; that makes for better history (whoever is writing it), and contributes to the democratisation of scholarship.

But it is only one step in a longer journey; and I want to spend the next few minutes pointing up three or four directions, that I think Connected Histories helps make possible; or which seem to grow naturally from it.

And the first is to do with those academic text miners, suddenly empowered to access ridiculously large bodies of data.  What do you do with a 2 billion word index?

What I want to do, is to begin the process of modelling what recorded language since Gutenberg, looks like; how does vocabulary change; how do genre evolve; how are ideas passed from medical literature to political science, to novels; how is changing technology and a changing environment reflected in changing texts.  In a sense, half of the last century was taken up in worrying about whether text, words, reflected a knowable universe, or were themselves controlling discourses, leaving humans powerless to imagine something new or describe something real – held captive in words. 

I like to describe what we can now access as ‘massive text objects’ – too large read, too complex to be contained in traditional taxonomies.  But, if we can begin to model them – if we can know both the absolute amount of language recorded; and how it changes from source to source, and decade to decade, we can use it in a more sophisticated way to trace first, the controlling forms of language, but also to more securely tie description to an underlying and knowable historical past.  If you know the shape, and texture of what has survived, you can begin to think through how it might relate to Herbert Butterfield’s, ‘...genuine relationship with the actual…’.[Herbert Butterfield, The Whig Interpretation of History, (George Bell and Sons: 1950), p. 73.  See also Michael Eamon, ‘A "Genuine Relationship with the Actual": New Perspectives on Primary Sources, History and the Internet in the Classroom’, The History Teacher 39.3 (2006): 32 pars. 6 Sep. 2006]
 
It is in text mining massive text objects that the hope of a new empiricism in historical analysis lies.

But for myself, I suspect there also something else also going on.  The urge to create new connections, to escape our inherited taxonomies, can already be seen in projects such as Mapping Crime – being demonstrated later today.  By tying material related to crime available through the John Johnson Collection of printed Ephemera to other repositories and other genre, a reconstructed set of links begins to emerge, that confound the structures created by librarianship.  The data itself, its newly digital form, seems to suggest the need for new connections.

And the API model at the heart of Connected Histories is itself an attempt to embed this idea and aspiration at the core of the design process.  It assumes that new connections are there to be made, and that they will inevitably cross boundaries of form and origin to encompass an ever expanding body of inherited artefact.

To take just a small example, we will soon be able to geo-reference at least a portion of the place names in Connected Histories, tying all that text to space in new ways.  By modelling maps in the way that we are beginning to model massive text objects, we can relate historical geography to present geography, to secure a further line between representation and a knowable past; and by using an API methodology we are ensuring that it is all mash-upable, with resources from wherever they come.  By September (if we keep to schedule), we will have the ability to mash-up eighteenth century London as found in the Old Bailey, and in London Lives, in Google Maps, with a rectified version of Rocque’s 1746 map of London, in combination with around 3 million artefacts dating from the period, dug up by the Museum of London.

This JISC funded project is in hand, and is in many ways the natural outcome of what has been described as the spatial turn in historical studies.  But by putting an API at the heart of the system, it will again facilitate the re-use and re-imagination of what we can do with a few billion lines of data.

And, of course, we can take the same approach to that other great inherited body of evidence: objects.  The historians and the museums will work together eventually (the logic is too ridiculously obvious to need re-enforcing), and at that point, the ability to cross reference maps and texts and objects, again will begin to change how we can evidence the past.  

And if we could add to the museum collections, that other massive online record of surviving historical artefacts; that other massive resource digitised by accident – the auction catalogues – we would have created an entirely new resource, available in a new way.  Auction catalogues have been created as online, digital resources for over a decade, and already contain detailed descriptions and images of millions of objects: the record of what individuals have valued and preserved on their own behalf, from the past.  And thousands more images and descriptions are added each month.  

Again, these represent a massive lens through which we can observe the past, and a silo dividing related and cognate materials.  Connecting them to texts and maps and stuff, will help us better understand the whole.

It is intended that Connected Histories will grow over time.  In its first update in September the National Archives ‘documents online’ will be added, as well as two key nineteenth-century resources: 65,000 digitised British Library books from the Jisc Historic Books Platform and the JSTOR collection of pamphlets on social and political issues.  Suggestions for additional content are welcome.  

But beyond more text, we are confronted with the challenge of integrating more different things.  And with each new variety of stuff, we move to a different kind of understanding, more sophisticated, better articulated, more firmly rooted in a clear model of what it is we are looking at; what we can securely see, and what we can’t.   

All in all, I think it is kind of cool.  

But I also think it remains part of that bigger project: 

to become an index to historical knowledge, a focus of historical research, a clearing-house of historical ideas, and a historical laboratory open to students of all universities and all nations.

Connected Histories.




Thursday, 13 January 2011

Urbanism Kolkata Style

I recently spent a week at a workshop in Kolkata organised by the British Library and the National Library of India, designed to lay the foundations for a project to digitise early Bengali Books (1778-1914).  In many respects it was a wonderful experience, and the project is incredibly worthwhile (though it will be difficult to implement).  But this was also my first visit to India, and my first visit to what might be called a 'mega-city', and the experience has forced me to revisit some aspects of how I have been conceptualising urban living, and in particular the nature of eighteenth and nineteenth century London.

A couple of years ago I visited Marrakesh and was struck by the essentially orderly character of what is a relatively poor and very crowded urban environment.  By contrast, what struck me in Kolkata was the extent to which urban growth seemed to have outstripped the city's ability to discipline behaviour.  At around seven million people at midday, and four million at midnight, Kolkata is both one of the worlds most recently created cities, and a city occupied by a rapidly burgeoning population drawn primarily from its own rural hinterland.  The city lived up to many stereotypes - crowded streets, fearsome pollution, and traffic that worked like fairground dodgems.   And while there were fewer beggars (either children or the disabled) than I had expected, and while I saw little evidence of malnutrition,  there were ragged shanties and street people around every corner.

But what surprised me was the nature of the rubbish.  It seemed to pile up along every roadside, untended and ignored.  And it seemed to be made up overwhelmingly of small bits of plastic, mixed with dust.  There was little evidence of large amounts of organic matter - no rotting fruit, or vegetables, fly specked bones or industrial by-products.  At one level it seemed the least varied or interesting rubbish I had ever seen - and it seemed to sit entirely unregarded, unmoved and unchanging.

My only explanation for this phenomenon is that, first, everything that could be recycled, re-used, turned to any account whatsoever, had been sifted from the pile and moved on.  And second, that there was no working system in place to remove the last, uneconomic residuum. 

As a historian of the urban environment this seems to me to re-enforce the profound inter-relationship between the economics of city life (its wealth), and the need for cultural controls on the behaviour of each individual urban dweller in order to make a city work.  In essence, what it confirmed for me is the extent to which living in a city requires the kind of detailed social and cultural system that could remove the rubbish; and that such cultural and bureaucratic systems can only be sustained in the face of measured growth - and that they can easily breakdown in the face of rapid migration.

When cities grow quickly as Kolkata certianly has; when behaviours and systems of bodily maintenance fitted for small holdings and low-density living, are practised in a high-density urban environment, the result in this instance seemed to approach the unliveable; or at least a form of urban dystopia.

In relation to the history of London, this observation seems to me to both emphasise the extent to which the city in the eighteenth century was able to maintain a culturally disciplined series of behaviours that both ensured that the rubbish was coralled to the correct place on the street; and that it was not allowed to stay there - poor neighbourhoods seldom tipped irredeemably into chaos.  It also suggests that the evolution of the nineteenth-century rookery formed an outpost of disfunctional urbanism that can be mapped against rural in-migration.  But, most importantly, this experience has re-enforced a belief that 'urban living' can be conceptualised as a distinct cultural phenomenon that takes similar forms across the globe - that being a city dweller is first and foremost about sharing a cultural system built on specifically urban forms of behaviour.