Friday, 29 May 2015

The UK Web Archive, Born Digital Sources and Rethinking the Future of Research

The following post is derived from a short talk I gave at a doctoral training event at the British Library in May 2015, focused on using the UK Web Archive.  It was written with PhD students in mind, but really forms a meditation on the opportunities created when we are working with web sites rather than print.  While lightly edited, the text retains the ticks and repetitions of public presentation.

My office c.1984
I normally work on properly dead people of the sort that do not really appear in the UK Web Archive – most of them eighteenth-century beggars and criminals.  And in many respects the object of study for people like me – interlocutors of the long dead -  has not changed that much in the last twenty years.  For most of us, the ‘object of study’ remains text.  Of course the ‘digital’ and the online has changed the nature of that text.  How we find things – the conundrums of search – shape the questions we ask.  And a series of new conundrums have been added to all the old ones – does, for instance, ‘big data’ and new forms of visualisation, imply a new ‘open eyed’ interrogation of data?  Are we being subtly encouraged to abandon older social science ‘models’, for something new?   And if we are, should these new approaches take the form of ‘scientific’ interrogation, looking for ‘natural’ patterns – following the lead of the Culturomics movement; or perhaps take the form of a re-engagement with the longue durée– in answer to the pleas of the History Manifesto.   Or perhaps we should be seeking a return to ‘close reading’ combined with a radical contextualisation - looking at the individual word, person, and thing – in its wider context, preserving focus across the spectrum.

And of course, the online and the digital also raises issues about history writing as a genre and form of publication.   Open access, linked data, open data, the 'crisis' of the monograph, and the opportunities of multi-modal forms of publication, all challenge us to think again about the kind of writing we do, as a  literary form.  Why not do your PhD as a graphic novel? Why not insist on publishing the research data with your literary over-lay?  Why not do something different?  Why not self-publish?

These are conundrums all – but conundrums largely of the ‘textual humanities’.  

Ironically, all these conundrums have not had much effect on the academy and the kind of scholarship the academy values.  The world of academic writing is largely, and boringly, the same as it was thirty years ago.  How we do it has changed, but what it looks like feels very familiar.

But the born digital is different.  Arguably, the sorts of things I do, history writing focused on the  properly dead, looks ‘conservative’ because it necessarily engages with the categories of knowing that dominated the nineteenth and twentieth centuries – these were centuries of text, organised into libraries of books, and commentated on by cadres of increasingly professional historians.  The born digital – and most importantly the UK web archive – is just different.  It sings to a different tune, and demands different questions – and if anywhere is going to change practise, it should be here. 

Somewhat to my frustration, I don’t work on the web as an ‘object of study’ –  and therefore feel uncertain about what it can answer and how its form is shaping the conversation; but I did want to suggest that the web itself and more particularly the UK Web Archive provides an opportunity to re-think what is possible, and to rethink what it is we are asking; how we might ask it, and to what purpose.

And I suppose the way I want to frame this is to suggest that the web itself brings on to a single screen, a series of forms of data that can be subject to lots of different forms of analysis.  A few years ago, when APIs were first being advocated as a component of web design, the comment that really struck me, was that the web itself is a form of API, and that by extension the Web Archive is subject to the same kind of ‘re-imagination’ and re-purposing that an API allows for a single site or source.  

As a result, you can – if you want – treat a web page as simple text – and apply all the tools of distant reading of text - that wonderful sense that millions of words can be consumed in a single gulp.   You can apply ‘topic modelling’, and Latent Semantic Analysis; or Word Frequency/Inverse Document Frequency measures.  Or, even more simply; you can count words, and look for outliers – stare hard at the word on the web!

But you can also go well beyond this.  In performance art, in geography and archaeology, in music and linguistics, new forms of reading are emerging with each passing year that seem to me to significantly challenge our sense of the ‘object of study’ – both traditional text and web page.  In part, this is simply a reflection of the fact that all our senses and measures are suddenly open to new forms of analysis and representation.  When everything is digital – when all forms of stuff come to us down a single pipeline -  everything can be read in a new way.  

 Consider for a moment the ‘LIVE’ project from the Royal Veterinary College in London, and their ‘haptic simulator’.  In this instance they have developed a full scale ‘haptic’ representation of a cow in labour, facing a difficult birth, which allows students to physically engage and experience the process of manipulating a calf in situ.  I haven’t had a chance to try this, but I am told that it is a mind altering experience.  It suggests that reading can be different; and should include the haptic - the feel and heft of a thing in your hand.  This is being coded for millions of objects through 3d scanning; but we do not yet have an effective way of incorporating that 3d text into how we read the past. 

 The same could be said of the aural - that weird world of sound on which we continually impose the order of language, music and meaning; but which is in fact a stream of sensations filtered through place and culture.  

Projects like the Virtual St Paul's Cross, which allows you to ‘hear’ John Donne’s sermons from the 1620s, from different vantage points around the yard, changes how we imagine them, and moves from ‘text’ to something much more complex and powerful.  And begins to navigate that normally unbridgeable space between text and the material world.  And if you think about this in relation to music and speech online – you end up with something different on a massive scale.

One of my current projects is to create a sound scape of the courtroom at the Old Bailey - to re-create the aural experience of the defendant - what it felt like to speak to power, and what it felt like to have power spoken at you from the bench. And in turn, to use that knowledge to assess who was more effective in their dealings with the court, and whether, having a bit of shirt to you, for instance, effected your experience of transportation or imprisonment.  And the point of the project is to simply add a few more variables to the ones we can securely derive from text.

It is an attempt to add just a couple of more columns to a spreadsheet of almost infinite categories of knowing.  And you could keep going – weather, sunlight, temperature, the presence of the smells and reeks of other bodies.  Ever more layers to the sense of place.  In part, this is what the gaming industries have been doing from the beginning, but it also becomes possible to turn that creativity on its head, and make it serve a different purpose.

In the work of people such as Ian Gregory, we can see the beginnings of new ways of reading both the landscape, and the textual leavings of dead.  Bob Shoemaker, Matthew Davies and I (with a lot of other people) tried to do something similar with Old Bailey material, and the geography of London in the Locating London’s Past project.

This map is simply colours blue, red and yellow mapped against brown and green.  I have absolutely no idea what this mapping actually means, but it did force me to think differently about the feel and experience of the city.  And I want to be able to do the same for all the text captured in the UK domain name. 

All of which is to state the obvious.  There are lots of new readings that change how we connect with historical evidence – whether that is text, or something more interesting.    In creating new digital forms of inherited culture - the stuff of the dead - we naturally innovate, and naturally enough, discover ever changing readings.  But the Web Archive, challenges us to do a lot more; and to begin to unpick what you might start pulling together from this near infinite archive. 

In other words, the tools of text are there, and arguably moving in the right direction, but there are several more dimensions we can exploit when the object of study is itself an encoding.

Each web page, for instance, embodies a dozen different forms.  Text is obvious, but it is important to remember that each component of the text – each word and letter, on a web page - is itself a complex composite.  What happens when you divide text by font or font size; weight, colour, kerning, formatting etc.  By location - in the header, or the body, or wherever the CSS sends it; or more subtly by where it appears to a users’ eye - in the middle of a line – or at the end.

Suddenly, to all the forms of analysis we have associated with ‘distant reading’ there are five or six further columns in the spread sheet – five or six new variables to investigate in that ‘big data’ eye-opened sort of way.

And that is just the text.  The page itself is both a single image, and a collection of them – each with their own properties.  And one of the great things that is coming out of image research is that we can begin to automate the process of analysing those screens as ‘images’.  Colour, layout, face recognition etc.  Each page, is suddenly ten images in one – all available as a new variable; a new column in the spreadsheet of analysis.  And, of course, the same could be said of embedded audio and video.

And all of that is before we even look under the bonnet.  The code, the links, the meta data for each page – in part we can think of these as just another iteration of the text; but more imaginatively, we can think about it as more variables in the mix.

But, of course, that in itself miss-understands the web and the Web Archive.  The commonplace metaphor I have been using up till now is of a ‘page’ – and is the intellectual equivalent of skeumorphism - relying on material world metaphors to understand the online.

But these aren’t pages at all, they are collections of code and data that generate in to an experience in real time.  They do not exist until they are used - if a website in the forest is never accessed, it does not exists.  The web archive therefore is not an archive of ‘objects’ in the traditional sense, but a snapshot from a moving film of possibilities.  At its most abstract, what the UK Web Archive has done, is spirit in to being the very object it seeks to capture – and of course, we all know that in doing so, the capturing itself changes the object.  Schrödinger's cat may be alive or dead, but its box is definitely open, and we have visited our observations upon its content.

So to add to all the layers of stuff that can fill your spreadsheet, there also needs to be columns for time and use; re-use and republication.  And all this is before we seek to change the metaphor and talk about networks of connections, instead of pages on a website.

Where I end up is seriously jealous of the possibilities; and seriously wondering what the ‘object of study’ might be.  In the nature of an archives, the UK Web Archive imagines itself as an ‘object of study’; created in the service of an imaginary scholar.  The question it raises is how do we turn something we really can’t understand, cannot really capture as an object of study, to serious purpose?  How do we think at one and the same time of the web as alive and dead, as code, text, and image – all in dynamic conversation one with the other.  And even if we can hold all that at once, what is it are we asking?

Monday, 27 April 2015

Voices of Authority: Towards a history from below in patchwork

This post is intended to very briefly describe a project I am about halfway through - that seeks to experiment with the new permeability that digital technologies seem to make possible - to create a more usable 'history from below', made up of lives knowable only through small fragments of information.

This particular project is called ‘Voices of Authority’, and is a small part of a larger AHRC funded project – The Digital Panopticon – that is seeking to digitise and link up the records reflecting everyone tried in London between around 1780 and 1875, and either sent to prison, or else transported to Australia.  This small element of the wider project is bringing together a series of different ways of knowing about a particular place, time and experience – the Old Bailey courtroom from around 1750-1850, and the experience of being tried for your life and for your liberty.  The conceit behind this project, is really a suggestion that building something in three dimensions, with space, physical form and performance, along with new forms of analysis of text; can change how we understand the experience of the trial process; and to allow a more fully empathetic engagement with defendants; along with a better understanding of how their experience impacted on the exercise of power and authority.

This project is only half completed – so this is very much a report of 'work in progress'.  But, in essence, what seeks to do is bring together three distinct different forms of ‘data’ and to re-organise that data around individual defendants.  

First, it takes the text of the Old Bailey Proceedings – the trial accounts of some 197,745 trials held between 1674 and 1913, and recognises them as comprising two different and distinct things – a bureaucratic record of the trials themselves (names, verdicts, punishments); and at the same time, one of the largest corpora of recorded spoken language created prior to the twentieth century – some 40 million words of direct, recorded testimony for the period under analysis.

These understandings of the Proceedings are, of course, built on projects of much longer duration; including the OldBailey Online, and more particularly, on Magnus Huber’s additional linguistic mark-up of the Proceedings, which allows ‘speech’ to be pulled from the trial text, and to identify the speaker along the way.  This is available via the Old Bailey Corpus.

The project also builds on text and data mining methodologies – including direct counting of word and phrase distributions, and the application of a form of explicit semantic analysis, that allows us to look at the changing character of language used in witness statements over the course of the eighteenth and nineteenth centuries.

In other words, the first element of the project is the text and speech, crimes, punishments and dates provided by the Old Bailey Proceedings.

The second element is the body of the criminal - the physical body of the individual  men and women involved.  The broader project is creating a dataset of some 66,000 men and women – with substantial and detailed information about their lives, both before and after transportation or imprisonment – reflecting the inter-relationship between the people who became defendants and criminals with the systems of a global empire. And this material provides a huge amount of data about bodies – to add to the words individuals spoke to power.  Height, weight, eye colour, tattoos among a range of other aspects of a physical self.  Suddenly, we know if a collection of words was spoken by a ten stone, 5 foot two inch woman with brown hair and black eyes, and a withered left arm; or by a six foot man with an anchor tattoo on his left arm, and a scar above squinting blue eyes. 

To think about it another way, this bit of the ‘project’, allows us to worry about the ragged boundary between the ‘physical’ as recorded in a set of numerical and standardised descriptions, and the ‘textual’ – the slippery and ambiguous content of each witness statement. 

In relation to ‘history from below’, this allows us to put together the lives of people like William Curtis, who as a 16 year old, in the summer of 1843, had a perfectly healthy tooth pulled, before stealing the dentists’ coat.  And Sarah Durrant, who was convicted for receiving two banks notes worth 2000 pounds. It all allows us to know their words (their textualities), and at the same time to see them as part of a different kind of truth – of place and body. 

The aspiration is to essentially code for the variabililties of body type at scale, to add a further dimension to both the records of the bureaucracy of trials (charge, verdict punishment), and the measurable content of the textualities of those same trials.

And finally, we are adding one additional dimension – space –  a ‘scene of trial’.

For this we are first of all building on a project called Locating London’s Past, which among other things, maps crime locations on to the historic landscape of London.  And to this we are adding a reconstruction of the courtroom, where all these trials took place. 

Simply using Sketch-up, we have made most progress on the George Dance’s building, finished in the late 1770s and providing the main venue for the relevant trials for the next hundred years – basing the models on the architectural plans from which it was built.  

In the process of creating this model, huge amounts of imformation about trial procedure has been revealed, including the changing layout of the court, and the relative position of the different speakers.  The design itself reflects a hitherto unacknowledged transition in the character of how witnesses and defendants were divided in this evolving space, evidencing a new story of the evolution of the criminal trial. 

The architecture itself, suggests that there was a clear transition from a situation in which witnesses and victims stood in a similar relation to the judge and jury (both facing the judge, relatively close to one another); to one - like a modern anglo-american courtroom  - where the judge and witnesses are on one side, and the defendant on the opposite side of the courtroom.  In other words, the character of the adversarial relationship at the core of the adversarial trial was re-defined, with the witnesses and victim re-located on the either side of the argument, and the judges role, redefined as arbiter between them.  At some level, in the process community resolution was replaced by court judgement. 

If you want to explain why conviction rates at the Old Bailey rose from under 50% in the mid eighteenth century, to over eighty percent at the end of the nineteenth century, starting from precisely the moment when the courtroom was rebuilt, this ‘fact on the ground’ needs to be part of the story.

What has also been revealed is the importance of levels – with lawyers speaking upwards to the judge, jury, witnesses and defendant, from a cock-pit several feet below their eye level.  Like a theatre audience, the judge, jury and defendant looked down on the stage below.  In other words, what was created, at least for a short time (70 years), was the real feel of a ‘theatre’ in which, as a barrister, you were forced to perform to the gods.  


Looking forwarding to the next stage, this particular sub-project is seeking to move from the ‘art’ of making and performance, through a humanist and historical appreciation of ‘experience’, towards the tools of social science and informatics – seeking to combine the close reading of a single desperate plea, with the empathy that can only come with physical knowledge, with that macroscopic image of all the similar words spoken over a hundred years – how that one plea fits in a universe of words and bodies.  And all of this, is in turn, being undertaken in pursuit of a more nuanced and empathetic engagement with the lives of working people - both for its own sake, and as part of a new analysis of the workings of power 'from below'.

With luck, this will allow us to move beyond a simple analysis of the courtroom, and the ‘adversarial trial’ - to an analysis through which we can see the whole system from the defendants’ perspectiv.

In other words, the next step is about creating a history of the British criminal justice system, and of transportation from an experiential perspective on a large scale – contributing to a history of common human experience, evidenced from the distributed leavings of the dead, analysed with all the approaches available to hand, from all the perspectives available.