Thursday 29 March 2007

Towards a Folksomonic Solution to Nominal Record Linkage in Distributed Historical Resources

Large bodies of historical evidence have been posted on the web in the last ten years, and a wide variety of new resources will be posted in years to come. They do not share a common structure, nor do they conform to an agreed set of technical standards. These resources include the electronic catalogues of libraries and archives (PROCAT, A2A), comprehensive bibliographies (RHS), full-text versions of printed sources (The Old Bailey Online, Times Online, EEBO and ECCO), biographical dictionaries (New DNB) and maps and images (Motco, Collage). In the next ten years these sources will be joined by large bodies of transcribed manuscript (Plebeian Lives, People in Place). The difficulty arises from the distributed, piecemeal and disconnected nature of these sources.

They currently need to be searched independently normally using a form of keyword searching. The result is a very traditional approach to historical research in which the historian builds up a picture of an event or subject through iterative searches in different documents. Some attempts have been made to circumvent or add to this process through automated nominal record linkage in which individuals mentioned in different sources are confirmed as the same individual (Westminster Historical Database); but effective and consistent and automated nominal record linkage has proved to be extremely difficult to implement. It has been made to work with a small number of highly structured sources such as censuses, registers and tax records, but is not suitable for dealing with more variable and qualitative records. At the same time, there are large numbers of family historians whose sole purpose is to locate disparate sources about single individuals. This project is designed to harness the hard and expert work of family historians to the problem of linking and judging links in disparate online resources.

The real experts at nominal record linkage are family historians. While academic writers generally structure their work around themes and questions, family historians work to bring together material about individuals and narrow groups defined by birth and marriage. There have been substantial attempts to harness the expertise of family historians to the task of generating large scale collections of information about individuals for use in academic history. But these have been predicated on family historians volunteering to work for a project directed and intellectually shaped by an academic (see for instance the Victorian Panel Survey Pilot Project). By creating a package that allows family historians to share information about individuals among themselves, using a social software environment model, large volumes of information about individuals could be generated as a by-product of the use by family historians of a package that in any case, satisfies a number of their needs. The greatest added value for family historians will come in three forms. First, new knowledge about family trees and individuals will be created as different family historians work backwards towards common individual ancestors (a whole new family ‘branch’ will be created every time two historians find themselves listing a single individual). Second, models of successful research will be promulgated. Knowing, for instance, the life histories of a range of people with similar profiles to that of an individual you are researching will help direct research down more effective paths. And third, a profound, self-generated contextualisation will be created through the process of generating links to disparate sources. As more qualitative sources are posted, this contextualisation will become ever more textured. It is all the more powerful for being self-generated, rather than a product packaged in the generic conventions of academic history writing.

In an environment similar to packages such as Del.ic.ious, family historians would be able to collect relevant bookmarks related to individuals and to organise these into a family tree, or similar intuitive structure. They would also be able to annotate bookmarks to include information from non-digital sources and to represent family relationships of the sort they are most concerned to establish. A template for authoring individual biographies or family histories would also be provided. In the process of using this package, family historians would create what amounts to a discreet information entity which could be given a unique identity. Because family historians search across the full range of online resources, their collection of bookmarks overcome the difficulties of searching across distributed and technically variable resources. The package would need to be free, open source, and carefully crafted to meet the needs of the widest community of family historians.

The book marks themselves could be displayed much as blog entries are displayed in Del.ic.ious. Tag Clouds and other forms of social software representations would add to the usability of the package. The important thing would be that these collections of bookmarks (and the text referred to) along with their associated annotations and connections to family members, could be abstracted as ordered information with a unique individual identity.

In this instance a folksonomic (click here for a good overview of the strengths and weaknesses of folksonomy) and social software approach would work in much the same way as it currently does in relation to blog and wiki sites.

It should be possible, however, goes one step further. In recent years, in order to facilitate the workings of online communities and to represent their activities, a number of search and association strategies and graphical strategies for the analysis of communities and networks have emerged which could be harnessed to the issue of understanding the information generated by family historians.

Once a collection of bookmarks, etc., has been identified as an individual, that body of texts could itself be turned into a single individual within an online community. In other words, the unique identity would become an avatar within an artificially created online community. The advantage of this approach is that it allows tools and strategies developed for the linking and searching of texts generated by online communities, and those created to analyse the activities of such communities to be used. For instance, words shared in the book-marked texts could be used to associate groups of individual identities or avatars. Or indeed book marks themselves could be used in this fashion. So, if you wanted to develop a detailed profile of the users of St Thomas’ Hospital, you would simply abstract all the identities which bookmark St Thomas’ Hospital (the records of which are being digitised by the Plebeian Lives project) as a part of their avatar. Depending on the nature of the bookmark (i.e. to a specific page with metadata) this could be fine tuned for date, or even a single ward, or individuals treated by a single doctor within the hospital. For an example of this kind of strategy working in practise see packages such as RefViz, which essentially cross references all text associated with a single record, against all other occurrences of the same text and then relate them to a single entity – a book or article in this instance.

A further refinement is in relation to the analysis of the communities created in the process. The relationships between individuals and the sets of groups evidenced by large numbers of historical avatars quickly outpace the ability of flat tables to represent their meaning. In projects such as Vister, however, a range of tools have been developed that could help. In environments of this sort, each node is an individual sorted into communities according to specific characteristics. Each node is also linked to an individual user (or in the case of historical avatars, the individual identity).

This kind of approach has the advantages of giving the user transparent access to the groups of text which make up an individual identity. By clicking on a single individual, or the cloud that makes up the community, the user could be taken directly to either the profile of the individual, and hence to the text, or directly to the text itself in the way that occurs in contexts such as de.lic.ious. In this example, blog entries are organised thematically according to the frequency of texts and phrases, but each entry could as easily be a historical document.

The great advantage of this approach would be that it would build ever growing levels of cross reference into the resource itself, which could in turn validate the quality of the information being generated. If large numbers of individual avatars are being linked to a single name, for instance, this would reflect a high level of uncertainty, or the possibility that two or more historical avatars are in fact the same individual. Within this context, different sources could also be rated and weighted for accuracy, quality and usefulness, allowing analyses to be more subtly formulated – i.e. a bookmark to a New DNB article, could be given greater weight than one to a Wikipedia item; or a tax record, more weight than a reference in a trial. The world will not agree on technical standards or the best way of searching historical documents. This approach would co-ordinate disparate forms of information, while retaining to the originators of historical websites a degree of authority over the content they create.