I have recently been involved in a couple of projects that allowed me to engage with the eighteenth century in a very different way than I am used to. I was involved - in a non-musical capacity! - in helping to create recordings of a couple of eighteenth-century ballads.
The first is a ballad the only copy of which (at least as for as I know) is in the British Library, and is titled The Workhouse Cruelty. I first came across this piece in the early 1980s, and haven't really done anything with it since. But, when I was asked to provide a ballad that would help illustrate a case about a murder in a workhouse for Voices from the Old Bailey on Radio 4, it immediately sprang to mind. The nice thing is that the producer, Elizabeth Burke, then went out and had a recording of it made. The result is here:
This particular recording seems a little sweet to me, and lacking in the political grit of the original rough printed version. And I suspect that it was originally sung by a man - of the sort known as a 'chanting' ballad singer (they normally specialised in durge-like religious songs). But it nevertheless made me want to think harder about how one interprets the words, and how one re-constructs the soundscape in which it was performed.
This then encouraged me to have a go at commissioning a recording myself. Francis Place's papers contain the words of a dozen or so, primarily bawdy, ballads he recalled from his youth in the 1780s. The really nice thing about Place's notes, is that he described where he heard them sung - behind St Clements church, etc - and by whom - two young women - and when - in the evening. The ballad I was particularly interested in was Jack Chance, which Place describes as being sung just after the Gordon Riots. As I was giving a lecture on the Riots, it seemed a natural thing to accompany it. I was also familiar with a printed version of this particular song, mis-dated at 1795, and retitled as Just the thing, among the digital collection at ECCO. I integrated the two versions to eliminate some of the blanks in Place's version and asked a friend of my son's, Henry Skewes, to write the music. Unlike most 18th century ballads, no tune was mentioned as being used with this one. Henry wrote the music, and asked another friend Stephanie Waldheim to sing it. The upshot is:
Jack Chance: Or Just the Thing
Music and arrangement by Henry Skewes;
performed by Stephanie Waldheim and Henry Skewes
copyright: 2010, Henry Skewes, Stephanie Waldheim and Tim Hitchcock.
This version has been translated from its original AAC Audio format to a mp3 format, and has developed a few oddities, but you get the point.
One way or another, this experience has taken me back to Bruce Smith's wonderful, but seldom cited, The Acoustic World of Early Modern England: Attending to the O-Factor. It has also reminded me that there is a lot to do to recreate a neighbourhood soundscape, but that if one could, it would help; and that perhaps it is time to have a go.
This blog is a space for me to rant in that most seventeenth-century sense of the word; and to cut and paste the ideas and comments that don't seem to fit in more traditional forms of academic publication.
Friday, 23 July 2010
Friday, 2 July 2010
Can we make a thesaurus of meanings for digital humanities?
In an idle moment I have been re-reading the introduction to Roget's Thesaurus, and have been struck by what a thesaurus actually does. It breaks up all words in to one of eight categories and assigns a number to each category. "Affections", for instance, is class eight. It then subdivides these categories in to more sub-categories, with "cheerfulness" falling under "Pleasure and Pleasurableness", and being assigned a number of 868 - after discontent (867) and before solemnity (869). If you then look up 868 - cheerfulness, it is divided again into 20 further subcategories, with around 15 words in each category. So, "gladden" falls under 868.6, and as it is the second word in the category, could be expressed as 868.6.2. The second half of the thesaurus simply lists all these words in alphabetical order to allow the user access to the hierarchy of meaning in the first half.
In other words, a thesaurus assigns a numerical value, that equates to one or more category of meaning for every word it contains.
If you took a large body of text - ten years of the Times, or everything published in 1840 - and broke it up into a set of words, and assigned each word a number from the thesaurus's hierarchy, you would end up with a unique numerical representation of the collected meaning of the words that make up the text.
If you take the sentence 'the sunlight warmed his forehead' and convert it into its thesaurus equivalents, you end with 334.10.7; 328.17.3; 239.5
It can be equated with: "The midday sun cooked his brow", which is 334.10.11; 328.17.20; 239.5
This would allow a kind of semantic search that was not dependent on direct context, and would not be universal, but would work perfectly well for historical text in English (all the better suited to historical material, as Roget was a late Enlightenment figure, and his categories map well on to historical text). There would remain an issue of disambiguation (making a distinction between "clip" as a noun, versus "clip" as a verb); but this could be mitigated either though mathematical approximations (you could create a third unique number that essentially averaged the two or more meanings assigned to any single word), or you could simply live with the errors generated, on the assumption that the historians are used to filtering their own reading. You could also apply the chronological data contained in the Oxford Historical Thesaurus to map how close (or distant) an individual text is to standard usage for a particular period; or how different genre relate to a standard evolving language (how literature vs law treatises map onto accepted usage in the decade they were written).
As we are confronted by massive text objects (I think the notion from linguistics of a corpora is less useful for historians who are seeking to find information, rather than define bodies of text), the ability to locate related or similar text across genre and texts is important. It would also be another way of approaching the measurement of "distance" between texts.
Alternatively, and this is closer to Roget's original scheme you could use this numerical labelling as a basic form of computer aided reading. You could, for instance, assign a colour to each broad category, and a shade of that colour to each sub-category; allowing you to identify the work that different parts of the text is doing, through a simple visual examination. When skimming through a large body of text, at perhaps 40 pages on a single screen, the colour coding would allow you to identify areas in which "affections" are directly discussed, or any of the other thesaurus categories - "Space", or "Physics" or "Matter".
Have I missed something - Does anyone know why we don't use a thesaurus based numerical hierarchy to code meaning in large texts? It would give you a "word" = "a set of numbers" (an unique number for each major word) in a paragraph or sentence or text division, which could then be compared statistically, or colour coded to reflect the breadth of meaning found. Or you could colour code for types of words and locate relevant sections in a large text in a particular colour. It just seems dead obvious as a way of moving towards the ideal at the heart of the semantic web, while avoiding the creation of 'triples', and the ever retreating promise of universality. My guess is either that librarians having been doing this for the last thirty years and not telling me (librarians are cruel that way), or that the rest of the world forgot to read the introduction to Roget's Thesaurus, which is also possible. Of course, the final possibility is that I don't understand the semantic web; and that ontologies in aggregate are already a form of thesaurus.
In other words, a thesaurus assigns a numerical value, that equates to one or more category of meaning for every word it contains.
If you took a large body of text - ten years of the Times, or everything published in 1840 - and broke it up into a set of words, and assigned each word a number from the thesaurus's hierarchy, you would end up with a unique numerical representation of the collected meaning of the words that make up the text.
If you take the sentence 'the sunlight warmed his forehead' and convert it into its thesaurus equivalents, you end with 334.10.7; 328.17.3; 239.5
It can be equated with: "The midday sun cooked his brow", which is 334.10.11; 328.17.20; 239.5
This would allow a kind of semantic search that was not dependent on direct context, and would not be universal, but would work perfectly well for historical text in English (all the better suited to historical material, as Roget was a late Enlightenment figure, and his categories map well on to historical text). There would remain an issue of disambiguation (making a distinction between "clip" as a noun, versus "clip" as a verb); but this could be mitigated either though mathematical approximations (you could create a third unique number that essentially averaged the two or more meanings assigned to any single word), or you could simply live with the errors generated, on the assumption that the historians are used to filtering their own reading. You could also apply the chronological data contained in the Oxford Historical Thesaurus to map how close (or distant) an individual text is to standard usage for a particular period; or how different genre relate to a standard evolving language (how literature vs law treatises map onto accepted usage in the decade they were written).
As we are confronted by massive text objects (I think the notion from linguistics of a corpora is less useful for historians who are seeking to find information, rather than define bodies of text), the ability to locate related or similar text across genre and texts is important. It would also be another way of approaching the measurement of "distance" between texts.
Alternatively, and this is closer to Roget's original scheme you could use this numerical labelling as a basic form of computer aided reading. You could, for instance, assign a colour to each broad category, and a shade of that colour to each sub-category; allowing you to identify the work that different parts of the text is doing, through a simple visual examination. When skimming through a large body of text, at perhaps 40 pages on a single screen, the colour coding would allow you to identify areas in which "affections" are directly discussed, or any of the other thesaurus categories - "Space", or "Physics" or "Matter".
Have I missed something - Does anyone know why we don't use a thesaurus based numerical hierarchy to code meaning in large texts? It would give you a "word" = "a set of numbers" (an unique number for each major word) in a paragraph or sentence or text division, which could then be compared statistically, or colour coded to reflect the breadth of meaning found. Or you could colour code for types of words and locate relevant sections in a large text in a particular colour. It just seems dead obvious as a way of moving towards the ideal at the heart of the semantic web, while avoiding the creation of 'triples', and the ever retreating promise of universality. My guess is either that librarians having been doing this for the last thirty years and not telling me (librarians are cruel that way), or that the rest of the world forgot to read the introduction to Roget's Thesaurus, which is also possible. Of course, the final possibility is that I don't understand the semantic web; and that ontologies in aggregate are already a form of thesaurus.
Subscribe to:
Posts (Atom)