Friday, 2 July 2010

Can we make a thesaurus of meanings for digital humanities?

In an idle moment I have been re-reading the introduction to Roget's Thesaurus, and have been struck by what a thesaurus actually does. It breaks up all words in to one of eight categories and assigns a number to each category. "Affections", for instance, is class eight. It then subdivides these categories in to more sub-categories, with "cheerfulness" falling under "Pleasure and Pleasurableness", and being assigned a number of 868 - after discontent (867) and before solemnity (869). If you then look up 868 - cheerfulness, it is divided again into 20 further subcategories, with around 15 words in each category. So, "gladden" falls under 868.6, and as it is the second word in the category, could be expressed as 868.6.2. The second half of the thesaurus simply lists all these words in alphabetical order to allow the user access to the hierarchy of meaning in the first half.

In other words, a thesaurus assigns a numerical value, that equates to one or more category of meaning for every word it contains.

If you took a large body of text - ten years of the Times, or everything published in 1840 - and broke it up into a set of words, and assigned each word a number from the thesaurus's hierarchy, you would end up with a unique numerical representation of the collected meaning of the words that make up the text.

If you take the sentence 'the sunlight warmed his forehead' and convert it into its thesaurus equivalents, you end with 334.10.7; 328.17.3; 239.5

It can be equated with: "The midday sun cooked his brow", which is 334.10.11; 328.17.20; 239.5

This would allow a kind of semantic search that was not dependent on direct context, and would not be universal, but would work perfectly well for historical text in English (all the better suited to historical material, as Roget was a late Enlightenment figure, and his categories map well on to historical text). There would remain an issue of disambiguation (making a distinction between "clip" as a noun, versus "clip" as a verb); but this could be mitigated either though mathematical approximations (you could create a third unique number that essentially averaged the two or more meanings assigned to any single word), or you could simply live with the errors generated, on the assumption that the historians are used to filtering their own reading. You could also apply the chronological data contained in the Oxford Historical Thesaurus to map how close (or distant) an individual text is to standard usage for a particular period; or how different genre relate to a standard evolving language (how literature vs law treatises map onto accepted usage in the decade they were written).

As we are confronted by massive text objects (I think the notion from linguistics of a corpora is less useful for historians who are seeking to find information, rather than define bodies of text), the ability to locate related or similar text across genre and texts is important. It would also be another way of approaching the measurement of "distance" between texts.

Alternatively, and this is closer to Roget's original scheme you could use this numerical labelling as a basic form of computer aided reading. You could, for instance, assign a colour to each broad category, and a shade of that colour to each sub-category; allowing you to identify the work that different parts of the text is doing, through a simple visual examination. When skimming through a large body of text, at perhaps 40 pages on a single screen, the colour coding would allow you to identify areas in which "affections" are directly discussed, or any of the other thesaurus categories - "Space", or "Physics" or "Matter".

Have I missed something - Does anyone know why we don't use a thesaurus based numerical hierarchy to code meaning in large texts? It would give you a "word" = "a set of numbers" (an unique number for each major word) in a paragraph or sentence or text division, which could then be compared statistically, or colour coded to reflect the breadth of meaning found. Or you could colour code for types of words and locate relevant sections in a large text in a particular colour. It just seems dead obvious as a way of moving towards the ideal at the heart of the semantic web, while avoiding the creation of 'triples', and the ever retreating promise of universality. My guess is either that librarians having been doing this for the last thirty years and not telling me (librarians are cruel that way), or that the rest of the world forgot to read the introduction to Roget's Thesaurus, which is also possible. Of course, the final possibility is that I don't understand the semantic web; and that ontologies in aggregate are already a form of thesaurus.

1 comment:

Vinoth Kumar said...

Wiztech Automation is a Chennai based one-stop Training Centre/Institute for the Students Looking for Practically Oriented Training in Industrial Automation PLC, SCADA, DCS, HMI, VFD,VLSI, Embedded, and others – IT Software, Web Designing and SEO.

PLC Training in Chennai
Embedded Training in Chennai
VLSI Training in Chennai
DCS Training in Chennai
IT Training Institutes in Chennai
Web Designing Training in Chennai