Friday 2 July 2010

Can we make a thesaurus of meanings for digital humanities?

In an idle moment I have been re-reading the introduction to Roget's Thesaurus, and have been struck by what a thesaurus actually does. It breaks up all words in to one of eight categories and assigns a number to each category. "Affections", for instance, is class eight. It then subdivides these categories in to more sub-categories, with "cheerfulness" falling under "Pleasure and Pleasurableness", and being assigned a number of 868 - after discontent (867) and before solemnity (869). If you then look up 868 - cheerfulness, it is divided again into 20 further subcategories, with around 15 words in each category. So, "gladden" falls under 868.6, and as it is the second word in the category, could be expressed as 868.6.2. The second half of the thesaurus simply lists all these words in alphabetical order to allow the user access to the hierarchy of meaning in the first half.

In other words, a thesaurus assigns a numerical value, that equates to one or more category of meaning for every word it contains.

If you took a large body of text - ten years of the Times, or everything published in 1840 - and broke it up into a set of words, and assigned each word a number from the thesaurus's hierarchy, you would end up with a unique numerical representation of the collected meaning of the words that make up the text.

If you take the sentence 'the sunlight warmed his forehead' and convert it into its thesaurus equivalents, you end with 334.10.7; 328.17.3; 239.5

It can be equated with: "The midday sun cooked his brow", which is 334.10.11; 328.17.20; 239.5

This would allow a kind of semantic search that was not dependent on direct context, and would not be universal, but would work perfectly well for historical text in English (all the better suited to historical material, as Roget was a late Enlightenment figure, and his categories map well on to historical text). There would remain an issue of disambiguation (making a distinction between "clip" as a noun, versus "clip" as a verb); but this could be mitigated either though mathematical approximations (you could create a third unique number that essentially averaged the two or more meanings assigned to any single word), or you could simply live with the errors generated, on the assumption that the historians are used to filtering their own reading. You could also apply the chronological data contained in the Oxford Historical Thesaurus to map how close (or distant) an individual text is to standard usage for a particular period; or how different genre relate to a standard evolving language (how literature vs law treatises map onto accepted usage in the decade they were written).

As we are confronted by massive text objects (I think the notion from linguistics of a corpora is less useful for historians who are seeking to find information, rather than define bodies of text), the ability to locate related or similar text across genre and texts is important. It would also be another way of approaching the measurement of "distance" between texts.

Alternatively, and this is closer to Roget's original scheme you could use this numerical labelling as a basic form of computer aided reading. You could, for instance, assign a colour to each broad category, and a shade of that colour to each sub-category; allowing you to identify the work that different parts of the text is doing, through a simple visual examination. When skimming through a large body of text, at perhaps 40 pages on a single screen, the colour coding would allow you to identify areas in which "affections" are directly discussed, or any of the other thesaurus categories - "Space", or "Physics" or "Matter".

Have I missed something - Does anyone know why we don't use a thesaurus based numerical hierarchy to code meaning in large texts? It would give you a "word" = "a set of numbers" (an unique number for each major word) in a paragraph or sentence or text division, which could then be compared statistically, or colour coded to reflect the breadth of meaning found. Or you could colour code for types of words and locate relevant sections in a large text in a particular colour. It just seems dead obvious as a way of moving towards the ideal at the heart of the semantic web, while avoiding the creation of 'triples', and the ever retreating promise of universality. My guess is either that librarians having been doing this for the last thirty years and not telling me (librarians are cruel that way), or that the rest of the world forgot to read the introduction to Roget's Thesaurus, which is also possible. Of course, the final possibility is that I don't understand the semantic web; and that ontologies in aggregate are already a form of thesaurus.

7 comments:

reviewstella said...

Amazing and informative post check amazon product related post ...
best rear view mirror camera
top 10 best rear view mirror cameras
rear view mirror cameras
mirror cameras
Best rear view mirror dash cam

maxwell said...

Nice post Are you looking for the
Best work socks for boots ? It is necessary to wear a comfortable and formal socks for work. However, it is not easy to find the best socks in most of the shops.

nawabzada said...

●▬▬▬▬PART TIME JOBS▬▬▬▬▬●

I am making $165 an hour working from home. i was greatly surprised at the same time as my neighbour advised me she changed into averaging $ninety five however I see the way it works now. I experience masses freedom now that i'm my non-public boss. that is what I do......
↓↓↓↓COPY THIS SITE↓↓↓↓

HERE►►►►►►www.besttrends7.com

Emmaswift said...

I love this kinda article it is an awesome article and it will help a lot of others and those students who are looking for Cheap essay writing help - £4 essay so thank you guys for sharing this article with us.

zoeymary80 said...


Your entry is quite educational. I relish reading business plan assignment for students aid websites cause I enjoy knowledge about impressions and emotions, and you can find ultimate current information skilled. I have never encountered this type of produced news before. This info type you can see these there

Gordon Ashley said...
This comment has been removed by the author.
Gordon Ashley said...

It was an amazing article. Realy helpful. Allow me to introduce myself. tunnel rush