Friday, 3 January 2014

Judging a book by its URLs

It will sound odd, but I have recently had a great time editing URLs.  Robert Shoemaker and I have have just finished a book for CUP, derived from the London Lives project, and called - London Lives: Poverty, Crime and the Making of a Modern City, 1690-1800. It is a long book (170,000 words) and each quote and reference in it is linked via a URL to the original document or article, book or web-resource used as evidence or to contextualize the argument.  It will be published as both an ebook and in hard copy, and the links need to be robust, and secure.  My estimate is that there are in the region of 4,000 URLs included in the manuscript (which was written collaboratively in PMWiki).  In the end, I found that I could identify an appropriate link for 98% of all footnote references, but then had to eliminate around 10% of these, as the relevant URL was just not useable.  The book took some nine years, and I am glad it is finished.

One of my final jobs was editing those 4000 URLs.   It took about three months work, spread over the last year, and I have just finished spending a week or so confirming what I hope will be their final form.  When I have told people about this work many have looked incredulous and suggested that this is the sort of technical implementation process that should be left to others.  A couple of otherwise nice people have suggested I dump this job on the shoulders of the nearest PhD student.  But for myself, it is precisely the kind of thing that an author should do for themselves.  And in doing it, two things kept coming to mind.  First was how the role of the scholar in creating a rigorous academic apparatus is a central part of the intellectual journey that academic writing involves - and that we should see the implementation of the online version of this in the light of the precise writing of footnotes and references that mark out good scholarship.  And second, that URLs encode a system of design and intent, online architecture and system of access, that signal the quality and permanence (the academic credibility and perceived audience) of historical materials online.  And that just as we have always sorted and judged scholarship by its form, we should think a bit harder about how the form of a URL can let us interrogate online materials.

On the first point, I do not know of much discussion of the joys of this kind of academic slog.  There is a lot of good writing on research and archives (by Carolyn Steedman and Arlette Farge among many others), on writing and thinking, but no-one talks much about the painstaking labour that goes in to turning a rough draft in to a final finished piece of scholarship.  And here I am really talking about generating accurate and fully comprehensive footnotes that reflect both the material cited, and the research journey that resulted in the main text.   This has become much easier with online catalogs and citation management packages, but nevertheless remains laborious and a reflection of our collective and individual commitment to a particular kind of evidenced discussion.  But for me it also represents my favourite compromise.  The writing of history is a wonderfully imaginative and creative process.  And in some respects we wish to judge the product of history writing as art.  Is it enjoyable to read? Is it convincing?  Does it do the job of good writing in liberating the readers' imagination?  In making these judgements we tend to appeal to a notion of 'value' that is cultural and that privileges dominant forms of authority.  This aspect of judgement is essentially romantic; with all the implications for western and elite hegemony embedded in that idea.  At the same time history writing is the result of simple hard work of a more technical kind - in the archives, in collating and collecting, re-ordering and interrogating data.  And it is valuable because it encompasses that hard work.  The beauty of the academic apparatus is that it evidences this and in the process generates a different measure of value.  In other words it is where quality is tied to a 'labour theory of value'.  I love the academic slog because it is where un-moored judgement is tied down to hard labour; and where value can be universalized in a common human experience (work).  In other words I really enjoyed editing 4000 URLs precisely because in them and their associated footnotes lies a claim to and evidence of the hard labour that underpins the book itself.

 At the same time, the process also taught me to read URLs differently.  Clearly coders and web designers do this as a matter of course.  But I am a historian and want to read URLs as a scholar, rather than as a programmer or designer.  And for me, the important thing is that URLs embed the structure of a site, making it plain to see for anyone willing to look hard; and that they are made up of both the character of a library reference, and a command directed at the new technology of discovery - the Internet .  There are just lots of different types of URL.

There are 'Search URLs' that include all the elements that  take the user past a collection to a specific object, but don't let you go directly there without the query.  And there are URLs that encode a cataloging hierarchy.  There are URLs that sift data, or work in your browser to change the data delivered, highlighting phrases or sifting material.  And there are URLs that encode licensing, passwords, and access information.  It is easy enough to find that the whole search journey that took you from a library catalog to an individual item is encoded directly in the URL, and even personalized to you, the machine you are using, or the forms of access you can deploy.  It is easy to find URLs that run on for hundreds of characters, each element divided by a '&' or a '%', or such.

But in creating robust reproducible links to credible historical materials most of these URLs are at least problematic if not useless.  If they include details for institutional access, or session information, they cannot be re-used by someone else.  These URLs are friable and fragile things and not fit for scholarly purposes.  And as a result, for the London Lives book we have been forced to eliminate all the links we originally hoped to include to forty or fifty different sites.  To take a single example, most archives structure their online collections with search in mind, making it difficult to link to a single item.  I spent a lot of time finding the catalog entry for every manuscript we cited in the London Metropolitan Archives, and Westminster Archives Centre, only to regretfully strip out the links when confronted by a complex URL that just did not look credible as a long term citation of the item itself.

Even in its simplest, and in the form recommended by the site for sharing a link, a London Metropolitan Archives URL looks like this:

http://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+P69~2FBRI~2FB~2F001~2FMS06554~2F004?SESSIONSEARCH


Since we had consulted these items in their physical form in any case, it did not seem too problematic to leave out these links, but a shame nevertheless.  And likewise, with paywall material there seemed little point in dangling real access, and the promise of credible evidence, before the eyes of readers who would not be able to go beyond the login screen.  It seemed better to cite a specific item in combination with a general (unlinked) URL and date of consultation as reflecting our own research journey, rather than to promise access when we could not deliver it.

With few exceptions the URLs that have been retained (and there are still 4000 of them) address specific items with a specific ID, and usually run to 20 to 40 characters.  DOIs are not bad once you figure out their structure and reformulate them as they should be, rather than the way they are normally cited on journal web pages.

dx.doi.org/10.1353/sec.2010.0268

And Google Books creates a very nice URL once you strip out all the complex formatting instructions that are normally generated as part of a search and inserted after the main ID.  This is what a Google Books' URL looks like if you were to use the 'search' version:

 http://books.google.co.uk/books?id=1sMJGt7_rTAC&printsec=frontcover&dq=%22Prosecution+and+Punishment:+Petty+Crime+and+the+Law%22&hl=en&sa=X&ei=rrzGUq_aDsSy7Aa_9YGQCg&redir_esc=y#v=onepage&q=%22Prosecution%20and%20Punishment%3A%20Petty%20Crime%20and%20the%20Law%22&f=false

And this URL will take to the same book:

 books.google.co.uk/books?id=1sMJGt7_rTAC

 And the Eighteenth-century Short Title Catalog generates some of the most elegant URLs I have found:

estc.bl.uk/T174945

And to a lesser extent, so does the Ethos collection of doctoral theses at the British Library.

ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.354762

And London Lives and the Old Bailey Online do pretty well on this score:

www.londonlives.org/browse.jsp?div=LMSMPS501980014
http://www.oldbaileyonline.org/browse.jsp?ref=t17910413-19


In part, I suspect that these issues would all disappear if I had a better sense of the layer of structure that lies beneath the WWW.  But for the moment I am keen to have a short, human-readable URL that looks like it will last longer than the session I am currently logged on for.   All of which simply takes me back to the joy of academic slogging and the importance of the academic apparatus as something that evidences hard work and opens up scholarship to credible criticism that goes beyond simple romantic appreciation and prejudice.

I know all too well that one of the skills of an academic is the ability to judge a book by its cover and the form of the text it contains.   For the online we need to embed URLs into precisely this process - and the joy of all that editing was that at the end of it, I feel I have learned to do just that.



21 comments:

Becky said...

I'm an academic librarian and spend a good bit of time explaining to students the value of a book's apparatus--references, index, table of contents--in descending order of importance, for the most part. This piece fits right in, and I appreciate your expressing so well how all the work of creating a book affects its quality.

I'll be searching your blog for any posts on indexes, my favorite scholarly tool.

Becky Kornegay

Tim Hitchcock said...

Hi Becky, Thanks for your comment - very much appreciated. And please keep up teaching students about book structures! I am continually depressed by the extent to which historians largely fail to do this.

Indexes are interesting - though I haven't blogged about them in particular (this blog is a bit random!). But what I have always wanted to try is turning traditional book indexes on their head, and using them to model reader response - essentially assuming that text that attracts human created index entries comprises text that the readers eye gravitated towards (even if it is only the eye of the indexer).

Tim Hitchcock

Janice said...

What a smart analysis. (And may I say that I'm excited for the book project which has inspired this post?)

It's difficult to get tech-phobic colleagues and students to understand that just copying the URL from their browser bar isn't going to create a durable URL that others can use. It's also maddening when the software or database choices for a project hamper the URL reusability - this should be a basic concern for anyone building a web project these days!

John Muccigrosso said...

I worry whenever there's a ? in a URL (URI, really). That usually means that the underlying search mechanism, which is liable to change, is being included. So when they switch out the engine in an upgrade, the reference won't work anymore.

For example, your London Lives URI is clearly using a little Java on the server. How long will that last?

I think the Eighteenth-century Short Title Catalog has the best URI: just an ID number after the /.

Sebastian Heath has got a few good posts on this. For example,

http://mediterraneanceramics.blogspot.com/2010/10/change-happens-if-it-can.html

Or his list of Very Clean URIs at

http://wiki.digitalclassicist.org/Very_clean_URIs

(Hmm, can I tag him here with @sebth?)

Kristopher Nelson said...

As a former web developer who is now a PhD student in history, this was fascinating--it's so ingrained in me to see URLs as reflections of what goes on behind the scenes that I often forget that this isn't how most people see them!

Anyway, just one thought: I am always suspicious of the durability of URLs that include a "?," even though this is extremely standard on so many sites. The "?" indicates that the bit after that is being passed on to another layer of the system that parses it as, effectively, a search. In my experience, if the backend technology changes, this part is the hardest to redirect to the proper resource.

I always prefer URLs that have only path information in them ("/"), because even though these still require backend tech to handle they tend to be the most stable into the future.

Of course, this is all very nice in theory! You do what you can, as you have, to reduce the complexity as much as possible.

This is an area where web devs could stand to pay attention historians/librarians/archivists, instead of just their immediate concerns.

Thanks for all the hard work on the project!

Tim Hitchcock said...

Dear All, Thanks for your comments. I particularly wanted to thank John for the link to Sebastian Heath's blog on this - I loved his line: If a URL looks unstable, it is.

But thanks also to both John and Kristopher for raising the issue of ? and the difficulties it creates. I am looking forward to discussing how to avoid this issue on the Old Bailey and London Lives sites.

One area that continues to interest me, is the effect that locating more and more functionality on the browser side of the equation, will have. One site I have been largely unable to reference effectively is Locating London's Past, just because the things I want to cite - maps - are generated on the fly, don't exist as 'objects' and don't actually show up in the URl.

Unknown said...
This comment has been removed by a blog administrator.
Unknown said...

I wonder if you have tried using the Internet Archive's Wayback Machine's "Save Page Now" tool to capture a page as it appeared when you accessed it for use as a trusted citation in the future? https://archive.org/web/web.php

Unknown said...

I wonder if you have tried using the Internet Archive's Wayback Machine's "Save Page Now" tool to capture a page as it appeared when you accessed it for use as a trusted citation in the future? https://archive.org/web/web.php

Unknown said...

I wonder if you have tried using the Internet Archive's Wayback Machine's "Save Page Now" tool to capture a page as it appeared when you accessed it for use as a trusted citation in the future? https://archive.org/web/web.php

Tim Hitchcock said...

I hadn't seen the 'Save Page Now' function - thanks very much for pointing me in this direction!

Phil said...

Just read this on the LSE Website. It reminded me of this old post; it expresses my own pleasure in reference-hunting, as well as a similar sense of how academic writing can combine the most unmoored and speculative creativity with a Gradgrindian level of groundedness ("Now, what I want is, References!"). Nothing like it, when it works.

jack wilson said...

Hi, I am Jackson from Chennai. I am technology freak. I did Big Data Hadoop Training in Chennai at FITA. This is useful for me to make a bright career in IT field.

Melisa said...

I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
Regards,
Best Informatica Training In Chennai|Informatica training center in Chennai

John Adam said...


your article is good information. i very like it, keep it up.

get more twitter followers instantly

HandsakerA6 said...

Indeed a very good experience been shown this would also create some more values by the time and also there would be more things come to the mind if we would able to transcribe to our services. blog writing service

Vinoth Kumar said...

Wiztech Automation is a Chennai based one-stop Training Centre/Institute for the Students Looking for Practically Oriented Training in Industrial Automation PLC, SCADA, DCS, HMI, VFD,VLSI, Embedded, and others – IT Software, Web Designing and SEO.

PLC Training in Chennai
Embedded Training in Chennai
VLSI Training in Chennai
DCS Training in Chennai
IT Training Institutes in Chennai
Web Designing Training in Chennai

Amirtha rao said...

Superb explanation & it's too clear to understand the concept as well, keep sharing admin with some updated information with right examples.
Regards,

Hadoop Training in Chennai|Big Data Training in Chennai|Fita Chennai reviews

tanya sweet said...

First of all i am saying that i like your post very much.I am really impressed by the way in which you presented the content and also the structure of the post. Hope you can gave us more posts like this and i really appreciate your hardwork.


Kiss Day 2017
Happy New Year 2018 Quotes
attitude dp for whatsapp in hindi
Good Night Quotes
Birthday Wishes to Brother
Happy New Year 2017 Poems
Valentine Week List 2017

akbar badshah said...

Valentine day Wishes
Valentine's day Wishes 2017
Valentine's day 2017
Valentine day Quotes
Valentine day Poem
Valentine day celebration with Girlfriend
Valentine's Week List 2017

Albert Smith said...

I love the creativity on this site. The title of the article is very interesting and unique and it propels the reader to read the entire article so that you can figure out why you should judge a book by its URL. I am looking forward to reading more articles from this site that will help me to improve my command of English and range of vocabulary which are important skills possessed by Proposal Editors.