Good points.
The issue of freeing up the OCRd text has been discussed at the project
board meeting for the BL newspapers project, both in the context of the
Richard's comments about opening up data and also in the terms of
accessibility
However, for this project it is has been a case of walking before
running. Digitising 2m pages of often fragile historic newspapers is a
big task, from ironing the copies prior to microfilming, through OCRing,
DTD development, quality checking and then deploying via a relatively
new type of business model for digital content. Note that the text of
the newspapers is also 'dirty' OCR, which would also need some kind of
mechanism for cleaning it up to make it more useful.
However, all these organisational issues are not to say that this kind
of approach should not be progressed. In principle, I don't think the BL
or Cengage has anything against the concept of making the OCR'd text
available. JISC would welcome it.
If this is a general feeling amongst the MCG that this open data is a
key part of making such content accessible, I'm happy to take these
comments back to the BL's project board for newspapers. And as paying
customers (another interesting issue) it's the kind of thing you might
want to let the BL know about directly.
Alastair
Alastair Dunning
JISC Digitisation Programme Manager
t: 0203 006 6065
JISC Office (1st Floor)
Brettenham House (South Entrance)
5 Lancaster Place
London WC2E 7EN
http://digitisation.jiscinvolve.org/
http://www.jisc.ac.uk/digitisation/
-----Original Message-----
From: Museums Computer Group [mailto:[log in to unmask]] On Behalf Of
Richard Light
Sent: 18 June 2009 11:15
To: [log in to unmask]
Subject: Re: Digital Britain - Final Report
In message
<[log in to unmask]>,
Nick Poole <[log in to unmask]> writes
>3. Tim Berners-Lee's appointment as 'linked data czar' is interesting -
>the mission to liberate publicly-funded data is something of a cause
>celebre in Whitehall and it seems likely to continue in spite of Tom
>Watson's recent - ahem - retirement. I would suggest that we bring our
>much-vaunted 'goldmine of content' to this party and use the
>opportunity to have a serious conversation about opening up cultural
>data in the broadest possible sense (which might include your semantic
>tech Richard!).
Digital Britain makes much of "the importance of news and local
journalism for democracy" (57-66). In that context, it is interesting
to go to the newly-announced British Library site British Newspapers
1800-1900 [1] and see how liberated the publicly-funded data relating to
this primary historical source material is feeling.
The site itself is fine, and works well. Starting as ever with a search
for Burgess Hill, I have entertained myself finding stories about the
Brighton Railway Murder ("the Hat Difficulty") and the Weidhaas Hygienic
Institute's natural asthma cure ("Dum Spiro Spero" - wonderful). Then I
started to wonder about the digitised text that is clearly there, in the
background, driving the search facilities that bring us the scanned page
images. Might that publicly-funded data be available? From the FAQ
[2], the answer is apparently "no":
-----------------------
Is it possible to see the raw text of the article in HTML [sic], as
captured by the Optical Character Recognition (OCR) system?
The British Library and Gale do not currently make this text
available -- the text files are used solely for searching the product
and the user is only able to view the digital image of the page.
-----------------------
As a community, are we happy with this state of affairs?
A source I am increasingly turning to for "raw text", in the absence of
alternatives, is Project Gutenberg:
http://www.gutenberg.org/wiki/Main_Page
This community initiative allows the full text of out-of-copyright works
to be downloaded and used as one wishes. It is just plain text, so you
need to get in there and add some markup, but I have for example been
able to convert Cousin's Biographical Dictionary of English Literature
into a database of 1700 biographical records, containing useful
structured information such as names, birth and death dates, and titles
of works. I am currently considering how to expose this as a Linked
Data resource, ideally using the CIDOC CRM as a base ontology.
So long as we have a model where information is published only as web
pages (especially when those pages contain only images), there is no
scope for this sort of data enhancement.
Richard
[1] http://newspapers.bl.uk/blcs/
[2] http://newspapers.bl.uk/blcs/page.do?page=/researchguide.jspx#sect2
--
Richard Light
****************************************************************
For mcg information visit the mcg website at
http://www.museumscomputergroup.org.uk.
To manage your subscription to this email list visit
http://www.museumscomputergroup.org.uk/email.shtml
****************************************************************
****************************************************************
For mcg information visit the mcg website at
http://www.museumscomputergroup.org.uk.
To manage your subscription to this email list visit
http://www.museumscomputergroup.org.uk/email.shtml
****************************************************************
|