On a general point, we've had access to the newspaper archive for some time now through HE, and it's been a superb resource - one of the landmark achievements in digitisation. I've used it extensively (particularly the Aberdeen papers) and it's been great for research. However, I've been very disappointed in the post-scan operations, in particular two areas: one is the segmentation where often a search returns whole pages which I really would have expected to have been subdivided (particularly when you compare results with e.g. the Scotsman archive).
More seriously is the quality of the OCR. You can search for words you know are in articles and it fails to return them. I've often seen something during a browse and then tried to find it again later searching for words I've seen (sometimes several) and just come up blank. This is very frustrating. It also means that, while getting some search results you are looking for, you can't be sure that you've seen everything on a particular topic. I offer this observation as a possible warning about the quality of any OCRd files you might think could be reused. This is actually a serious issue for the Aberdeen papers at least, though this could be an indication of the quality of the particular material being scanned.
The printing could be better too, but that's a different matter.
Alan
Dr Alan Knox
Head of Historic Collections
University of Aberdeen
King's College
Aberdeen AB24 3SW
tel +44 (0)1224 272599
fax +44 (0)1224 273891
[log in to unmask]
www.abdn.ac.uk/historic
-----Original Message-----
From: Museums Computer Group [mailto:[log in to unmask]] On Behalf Of Alastair Dunning
Sent: Thursday 18 June 2009 12:22
To: [log in to unmask]
Subject: BL Newspapers and open content
Good points.
The issue of freeing up the OCRd text has been discussed at the project
board meeting for the BL newspapers project, both in the context of the
Richard's comments about opening up data and also in the terms of
accessibility
However, for this project it is has been a case of walking before
running. Digitising 2m pages of often fragile historic newspapers is a
big task, from ironing the copies prior to microfilming, through OCRing,
DTD development, quality checking and then deploying via a relatively
new type of business model for digital content. Note that the text of
the newspapers is also 'dirty' OCR, which would also need some kind of
mechanism for cleaning it up to make it more useful.
However, all these organisational issues are not to say that this kind
of approach should not be progressed. In principle, I don't think the BL
or Cengage has anything against the concept of making the OCR'd text
available. JISC would welcome it.
If this is a general feeling amongst the MCG that this open data is a
key part of making such content accessible, I'm happy to take these
comments back to the BL's project board for newspapers. And as paying
customers (another interesting issue) it's the kind of thing you might
want to let the BL know about directly.
Alastair
Alastair Dunning
JISC Digitisation Programme Manager
t: 0203 006 6065
JISC Office (1st Floor)
Brettenham House (South Entrance)
5 Lancaster Place
London WC2E 7EN
http://digitisation.jiscinvolve.org/
http://www.jisc.ac.uk/digitisation/
-----Original Message-----
From: Museums Computer Group [mailto:[log in to unmask]] On Behalf Of
Richard Light
Sent: 18 June 2009 11:15
To: [log in to unmask]
Subject: Re: Digital Britain - Final Report
In message
<[log in to unmask]>,
Nick Poole <[log in to unmask]> writes
>3. Tim Berners-Lee's appointment as 'linked data czar' is interesting -
>the mission to liberate publicly-funded data is something of a cause
>celebre in Whitehall and it seems likely to continue in spite of Tom
>Watson's recent - ahem - retirement. I would suggest that we bring our
>much-vaunted 'goldmine of content' to this party and use the
>opportunity to have a serious conversation about opening up cultural
>data in the broadest possible sense (which might include your semantic
>tech Richard!).
Digital Britain makes much of "the importance of news and local
journalism for democracy" (57-66). In that context, it is interesting
to go to the newly-announced British Library site British Newspapers
1800-1900 [1] and see how liberated the publicly-funded data relating to
this primary historical source material is feeling.
The site itself is fine, and works well. Starting as ever with a search
for Burgess Hill, I have entertained myself finding stories about the
Brighton Railway Murder ("the Hat Difficulty") and the Weidhaas Hygienic
Institute's natural asthma cure ("Dum Spiro Spero" - wonderful). Then I
started to wonder about the digitised text that is clearly there, in the
background, driving the search facilities that bring us the scanned page
images. Might that publicly-funded data be available? From the FAQ
[2], the answer is apparently "no":
-----------------------
Is it possible to see the raw text of the article in HTML [sic], as
captured by the Optical Character Recognition (OCR) system?
The British Library and Gale do not currently make this text
available -- the text files are used solely for searching the product
and the user is only able to view the digital image of the page.
-----------------------
As a community, are we happy with this state of affairs?
A source I am increasingly turning to for "raw text", in the absence of
alternatives, is Project Gutenberg:
http://www.gutenberg.org/wiki/Main_Page
This community initiative allows the full text of out-of-copyright works
to be downloaded and used as one wishes. It is just plain text, so you
need to get in there and add some markup, but I have for example been
able to convert Cousin's Biographical Dictionary of English Literature
into a database of 1700 biographical records, containing useful
structured information such as names, birth and death dates, and titles
of works. I am currently considering how to expose this as a Linked
Data resource, ideally using the CIDOC CRM as a base ontology.
So long as we have a model where information is published only as web
pages (especially when those pages contain only images), there is no
scope for this sort of data enhancement.
Richard
[1] http://newspapers.bl.uk/blcs/
[2] http://newspapers.bl.uk/blcs/page.do?page=/researchguide.jspx#sect2
--
Richard Light
****************************************************************
For mcg information visit the mcg website at
http://www.museumscomputergroup.org.uk.
To manage your subscription to this email list visit
http://www.museumscomputergroup.org.uk/email.shtml
****************************************************************
****************************************************************
For mcg information visit the mcg website at
http://www.museumscomputergroup.org.uk.
To manage your subscription to this email list visit
http://www.museumscomputergroup.org.uk/email.shtml
****************************************************************
The University of Aberdeen is a charity registered in Scotland, No SC013683.
****************************************************************
For mcg information visit the mcg website at
http://www.museumscomputergroup.org.uk.
To manage your subscription to this email list visit
http://www.museumscomputergroup.org.uk/email.shtml
****************************************************************
|