From daily experience I can tell you variation is rife. For example, from a data sample of about 250 source which should have very similar structure (online newspapers), and where 200 of them are produced from the same software platform, we have found over a dozen variations in the date syntax (forget the semantics of the "published" field - that is a different nightmare). And that is a single, very simple (in concept) piece of data. You can extrapolate to pagination and the like. After all the hundreds of pages of AACR2 bear witness to the variation and complexity found in the real world.
Peter
Dr Peter Noerr
CTO, MuseGlobal, Inc.
+1 415 896 6873
www.museglobal.com
From: List for discussion on Resource Description and Access (RDA) [mailto:[log in to unmask]] On Behalf Of Simon Spero
Sent: Wednesday, March 26, 2008 8:46 AM
To: [log in to unmask]
Subject: How much variation is there in legacy data?
> I believe that RDA itself masks some of the complexity in that world
> by using text and allowing human readers to do the necessary
> interpretation. So in a case where you have something like a
> technical manual that pages things like "A-1 to A-32; B-1 to B15",
> etc., RDA will say: "approximately 600 pages." Or for a book with
> some text and some pictures it might say: "27 pages, unnumbered
> sequence of leaves". Some of this is rules, but underneath it is the
> reality of the resources being described. It's this reality of the
> resources that we need to be mindful of.
It might be useful to do some experimentation to see how much variety
there is in this kind of field in existing data. 300s might be a
good place to start.
Maybe we could extract 300 fields from a small random sample of
records, identify units, value, dimensionality, and syntax;
generate lexicon and transducer rules for GATE or similar, then
evaluate over a test set from a different source.
This approach should quickly give a good indication as to general
feasibility.
Worth doing?
Simon
|