Hi Folks
I seem to have generated a bit of a storm, suffice to say my comments are not
about turf wars (as someone implied), but about meeting expectations. So, on
that front, I'm going to take just one point (and consequences) from Paul's
email:
On Tuesday 17 January 2006 11:02, Wheatley, Paul wrote:
> Quoting selectively from Bryan's blog (apologies in advance!):
> >So, yes, I'm very much in favour of institutional repositories, but
>
> they >need to be established with a very clear understanding of what
> they will >host and they need to reject material that they can't hope to
> preserve.
>
> Given that the development of preservation technology is at such an
> early stage, I'm not sure anyone can answer that question easily. What
> we will be able to say with confidence that we can preserve in just a
> couple of years time will be far more ambitious than what we can
> realistically preserve now. Is this a good reason to turn away material
> now?
(Note that my response here is about research *data*, I have nothing (useful)
to say about learning materials etc).
Yes, I'm afraid it is. The reality is that preservation (beyond a trivial
period of time) relies heavily on a relationship between the producer of the
data and the archivist, and as time goes on, a relationship between the
consumer of the data and the archivist.
Experience tells us: if you don't get the producer actively involved,
producing adequate discipline specific metadata, right at the beginning, and
if you don't then actively migrate that metadata ... you'll end up with bits
and bytes that are simply impracticable to deal with because you need humans
to help ... and it's simply become unscalable. We're already in that position
with some of our early datasets ... it's not that we can't read the format,
it's that the information encoded in the format isn't good enough, and we
need a human to deal with it ... but we've got 35,000 files in that format,
each could take 1-15 minutes or so to deal with. You do the maths ... that's
just one format!
So, by saying to the producer: "don't worry, just biff me the data, we'll work
out how to preserve it later" is giving them license to think you've done the
preservation work, but you haven't, and you're going to have to come back to
it ... and by then there might be no one willing to pay for the work!
(For the record: definition of "early stage": the BADC has been preserving
digital data for 11 years, and it grew out of a preceeding entity ...
arguably we represent several decades of experience doing this ... and we've
made a lot of mistakes ... some of which I see being repeated).
> >Then they need to hold a hard line against function creep, and only
>
> accept >material in "well known formats" (whatever that means) with
> "well
>
> This is a popular strategy to ease the preservation problems the
> repositories are taking on. However, the reality is that when faced with
> a limited number of submission formats, submitters either don't bother
> or they perform migrations themselves. Do we really want to place
> complex preservation actions in the hands of the users? No records are
> kept of what action they take and migrations are performed in an ad hoc
> way. This could well be creating an even bigger preservation challenge
> for us in the future.
The bottom line is the folk you call users are the data producers, and they
know more about what they're doing than we do. In particular, if anyone has
to make ad hoc decisions, rather them than me! When I can do it properly,
then I'll get involved.
(Remember, I'm talking about research data, the arguments Paul makes are quite
tenable with other types of "preservation entity", and i'm only making these
arguments to try and keep IR's - with a sensible definition - practical and
useful).
Bryan
--
Bryan Lawrence
Director of Environmental Archival and Associated Research
Head of the NCAS/British Atmospheric Data Centre
CCLRC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; Web: home.badc.rl.ac.uk/lawrence
|