On Fri, 27 Jul 2012 00:01:34 +0100, JISC-REPOSITORIES automatic digest
system <[log in to unmask]> wrote:
> There are 15 messages totaling 6797 lines in this issue.
>
> Topics of the day:
>
> 1. Policies on depositing MS Word files (12)
> 2. A survey on cost estimation of cloud storage and computing
> 3. Richard Poynder Interviews Stevan Harnad on RCUK OA Policy
> 4. DC-2012 in Malaysia five weeks away
>
> ----------------------------------------------------------------------
>
> Date: Thu, 26 Jul 2012 18:12:57 +1000
> From: David Groenewegen <[log in to unmask]>
> Subject: Re: Policies on depositing MS Word files
>
> I think the other "problem" with Word comes from the word processor wars
> of the early 90s, when it was unclear what would be the most common
> format (remember WordPerfect? WordStar? MacWrite?). It made people
> nervous about its longevity.
>
> But Word has been the default standard for creating documents for a
> longish time now - there can't be many people or companies who don't
> rely on it (and even if you don't I bet you still have the capacity to
> deal with it). If the ability to access the billions of Word documents
> out there disappeared tomorrow through some bizarre circumstance where
> every single one of the hundreds of millions of copies of Word
>
(<http://blogs.technet.com/b/office2010/archive/2009/10/07/new-ways-to-try-and-buy-microsoft-office-2010.aspx>
>
> and all the various compatible tools (<https://docs.google.com/>)
> stopped working, someone would have to invent a way of overcoming this
> pretty quick smart.
>
> Please note: I'm not saying that Word is perfect, or that I'm thrilled
> with this outcome, or that Word is better than <insert your favourite
> here>, or that it isn't the result of Microsoft exploiting its market
> share.
>
> What I am saying is that a Word document is probably the last format we
> need to worry about for preservation purposes for the foreseeable
> future. Except maybe PDF.
There are several aspects to how useful a format is for digital
preservation:
1) Is the format itself open and well-defined, such that multiple products
exist that can render the contents accurately
2) Is the format "lossy" (either in semantics or quality)
3) Most importantly, how widely used is the format - big herds have better
survival rates
I'm on the mailing list for poppler PDF, which is the most popular PDF
library on Linux. You may be surprised by the difficulty of rendering PDF
reliably, indeed they often have to refer to the *implementation* in Adobe
Reader to decide how a particular PDF feature should look. But you can say
with confidence that PDF can be rendered (near 100%) accurately on all
platforms, in a variety of tools (server and client).
I disagree with Les' assertion that because "docx" is an standard that we
are in a better position than we were with Word "doc". I'm sure it's a
common experience in the community to see someone struggling with a
presentation that has gone haywire due to an Open Office/MS Office
interchange. Sure, docx is easier to hack on but as the point has been made
it is such a widespread format the tools have been created to extract raw
data from "doc".
The biggest problem for MS Word as a preservation format is the lack of
server tools. Not even Microsoft can sell you a tool for accurately
processing Word (Powerpoint etc.) on the server - rendering of Office
documents is bound up in the nuances and quirks of the closed-source Office
suite.
Of course the most successful format, available on by far the most
platforms and most vendors, is HTML. As the Semantic Web/schema.org gain
traction the amount of information stored in HTML will dwarf that in
dead-tree formats like Word and PDF (if it doesn't already).
So, as has been already suggested, placing the source (Word, TeX, etc.)
plus a PDF on the repository is the best current strategy. It may be
prudent to not make the source public due to the leaking of private data
e.g. Word embedding the host machine's MAC address. The PDF should
*hopefully* bundle up all of the dependencies of the document (fonts in
particular), meaning readers will see what the writer intended.
--
All the best,
Tim.
|