I think Tim's summary is right on the money. There are problems with all
three formats (docx, TeX and PDF) being discussed.
At arXiv approximately 90% of our 75k submissions/year come as TeX
source. We put very significant effort into maintaining a TeX processing
system that can process the ~700k article historical corpus of TeX
documents. This has a set of different style/class trees to deal with 20
years of articles. Every time we change TeX version or revamp the
style/class trees there is an extensive regression test process
involving checking that we can still process the whole corpus, and
visual inspections of a significant number of processed PDFs. (A nice
part of this story is that prior to 1996, before PDF was widely used, we
had the option of collecting TeX source or 300dpi PostScript from users.
Because we took the TeX source we can now produce high quality PDF from
these articles as opposed to being stuck with horrible PDF generated
from low bitmap PostScript).
For a while we accepted docx submissions and had a back-end system for
processing them to PDF. We were unable to get/maintain this processing
system such that the output had reliable fidelity. We frequently had to
contact submitters out-of-band to get them to generate a PDF on their
system and mail it to us, which was unsustainable. We now request PDF
generated by the user (would be nice if they submitted PDF/A...).
Approximately 10% of arXiv submissions are PDF, generated from Word and
other systems. A small but non-negligible fraction of PDF documents have
rendering problems (e.g. fonts not included, different rendering with
different software, local font substitution,..). Still probably the best
baseline option but by no means perfect.
While I agree that adoption of a much simpler and more "sematics rather
than format oriented language" such as HTML would likely be good, I
can't see that happening widely in the near future. And, if it did then
users would do everything within their power to use it to get exactly
the format they want, ignoring any semantic conventions and relying on
quirks of the browser they see it rendered in... reducing eventual
fidelity on other environments.
I agree that source (Word, TeX, etc.) plus PDF is the best strategy at
present.
Cheers,
Simeon
On 7/27/12 5:42 AM, Tim Brody wrote:
> There are several aspects to how useful a format is for digital
> preservation:
> 1) Is the format itself open and well-defined, such that multiple products
> exist that can render the contents accurately
> 2) Is the format "lossy" (either in semantics or quality)
> 3) Most importantly, how widely used is the format - big herds have better
> survival rates
>
> I'm on the mailing list for poppler PDF, which is the most popular PDF
> library on Linux. You may be surprised by the difficulty of rendering PDF
> reliably, indeed they often have to refer to the *implementation* in Adobe
> Reader to decide how a particular PDF feature should look. But you can say
> with confidence that PDF can be rendered (near 100%) accurately on all
> platforms, in a variety of tools (server and client).
>
> I disagree with Les' assertion that because "docx" is an standard that we
> are in a better position than we were with Word "doc". I'm sure it's a
> common experience in the community to see someone struggling with a
> presentation that has gone haywire due to an Open Office/MS Office
> interchange. Sure, docx is easier to hack on but as the point has been made
> it is such a widespread format the tools have been created to extract raw
> data from "doc".
>
> The biggest problem for MS Word as a preservation format is the lack of
> server tools. Not even Microsoft can sell you a tool for accurately
> processing Word (Powerpoint etc.) on the server - rendering of Office
> documents is bound up in the nuances and quirks of the closed-source Office
> suite.
>
> Of course the most successful format, available on by far the most
> platforms and most vendors, is HTML. As the Semantic Web/schema.org gain
> traction the amount of information stored in HTML will dwarf that in
> dead-tree formats like Word and PDF (if it doesn't already).
>
> So, as has been already suggested, placing the source (Word, TeX, etc.)
> plus a PDF on the repository is the best current strategy. It may be
> prudent to not make the source public due to the leaking of private data
> e.g. Word embedding the host machine's MAC address. The PDF should
> *hopefully* bundle up all of the dependencies of the document (fonts in
> particular), meaning readers will see what the writer intended.
>
|