Print

Print


Oh dear, I knew I would poke my head in eventually.
Nice, and informative, thanks Simeon.
> I agree that source (Word, TeX, etc.) plus PDF is the best strategy at present.
To generalise this, you could have said:
"I agree that source (Word, TeX, etc.) plus image (PDF, jpg, etc.) is the best strategy at present.

I too have had significant problems in the past with PDF missing fonts, etc, although that seems to have gone away.
But I usually manage to rescue jpg images, whatever has happened as I moved computers.

As far as source is concerned, having a source where there is ascii text, is always a good thing - hence latex, even with all the serious problems of keeping up to date with ever changing/increasing libraries.
And hence Word is good, as people have said, although it is important to specify docx (as you do): but unfortunately it seems people use Word 2007 or similar format, rather than docx, which seriously compromises forward compatibility of the data for a transient backwards compatibility of the software.

Best
Hugh

On 27 Jul 2012, at 14:49, Simeon Warner <[log in to unmask]> wrote:

> I think Tim's summary is right on the money. There are problems with all three formats (docx, TeX and PDF) being discussed.
> 
> At arXiv approximately 90% of our 75k submissions/year come as TeX source. We put very significant effort into maintaining a TeX processing system that can process the ~700k article historical corpus of TeX documents. This has a set of different style/class trees to deal with 20 years of articles. Every time we change TeX version or revamp the style/class trees there is an extensive regression test process involving checking that we can still process the whole corpus, and visual inspections of a significant number of processed PDFs. (A nice part of this story is that prior to 1996, before PDF was widely used, we had the option of collecting TeX source or 300dpi PostScript from users. Because we took the TeX source we can now produce high quality PDF from these articles as opposed to being stuck with horrible PDF generated from low bitmap PostScript).
> 
> For a while we accepted docx submissions and had a back-end system for processing them to PDF. We were unable to get/maintain this processing system such that the output had reliable fidelity. We frequently had to contact submitters out-of-band to get them to generate a PDF on their system and mail it to us, which was unsustainable. We now request PDF generated by the user (would be nice if they submitted PDF/A...).
> 
> Approximately 10% of arXiv submissions are PDF, generated from Word and other systems. A small but non-negligible fraction of PDF documents have rendering problems (e.g. fonts not included, different rendering with different software, local font substitution,..). Still probably the best baseline option but by no means perfect.
> 
> While I agree that adoption of a much simpler and more "sematics rather than format oriented language" such as HTML would likely be good, I can't see that happening widely in the near future. And, if it did then users would do everything within their power to use it to get exactly the format they want, ignoring any semantic conventions and relying on quirks of the browser they see it rendered in... reducing eventual fidelity on other environments.
> 
> I agree that source (Word, TeX, etc.) plus PDF is the best strategy at present.

> 
> Cheers,
> Simeon