Print

Print


RE: PDFs lock data away RE: PLoS business models, global village

There is a "best of both worlds" alternative that the community might want to consider.

This is a combination of an OpenOffice Writer document (which conforms to the new Open Document Format standard¯see http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office) and a PDF, packaged as a zip file both for efficiency and also to provide firm binding of the two renderings.

Best wishes, Henry
 
H.M. Gladney, Ph.D.   http://home.pacbell.net/hgladney  

-----Original Message-----
From: Repositories discussion list [mailto:[log in to unmask]] On Behalf Of Simeon Warner
Sent: Sunday, December 10, 2006 6:48 AM
To: [log in to unmask]
Subject: Re: PDFs lock data away RE: PLoS business models, global village

I think Falk replied to Leslie somewhat at cross purposes.

PDFs are certainly a bad choice for data which should be made available in raw formats and/or appropriate interchange formats. However, that says nothing against making the PDFs of articles available! We should do both.

I would argue that, at present, PDF is the most openly accessible format for textual documents where layout/presentation is at least somewhat important. Good/free PDF viewers are available and easy to install so by making a PDF openly available you make a document available to the entire internet-connected research community. (For work with alternative formats I think the US NLM/NCBI work with source documents in XML according to NLM DTD and rendering into XHTML or PDF on demand is an exemplar. However, as there aren't good authoring tools, this isn't an option for

self-archiving.)

An amusing note on data mining: At arXiv.org we have collected TeX source of research papers for many years. In recent years, a number of projects have done work on automatic citation extraction (including Citebase) and other data mining on arXiv documents -- most have opted to work from the processed PDF of papers (essentially "page scraping") rather than from the raw TeX source. The reason is that in some sense there is more uniformity in the PDF output (which tends to adhere to our presentation norms) than in the TeX source (which supports a great many alternative ways of doing things an, in general, requires a full TeX engine to understand).

Cheers,
Simeon


On Sun, 10 Dec 2006, Falk Huettmann wrote:
> Dear Leslie et al,
>
> sure, I agree in concept, but not in reality.
>
> PDFs are used to lock data away, to make them unusable.
> Many data sets, e.g. text files, exist already in good digital form, and
> become unsuable once presented
> as PDFs. So it's truly a step backwards.
>
> PDFs just support the concept of FEAR ("uh, somebody dares to use my
> information .").
> Whereas we all know and support: "The value of data lies in its use"!
>
> So we could, and should, use the raw data and text Files instead.
>
> In addition, we then get these clumsy PDFs that crash all the time.
> PDFs are NOT an option, and should  not be used any further.
>
> They just support the notion of 'change for no change' (yeah, it's digital
> and online, but.).
>
> Creating PDFs costs money, too, and the funds should be invested more wisely
> instead.
>
> I am a user of public online data for over 10 years, and we have that
> problem frequently.
> We even re-digitized major PDF documents with raw data tables, into useable
> datasets and put them online.
> And there is software out there that does exactly that.
> Isn't that somewhat silly? Are we re-inventing the wheel here ?
>
> Anyways, let's see where we go with it.
>
> Kind regards
>
>     F.
>
> Falk Huettmann PhD, Assistant Professor
> -EWHALE lab- Biology and Wildlife Dept., Institute of Arctic Biology
> 419 IRVING I, University of Alaska Fairbanks AK 99775-7000 USA
> Email [log in to unmask]  Phone 907 474 7882 Fax 907 474 6716
>
>
>  _____
>
> From: Leslie Carr [mailto:[log in to unmask]]
> Sent: Sunday, December 10, 2006 12:29 PM
> To: Falk Huettmann
> Cc: [log in to unmask]
> Subject: Re: PLoS business models, global village
>
>
> On 10 Dec 2006, at 08:27, Falk Huettmann wrote:
>
>
>
> Am I correct to say that PDFs are not part of true OpenAccess (raw data,
> shared analysis) and should be fully abandoned/replaced ASAP ?
> "True Open Access" is a hitherto unidentified specialisation of "Open
> Access". The latter simply requires research outputs to be accessible to
> everyone, without let or hindrance, now or in the future.
>
> Perhaps you are suggesting that PDFs are not an optimal information exchange
> vehicle - and many people (data miners) would agree with you. However, PDF
> files are the majority means of dissemination, and while we await the Next
> Great interoperability format (presumably based on XML) together with the
> easy-to-use tools to go with it, we should continue making PDFs open access
> with all our energy.
> --
> Les Carr
>
>