On the Deep Disanalogy
Between Text and Software and
Between Text and Data
Insofar as Free/Open Access is Concerned
Stevan Harnad
It would be a *great* conceptual and strategic mistake for the movement
dedicated to open access to peer-reviewed research (BOAI)
http://www.soros.org/openaccess/ to conflate its sense of "free"
vs. open" with the sense of "free vs. open" as it is used in the
free/open-source software movements. The two senses are not at all the
same, and importing the software-movements' distinction just adds to
the still widespread confusion and misunderstanding that there is in
the research community about toll-free access.
I will try to state it in the simplest and most direct terms possible:
Software is code that you use to *do* things. It may not be enough to
let you use the code for free to do things, because one of the things you
may want to do is to modify the code so it will do *other* things. Hence
you may need not only free use of the code, but the code itself has to
be open, so you can see and modify it.
There is simply *no counterpart* to this in peer-reviewed research
article use. None. Researchers, in using one another's articles, are
using and re-using the *content* (what the articles are reporting), and
not the *code* (i.e., the actually words in the text). Yes, they read the
text. Yes (within limits) they may quote it. Yes, it is helpful to be able
to navigate the code by character-string and boolean searching. But what
researchers are fundamentally *not* doing in writing their own articles
(which build on the articles they have read) is anything faintly analogous
to modifying the code for the original article!
I hope that that is now transparent, having been pointed out and written
in longhand like this. So if it is obvious that what researchers do with
the articles they read is not to modify the text in order to generate a
new text, as programmers may modify a program to generate a new program,
where did this open/free source/access conflation come from?
There is a second conflation inherent in it, namely, a conflation between
research publishing (i.e., peer-reviewed journal articles) and public
data-archiving (scientific and scholarly databases consisting of the
raw and processed data on which the research reports are based).
Digital data archiving (e.g., the various genome databases, astrophysical
databases, etc.) is relatively new, and it is a powerful *supplement*
to peer-reviewed article publishing. In general, the data are not *in*
the published article, they are *associated with* it. In paper days, there
was not the page-allotment or the money to publish all the data. And even
in digital days, there is no standardized practice yet of making the raw
data as public as the research findings themselves; but there is definite
movement in that direction, because of its obvious power and utility.
The point, however, is this: As of today, articles and data are not
the same thing. The 2,000,000 new articles appearing every year in the
planet's 20,000 peer-reviewed journals (the full-text literature that
-- as we cannot keep reminding ourselves often enough, apparently --
the open/free access movement is dedicated to freeing from access-tolls)
consists of articles only, *not* the research data on which the articles
are based.
Hence, today, the access problem concerns toll-access to the article
full-texts of 2,000,000 articles published yearly, not access to the
data on which they are based (most of which are not yet archived online,
let alone published; and, when they *are* archived online, they are often
already publicly accessible toll-free!). No doubt research practices will
evolve toward making all data accessible to would-be users, along with the
articles reporting the research findings. This is quite natural, and in
line with researchers' desire to maximize the use and hence the impact
of their research. What may happen is that journals will eventually include
some or all the underlying data as part of the peer-reviewed publication
itself (there may even be "peer-reviewed data"), but in an online digital
supplement only, rather than in the paper edition.
(What is *dead-certain* is that, as this happens, authors will not
be idiotic enough to sign over copyright to their research data to their
publishers, the same way they have been signing over copyright to the
texts of their research reports! So let's not even waste time on that
implausible hypothetical contingency. The research community may be slow
off the mark in reaching for the free-access that is already within its
grasp, but they have not altogether taken leave of their senses!)
But that bridge (digital data supplements), if it ever comes, can be
crossed if/when we get to it. Right now, when we are talking about
the peer-reviewed literature to which we are trying to free access we
are talking about *articles* and not about *data*. Hence, exactly as
in the conflation of text with software in the incorrect and misleading
open/free source analogy, the conflation of open/free full-text access to
the refereed literature with hypothetical questions about data-access
and data re-use and re-analysis capability is simply incorrect and
misleading. The two are different, and it is only the first that is at
issue today.
Open/free access -- in this flurry of definitional fussiness and fancy
one no longer knows which word to use! -- to the refereed research
literature is already vastly overdue, even though it has been 100%
within our practical reach for several years now.
http://cogprints.soton.ac.uk/documents/disk0/00/00/16/85/index.html
Research usage and impact and productivity are still being needlessly
lost daily, in untold quantities, because of access-denial by
toll-barriers. Why on earth do we keep wasting our time, energy
and attention on minor diversions and irrelevancies, while keeping
the solution to the real, pressing problem on hold, as we ponder the
ramifications of incoherent analogies with software and with
data-archiving, when there is a real job to be done: freeing (sic)
full-text access to the planet's yearly 2,000,000 peer-reviewed research
articles, now!
http://www.nature.com/nature/debates/e-access/Articles/harnad.html
I will now quote/comment this latest variant of that Protean microbe
that keeps on causing us Zeno's Paralysis on the road to the optimal
and inevitable. In the past, the source of this persistent virus
and its ever-mutating variants had been the opponents of free
access (toll-access publishers), as well as its over-timorous
potential beneficiaries (researchers, librarians, administrators).
http://www.ecs.soton.ac.uk/~harnad/Tp/resolution.htm#8 But now the
paralysis-inducing bug is also originating from the ranks of free-access
activists, who risk balkanizing the free-access movement by driving a
conceptual wedge between "free" and "open," despite the fact that nothing
substantive is to be gained, and only more time to be lost thereby. I
will pass to quote/comment mode to illustrate this:
On Thu, 14 Aug 2003, Matthew Cockerill wrote:
> The open source software community [uses] the shorthand 'free, as in beer'
The open/free distinction in software is based on the modifiability of the
code. This is irrelevant to refereed-article full-text. (And the beer
analogy was silly and uninformative in both cases! Lots of laughs, but
little light cast.)
> Sure, if you are given some limited access to something and that access is
> 'free, as in beer', that can be very useful.
> In the world of software, say, that would apply to Windows Media Player,
> which you can download for free from the Microsoft website (even though the
> software itself is highly proprietary, and Microsoft would not take kindly
> to you reverse-engineering it or distributing a modified version).
This is all irrelevant to article-access, except that toll-access
publishers can, like every other product- or service-provider, use partial
or temporary access as a marketing "hook." Temporary access is not free
access (or rather it is free access only while it is free). And partial
access is free only for whatever it is access to, not for what it is
not access to. (We're all "non-smokers" while we are asleep...)
But none of this provides any basis at all for the analogy with
proprietary code, as in software, nor with any need for code
modifiability, whatsoever.
> But free/open source software is more than 'free as in beer', it is 'free as
> in speech', and this offers hugely significant extra freedoms (which is why
> open source software has had such a revolutionary effect on the software
> industry).
This free beer/speech analogy was already dubious in the software case
(not all programmers wish to give away their code [the freedom to produce
non-give-away products/services is a freedom too!], either for use or
for modification, or both; and my speech, whether spoken or written,
is spoken/written for you to hear, not for you to claim to have been
your own words, whether in unaltered or altered form; and we are free
to say or write what we like, as long as it is indeed our own, etc. etc.).
But never mind. We will not try to repair another domain's incoherent
analogy here; but, please, let us not import it where it just sows still
more confusion in an already confused terrain: Refereed-research-article
authors (unlike the authors of most other forms of "written speech")
are not interested in earning access-royalties from the sale or use of
their words. They just want their words *used,* as much as possible. (That's
"research impact.") But to use their words is not to modify their *form*
(the code) and then re-issue them, perhaps as the modifier's own. To use
their words is to use their *content*, by incorporating that content
into the user's own content, in his *own* words, with proper source
attribution, so as to produce another text, another "written speech."
It would be nice if all programmers were willing and motivated to make
all their code free, not just for use, but for modification too. It would
also be nice if the writers of all words were willing and motivated to
make their words free, not just for use, but for modification too. But
alas humans and their egos are monadic, not distributed and diffuse,
and their motivation is usually local, and quid pro quo. So there will
always be programmers who program only if it pays, and they may want the
credit as well as the first-dibs at modification and development. Nolo
contendere there.
But the same is true of writers. Some will always want to be paid for
access to their words, and virtually all will want to keep their own
words as their own.
http://cogprints.ecs.soton.ac.uk/archive/00001700/index.html
Refereed-article writers, however, don't want to be paid for access to
their words, because access-tolls reduce the usage of their work, which
is what they really want to maximize (because that research impact is
what brings them their rewards, both financial and
scholarly/scientific). Because the words are in natural language, there
is no question of researchers concealing their code (of they choose to
publish at all). But what they want you freely using is its *content*
(with proper attribution). There is no question of modifying its form. As
software does not have this form/content duality, the analogy simply
does not apply; it is incoherent.
> The Free Software Foundation defines these freedoms as:
> * The freedom to run the program, for any purpose (freedom 0).
Inapplicable to text: "Running the program" is accessing the text.
> * The freedom to study how the program works, and adapt it to your needs
> (freedom 1). Access to the source code is a precondition for this.
Irrelevant to text. You may study and use the content of my (giveway,
refereed-article) text (with attribution) in any way you like, and you
may quote it (with attribution). That's all. And there all analogy
between text and software ends.
There are also many new software-based uses (indexing, search,
navigation, digitometric analyses) that one can make of online text,
which refereed-article authors also welcome, but the big hurdle is free
full-text access, and not these perks, which will come with the territory.
But no reprocessing of *my* text code in order to turn it into *your*
text code (other than via its content, as processed by your brain)!
(And remember that data, and data-processing, are not part of
refereed-article text.)
> * The freedom to redistribute copies so you can help your neighbor (freedom
> 2).
Moot for text, when all you need redistribute is the URL of its toll-free
full-text online.
> * The freedom to improve the program, and release your improvements to the
> public, so that the whole community benefits (freedom 3). Access to the
> source code is a precondition for this.
> (see http://www.gnu.org/philosophy/free-sw.html )
Irrelevant to refereed-article text. You may improve on the content, in
text of your own, with proper attribution. (And again, data re-analysis
is an orthogonal matter.)
> This philosophy fits exceptionally well with the needs of the scientific
> community to share and build on each others research, which is why very many
> academic software development projects are developed using an open source
> model.
Scientific *software*. But we were talking about scientific-article
*text*, and this was supposed to be an analogy! There is no counterpart
to collective software development at the article-code level. It is only
content that the scientific community develops collectively, and even
that, while tracking attribution through citation.
Nor did the collective, cumulative use of scientific content require any
cues from the software community! Open-source *content* has been the
rule with scholarship for centuries: That's why scholars *publish*. The
new question is only about access to their content (via their text)
online. Please let's not forget or obscure that fundamental new question
in this welter of free-associative digital analogies of doubtful
relevance and coherence.
> BioMed Central's policy of Open Access is based on giving the scientific
> community a similarly broad freedom to make use of the research articles
> that we publish.
The scientific community already has the freedom to make use of
published articles. What it lacks is toll-free access to their texts!
> This includes giving access to the structured form of the articles,
We're back to XML mark-up again: a perk, a welcome perk, but we first,
and far more urgently, need the basics, namely, toll-free access to the
full-text. Please let us focus on that, rather than getting side-tracked
onto perks, especially those that make it seem as if free access were
somehow not enough, somehow not "truly open." We don't have free access
today. We don't need advice on the short-comings of free access; we need
help in getting free access, as soon as possible.
> and giving the right to redistribute and create derivative works
> from the articles.
I've already replied to this in an earlier posting: When the full-text
is online and toll-free, the only relevant mode of "redistribution" is
to distribute the URL. Ditto for "derivative works." Quotes, as always,
require attribution. And text without attribution may be neither "re-used"
nor modified. So what is really the point here?
> This isn't just a philosophical issue - it has practical implications:
>
> e.g. in the August 14 issue of Nature (Vol 424 p727), Donat Agosti, from the
> American Museum of Natural History, New York, laments the fact that the
> www.antbase.org database of ant taxonomy is missing much critical
> information because a large fraction of all descriptions of new ant species
> are covered by publisher copyright.
I couldn't follow this. If the database is toll-free, the database is
toll-free. If making the database useful requires toll-free access to
the full-text of refereed-articles, then the full-text of
refereed-articles needs to be made toll-free! We knew that already!
What is the point of all these further free-associations and free-floating
analogies? We are running in circles instead of breaking out of the
circle.
> In a true Open Access environment, not only could Antbase link to the
> articles on the publishers web site, but it could also make use the images
> and the text within those published descriptions to compile a universal and
> authoritative catalog of Ant taxonomy.
Translation: We need free access not only to the database, but to the
full-text. This can be clearly seen without conflating the two. (Please
jettison this "true open access" locution, or save it for when we have
universal false-but-toll-free full-text access, and we have nothing
more urgent left to do than to optimize it further. My guess is that
the rest will already have come with the territory of its own accord. But
please, let's go for the territory, before the "truth" -- see Keats
quote at end).
> Finally, to respond to Sally's point questioning the benefits of
> deposition in a standard repository:
I re-read Sally Morris's point, and I now see that (in agreeing on #5)
I misconstrued it as as addressing only the trivial differences between
the types of "databases" -- "archives," "repositories": how we unfailingly
prefer to fuss with and multiply terminological trivia instead of
staying focussed on matter of substance! -- in which a full-text might
be deposited (e.g., Eprints vs Dspace, or central vs. institutional). I
now realize that Sally was refereeing there to BioMedCentral's (BMC's)
[requirement? recommendation?] that BMC authors archive their BMC
full-texts in an open-access database such as PubMed Central. Hence what
my reply to Sally should have been was this:
>sh> 5) Whether the item and/or its metadata are deposited in certain
>sh> types of databases (this last seems to me supremely irrelevant)
I agree it's irrelevant, if by "certain
type" you mean, say, Eprints vs. Dspace.
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2670.html
But it's certainly not irrelevant whether the item (full-text)
is deposited in *some* type of database *at all*, for if it
is not deposited in a free-access database of *some* type,
it is not free access!
Whether that database type is institutional and distributed,
disciplinary and central, or the toll-free access database of an
open-access or a toll-access publisher is an implementational
and strategic matter. And whether or not that database is
OAI-compliant is a matter of functionality and efficiency
(OAI-compliant databases greatly preferred!).
> Although theoretically it might not matter where something is available, or
> in what format, it should be clear that in practical terms these are
> absolutely vital issues.
Absolutely vital *relative to what*? In practical terms, we do not
have free full-text online access to most of the refereed literature
(2,000,000 annual articles, in 20,000 refereed journals) today. What
is absolutely vital is getting that free access, now, and putting an
end at last to the needless daily impact-loss that continues until that
happens. Whether that free access is via this type of archive or that,
and has or lacks these perks or those, is certainly not the absolutely
vital issue today. On the contrary, foregrounding such minor details
when we still lack the basics, and thereby raising the goal post for
what we should all be aiming for, slows and diverts rather than speeds
progress.
Free access, now! Never mind the rest until we have those long-overdue
basics in hand, at last!
> So for example, theoretically, every DNA sequencing
> lab could put up its own web page and make available the sequences they
> themselves have obtained, using their own choice of format. The scientific
> community would thereby have free access to all those DNA sequences.
Correct. And this has absolutely *nothing* to do with the free-access
movement, which is about toll-free access to the 2M articles in the 20K
toll-access journals, not about data-archiving, which is a parallel but
independent development that proceeds apace, and does not need
free-access's (or publishers') permission! (Data-archiving, on the
other hand, might help accelerate article-archiving!)
http://www.ecs.soton.ac.uk/~harnad/Temp/data-archiving.htm
> But in
> fact, the deposition of all DNA sequences in a standard format with Genbank
> has a truly enormous benefit in practical terms, and has served as a crucial
> foundation for the development of tools to mine the genome. PubMed Central's
> role as a repository for biomedical research articles is very much
> analogous to Genbank's role as a repository for DNA sequence data.
An archive is an archive. There is an analogy (as well as a
complementarity) between data-archives and article-archives, but the
big difference is that both data archiving and data-archives are (1)
new, and (2) do not have a prior tradition and current status quo of
being non-free, whereas articles are (1) old, and (2) do have a prior
tradition and current status quo of being non-free. Publishers' relatively
new toll-based online article-archives are also non-free. So the relevant point
about article archiving is that article-archives should be free.
"that is all ye know on earth, and all ye need to know"
Stevan Harnad
NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03):
http://amsci-forum.amsci.org/archives/september98-forum.html
or
http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html
Discussion can be posted to: [log in to unmask]
|