Re: Free Access vs. Open Access

Re: Free Access vs. Open Access Stevan Harnad 11 Aug 2003 23:10 UTC
On Mon, 11 Aug 2003, Matthew Cockerill wrote:

>sh>       "The use one makes of those full texts is to read them,
>sh>        print them off, quote/comment them, cite them, and use
>sh>        their *contents* in further research, building on them.
>sh>        What is "re-use"? And what is "redistribution" (when
>sh>        everyone on the planet with access to the web has access
>sh>        to the full-text of every such article)?"
>
> Having free access to articles on the publisher's website would certainly
> offer progress compared to the current status quo. But it would not offer
> anything like the benefits of true open access.

Free access to the current 20,000 journals (2 million articles yearly)
would be like the difference between night and day. Compared to that,
the difference between "free" and "true open" access amounts to just a
few degrees of luminosity.

But let me agree at once that if free access were gerrymandered so
all the user could do was to browse the text on-screen, without being
able to download, save, grep, or print-off, then that would indeed
arbitrarily limit free access's usefulness. How many (if any) of the
several million free-access refereed-journal articles currently on the
web, however -- whether BOAI-1, BOAI-2, or otherwise -- are gerrymandered
in that way? If (as I suspect) the answer is "very few" or even "none
that I know of," then this hypothetical constraint is not worth another
moment's thought or energy diverted from the real task at hand, which
is to turn night into day, as soon as possible.

> Here are just some of the
> reasons why re-use and re-distribution rights are vital to open access:
>
> (1) Digital permanence - it is not enough for the publisher to be the only
> body which curates the full archive of published research content. To ensure
> long term digital permanence of the scientific record, it is vital that
> articles should be deposited with multiple archives, and redistributable
> from and between those archives.

It seems to me that this is conflating (arbitrarily) two completely
independent matters. One is toll-free online *access* to the articles
in the 20K journals that are currently only accessible via tolls. The
other is the *preservation* of that toll-based corpus.

Well, preservation of that toll-based corpus was always a concern, in
on-paper days as in on-line days, and the concern has nothing whatsoever
to do with free (or open) access! We could have a failsafe preservation
system without free access, or we could have a failsafe preservation
with free access; or we could have an uncertain preservation system
without free access (as we do now) or an uncertain preservation system
with free access (bringing the present system out into the light of
day).

The preservation burden has to be (and will be, and is being) faced in
any case. Why on earth should that entirely orthogonal longterm
task be coupled in *any way* to the immediate and urgent problem of free
access today? And why should "open access" be linked with or defined in
terms of the eventual solution to the preservation problem, one way or
the other? (This is not an argument for indifference to preservation: it
is an argument for decoupling two completely independent desiderata.)

> (2) A flexible choice of tools for searching and browsing
> The reason that Google exists is because the web is free for anyone to
> download and index. As a result, there is competition among search engines,
> and Google had the incentive to develop a better system for indexing web
> pages, which has since driven other search engine companies to improve the
> tools they offer.
>
> Compare this with the situation with scientific research. If the research
> resides only on the publisher's site, you don't have a free choice of what
> tools you use to search and browse it - you are stuck with what that
> particular publisher provides you with.

We are quite squarely in the domain of hypotheticals here. (Which
publisher's free-access corpus, inaccessible to google, are we talking
about?) But let us suppose that a publisher provides free access --
not gerrymandered free access, but free access that allows downloading,
saving, grepping and printing:

First, I will bet that such a publisher will want to maximize the
visibility and impact of his contents by allowing at least the indexing
metadata to be harvested, both by google, and by the OAI search engines
specializing in the refereed journal literature.

But even if we get doubly hypothetical here, and suppose the publisher
does *not* disclose the metadata to harvesters, there is
still a super-simple solution: Every author has an online
CV. Their CV will contain the metadata for every one of their
journal publications. (Such CVs can and will be OAI-compliant:
http://paracite.eprints.org/cgi-bin/rae_front.cgi ).
Add the URL for the free-access full-text on the publisher's website to
your CV entry and the circle is closed. (Better still, also self-archive
the full text in your own institutional OAI-compliant repository!)
End of story.

> This ties in with developments in Grid computing (e.g.
> http://www.escience-grid.org.uk/ ). With open access, published research
> would be available "on tap" via the grid, and scientists would be able to
> use their preferred choice of grid tools to access the data, rather than
> being stuck with the tools provided by the publisher.

As stated above, the CV/OAI gambit above already trivially takes care of
closing the circle.

I agree, though, that for many research purposes, it is beneficial to
have not just the metadata but the full-text inverted and indexed, as
well as agent-harvestable and. Again, if the publisher's free-access site
doesn't do this, the author's institutional site certainly can and will.
In fact, authors and their institutions are the ones with the most
direct interest in making sure their own research output is maximally
usable in this way.
http://www.ecs.soton.ac.uk/~harnad/Temp/unto-others.html

Let us not, however, conflate article-text archiving with
data-archiving. Data-archiving is important too, but it is an extra:
an independent new bonus of the online era, having nothing to do with
the question of toll-free access to article-texts. In the paper era, raw
data were not published, just summarized in what was published. Eventually
data will no doubt be incorporated into online publications in some way,
but until then there is certainly no need for authors to wait! They
can publish their article, as before, and, in addition, self-archive
the data on which their article is based in their own OAI-compliant
institutional research repository (the same repository in which
the full-text of their article can and should be self-archived too,
whether it appears in an open-access journal, a toll-access journal, or a
toll-access journal that offers toll-free access too). Again, the online
CV can close the circle, if it is not already closed of its own accord.

And this way, although it is functionally independent, data-archiving
can help speed the progress toward toll-free full-text access too.

> (3) Datamining
>
> With a million or so biomedical research articles being published each year,
> the sheer volume of output is an obstacle to the comprehension and synthesis
> of the results reported in that research. If the XML of the articles can be
> brought together in one place then the tools of datamining can be applied to
> it to extract useful but non-obvious information.

Agreed. See above. But before we get carried away with the potential
perks, let's not forget the still absent basics: Let there be Light
(toll-free full-text access), now! Leave the Solar-Energy and Club-Med
projects for when we already have our daily fill of photons.

> The simplest type of datamining is citation analysis
>
> Currently you need to pay ISI a lot of money to find out what cites what,
> but with true open access, citation analysis becomes trivial.

Perhaps not quite trivial. (There's still the problem of parsing,
identifying and linking the citations for all those articles without the
ultimate mark-up: But we're working on it: http://opcit.eprints.org/ ).

But again, this is an independent perk, because you could have universal
citation linking and analysis even *without* toll-free full-text access!
For an article's reference list, like its indexing metadata (and its
accompanying empirical data) can all be self-archived by the author (guess
where?). We are in fact promoting this solution for royalty-based books,
whose authors, unlike journal article-authors, are unlikely to want to
make their full-texts accessible toll-free. Their metadata and reference
lists, however, are another matter, and can (and will) be tucked into
the institutional OAI-compliant repository too, with a new indicator of
global book citation impact as the harvestable reward.
http://www.ariadne.ac.uk/issue35/harnad/

> So, for example, if you view a PubMed record:
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_ui
> ds=11667947&dopt=Abstract
> you already get links to all the full text articles in PubMed Central which
> cite that PubMed item
> http://www.pubmedcentral.gov/tocrender.fcgi?action=cited&tool=pubmed&pubmedi
> d=11667947

And if you look at citebase, you will see how this generalizes to the
entire OAI-compliant literature:
http://citebase.eprints.org/cgi-bin/search

> The more true open access research that is published and archived at PubMed
> Central, the more useful this becomes for biomedical researchers. [Sure,
> "screen-scaping" HTML from free articles displayed on publisher sites could
> give some citation information, but with nothing like the ease, accuracy and
> reliability that it can be obtained with the use of XML data, as at PubMed
> Central].

Fine. But I'd rather have toll-free access to all 20K journals right
now, rather than waiting for these XML perks -- wouldn't you?

Again, toll-free access is one thing -- and extremely important,
already reachable, and already overdue -- and potential perks such as
citation-based navigation are another. Let there be light first; then we
can worry about calibrating the photometers on our Yashicas.

> Beyond citation analysis, there are many other forms of datamining that are
> possible:
> For more information see:
> http://www.biomedcentral.com/info/about/datamining/
>
> e.g. Research articles can be mined for details of protein interactions
> http://bioinfo.mshri.on.ca/prebind/

See above. Right now, it is an indisputable fact that open-access
publishing today (BOAI-2) is the solution only for that 5% of the literature
(of 20K journals) that has a suitable open-access journal today. The
immediate solution for all the rest is self-archiving (BOAI-1), rather
than continuing to wait for more open-access journals to spawn and grow.

(If, in the meanwhile, toll-access publishers also want to help hasten
things along by providing free access, they are certainly welcome
to do so! I still regret -- for the sake of open access --
that the BOAI http://www.soros.org/openaccess/sign2.shtml?o was
not ready to count it as publisher support of open access if a
toll-access journal supported author self-archiving of their articles
http://www.ecs.soton.ac.uk/~harnad/Temp/rcoptable.gif: *Of course* that
is publisher support for open access! By the same token, I would certainly
consider it as publisher support for open access if a toll-access journal
made its full-text contents publicly accessible online toll-free. Even if
it was gerrymandered full-text access -- as long as they also supported
self-archiving!)

> And as scientific content is increasingly marked up using richer forms of
> semantically meaningful XML (e.g. CML for chemical structures, MathML for
> equations), the value of datamining will continue to increase.

All true. And it will all prevail eventually. But we need free access
*now*. http://www.ecs.soton.ac.uk/~harnad/Temp/che.htm

> The BioLINK group are using BioMed Central's open access corpus as the raw
> material for a datamining competition, designed to stimulate progress in the
> development of tools for biological datamining.
> http://www.pdg.cnb.uam.es/BioLINK/BioCreative_task2.html

That is commendable and welcome. But it must not be forgotten what
percentage of the annual biological journal literature that sample
actually represents. We must not be held back to that small percentage
because we are informed that mere free access is not good enough -- not
"true open access." Such rarefied fussiness does not serve the cause of
either free or open access at this point.

> (4) Derivative works and compilations
> Say that a scientist performs a meta-analysis on a group of published
> clinical trials, and wants to make available the conclusions of that
> research. Or perhaps a datamining researcher has taken a corpus of 1000
> articles breast cancer, and established some interesting conclusions.

All very welcome and valuable (indeed, inevitable) developments in the
online age. But I'd rather that progress toward free access for all 20K
did not wait for these perks. Indeed, the sooner we have free access,
the sooner the rest will come too.

> In a true open access environment, each is free to post the results of their
> research, *along with* the actual corpus of data which the research was
> based on (effectively, the raw data for that research).
> But in a non-open access environment, that raw data (i.e. the research
> articles) cannot be redistributed, which makes it far more difficult than it
> needs to be for other scientists to reproduce, critique and follow up the
> work.

I am afraid I have to disagree. As already noted above, authors are as
free to self-archive (in their institutional repositories) the empirical
data underlying their toll-access publications as they are to do so with
the data underlying their open-access publications. Data-archiving is
another thing for which there is no point sitting around awaiting the
era of universal open-access publishing. Data-archiving will encourage
article self-archiving, and both will hasten the era of universal
open-access.

> Similarly, a scientist may wish to make a point by assembling a collection
> of certain articles or article fragments (perhaps they wish to assemble a
> comparison of the methods used for a certain technique).
> In an open access world, as long as they cite the sources, they are
> completely free to create and redistribute that compilation. Such a
> selective compilation may in itself be extremely useful contribution to
> science.

I can't follow this at all. A compilation is a list of articles, whether
online or on-paper, whether toll-access of open-access. If the
full-texts of the texts are *free* access, all the compilation need list
is their URLs. (Ditto for article "fragments": try section number,
paragraph number, or even [yech!] PDF page number.)

> (5) Print redistribution rights - the National Health Service, for example,
> should be able to redistribute thousands of printed copies of an important
> research article (which it may have funded) to its doctors if it wishes to
> do so. It should not have to pay a hefty copyright fee for the privilege.

I have no views on this, but it has nothing to do with open access,
which even in the strict BOAI definition refers to online access, not
to multiple printing and redistribution rights. Besides, this is all
becoming moot in the online era: Why distribute print copies instead of
URLs, if the texts are publicly accessible online toll-free?

(I think it is a big mistake, and clouds the issue, to try to link online
toll-free access arguments with paper-printing rights. Don't forget that
those worthy paper-based arguments would have been just as worthy in the
paper era. So surely they are *not* what has changed in the online era.)

> Certainly, print redistribution will likely become less significant in the
> future, but there is no logical reason that the scientific community should
> not be free to exchange and distribute the research that it has created in
> print form, as well as online.

The case for multiple printing rights is *much* weaker than the case
for toll-free online access. Please let us not needlessly weaken
the case for free access by handicapping it with such needless extra
burdens. Free access will erode the need to print, even as it erodes
publisher opposition to printing. But now, all fussing about print
"redistribution" rights does is provoke needless opposition, to no
good purpose. Keep it light, till everyone sees the light.

Stevan Harnad

NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03):

    http://amsci-forum.amsci.org/archives/september98-forum.html
                            or
    http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html

Discussion can be posted to: september98-forum@amsci-forum.amsci.org