---------- Forwarded message ---------- Date: Sun, 1 Oct 2000 17:10:20 +0100 From: Stevan Harnad <harnad@COGLIT.ECS.SOTON.AC.UK> Subject: Re: [Ref-Links] economic effects of link-based search engines on e-journals On Sun, 1 Oct 2000, Eric Hellman wrote: > >it would be more useful and relevant > >for researchers if a special, google-style search engine were devised > >that searched only the refereed research literature on keywords, and > >then returned results on the basis of citation-link-frequency (i.e., > >the most cited papers on that keyword first). > > I observe that google, AS IT EXISTS TODAY, works quite well in > returning useful and relevant results in areas (such as nitride > semiconductor research) where the content is available for spidering. > The assertion that a special purpose engine would be MORE useful is a > marketing claim made by Northern Light which I have not tested. Eric is a technical expert here and I am not. So there may be something I am not seeing or understanding, but it seems to me that the idea that google itself, searching web-wide, is any sort of a solution for researchers who want to search all and only the refereed journal literature, is erroneous. There is a huge difference (a world of difference in fact) between either (1a) a consumer searching the whole web for products, or (1b) a student or layman searching the whole web for information, on the one hand, ranked on google's well-linkedness parameter, and (2) a researcher, searching only that portion of the web that is tagged "refereed," and ranked on citation-linkedness. It seems to me that to have the latter, there has to be a reliable way to (i) "sector" the web into just the refereed portion, and (ii) ensure that the contents of that sector are fully citation-interlinked. Nothing like this will fall out, as a side-effect, on the basis of the larger, web-wide, link-ranking principle. What is needed is a reliable, universal way of tagging all and only the items in this sector as "refereed," and interlinking them by citation, and then harvesting those, and only those. The Open Archiving Initiative (OAI) seems to have provided the meta-data tagging protocol, the OAI-compliant software at www.eprints.org allows institutions worldwide to create these interoperable archives, authors can then fill them, and the opcit.eprints.org citation-linking software, currently adapted specifically for the Los Alamos Physics, can be adapted as an open archives service applied to all the harvested eprint archives. Dedicated search engines can then operate on that corpus alone, instead of the whole web. > The interesting thing to me is that by virtue of its > interlinked-ness, scholarly literature tends to rank high in google > even without prefiltering. In some cases, interference is a problem. > For example, if you try to look for InN (indium nitride), you get a > lot of hotels and Bed-and-Breakfasts. Here the fact that I am not technical does not disqualify me: I can say with absolute certainty that google as a way of retrieving (what there is of) the refereed literature on the web, and that literature only, is completely hopeless. What a user should be able to do (with the restricted sector and searcher I am describing) is precisely the same thing he does if he executes, say, a search in Medline, or Inspec, or Web of Science: He should be able to retrieve all and only the refereed literature (but citation-ranked and full-text). No wading through "Bed-and-Breakfasts" and thousands of other irrelevant items. > Google is uncanny. For example, it knows to classify "Harnad" in the > category "Logic and Ontology:Natural Kinds". It's interesting that it got that, based on linkedness, but that is in fact far from being the best or most useful first-cut classification of my work. It is no doubt an artifact of the linkedness-ranking. If you did it in the refereed sector, using citation linking, you would get much more accurate and useful categories. > >For this, the refereed (and pre-refereeing) literature needs to be: > > > >(1) identifiable by agreed upon meta-data tagging: > > http://www.openarchives.org > > Good, but not strictly essential. It is a matter of current > controversy in the search engine community as to whether metadata is > useful at all in open, automated environments. Of course meta tagging > is very useful for other applications. The OAI-protocol, and registration as an OA data-provider, as I understand it, makes it possible to selectively harvest the contents of those archives, and those archives alone. (Web-wide, the meta-data would be buried in a lot else, and probably not even unique.) http://www.openarchives.org/sfc/sfc_archives.htm > >(2) online (preferably full-text and free): > > http://www.eprints.org > > Necessary, but not sufficient. Content must also be available to > robots. The Los Alamos Archive is a prominent example of a site where > robots are unwelcome. I agree completely. Not being a technical person, I cannot say how, but I have a gut feeling that there will be a way to allow the registered, OAI-compliant eprint archives to be automatically harvested. (In fact, I bet that such an automatic harvester will be among the first registered OA service-providers -- and searchers and citation-linkers will not be far behind). http://www.openarchives.org/sfc/sfc_services.htm > >and > > > >(3) fully citation-linked: > > http://opcit.eprints.org > > > > Again, necessary, but not sufficient. The links must be > robot-friendly. Feel free to contact me if you want details; this is > a technical subject. I agree that they must be robot-friendly. Once we at Southampton have released the final version of the (free) OAI-compliant Eprint-archive-creating software, we will be working on providing citation-linking services and perhaps harvesters. We will certainly make sure that, at least for authorized daily harvesting service-providers (whose selection can then be searched be real people), the OAI-compliant Eprint Archives permit automatic harvesting. CogPrints has no robot restriction -- although, admittedly, at only 1/130th of the size of Los Alamos, it has not yet reached the size where it might need one: CogPrints, however, like Los Alamos, is a CENTRALIZED Eprint Archive. Once there are distibuted, institutional Eprint Archives, each holding only their own researchers' refereed papers, the harvesting and robot problem might not come up. -------------------------------------------------------------------- Stevan Harnad harnad@cogsci.soton.ac.uk Professor of Cognitive Science harnad@princeton.edu Department of Electronics and phone: +44 23-80 592-582 Computer Science fax: +44 23-80 592-865 University of Southampton http://www.cogsci.soton.ac.uk/~harnad/ Highfield, Southampton http://www.princeton.edu/~harnad/ SO17 1BJ UNITED KINGDOM NOTE: A complete archive of the ongoing discussion of providing free access to the refereed journal literature online is available at the American Scientist September Forum (98 & 99 & 00): http://amsci-forum.amsci.org/archives/september98-forum.html You may join the list at the site above. Discussion can be posted to: september98-forum@amsci-forum.amsci.org