As I understand from talking to folks who have been working with Google,
there are several problems with the normal Google approach.
First, as noted and described in the links, you have to let the googlebot
in, and you need to give it a list of links to *all* of the content that you
want to be indexed. You probably don't want to have a human readable page
with a millon links, so an appropriate solution is to recognize when the
googlebot is visiting and give it a different view of your site -- the page
with the links.
Next you have to make sure that googlebot will harvest all of the links.
The various descriptions indicate that it is not by default an exhaustive
harvest, and the googlebot will revisit the site many times.
Once google harvests, it has to index what it found. Again, by default it
doesn't treat learning content in any special way. Does DC:Title mean
anything special? How do I get precise search results using the metadata
that is associated with the content?
I also understand that the googlebot makes many ranking decisions -- what to
harvest, what to index, what to display, so the google view of your
repository, and what the user in the google search result sees may both be
different from what you have or what you would see from a direct repository
search.
There have also been problems with content that has a URI that is a
persistent ID, e.g., a PURL, a DOI. Google thinks that the content is
"owned" by the URL owner. The pagerank for http://resolver/id is based on
the pagerank of "resolver", not of the actual content. But I think they
have been working on this problem for some collections, like Crossref.
So while Google Scholar helps, it does not yet solve the problem of getting
precise results from all the content in the repositories.
- Dan
>Hi,
>
>
>> As an aside, the comments about Google are interesting, since these
>> are
>> also issues currently being discussed in the context of the JISC
>> Information Environment. In particular, the comment about Google 'not
>> seeing the deep web' seems to miss the recent things that Google
>> has been
>> doing with Google Scholar? It is true only where repository owners
>> take
>> the trouble to prevent the Google robots indexing their content? :-)
>>
>
>On the same note, repository owners are taking more proactive role
>to allow Google to promote their content by submitting their content
>for indexing, e.g. in DSpace community, see: http://sourceforge.net/
>mailarchive/message.php?msg_id=9373110, see: http://sourceforge.net/
>mailarchive/message.php?msg_id=11043512 .
>
>The dlib article refers to the appropriate copies requirement. Google
>Scholar provides this by identifying the sources of content
>providers. Recent developments also allow the institutional link
>resolvers to be embedded in the search results.
>
>Also in the article:
>
>"On the more technical side, the presence of quite a few implementers
>of federated search mechanisms testified to the fact that building
>interoperability between repositories is not new, and, as Kerry
>Blinco from IMS Australia pointed out, libraries have been dealing
>with issues of scale and repository interoperability for a
>considerable time."
>
>An issue spanned out of the above would be: in what contexts would
>CORDRA relates to the existing architecture/models in the digital
>library world, such as JISC IE, MODELS, OCKHAM which for example is
>working on metadata caching/harvesting solution c.f. Google for the
>"deep web".
>
>
>Boon
>
>
>-----
>Boon Low
>System Development, EGEE Training
>National e-Science Centre
>http://homepages.ed.ac.uk/boon/
|