(resent with From: header faked to be my old Bristol Uni address for
JISCMail's benefit)
> From: Dan Brickley <[log in to unmask]>
> Date: Tue, 19 Aug 2003 15:39:14 -0400
> To: Charles Christacopoulos <[log in to unmask]>
> Cc: [log in to unmask], [log in to unmask]
> Subject: Re: Indexing *.ac.uk
> Message-ID: <[log in to unmask]>
>
> * Charles Christacopoulos <[log in to unmask]> [2003-08-19 20:26+0100]
> > Addressed to: [log in to unmask]
> > [log in to unmask]
> >
> > Hi,
> >
> > Apologies for crossposting.
> >
> > In the next few days (late this week or early next week) we will start testing
> > a robot on *.ac.uk to index whatever it finds. It has been tested at our domain
> > but we need something a touch bigger than *.dundee.ac.uk. It will come under
> > the name of Capek (Egothor's robot). When the test are completed, the robot
> > and its engine may become a more permanent feature ... until Google is
> > defeated. :-)
> >
> > I would appreciate if you could pass the above info. to any of your colleages or
> > IT Directors who might be concerned about a sudden increase in traffic of your
> > webservers.
> >
> > --------------------------------------
> >
> > The way the robot/engine works is that it tries to learn which are the pages
> > which change more often than others so it visits them more frequently.
> > In its current incarnation it visits those pages every 1/2 hour. Other pages are
> > visited less frequently and so on.
>
> Would you consider making use of machine-readable metadata about site
> update policies?
>
> The RSS-DEV group developed an RDF vocabulary for use with RSS1.0 which
> might be relevant. We could encourage folks to link to a robots.rdf from robots.txt
> or with a LINK REL from homepages. See http://web.resource.org/rss/1.0/
> -> http://web.resource.org/rss/1.0/modules/syndication/
>
> This gives markup along lines of:
>
> <sy:updatePeriod>hourly</sy:updatePeriod>
> <sy:updateFrequency>2</sy:updateFrequency>
> <sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
>
> Something like this should be useful, though I guess you'd want a way to
> know the schedule for the entire site rather than for each page
> separately.
>
> I'd also strongly encourage you to use RSS1 feeds (or any RSS feeds) to
> gather info about new pages on the site, to allow your index to stay up
> to date with info about recently changed pages on each site. There are
> LINK REL conventions for finding RSS feeds too...
>
> cheers,
>
> Dan
>
> > Any comments as to what you think is an appropriate interval for (re)visiting
> > pages which are likely to change would be appreciated.
> >
> > TIA.
> > Charles
> >
> > PS. The robot/engine are written in Java and they are Open Source.
>
> cool :)
|