JISCMail - JISC-REPOSITORIES Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
JISC-REPOSITORIES Archives

JISC-REPOSITORIES@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		JISC-REPOSITORIES Home
		JISC-REPOSITORIES July 2008
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: A Protocol for (Content) Statistical Harvesting?
From:
Millington Peter <[log in to unmask]>
Reply-To:
Millington Peter <[log in to unmask]>
Date:
Fri, 25 Jul 2008 11:12:02 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (528 lines)
Hi Scott,

Thanks for your comments. I had a look at your prototype, and was very
impressed by the interface and the presentation of your charts. They
compare favourably with Google Analytics. Unfortunately, because Google
Analytics requires a snippet of JavaScript to be included on every web
page to be logged (usually achieved via a template), it cannot properly
log downloads of PDFs and other full texts. Am I right in thinking that
your method does better in this respect?

However, this is all by the bye. We may be at cross purposes. My own
prototype Protocol for Statistical Harvesting, is purely aimed at
acquiring statistics regarding the content of repositories (at its
simplest, how many items are there?), whereas yours is primarily
handling usage statistics. Perhaps I should have entitled my prototype
"Protocol for Content Statistical Harvesting" or something similar.

I'm not sure I agree that my proposed approach is too report-dependent.
I would argue that a given repository would choose which categories of
content it wished to expose via the harvester - e.g. OAI sets,
departments, subjects, full-text/metadata-only, publication date,
deposition date, etc. - as few or as many as they wished. The user would
be able to harvest more or less any combination of the available
categories to suit their needs, and they could process and analyse the
resulting XML to generate a variety of reports and charts. Having got my
hands dirty with a bit of prototype coding, I'm sure my proposals would
be quite easy to implement, and the results would vastly improve the
compilation of content statistics.

Best wishes

Peter


-----Original Message-----
From: Scott Yeadon [mailto:[log in to unmask]] 
Sent: 22 July 2008 02:32
To: [log in to unmask]
Cc: [log in to unmask]
Subject: Re: A Protocol for Statistical Harvesting?

Hi Peter,

Under the Australian Partnership of Sustainable Repositories (APSR)
program the Benchmark Statistics (BEST) project investigated and
developed a prototype event harvester and statistics aggregation service
in the context of providing a national service for repository
statistics.  The prototype is accessible at
http://devel.apsr.edu.au/cosi/ by selecting "Benchmark 
Statistics->Reports" from the left-hand navigation bar. Note this is a
VM running on a fairly loaded server, so apologies if it is a little
slow at times! Some background information can be found at
http://www.apsr.edu.au/best/index.htm and the latest set of
documentation can be found at http://www.apsr.edu.au/software.htm#best. 
Our approach was based on research by LANL and also the JISC IRS
project.

The approach you propose I think is coming too much from a
report-dependent viewpoint. Your examples are centred around a specific
report (and variants) and I think a more flexible and maintainable
approach would be to look at a more generic means of gathering the raw
data and then having report generation at another point in the
processing chain. For example, in the BEST scenario, repositories are
required to support a particular metadata format, the Event Interchange
Model (EIM), within their OAI-PMH Data Provider. This format is for
representing raw events which the OAI-PMH harvester feeds to an
aggregation service (a separate application with both HTTP and user
interfaces). The aggregation service then stores the raw events and it
is this service which generates the final context-specific reports. In
this way should new types of reports be needed changes are not required
on the repository side since all the repositories do is feed raw data. 
The exception to this would be where either new event types or metadata
needs to be collected, however these changes are likely to be minor. 
(Note that metadata harvesting is also supported in EIM either embedded
or via separate metadata harvest.)

It is also worth noting the greater the number of reports to be
supported and their complexity is likely to require more than just
simple DC metadata to be collected which immediately has an impact on
all the repositories running a default OAI-PMH Data Provider
implementation. The trick is to show there is value in making the effort
to provide this information via the availability of more targetted and
useful information.

Hope that helps.

Scott.
> Date:    Fri, 11 Jul 2008 15:18:51 +0100
> From:    Millington Peter <[log in to unmask]>
> Subject: A Protocol for Statistical Harvesting?
>
> This is a multi-part message in MIME format.
>
> ------_=_NextPart_001_01C8E361.09FE9704
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> Hi,
> =20
> I originally posted a version of this message to the JISC-CRIG list,
but I = have been asked to cross-post it to JISC-REPOSITORIES for wider
discussion.=  Apologies if you've seen it already.
>
> I've recently been thinking about the tribulations of trying to count
the n= umber of items in a repository, and of gathering similar
statistical inform= ation. We're doing this at OpenDOAR via OAI-PMH, but
like other people I fi= nd the iterative processing can be a minefield.
In my na=EFvety I thought t= hat ORE might be able to help, but it seems
not, because of course its focu= s is on object reuse and exchange
(which it does very well) rather than sta= tistics.
> =20
> There ought to be an easier way, given that most of the information
would b= e very quick and easy to obtain using single SQL commands:
> =20
>     e.g. SELECT COUNT(*) FROM repository; =20 It struck me that we
could do with a Protocol for Statistical Harvesting (P= SH), along the
lines of, or even extending, OAI-PMH - effectively implement= ing a
'Count' verb. Better repository statistics would help improve the tra=
cking and assessment of Open Access initiatives, and perhaps even assist
da= ta harvesting processes.
>
> I've explored the idea of a statistical harvester a bit further, and
put to= gether an outline for discussion at:
> =20
>     <http://www.opendoar.org/demos/psh_prototype>
http://www.opendoar.org/d= emos/psh_prototype =20 This outline uses
examples from a working prototype harvester that I put to= gether for
data in the OpenDOAR database. This only took a few hours to pro= gram
in my spare time, and I imagine it would only take a day or two to do =
something similar for EPrints, DSpace, Fedora, etc. This therefore could
be=  a quick win.
> =20
> I would be interested to know what people think about this - ideas,
feedbac= k, brickbats, etc.
> =20
> Regards
> =20
> Peter
> =20
> Peter Millington
> SHERPA Technical Development Officer
> Greenfield Medical Library, University of Nottingham, Queen's Medical 
> Centr= e, Nottingham, NG7 2UH, England
> Phone: +44 (0)115 84 68481
> =20
> http://www.opendoar.org/
>
> This message has been checked for viruses but the contents of an
attachment may still contain software viruses, which could damage your
computer system:
> you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
>
>
> ------_=_NextPart_001_01C8E361.09FE9704
> Content-Type: text/html;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
> <HTML><HEAD> <META http-equiv=3DContent-Type content=3D"text/html; 
> charset=3Diso-8859-1"> <META content=3D"MSHTML 6.00.6000.16674" 
> name=3DGENERATOR></HEAD> <BODY> <DIV><FONT face=3DVerdana size=3D2> 
> <DIV><FONT face=3DVerdana size=3D2> <DIV><FONT face=3DVerdana 
> size=3D2>Hi,</FONT></DIV><FONT face=3DVerdana=20 
> size=3D2></FONT></FONT></DIV> <DIV><FONT face=3DVerdana size=3D2><FONT

> face=3DVerdana=20 size=3D2></FONT></FONT>&nbsp;</DIV>
> <DIV>
> <DIV><SPAN class=3D362300813-11072008><FONT face=3DVerdana size=3D2>I 
> origi= nally=20 posted a version of this message to the JISC-CRIG 
> list, but I have been ask= ed to=20 cross-post it to JISC-REPOSITORIES

> for wider discussion. Apologies if you'v= e=20 seen it 
> already.</FONT></SPAN></DIV> <DIV><BR><FONT face=3DVerdana 
> size=3D2>I've recently been thinking about th= e=20 tribulations of 
> trying to count the number of items in a repository, and of= =20 
> gathering similar statistical information.&nbsp;<SPAN=20 
> class=3D362300813-11072008>W</SPAN>e're doing this at 
> <EM>Open</EM>DOAR via= =20 OAI-PMH, but like other people I find 
> th<SPAN class=3D362300813-11072008>e= =20 iterative 
> processing&nbsp;</SPAN>can be a&nbsp;<SPAN=20 
> class=3D362300813-11072008>minefield</SPAN>.<SPAN 
> class=3D362300813-1107200=
> 8>=20
> </SPAN>In my na=EFvety I thought that ORE might be able to help, 
> but<SPAN= =20 class=3D362300813-11072008> </SPAN>it seems not, 
> because&nbsp;<SPAN=20 class=3D362300813-11072008>of course </SPAN>its 
> focus is on o<SPAN=20 class=3D362300813-11072008>bject reuse and 
> exchange (which it does very wel= l)=20 rather than 
> statistics</SPAN>.</FONT></DIV> <DIV><FONT face=3DVerdana 
> size=3D2></FONT>&nbsp;</DIV> <DIV><FONT face=3DVerdana size=3D2>There 
> ought to be an easier way, given t= hat most=20 of the information 
> would be very quick and easy to obtain using<SPAN=20 
> class=3D362300813-11072008> single</SPAN>&nbsp;SQL<SPAN 
> class=3D362300813-1=
> 1072008>=20
> commands</SPAN>:</FONT></DIV></DIV>
> <DIV><FONT face=3DVerdana size=3D2></FONT>&nbsp;</DIV> <DIV><FONT 
> face=3DVerdana size=3D2>&nbsp;&nbsp;&nbsp; e.g. SELECT COUNT(*) = 
> FROM=20 repository;</FONT></DIV> <DIV><FONT face=3DVerdana 
> size=3D2></FONT>&nbsp;</DIV> <DIV><FONT face=3DVerdana size=3D2>It 
> struck me that we could do with a Pro= tocol for=20 Statistical 
> Harvesting (PSH), along the lines of, or even extending, OAI-PM= H 
> -=20 effectively implementing a 'Count' verb. Better repository 
> statistics would=  help=20 improve the tracking and assessment of Open

> Access initiatives, and perhaps=  even=20 assist data harvesting 
> processes.</FONT></DIV> <DIV><FONT face=3DVerdana size=3D2><BR>I've 
> explored the idea of a statisti= cal=20 harvester a bit further, and 
> put together an outline for discussion=20 
> at:<BR>&nbsp;<BR>&nbsp;&nbsp;&nbsp; </FONT><A=20 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype=20
> href=3D"http://www.opendoar.org/demos/psh_prototype"></FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype></FONT><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype face=3DVerdana=20 
> size=3D2>http://www.opendoar.org/demos/psh_prototype</A><BR=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype face=3DVerdana 
> size=3D2=
>   
>> &nbsp;<BR=20
>>     
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>This outline uses 
> examp= les=20 from a working prototype harvester&nbsp;that I put 
> together&nbsp;<SPAN=20
> class=3D357534415-05062008 
> title=3Dhttp://www.opendoar.org/demos/psh_protot=
> ype>for=20
> data in&nbsp;</SPAN>the <EM=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>Open</EM>DOAR 
> database.=  This=20 only took a few hours to program in my spare time,

> and I imagine it would o= nly=20 take a day or two to do something 
> similar for EPrints, DSpace, Fedora, etc.= <SPAN=20
> class=3D357534415-05062008 
> title=3Dhttp://www.opendoar.org/demos/psh_protot=
> ype> This=20
> therefore could be a quick win.</SPAN><BR=20 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>&nbsp;<BR=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>I would be 
> interested t= o know=20 what people think about this<SPAN 
> class=3D357534415-05062008=20 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype> - ideas, 
> feedback, bri= ckbats,=20 etc</SPAN>.</FONT></DIV> <DIV 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype face=3DVerdana=20 
> size=3D2></FONT>&nbsp;</DIV> <DIV 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype face=3DVerdana 
> size=3D2=
>   
>> Regards<BR=20
>>     
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>&nbsp;<BR=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>Peter<BR=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>&nbsp;<BR=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>Peter 
> Millington<BR=20 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>SHERPA Technical 
> Develo= pment=20 Officer<BR 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>Greenfield M= 
> edical=20 Library, University of Nottingham, Queen's Medical Centre, 
> Nottingham, NG7 = 2UH,=20 England<BR 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype>Phone: +44 (= 
> 0)115=20
> 84 68481</FONT></DIV>
> <DIV title=3Dhttp://www.opendoar.org/demos/psh_prototype><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype face=3DVerdana=20 
> size=3D2></FONT>&nbsp;</DIV> <DIV 
> title=3Dhttp://www.opendoar.org/demos/psh_prototype><FONT=20
> title=3Dhttp://www.opendoar.org/demos/psh_prototype face=3DVerdana 
> size=3D2=
>   
>> <A=20
>>     
> title=3D"http://www.opendoar.org/&#10;http://www.opendoar.org/demos/ps
> h_pro= 
> totype&#10;http://www.opendoar.org/&#10;http://www.opendoar.org/demos/
> psh_p=
> rototype"=20
> href=3D"http://www.opendoar.org/">http://www.opendoar.org/</A></FONT><
> /DIV>=
> </FONT></FONT></DIV></BODY><br/>
> <p>
> This message has been checked for viruses but the contents of an
attachment may still contain software viruses, which could damage your
computer system:
> you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
> </p>
> </HTML>
>
> ------_=_NextPart_001_01C8E361.09FE9704--
>
> ------------------------------
>
> Date:    Fri, 11 Jul 2008 16:18:52 +0100
> From:    Phil Cross <[log in to unmask]>
> Subject: Re: A Protocol for Statistical Harvesting?
>
> The |resumptionToken| element, in say a ListIdentifiers response, has
an=20 optional attribute, 'completListSize' which would be an easier
method of=20 solving your problem Peter, if all repositories implemented
this (and=20 implemented resumption tokens). This has the benefit of
already being a=20 part of the standard.
> Cheers,
> Phil
>
> Millington Peter wrote:
>   
>> Hi,
>> =20
>> I originally posted a version of this message to the JISC-CRIG 
>> list,=20 but I have been asked to cross-post it to JISC-REPOSITORIES 
>> for wider=20 discussion. Apologies if you've seen it already.
>>
>> I've recently been thinking about the tribulations of trying to 
>> count=20 the number of items in a repository, and of gathering 
>> similar=20 statistical information. We're doing this at /Open/DOAR 
>> via OAI-PMH,=20 but like other people I find the iterative processing

>> can be=20 a minefield. In my na=EFvety I thought that ORE might be 
>> able to help,=20 but it seems not, because of course its focus is on 
>> object reuse and=20 exchange (which it does very well) rather than
statistics.
>> =20
>> There ought to be an easier way, given that most of the 
>> information=20 would be very quick and easy to obtain using single
SQL commands:
>> =20
>>     e.g. SELECT COUNT(*) FROM repository; =20 It struck me that we 
>> could do with a Protocol for Statistical=20 Harvesting (PSH), along 
>> the lines of, or even extending, OAI-PMH -=20 effectively 
>> implementing a 'Count' verb. Better repository statistics=20 would 
>> help improve the tracking and assessment of Open Access=20 
>> initiatives, and perhaps even assist data harvesting processes.
>>
>> I've explored the idea of a statistical harvester a bit further, 
>> and=20 put together an outline for discussion at:
>> =20
>>     http://www.opendoar.org/demos/psh_prototype
>> =20
>> This outline uses examples from a working prototype harvester that 
>> I=20 put together for data in the /Open/DOAR database. This only took

>> a few=20 hours to program in my spare time, and I imagine it would 
>> only take a=20 day or two to do something similar for EPrints, 
>> DSpace, Fedora, etc.=20 This therefore could be a quick win.
>> =20
>> I would be interested to know what people think about this - 
>> ideas,=20 feedback, brickbats, etc.
>> =20
>> Regards
>> =20
>> Peter
>> =20
>> Peter Millington
>> SHERPA Technical Development Officer
>> Greenfield Medical Library, University of Nottingham, Queen's 
>> Medical=20 Centre, Nottingham, NG7 2UH, England
>> Phone: +44 (0)115 84 68481
>> =20
>> http://www.opendoar.org/
>>
>> This message has been checked for viruses but the contents of an=20 
>> attachment may still contain software viruses, which could damage 
>> your=20 computer system: you are advised to perform your own checks.
>> Email=20 communications with the University of Nottingham may be 
>> monitored as=20 permitted by UK legislation.
>>
>>     
>
> --=20
> ---------------------------------
> Phil Cross
> Senior Technical Researcher
> Institute for Learning and Research Technology University of Bristol
> 8 - 10 Berkeley Square
> Bristol, BS8 1HH
> Tel: +44 (0)117 331 4391
> Fax: +44 (0)117 331 4396
> E-mail: [log in to unmask]
> URL: http://www.ilrt.bris.ac.uk/aboutus/staff?search=3Dcmpac
> -----------------------------------
>
> ------------------------------
>
> Date:    Fri, 11 Jul 2008 18:39:22 +0100
> From:    Millington Peter <[log in to unmask]>
> Subject: Re: A Protocol for Statistical Harvesting?
>
> Thanks Phil.
>
> The 'completeListSize' attribute is darned useful, where people have
bother= ed to implement it, as at Bepress and LoC. No doubt this has
also helped re= duce the load on their servers from robots. However, as
you imply, implemen= tation is patchy.
>
> As you'll see if you follow the link, total repository size is only
one asp= ect of my proposal/argument. A Protocol for Statistical
Harvesting could al= so yield a lot of other interesting information
with minimal effort - for i= nstance the proportion of items that are
full-text.
>
> Cheers
>
> Peter
>
> -----Original Message-----
> From: Repositories discussion list 
> [mailto:[log in to unmask]
> ] On Behalf Of Phil Cross
> Sent: 11 July 2008 16:19
> To: [log in to unmask]
> Subject: Re: A Protocol for Statistical Harvesting?
>
> The |resumptionToken| element, in say a ListIdentifiers response, has
an op= tional attribute, 'completListSize' which would be an easier
method of solv= ing your problem Peter, if all repositories implemented
this (and implement= ed resumption tokens). This has the benefit of
already being a part of the = standard.
> Cheers,
> Phil
>
> Millington Peter wrote:
>   
>> Hi,
>> =20=20
>> I originally posted a version of this message to the JISC-CRIG 
>> list,=20  but I have been asked to cross-post it to JISC-REPOSITORIES

>> for wider=20  discussion. Apologies if you've seen it already.
>>
>> I've recently been thinking about the tribulations of trying to 
>> count=20 the number of items in a repository, and of gathering 
>> similar=20 statistical information. We're doing this at /Open/DOAR 
>> via OAI-PMH,=20 but like other people I find the iterative processing

>> can be a=20 minefield. In my na=EFvety I thought that ORE might be 
>> able to help, but=
>>     
> =20
>   
>> it seems not, because of course its focus is on object reuse and=20 
>> exchange (which it does very well) rather than statistics.
>> =20=20
>> There ought to be an easier way, given that most of the 
>> information=20 would be very quick and easy to obtain using single
SQL commands:
>> =20=20
>>     e.g. SELECT COUNT(*) FROM repository; =20=20  It struck me that 
>> we could do with a Protocol for Statistical=20  Harvesting (PSH), 
>> along the lines of, or even extending, OAI-PMH -=20  effectively 
>> implementing a 'Count' verb. Better repository statistics=20  would 
>> help improve the tracking and assessment of Open Access=20  
>> initiatives, and perhaps even assist data harvesting processes.
>>
>> I've explored the idea of a statistical harvester a bit further, 
>> and=20  put together an outline for discussion at:
>> =20=20
>>     http://www.opendoar.org/demos/psh_prototype
>> =20=20
>> This outline uses examples from a working prototype harvester that 
>> I=20  put together for data in the /Open/DOAR database. This only 
>> took a few=20  hours to program in my spare time, and I imagine it 
>> would only take a=20  day or two to do something similar for EPrints,
DSpace, Fedora, etc.
>> This therefore could be a quick win.
>> =20=20
>> I would be interested to know what people think about this - 
>> ideas,=20 feedback, brickbats, etc.
>> =20=20
>> Regards
>> =20=20
>> Peter
>> =20=20
>> Peter Millington
>> SHERPA Technical Development Officer
>> Greenfield Medical Library, University of Nottingham, Queen's 
>> Medical=20  Centre, Nottingham, NG7 2UH, England
>> Phone: +44 (0)115 84 68481
>> =20=20
>> http://www.opendoar.org/
>>
>> This message has been checked for viruses but the contents of an=20 
>> attachment may still contain software viruses, which could damage 
>> your=20 computer system: you are advised to perform your own checks.
>> Email=20 communications with the University of Nottingham may be 
>> monitored as=20 permitted by UK legislation.
>>
>>     
>
> --
> ---------------------------------
> Phil Cross
> Senior Technical Researcher
> Institute for Learning and Research Technology University of Bristol
> 8 - 10 Berkeley Square
> Bristol, BS8 1HH
> Tel: +44 (0)117 331 4391
> Fax: +44 (0)117 331 4396
> E-mail: [log in to unmask]
> URL: http://www.ilrt.bris.ac.uk/aboutus/staff?search=3Dcmpac
> -----------------------------------
>
> This message has been checked for viruses but the contents of an
attachment may still contain software viruses, which could damage your
computer system:
> you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
>
> ------------------------------
>
> End of JISC-REPOSITORIES Digest - 10 Jul 2008 to 11 Jul 2008 
> (#2008-141)
> **********************************************************************
> **
>
>
>
>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options