I'd love to hear how this goes! The interesting thing about
duplicates, as far as my experience with EPrints QA goes, is not so
much RECOGNISING them as making authoritative decisions about WHAT TO
DO with them.
Say you find out that 6 records all seem to be about the same
publication - or at least, seem to be representative of different
stages of a publication's lifecycle. They may all have subtly
different metadata, and some attached documents appear to be
substantially longer than others. Perhaps some are for a journal
article and some for a workshop paper that preceded it and the others
are unpublished drafts. Each is deposited by different authors at
different times.
The detective work involved in unravelling these cases makes the
technical problem of finding them in the first place look like childs
play! Thankfully these complex examples are fairly rare. But even so
it can still be quite daunting to see a list of hundreds of sets of
potentially duplicate records and then ask oneself the question "now
what?"
Does anyone have any helpful experience to share?
--
Les
On 23 Jun 2008, at 15:06, Razum, Matthias wrote:
> The German eSciDoc project (http://www.escidoc.org) is currently
> working
> on a duplicate checker. It compares a given document with all existing
> documents in its index (respectivly in a Fedora repository) and
> returns
> a probability that the new document is a duplicate of an existing one.
> Technically, it calculates similarities between documents based on
> 7-word groups.
>
> The similarity algorithm is a Java port of the original C code from
> "Plagiarism Detection in arXiv", Daria Sorokina, Johannes Gehrke,
> Simeon
> Warner, Paul Ginsparg [ICDM'06, http://arxiv.org/abs/cs/0702012].
>
> The Java source code is available here (open-source licensed under the
> CDDL):
> http://www.escidoc-project.de/software/docsim.zip
>
> Some (rudimentary) documentation can be found here:
> http://www.escidoc-project.de/JSPWiki/en/DuplicationDetectionService
>
> The code is still pre-beta. We would like to get some feedback from
> others if it works for them and how we could further improve the code.
> One thing we have on our list is the inclusion of at least some
> descriptive metadata into the similarity algorithm.
>
> Matthias.
>
>
>> -----Original Message-----
>> From: Repositories discussion list
>> [mailto:[log in to unmask]] On Behalf Of Leslie Carr
>> Sent: Monday, June 23, 2008 3:36 PM
>> To: [log in to unmask]
>> Subject: Re: Dealing with duplicate papers
>>
>> I'm sure that the real gotcha is that they won't be exact
>> duplicates. That would be too easy :-)
>> --
>> Les
>>
>>
>> On 23 Jun 2008, at 14:20, David Kane wrote:
>>
>>
>> Hi Rachel,
>>
>> RSS is good. In EPrints you can turn any search into
>> an RSS feed. It's easy then to have a PHP script or similar
>> consume that feed, cache it, and display it in the web page.
>>
>> Combining multiple feeds is a bit more of a problem.
>> This could be achieved by using feedburner or Yahoo pipes. I
>> am not sure if these de-dupe the feeds, but that is an
>> enhancement that could be made to the PHP code that consumnes
>> the feed and renders the HTML.
>>
>> I say PHP because that is what I used. You are welcome
>> to use my code, if you like.
>>
>> Best,
>>
>> David.
>>
>>
>> 2008/6/23 Rachel Hill <[log in to unmask]>:
>>
>>
>> Hi all,
>> I'm trying to figure out the best way to solve
>> the following problem of duplicate papers:
>>
>> A new research centre is affiliated with 2
>> universities. Some papers from the centre will be co-authored
>> by both universities, some will be authored by just the one.
>> Each university will have a copy of papers written by its
>> authors in its institutional repository. Now the research
>> centre will create a website containing a record of all its
>> papers, and wants to pull all its papers from each IR and
>> join them together into one publications list.
>> Does anyone have advice on the best (and
>> easiest) means of doing this, while automatically removing
>> duplicates (where papers have been co-authored)? What method
>> to use: OAI-PMH? other?? Has anyone had experience with this before?
>>
>> Any suggestions appreciated!
>>
>> Many thanks,
>> Rachel Hill
>>
>>
>>
>>
>>
>> --
>> David Kane
>> Systems Librarian
>> Waterford Institute of Technology
>> http://library.wit.ie/
>> T: ++353.51302838
>> M: ++353.876693212
>>
>>
>>
>
>
> -------------------------------------------------------
>
> Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-
> technische Information mbH.
> Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht
> Mannheim HRB 101892.
> Geschäftsführerin: Sabine Brünger-Weilandt.
> Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
>
|