Hmm, providing just probabilities and leaving the decision process to humans was actually a requirement for our service. But I see your point, especially when trying to compare documents from different repositories.
Matthias.
> -----Original Message-----
> From: Leslie Carr [mailto:[log in to unmask]]
> Sent: Monday, June 23, 2008 4:28 PM
> To: Razum, Matthias
> Cc: [log in to unmask]
> Subject: Re: Dealing with duplicate papers
>
> I'd love to hear how this goes! The interesting thing about
> duplicates, as far as my experience with EPrints QA goes, is not so
> much RECOGNISING them as making authoritative decisions about
> WHAT TO
> DO with them.
>
> Say you find out that 6 records all seem to be about the same
> publication - or at least, seem to be representative of different
> stages of a publication's lifecycle. They may all have subtly
> different metadata, and some attached documents appear to be
> substantially longer than others. Perhaps some are for a journal
> article and some for a workshop paper that preceded it and
> the others
> are unpublished drafts. Each is deposited by different authors at
> different times.
>
> The detective work involved in unravelling these cases makes the
> technical problem of finding them in the first place look
> like childs
> play! Thankfully these complex examples are fairly rare. But even so
> it can still be quite daunting to see a list of hundreds of sets of
> potentially duplicate records and then ask oneself the question "now
> what?"
>
> Does anyone have any helpful experience to share?
> --
> Les
>
>
>
> On 23 Jun 2008, at 15:06, Razum, Matthias wrote:
>
> > The German eSciDoc project (http://www.escidoc.org) is currently
> > working
> > on a duplicate checker. It compares a given document with
> all existing
> > documents in its index (respectivly in a Fedora repository) and
> > returns
> > a probability that the new document is a duplicate of an
> existing one.
> > Technically, it calculates similarities between documents based on
> > 7-word groups.
> >
> > The similarity algorithm is a Java port of the original C code from
> > "Plagiarism Detection in arXiv", Daria Sorokina, Johannes Gehrke,
> > Simeon
> > Warner, Paul Ginsparg [ICDM'06, http://arxiv.org/abs/cs/0702012].
> >
> > The Java source code is available here (open-source
> licensed under the
> > CDDL):
> > http://www.escidoc-project.de/software/docsim.zip
> >
> > Some (rudimentary) documentation can be found here:
> > http://www.escidoc-project.de/JSPWiki/en/DuplicationDetectionService
> >
> > The code is still pre-beta. We would like to get some feedback from
> > others if it works for them and how we could further
> improve the code.
> > One thing we have on our list is the inclusion of at least some
> > descriptive metadata into the similarity algorithm.
> >
> > Matthias.
> >
> >
> >> -----Original Message-----
> >> From: Repositories discussion list
> >> [mailto:[log in to unmask]] On Behalf Of Leslie Carr
> >> Sent: Monday, June 23, 2008 3:36 PM
> >> To: [log in to unmask]
> >> Subject: Re: Dealing with duplicate papers
> >>
> >> I'm sure that the real gotcha is that they won't be exact
> >> duplicates. That would be too easy :-)
> >> --
> >> Les
> >>
> >>
> >> On 23 Jun 2008, at 14:20, David Kane wrote:
> >>
> >>
> >> Hi Rachel,
> >>
> >> RSS is good. In EPrints you can turn any search into
> >> an RSS feed. It's easy then to have a PHP script or similar
> >> consume that feed, cache it, and display it in the web page.
> >>
> >> Combining multiple feeds is a bit more of a problem.
> >> This could be achieved by using feedburner or Yahoo pipes. I
> >> am not sure if these de-dupe the feeds, but that is an
> >> enhancement that could be made to the PHP code that consumnes
> >> the feed and renders the HTML.
> >>
> >> I say PHP because that is what I used. You are welcome
> >> to use my code, if you like.
> >>
> >> Best,
> >>
> >> David.
> >>
> >>
> >> 2008/6/23 Rachel Hill <[log in to unmask]>:
> >>
> >>
> >> Hi all,
> >> I'm trying to figure out the best way to solve
> >> the following problem of duplicate papers:
> >>
> >> A new research centre is affiliated with 2
> >> universities. Some papers from the centre will be co-authored
> >> by both universities, some will be authored by just the one.
> >> Each university will have a copy of papers written by its
> >> authors in its institutional repository. Now the research
> >> centre will create a website containing a record of all its
> >> papers, and wants to pull all its papers from each IR and
> >> join them together into one publications list.
> >> Does anyone have advice on the best (and
> >> easiest) means of doing this, while automatically removing
> >> duplicates (where papers have been co-authored)? What method
> >> to use: OAI-PMH? other?? Has anyone had experience with
> this before?
> >>
> >> Any suggestions appreciated!
> >>
> >> Many thanks,
> >> Rachel Hill
> >>
> >>
> >>
> >>
> >>
> >> --
> >> David Kane
> >> Systems Librarian
> >> Waterford Institute of Technology
> >> http://library.wit.ie/
> >> T: ++353.51302838
> >> M: ++353.876693212
> >>
> >>
> >>
> >
> >
> > -------------------------------------------------------
> >
> > Fachinformationszentrum Karlsruhe, Gesellschaft für
> wissenschaftlich-
> > technische Information mbH.
> > Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht
> > Mannheim HRB 101892.
> > Geschäftsführerin: Sabine Brünger-Weilandt.
> > Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
> >
>
>
-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH.
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
|