The German eSciDoc project (http://www.escidoc.org) is currently working
on a duplicate checker. It compares a given document with all existing
documents in its index (respectivly in a Fedora repository) and returns
a probability that the new document is a duplicate of an existing one.
Technically, it calculates similarities between documents based on
7-word groups.
The similarity algorithm is a Java port of the original C code from
"Plagiarism Detection in arXiv", Daria Sorokina, Johannes Gehrke, Simeon
Warner, Paul Ginsparg [ICDM'06, http://arxiv.org/abs/cs/0702012].
The Java source code is available here (open-source licensed under the
CDDL):
http://www.escidoc-project.de/software/docsim.zip
Some (rudimentary) documentation can be found here:
http://www.escidoc-project.de/JSPWiki/en/DuplicationDetectionService
The code is still pre-beta. We would like to get some feedback from
others if it works for them and how we could further improve the code.
One thing we have on our list is the inclusion of at least some
descriptive metadata into the similarity algorithm.
Matthias.
> -----Original Message-----
> From: Repositories discussion list
> [mailto:[log in to unmask]] On Behalf Of Leslie Carr
> Sent: Monday, June 23, 2008 3:36 PM
> To: [log in to unmask]
> Subject: Re: Dealing with duplicate papers
>
> I'm sure that the real gotcha is that they won't be exact
> duplicates. That would be too easy :-)
> --
> Les
>
>
> On 23 Jun 2008, at 14:20, David Kane wrote:
>
>
> Hi Rachel,
>
> RSS is good. In EPrints you can turn any search into
> an RSS feed. It's easy then to have a PHP script or similar
> consume that feed, cache it, and display it in the web page.
>
> Combining multiple feeds is a bit more of a problem.
> This could be achieved by using feedburner or Yahoo pipes. I
> am not sure if these de-dupe the feeds, but that is an
> enhancement that could be made to the PHP code that consumnes
> the feed and renders the HTML.
>
> I say PHP because that is what I used. You are welcome
> to use my code, if you like.
>
> Best,
>
> David.
>
>
> 2008/6/23 Rachel Hill <[log in to unmask]>:
>
>
> Hi all,
> I'm trying to figure out the best way to solve
> the following problem of duplicate papers:
>
> A new research centre is affiliated with 2
> universities. Some papers from the centre will be co-authored
> by both universities, some will be authored by just the one.
> Each university will have a copy of papers written by its
> authors in its institutional repository. Now the research
> centre will create a website containing a record of all its
> papers, and wants to pull all its papers from each IR and
> join them together into one publications list.
> Does anyone have advice on the best (and
> easiest) means of doing this, while automatically removing
> duplicates (where papers have been co-authored)? What method
> to use: OAI-PMH? other?? Has anyone had experience with this before?
>
> Any suggestions appreciated!
>
> Many thanks,
> Rachel Hill
>
>
>
>
>
> --
> David Kane
> Systems Librarian
> Waterford Institute of Technology
> http://library.wit.ie/
> T: ++353.51302838
> M: ++353.876693212
>
>
>
-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH.
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
|