Print

Print


Hi,
Great practical suggestions in this thread already.
I just wanted to add a link to a relevant paper (with Ann Green and Libbie Stephenson) on data quality review (http://www.ijdc.net/index.php/ijdc/article/view/9.1.263 ). The paper describes a framework for quality-related curation activities and explains how DQR is practiced in three domain-specific data archives and in a selection of other data repositories. The paper argues that the research community as a whole needs to address this issue seriously otherwise we'll have repositories full of non-usable "stuff."
I believe that real engagement from researchers - in devising and promoting data policies, tools, and guidelines; not just in complying - is necessary in order to strike the balance between, as Robin says below, "the depositor feeling hassled and our being satisfied with the submission."
-Limor


_____________

Limor Peer, PhD
Associate Director for Research | Institution for Social and Policy Studies | Yale University
77 Prospect Street | P.O.Box 208209 | New Haven, Connecticut 06520-8209
203-432-0054
[log in to unmask]<mailto:[log in to unmask]>
www.isps.yale.edu<http://www.isps.yale.edu/>
@l_peer<https://twitter.com/l_peer>


From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Robin Rice
Sent: Thursday, June 04, 2015 12:52 PM
To: [log in to unmask]
Subject: Re: [RESEARCH-DATAMAN] Quality control of research data

Hi,

We try to strike a happy medium between the kind of in-depth quality control a domain repository can do, where the data types are well-understood, and a no-questions-asked upload policy that would lead to completely unusable data that would frustrate end-users.

We don't allow submissions that don't have at least one HUMAN-readable documentation file (could even be a readme.txt explaining the content and structures of the file or a copy of the paper if that does the trick). Our definition of a dataset is one or more data files plus documentation, so that would be the minimum necessary and is simple to explain - of course the users will be multidisciplinary as well as the depositors, so one can't count on shared understandings. This is in addition to the description field in the metadata.

I agree the only time to have leverage with the depositor about these things is at the point of deposit, so it's quite handy to have a moderation step in the workflow rather than accept it automatically.

On the DataShare wiki (public but internally facing) there are guidelines we've written for ourselves as repository administrators to follow when checking over a new submission (with thanks to my colleague Pauline Ward for maintaining this well). It has links to public-facing documentation that we can point them to as well, such as the Depositor's Checklist. http://edin.ac/1EYhiOT<https://urldefense.proofpoint.com/v2/url?u=http-3A__edin.ac_1EYhiOT&d=AwMFAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=Hj2JO4mwhLLgSXIo7EGRP0-P3qXfsTq5mgerKO7Ezbw&m=0JSha0z9Ra35LWJ0L7j-5z3grV82pTRT3F_klQKc63o&s=MHHym4tbnhqqlm2E22YVz36C-weK2prbMB-Z3AVTjFg&e=>

There is a balance to be struck in terms of the depositor feeling hassled and our being satisfied with the submission. We have certainly noticed some of these last minute EPSRC submitters have quite a low tolerance for moderation. We even had a head of department "test" us to make sure the process was easy enough for her to recommend to the other academic staff.

We often correct ourselves calling the depositors 'users' because they are not the ultimate users of the data in the repository, though they are the ones we're in contact with at present. To steal from the Simpsons, "Will somebody please think of the users?" ;-)

Cheers,
Robin

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Khokhar, Masud
Sent: 04 June 2015 17:28
To: [log in to unmask]<mailto:[log in to unmask]>
Subject: Re: Quality control of research data

Not sure it is as simple as that John, we do ask them to provide description for datasets and chase them if they don't (at least at a basic level). The issues emerge with more difficult datasets, where there are huge number of files, 100s of different file formats, and proprietary formats where we can't really tell much. Researchers will also talk about self-describing datasets, e.g. A piece of code with lots of embedded comments, therefore not providing a lot of descriptive detail of the full dataset. I do take your point as well.

Best wishes,
Masud


From: John Milner <[log in to unmask]<mailto:[log in to unmask]>>
Reply-To: "[log in to unmask]<mailto:[log in to unmask]>" <[log in to unmask]<mailto:[log in to unmask]>>
Date: Thursday, 4 June 2015 17:05
To: "[log in to unmask]<mailto:[log in to unmask]>" <[log in to unmask]<mailto:[log in to unmask]>>
Subject: Re: Quality control of research data

I wouldn't care whether the researchers are annoyed or not! What kind of cretin submits data with no sensible context ?

John

Sent from my iPad

On 4 Jun 2015, at 16:53, Khokhar, Masud <[log in to unmask]<mailto:[log in to unmask]>> wrote:
Hi all,

Now that most of us have started receiving datasets, what are you doing in terms of quality checking of data? We have just received a large zip file of data which consists of about 200MB+ of .log files. Without knowing the research behind the data (data is often submitted before the publication), we can't really say whether these log files are by product of data or actual data itself.

We also don't want to go back to the researchers with all these questions, they are already annoyed enough.

Any ideas/tips/tricks/sharing of experience highly appreciated.

Best wishes,
Masud

--
Masud Khokhar
Head of Digital Innovation
Lancaster University Library
Lancaster University,
LA1 4YH, Lancaster, United Kingdom

Tel: +44 (0)1524 5-94236 | Email: [log in to unmask]<mailto:[log in to unmask]> | Twitter: @mkhokhar