The EUsprig discussion list drew my attention to this paper, whose
abstract suggests that advanced scientific thinking and resources are
being wasted because of poor data handling.
Keith Baggerly and Kevin Coombes, "Deriving Chemosensitivity from cell
lines: Forensic bioinformatics and reproducible research in
high-throughput biology"
http://www.ndns.nl/static/files/sls/presentations/Baggerly-AnnalsApplied
Stats.pdf
Abstract starts
High-throughput biological assays such as microarrays let us ask very
detailed questions about how diseases operate, and promise to let us
personalize therapy. Data processing, however, is often not described
well enough to allow for exact reproduction of the results, leading to
exercises in "forensic bioinformatics" where aspects of raw data and
reported results are used to infer what methods must have been employed.
Unfortunately, poor documentation can shift from an inconvenience to an
active danger when it obscures not just methods but errors.
[5 case studies of published work]
Discussion starts
On the nature of common errors.
In all of the case studies examined above, forensic reconstruction
identifies errors that are hidden by poor documentation. Unfortunately,
these case studies are illustrative, not exhaustive; further problems
similar to the ones detailed above are described in the supplementary
reports. The case studies also share other commonalities. In particular,
they illustrate that the most common problems are simple: e.g.,
confounding in the experimental design (all TET before all FEC), mixing
up the gene labels (off-by-one errors), mixing up the group labels
(sensitive/resistant); most of these mixups involve simple switches or
offsets. These mistakes are easy to make, particularly if working with
Excel or if working with 0/1 labels instead of names (as with binreg).
We have encountered these and like problems before. As part of the 2002
Competitive Analysis of Microarray Data (CAMDA) competition, Stivers et
al. (2003) identified and corrected a mixup in annotation affecting
roughly a third of the data which was driven by a simple one-cell
deletion from an Excel file coupled with an inappropriate shifting up of
all values in the affected column only. [RAR's italic emphasis]
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted.
If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own
and do not necessarily reflect those of Cefas.
Communications on Cefas’ computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes.
You may leave the list at any time by sending the command
SIGNOFF allstat
to [log in to unmask], leaving the subject line blank.
|