I was hesitant to add my opinion so far because I'm used
more to listen this forum rather than tell others what I
think.
"Why" and "what" to deposit are absolutely
interconnected. Once you decide why you want to do it, then
you will probably know what will be the best format and vice
versa.
Whether this deposition of raw images will or will not
help in future understanding the biology better I'm not
sure.
But to store those difficult datasets to help the future
software development sounds really farfetched. This assumes
that in the future crystallographers will never grow
crystals that will deliver difficult datasets. If that is
the case and in 10-20-30
years next generation will be growing much better crystals
then they don't need such a software development.
If that is not the case, and once in a while (or more
often) they will be getting something out of ordinary then
software developers will take them and develop whatever they
need to develop to consider such cases.
Am I missing a point of discussion here?
Regards,
Vaheh
-----Original Message-----
From: CCP4 bulletin board [
mailto:[log in to unmask]]
On Behalf Of Robert Esnouf
Sent: Monday, October 31, 2011 10:31 AM
To:
[log in to unmask]
Subject: Re: [ccp4bb] To archive or not to archive, that's
the question!
Dear All,
As someone who recently left crystallography for
sequencing, I
should modify Tassos's point...
"A full data-set is a few terabytes, but post-processing
reduces it to sub-Gb size."
My experience from HiSeqs is that this "full" here means
the
base calls - equivalent to the unmerged HKLs - hardly raw
data. NGS (short-read) sequencing is an imaging technique
and
the images are more like >100TB for a 15-day run on a
single
flow cell. The raw base calls are about 5TB. The
compressed,
mapped data (BAM file, for a human genome, 30x coverage)
is
about 120GB. It is only a variant call file (VCF,
difference
from a stated human reference genome) that is sub-Gb and
these
files are - unsurprisingly - unsuited to detailed
statistical
analysis. Also $1k is a not yet an economic cost...
The DNA information capacity in a single human body
dwarfs the
entire world disk capacity, so storing DNA is a no
brainer
here. Sequencing groups are making very hard-nosed
economic
decisions about what to store - indeed it is a source of
research in itself - but the scale of the problem is very
much
bigger.
My tuppence ha'penny is that depositing "raw" images
along
with everything else in the PDB is a nice idea but would
have
little impact on science (human/animal/plant health or
understanding of biology).
1) If confined to structures in the PDB, the images would
just
be the ones giving the final best data - hence the ones
least
likely to have been problematic. I'd be more interested
in
SFs/maps for looking at ligand-binding etc...
2) Unless this were done before paper acceptance they
would be
of little use to referees seeking to review important
structural papers. I'd like to see PDB validation reports
(which could include automated data processing, perhaps
culled
from synchrotron sites, SFs and/or maps) made available
to
referees in advance of publication. This would be enabled
by
deposition, but could be achieved in other ways.
3) The datasets of interest to methods developers are
unlikely
to be the ones deposited. They should be in contact with
synchrotron archives directly. Processing multiple
lattices is
a case in point here.
4) Remember the "average consumer" of a PDB file is not a
crystallographer. More likely to be a graduate student in
a
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not
trivializing
the issue, but importance is always relative. Are there
"outsiders" on the panel to keep perspective?
Robert
--
Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK
---- Original message ----
>Date: Mon, 31 Oct 2011 11:37:47 +0100
>Subject: Re: [ccp4bb] To archive or not to archive,
that's
the question!
>
> Dear all,
> The discussion about keeping primary data, and
what
> level of data can be considered 'primary', has -
> rather unsurprisingly - come up also in areas
other
> than structural biology.
> An example is next generation sequencing. A
> full-dataset is a few tera bytes, but
> post-processing reduces it to sub-Gb size.
However,
> the post-processed data, as in our case,
> have suffered the inadequacy of computational
> "reduction" ... At least out institute has decided
> to create double back-up of the primary data in
> triplicate. For that reason our facility bought
> three -80 freezers, one on site at the basement,
on
> at the top floor, and one off-site, and they keep
> the DNA to be sequenced. A sequencing run is
already
> sub-1k$ and it will not become
> more expensive. So, if its important, do it again.
> Its cheaper and its better.
> At first sight, that does not apply to MX. Or does
> it?
> So, maybe the question is not "To archive or not
to
> archive" but "What to archive".
> (similarly, it never crossed my mind if I should
"be
> or not be" - I always wondered "what to be")
> A.
> On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
>
> Am 20:59, schrieb Jrh:
> ...
>
> So:- Universities are now establishing their
> own institutional
>
> repositories, driven largely by Open Access
> demands of funders. For
>
> these to host raw datasets that underpin
> publications is a reasonable
>
> role in my view and indeed they already have
> this category in the
>
> University of Manchester eScholar system, for
> example. I am set to
>
> explore locally here whether they would
> accommodate all our Lab's raw
>
> Xray images datasets per annum that underpin
our
> published crystal
>
> structures.
>
> It would be helpful if readers of this CCP4bb
> could kindly also
>
> explore with their own universities if they
have
> such an
>
> institutional repository and if raw data sets
> could be accommodated.
>
> Please do email me off list with this
> information if you prefer but
>
> within the CCP4bb is also good.
>
> Dear John,
>
> I'm pretty sure that there exists no consistent
> policy to provide an "institutional repository"
> for deposition of scientific data at German
> universities or Max-Planck institutes or
Helmholtz
> institutions, at least I never heard of
something
> like this. More specifically, our University of
> Konstanz certainly does not have the
> infrastructure to provide this.
>
> I don't think that Germany is the only country
> which is the exception to any rule of
availability
> of "institutional repository" . Rather, I'm
almost
> amazed that British and American institutions
seem
> to support this.
>
> Thus I suggest to not focus exclusively on
> official institutional repositories, but to
> explore alternatives: distributed filestores
like
> Google's BigTable, Bittorrent or others might be
> just as suitable - check out
> I guess that any crystallographic lab could
easily
> sacrifice/donate a TB of storage for the
purposes
> of this project in 2011 (and maybe 2 TB in 2012,
3
> in 2013, ...), but clearly the level of work to
> set this up should be kept as low as possible (a
> bittorrent daemon seems simple enough).
>
> Just my 2 cents,
>
> Kay
>
> P please don't print this e-mail unless you really
> need to
> Anastassis (Tassos) Perrakis, Principal
Investigator
> / Staff Member
> Department of Biochemistry (B8)
> Netherlands Cancer Institute,
> Dept. B8, 1066 CX Amsterdam, The Netherlands
> Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
> SMS: +31 6 28 597791
To the extent this electronic
communication or any of its attachments contain
information that is not in the public domain, such
information is considered by MedImmune to be
confidential
and proprietary. This communication is expected to be
read and/or used only by the individual(s) for whom it
is intended. If you have received this electronic
communication in error, please reply to the sender
advising of the error in transmission and delete
the original message and any accompanying documents from
your system immediately, without copying, reviewing or
otherwise using them for any purpose. Thank you for your
cooperation.