I think having each lab deal with archiving their own data and making
them available to the public is much less practical than having a
centralized repository for the following reasons:
1. The overhead would be many times that of a centralized repository,
because of multiplication of efforts.
2. Every lab would need a dedicated and trained person to maintain
the archive. It is very likely that this person will be some graduate
student or post-doc. When these people leave, another person needs to
be identified. Looking at how software, chemical inventories, etc.
are maintained in places that operate like this, I have little
confidence that a reliable repository would ever be established and
properly maintained.
3. Cost. Due to economy of scale, it would be much more expensive to
distribute a repository of this size over hundreds of labs, each one
of them needing to provide the hardware for its portion.
With respect to funding, if the community identifies a central
repository as an indispensable, must-have item, dedicated to
maintaining the highest standards in a scientific field, and aimed at
avoiding false interpretation of data as well as at reducing the
occurrence of fabricated data, I can't imagine funding agencies would
object too much.
Best - MM
On Aug 17, 2007, at 10:45 AM, Winter, G ((Graeme)) wrote:
> On the question of what is "trivial" I would argue that deposition of
> the raw diffraction images is not - for a few simple reasons:
>
> No I think I have to agree with Kim on this one - it is not trivial.
> Setting up even a modest RAID array costs real money and takes real
> time. Setting one up with a guaranteed quality of service (uptime,
> bandwidth, disaster recovery) capable of storing the images which
> directly contributed to every deposition would be very expensive.
>
> Now, if the people who collected that data could find a "place on the
> web" to store the compressed images, and deposit a link to where they
> can be found, that would be ace. People who are interested in the
> results can go fetch the images - since probably only ~ 5 people would
> actually download them this would not be too bandwidth intensive. If
> that place on the web dies - well hopefully they still have them on
> firewire or on DVD ...
>
> Now the main difference with this would be to move the images from
> being
> something inconveniently large which usually we don't share to
> something
> inconveniently large we usually make available to those who are
> interested. From the replies to the list in this discussion you could
> probably figure out who *is* interested, and it is not world+dog. This
> shift of burden would turn it from something which would require a
> huge
> grant proposal which would almost certainly not get funded to a 1%
> increase in the cost of the structure solution for the lab in question
> and peace of mind for the community at large.
>
> I for one would be happy to write a few scripts which will compress
> batches of images, write the index pages, compute the md5sums so we
> know
> that the data are ok and generally put together a toolbox for curating
> images on the web.
>
> So we end up with...
>
> You want to argue with my structure - well, here are the frames, *you*
> solve it.
>
> Can't really argue with that.
>
> Again, just MHO.
>
> Cheers,
>
> Graeme
>
>
>
>
>
> -----Original Message-----
> From: CCP4 bulletin board [mailto:[log in to unmask]] On Behalf Of
> Mischa Machius
> Sent: 17 August 2007 15:07
> To: [log in to unmask]
> Subject: [ccp4bb] Depositing Raw Data
>
> Since there are several sub-plots in that mammoth thread, I thought
> branching out would be a good idea.
>
> I think working out the technicalities of how to publicly archive raw
> data is fairly simple compared to the bigger picture.
>
> 1. Indeed, all the required meta-data will need to be captured just
> like
> for refined coordinates. This will be an additional burden for the
> depositor, but it's clearly necessary, and I do consider it trivial.
> Trivial, as in the sense of "straightforward", i.e., there is no
> fundamental problem blocking progress. As mentioned, current data
> processing software captures most of the pertinent information
> already,
> although that could be improved. I am sure that the beamlines,
> diffraction-system manufacturers and authors of data- processing
> software can be convinced to cooperate appropriately, if the community
> needs these features.
>
> 2. More tricky is the issue of a unified format for the images, which
> would be very helpful. There have been attempts at creating unified
> image formats, but - to my knowledge - they haven't gotten anywhere.
> However, I am also convinced that such formats can be designed, and
> that
> detector manufacturers will have no problems implementing them,
> considering that their detectors may not be purchased if they don't
> comply with requirements defined by the community.
>
> 3. The hardware required to store all those data, even in a highly
> redundant way, is clearly trivial.
>
> 4. The biggest problem I can see in the short run is the burden on the
> databank when thousands of investigators start transferring
> gigabytes of
> images, all at the same time.
>
> 5. I think the NSA might go bonkers over that traffic, although it
> certainly has enough storage space. Imagine, they let their
> decoders go
> wild on all those images. They might actually find interesting
> things in
> them...
>
> So, what's the hold-up?
>
> Best - MM
>
>
>
> On Aug 17, 2007, at 3:23 AM, Winter, G (Graeme) wrote:
>
>> Storing all the images *is* expensive but it can be done - the
>> JCSG do
>
>> this and make available a good chunk of their raw diffraction data.
>> The
>> cost is, however, in preparing this to make the data useful for the
>> person who downloads it.
>>
>> If we are going to store and publish the raw experimental
>> measurements
>
>> (e.g. the images) which I think would be spectacular, we will also
>> need to define a minimum amount of metadata which should be supplied
>> with this to allow a reasonable chance of reproduction of the
>> results.
>
>> This is clearly not trivial, but there is probably enough information
>> in the harvest and log files from e.g. CCP4, HKL2000, Phenix to allow
>> this.
>>
>> The real problem will be in getting people to dig out that tape / dvd
>> with the images on, prepare the required metadata and "deposit" this
>> information somewhere. Actually storing it is a smaller challenge,
>> though this is a long way from being trivial.
>>
>> On an aside - firewire disks are indeed a very cheap way of storing
>> the data. There is a good reason why they are much cheaper than the
>> equivalent RAID array. They fail. Ever lost 500GB of data in one go?
>> Ouch. ;o)
>>
>> Just MHO.
>>
>> Cheers,
>>
>> Graeme
>>
>> -----Original Message-----
>> From: CCP4 bulletin board [mailto:[log in to unmask]] On Behalf Of
>> Phil Evans
>> Sent: 16 August 2007 15:13
>> To: [log in to unmask]
>> Subject: Re: [ccp4bb] The importance of USING our validation tools
>>
>> What do you count as raw data? Rawest are the images - everything
>> beyond that is modellling - but archiving images is _expensive_!
>> Unmerged intensities are probably more manageable
>>
>> Phil
>>
>>
>> On 16 Aug 2007, at 15:05, Ashley Buckle wrote:
>>
>>> Dear Randy
>>>
>>> These are very valid points, and I'm so glad you've taken the
>>> important step of initiating this. For now I'd like to respond to
>>> one
>
>>> of them, as it concerns something I and colleagues in Australia are
>>> doing:
>>>>
>>>> The more information that is available, the easier it will be to
>>>> detect fabrication (because it is harder to make up more
>>>> information
>
>>>> convincingly). For instance, if the diffraction data are deposited,
>>>> we can check for consistency with the known properties of real
>>>> macromolecular crystals, e.g. that they contain disordered solvent
>>>> and not vacuum. As Tassos Perrakis has discovered, there are
>>>> characteristic ways in which the standard deviations depend on the
>>>> intensities and the resolution. If unmerged data are deposited,
>>>> there
>>
>>>> will probably be evidence of radiation damage, weak effects from
>>>> intrinsic anomalous scatterers, etc. Raw images are probably even
>>>> harder to simulate convincingly.
>>>
>>> After the recent Science retractions we realised that its about time
>>> raw data was made available. So, we have set about creating the
>>> necessary IT and software to do this for our diffraction data, and
>>> are
>>
>>> encouraging Australian colleagues to do the same. We are about a
>>> week
>
>>> away from launching a web-accessible repository for our recently
>>> published (eg deposited in PDB) data, and this should coincide with
>>> an
>>
>>> upcoming publication describing a new structure from our labs. The
>>> aim
>>
>>> is that publication occurs simultaneously with release in PDB as
>>> well
>
>>> as raw diffraction data on our website.
>>> We hope to house as much of our data as possible, as well as data
>>> from
>>
>>> other Australian labs, but obviously the potential dataset will be
>>> huge, so we are trying to develop, and make available freely to the
>>> community, software tools that allow others to easily setup their
>>> own
>
>>> repositories. After brief discussion with PDB the plan is that PDB
>>> include links from coordinates/SF's to the raw data using a simple
>>> handle that can be incorporated into a URL. We would hope that we
>>> can
>>
>>> convince the journals that raw data must be made available at the
>>> time
>>
>>> of publication, in the same way as coordinates and structure
>>> factors.
>>
>>> Of course, we realise that there will be many hurdles along the way
>>> but we are convinced that simply making the raw data available ASAP
>>> is
>>
>>> a 'good thing'.
>>>
>>> We are happy to share more details of our IT plans with the CCP4BB,
>>> such that they can be improved, and look forward to hearing feedback
>>>
>>> cheers
>
>
> ----------------------------------------------------------------------
> --
> --------
> Mischa Machius, PhD
> Associate Professor
> UT Southwestern Medical Center at Dallas
> 5323 Harry Hines Blvd.; ND10.214A
> Dallas, TX 75390-8816; U.S.A.
> Tel: +1 214 645 6381
> Fax: +1 214 645 6353
------------------------------------------------------------------------
--------
Mischa Machius, PhD
Associate Professor
UT Southwestern Medical Center at Dallas
5323 Harry Hines Blvd.; ND10.214A
Dallas, TX 75390-8816; U.S.A.
Tel: +1 214 645 6381
Fax: +1 214 645 6353
|