I think the average structure is much less than 20 GB since most data
seems to be collected as SAD. I quickly looked at my data ~20
structures 3 MAD, 9 SAD, 3 MIR, 4 MR, number of amino acids per asu 150
- 9600, the average was closer to 3 GB (compressed). The largest dataset
24 GB (compressed), smallest 300 MB (compressed).
Juergen
Mischa Machius wrote:
> Hmm - I think I miscalculated, by a factor of 100 even!... need more
> coffee. In any case, I still think it would be doable. Best - MM
>
>
> On Aug 16, 2007, at 9:30 AM, Mischa Machius wrote:
>
>> I don't think archiving images would be that expensive. For one, I
>> have found that most formats can be compressed quite substantially
>> using simple, standard procedures like bzip2. If optimized, raw
>> images won't take up that much space. Also, initially, only those
>> images that have been used to obtain phases and to refine finally
>> deposited structures could be archived. If the average structure
>> takes up 20GB of space, 5,000 structures would be 1TB, which fits on
>> a single hard drive for less than $400. If the community thinks this
>> is a worthwhile endeavor, money should be available from granting
>> agencies to establish a central repository (e.g., at the RCSB).
>> Imagine what could be done with as little as $50,000. For large
>> detectors, binning could be used, but giving current hard drive
>> prices and future developments, that won't be necessary. Best - MM
>>
>>
>> On Aug 16, 2007, at 9:13 AM, Phil Evans wrote:
>>
>>> What do you count as raw data? Rawest are the images - everything
>>> beyond that is modellling - but archiving images is _expensive_!
>>> Unmerged intensities are probably more manageable
>>>
>>> Phil
>>>
>>>
>>> On 16 Aug 2007, at 15:05, Ashley Buckle wrote:
>>>
>>>> Dear Randy
>>>>
>>>> These are very valid points, and I'm so glad you've taken the
>>>> important step of initiating this. For now I'd like to respond to
>>>> one of them, as it concerns something I and colleagues in
>>>> Australia are doing:
>>>>
>>>>>
>>>>> The more information that is available, the easier it will be to
>>>>> detect fabrication (because it is harder to make up more
>>>>> information convincingly). For instance, if the diffraction data
>>>>> are deposited, we can check for consistency with the known
>>>>> properties of real macromolecular crystals, e.g. that they
>>>>> contain disordered solvent and not vacuum. As Tassos Perrakis has
>>>>> discovered, there are characteristic ways in which the standard
>>>>> deviations depend on the intensities and the resolution. If
>>>>> unmerged data are deposited, there will probably be evidence of
>>>>> radiation damage, weak effects from intrinsic anomalous
>>>>> scatterers, etc. Raw images are probably even harder to simulate
>>>>> convincingly.
>>>>
>>>>
>>>> After the recent Science retractions we realised that its about
>>>> time raw data was made available. So, we have set about creating
>>>> the necessary IT and software to do this for our diffraction data,
>>>> and are encouraging Australian colleagues to do the same. We are
>>>> about a week away from launching a web-accessible repository for
>>>> our recently published (eg deposited in PDB) data, and this should
>>>> coincide with an upcoming publication describing a new structure
>>>> from our labs. The aim is that publication occurs simultaneously
>>>> with release in PDB as well as raw diffraction data on our
>>>> website. We hope to house as much of our data as possible, as well
>>>> as data from other Australian labs, but obviously the potential
>>>> dataset will be huge, so we are trying to develop, and make
>>>> available freely to the community, software tools that allow
>>>> others to easily setup their own repositories. After brief
>>>> discussion with PDB the plan is that PDB include links from
>>>> coordinates/SF's to the raw data using a simple handle that can be
>>>> incorporated into a URL. We would hope that we can convince the
>>>> journals that raw data must be made available at the time of
>>>> publication, in the same way as coordinates and structure
>>>> factors. Of course, we realise that there will be many hurdles
>>>> along the way but we are convinced that simply making the raw data
>>>> available ASAP is a 'good thing'.
>>>>
>>>> We are happy to share more details of our IT plans with the
>>>> CCP4BB, such that they can be improved, and look forward to
>>>> hearing feedback
>>>>
>>>> cheers
>>>
>>
>>
>> ----------------------------------------------------------------------
>> ----------
>> Mischa Machius, PhD
>> Associate Professor
>> UT Southwestern Medical Center at Dallas
>> 5323 Harry Hines Blvd.; ND10.214A
>> Dallas, TX 75390-8816; U.S.A.
>> Tel: +1 214 645 6381
>> Fax: +1 214 645 6353
>
>
>
> ------------------------------------------------------------------------
> --------
> Mischa Machius, PhD
> Associate Professor
> UT Southwestern Medical Center at Dallas
> 5323 Harry Hines Blvd.; ND10.214A
> Dallas, TX 75390-8816; U.S.A.
> Tel: +1 214 645 6381
> Fax: +1 214 645 6353
>
--
Jürgen Bosch
University of Washington
Dept. of Biochemistry, K-426
1705 NE Pacific Street
Seattle, WA 98195
Box 357742
Phone: +1-206-616-4510
FAX: +1-206-685-7002
|