Dear Frank and Tim,
When I first joined Global Phasing, we tried very hard to obtain images from
data collections that we thought would be useful to the work that I am doing
here. Gerard exploited his considerable knowledge and contacts to the utmost
to do this. To the best of my recollection, we failed to obtain a single
image set.
I am not going to name names, and in fact there was no shortage of goodwill
and cooperation from the people that we asked. The problem was that those
people did not have the time for the technical challenges involved in
dealing with old tape formats that had been written using utilities and
platforms that were no longer current.
I understand the arguments against long term storage of _all_ raw data, and
I do know that it takes time and effort to maintain long-term storage. The
problem I have is that it is easy to slide from this position to one of
saying simply: "It isn't worth keeping raw data". This I take serious issue
with: raw images are an essential tool for software and methods development
in this area. We are massively indebted to the JCSG for their Repository of
Crystallographic Datasets for example: our job would be a lot harder without
it.
Graeme said:
> Like with crystallisation conditions, you donĒt usually know a priori
> which you need.
This is of course true, but I believe that it is possible nevertheless to
have some guidelines that will capture datasets that contain value.
Regards,
Peter.
On 23/10/15 11:00, Frank von Delft wrote:
> Bravo, that needed saying. Here's one observation:
>
> For the last decade I've been tasked with the crystallography aspects
> of a major generator of human structures (>500), all supposedly
> medically important. Raw data is all dutifully backed up (we have
> very good IT), we even capture a tonne of metadata and annotation, a
> luxury not available to most labs (excellent Research Informatics
> groups are rare!)
>
> We get lots of questions about protocols - but I have yet to receive
> a single request for raw data.
>
> The cost is not storage, it is human time. Dataset speleology is
> seriously hard work... Hell, I struggle to read even a fraction of
> the /papers/!!
>
> phx.
>
>
>
>
> On 23/10/2015 10:16, Tim Gruene wrote:
>> Dear all,
>>
>> I have wondered if it is really worth the effort (and disk space)
>> for central long-term storage of diffraction images. What fraction
>> of such data will ever be looked at in the future after the
>> respective project has been published? Even if some revolutionary
>> new technology would be developed, I guess this would mostly be
>> applied to current rather than old projects. Given the substantial
>> energy consumption of long term storage (including DVDs and tape as
>> these have to be produced), the gross benefit might be greater
>> deleting old data at some point saving energy and effort for more
>> current things.
>>
>> I have been through a few disk crashs. Often I was annoyed because
>> I had to reinstall a new computer, and sometimes I could not
>> recover some data which I would have liked to. But in fact it often
>> cleaned my computer and life went on even without access to
>> whatever got lost.
>>
>> So what is the scientific argument behind long-term storage of
>> diffraction images other than academic interest in re-processing
>> the data? As mentioned above, I guess that the benefit of
>> re-processing the data may only be minor and effort might be better
>> spent on concurrent projects.
>>
>> Best wishes, Tim
>>
>> On Wednesday, October 21, 2015 06:03:21 PM Allister Crow wrote:
>>> On the last point about storing diffraction images, I wonder what
>>> the community's opinion is of uploading images to the Zenodo
>>> archive for safe-keeping and sharing?
>>>
>>> The Zenodo project is being run by the folks at CERN, and is EU
>>> funded to support scientific data sharing. (Zenodo.org)
>>>
>>> Until the PDB does this, perhaps this is one of the better ways
>>> through which we can ensure preservation (or at least another
>>> backup) of our most important diffraction images?
>>>
>>> - Ally
>>>
>>> ps I should also say that I originally learned of Zenodo from
>>> Graeme Winter at Diamond.
>>>
>>> ----------------- Allister Crow Department of Pathology
>>> University of Cambridge Google Scholar
>>> Profile<http://bit.ly/11ga7Sq> Research Gate
>>> Profile<http://bit.ly/137Ytt4> Departmental
>>> Page<http://www.path.cam.ac.uk/directory/allister-crow>
>>>
>>>> On 21 Oct 2015, at 17:03, William G. Scott<[log in to unmask]>
>>>> wrote:
>>>>
>>>> Dear CCP4 Citizenry:
>>>>
>>>> IĒm worried about medium to long-term data storage and
>>>> integrity. At the moment, our lab uses mostly HFS+ formatted
>>>> filesystems on our disks, which is the OS X default. HFS+
>>>> always struck me as somewhat fragile, and resource forks at
>>>> best are a (seemingly needless) headache, at least as far as
>>>> crystallography datasets go. (True, you can do
>>>> HFS-compression and losslessly shrink your images by a factor
>>>> of 2, or shrink your ccp4 installation, but these are fairly
>>>> minor advantages).
>>>>
>>>> I read the CCP4 wiki page
>>>> http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Filesystems
>>>>
>>>>
that summarizes some of the other options. From what I have read, there
>>>> and elsewhere, it seems like zfs and btrfs might be
>>>> significantly better alternatives to HFS+, but I really would
>>>> like to get a sense of what others have experienced with these,
>>>> or other, equally or more robust options. I donĒt feel like I
>>>> know enough to critically evaluate the information.
>>>>
>>>> Anyone know what the NSA uses?
>>>>
>>>> I recently created a de novo backup of some personal data on an
>>>> external HFS+ drive (photos, movies, music, etc). I was very
>>>> unpleasantly surprised to find several files had been silently
>>>> corrupted. (In the case of a movie file, for example, the file
>>>> would play but could not be copied. In another case, a music
>>>> file would not copy, yet it had identical md5sum and sha1
>>>> checksums when compared to an uncorrupted redundant backup I
>>>> had. IĒm still puzzled by this, but it suggests the resource
>>>> fork might be the source of the corruption, and, more
>>>> worrisome still, that conventional checksums arenĒt detecting
>>>> some silently corrupted data, so I am not even sure if zfs
>>>> self-healing would be the answer.)
>>>>
>>>> Since we as a community are now encouraging primary X-ray
>>>> diffraction images to be stored, I can only imagine the problem
>>>> could be ubiquitous, and a discussion might be worth having.
>>>> (I apologize if this has been addressed previously; I did
>>>> search the archive.)
>>>>
>>>> All the best,
>>>>
>>>> Bill
>>>>
>>>>
>>>>
>>>> William G. Scott Director, Program in Biochemistry and
>>>> Molecular Biology Professor, Department of Chemistry and
>>>> Biochemistry and The Center for the Molecular Biology of RNA
>>>> University of California at Santa Cruz Santa Cruz, California
>>>> 95064 USA
>
|