Hi James,
I see no real need for lossy compression datasets. They may be useful
for demonstration purposes, and to follow synchrotron data collection
remotely. But for processing I need the real data. It is my experience
that structure solution, at least in the difficult cases, depends on
squeezing out every bit of scattering information from the data, as much
as is possible with the given software. Using a lossy-compression
dataset in this situation would give me the feeling "if structure
solution does not work out, I'll have to re-do everything with the
original data" - and that would be double work. Better not start going
down that route.
The CBF byte compression puts even a 20bit detector pixel into a single
byte, on average. These frames can be further compressed, in the case of
Pilatus fine-slicing frames, using bzip2, almost down to the level of
entropy in the data (since there are so many zero pixels). And that
would be lossless.
Storing lossily-compressed datasets would of course not double the
diskspace needed, but would significantly raise the administrative burdens.
Just to point out my standpoint in this whole discussion about storage
of raw data:
I've been storing our synchrotron datasets on disks, since 1999. The
amount of money we spend per year for this purpose is constant (less
than 1000€). This is possible because the price of a GB disk space drops
faster than the amount of data per synchrotron trip rises. So if the
current storage is full (about every 3 years), we set up a bigger RAID
(plus a backup RAID); the old data, after copying over, always consumes
only a fraction of the space on the new RAID.
So I think the storage cost is actually not the real issue - rather, the
real issue has a strong psychological component. People a) may not
realize that the software they use is constantly being improved, and
that needs data which cover all the corner cases; b) often do not wish
to give away something because they feel it might help their
competitors, or expose their faults.
best,
Kay (XDS co-developer)
-------- Original Message --------
Date: Mon, 7 Nov 2011 09:30:11 -0800
From: James Holton <[log in to unmask]>
Subject: image compression
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
At the risk of sounding like another "poll", I have a pragmatic question
for the methods development community:
Hypothetically, assume that there was a website where you could download
the original diffraction images corresponding to any given PDB file,
including "early" datasets that were from the same project, but because
of smeary spots or whatever, couldn't be solved. There might even be
datasets with "unknown" PDB IDs because that particular project never
did work out, or because the relevant protein sequence has been lost.
Remember, few of these datasets will be less than 5 years old if we try
to allow enough time for the original data collector to either solve it
or graduate (and then cease to care). Even for the "final" dataset,
there will be a delay, since the half-life between data collection and
coordinate deposition in the PDB is still ~20 months. Plenty of time to
forget. So, although the images were archived (probably named "test"
and in a directory called "john") it may be that the only way to figure
out which PDB ID is the "right answer" is by processing them and
comparing to all deposited Fs. Assume this was done. But there will
always be some datasets that don't match any PDB. Are those
interesting? What about ones that can't be processed? What about ones
that can't even be indexed? There may be a lot of those!
(hypothetically, of course).
Anyway, assume that someone did go through all the trouble to make these
datasets "available" for download, just in case they are interesting,
and annotated them as much as possible. There will be about 20 datasets
for any given PDB ID.
Now assume that for each of these datasets this hypothetical website has
two links, one for the "raw data", which will average ~2 GB per wedge
(after gzip compression, taking at least ~45 min to download), and a
second link for a "lossy compressed" version, which is only ~100
MB/wedge (2 min download). When decompressed, the images will visually
look pretty much like the originals, and generally give you very similar
Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other
statistics when processed with contemporary software. Perhaps a bit
worse. Essentially, lossy compression is equivalent to adding noise to
the images.
Which one would you try first? Does lossy compression make it easier to
hunt for "interesting" datasets? Or is it just too repugnant to have
"modified" the data in any way shape or form ... after the detector
manufacturer's software has "corrected" it? Would it suffice to simply
supply a couple of "example" images for download instead?
-James Holton
MAD Scientist
|