JISCMail - CCP4BB Archives

HI James,

Regarding the suggestion of lossy compression, it is really hard to
comment without having a good idea of the real cost of doing this. So,
I have a suggestion:

 - grab a bag of JCSG data sets, which we know should all be essentially OK.
 - you squash then unsquash them with your macguffin, perhaps
randomizing them as to whether A or B is squashed.
 - process them with Elves / xia2 / autoPROC (something which is reproducible)
 - pop the results into pdb_redo

Then compare the what-comes-out. Ultimately adding "noise" may (or may
not) make a measurable difference to the final refinement - this may
be a way of telling if it does or doesn't. Why however would I have
any reason to worry? Because the noise being added is not really
random - it will compression artifacts. This could have a subtle
effect on how the errors are estimated and so on. However you can hum
and haw about this for a decade without reaching a conclusion.

Here, it's something which in all honesty we can actually evaluate, so
is it worth giving it a go? If the results were / are persuasive (i.e.
a "report on the use of lossy compression in transmission and storage
of X-ray diffraction data" was actually read and endorsed by the
community) this would make it much more worthwhile for consideration
for inclusion in e.g. cbflib.

I would however always encourage (if possible) that the original raw
data is kept somewhere on disk in an unmodified form - I am not a fan
of one-way computational processes with unique data.

Thoughts anyone?

Cheerio,

Graeme

On 7 November 2011 17:30, James Holton <[log in to unmask]> wrote:
> At the risk of sounding like another "poll", I have a pragmatic question for
> the methods development community:
>
> Hypothetically, assume that there was a website where you could download the
> original diffraction images corresponding to any given PDB file, including
> "early" datasets that were from the same project, but because of smeary
> spots or whatever, couldn't be solved.  There might even be datasets with
> "unknown" PDB IDs because that particular project never did work out, or
> because the relevant protein sequence has been lost.  Remember, few of these
> datasets will be less than 5 years old if we try to allow enough time for
> the original data collector to either solve it or graduate (and then cease
> to care).  Even for the "final" dataset, there will be a delay, since the
> half-life between data collection and coordinate deposition in the PDB is
> still ~20 months.  Plenty of time to forget.  So, although the images were
> archived (probably named "test" and in a directory called "john") it may be
> that the only way to figure out which PDB ID is the "right answer" is by
> processing them and comparing to all deposited Fs.  Assume this was done.
>  But there will always be some datasets that don't match any PDB.  Are those
> interesting?  What about ones that can't be processed?  What about ones that
> can't even be indexed?  There may be a lot of those!  (hypothetically, of
> course).
>
> Anyway, assume that someone did go through all the trouble to make these
> datasets "available" for download, just in case they are interesting, and
> annotated them as much as possible.  There will be about 20 datasets for any
> given PDB ID.
>
> Now assume that for each of these datasets this hypothetical website has two
> links, one for the "raw data", which will average ~2 GB per wedge (after
> gzip compression, taking at least ~45 min to download), and a second link
> for a "lossy compressed" version, which is only ~100 MB/wedge (2 min
> download).  When decompressed, the images will visually look pretty much
> like the originals, and generally give you very similar Rmerge, Rcryst,
> Rfree, I/sigma, anomalous differences, and all other statistics when
> processed with contemporary software.  Perhaps a bit worse.  Essentially,
> lossy compression is equivalent to adding noise to the images.
>
> Which one would you try first?  Does lossy compression make it easier to
> hunt for "interesting" datasets?  Or is it just too repugnant to have
> "modified" the data in any way shape or form ... after the detector
> manufacturer's software has "corrected" it?  Would it suffice to simply
> supply a couple of "example" images for download instead?
>
> -James Holton
> MAD Scientist
>