JISCMail - CCP4BB Archives

At the risk of sounding like another "poll", I have a pragmatic question 
for the methods development community:

Hypothetically, assume that there was a website where you could download 
the original diffraction images corresponding to any given PDB file, 
including "early" datasets that were from the same project, but because 
of smeary spots or whatever, couldn't be solved.  There might even be 
datasets with "unknown" PDB IDs because that particular project never 
did work out, or because the relevant protein sequence has been lost.  
Remember, few of these datasets will be less than 5 years old if we try 
to allow enough time for the original data collector to either solve it 
or graduate (and then cease to care).  Even for the "final" dataset, 
there will be a delay, since the half-life between data collection and 
coordinate deposition in the PDB is still ~20 months.  Plenty of time to 
forget.  So, although the images were archived (probably named "test" 
and in a directory called "john") it may be that the only way to figure 
out which PDB ID is the "right answer" is by processing them and 
comparing to all deposited Fs.  Assume this was done.  But there will 
always be some datasets that don't match any PDB.  Are those 
interesting?  What about ones that can't be processed?  What about ones 
that can't even be indexed?  There may be a lot of those!  
(hypothetically, of course).

Anyway, assume that someone did go through all the trouble to make these 
datasets "available" for download, just in case they are interesting, 
and annotated them as much as possible.  There will be about 20 datasets 
for any given PDB ID.

Now assume that for each of these datasets this hypothetical website has 
two links, one for the "raw data", which will average ~2 GB per wedge 
(after gzip compression, taking at least ~45 min to download), and a 
second link for a "lossy compressed" version, which is only ~100 
MB/wedge (2 min download).  When decompressed, the images will visually 
look pretty much like the originals, and generally give you very similar 
Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other 
statistics when processed with contemporary software.  Perhaps a bit 
worse.  Essentially, lossy compression is equivalent to adding noise to 
the images.

Which one would you try first?  Does lossy compression make it easier to 
hunt for "interesting" datasets?  Or is it just too repugnant to have 
"modified" the data in any way shape or form ... after the detector 
manufacturer's software has "corrected" it?  Would it suffice to simply 
supply a couple of "example" images for download instead?

-James Holton
MAD Scientist