I'll second that... can't remember anybody on the barricades about
"corrected" CCD images, but they've been just so much more practical.
Different kind of problem, I know, but equivalent situation: the people
to ask are not the purists, but the ones struggling with the huge
volumes of data. I'll take the lossy version any day if it speeds up
real-time evaluation of data quality, helps me browse my datasets, and
allows me to do remote but intelligent data collection.
phx.
On 08/11/2011 02:22, Herbert J. Bernstein wrote:
> Dear James,
>
> You are _not_ wasting your time. Even if the lossy compression ends
> up only being used to stage preliminary images forward on the net while
> full images slowly work their way forward, having such a compression
> that preserves the crystallography in the image will be an important
> contribution to efficient workflows. Personally I suspect that
> such images will have more important, uses, e.g. facilitating
> real-time monitoring of experiments using detectors providing
> full images at data rates that simply cannot be handled without
> major compression. We are already in that world. The reason that
> the Dectris images use Andy Hammersley's byte-offset compression,
> rather than going uncompressed or using CCP4 compression is that
> in January 2007 we were sitting right on the edge of a nasty
> CPU-performance/disk bandwidth tradeoff, and the byte-offset
> compression won the competition. In that round a lossless
> compression was sufficient, but just barely. In the future,
> I am certain some amount of lossy compression will be
> needed to sample the dataflow while the losslessly compressed
> images work their way through a very back-logged queue to the disk.
>
> In the longer term, I can see people working with lossy compressed
> images for analysis of massive volumes of images to select the
> 1% to 10% that will be useful in a final analysis, and may need
> to be used in a lossless mode. If you can reject 90% of the images
> with a fraction of the effort needed to work with the resulting
> 10% of good images, you have made a good decision.
>
> An then there is the inevitable need to work with images on
> portable devices with limited storage over cell and WIFI networks. ...
>
> I would not worry about upturned noses. I would worry about
> the engineering needed to manage experiments. Lossy compression
> can be an important part of that engineering.
>
> Regards,
> Herbert
>
>
> At 4:09 PM -0800 11/7/11, James Holton wrote:
>> So far, all I really have is a "proof of concept" compression algorithm here:
>> http://bl831.als.lbl.gov/~jamesh/lossy_compression/
>>
>> Not exactly "portable" since you need ffmpeg and the x264 libraries
>> set up properly. The latter seems to be constantly changing things
>> and breaking the former, so I'm not sure how "future proof" my
>> "algorithm" is.
>>
>> Something that caught my eye recently was fractal compression,
>> particularly since FIASCO has been part of the NetPBM package for
>> about 10 years now. Seems to give comparable compression vs quality
>> as x264 (to my eye), but I'm presently wondering if I'd be wasting my
>> time developing this further? Will the crystallographic world simply
>> turn up its collective nose at lossy images? Even if it means waiting
>> 6 years for "Nielsen's Law" to make up the difference in network
>> bandwidth?
>>
>> -James Holton
>> MAD Scientist
>>
>> On Mon, Nov 7, 2011 at 10:01 AM, Herbert J. Bernstein
>> <[log in to unmask]> wrote:
>>> This is a very good question. I would suggest that both versions
>>> of the old data are useful. If was is being done is simple validation
>>> and regeneration of what was done before, then the lossy compression
>>> should be fine in most instances. However, when what is being
>>> done hinges on the really fine details -- looking for lost faint
>>> spots just peeking out from the background, looking at detailed
>>> peak profiles -- then the lossless compression version is the
>>> better choice. The annotation for both sets should be the same.
>>> The difference is in storage and network bandwidth.
>>>
>>> Hopefully the fraud issue will never again rear its ugly head,
>>> but if it should, then having saved the losslessly compressed
>>> images might prove to have been a good idea.
>>>
>>> To facilitate experimentation with the idea, if there is agreement
>>> on the particular lossy compression to be used, I would be happy
>>> to add it as an option in CBFlib. Right now all the compressions
>> > we have are lossless.
>>> Regards,
>>> Herbert
>>>
>>>
>>> =====================================================
>>> Herbert J. Bernstein, Professor of Computer Science
>>> Dowling College, Kramer Science Center, KSC 121
>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>> +1-631-244-3035
>>> [log in to unmask]
>>> =====================================================
>>>
>>> On Mon, 7 Nov 2011, James Holton wrote:
>>>
>>>> At the risk of sounding like another "poll", I have a pragmatic question
>>>> for the methods development community:
>>>>
>>>> Hypothetically, assume that there was a website where you could download
>>>> the original diffraction images corresponding to any given PDB file,
>>>> including "early" datasets that were from the same project, but because of
>>>> smeary spots or whatever, couldn't be solved. There might even be datasets
>>>> with "unknown" PDB IDs because that particular project never did work out,
>>>> or because the relevant protein sequence has been lost. Remember, few of
>>>> these datasets will be less than 5 years old if we try to allow enough time
>>>> for the original data collector to either solve it or graduate (and then
>>>> cease to care). Even for the "final" dataset, there will be a delay, since
>>>> the half-life between data collection and coordinate deposition in the PDB
>>>> is still ~20 months. Plenty of time to forget. So, although the
>>>> images were
>>>> archived (probably named "test" and in a directory called "john") it may be
>>>> that the only way to figure out which PDB ID is the "right answer" is by
>>>> processing them and comparing to all deposited Fs. Assume this was done.
>>>> But there will always be some datasets that don't match any PDB.
>>>> Are those
>>>> interesting? What about ones that can't be processed? What
>>>> about ones that
>>>> can't even be indexed? There may be a lot of those! (hypothetically, of
>>>> course).
>>>>
>>>> Anyway, assume that someone did go through all the trouble to make these
>>>> datasets "available" for download, just in case they are interesting, and
>>>> annotated them as much as possible. There will be about 20
>>>> datasets for any
>>>> given PDB ID.
>>>>
>>>> Now assume that for each of these datasets this hypothetical website has
>>>> two links, one for the "raw data", which will average ~2 GB per
>>>> wedge (after
>>>> gzip compression, taking at least ~45 min to download), and a second link
>>>> for a "lossy compressed" version, which is only ~100 MB/wedge (2 min
>>>> download). When decompressed, the images will visually look
>>>> pretty much like
>>>> the originals, and generally give you very similar Rmerge, Rcryst, Rfree,
>>>> I/sigma, anomalous differences, and all other statistics when
>>>> processed with
>>>> contemporary software. Perhaps a bit worse. Essentially, lossy
>>>> compression
>>>> is equivalent to adding noise to the images.
>>>>
>>>> Which one would you try first? Does lossy compression make it easier to
>>>> hunt for "interesting" datasets? Or is it just too repugnant to have
>>>> "modified" the data in any way shape or form ... after the detector
>>>> manufacturer's software has "corrected" it? Would it suffice to simply
>>>> supply a couple of "example" images for download instead?
>>>>
>>>> -James Holton
>>>> MAD Scientist
>>>>
>
|