I think that real universal image depositions will not take off without a newish type of compression that will speed up and ease up things.
Therefore the compression discussion is highly relevant - I would even suggest to go to mathematicians and software engineers to provide
a highly efficient compression format for our type of data - our data sets have some very typical repetitive features so they can be very likely compressed as a whole set without loosing information (differential compression in the series) but this needs experts ..


Jan Dohnalek


On Tue, Nov 8, 2011 at 8:19 AM, Miguel Ortiz Lombardia <[log in to unmask]> wrote:
So the purists of speed seem to be more relevant than the purists of images.

We complain all the time about how many errors we have out there in our
experiments that we seemingly cannot account for. Yet, would we add
another source?

Sorry if I'm missing something serious here, but I cannot understand
this artificial debate. You can do useful remote data collection without
having look at *each* image.


Miguel


Le 08/11/2011 06:27, Frank von Delft a écrit :
> I'll second that...  can't remember anybody on the barricades about
> "corrected" CCD images, but they've been just so much more practical.
>
> Different kind of problem, I know, but equivalent situation:  the people
> to ask are not the purists, but the ones struggling with the huge
> volumes of data.  I'll take the lossy version any day if it speeds up
> real-time evaluation of data quality, helps me browse my datasets, and
> allows me to do remote but intelligent data collection.
>
> phx.
>
>
>
> On 08/11/2011 02:22, Herbert J. Bernstein wrote:
>> Dear James,
>>
>>     You are _not_ wasting your time.  Even if the lossy compression ends
>> up only being used to stage preliminary images forward on the net while
>> full images slowly work their way forward, having such a compression
>> that preserves the crystallography in the image will be an important
>> contribution to efficient workflows.  Personally I suspect that
>> such images will have more important, uses, e.g. facilitating
>> real-time monitoring of experiments using detectors providing
>> full images at data rates that simply cannot be handled without
>> major compression.  We are already in that world.  The reason that
>> the Dectris images use Andy Hammersley's byte-offset compression,
>> rather than going uncompressed or using CCP4 compression is that
>> in January 2007 we were sitting right on the edge of a nasty
>> CPU-performance/disk bandwidth tradeoff, and the byte-offset
>> compression won the competition.   In that round a lossless
>> compression was sufficient, but just barely.  In the future,
>> I am certain some amount of lossy compression will be
>> needed to sample the dataflow while the losslessly compressed
>> images work their way through a very back-logged queue to the disk.
>>
>>     In the longer term, I can see people working with lossy compressed
>> images for analysis of massive volumes of images to select the
>> 1% to 10% that will be useful in a final analysis, and may need
>> to be used in a lossless mode.  If you can reject 90% of the images
>> with a fraction of the effort needed to work with the resulting
>> 10% of good images, you have made a good decision.
>>
>>     An then there is the inevitable need to work with images on
>> portable devices with limited storage over cell and WIFI networks. ...
>>
>>     I would not worry about upturned noses.  I would worry about
>> the engineering needed to manage experiments.  Lossy compression
>> can be an important part of that engineering.
>>
>>     Regards,
>>       Herbert
>>
>>
>> At 4:09 PM -0800 11/7/11, James Holton wrote:
>>> So far, all I really have is a "proof of concept" compression
>>> algorithm here:
>>> http://bl831.als.lbl.gov/~jamesh/lossy_compression/
>>>
>>> Not exactly "portable" since you need ffmpeg and the x264 libraries
>>> set up properly.  The latter seems to be constantly changing things
>>> and breaking the former, so I'm not sure how "future proof" my
>>> "algorithm" is.
>>>
>>> Something that caught my eye recently was fractal compression,
>>> particularly since FIASCO has been part of the NetPBM package for
>>> about 10 years now.  Seems to give comparable compression vs quality
>>> as x264 (to my eye), but I'm presently wondering if I'd be wasting my
>>> time developing this further?  Will the crystallographic world simply
>>> turn up its collective nose at lossy images?  Even if it means waiting
>>> 6 years for "Nielsen's Law" to make up the difference in network
>>> bandwidth?
>>>
>>> -James Holton
>>> MAD Scientist
>>>
>>> On Mon, Nov 7, 2011 at 10:01 AM, Herbert J. Bernstein
>>> <[log in to unmask]>  wrote:
>>>>   This is a very good question.  I would suggest that both versions
>>>>   of the old data are useful.  If was is being done is simple
>>>> validation
>>>>   and regeneration of what was done before, then the lossy compression
>>>>   should be fine in most instances.  However, when what is being
>>>>   done hinges on the really fine details -- looking for lost faint
>>>>   spots just peeking out from the background, looking at detailed
>>>>   peak profiles -- then the lossless compression version is the
>>>>   better choice.  The annotation for both sets should be the same.
>>>>   The difference is in storage and network bandwidth.
>>>>
>>>>   Hopefully the fraud issue will never again rear its ugly head,
>>>>   but if it should, then having saved the losslessly compressed
>>>>   images might prove to have been a good idea.
>>>>
>>>>   To facilitate experimentation with the idea, if there is agreement
>>>>   on the particular lossy compression to be used, I would be happy
>>>>   to add it as an option in CBFlib.  Right now all the compressions
>>>   >  we have are lossless.
>>>>   Regards,
>>>>    Herbert
>>>>
>>>>
>>>>   =====================================================
>>>>    Herbert J. Bernstein, Professor of Computer Science
>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>>>
>>>>                   +1-631-244-3035
>>>>                   [log in to unmask]
>>>>   =====================================================
>>>>
>>>>   On Mon, 7 Nov 2011, James Holton wrote:
>>>>
>>>>>   At the risk of sounding like another "poll", I have a pragmatic
>>>>> question
>>>>>   for the methods development community:
>>>>>
>>>>>   Hypothetically, assume that there was a website where you could
>>>>> download
>>>>>   the original diffraction images corresponding to any given PDB file,
>>>>>   including "early" datasets that were from the same project, but
>>>>> because of
>>>>>   smeary spots or whatever, couldn't be solved.  There might even
>>>>> be datasets
>>>>>   with "unknown" PDB IDs because that particular project never did
>>>>> work out,
>>>>>   or because the relevant protein sequence has been lost.
>>>>> Remember, few of
>>>>>   these datasets will be less than 5 years old if we try to allow
>>>>> enough time
>>>>>   for the original data collector to either solve it or graduate
>>>>> (and then
>>>>>   cease to care).  Even for the "final" dataset, there will be a
>>>>> delay, since
>>>>>   the half-life between data collection and coordinate deposition
>>>>> in the PDB
>>>>>   is still ~20 months. Plenty of time to forget.  So, although the
>>>>> images were
>>>>>   archived (probably named "test" and in a directory called "john")
>>>>> it may be
>>>>>   that the only way to figure out which PDB ID is the "right
>>>>> answer" is by
>>>>>   processing them and comparing to all deposited Fs.  Assume this
>>>>> was done.
>>>>>    But there will always be some datasets that don't match any PDB.
>>>>> Are those
>>>>>   interesting?  What about ones that can't be processed?  What
>>>>> about ones that
>>>>>   can't even be indexed?  There may be a lot of those!
>>>>> (hypothetically, of
>>>>>   course).
>>>>>
>>>>>   Anyway, assume that someone did go through all the trouble to
>>>>> make these
>>>>>   datasets "available" for download, just in case they are
>>>>> interesting, and
>>>>>   annotated them as much as possible.  There will be about 20
>>>>> datasets for any
>>>>>   given PDB ID.
>>>>>
>>>>>   Now assume that for each of these datasets this hypothetical
>>>>> website has
>>>>>   two links, one for the "raw data", which will average ~2 GB per
>>>>> wedge (after
>>>>>   gzip compression, taking at least ~45 min to download), and a
>>>>> second link
>>>>>   for a "lossy compressed" version, which is only ~100 MB/wedge (2 min
>>>>>   download). When decompressed, the images will visually look
>>>>> pretty much like
>>>>>   the originals, and generally give you very similar Rmerge,
>>>>> Rcryst, Rfree,
>>>>>   I/sigma, anomalous differences, and all other statistics when
>>>>> processed with
>>>>>   contemporary software.  Perhaps a bit worse.  Essentially, lossy
>>>>> compression
>>>>>   is equivalent to adding noise to the images.
>>>>>
>>>>>   Which one would you try first?  Does lossy compression make it
>>>>> easier to
>>>>>   hunt for "interesting" datasets?  Or is it just too repugnant to
>>>>> have
>>>>>   "modified" the data in any way shape or form ... after the detector
>>>>>   manufacturer's software has "corrected" it?  Would it suffice to
>>>>> simply
>>>>>   supply a couple of "example" images for download instead?
>>>>>
>>>>>   -James Holton
>>>>>   MAD Scientist
>>>>>
>>
>


--
Miguel

Architecture et Fonction des Macromolécules Biologiques (UMR6098)
CNRS, Universités d'Aix-Marseille I & II
Case 932, 163 Avenue de Luminy, 13288 Marseille cedex 9, France
Tel: +33(0) 491 82 55 93
Fax: +33(0) 491 26 67 20
mailto:[log in to unmask]
http://www.afmb.univ-mrs.fr/Miguel-Ortiz-Lombardia



--
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic

Tel: +420 296 809 390
Fax: +420 296 809 410