Something I have been playing with recently that might address your
problem in a way you like is SquashFS:
http://squashfs.sourceforge.net/
SquashFS is a read-only compressed file system. It uses gzip --best,
which is comparable to bzip2 for diffraction images (in my experience).
Basically, it works a lot like burning to a CD. You run "mksquashfs" to
create the compressed image and then "mount -o loop" it. Then voila!
You can access everything in the archive as if it were an uncompressed
file. Disk I/O then consists of compressed data (decompression is done
by the kernel), and so does network traffic if you play a clever trick:
share the compressed file over NFS and "mount -o loop" it locally. This
has much bigger advantages than you might realize because most of the
NFS traffic that brings a file server to its knees are the tiny little
"writes" that are done to update access times. NFS writes (and RAID
writes) are all really expensive, and you can actually gain a
considerable performance increase by just mounting your "data" disks
read-only (or by putting "noatime" as a mount option).
Anyway, SquashFS is not as slick as the transparent compression you can
get with HFS or NTFS, but I personally like the fact that it is
read-only (good for data). For real-time backup, mksquashfs does
support "appending" to an existing archive, so you can probably build
your squashfs file on the usb disk at the beamline (even if the beamline
computer kernels can't mount it). However, if you MUST have your
processing files mixed amongst your images, you can use "unionfs" to
overlay a writable file system with the read-only one. Depends on how
cooperative your IT guys are...
-James Holton
MAD Scientist
Ian Tickle wrote:
> All -
>
> No doubt this topic has come up before on the BB: I'd like to ask
> about the current capabilities of the various integration programs (in
> practice we use only MOSFLM & XDS) for reading compressed diffraction
> images from synchrotrons. AFAICS XDS has limited support for reading
> compressed images (TIFF format from the MARCCD detector and CCP4
> compressed format from the Oxford Diffraction CCD); MOSFLM doesn't
> seem to support reading compressed images at all (I'm sure Harry will
> correct me if I'm wrong about this!). I'm really thinking about
> gzipped files here: bzip2 no doubt gives marginally smaller files but
> is very slow. Currently we bring back uncompressed images but it
> seems to me that this is not the most efficient way of doing things -
> or is it just that my expectation that it's more efficient to read
> compressed images and uncompress in memory not realised in practice?
> For example the AstexViewer molecular viewer software currently reads
> gzipped CCP4 maps directly and gunzips them in memory; this improves
> the response time by a modest factor of ~ 1.5, but this is because
> electron density maps are 'dense' from a compression point of view;
> X-ray diffraction images tend to have much more 'empty space' and the
> compression factor is usually considerably higher (as much as
> 10-fold).
>
> On a recent trip we collected more data than we anticipated & the
> uncompressed data no longer fitted on our USB disk (the data is backed
> up to the USB disk as it's collected), so we would have definitely
> benefited from compression! However file size is *not* the issue:
> disk space is cheap after all. My point is that compressed images
> surely require much less disk I/O to read. In this respect bringing
> back compressed images and then uncompressing back to a local disk
> completely defeats the object of compression - you actually more than
> double the I/O instead of reducing it! We see this when we try to
> process the ~150 datasets that we bring back on our PC cluster and the
> disk I/O completely cripples the disk server machine (and everyone
> who's trying to use it at the same time!) unless we're careful to
> limit the number of simultaneous jobs. When we routinely start to use
> the Pilatus detector on the beamlines this is going to be even more of
> an issue. Basically we have plenty of processing power from the
> cluster: the disk I/O is the bottleneck. Now you could argue that we
> should spread the load over more disks or maybe spend more on faster
> disk controllers, but the whole point about disks is they're cheap, we
> don't need the extra I/O bandwidth for anything else, and you
> shouldn't need to spend a fortune, particularly if there are ways of
> making the software more efficient, which after all will benefit
> everyone.
>
> Cheers
>
> -- Ian
>
|