On general scientific principles the reasons for archiving "raw data"
all boil down to one thing: there was a systematic error, and you hope
to one day account for it. After all, a "systematic error" is just
something you haven't modeled yet. Is it worth modelling? That depends...
There are two main kinds of systematic error in MX:
1) Fobs vs Fcalc
Given that the reproducibility of Fobs is typically < 3%, but
typical R/Rfree values are in the 20%s, it is safe to say that this is a
rather whopping systematic error. What causes it? Dunno. Would
structural biologists benefit from being able to model it? Oh yes!
Imagine being able to reliably see a ligand that has an occupancy of
only 0.05, or to be able to unambiguously distinguish between two
proposed reaction mechanisms and back up your claims with hard-core
statistics (derived from SIGF). Perhaps even teasing apart all the
different minor conformers occupied by the molecule in its functional
cycle? I think this is the main reason why we all decided to archive
Fobs: 20% error is a lot.
2) scale factors
We throw a lot of things into "scale factors", including sample
absorption, shutter timing errors, radiation damage, flicker in the
incident beam, vibrating crystals, phosphor thickness, point-spread
vaiations, and many other phenomena. Do we understand the physics
behind them? Yes (mostly). Is there "new biology" to be had by
modelling them more accurately? No. Unless, of course, you count all
the structures we have not solved yet.
Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and
other "native" elements actually worked? You wouldn't have to grow
SeMet protein anymore, and you could go after systems that don't express
well in E. coli. Perhaps even going to the native source! I think
there is plenty of "new biology" to be had there. Wouldn't it be nice
if you could do S-SAD even though your spots were all smeary and
overlapped and mosaic and radiation damaged?
Why don't we do this now? Simple!: it doesn't work. Why doesn't it
work? Because we don't know all the "scale factors" accurately enough.
In most cases, the "% error" from all the scale factors usually adds up
to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due
to native element anomalous scattering is usually less than 1%.
Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et
al. 2006), but if photon-counting were the only source of error, we
should be able to get Rmerge of ~0.1% or less, particularly in the
low-angle resolution bins. If we can do that, then there will be little
need for SeMet anymore.
But, we need the "raw" images if we are to have any hope of figuring out
how to get the errors down to the 0.1% level. There is no one magic
dataset that will tell us how to do this, we need to "average over" lots
of them. Yes, this is further "upstream" of the "new biology" than
deposited Fs, and yes the cost of archiving images is higher, but I
think the potential benefits to the structural biology community if we
can crack the 0.1% S-SAD barrier is nothing short of revolutionary.
-James Holton
MAD Scientist
On 11/1/2011 8:32 AM, Anastassis Perrakis wrote:
> Dear Gerard
>
> Isolating your main points:
>
>> but there would have been no PDB-REDO because the
>> data for running it would simply not have been available! ;-) . Or do
>> you
>> think the parallel does not apply?
> ...
>> have thought, some value. From the perspective of your message, then,
>> why
>> are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no
>> chance of measuring up to them?
>
> I was thinking of the inconsistency while sending my previous email
> ... ;-)
>
> Basically, the parallel does apply. PDB-REPROCESS in a few years would
> be really fantastic - speaking as a crystallographer and methods
> developer.
>
> Speaking as a structural biologist though, I did think long and hard
> about
> the usefulness of PDB_REDO. I obviously decided its useful since I am now
> heavily involved in it for a few reasons, like uniformity of final
> model treatment,
> improving refinement software, better statistics on structure quality
> metrics,
> and of course seeing if the new models will change our understanding of
> the biology of the system.
>
> An experiment that I would like to do as a structural biologist - is
> the following:
> What about adding an "increasing noise" model to the Fobs's of a few
> datasets and re-refining?
> How much would that noise change the final model quality metrics and
> in absolute terms?
>
> (for the changes that PDB_RE(BUILD) does have a preview at
> http://www.ncbi.nlm.nih.gov/pubmed/22034521
> ....I tried to avoid the shamelessly self-promoting plug-in, but could
> resists at the end!)
>
> That experiment - or a better designed variant for it ? - would maybe
> tell us if we should be advocating the archive of all images,
> and being scientifically convinced of the importance of that beyond
> methods development, we would all argue a strong case
> to the funding and hosting agencies.
>
> Tassos
>
> PS Of course, that does not negate the all-important argument, that
> when struggling with marginal
> data better processing software is essential. There is a clear need
> for better software
> to process images, especially for low resolution and low signal/noise
> cases.
> Since that is dependent on having test data I am all for supporting an
> initiative to collect such data,
> and I would gladly spend a day digging our archives to contribute.
|