James,
I'm not sure you're completely right here - it's reasonably
straightforward to show that
Rmerge ~ 0.7979 / (<I/sigma>)
(Weiss & Hilgenfeld, J. Appl. Cryst 1997) which can be verified from
e.g. the Scala log file, provided that the *unmerged* I/sigma is
considered:
http://www.ccp4.ac.uk/xia/rmerge.jpg
This example did not exhibit much radiation damage so it does
represent a best case.
For (unmerged) I/sigma < 1 the statistics do tend to become
unreliable, which I found was best demonstrated by inspection of the
E^4 plot - up to I/sigma ~ 1 it was ~ 2, but increased substantially
thereafter. This I had assumed represented the fact that the
"intensities" were drawn from a gaussian distribution with low I/sigma
rather than the exponential (WIlson) distribution which would be
expected for intensities.
By repeatedly selecting small random subsets* of unique reflections in
the example data set and merging them separately, I found that the
"error" on the Rmerge above for the weakest reflections was about
0.05. Since this retains the same multiplicity and the mean value
converges on the complete data set statistics, I believe that the
comparisons are valid.
I guess I "don't believe you" :o)
Best,
Graeme
* CCTBX is awesome for this kind of thing!
2009/7/15 James Holton <[log in to unmask]>:
> Actually, if I/sd < 3, Rmerge, Rpim, Rrim, etc. are all infinity. Doesn't
> matter what your redundancy is.
>
> Don't believe me? Try it.
> The extreme case is I/sd = 0, and as long as there is some background (and,
> let's face it, there always is), the "observed" spot intensity will be
> equally likely to be positive or negative, with a (basically) Gaussian
> distribution.
> So, if you generate say, ten Gaussian-random numbers (centered on zero),
> take their average value <I>, compute the average deviation from that
> average <|I-<I>|>, and then divide <|I-<I>|>/<I>, you will get the "Rmerge"
> expected for I/sd = 0 at a redundancy of 10. Problem is, if you do this
> again with a different random number seed, you will get a very different
> Rmerge. Even if you do it with a million different random number seeds and
> compute the "average Rmerge", you will always get wildly different values.
> Some positive, some negative. And it doesn't matter how many "data points"
> you use to compute the Rmerge: averaging a million Rmerge values will give a
> different answer than averaging a million and one.
>
> The reason for this numerical instability is because both <I> and <|I-<I>|>
> follow a Gaussian distribution that is centered at zero, and the ratio of
> two numbers like this has a Lorentzian distribution. The Lorentzian looks a
> lot like a Gaussian, but has much fatter tails. Fat enough so that the
> Lorentzian distribution has NO MEAN VALUE. Seriously. It is hard to
> believe that the average value of something that is equally likely to be
> positive or negative could be anything but zero, but for all practical
> purposes you can never arrive at the average value of something with a
> Lorentzian distribution. At least not by taking finite samples. So, no
> matter what the redundancy, you will always get a different Rmerge.
>
> However, if <I> is not centered on zero (I/sd > 0), then the ratio of the
> two Gaussian-random numbers starts to look like a Gaussian itself, and this
> distribution does have a mean value (Rmerge will be "reproducible").
> However, this does not happen all at once. The tails start to shrink as
> I/sd = 1, they are even smaller at I/sd = 2, and the distribution finally
> looses all "Lorentzian character" when I/sd >= 3. Only then is Rmerge a
> meaningful quantity.
>
> So, perhaps our "forefathers" who first instituted the practice of a 3-sigma
> cutoff for all intensities actually DID know what they were doing! All R-
> statistics (including Rcryst and Rfree) are unstable in this way for weak
> data, but sometime in the early 1990s the practice of computing R-factors on
> "all data" crept into the field. I'm not saying we should not use all data,
> maximum likelihood refinement uses sigmas properly and "weak" data are
> powerful restraints. However, I will go on record as suggesting that a
> 3-sigma cutoff should be used for all R statistics. There is still a place
> in your PDB file to put the sigma cutoff you used for your R factors.
>
> -James Holton
> MAD Scientist
>
>
> Lijun Liu wrote:
>>
>> Hi Frank,
>>
>> Off from the original topic but important to clarify. If I misled the
>> concepts, I apologize.
>>
>>> Outer shell Rmerge will always be very high:
>>
>> ----------
>> True! Especially when I/Sig ~ 1 or less.
>>
>>> Only I/sigI (and completeness, although it's related) is really relevant
>>> for deciding high resolution cutoff.
>>
>> ---------
>> Normally I use I/Sig = 2.0 for res-cut-off. For this "accuracy"---please
>> do not ask me the exact meaning of Sig(too many contributed this including
>> hardware, software, protocol, strategies,...), the average measuring error
>> for reflections could be expected to the inversion of this number, 1/2.0,
>> i.e. 50%, which in general suggests that the Rmerge should not pass much
>> this value to make the inclusion of the data meaningful. (Please read this
>> carefully since I do not want to confuse two different concepts). Or you
>> are merging data with merging error much larger than the data measuring
>> error. Although the estimation of Sig(I) is difficult and Sig(I) itself may
>> be of large error, when I/sig ~ 3, 70% seems still to be too high to accept.
>>
>> Rmerge is well known to be a weak indicator, but it is not just a
>> mathematical issue, and never a crap. It should be used with others (I/S,
>> red, ...). I agree with Ian that all data should be included, if the
>> quality is guaranteed.
>>
>> I did not comb the history of refinement softwares and their philosophy,
>> but today it seems all the prevailing ref-packages use resolution bins for
>> shelling (I know there has been enough theoretical ground to to so), which
>> is the source of RESOLUTION CUTOFF and some problems arisen from RESOLUTION
>> CUTOFF for example the Rmerge issue. I appreciate to be told if some
>> softwares had ever used I, I/SigI, F, F/SigF or something else for binning,
>> especially in the early time for refinement package development. RESOLUTION
>> BINNING might not be a have-to? :D
>>
>> Best regards.
>>
>> Lijun Liu, PhD
>> http://www.uoregon.edu/~liulj/ <http://www.uoregon.edu/%7Eliulj/>
>>
>>
>>
>>
>>
>>
>
|