James,
Graeme is right. While <I> does indeed (approximately) follow a
Gaussian, <|I-<I>|> cannot. The absolute value operator keeps it
positive (reflects the negative across the origin), and hence it is a
half Gaussian. Its mean cannot be zero unless the variance is zero.
For standard normals (variance = 1), the mean of <|I-<I>|> is 0.798,
just as Graeme said. You can do the integration. So, the fact that <|
I-<I>|>/<I> is unstable at low I/sigma is *not* a consequence of the
peculiar divergent properties of a Cauchy (Lorentzian). Rather, it's
a consequence of E(I) being zero. And, like your calculator knows,
division by zero is undefined (or infinite, depending on your
proclivities).
Cheers,
Douglas
On Jul 15, 2009, at 5:03 PM, James Holton wrote:
> I tried plugging I/sigma = 0 into your formula below, but my
> calculator returned "EEEEEEEE"
>
> -James Holton
> MAD Scientist
>
> Graeme Winter wrote:
>> James,
>>
>> I'm not sure you're completely right here - it's reasonably
>> straightforward to show that
>>
>> Rmerge ~ 0.7979 / (<I/sigma>)
>>
>> (Weiss & Hilgenfeld, J. Appl. Cryst 1997) which can be verified from
>> e.g. the Scala log file, provided that the *unmerged* I/sigma is
>> considered:
>>
>> http://www.ccp4.ac.uk/xia/rmerge.jpg
>>
>> This example did not exhibit much radiation damage so it does
>> represent a best case.
>>
>> For (unmerged) I/sigma < 1 the statistics do tend to become
>> unreliable, which I found was best demonstrated by inspection of the
>> E^4 plot - up to I/sigma ~ 1 it was ~ 2, but increased substantially
>> thereafter. This I had assumed represented the fact that the
>> "intensities" were drawn from a gaussian distribution with low I/
>> sigma
>> rather than the exponential (WIlson) distribution which would be
>> expected for intensities.
>>
>> By repeatedly selecting small random subsets* of unique reflections
>> in
>> the example data set and merging them separately, I found that the
>> "error" on the Rmerge above for the weakest reflections was about
>> 0.05. Since this retains the same multiplicity and the mean value
>> converges on the complete data set statistics, I believe that the
>> comparisons are valid.
>>
>> I guess I "don't believe you" :o)
>>
>> Best,
>>
>> Graeme
>>
>>
>>
>> * CCTBX is awesome for this kind of thing!
>>
>> 2009/7/15 James Holton <[log in to unmask]>:
>>
>>> Actually, if I/sd < 3, Rmerge, Rpim, Rrim, etc. are all infinity.
>>> Doesn't
>>> matter what your redundancy is.
>>>
>>> Don't believe me? Try it.
>>> The extreme case is I/sd = 0, and as long as there is some
>>> background (and,
>>> let's face it, there always is), the "observed" spot intensity
>>> will be
>>> equally likely to be positive or negative, with a (basically)
>>> Gaussian
>>> distribution.
>>> So, if you generate say, ten Gaussian-random numbers (centered on
>>> zero),
>>> take their average value <I>, compute the average deviation from
>>> that
>>> average <|I-<I>|>, and then divide <|I-<I>|>/<I>, you will get the
>>> "Rmerge"
>>> expected for I/sd = 0 at a redundancy of 10. Problem is, if you
>>> do this
>>> again with a different random number seed, you will get a very
>>> different
>>> Rmerge. Even if you do it with a million different random number
>>> seeds and
>>> compute the "average Rmerge", you will always get wildly different
>>> values.
>>> Some positive, some negative. And it doesn't matter how many
>>> "data points"
>>> you use to compute the Rmerge: averaging a million Rmerge values
>>> will give a
>>> different answer than averaging a million and one.
>>>
>>> The reason for this numerical instability is because both <I> and
>>> <|I-<I>|>
>>> follow a Gaussian distribution that is centered at zero, and the
>>> ratio of
>>> two numbers like this has a Lorentzian distribution. The
>>> Lorentzian looks a
>>> lot like a Gaussian, but has much fatter tails. Fat enough so
>>> that the
>>> Lorentzian distribution has NO MEAN VALUE. Seriously. It is hard
>>> to
>>> believe that the average value of something that is equally likely
>>> to be
>>> positive or negative could be anything but zero, but for all
>>> practical
>>> purposes you can never arrive at the average value of something
>>> with a
>>> Lorentzian distribution. At least not by taking finite samples.
>>> So, no
>>> matter what the redundancy, you will always get a different Rmerge.
>>>
>>> However, if <I> is not centered on zero (I/sd > 0), then the ratio
>>> of the
>>> two Gaussian-random numbers starts to look like a Gaussian itself,
>>> and this
>>> distribution does have a mean value (Rmerge will be "reproducible").
>>> However, this does not happen all at once. The tails start to
>>> shrink as
>>> I/sd = 1, they are even smaller at I/sd = 2, and the distribution
>>> finally
>>> looses all "Lorentzian character" when I/sd >= 3. Only then is
>>> Rmerge a
>>> meaningful quantity.
>>>
>>> So, perhaps our "forefathers" who first instituted the practice of
>>> a 3-sigma
>>> cutoff for all intensities actually DID know what they were
>>> doing! All R-
>>> statistics (including Rcryst and Rfree) are unstable in this way
>>> for weak
>>> data, but sometime in the early 1990s the practice of computing R-
>>> factors on
>>> "all data" crept into the field. I'm not saying we should not use
>>> all data,
>>> maximum likelihood refinement uses sigmas properly and "weak" data
>>> are
>>> powerful restraints. However, I will go on record as suggesting
>>> that a
>>> 3-sigma cutoff should be used for all R statistics. There is
>>> still a place
>>> in your PDB file to put the sigma cutoff you used for your R
>>> factors.
>>>
>>> -James Holton
>>> MAD Scientist
>>>
>>>
>>> Lijun Liu wrote:
>>>
>>>> Hi Frank,
>>>>
>>>> Off from the original topic but important to clarify. If I
>>>> misled the
>>>> concepts, I apologize.
>>>>
>>>>
>>>>> Outer shell Rmerge will always be very high:
>>>>>
>>>> ----------
>>>> True! Especially when I/Sig ~ 1 or less.
>>>>
>>>>
>>>>> Only I/sigI (and completeness, although it's related) is really
>>>>> relevant
>>>>> for deciding high resolution cutoff.
>>>>>
>>>> ---------
>>>> Normally I use I/Sig = 2.0 for res-cut-off. For this
>>>> "accuracy"---please
>>>> do not ask me the exact meaning of Sig(too many contributed this
>>>> including
>>>> hardware, software, protocol, strategies,...), the average
>>>> measuring error
>>>> for reflections could be expected to the inversion of this
>>>> number, 1/2.0,
>>>> i.e. 50%, which in general suggests that the Rmerge should not
>>>> pass much
>>>> this value to make the inclusion of the data meaningful. (Please
>>>> read this
>>>> carefully since I do not want to confuse two different
>>>> concepts). Or you
>>>> are merging data with merging error much larger than the data
>>>> measuring
>>>> error. Although the estimation of Sig(I) is difficult and Sig(I)
>>>> itself may
>>>> be of large error, when I/sig ~ 3, 70% seems still to be too high
>>>> to accept.
>>>>
>>>> Rmerge is well known to be a weak indicator, but it is not just a
>>>> mathematical issue, and never a crap. It should be used with
>>>> others (I/S,
>>>> red, ...). I agree with Ian that all data should be included, if
>>>> the
>>>> quality is guaranteed.
>>>>
>>>> I did not comb the history of refinement softwares and their
>>>> philosophy,
>>>> but today it seems all the prevailing ref-packages use resolution
>>>> bins for
>>>> shelling (I know there has been enough theoretical ground to to
>>>> so), which
>>>> is the source of RESOLUTION CUTOFF and some problems arisen from
>>>> RESOLUTION
>>>> CUTOFF for example the Rmerge issue. I appreciate to be told if
>>>> some
>>>> softwares had ever used I, I/SigI, F, F/SigF or something else
>>>> for binning,
>>>> especially in the early time for refinement package development.
>>>> RESOLUTION
>>>> BINNING might not be a have-to? :D
>>>>
>>>> Best regards.
>>>>
>>>> Lijun Liu, PhD
>>>> http://www.uoregon.edu/~liulj/ <http://www.uoregon.edu/%7Eliulj/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
|