JISCMail - CCP4BB Archives

You do also have always to consider why you are doing this calculation -
usually to satisfy a sceptical and possibly ill-informed referee. A major
reason for doing this is to justify including an outer resolution shell of
data (see this BB passim), and for this I have come to prefer the random
half-dataset correlation coefficient in shells. A CC has a more
straightforward distribution than an R-factor (though not entirely without
problems). It is independent of the SD estimates, and easy to understand.

Phil

>
> The fundamental mathematical problem of using an R statistic on data
> with I/sigma(I) < 3 is that the assumption that the fractional deviates
> (I-<I>)/<I> obey a Gaussian distribution breaks down.  And when that
> happens, the R calculation itself becomes unstable, giving essentially
> random R values.  Therefore, including weak data in R calculations is
> equivalent to calculating R with a 3-sigma cutoff, and then adding a
> random number to the R value.  Now, "random" data is one thing, but if
> the statistic used to evaluate the data quality is itself random, then
> it is not what I would call "useful".
>
> Since I am not very good at math, I always find myself approaching
> statistics by generating long lists of random numbers, manipulating them
> in some way, and then graphing the results.  For graphing Rmerge vs
> I/sigma(I), one does find that "Bernhard's rule" of Rmerge = 0.8/(
> I/sigma(I) ) generally applies, but only for I/sigma(I) that is >= 3.
> It gets better with high multiplicity, but even with m=100, the Rmerge
> values for the I/sigma(I) < 1 points are all over the place.  This is
> true even if you average the value of Rmerge over a million random
> number "seeds".  In fact, one must do so much averaging, that I start to
> worry about the low-order bits of common random number generators.  I
> have attached images of these "Rmerge vs I/sigma" graphs.  The error
> bars reflect the rms deviation from the average of a large number of
> "Rmerge" values (different random number seeds).  The "missing" values
> are actually points where the average Rmerge in 600000 trials (m=3) was
> still negative.
>
> The reason for this "noisy R factor problem" becomes clear if you
> consider the limiting case where the "true" intensity is zero, and make
> a histogram of ( I - <I> )/<I>.  It is not a Gaussian.  Rather, it is
> the Gaussian's evil stepsister: the Lorentzian (or "Cauchy
> distribution").  This distribution may look a lot like a Gaussian, but
> it has longer tails, and these tails give it the weird statistical
> property of having an undefined mean value.  This is counterintuitive!
> Because you can clearly just look at the histogram and see that it has a
> central peak (at zero), but if you generate a million
> Lorentzian-distributed random numbers and take the average value, you
> will not get anything close to zero.  Try it!  You can generate a
> Lorentzian deviate from a uniform deviate like this:
> tan(pi*(rand()-0.5)), where "rand()" makes a random number from 0 to 1.
>
> Now, it is not too hard to understand how R could "blow up" when the
> "true" spot intensities are all zero.  After all, as <I> approaches
> zero, the ratio ( I - <I> ) / <I> approaches a divide-by-zero problem.
> But what about when I/sigma(I) = 1?  Or 2?  If you look at these
> histograms, you find that they are a "cross" between a Gaussian and a
> Lorentzian (the so-called "Voigt function"), and the histogram does not
> become truly "Gaussian-looking" until I/sigma(I) = 3.  At this point,
> the R factor behaves "Bernhard's rule" quite well, even with
> multiplicities as low as 2 or 3.  This was the moment when I realized
> that the "early crystallographers" who first decided to use this 3-sigma
> cutoff, were smarter than I am.
>
> Now, you can make a Voigt function (or even a Lorentzian) look more like
> a Gaussian by doing something called "outlier rejection", but it is hard
> to rationalize why the "outliers" are being rejected.  Especially in a
> simulation!  Then again, the silly part of all this is all we really
> want is the "middle" of the histogram of ( I - <I> )/<I>.  In fact, if
> you just pick the "most common" Rmerge, you would get a much better
> estimate of the "true Rmerge" in a given resolution bin than you would
> by averaging a hundred times more data.  Such procedures are called
> "robust estimators" in statistics, and the "robust estimator"
> equivalents to the average and the "rms deviation from the average" are
> the median and the "median absolute deviation from the median".  If you
> make a list of Lorentzian-random numbers as above, and compute the
> median, you will get a value very close to zero, even with modest
> multiplicity!  And the "median absolute deviation from the median"
> rapidly converges to 1, which matches the "full width at half maximum"
> of the histogram quite nicely.
>
> So, what are the practical implications of this?  Perhaps instead of the
> average Rmerge in each bin we should be looking at the median Rmerge?
> This will be the same as the average for the cases where I/sigma(I) > 3,
> but still be "well behaved" for the bins that contain only weak data.
> The last two graphs attached compare the average (red) Rmerge vs the
> median (blue).  These were computed for the same data, and clearly the
> median Rmerge is a lot more stable.  The blue error bars are the median
> absolute deviation from the median.  True, the median Rmerge curves down
> a bit as I/sigma approaches zero at m=3, but at least it doesn't go
> negative!
>
> Just a suggestion.
>
> -James Holton
> MAD Scientist
>
> On 3/9/2011 8:33 AM, Graeme Winter wrote:
>> Hi James,
>>
>> May I just offer a short counter-argument to your case for not
>> including weak reflections in the merging residuals?
>>
>> Unlike many people I rather like Rmerge, not because it tells you how
>> good the data are, but because it gives you a clue as to how well the
>> unmerged measurements agree with one another. It's already been
>> mentioned on this thread that Rmerge is ~ 0.8 / <I/sigma> which means
>> that the inverse is also true - an Rmerge of 0.8 indicates that the
>> average measurement in the shell has an I/sigma of ~ 1 (presuming
>> there are sufficient multiple measurements - if the multiplicity is <
>> 3 or so this can be nonsense)
>>
>> This does not depend on the error model or the multiplicity. It just
>> talks about the average. Now, if we exclude all measurements with an
>> I/sigma of less than three we have no idea of how strong the
>> reflections in the shell are on average. We're just top-slicing the
>> good reflections and asking how well they agree. Well, with an I/sigma
>> > 3 I would hope they agree rather well if your error model is
>> reasonable. It would suddenly become rare to see an Rmerge > 0.3 in
>> the outer shell.
>>
>> I like Rpim. It tells you how good the average measurement should be
>> provided you have not too much radiation damage. However, without
>> Rmerge I can't get a real handle on how well the measurements agree.
>>
>> Personally, what I would like to see is the full contents of the Scala
>> log file available as graphs along with Rd from xdsstat and some other
>> choice statistics so you can get a relatively complete picture,
>> however I appreciate that this is unrealistic :o)
>>
>> Just my 2c.
>>
>> Cheerio,
>>
>> Graeme
>>
>> On 8 March 2011 20:07, James Holton <[log in to unmask]
>> <mailto:[log in to unmask]>> wrote:
>>
>>     Although George does not mention anything about data reduction
>>     programs, I take from his description that common small-molecule
>>     data processing packages (SAINT, others?), have also been
>>     modernized to record all data (no I/sigmaI > 2 or 3 cutoff).  I
>>     agree with him that this is a good thing!  And it is also a good
>>     thing that small-molecule refinement programs use all data.  I
>>     just don't think it is a good idea to use all data in R factor
>>     calculations.
>>
>>     Like Ron, I will probably be dating myself when I say that when I
>>     first got into the macromolecular crystallography business, it was
>>     still commonplace to use a 2-3 sigma spot intensity cutoff.  In
>>     fact, this is the reason why the PDB wants to know your
>>     "completeness" in the outermost resolution shell (in those days,
>>     the outer resolution was defined by where completeness drops to
>>     ~80% after the 3 sigma spot cutoff).  My experience with this,
>>     however, was brief, as the maximum-likelihood revolution was just
>>     starting to take hold, and the denzo manual specifically stated
>>     that only bad people use sigma cutoffs > -3.0.  Nevertheless, like
>>     many crystallographers from this era, I have fond memories of the
>>     REALLY low R factors you can get by using this arcane and now
>>     reviled practice.  Rsym values of 1-2% were common.
>>
>>     It was only recently that I learned enough about statistics to
>>     understand the wisdom of my ancestors and that a 3-sigma cutoff is
>>     actually the "right thing to do" if you want to measure a
>>     fractional error (like an R factor).  That is all I'm saying.
>>
>>     -James Holton
>>     MAD Scientist
>>
>>
>>     On 3/6/2011 2:50 PM, Ronald E Stenkamp wrote:
>>
>>         My small molecule experience is old enough (maybe 20 years)
>>         that I doubt if it's even close to representing current
>>         practices (best or otherwise).  Given George's comments, I
>>         suspect (and hope) that less-than cutoffs are historical
>>         artifacts at this point, kept around in software for making
>>         comparisons with older structure determinations.  But a bit of
>>         scanning of Acta papers and others might be necessary to
>>         confirm that.  Ron
>>
>>
>>         On Sun, 6 Mar 2011, James Holton wrote:
>>
>>
>>             Yes, I would classify anything with I/sigmaI < 3 as
>>             "weak".  And yes, of course it is possible to get "weak"
>>             spots from small molecule crystals. After all, there is no
>>             spot so "strong" that it cannot be defeated by a
>>             sufficient amount of background!  I just meant that,
>>             relatively speaking, the intensities diffracted from a
>>             small molecule crystal are orders of magnitude brighter
>>             than those from a macromolecular crystal of the same size,
>>             and even the same quality (the 1/Vcell^2 term in Darwin's
>>             formula).
>>
>>             I find it interesting that you point out the use of a 2
>>             sigma(I) intensity cutoff for small molecule data sets!
>>              Is this still common practice?  I am not a card-carrying
>>             "small molecule crystallographer", so I'm not sure.
>>             However, if that is the case, then by definition there are
>>             no "weak" intensities in the data set.  And this is
>>             exactly the kind of data you want for least-squares
>>             refinement targets and computing "% error" quality metrics
>>             like R factors.  For likelihood targets, however, the
>>             "weak" data are actually a powerful restraint.
>>
>>             -James Holton
>>             MAD Scientist
>>
>>             On 3/6/2011 11:22 AM, Ronald E Stenkamp wrote:
>>
>>                 Could you please expand on your statement that
>>                 "small-molecule data has essentially no weak spots."?
>>                  The small molecule data sets I've worked with have
>>                 had large numbers of "unobserved" reflections where I
>>                 used 2 sigma(I) cutoffs (maybe 15-30% of the
>>                 reflections).  Would you consider those "weak" spots
>>                 or not?  Ron
>>
>>                 On Sun, 6 Mar 2011, James Holton wrote:
>>
>>                     I should probably admit that I might be indirectly
>>                     responsible for the resurgence of this I/sigma > 3
>>                     idea, but I never intended this in the way
>>                     described by the original poster's reviewer!
>>
>>                     What I have been trying to encourage people to do
>>                     is calculate R factors using only hkls for which
>>                     the signal-to-noise ratio is > 3.  Not refinement!
>>                     Refinement should be done against all data.  I
>>                     merely propose that weak data be excluded from
>>                     R-factor calculations after the
>>                     refinement/scaling/mergeing/etc. is done.
>>
>>                     This is because R factors are a metric of the
>>                     FRACTIONAL error in something (aka a "%
>>                     difference"), but a "% error" is only meaningful
>>                     when the thing being measured is not zero.
>>                      However, in macromolecular crystallography, we
>>                     tend to measure a lot of "zeroes".  There is
>>                     nothing wrong with measuring zero!  An excellent
>>                     example of this is confirming that a systematic
>>                     absence is in fact "absent".  The "sigma" on the
>>                     intensity assigned to an absent spot is still a
>>                     useful quantity, because it reflects how confident
>>                     you are in the measurement.  I.E.  a sigma of "10"
>>                     vs "100" means you are more sure that the
>>                     intensity is zero. However, there is no "R factor"
>>                     for systematic absences. How could there be!  This
>>                     is because the definition of "% error" starts to
>>                     break down as the "true" spot intensity gets
>>                     weaker, and it becomes completely meaningless when
>>                     the "true" intensity reaches zero.
>>
>>                     Historically, I believe the widespread use of R
>>                     factors came about because small-molecule data has
>>                     essentially no weak spots.  With the exception of
>>                     absences (which are not used in refinement), spots
>>                     from "salt crystals" are strong all the way out to
>>                     edge of the detector, (even out to the "limiting
>>                     sphere", which is defined by the x-ray
>>                     wavelength).  So, when all the data are strong, a
>>                     "% error" is an easy-to-calculate quantity that
>>                     actually describes the "sigma"s of the data very
>>                     well.  That is, sigma(I) of strong spots tends to
>>                     be dominated by things like beam flicker, spindle
>>                     stability, shutter accuracy, etc.  All these
>>                     usually add up to ~5% error, and indeed even the
>>                     Braggs could typically get +/-5% for the intensity
>>                     of the diffracted rays they were measuring.
>>                      Things like Rsym were therefore created to check
>>                     that nothing "funny" happened in the measurement.
>>
>>                     For similar reasons, the quality of a model
>>                     refined against all-strong data is described very
>>                     well by a "% error", and this is why the
>>                     refinement R factors rapidly became popular.  Most
>>                     people intuitively know what you mean if you say
>>                     that your model fits the data to "within 5%".  In
>>                     fact, a widely used criterion for the correctness
>>                     of a "small molecule" structure is that the
>>                     refinement R factor must be LOWER than Rsym.  This
>>                     is equivalent to saying that your curve (model)
>>                     fit your data "to within experimental error".
>>                     Unfortunately, this has never been the case for
>>                     macromolecular structures!
>>
>>                     The problem with protein crystals, of course, is
>>                     that we have lots of "weak" data.  And by "weak",
>>                     I don't mean "bad"!  Yes, it is always nicer to
>>                     have more intense spots, but there is nothing
>>                     shameful about knowing that certain intensities
>>                     are actually very close to zero.  In fact, from
>>                     the point of view of the refinement program, isn't
>>                     describing some high-angle spot as: "zero, plus or
>>                     minus 10", better than "I have no idea"?   Indeed,
>>                     several works mentioned already as well as the
>>                     "free lunch algorithm" have demonstrated that
>>                     these "zero" data can actually be useful, even if
>>                     it is well beyond the "resolution limit".
>>
>>                     So, what do we do?  I see no reason to abandon R
>>                     factors, since they have such a long history and
>>                     give us continuity of criteria going back almost a
>>                     century.  However, I also see no reason to punish
>>                     ourselves by including lots of zeroes in the
>>                     denominator.  In fact, using weak data in an R
>>                     factor calculation defeats their best feature.  R
>>                     factors are a very good estimate of the fractional
>>                     component of the total error, provided they are
>>                     calculated with strong data only.
>>
>>                     Of course, with strong and weak data, the best
>>                     thing to do is compare the model-data disagreement
>>                     with the magnitude of the error.  That is, compare
>>                     |Fobs-Fcalc| to sigma(Fobs), not Fobs itself.
>>                      Modern refinement programs do this!  And I say
>>                     the more data the merrier.
>>
>>
>>                     -James Holton
>>                     MAD Scientist
>>
>>
>>                     On 3/4/2011 5:15 AM, Marjolein Thunnissen wrote:
>>
>>                         hi
>>
>>                         Recently on a paper I submitted, it was the
>>                         editor of the journal who wanted exactly the
>>                         same thing. I never argued with the editor
>>                         about this (should have maybe), but it could
>>                         be one cause of the epidemic that Bart Hazes
>>                         saw....
>>
>>
>>                         best regards
>>
>>                         Marjolein
>>
>>                         On Mar 3, 2011, at 12:29 PM, Roberto
>>                         Battistutta wrote:
>>
>>                             Dear all,
>>                             I got a reviewer comment that indicate the
>>                             "need to refine the structures at an
>>                             appropriate resolution (I/sigmaI of>3.0),
>>                             and re-submit the revised coordinate files
>>                             to the PDB for validation.". In the
>>                             manuscript I present some crystal
>>                             structures determined by molecular
>>                             replacement using the same protein in a
>>                             different space group as search model.
>>                             Does anyone know the origin or the
>>                             theoretical basis of this "I/sigmaI>3.0"
>>                             rule for an appropriate resolution?
>>                             Thanks,
>>                             Bye,
>>                             Roberto.
>>
>>
>>                             Roberto Battistutta
>>                             Associate Professor
>>                             Department of Chemistry
>>                             University of Padua
>>                             via Marzolo 1, 35131 Padova - ITALY
>>                             tel. +39.049.8275265/67
>>                             fax. +39.049.8275239
>>                             [log in to unmask]
>>                             <mailto:[log in to unmask]>
>>                             www.chimica.unipd.it/roberto.battistutta/
>>                             <http://www.chimica.unipd.it/roberto.battistutta/>
>>                             VIMM (Venetian Institute of Molecular
>>                             Medicine)
>>                             via Orus 2, 35129 Padova - ITALY
>>                             tel. +39.049.7923236
>>                             fax +39.049.7923250
>>                             www.vimm.it <http://www.vimm.it>
>>
>>
>>
>>
>>
>>
>>
>
>