You do also have always to consider why you are doing this calculation -
usually to satisfy a sceptical and possibly ill-informed referee. A major
reason for doing this is to justify including an outer resolution shell of
data (see this BB passim), and for this I have come to prefer the random
half-dataset correlation coefficient in shells. A CC has a more
straightforward distribution than an R-factor (though not entirely without
problems). It is independent of the SD estimates, and easy to understand.
Phil
>
> The fundamental mathematical problem of using an R statistic on data
> with I/sigma(I) < 3 is that the assumption that the fractional deviates
> (I-<I>)/<I> obey a Gaussian distribution breaks down. And when that
> happens, the R calculation itself becomes unstable, giving essentially
> random R values. Therefore, including weak data in R calculations is
> equivalent to calculating R with a 3-sigma cutoff, and then adding a
> random number to the R value. Now, "random" data is one thing, but if
> the statistic used to evaluate the data quality is itself random, then
> it is not what I would call "useful".
>
> Since I am not very good at math, I always find myself approaching
> statistics by generating long lists of random numbers, manipulating them
> in some way, and then graphing the results. For graphing Rmerge vs
> I/sigma(I), one does find that "Bernhard's rule" of Rmerge = 0.8/(
> I/sigma(I) ) generally applies, but only for I/sigma(I) that is >= 3.
> It gets better with high multiplicity, but even with m=100, the Rmerge
> values for the I/sigma(I) < 1 points are all over the place. This is
> true even if you average the value of Rmerge over a million random
> number "seeds". In fact, one must do so much averaging, that I start to
> worry about the low-order bits of common random number generators. I
> have attached images of these "Rmerge vs I/sigma" graphs. The error
> bars reflect the rms deviation from the average of a large number of
> "Rmerge" values (different random number seeds). The "missing" values
> are actually points where the average Rmerge in 600000 trials (m=3) was
> still negative.
>
> The reason for this "noisy R factor problem" becomes clear if you
> consider the limiting case where the "true" intensity is zero, and make
> a histogram of ( I - <I> )/<I>. It is not a Gaussian. Rather, it is
> the Gaussian's evil stepsister: the Lorentzian (or "Cauchy
> distribution"). This distribution may look a lot like a Gaussian, but
> it has longer tails, and these tails give it the weird statistical
> property of having an undefined mean value. This is counterintuitive!
> Because you can clearly just look at the histogram and see that it has a
> central peak (at zero), but if you generate a million
> Lorentzian-distributed random numbers and take the average value, you
> will not get anything close to zero. Try it! You can generate a
> Lorentzian deviate from a uniform deviate like this:
> tan(pi*(rand()-0.5)), where "rand()" makes a random number from 0 to 1.
>
> Now, it is not too hard to understand how R could "blow up" when the
> "true" spot intensities are all zero. After all, as <I> approaches
> zero, the ratio ( I - <I> ) / <I> approaches a divide-by-zero problem.
> But what about when I/sigma(I) = 1? Or 2? If you look at these
> histograms, you find that they are a "cross" between a Gaussian and a
> Lorentzian (the so-called "Voigt function"), and the histogram does not
> become truly "Gaussian-looking" until I/sigma(I) = 3. At this point,
> the R factor behaves "Bernhard's rule" quite well, even with
> multiplicities as low as 2 or 3. This was the moment when I realized
> that the "early crystallographers" who first decided to use this 3-sigma
> cutoff, were smarter than I am.
>
> Now, you can make a Voigt function (or even a Lorentzian) look more like
> a Gaussian by doing something called "outlier rejection", but it is hard
> to rationalize why the "outliers" are being rejected. Especially in a
> simulation! Then again, the silly part of all this is all we really
> want is the "middle" of the histogram of ( I - <I> )/<I>. In fact, if
> you just pick the "most common" Rmerge, you would get a much better
> estimate of the "true Rmerge" in a given resolution bin than you would
> by averaging a hundred times more data. Such procedures are called
> "robust estimators" in statistics, and the "robust estimator"
> equivalents to the average and the "rms deviation from the average" are
> the median and the "median absolute deviation from the median". If you
> make a list of Lorentzian-random numbers as above, and compute the
> median, you will get a value very close to zero, even with modest
> multiplicity! And the "median absolute deviation from the median"
> rapidly converges to 1, which matches the "full width at half maximum"
> of the histogram quite nicely.
>
> So, what are the practical implications of this? Perhaps instead of the
> average Rmerge in each bin we should be looking at the median Rmerge?
> This will be the same as the average for the cases where I/sigma(I) > 3,
> but still be "well behaved" for the bins that contain only weak data.
> The last two graphs attached compare the average (red) Rmerge vs the
> median (blue). These were computed for the same data, and clearly the
> median Rmerge is a lot more stable. The blue error bars are the median
> absolute deviation from the median. True, the median Rmerge curves down
> a bit as I/sigma approaches zero at m=3, but at least it doesn't go
> negative!
>
> Just a suggestion.
>
> -James Holton
> MAD Scientist
>
> On 3/9/2011 8:33 AM, Graeme Winter wrote:
>> Hi James,
>>
>> May I just offer a short counter-argument to your case for not
>> including weak reflections in the merging residuals?
>>
>> Unlike many people I rather like Rmerge, not because it tells you how
>> good the data are, but because it gives you a clue as to how well the
>> unmerged measurements agree with one another. It's already been
>> mentioned on this thread that Rmerge is ~ 0.8 / <I/sigma> which means
>> that the inverse is also true - an Rmerge of 0.8 indicates that the
>> average measurement in the shell has an I/sigma of ~ 1 (presuming
>> there are sufficient multiple measurements - if the multiplicity is <
>> 3 or so this can be nonsense)
>>
>> This does not depend on the error model or the multiplicity. It just
>> talks about the average. Now, if we exclude all measurements with an
>> I/sigma of less than three we have no idea of how strong the
>> reflections in the shell are on average. We're just top-slicing the
>> good reflections and asking how well they agree. Well, with an I/sigma
>> > 3 I would hope they agree rather well if your error model is
>> reasonable. It would suddenly become rare to see an Rmerge > 0.3 in
>> the outer shell.
>>
>> I like Rpim. It tells you how good the average measurement should be
>> provided you have not too much radiation damage. However, without
>> Rmerge I can't get a real handle on how well the measurements agree.
>>
>> Personally, what I would like to see is the full contents of the Scala
>> log file available as graphs along with Rd from xdsstat and some other
>> choice statistics so you can get a relatively complete picture,
>> however I appreciate that this is unrealistic :o)
>>
>> Just my 2c.
>>
>> Cheerio,
>>
>> Graeme
>>
>> On 8 March 2011 20:07, James Holton <[log in to unmask]
>> <mailto:[log in to unmask]>> wrote:
>>
>> Although George does not mention anything about data reduction
>> programs, I take from his description that common small-molecule
>> data processing packages (SAINT, others?), have also been
>> modernized to record all data (no I/sigmaI > 2 or 3 cutoff). I
>> agree with him that this is a good thing! And it is also a good
>> thing that small-molecule refinement programs use all data. I
>> just don't think it is a good idea to use all data in R factor
>> calculations.
>>
>> Like Ron, I will probably be dating myself when I say that when I
>> first got into the macromolecular crystallography business, it was
>> still commonplace to use a 2-3 sigma spot intensity cutoff. In
>> fact, this is the reason why the PDB wants to know your
>> "completeness" in the outermost resolution shell (in those days,
>> the outer resolution was defined by where completeness drops to
>> ~80% after the 3 sigma spot cutoff). My experience with this,
>> however, was brief, as the maximum-likelihood revolution was just
>> starting to take hold, and the denzo manual specifically stated
>> that only bad people use sigma cutoffs > -3.0. Nevertheless, like
>> many crystallographers from this era, I have fond memories of the
>> REALLY low R factors you can get by using this arcane and now
>> reviled practice. Rsym values of 1-2% were common.
>>
>> It was only recently that I learned enough about statistics to
>> understand the wisdom of my ancestors and that a 3-sigma cutoff is
>> actually the "right thing to do" if you want to measure a
>> fractional error (like an R factor). That is all I'm saying.
>>
>> -James Holton
>> MAD Scientist
>>
>>
>> On 3/6/2011 2:50 PM, Ronald E Stenkamp wrote:
>>
>> My small molecule experience is old enough (maybe 20 years)
>> that I doubt if it's even close to representing current
>> practices (best or otherwise). Given George's comments, I
>> suspect (and hope) that less-than cutoffs are historical
>> artifacts at this point, kept around in software for making
>> comparisons with older structure determinations. But a bit of
>> scanning of Acta papers and others might be necessary to
>> confirm that. Ron
>>
>>
>> On Sun, 6 Mar 2011, James Holton wrote:
>>
>>
>> Yes, I would classify anything with I/sigmaI < 3 as
>> "weak". And yes, of course it is possible to get "weak"
>> spots from small molecule crystals. After all, there is no
>> spot so "strong" that it cannot be defeated by a
>> sufficient amount of background! I just meant that,
>> relatively speaking, the intensities diffracted from a
>> small molecule crystal are orders of magnitude brighter
>> than those from a macromolecular crystal of the same size,
>> and even the same quality (the 1/Vcell^2 term in Darwin's
>> formula).
>>
>> I find it interesting that you point out the use of a 2
>> sigma(I) intensity cutoff for small molecule data sets!
>> Is this still common practice? I am not a card-carrying
>> "small molecule crystallographer", so I'm not sure.
>> However, if that is the case, then by definition there are
>> no "weak" intensities in the data set. And this is
>> exactly the kind of data you want for least-squares
>> refinement targets and computing "% error" quality metrics
>> like R factors. For likelihood targets, however, the
>> "weak" data are actually a powerful restraint.
>>
>> -James Holton
>> MAD Scientist
>>
>> On 3/6/2011 11:22 AM, Ronald E Stenkamp wrote:
>>
>> Could you please expand on your statement that
>> "small-molecule data has essentially no weak spots."?
>> The small molecule data sets I've worked with have
>> had large numbers of "unobserved" reflections where I
>> used 2 sigma(I) cutoffs (maybe 15-30% of the
>> reflections). Would you consider those "weak" spots
>> or not? Ron
>>
>> On Sun, 6 Mar 2011, James Holton wrote:
>>
>> I should probably admit that I might be indirectly
>> responsible for the resurgence of this I/sigma > 3
>> idea, but I never intended this in the way
>> described by the original poster's reviewer!
>>
>> What I have been trying to encourage people to do
>> is calculate R factors using only hkls for which
>> the signal-to-noise ratio is > 3. Not refinement!
>> Refinement should be done against all data. I
>> merely propose that weak data be excluded from
>> R-factor calculations after the
>> refinement/scaling/mergeing/etc. is done.
>>
>> This is because R factors are a metric of the
>> FRACTIONAL error in something (aka a "%
>> difference"), but a "% error" is only meaningful
>> when the thing being measured is not zero.
>> However, in macromolecular crystallography, we
>> tend to measure a lot of "zeroes". There is
>> nothing wrong with measuring zero! An excellent
>> example of this is confirming that a systematic
>> absence is in fact "absent". The "sigma" on the
>> intensity assigned to an absent spot is still a
>> useful quantity, because it reflects how confident
>> you are in the measurement. I.E. a sigma of "10"
>> vs "100" means you are more sure that the
>> intensity is zero. However, there is no "R factor"
>> for systematic absences. How could there be! This
>> is because the definition of "% error" starts to
>> break down as the "true" spot intensity gets
>> weaker, and it becomes completely meaningless when
>> the "true" intensity reaches zero.
>>
>> Historically, I believe the widespread use of R
>> factors came about because small-molecule data has
>> essentially no weak spots. With the exception of
>> absences (which are not used in refinement), spots
>> from "salt crystals" are strong all the way out to
>> edge of the detector, (even out to the "limiting
>> sphere", which is defined by the x-ray
>> wavelength). So, when all the data are strong, a
>> "% error" is an easy-to-calculate quantity that
>> actually describes the "sigma"s of the data very
>> well. That is, sigma(I) of strong spots tends to
>> be dominated by things like beam flicker, spindle
>> stability, shutter accuracy, etc. All these
>> usually add up to ~5% error, and indeed even the
>> Braggs could typically get +/-5% for the intensity
>> of the diffracted rays they were measuring.
>> Things like Rsym were therefore created to check
>> that nothing "funny" happened in the measurement.
>>
>> For similar reasons, the quality of a model
>> refined against all-strong data is described very
>> well by a "% error", and this is why the
>> refinement R factors rapidly became popular. Most
>> people intuitively know what you mean if you say
>> that your model fits the data to "within 5%". In
>> fact, a widely used criterion for the correctness
>> of a "small molecule" structure is that the
>> refinement R factor must be LOWER than Rsym. This
>> is equivalent to saying that your curve (model)
>> fit your data "to within experimental error".
>> Unfortunately, this has never been the case for
>> macromolecular structures!
>>
>> The problem with protein crystals, of course, is
>> that we have lots of "weak" data. And by "weak",
>> I don't mean "bad"! Yes, it is always nicer to
>> have more intense spots, but there is nothing
>> shameful about knowing that certain intensities
>> are actually very close to zero. In fact, from
>> the point of view of the refinement program, isn't
>> describing some high-angle spot as: "zero, plus or
>> minus 10", better than "I have no idea"? Indeed,
>> several works mentioned already as well as the
>> "free lunch algorithm" have demonstrated that
>> these "zero" data can actually be useful, even if
>> it is well beyond the "resolution limit".
>>
>> So, what do we do? I see no reason to abandon R
>> factors, since they have such a long history and
>> give us continuity of criteria going back almost a
>> century. However, I also see no reason to punish
>> ourselves by including lots of zeroes in the
>> denominator. In fact, using weak data in an R
>> factor calculation defeats their best feature. R
>> factors are a very good estimate of the fractional
>> component of the total error, provided they are
>> calculated with strong data only.
>>
>> Of course, with strong and weak data, the best
>> thing to do is compare the model-data disagreement
>> with the magnitude of the error. That is, compare
>> |Fobs-Fcalc| to sigma(Fobs), not Fobs itself.
>> Modern refinement programs do this! And I say
>> the more data the merrier.
>>
>>
>> -James Holton
>> MAD Scientist
>>
>>
>> On 3/4/2011 5:15 AM, Marjolein Thunnissen wrote:
>>
>> hi
>>
>> Recently on a paper I submitted, it was the
>> editor of the journal who wanted exactly the
>> same thing. I never argued with the editor
>> about this (should have maybe), but it could
>> be one cause of the epidemic that Bart Hazes
>> saw....
>>
>>
>> best regards
>>
>> Marjolein
>>
>> On Mar 3, 2011, at 12:29 PM, Roberto
>> Battistutta wrote:
>>
>> Dear all,
>> I got a reviewer comment that indicate the
>> "need to refine the structures at an
>> appropriate resolution (I/sigmaI of>3.0),
>> and re-submit the revised coordinate files
>> to the PDB for validation.". In the
>> manuscript I present some crystal
>> structures determined by molecular
>> replacement using the same protein in a
>> different space group as search model.
>> Does anyone know the origin or the
>> theoretical basis of this "I/sigmaI>3.0"
>> rule for an appropriate resolution?
>> Thanks,
>> Bye,
>> Roberto.
>>
>>
>> Roberto Battistutta
>> Associate Professor
>> Department of Chemistry
>> University of Padua
>> via Marzolo 1, 35131 Padova - ITALY
>> tel. +39.049.8275265/67
>> fax. +39.049.8275239
>> [log in to unmask]
>> <mailto:[log in to unmask]>
>> www.chimica.unipd.it/roberto.battistutta/
>> <http://www.chimica.unipd.it/roberto.battistutta/>
>> VIMM (Venetian Institute of Molecular
>> Medicine)
>> via Orus 2, 35129 Padova - ITALY
>> tel. +39.049.7923236
>> fax +39.049.7923250
>> www.vimm.it <http://www.vimm.it>
>>
>>
>>
>>
>>
>>
>>
>
>
|