I generally cut off integration at the shell wheree I/sigI < 0.5 and
then cut off merged data where MnI/sd(I) ~ 1.5. It is always easier to
to cut off data later than to re-integrate it. I never look at the
Rmerge, Rsym, Rpim or Rwhatever in the highest resolution shell. This
is because R-statistics are inappropriate for weak data.
Don't believe me? If anybody out there doesn't think that spots with no
intensity are important, then why are you looking so carefully at your
systematic absences? ;) The "Rabsent" statistic (if it existed) would
always be dividing by zero, and giving wildly varying numbers > 100%
(unless your "absences" really do have intensity, but then they are not
absent, are they?).
There is information in the intensity of an "absent" spot (a
systematic absence, or any spot beyond your "true resolution limit").
Unfortunately, measuring zero is "hard" because the "signal to noise
ratio" will always be ... zero. Statistics as we know it seems to fear
this noise>signal domain. For example, the error propagation Ulrich
pointed out (F/sigF = 2 I/sigI) breaks down as I approaches zero. If
you take F=0, and add random noise to it and then square it, you will
get an average value for <I>=<F^2> that always equals the square of the
noise you added. It will never be zero, no matter how much averaging
you do. Going the other way is problematic because if <I> really is
zero, then half of your measurments of it will be negative (and sqrt(I)
will be "imaginary" (ha ha)). This is the problem TRUNCATE tries to solve.
Despite these difficulties, IMHO, cutting out weak data from a ML
refinement is a really bad idea. This is because there is a big
difference between "1 +/- 10" and "I don't know, could be anything" when
you are fitting a model to data. ESPECIALLY when your data/parameters
ratio is already ~1.0 or less. This is because the DIFFERENCE between
Fobs and Fcalc relative to the uncertainty of Fobs is what determines
wether or not your model is correct "to within experimental error". If
weak, high-res data are left out, then they can become a dumping ground
for model bias. Indeed, there are some entries in the PDB (particularly
those pre-dating when we knew how to restrain B factors properly) that
show an up-turn in "intensity" beyond the quoted resolution cutoff (if
you look at the Wilson plot of Fcalc). This is because the refinement
program was allowed to make Fcalc beyond the resolution cutoff anything
it wanted (and it did).
The only time I think cutting out data because it is weak is appropriate
is for map calculations. Leaving out an HKL from the map is the same as
assigning it to zero (unless it is a sigma-a map that "fills in" with
Fcalcs). In maps, weak data (I/sd < 1) will (by definition) add more
noise than signal. In fact, calculating an anomalous difference
Patterson with DANO/SIGDANO as the coefficients instead of DANO can
often lead to "better" maps.
Yes, your Rmerge, Rcryst and Rfree will all go up if you include weak
data in your scaling and refinement, but the accuracy of your model will
improve. If you (or your reviewer) are worried about this, I suggest
using the old, traditional 3-sigma cutoff for data used to calculate R.
Keep the anachronisms together. Yes, the PDB allows this. In fact,
(last time I checked) you are asked to enter what sigma cutoff you used
for your R factors.
In the last 100 days (3750 PDB depositions), the "REMARK 3 DATA
CUTOFF" stats are thus:
sigma-cutoff popularity
NULL 13.84%
NONE 13.65%
-2.5 to -1.5 0.37%
-0.5 to 0.5 62.48%
0.5 to 1.5 2.03%
1.5 to 2.5 6.51%
2.5 to 3.5 0.61%
3.5 to 4.5 0.24%
>4.5 0.27%
So it would appear mine is not a popular attitude.
-James Holton
MAD Scientist
Shane Atwell wrote:
> Could someone point me to some standards for data quality, especially
> for publishing structures? I'm wondering in particular about highest
> shell completeness, multiplicity, sigma and Rmerge.
>
> A co-worker pointed me to a '97 article by Kleywegt and Jones:
>
> _http://xray.bmc.uu.se/gerard/gmrp/gmrp.html_
>
> "To decide at which shell to cut off the resolution, we nowadays tend
> to use the following criteria for the highest shell: completeness > 80
> %, multiplicity > 2, more than 60 % of the reflections with I > 3
> sigma(I), and Rmerge < 40 %. In our opinion, it is better to have a
> good 1.8 Å structure, than a poor 1.637 Å structure."
>
> Are these recommendations still valid with maximum likelihood methods?
> We tend to use more data, especially in terms of the Rmerge and sigma
> cuttoff.
>
> Thanks in advance,
>
> *Shane Atwell*
>
|