JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for CCP4BB Archives


CCP4BB Archives

CCP4BB Archives


CCP4BB@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

CCP4BB Home

CCP4BB Home

CCP4BB  January 2012

CCP4BB January 2012

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Reasoning for Rmeas or Rpim as Cutoff

From:

Phoebe Rice <[log in to unmask]>

Reply-To:

[log in to unmask]

Date:

Tue, 31 Jan 2012 14:02:45 -0600

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (513 lines)

I'm enjoying this discussion.
It also seems like a good spot to inject my standard plea for better treatment of anisotropy in things like "table 1" of papers and PDB deposition forms.  When you have Ewald's football (American football), like many nucleic acid-ophiles do, one number simply isn't enough to describe the data.

=====================================
Phoebe A. Rice
Dept. of Biochemistry & Molecular Biology
The University of Chicago
phone 773 834 1723
http://bmb.bsd.uchicago.edu/Faculty_and_Research/01_Faculty/01_Faculty_Alphabetically.php?faculty_id=123
http://www.rsc.org/shop/books/2008/9780854042722.asp


---- Original message ----
>Date: Tue, 31 Jan 2012 09:28:49 +0000
>From: CCP4 bulletin board <[log in to unmask]> (on behalf of Randy Read <[log in to unmask]>)
>Subject: Re: [ccp4bb] Reasoning for Rmeas or Rpim as Cutoff  
>To: [log in to unmask]
>
>   Hi Frank,
>   Now that I've been forced to recalibrate my measure
>   of "old literature" (apparently it's not just
>   literature dating from before you started your
>   PhD)...
>   This idea has been used occasionally, but I think it
>   might be becoming more relevant as more structures
>   are done at low resolution.  To be honest, my
>   criteria for resolution tend to be flexible -- if a
>   crystal diffracts to 2.2A resolution by the
>   I/sig(I)>2 criterion, there's less incentive to
>   worry about it -- pushing it to 2A probably won't
>   make much difference to the biological questions
>   that can be answered.  But if the 2-sigma criterion
>   says that the crystal diffracts to 3.3A, then I'm
>   much more likely to see how much more can
>   justifiably be squeezed out of it.  Actually, I'm
>   more inclined to start from a 1-sigma cutoff in the
>   first instance, trusting the maximum likelihood
>   methods to deal appropriately with the uncertainty.
>   It's not too hard to do the higher-resolution
>   cross-validation, but there are a number of things
>   to worry about.  First, I remember being told that
>   data processing programs will do a better job of
>   learning the profiles if you only give them data to
>   a resolution where there are real spots, so you
>   probably don't always want to integrate to an
>   arbitrarily high resolution, unless you're willing
>   to go back to the integration after reassessing the
>   resolution cutoff.  Maybe, as a standard protocol,
>   one could integrate to a conservative resolution and
>   a much higher resolution, then use the conservative
>   data set for initial work and the higher resolution
>   data set for evaluating the optimal resolution
>   cutoff -- and then reintegrate one more time later
>   at that resolution, using those data for the rest of
>   the structure determination.
>   The idea of using the SigmaA curve from Refmac has
>   come up, but SigmaA curves from cross-validation
>   data will have a problem.  In order to get these to
>   behave (with a small number of reflections per
>   resolution bin), you need to smooth the curve in
>   some way.  Refmac does this by fitting a functional
>   form, so the high-resolution SigmaA values are bound
>   to drop off smoothly regardless of the real
>   structure factor agreement.  If you're evaluating
>   resolution immediately after molecular replacement
>   with a good model, then you could use my old SIGMAA
>   program to get independent SigmaA values for
>   individual resolution bins, using all the data
>   (because there's no danger of over-fitting).
>    However, if you start out with a poor model or
>   solve the structure by experimental phasing, you'll
>   have to do some building and refinement before you
>   have a model good enough to compare with the
>   higher-resolution data.  Then you want to compare
>   the fit of the cross-validation data up to the
>   resolution cutoff used in refinement to the
>   resolution-dependent fit of all the higher
>   resolution data not used in refinement.  I'd
>   probably do that, at the moment, by using sftools to
>   select all the data that haven't been used in
>   refinement then calculate correlation coefficients
>   in resolution bins (which are probably as good for
>   this purpose as SigmaA values).  (For
>   non-aficionados of sftools, the selection could be
>   done by selecting the reflections with d-spacing
>   less than dmin for your refinement, selecting the
>   subset of those that are in the working set, then
>   inverting the selection to get everything not used
>   in refinement.)
>   Regards,
>   Randy
>   On 30 Jan 2012, at 10:03, Frank von Delft wrote:
>
>     Hi Randy - thank you for a very interesting
>     reminder to old literature. 
>
>     I'm intrigued:  how come this apparently excellent
>     idea has not become standard best practice in the
>     14 years since it was published? 
>
>     phx
>
>     On 30/01/2012 09:40, Randy Read wrote:
>
>       Hi,
>       Here are a couple of links on the idea of
>       judging resolution by a type of cross-validation
>       with data not used in refinement:
>       Ling et al,
>       1998: http://pubs.acs.org/doi/full/10.1021/bi971806n
>       Brunger et al,
>       2008: http://journals.iucr.org/d/issues/2009/02/00/ba5131/index.html
>         (cites earlier relevant papers from Brunger's
>       group)
>       Best wishes,
>       Randy Read
>       On 30 Jan 2012, at 07:09, arka chakraborty
>       wrote:
>
>         Hi all,
>
>         In the context of the above going discussion
>         can anybody post links for a few relevant
>         articles?
>
>         Thanks in advance,
>
>         ARKO
>
>         On Mon, Jan 30, 2012 at 3:05 AM, Randy Read
>         <[log in to unmask]> wrote:
>
>           Just one thing to add to that very detailed
>           response from Ian.
>
>           We've tended to use a slightly different
>           approach to determining a sensible
>           resolution cutoff, where we judge whether
>           there's useful information in the highest
>           resolution data by whether it agrees with
>           calculated structure factors computed from a
>           model that hasn't been refined against those
>           data.  We first did this with the complex of
>           the Shiga-like toxin B-subunit pentamer with
>           the Gb3 trisaccharide (Ling et al, 1998).
>            From memory, the point where the average
>           I/sig(I) drops below 2 was around 3.3A.
>            However, we had a good molecular
>           replacement model to solve this structure
>           and, after just carrying out rigid-body
>           refinement, we computed a SigmaA plot using
>           data to the edge of the detector (somewhere
>           around 2.7A, again from memory).  The SigmaA
>           plot dropped off smoothly to 2.8A
>           resolution, with values well above zero
>           (indicating significantly better than random
>           agreement), then dropped suddenly.  So we
>           chose 2.8A as the cutoff.  Because there
>           were four pentamers in the asymmetric unit,
>           we could then use 20-fold NCS averaging,
>           which gave a fantastic map.  In this case,
>           the averaging certainly helped to pull out
>           something very useful from a very weak
>           signal, because the maps weren't nearly as
>           clear at lower resolution.
>
>           Since then, a number of other people have
>           applied similar tests.  Notably, Axel
>           Brunger has done some careful analysis to
>           show that it can indeed be useful to take
>           data beyond the conventional limits.
>
>           When you don't have a great MR model, you
>           can do something similar by limiting the
>           resolution for the initial refinement and
>           rebuilding, then assessing whether there's
>           useful information at higher resolution by
>           using the improved model (which hasn't seen
>           the higher resolution data) to compute
>           Fcalcs.  By the way, it's not necessary to
>           use a SigmaA plot -- the correlation between
>           Fo and Fc probably works just as well.  Note
>           that, when the model has been refined
>           against the lower resolution data, you'll
>           expect a drop in correlation at the
>           resolution cutoff you used for refinement,
>           unless you only use the cross-validation
>           data for the resolution range used in
>           refinement.
>
>           -----
>           Randy J. Read
>           Department of Haematology, University of
>           Cambridge
>           Cambridge Institute for Medical Research  
>            Tel: +44 1223 336500
>           Wellcome Trust/MRC Building                
>                   Fax: +44 1223 336827
>           Hills Road                                  
>                                    E-mail:
>           [log in to unmask]
>           Cambridge CB2 0XY, U.K.                    
>                     www-structmed.cimr.cam.ac.uk
>           On 29 Jan 2012, at 17:25, Ian Tickle wrote:
>
>           > Jacob, here's my (personal) take on this:
>           >
>           > The data quality metrics that everyone
>           uses clearly fall into 2
>           > classes: 'consistency' metrics, i.e.
>           Rmerge/meas/pim and CC(1/2) which
>           > measure how well redundant observations
>           agree, and signal/noise ratio
>           > metrics, i.e. mean(I/sigma) and
>           completeness, which relate to the
>           > information content of the data.
>           >
>           > IMO the basic problem with all the
>           consistency metrics is that they
>           > are not measuring the quantity that is
>           relevant to refinement and
>           > electron density maps, namely the
>           information content of the data, at
>           > least not in a direct and meaningful way.
>            This is because there are 2
>           > contributors to any consistency metric:
>           the systematic errors (e.g.
>           > differences in illuminated volume and
>           absorption) and the random
>           > errors (from counting statistics, detector
>           noise etc.).  If the data
>           > are collected with sufficient redundancy
>           the systematic errors should
>           > hopefully largely cancel, and therefore
>           only the random errors will
>           > determine the information content.
>            Therefore the systematic error
>           > component of the consistency measure
>           (which I suspect is the biggest
>           > component, at least for the strong
>           reflections) is not relevant to
>           > measuring the information content.  If the
>           consistency measure only
>           > took into account the random error
>           component (which it can't), then it
>           > would be essentially be a measure of
>           information content, if only
>           > indirectly (but then why not simply use a
>           direct measure such as the
>           > signal/noise ratio?).
>           >
>           > There are clearly at least 2 distinct
>           problems with Rmerge, first it's
>           > including systematic errors in its measure
>           of consistency, second it's
>           > not invariant with respect to the
>           redundancy (and third it's useless
>           > as a statistic anyway because you can't do
>           any significance tests on
>           > it!).  The redundancy problem is fixed to
>           some extent with Rpim etc,
>           > but that still leaves the other problems.
>            It's not clear to me that
>           > CC(1/2) is any better in this respect,
>           since (as far as I understand
>           > how it's implemented), one cannot be sure
>           that the systematic errors
>           > will cancel for each half-dataset Imean,
>           so it's still likely to
>           > contain a large contribution from the
>           irrelevant systematic error
>           > component and so mislead in respect of the
>           real data quality exactly
>           > in the same way that Rmerge/meas/pim do.
>            One may as well use the
>           > Rmerge between the half dataset Imeans,
>           since there would be no
>           > redundancy effect (i.e. the redundancy
>           would be 2 for all included
>           > reflections).
>           >
>           > I did some significance tests on CC(1/2)
>           and I got silly results, for
>           > example it says that the significance
>           level for the CC is ~ 0.1, but
>           > this corresponded to a huge Rmerge (200%)
>           and a tiny mean(I/sigma)
>           > (0.4).  It seems that (without any basis
>           in statistics whatsoever) the
>           > rule-of-thumb CC > 0.5 is what is
>           generally used, but I would be
>           > worried that the statistics are so far
>           divorced from the reality - it
>           > suggests that something is seriously wrong
>           with the assumptions!
>           >
>           > Having said all that, the mean(I/sigma)
>           metric, which on the face of
>           > it is much more closely related to the
>           information content and
>           > therefore should be a more relevant metric
>           than Rmerge/meas/pim &
>           > CC(1/2), is not without its own problems
>           (which probably explains the
>           > continuing popularity of the other
>           metrics!).  First and most obvious,
>           > it's a hostage to the estimate of sigma(I)
>           used.  I've never been
>           > happy with inflating the counting sigmas
>           to include effects of
>           > systematic error based on the consistency
>           of redundant measurements,
>           > since as I indicated above if the data are
>           collected redundantly in
>           > such a way that the systematic errors
>           largely cancel, it implies that
>           > the systematic errors should not be
>           included in the estimate of sigma.
>           > The fact that then the sigma(I)'s would
>           generally be smaller (at
>           > least for the large I's), so the sample
>           variances would be much larger
>           > than the counting variances, is
>           irrelevant, because the former
>           > includes the systematic errors.  Also the
>           I/sigma cut-off used would
>           > probably not need to be changed since it
>           affects only the weakest
>           > reflections which are largely unaffected
>           by the systematic error
>           > correction.
>           >
>           > The second problem with mean(I/sigma) is
>           also obvious: i.e. it's a
>           > mean, and as such it's rather insensitive
>           to the actual distribution
>           > of I/sigma(I).  For example if a shell
>           contained a few highly
>           > significant intensities these could be
>           overwhelmed by a large number
>           > of weak data and give an insignificant
>           mean(I/sigma).  It seems to me
>           > that one should be considering the
>           significance of individual
>           > reflections, not the shell averages.  Also
>           the average will depend on
>           > the width of the resolution bin, so one
>           will get the strange effect
>           > that the apparent resolution will depend
>           on how one bins at the data!
>           > The assumption being made in taking the
>           bin average is that I/sigma(I)
>           > falls off smoothly with d* but that's
>           unlikely to be the reality.
>           >
>           > It seems to me that a chi-square statistic
>           which takes into account
>           > the actual distribution of I/sigma(I)
>           would be a better bet than the
>           > bin average, though it's not entirely
>           clear how one would formulate
>           > such a metric.  One would have to consider
>           subsets of the data as a
>           > whole sorted by increasing d* (i.e. not in
>           resolution bins to avoid
>           > the 'bin averaging effect' described
>           above), and apply the resolution
>           > cut-off where the chi-square statistic has
>           maximum probability.  This
>           > would automatically take care of
>           incompleteness effects since all
>           > unmeasured reflections would be included
>           with I/sigma = 0 just for the
>           > purposes of working out the cut-off point.
>            I've skipped the details
>           > of implementation and I've no idea how it
>           would work in practice!
>           >
>           > An obvious question is: do we really need
>           to worry about the exact
>           > cut-off anyway, won't our sophisticated
>           maximum likelihood refinement
>           > programs handle the weak data correctly?
>            Note that in theory weak
>           > intensities should be handled correctly,
>           however the problem may
>           > instead lie with incorrectly estimated
>           sigmas: these are obviously
>           > much more of an issue for any software
>           which depends critically on
>           > accurate estimates of uncertainty!  I did
>           some tests where I refined
>           > data for a known protein-ligand complex
>           using the original apo model,
>           > and looked at the difference density for
>           the ligand, using data cut at
>           > 2.5, 2 and 1.5 Ang where the standard
>           metrics strongly suggested there
>           > was only data to 2.5 Ang.
>           >
>           > I have to say that the differences were
>           tiny, well below what I would
>           > deem significant (i.e. not only the map
>           resolutions but all the map
>           > details were essentially the same), and
>           certainly I would question
>           > whether it was worth all the
>           soul-searching on this topic over the
>           > years!  So it seems that the refinement
>           programs do indeed handle weak
>           > data correctly, but I guess this should
>           hardly come as a surprise (but
>           > well done to the software developers
>           anyway!).  This was actually
>           > using Buster: Refmac seems to have more of
>           a problem with scaling &
>           > TLS if you include a load of high
>           resolution junk data.  However,
>           > before anyone acts on this information I
>           would _very_ strongly advise
>           > them to repeat the experiment and verify
>           the results for themselves!
>           > The bottom line may be that the actual
>           cut-off used only matters for
>           > the purpose of quoting the true resolution
>           of the map, but it doesn't
>           > significantly affect the appearance of the
>           map itself.
>           >
>           > Finally an effect which confounds all the
>           quality metrics is data
>           > anisotropy: ideally the cut-off surface of
>           significance in reciprocal
>           > space should perhaps be an ellipsoid, not
>           a sphere.  I know there are
>           > several programs for anisotropic scaling,
>           but I'm not aware of any
>           > that apply anisotropic resolution cutoffs
>           (or even whether this would
>           > be advisable).
>           >
>           > Cheers
>           >
>           > -- Ian
>           >
>           > On 27 January 2012 17:47, Jacob Keller
>           <[log in to unmask]> wrote:
>           >> Dear Crystallographers,
>           >>
>           >> I cannot think why any of the various
>           flavors of Rmerge/meas/pim
>           >> should be used as a data cutoff and not
>           simply I/sigma--can somebody
>           >> make a good argument or point me to a
>           good reference? My thinking is
>           >> that signal:noise of >2 is definitely
>           still signal, no matter what the
>           >> R values are. Am I wrong? I was thinking
>           also possibly the R value
>           >> cutoff was a historical
>           accident/expedient from when one tried to
>           >> limit the amount of data in the face of
>           limited computational
>           >> power--true? So perhaps now, when the
>           computers are so much more
>           >> powerful, we have the luxury of including
>           more weak data?
>           >>
>           >> JPK
>           >>
>           >>
>           >> --
>           >>
>           *******************************************
>           >> Jacob Pearson Keller
>           >> Northwestern University
>           >> Medical Scientist Training Program
>           >> email: [log in to unmask]
>           >>
>           *******************************************
>
>         --
>
>         ARKA CHAKRABORTY
>         CAS in Crystallography and Biophysics
>         University of Madras
>         Chennai,India
>
>       ------
>       Randy J. Read
>       Department of Haematology, University of
>       Cambridge
>       Cambridge Institute for Medical Research    
>        Tel: + 44 1223 336500
>       Wellcome Trust/MRC Building                  
>       Fax: + 44 1223 336827
>       Hills Road                                  
>        E-mail: [log in to unmask]
>       Cambridge CB2 0XY, U.K.                      
>       www-structmed.cimr.cam.ac.uk
>
>   ------
>   Randy J. Read
>   Department of Haematology, University of Cambridge
>   Cambridge Institute for Medical Research      Tel: +
>   44 1223 336500
>   Wellcome Trust/MRC Building                   Fax: +
>   44 1223 336827
>   Hills Road                                  
>    E-mail: [log in to unmask]
>   Cambridge CB2 0XY, U.K.                      
>   www-structmed.cimr.cam.ac.uk

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager