Email discussion lists for the UK Education and Research communities

## CCP4BB@JISCMAIL.AC.UK

#### View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ] Font: Proportional Font

#### Options

Subject:

Re: Fwd: calculate the real space R factor using OVERLAPMAP

From:

Date:

Mon, 24 May 2010 12:55:18 +0100

Content-Type:

text/plain

Parts/Attachments:

 text/plain (142 lines)
 ```Hi Pavel Phew! Lots of questions - this could take a while: > - where this formula come from and what are the grounds for this? It's just the RMSD of the density divided by its standard uncertainty (sigma), which we're assuming is the same for all grid points: this isn't quite true, sigma is higher on or near rotation axes, but the effect is sufficiently small that we can ignore it. It comes from taking the negative log of the likelihood function of the density values assuming a normal error distribution, which gives you chi-squared, i.e. sum(delta_rho^2)/sigma(rho)^2. The likelihood is the standard measure of the consistency of a model, in this case an atomic model which gives you rho_calc, with the data (rho_obs), and delta_rho = rho_obs - rho_calc. > - how to make sense of the numbers. Say I used this formula and I got a > number X; how can I tell if it is good or not good? There's a standard procedure for significance testing which is explained in all statistics textbooks (or go here: http://en.wikipedia.org/wiki/Significance_test). You decide on a level of significance ('critical p-value') which represents the probability of getting the observed or a more extreme result purely by chance, assuming that the 'null hypothesis' is true, i.e. that the difference density in question doesn't represent any real signal, only random error ('noise'). You can use p=1% or even p=0.1% if you want to be even more confident that what you see isn't just random error: you are trying to avoid the situation where you reject the null hypothesis and conclude that there's real difference density present when there really isn't ('Type 1 error'). Of course making the p-value too small might mean that you miss real difference density ('Type 2 error'). Then you look up your chi-squared value and the 'number of degrees of freedom' in the relevant published statistical table of upper critical values of the chi-square distribution (e.g. go here: http://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm). This is just the cumulative distribution function of the chi-square distribution, so can be readily computed using the appropriate continued fraction expansion (see 'upper incomplete gamma function': http://en.wikipedia.org/wiki/Incomplete_Gamma_function). This means that non-tabulated values can be used, for example for the normal distribution the p-value corresponding to the usual '3 sigma' threshold is 0.27%, so it makes sense to use the same p-value here (p=0.1% corresponds to 3.3 sigma for the normal distribution). One very important point that I glossed over in my previous e-mail is the role of Npoints: this is the number of *independent* density values in the sum above. This is slightly tricky because of course normally we over-sample the maps which means the density values are no longer independent. However we can get round this because we know that at the Shannon limit where the grid spacing in the map = Dmin/2 the density values become statistically independent, so that if we over-sample with a grid spacing of say Dmin/4 (which is what I always use for this), the over-sampling factor is 2 in each direction so Ndof = Npoints/8. Let's say Npoints = 400, Ndof = 50, then for p=0.1%, chi-squared = 86.661, so RMS-Z-score = sqrt(86.661/50) = 1.32 so that's your threshold, i.e. a value bigger than this probably means that the difference density is real. However if it's less note that it *doesn't* prove there's nothing there, it just means that the data isn't good enough to come to a firm conclusion - you might find stronger evidence if you were to obtain better data - remember always that absence of evidence isn't evidence of absence! > - do you think it is better than looking at three values {map CC, 2mFo-DFc, > mFo-DFc} and why? Yes, because all the information you need is encapsulated in 1 number per region of interest! But I don't understand what you mean by 2mFo-DFc & mFo-DFc being counted each as 1 number. Surely you have 1 value of each of these at every grid point, or at least 1 value per maximum in the case say of an extended ligand? Note that I'm not proposing anything new, this is all explained in standard statistics textbooks (Kendall's Advanced Theory of Statistics by Stuart & Ord is probably the best). In fact this is exactly my point: why re-invent the wheel (and likely end up with a square one!) when the appropriate statistics is all there in the textbooks and has been for ~ 80 years? > - why 2(mFo-DFc)? Randy Read (AC 1986, A42, 140-149) showed that for a partial structure with errors the expected values of the true Fs for the complete structure (FN) for which Fo's have been obtained experimentally, and for a partial structure model (FP) respectively are:               FN = (2mFo-DFc)exp(i phi_calc) ... for acentric reflections,               FN = mFo exp(i phi_calc) ... for centric reflections,               FP = DFc exp(i phi_calc) ... for both acentric and centric. Hence the difference map coefficients DF=FN-FP are respectively:               DF = 2(mFo-DFc)exp(i phi_calc) ... for acentrics,               DF = (mFo-DFc)exp(i phi_calc) ... for centrics, This is consistent with the observation that difference map peaks in non-centrosymmetric structures appear at half the theoretical height (assuming the phase errors are small), so you need to multiply the coefficients by 2 to get the right value, whereas peaks in centrosymmetric structures appear at the true height so don't need to be corrected. This has been known for a long time (e.g. see Blundell & Johnson, 1976, p.408). > - how the "region of interest is defined"? You define it! - exactly the same as you do for RSCC & RSR. Note that although there is a significance test for the CC, none exists for the R-factor. The basic problem with the R-factor is that it conflates 2 effects: because of the sum over the data in the denominator, R is a function both of the absolute values of the errors and of the values of the data relative to the errors, so weak data always has a high R-factor, hence a high value of the R-factor for weak data really tells you nothing about the errors. This is apparent when you look at Rmerge values for intensity data: the appropriate statistic which quantifies the data quality in that case is not Rmerge at all but the average I/sigma(I). > - how you compute sigma(rho)? See my reply to George Sheldrick's post. > By suggesting to use {map CC, 2mFo-DFc, mFo-DFc} I was assuming that: > - map CC will tell you about similarities of shapes and it will not tell you > about how strong the density is, indeed.  So, using map CC alone is clearly > insufficient. Also, we more or less have feeling about the values, which is > helpful. > - 2mFo-DFc will tell you about the strength of the density. I mean, if you > get 2.5sigma at the center of atom A -  it's good (provided that map CC is > good), and if it is 0.3sigma you should get puzzled. > - Having excess of +/- mFo-DFc density will tell you something too. The problem is how is all this information quantified in an objective and statistically justifiable way in order to arrive at a firm conclusion? Cheers -- Ian```