Hi Ian,

> This probably doesn't affect the MTZ file that comes out of it
> but it will affect the statistics of the twinning tests.

I'd be interested to see if this solves the cases where ctruncate goes into an infinite loop during the twinning tests.

Cheers,
Robbie

Date: Sat, 30 Nov 2013 15:37:44 +0000
From: [log in to unmask]
Subject: [ccp4bb] TRUNCATE & CTRUNCATE issues.
To: [log in to unmask]

Hello All, I'd like to get your views on some changes I'm proposing to make to the TRUNCATE source code. IMO there are some issues with the way TRUNCATE does its statistical analyses which need to be fixed. This probably doesn't affect the MTZ file that comes out of it but it will affect the statistics of the twinning tests. I've been meaning to do this for some time but it involves some fairly radical changes; now I've finally decided to bite the bullet.

One problem is that there seems to be some confusion in the source code comments concerning the meaning of what I call the "symmetry enhancement factor" (aka "epsilon" or "e" below). This is the point-group dependent factor by which the mean intensities of special rows and zones are enhanced by symmetry; for example in PG121 the mean I of the 0k0 reflections is enhanced by a factor of 2; in PG6 & PG622, it's a factor of 6 for the 00l reflections, and so on.

For space groups with screw axes (or glide planes in enantiomorphic SGs, i.e. the crystal contains both the asymmetric unit and its mirror image), the mean is multiplied by e only if the systematic absences are omitted from the sums; if you include them then there is no overall enhancement. So rotation and screw axes (or mirrors and glide planes) need to be treated differently, yet the CCP4 library code for this (s/rs EPSLN & EPSLON) treat them identically (more on this below). TRUNCATE calls epsilon alternatively a "multiplicity" or "weight" which further adds to the confusion since multiplicity (m) usually means something completely different (it's the number of times a symmetry-equivalent reflection occurs in a full hemisphere of data, so for PG222 axial reflections (h00 etc.) m=1, for zero layers (hk0 etc.) m=2, and for general hkl m=4: in contrast e = 2, 1 and 1 resp.).

This snippet of code from TRUNCATE which accumulates sums in bins according to d*^2 illustrates the problem:

      CALL EPSLON(INHKL,WEIGHT,ISYSAB)
C
      FF(NT) = FF(NT) + F/WEIGHT
      SD(NT) = SD(NT) + SS/WEIGHT
      N(NT) = N(NT) + 1
      AMULT = NSYM/WEIGHT
C
...
C     Accumulate sums for Wilson plot.
C
      SN(NT) = SN(NT) + AMULT
      SW(NT) = SW(NT) + AMULT*FFSCAT
      SR(NT) = SR(NT) + AMULT*Q
      SI(NT) = SI(NT) + AMULT*F

Here WEIGHT = epsilon (the ISYSAB flag is ignored throughout), so all sums are being accumulated with the terms multiplied by 1/e (NSYM is the no. of asymmetric units so is constant and doesn't affect the results). However IMO this factor should be applied only to individual intensities or their SDs (note that in the above code and that below, F is actually the uncorrected intensity just to further confuse you!). The problem here is that for pure rotation axes the law of conservation of energy requires that the overall mean I is unchanged. In any interference phenomenon energy can neither be created nor destroyed, merely transferred from one place to another. So what happens is that the enhanced intensity of the axial reflections has been transferred from neighbouring reflections which have their intensities diminished in total by the same amount. In fact an oscillating Bessel function centred on the axis is superimposed on the intensities so you get cylindical zones of alternating enhanced and diminishing intensities with the magnitude of the oscillations dying away as you go further from the axis. However the energy conservation law requires that the net overall average I must be unchanged by the presence of the axis.

This implies that the 1/e correction factor SHOULD NOT be included for pure rotation axes (and mirror planes) when summing for the mean I. However the sums should be performed over a complete hemisphere given only one symmetry equivalent per reflection, which means that the multiplicity (m) factor SHOULD be included. This is the direct opposite of what the code above is doing (i.e. it includes e but not m!). The Fortran code would look like:

SI(NT) = SI(NT) + M*FI # Total energy is conserved!

SN(NT) = SN(NT) + M # Count reflections in hemisphere.

Note that the statistically valid procedure will be different if one is say summing Is for a likelihood function, since this requires that the terms are statistically independent so one would then add only one term per equivalent, not a complete hemisphere, i.e. in that case the multiplicity factors should be omitted.

For screw axes (and glide planes) the situation is different: there is no Bessel function and only the axial reflections are affected, so therefore it requires different handling in the code (as I said above, this is not happening!). Now, since systematic absences are normally not present in the data, the mean intensity of the remaining reflections is enhanced by the e factor, purely by the action of omitting the systematic absences of zero I. This implies that we need to simulate the presence of the systematic absences when taking the mean. So for example in the PG6 case we would have to sum the intensities of the 00l, l=6n reflections WITHOUT correction, but then count each 00l reflection as though it were 6 reflections (i.e. also counting the omitted sys. abs. in the average), so the code would now look like:

SI(NT) = SI(NT) + M*FI # Total energy is still conserved!

SN(NT) = SN(NT) +M*E # but also count the sys. abs. that were omitted.

This implies that at least for the reflection counts, the e correction factor SHOULD be included for screw axes (and glide planes), and for the same reason as above the m factor SHOULD also be included.

The differences between rotation and screw axes (or between mirrors and glides) arise because Wilson's assumption of uniform random distribution of atoms breaks down in the former case: an atom cannot approach a rotation axis or mirror plane closer than its VDW radius, so this excluded zone along the axis or plane causes an interference effect. In both cases the main effect is actually that in projection along the axis the atomic positions are not random: they are correlated by an apparent inversion centre (it's actually another interference effect this time from pairs of atoms lying in the same plane of reflection and related by symmetry).

All the above is relevant ONLY to taking the average intensity. For other purposes the correct procedure is likely to differ. For example if one is interested in the individual normalised structure amplitudes for direct methods (i.e. not just the mean), then the sqrt(1/e) factor clearly SHOULD be applied to the individual amplitudes. Also if one is calculating higher moments of Z (= normalised I) then the deviations will not cancel (there's nothing in the energy conservation law that says that energy^n is conserved if n is not 0 or 1). In the rotation axis case each large on-axis positive deviation tends to be offset by several small off-axis deviations, so in that case the optimal procedure would appear to be to multiply the on-axis Is by m/e before summing for the higher moments (say n >= 2), e.g.:

SM(N) = SM(N) + M*(FI/E)**N

NM(N) = NM(N) + M

For the screw-axis case with the sys. abs. omitted the situation is again different and requires different code; IMO we should distribute an amount I/e from each axial reflection equally among the omitted sys. abs. and then include them as though they were all present:

SM(N) = SM(N) + M*E*(FI/E)**N

NM(N) = NM(N) + M*E

One further issue needs to be addressed (TRUNCATE is not short of issues - the ones I mention here are only a fraction!). The code for calculating the moments in TRUNCATE is:

C---- Sums for moments of I
         F = FFA(1)/WEIGHT
         IF(F.GT.0.0) SMMEMM(1,NT,ICEN) = SMMEMM(1,NT,ICEN) + sqrt(F)
         IF(F.GT.0.0) SMMEMM(3,NT,ICEN) = SMMEMM(3,NT,ICEN) + F*sqrt(F)
         SMMIMM(1,NT,ICEN) = SMMIMM(1,NT,ICEN) + F
         SMMIMM(2,NT,ICEN) = SMMIMM(2,NT,ICEN) + F**2
         SMMIMM(3,NT,ICEN) = SMMIMM(3,NT,ICEN) + F**3
         SMMIMM(4,NT,ICEN) = SMMIMM(4,NT,ICEN) + F**4
         NMMNUM(NT,ICEN) = NMMNUM(NT,ICEN) + 1

Notwithstanding that the WEIGHT factor (aka epsilon) probably should not be applied to the lower moments, and the multiplicity factor probably should be applied everywhere, there is still a problem here. F is the uncorrected intensity but IMO we should be using the corrected intensity (i.e. by F & W's Bayesian procedure): why else are we calculating the correction if not to apply it? The uncorrected intensities may be negative which adds spurious noise to the moments (a negative intensity makes the same contribution to an even moment as an equal positive intensity: this cannot possibly be correct). However using the corrected Is brings another problem: when calculating the moments we should be using the expected values based on the posterior probability distribution of the Is. This means integrating the moments over all expected Is > 0, multiplied by the probability density. This is non-trivial, however there is a way using the parabolic cylinder functions U(a,x) (http://en.wikipedia.org/wiki/Parabolic_cylinder_functions). The nice thing about this method is one can generalise it to calculate the Bayesian-corrected F and I as well as arbitrary moments and get rid of F & W's cubic spline interpolation code with the slightly worrying warning (see NEGIAS & NEGICS s/rs in TRUNCATE):

C---- Accuracy - better than 5 percent (i think).

With the PCFs the code becomes both much simpler and much more accurate (to REAL*4 precision, so ~ 0.0001% accuracy!).

I should also say that I believe that CTRUNCATE is not immune from these issues: it seems to be a straight translation at least in part from Fortran to C++ (though I'm far from being a C++ expert so I'll leave fixing CTRUNCATE to those who know what they're doing!).

Sorry about the length of this: at least you can't say I didn't consult you!

Cheers

-- Ian