I agree absolutely with James - be as succinct as you like in a table
but include the verbose definition for each entry in the log file - or
at the very least in the manual. It should be easy to search for with
the table tag.
People will not go and read a reference..
Eleanor
James Holton wrote:
> I think the best way to deal with issues like this can be found in
> Strunk & White "The Elements of Style" (1918). Among other things,
> these authors put forward a rather simple yet often overlooked rule to
> writing in general, which I think applies equally well to computer
> programs:
>
> "Be clear."
>
> The sentence itself is an example of how brevity need not sacrifice
> clarity. Yes, you need labels in the table itself to be short, but
> there is space immediately below (and above) every table that (IMHO)
> ought to contain the definitions of each and every
> variable/abbreviation used in the table, spelled out no matter how
> obvious it may seem to the author. I can tell you many long and
> painful stories about me trying to figure out what some variable in
> some equation in some paper actually meant! Context is everything.
> If you are tight for space, cite a reference (such as the manual).
>
> That, and scientists talking about such quantities in email, papers,
> etc. (such as myself) should also heed Strunk and White and also not
> just assume that everyone knows exactly what "structure factor" means
> as opposed to a "structure amplitude", let alone I/sigma. Indeed, the
> word "intensity" is an incredibly ill-defined unit all by itself, to
> the point of being useless. It can have units of photons, photons/s,
> photons/area/s, photons/area, energy/volume, and many many more.
> Often even in the same equation!
>
> I would strongly advise against changing the "variable names" printed
> out in log files by SCALA and other programs, especially when a given
> name has persisted for a decade or more. Adding an "inline
> definition" is fine, but changing names not only breaks programs that
> were written to read these logs (and sometimes even humans reading the
> log), but it also confines the meaning of "I/SIGMA from SCALA" to a
> particular period in history.
>
> So, what statistic do we want to look at? That depends on what you
> are trying to do with the data. There is no way for Phil to know
> this, so it is good that he prints out lots of different statistics.
> That said, when talking about the data quality requirements for
> structure solution by MAD/SAD, I suggest looking at I/sigma(I) where:
> I - merged intensity (proportional to photons) assigned to a
> reciprocal lattice point (hkl index)
> sigma(I) - the error assigned to I
>
> Exactly what I/sigma(I) is required to solve a structure, or make some
> conclusion about a solved structure is a topic for another day.
>
> -James Holton
> MAD Scientist
>
>
> Phil Evans wrote:
>> “I/sigma” statistics seem to be contentious & confusing (see recent
>> discussions on CCP4BB), particularly in what the various measures
>> should be called (and how they should be labelled in a table, where
>> there is only room for a very short name). I thought it worth
>> commenting on this issue at a little more length.
>>
>> There are several interacting issues:
>>
>> 1) Statistics can be calculated either for individual observations
>> Ihl or for intensities averaged over multiple (symmetry-related or
>> replicate) measurements Ih(avg): both are useful, but they need to be
>> distinguished
>>
>> 2) The statistic can be (a) the ratio of means <I>/<sigma> or (b) the
>> mean of ratios <I/sigma> . These are not the same.
>>
>> 3)The “sigma” used in 2(a) can be either (a) the estimated corrected
>> SD or (b) the RMS scatter of observations ie the RMS deviation (which
>> is itself generally used to estimate a “correction” to the SD). The
>> RMS scatter cannot be used for 2(b) of course, since that needs
>> individual sigmas for each reflection.
>>
>> 4) Values will depend on how many outliers have been rejected.
>>
>> For what it’s worth, Scala outputs two such statistics:-
>>
>> (i) “I/sigma”: this is calculated for individual observations Ihl and
>> is the (mean intensity <Ihl>)/(RMS scatter of Ihl). RMS scatter = RMS
>> [Ihl – Ih(avg)]. This is some measure of the average significance of
>> individual observations, but does not take into account multiplicity.
>> In my new program under development (a Scala replacement) I have
>> relabelled this column “I/RMS” but I don’t really know what best to
>> call it. This value is a ratio of means (see 2(a) above).
>>
>> (ii) “Mn(I/sd)”: this is the mean value of (Ih(avg)/sd(Ih(avg))),
>> where Ih(avg) is the (weighted) average over all observations for
>> reflection h, and sd(Ih(avg)) is the estimated SD of this average,
>> after any “corrections” have been applied. This is, I think, the best
>> estimate of “signal-to-noise ratio”, but does depend on realistic
>> estimates of sd(Ih(avg)), which is not entirely straightforward (and
>> certainly doesn’t allow for systematic errors!). This value is a mean
>> of ratios (see 2(b) above).
>>
>>
>>
>> The “corrected” sd(Ihl) is calculated in Scala for each observation as
>> sd(Ihl)corrected = SdFac * sqrt{sd(I)**2 + SdB*Ihl*LP +
>> (SdAdd*Ihl)**2}
>> with the parameters SdFac, SdB & SdAdd determined by trying to make
>> the RMS normalised deviation Delta(hl) = (Ihl -
>> Ih(avg))/sd(Ihl)corrected = 1.0 for all intensity ranges (different
>> parameters for each run). If the sd estimates are correct, then the
>> distribution of Delta(hl) should have SD = 1.0, and this “correction”
>> tries to enforce this. This is more or less equivalent to making the
>> RMS scatter == average SD. However the uncertainties in how best to
>> estimate the real error do then influence the reliability of the
>> Mn(I/sd) statistic (see (ii) above)
>>
>> So what statistics do we want to look at? Probably the main reason
>> for looking at signal/noise statistics is to choose a “real
>> resolution” cutoff, from some sort of signal/noise ratio. It isn’t
>> clear (to me) what is the best way of doing this, and it is
>> particularly difficult if the data are significantly anisotropic. The
>> multiplicity needs to be taken into account, so the individual
>> “I/sigma” (see (i) above) isn’t the best guide. Personally, I
>> generally cut data at around the point where Mn(I/sd) =~ 2, but I
>> would cut off at <2 for anisotropic data. I also find a useful guide
>> from the correlation coefficient between Ih(avg) (Imean) pairs in
>> half-datasets (plotted by Scala): the CC should be >0.5 at least, I
>> think.
>>
>> Note that the overall value of any of these statistics over all
>> resolution ranges is not very useful and can be confusing, depending
>> on the distribution of intensities, since it mixes up strong low
>> resolution data (high signal/noise) with weak high resolution data
>> (low signal/noise).
>>
>> That leaves the question of how to label these statistics in a
>> consistent, clear and concise way: suggestions?
>>
>> Phil Evans
>
>
>
|