JISCMail - CCP4BB Archives

On 1 September 2013 11:31, Frank von Delft <[log in to unmask]> wrote:

2.
I'm struck by how small the improvements in R/Rfree are in Diederichs & Karplus (ActaD 2013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/); the authors don't discuss it, but what's current thinking on how to estimate the expected variation in R/Rfree - does the Tickle formalism (1998) still apply for ML with very weak data?

Frank, our paper is still relevant, unfortunately just not to the question you're trying to answer! We were trying to answer 2 questions: 1) what value of Rfree would you expect to get if the structure were free of systematic error and only random errors were present, so that could be used as a baseline (assuming a fixed cross-validation test set) to identify models with gross (e.g. chain-tracing) errors; and 2) how much would you expect Rfree to vary assuming a fixed starting model but with a different random sampling of the test set (i.e. the "sampling standard deviation"). The latter is relevant if say you want to compare the same structure (at the same resolution obviously) done independently in 2 labs, since it tells you how big the difference in Rfree for an arbitrary choice of test set needs to be before you can claim that it's statistically significant.

In this case the questions are different because you're certainly not comparing different models using the same test set, neither I suspect are you comparing the same model with different randomly selected test sets. I assume in this case that the test sets for different resolution cut-offs are highly correlated, which I suspect makes it quite difficult to say what is a significant difference in Rfree (I have not attempted to do the algebra!).

Rfree is one of a number of "model selection criteria" (see http://en.wikipedia.org/wiki/Model_selection#Criteria_for_model_selection) whose purpose is to provide a metric for comparison of different models given specific data, i.e. as for the likelihood function they all take the form f(model | data), so in all cases you're varying the model with fixed data. It's use in the form f(data | model), i.e. where you're varying the data with a fixed model I would say is somewhat questionable and certainly requires careful analysis to determine whether the results are statistically significant. Even assuming we can argue our way around the inappropriate application of model selection methodology to a different problem, unfortunately Rfree is far from an ideal criterion in this respect; a better one would surely be the free log-likelihood as originally proposed by Gerard Bricogne.

Cheers

-- Ian