I would agree mostly with what Dale said, and point out that it applies as
well to the SigmaA estimation that is a necessary part of ML refinement.
When we were developing the ML targets that went into CNS, we did a number
of tests to see how many cross-validation reflections were needed. The
fewest we could get away with, for relatively poor starting models, was
about 500-1000. I would only recommend as few as 500 if you have a small
cell or low resolution (where you might only have 5000 reflections in
total) and can't afford to give up more. If I had a large cell or high
resolution, I would probably prefer to take up to 2000 reflections for
cross-validation, because the precision of the SigmaA estimates would be
improved with little cost. But there's certainly no need to take a defined
proportion of the data regardless of the total number.
The precision of the likelihood-based estimates for SigmaA depends not only
on the number of reflections but also on the quality of the model. As the
model gets better and the true SigmaA values increase, the estimates of
SigmaA become more precise. So one could probably afford to reduce the size
of the cross-validation set towards the end of refinement.
Which brings me to one of the things Dale said, which is that his tests
showed no correlation of the precision of Rfree with the true Rfree. He
qualified this by saying that his models only ranged from 35% to 55%; if he
had looked at a wider range, I think he would have found a strong
correlation. I believe that Ian Tickle showed this in a paper he published
a few years ago.
Randy Read
|