JISCMail - CCP4BB Archives

Tim,

Overfitting has nothing to do with whether or not the refinement is at the
global (or local) minimum.  You can predict how much a model will be
overfitted before you even start the refinement, because it has everything
to do with choices you made or were forced upon you right at the beginning,
i.e. the observation/effective parameter ratio and the values of the
weighting parameters.  There will always be some degree of overfitting as
evidenced by the ratio Rfree/Rwork at convergence being > 1 and in simple
cases (no NCS), provided the refinement has converged, this degree of
overfitting is quite predictable (with some statistical sampling error of
course) knowing only the number of observations, variable parameters and
restraints and the weights (it does assume however that the weights used
are a true measure of the errors).

The only situation in which there is truly zero overfitting is where you
have either infinite or error-free data neither of which of course is
experimentally realisable: only in that case are the expectations of Rwork
and Rfree exactly equal.  Also overfitting is obviously relative, not
absolute: since _all_ models are overfitted by definition (i.e. based on
finite data with errors), all you can say is that one model is more or less
overfitted than another one, dependent on its obs/param ratio.  This
implies that there's absolutely no guarantee that someone won't come along
after you have deposited your data and re-refine your model using your data
but with a different choice of variable parameters and/or weights and come
up with a new model that is both in better agreement with your data and is
less overfitted (ask Robbie Joosten!).

Just because a refinement is not at the global (or local) minimum doesn't
mean that it's any less overfitted than one that is at the minimum.  It
just means that the model is in poorer agreement with the data.  The clue
is in the name 'maximum likelihood refinement'.

Let's assume for a moment that you are right and optimisations should be
stopped before the convergence-based stopping rules are satisfied (BTW you
won't find that idea implemented in any of the standard optimisation
packages, for good reason!).  In that situation what form would the new
stopping rule take?  It can't be based on Rfree because a fundamental
principle of cross-validation is that the test set should be 'locked away'
and not used in any way to guide the optimisation, otherwise it effectively
becomes part of the working set and you defeat the whole point of having an
independent test set.

Cheers

-- Ian


On 28 November 2014 at 08:24, Tim Gruene <[log in to unmask]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dear Jacob,
>
> you don't necessarily want to find the global minimum - the global
> minimum of the target function might be an overfitted set of parameters.
> At low resolution the (local) minimum you do want to reach may not be
> very sharp. This is best illustrated by shelxl that prints the maximum
> shift each cycle, and with poor data to parameter ratios and poor data
> quality you often find individual atoms moving randomly around (by
> e.g. 0.02A).
>
> Cheers,
> Tim
>
> On 11/27/2014 11:14 PM, Keller, Jacob wrote:
> >> We just had a chance to read this most interesting discussion. We
> >> would agree
> > with Ian that jiggling or SA refinement may not be needed if
> > refinement can in fact be run to convergence. However, this will be
> > difficult to achieve for large structures, especially when only
> > moderate to low resolution data are available. 
> >
> > I find this interesting—is refinement convergence related to
> > resolution? Is this because the structure-landscape is not
> > sufficiently defined to find the real global minimum? I wonder what
> > would happen if Ian Tickle’s test were done on many structures, and
> > results examined as a function of resolution? Predictions? I guess
> > generally there are more ways to fit fewer data points than many,
> > but then perhaps refinement convergence would be more dependent on
> > parameter:observation ratios than resolution per se, although
> > quantities are closely related all things being equal.
> >
> > JPK
> >
>
> - --
> - --
> Dr Tim Gruene
> Institut fuer anorganische Chemie
> Tammannstr. 4
> D-37077 Goettingen
>
> GPG Key ID = A46BEE1A
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.12 (GNU/Linux)
>
> iD8DBQFUeDE1UxlJ7aRr7hoRAoqXAJ9ycLFovlr15oj447XceDmtrCkXlQCgqHC/
> ytJnZ1lv4GdBIwuDzqniLxU=
> =sxDL
> -----END PGP SIGNATURE-----
>