Print

Print


Hi Robbie,

thank you for the explanation. Heinz Gut and Michael Hadders pointed me at Axel
Brunger's publication Methods Enzymol. 1997;277:366-96.,
http://www.ncbi.nlm.nih.gov/pubmed/18488318, which is where I got the notion of
500-1000 from. In this article a decrease of the error margin of Rfree with n^(1/2) is
mentioned (p.384), but only as an observation. Is your statement "inverse proportional
with the number of reflections" based on some statistical treatment, or also
just on observation?

It is a pity that k-cross validation is not standard routine because it seems so
easy and so quickly to do with nowadays computers and a simple script. But
that's probably like reminding people of not using R_int anymore in favour of
R_meas...

Cheers,
Tim

On Tue, Mar 26, 2013 at 10:24:51AM +0100, Robbie Joosten wrote:
> Hi Tim,
> 
> I don't think the 5-10% or 500-1000 reflections are real rules, but rather
> practical choices. The error margin in R-free is inverse proportional with
> the number of reflections in your test set and also proportional with R-free
> itself. So for R-free to be 'significant' you need some absolute number of
> reflections to reach your cut-off of significance. This is where the 1000
> comes from (500 is really pushing the limit). 
> You want to make sure the error margin in R and R-free are not too far apart
> and you probably also want to keep the test set representative of the whole
> data set (this is particularly important because we use hold-out validation,
> you only get one shot at validating). This is where the 5%-10% comes from.  
> Another consideration for going for the 5%-10% thing is that this makes it
> feasible to do 'full' (i.e. k-fold) cross-validation: you only have to do
> 20-10 refinements.  If you would go for 1000 reflections you would have to
> do 48 refinements for the average dataset.
> 
> Personally, I take 5% and increase this percentage to maximum 10% if using
> 5% gives me a test set smaller than 1000 reflections.
> 
> HTH,
> Robbie
> 
> > -----Original Message-----
> > From: CCP4 bulletin board [mailto:[log in to unmask]] On Behalf Of
> > Tim Gruene
> > Sent: Tuesday, March 26, 2013 09:33
> > To: [log in to unmask]
> > Subject: [ccp4bb] Rfree reflections
> > 
> > Dear all,
> > 
> > I recall that the set of Rfree reflections should be 500-1000, rather than
> 5-
> > 10%, but I cannot find the reference for it (maybe Ian Tickle?).
> > 
> > I would therefore like to be confirmed or corrected:
> > 
> > Is there an absolute number required for Rfree to be significant, i.e.
> 500-1000
> > irrespective of the total number of unique reflections in the data set, or
> is it
> > 5-10% (as a compromise)?
> > 
> > Thanks and regards,
> > Tim
> > 
> > --
> > --
> > Dr Tim Gruene
> > Institut fuer anorganische Chemie
> > Tammannstr. 4
> > D-37077 Goettingen
> > 
> > GPG Key ID = A46BEE1A
> 

-- 
--
Dr Tim Gruene
Institut fuer anorganische Chemie
Tammannstr. 4
D-37077 Goettingen

GPG Key ID = A46BEE1A