This is a nice paper and an interestingly different approach to
avoiding bias and/or quantifying errors - and indeed there are all
kinds of possibilities if you have a particular structure on which you
are prepared to spend unlimited time and resources.
The specific context in which Graeme's initial question led me to
query instead "who should set the FreeR flags, at what stage and on
what basis?" was that of the data analysis linked to high-throughput
fragment screening, in which speed is of the essence at every step.
Creating FreeR flags afresh for each target-fragment complex
dataset without any reference to those used in the refinement of the
apo structure is by no means an irrecoverable error, but it will take
extra computing time to let the refinement of the complex adjust to a
new free set, starting from a model refined with the ignored one. It
is in order to avoid the need for that extra time, or for a recourse
to various debiasing methods, that the "book-keeping faff" described
yesterday has been introduced. Operating without it is perfectly
feasible, it is just likely to not be optimally direct.
I will probably bow out here, before someone asks "How many
[e-mails from me] is too many?" :-) .
With best wishes,
On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote:
> one more suggestion. You can avoid all the recipes by use all data for WORK set and 0 reflections for TEST set regardless of the amount of data by using the FREE KICK ML target. For explanation see our recent paper Praznikar, J. & Turk, D. (2014) Free kick instead of cross-validation in maximum-likelihood refinement of macromolecular crystal structures. Acta Cryst. D70, 3124-3134.
> Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML”
> > On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system <[log in to unmask]> wrote:
> > Date: Thu, 4 Jun 2015 08:30:57 +0000
> > From: Graeme Winter <[log in to unmask]>
> > Subject: Re: How many is too many free reflections?
> > Hi Folks,
> > Many thanks for all of your comments - in keeping with the spirit of the BB
> > I have digested the responses below. Interestingly I suspect that the
> > responses to this question indicate the very wide range of resolution
> > limits of the data people work with!
> > Best wishes Graeme
> > ===================================
> > Proposal 1:
> > 10% reflections, max 2000
> > Proposal 2: from wiki:
> > http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set
> > including Randy Read "recipe":
> > So here's the recipe I would use, for what it's worth:
> > <10000 reflections: set aside 10%
> > 10000-20000 reflections: set aside 1000 reflections
> > 20000-40000 reflections: set aside 5%
> >> 40000 reflections: set aside 2000 reflections
> > Proposal 3:
> > 5% maximum 2-5k
> > Proposal 4:
> > 3% minimum 1000
> > Proposal 5:
> > 5-10% of reflections, minimum 1000
> > Proposal 6:
> >> 50 reflections per "bin" in order to get reliable ML parameter
> > estimation, ideally around 150 / bin.
> > Proposal 7:
> > If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
> > 40k i.e. rather a lot. Referees question use of > 5k reflections as test
> > set.
> > Comment 1 in response to this:
> > Surely absolute # of test reflections is not relevant, percentage is.
> > ============================
> > Approximate consensus (i.e. what I will look at doing in xia2) - probably
> > follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
> > most of the criteria raised by everyone else.
> > On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter <[log in to unmask]>
> > wrote:
> >> Hi Folks
> >> Had a vague comment handed my way that "xia2 assigns too many free
> >> reflections" - I have a feeling that by default it makes a free set of 5%
> >> which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
> >> excessive now.
> >> This was particularly in the case of high resolution data where you have a
> >> lot of reflections, so 5% could be several thousand which would be more
> >> than you need to just check Rfree seems OK.
> >> Since I really don't know what is the right # reflections to assign to a
> >> free set thought I would ask here - what do you think? Essentially I need
> >> to assign a minimum %age or minimum # - the lower of the two presumably?
> >> Any comments welcome!
> >> Thanks & best wishes Graeme
> Dr. Dusan Turk, Prof.
> Head of Structural Biology Group http://bio.ijs.si/sbl/
> Head of Centre for Protein and Structure Production
> Centre of excellence for Integrated Approaches in Chemistry and Biology of Proteins, Scientific Director
> Professor of Structural Biology at IPS "Jozef Stefan"
> e-mail: [log in to unmask]
> phone: +386 1 477 3857 Dept. of Biochem.& Mol.& Struct. Biol.
> fax: +386 1 477 3984 Jozef Stefan Institute
> Jamova 39, 1 000 Ljubljana,Slovenia
> Skype: dusan.turk (voice over internet: www.skype.com