Hi Graeme,


We have had similar discussion with PDB_REDO that is frequently forced to assign a new R-free set when the input data doesn’t have one (this still happens with new PDB entries!). The ‘500/1000/1500/2000 reflections’ is enough school seems to look only at the variance of R-free for different choices of test sets, which depends on the absolute number of reflections.  You also want a representative sample of reciprocal space which depends on the fraction of reflections. In PDB_REDO we make a new test set if:

-          The test set is smaller than 1% of the reflections

-          When the set has fewer than 500 reflections AND is smaller than 10% of the reflections.


The new set is chosen as at least 5% of the possible reflections given the cell parameters and the resolution. If there are between 20000 and 10000 reflections, the percentage is increased to get at least 1000 reflections in the test set.  So the maximum percentage is 10%.


Funny side note: The random number generator in freerflag was set up to always pick the same test set for given resolution and cell parameters, which is useful if you misplace your test set. Unfortunately, we also had data sets from the PDB where the newly generated test set had no observed reflections. Most of these datasets were close to 95% complete ;)





From: CCP4 bulletin board [mailto:[log in to unmask]] On Behalf Of Graeme Winter
Sent: Tuesday, June 2, 2015 12:27
To: [log in to unmask]
Subject: [ccp4bb] How many is too many free reflections?


Hi Folks


Had a vague comment handed my way that "xia2 assigns too many free reflections" - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now.


This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK.


Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably?


Any comments welcome!


Thanks & best wishes Graeme