We have had similar discussion with PDB_REDO that is frequently forced to assign a new R-free set when the input data doesn’t have one (this still happens with new PDB entries!). The ‘500/1000/1500/2000 reflections’ is enough school seems to look only at the variance of R-free for different choices of test sets, which depends on the absolute number of reflections. You also want a representative sample of reciprocal space which depends on the fraction of reflections. In PDB_REDO we make a new test set if:
- The test set is smaller than 1% of the reflections
- When the set has fewer than 500 reflections AND is smaller than 10% of the reflections.
The new set is chosen as at least 5% of the possible reflections given the cell parameters and the resolution. If there are between 20000 and 10000 reflections, the percentage is increased to get at least 1000 reflections in the test set. So the maximum percentage is 10%.
Funny side note: The random number generator in freerflag was set up to always pick the same test set for given resolution and cell parameters, which is useful if you misplace your test set. Unfortunately, we also had data sets from the PDB where the newly generated test set had no observed reflections. Most of these datasets were close to 95% complete ;)