In evaluating the prognostic properties of statistical models, there is considerable scope for variation in approach to splitting the dataset. The simplest, and most overoptimistic, method is resubstitution, whereby the same data are used for design and testing of models. The next simplest is to split the data into design (training) and test (hold-out) sets.
An immediate question for the latter set-up is: what is the "optimal" ratio of design to test set? Clearly, the larger the design set, the more accurate the estimates of the model's parameters; while the larger the test set, the more accurate the estimates of the model's accuracy. Thus, we have to compromise between these two opposing criteria. Arguments of symmetry would suggest a 50:50 split, but several papers I have seen split design:test as 80:20, without giving any (theoretical) justification of this choice.
Any references or comments would be most appreciated.
Stephan Rudolfer
--
Dr Stephan M Rudolfer
Honorary Research Fellow
in Biostatistics & Mathematical Statistics
Chairman, Manchester Group, Royal Statistical Society
Biostatistics Group, University of Manchester
Stopford Building, Oxford Road
MANCHESTER M13 9PT
Tel: +44 161 275 5054 Fax: +44 161 275 5205
Email: [log in to unmask]
|