The question of oversampling the less frequent class in a two class response
regression has arisen here recently. I'd like to know what people think
about this and if there is useful literature on it. I will send a summary of
responses to the list.
Let me explain a bit better what I mean:
We want to build a model for prediction. We have many cases (>10,000) and
many predictor variables (to start with at least). The proportion of cases
in the less frequent class in the response variable is relatively low,
usually less than 20% sometimes less than 10%.
Two alternative approaches to selecting the data for modelling have been
suggested:
1. Take all cases or a large sample where the proportions in the two
response classes are the same as in the population.
2. Take a sample where the proportions in the two response classes are
50/50. This would be just for model estimation, testing would be done using
a respresentative sample.
So the questions are:
What is the effect on the model parameters and predictive ability of
choosing the second alternative if we use a regression technique like
logistic regression?
What is the effect if we use an Artificial Neural Net algorithm?
What is the effect if we use a Decision Tree?
Regards
Willo Roe
The information contained in this e-mail transmission is confidential
and may be privileged. It is intended only for the addressee(s) stated
above. If you are not an addressee, any use, dissemination, distribution,
publication, or copying of the information contained in this e-mail is
strictly prohibited. If you have received this e-mail in error, please
immediately notify our IT Department by telephone at 353-1-6769333
or e-mail [log in to unmask] and delete the e-mail from your
system.
|