This is more of a problem of "imbalanced training set" (google this term
for scientific articles) than small samples.
Here are some possible solutions:
1) Subsample from the more abundant class. Easiest to implement.
2) Weight samples from the rare class more highly. I believe most
softwares has a weighing option for logistic regression.
3) Use a cost function that penalizes incorrect classification of
samples from the rare class more.
They all involve some arbirtary choice (proportion to subsample, weight,
cost function).
Once you have the prediction equations, you may have need to incorporate
the prior information that responses are rarer than non-responses,
before you apply it on the test set.
Regards, Adai
Lars Chi wrote:
> Hi,
>
> I'm working on a classical classification problem for a marketing
> application (predicting response to a mailing campaign). The problem is not
> only that the response rate is low (2.26%) which is not rare to find in this
> type of applications, but the sample size is small (only 5,653 instances –
> including both responses (128) and non-responses (5,525)). The number of
> predictors in my data set is ~ 50.
>
> If possible, I'd like to have your opinion regarding how to approach this
> problem. In particular, I believe this data set is small to split it into
> train and validation. So I considered, performing under-sampling to obtain a
> 10% response and used Logistic Regression with a Stepwise selection and
> cross-validation error as the selection criterion. Is there any alternative
> approach you would think may work better?
>
> Many thanks in advance for your help.
>
> Kind Regards,
>
> Lars.
|