Lars,
I'd suggest a couple more interventions.
1. First of the suggestions below, I'd avoid subsampling in your case.
The reason being that if you remove observations representing a
particular class you will bias the intercept unless you also weight
the observations back to the population propensities, and even if you
do that, you will still be reducing the statistical evidence you have
about the probability of your response (or lack of it) under different
circumstances.
2. To mediate the risk of over fitting due to few responses, I'd
eliminate predictors that would be likely to be overly influenced.
For instance an indicator variable that had fewer than 5% or 10%
positives. It would be easy for two unlikely categories (your
response and the low propensity indicator) to not end up with a
representative intersection. The same can be said for many variable
interactions. Even with two "continuous" variables, if the
"continuity" gets sparse on the high or low ends those ends can be
unduly influenced by chance coincidences with your infrequent
response.
Just my two cents. Hope it helps.
--
Best regards,
David Young
Marketing and Statistical Consultant
Madrid, Spain
+34 913 540 381
http://www.linkedin.com/in/europedavidyoung
http://www.telefonica.net/web2/davidyoung
Sunday, September 14, 2008, 5:14:28 PM, you wrote:
AR> This is more of a problem of "imbalanced training set" (google this term
AR> for scientific articles) than small samples.
AR> Here are some possible solutions:
AR> 1) Subsample from the more abundant class. Easiest to implement.
AR> 2) Weight samples from the rare class more highly. I believe most
AR> softwares has a weighing option for logistic regression.
AR> 3) Use a cost function that penalizes incorrect classification of
AR> samples from the rare class more.
AR> They all involve some arbirtary choice (proportion to subsample, weight,
AR> cost function).
AR> Once you have the prediction equations, you may have need to incorporate
AR> the prior information that responses are rarer than non-responses,
AR> before you apply it on the test set.
AR> Regards, Adai
AR> Lars Chi wrote:
>> Hi,
>>
>> I'm working on a classical classification problem for a marketing
>> application (predicting response to a mailing campaign). The problem is not
>> only that the response rate is low (2.26%) which is not rare to find in this
>> type of applications, but the sample size is small (only 5,653 instances –
>> including both responses (128) and non-responses (5,525)). The number of
>> predictors in my data set is ~ 50.
>>
>> If possible, I'd like to have your opinion regarding how to approach this
>> problem. In particular, I believe this data set is small to split it into
>> train and validation. So I considered, performing under-sampling to obtain a
>> 10% response and used Logistic Regression with a Stepwise selection and
>> cross-validation error as the selection criterion. Is there any alternative
>> approach you would think may work better?
>>
>> Many thanks in advance for your help.
>>
>> Kind Regards,
>>
>> Lars.
|