Dear Jorge
Depending on your classification method you can either resample your dataset to
a.)reproduce instances of the minority class until 50%-50% distribution is achieved = oversampling, maybe adding artificial noise
b.)you can undersample as suggested in an earlier email, although we found this doesn't work well in Data Mining.
c.) bias predictions by prior probabilities (what software do you use?)
d.) incorporate asymmetric misclassification costs to bias the parameterisation process of learning machine classifiers (neural nets, Support vectors etc.)
However, these choices may depend on what you want to do, if you need to stick to logistic regression model. Resampling schemes change the distribution of you datasets, thereby altering the regression coefficients. I am unsure how this allows interpretation of coefficients for effects in relationship to a dataset balanced to represent a population ...
We recently faced a similar problem; please see (working paper status):
http://www.sven-crone.com/documents/papers/DMIN05_CroSoo.pdf
S.Crone, D. Soopramanien: Predicting Customer Online Shopping Adoption for Market Modelling using Artificial Neural Networks, in: H.Arabnia; R.Joshua; Y.Mun (eds.): Proceedings of the International Conference on Data Mining, DMIN'05, Las Vegas, CSREA -
In addition, I would myself appreciate feedback regarding other approaches, contrary positions and additional literature on this interesting problem domain.
Thanks,
Sven
______________________________________________________
Sven F. Crone
Lancaster University Management School
Department of Management Science, Room A53a
Lancaster LA1 4YX
United Kingdome
Tel +44 (0)1524 592991direct
Fax +44 (0)1524 844885
Tel +44 (0)1524 593867 department
Internet www.lums.lancs.ac.uk
eMail [log in to unmask]
_______________________________________________
-----Ursprüngliche Nachricht-----
Von: A UK-based worldwide e-mail broadcast system mailing list [mailto:[log in to unmask]] Im Auftrag von Jorge Caballero Rodríguez
Gesendet: 16 May 2005 13:34
An: [log in to unmask]
Betreff: Logistic regression. How can I balance my data???
Hi all!
I have got a problem with a logistic model. I want scoring my data, I am modelling churning in a bank. The problem is that people that is churn is only a 3% of all data. I need balance my data for obtain a 30 or 50 %. ¿How I can do it?
Thanks!
---------------------------------
Correo Yahoo!
Comprueba qué es nuevo, aquí
http://correo.yahoo.es
|