Dear Leo
Neural networks are "error hungry" if trained on misclassification errors. Due to the imbalanced classes they will receive a lot more "learning" information through those errors to adjust the weights then for the minority classes. Therefore they will specialise on predicting a class with most error contributions - but frequently this class is not actually the one of interest!
There is ample evidence that how to overcome imbalanced class problems (which is often also associated with asymmetric costs, where the minority class is often of most interest and more costly to misclassify, e.g. fraudulent use of credit cards vs. normal use of credit cards). Ideally, use the cost information in training (called "cost sensitive learning"), not just adjust your thresholds afterwards.
To balance sampling, try first to
- split the datasets with stratified sampling (equal class distributions in all subsamples of training, validation and test set)
- oversample the minority classes in the training and validation datasets by randomly replicating instances of the minory classes - simply duplicate them until you have equal numbers of instances
- DO NOT! Oversample the test dataset, which should have the original imbalances of classes
- run your algorithms, select a candidate on validation, evaluate on the test accuracy (and compare against other benchmark methods!)
Oversampling frequenlty works better then undersampling (where you throw away instances of the majority class until you have equal class distributions). By the way: SAS calls it oversampling but actually does undersampling. So you would have to manually oversample if you use SAS.
I know it is not your question but please allow some suggestions: In modelling a nerual net for multiclass classificatino, please use a softmax output function (as a class can only be either a, b or c - and not independently all or none if using a linear or sigmoid activation function). Code nominal and ordinal inputs as binary variables. Experiment with different variable codings (binning) of the interval variables if you have domain knowledge.
I would suggest that you always compare neural nets (or any other given method) to an established benchmark method, of course the baseline distribution in the dataset (if randomly selecting an instance), but also simple logistic regression or decision trees. Then you can evaluate the tradeoff between increased model complexity and limited understanding of the variable / feature interactions towards predictive accuracy.
Also, when evaluating imbalanced class problems please make sure not to use simple error metrics such as misclassification rate, as they do not reflect the class imbalances in estimating the accuracy per class. Use ROC curves etc. instead.
Some References on sampling:
Provost, F. und Fawcett, T. (2001) Robust classification for imprecise environments, in: Machine Learning, Vol. 42, S. 203-231.
Weiss, G. M. und Provost, F. (2003) Learning when training data are costly: The effect of class distribution on tree induction, in: Journal of Artificial Intelligence Research, Vol. 19, S. 315-354.
Crone, S. F., Lessmann, S. und Stahlbock, R. (2006) The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing, in: European Journal of Operational Research, Vol. 173, S. 781-800.
Saar-Tsechansky, M. und Provost, F. (2004) Active sampling for class probability estimation and ranking, in: Machine Learning, Vol. 54, S. 153-178.
Grzymala-Busse, J. W., Stefanowski, J. und Wilk, S. (2004) A comparison of two approaches to data mining from imbalanced data, in: Knowledge-Based Intelligent Information and Engineering Systems, Pt 1, Proceedings, Vol. 3213, S. 757-763.
Grzymala-Busse, J. W., Stefanowski, J. und Wilk, S. (2005) A comparison of two approaches to data mining from imbalanced data, in: Journal of Intelligent Manufacturing, Vol. 16, S. 565-573.
Han, H., Wang, W. Y. und Mao, B. H. (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in: Advances in Intelligent Computing, Pt 1, Proceedings, Vol. 3644, S. 878-887.
Kind regards
Sven
______________________________________________________
Sven F. Crone
Deputy Director, Lancaster Centre for Forecasting
Assistant Professor in Management Science (Lecturer)
Lancaster University Management School
Department of Management Science
Lancaster LA1 4YX
United Kingdom
Tel +44 (0)1524 592991direct
Tel +44 (0)1524 593867 department
Fax +44 (0)1524 844885
Internet http://www.lums.lancs.ac.uk
eMail [log in to unmask]
_______________________________________________
Programme Committee Chair, Conference Co-Chair, DMIN'06 www.dmin-2007.com
International Conference on Data Mining, June 25-28, 2007, Las vegas, NV, USA
Co-organiser of the 2007 NN3 Neural Network Forecasting Competition,
ISF'07, IJCNN'07, DMIN'07, www.neural-forecasting-competition.com
-----Original Message-----
From: A UK-based worldwide e-mail broadcast system mailing list [mailto:[log in to unmask]] On Behalf Of Leo Guelman
Sent: Thursday, September 06, 2007 1:43 AM
To: [log in to unmask]
Subject: Neural Networks Query
Hi,
I'm using a Neural Net model to approach a classification problem with 3
possible outcomes. The distribution of outcomes is far from even: A (88%), B
(10%) and C (2%). I am using a random sample and thus it reflects the
proportions in the population. Because the algorithm minimizes the overall
error function, I'm getting good performance on A's but not on B and C.
Should I use a sample with 1/3 of the outcomes in each category and do some
proper weighting afterwards?
By the way, I am using STATISTICA Data Miner.
Thanks in advance for your response.
Regards,
Leo.
|