Statistical learning for class imbalanced data: a case study of Malaria indicator survey data

Ongera,Maangi Daniel
Journal Title
Journal ISSN
Volume Title
Strathmore University
Class imbalanced problems are predominant in real-life applications. In most cases, the minority class is the most important. Standard statistical learning algorithms tend to produce poor results for the minority class and very good results for the majority class. One of the widely used mechanism to address this problem is by re-sampling the training data. The objective of this study is to examine the performance of statistical learning algorithms by using different re-sampling approaches for handling class imbalance. Methods Two classical and ensemble statistical learning techniques were trained on an imbalanced Malaria Indicator Survey data set while handling the majority-minority problem through re-sampling. These included: Logistic regression, support vector machines, random forest, and extreme gradient boosting. The algorithms were trained without handling class imbalance first. Secondly, the algorithms were trained using six re-sampling procedures to handle class imbalance: random under-sampling, random over-sampling, Synthetic Minority Oversampling technique (SMOTE), Random Over Sampling Examples (ROSE) techniques and Adaptive Synthetic Sampling Approach (ADASYN). We further investigated whether combining randomly under-sampled and over-sampled data can result in improved performance. Eighty percent of the data was used for model training using 5 fold cross validation. Results All methods that were considered for handling class imbalance had strengths and weaknesses depending on the performance metric. For instance, random under-sampling (RU) resulted in models with higher sensitivity than random over-sampling (RO). To get a trade-off between sensitivity and specificity, these two methods can be combined (RURO). This approach resulted in 99.5% sensitivity, 88.1 % specificity, 89.6 % precision, 94.3 % F1 score and a 93.9 % accuracy on the test set using the Extreme Gradient Boosting machine.
Submitted in total fulfilment of the requirements for the degree of Master of Science in Statistical Sciences of Strathmore University