Statistical learning for class imbalanced data: a case study of Malaria indicator survey data

dc.contributor.authorOngera,Maangi Daniel
dc.date.accessioned2023-05-25T07:43:28Z
dc.date.available2023-05-25T07:43:28Z
dc.date.issued2022
dc.descriptionSubmitted in total fulfilment of the requirements for the degree of Master of Science in Statistical Sciences of Strathmore University
dc.description.abstractClass imbalanced problems are predominant in real-life applications. In most cases, the minority class is the most important. Standard statistical learning algorithms tend to produce poor results for the minority class and very good results for the majority class. One of the widely used mechanism to address this problem is by re-sampling the training data. The objective of this study is to examine the performance of statistical learning algorithms by using different re-sampling approaches for handling class imbalance. Methods Two classical and ensemble statistical learning techniques were trained on an imbalanced Malaria Indicator Survey data set while handling the majority-minority problem through re-sampling. These included: Logistic regression, support vector machines, random forest, and extreme gradient boosting. The algorithms were trained without handling class imbalance first. Secondly, the algorithms were trained using six re-sampling procedures to handle class imbalance: random under-sampling, random over-sampling, Synthetic Minority Oversampling technique (SMOTE), Random Over Sampling Examples (ROSE) techniques and Adaptive Synthetic Sampling Approach (ADASYN). We further investigated whether combining randomly under-sampled and over-sampled data can result in improved performance. Eighty percent of the data was used for model training using 5 fold cross validation. Results All methods that were considered for handling class imbalance had strengths and weaknesses depending on the performance metric. For instance, random under-sampling (RU) resulted in models with higher sensitivity than random over-sampling (RO). To get a trade-off between sensitivity and specificity, these two methods can be combined (RURO). This approach resulted in 99.5% sensitivity, 88.1 % specificity, 89.6 % precision, 94.3 % F1 score and a 93.9 % accuracy on the test set using the Extreme Gradient Boosting machine.
dc.identifier.urihttp://hdl.handle.net/11071/13207
dc.language.isoen
dc.publisherStrathmore University
dc.titleStatistical learning for class imbalanced data: a case study of Malaria indicator survey data
dc.typeThesis
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Maangi Daniel.pdf
Size:
1.16 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: