Robust statistical learning for optimal classification of imbalanced data

dc.contributor.authorJuma, Samuel Wanyonyi
dc.date.accessioned2022-06-13T06:17:38Z
dc.date.available2022-06-13T06:17:38Z
dc.date.issued2021
dc.descriptionA Research Thesis Submitted to the Graduate School in partial fulfillment of the requirements for the Award of Master of Science Degree in Statistical Sciences at Strathmore Universityen_US
dc.description.abstractNeurobiological disorders such as Learning Disabilities (LD) are increasing becoming a major concern in education and health sectors, hence, precise identification of these disorders is critical. While neuropsychological assessments play an important role in diagnosis, there is limited conventional methodologies for test administration, scoring and interpretation of results. Consequently, there is frequent misclassification of children due to imprecise distinction between children with learning disabilities and those with learning difficulties. This research sought to apply statistical and Machine Learning (ML) approaches to strengthen the LD diagnostic process. This research addresses the challenges of learning from imbalanced data, a characteristic often associated with LD data due to low prevalence of the disorder. Imbalanced data poses a challenge in designing efficient ML solutions since standard classification models assumes fairly distributed classes. The study used experimental design to identify a suitable base learner, and corrective technique to tackle the challenge of imbalanced data. Statistical experiments performed were based on secondary data obtained from a Baseline Survey on Learning Disabilities conducted by Kenya Institute of Special Education in 2019. It was found that Support Vector Machine (SVM) is the best base learner for imbalanced data with the highest classification efficiency compared to other classification models. For data with high dimensionality, it was found that the classification power of Artificial Neural Network (ANN) was better than that of SVM despite the need for significantly higher computational effort. When data dimensionality is reduced, it was observed that classification power of ANN reduces significantly. SVM was also found to be a more flexible model whose classification power is least affected by changes in data dimensionality. It was found that both Adaptive Boosting (AdaBoost) and Adaptive Synthetic Sampling (ADASYN) equally perform well in tackling the imbalanced data, with AdaBoost performing slightly better, although the difference was not statistically significant. The study concludes that SVM and ANN can be used to model highly imbalanced data to achieve the highest classification accuracy with respect to the minority class. ADASYN and AdaBoost methods can be used jointly to build a more robust corrective algorithm to tackle highly imbalanced data.en_US
dc.identifier.urihttp://hdl.handle.net/11071/12814
dc.language.isoenen_US
dc.publisherStrathmore Universityen_US
dc.subjectStatistical learningen_US
dc.subjectOptimal classificationen_US
dc.subjectImbalanced dataen_US
dc.titleRobust statistical learning for optimal classification of imbalanced dataen_US
dc.typeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Robust statistical learning for optimal classification of imbalanced data.pdf
Size:
2.67 MB
Format:
Adobe Portable Document Format
Description:
Full - text thesis
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: