Enhancing credit scoring in emerging markets: overcoming data scarcity with advanced machine learning and data augmentation techniques

dc.contributor.authorGathimba, R. W.
dc.date.accessioned2026-04-13T10:09:07Z
dc.date.issued2025
dc.descriptionFull - text thesis
dc.description.abstractCredit risk assessment is essential for lending institutions, especially in data-scarce environments where limited borrower information complicates accurate risk evaluation. This study presents a robust machine learning pipeline that integrates real demographic data with synthetic financial records generated via a Conditional Tabular GAN (CTGAN) model, effectively augmenting the training dataset. Exploratory Data Analysis (EDA) revealed that debt-to-income and debt-to-savings ratios were among the most predictive features; these were log-transformed to address skewness and improve model learning. Four classification models Logistic Regression, Random Forest, Gradient Boosting and Neural Network were trained and evaluated. The Random Forest model consistently outperformed others when trained on a 75% real / 25% synthetic mixed dataset, achieving an accuracy of 75%, a macro F1-score of 0.69, and an AUC-ROC of 68.6%. To improve statistical reliability, bootstrapped confidence intervals were computed, confirming model robustness. A fairness analysis was also conducted by excluding sensitive attributes such as sex and marital status, resulting in an ethically aligned model without significant performance loss. The final Random Forest model was deployed using a Streamlit web application, enabling real-time credit scoring via a lightweight and user-friendly interface. This research demonstrates that synthetic data augmentation, combined with advanced machine learning, can enhance credit scoring in emerging markets, particularly for microfinance institutions. Future work will focus on fairness auditing, model calibration, and integration into financial infrastructure to maximize operational impact. Key Words: credit scoring, Random Forest, CTGAN, synthetic data, emerging markets, fairness, microfinance
dc.identifier.citationGathimba, R. W. (2025). Enhancing credit scoring in emerging markets: Overcoming data scarcity with advanced machine learning and data augmentation techniques [Strathmore University]. https://hdl.handle.net/11071/16380
dc.identifier.urihttps://hdl.handle.net/11071/16380
dc.language.isoen
dc.publisherStrathmore University
dc.titleEnhancing credit scoring in emerging markets: overcoming data scarcity with advanced machine learning and data augmentation techniques
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Enhancing credit scoring in emerging markets - overcoming data scarcity with advanced machine learning and data augmentation techniques.pdf
Size:
14.24 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: