Enhancing credit scoring in emerging markets: overcoming data scarcity with advanced machine learning and data augmentation techniques

Abstract

Credit risk assessment is essential for lending institutions, especially in data-scarce environments where limited borrower information complicates accurate risk evaluation. This study presents a robust machine learning pipeline that integrates real demographic data with synthetic financial records generated via a Conditional Tabular GAN (CTGAN) model, effectively augmenting the training dataset. Exploratory Data Analysis (EDA) revealed that debt-to-income and debt-to-savings ratios were among the most predictive features; these were log-transformed to address skewness and improve model learning. Four classification models Logistic Regression, Random Forest, Gradient Boosting and Neural Network were trained and evaluated. The Random Forest model consistently outperformed others when trained on a 75% real / 25% synthetic mixed dataset, achieving an accuracy of 75%, a macro F1-score of 0.69, and an AUC-ROC of 68.6%. To improve statistical reliability, bootstrapped confidence intervals were computed, confirming model robustness. A fairness analysis was also conducted by excluding sensitive attributes such as sex and marital status, resulting in an ethically aligned model without significant performance loss. The final Random Forest model was deployed using a Streamlit web application, enabling real-time credit scoring via a lightweight and user-friendly interface. This research demonstrates that synthetic data augmentation, combined with advanced machine learning, can enhance credit scoring in emerging markets, particularly for microfinance institutions. Future work will focus on fairness auditing, model calibration, and integration into financial infrastructure to maximize operational impact. Key Words: credit scoring, Random Forest, CTGAN, synthetic data, emerging markets, fairness, microfinance

Description

Full - text thesis

Keywords

Citation

Gathimba, R. W. (2025). Enhancing credit scoring in emerging markets: Overcoming data scarcity with advanced machine learning and data augmentation techniques [Strathmore University]. https://hdl.handle.net/11071/16380

Endorsement

Review

Supplemented By

Referenced By