Assessing Predictive Performance of Supervised

Machine Learning Algorithms: An Alternative

Model for Diamond Pricing

Samuel Njoroge Kigo

Submitted in total fulfilment of the requirements for the degree of

Master of Science in Statistical Sciences of Strathmore University

Institute of Mathematical Sciences

Strathmore University

Nairobi, Kenya

May 2022

This thesis is available for Library use through open access on the understanding that it is copyright
material and that no quotation from the thesis may be published without proper acknowledgement.


Declaration
I declare that this work has not been previously submitted and approved for award of a degree

by this or any other University. To the best of my knowledge and belief, the thesis contains

no material previously published or written by another person except where due reference is

made in the thesis itself.

© No part of this thesis may be reproduced without the permission of the author and

Strathmore University.

Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Samuel Njoroge Kigo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Date:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .August 20, 2022. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Approval

The thesis of Samuel Njoroge Kigo was reviewed and approved by the following:

Professor Bernard Omolo

Supervisor,

Institute of Mathematical Sciences, Strathmore University.

Dr. Evans Omondi

Supervisor,

Institute of Mathematical Sciences, Strathmore University.

Dr. Godfrey Madigu

Dean,

Institute of Mathematical Sciences, Strathmore University.

Dr. Bernard Shibwabo

Director,

Office of Graduate Studies, Strathmore University.

ii


Abstract
The world’s hardest mineral is a diamond, which is 58 times harder than any other mineral,

and its beauty as a jewel has long been appreciated. The diamond is popular due to its optical

property as well as other causes such as its durability, custom, fashion, and strong marketing

by diamond producers. Diamond demand, on the other hand, is not directly related to such

inherent characteristics, but rather to their perceived value as rare and expensive objects.

Forecasting diamond pricing is challenging due to non-linearity in important features such

as carat, cut, clarity table, and depth. Given this, we conducted a comparative analysis and

implementation of multiple supervised machine learning models in predicting diamond price

in both classification and regression approaches. We evaluated eight different supervised

algorithms in our work, including Multiple Linear Regression, Linear Discriminant Analysis,

eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines,

Boosted Regression and Classification Trees, and Multi-Layer Perceptron, and showcased

the best suitable model given selected evaluation metrics. The analysis in this work is

based on data preprocessing, exploratory data analysis, training the aforementioned models,

assessing their accuracy, and interpreting their results. Based on the performance metrics

values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal

algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy

value of 74.28%. As a result, the eXtreme Gradient Boosting method was recommended for

forecasting the price of a diamond specimen.

iii


Table of Contents

List of Figures vii

List of Tables ix

List of Abbreviations x

Acknowledgement xi

Dedication xii

1 Introduction 1

1.1 Background of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Dissemination and Utilisation of the Study Results . . . . . . . . . . . . . 7

1.7 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Literature Review 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Supervised Machine Learning Algorithms . . . . . . . . . . . . . . . . . . 8

2.3 Application of ML in Classification and Regression . . . . . . . . . . . . . 9

2.4 Application of ML in Diamond Pricing . . . . . . . . . . . . . . . . . . . . 9


3 Methodology 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Boosted Classification and Regression Trees (BCARTs) . . . . . . . . . . . 16

3.4 eXtreme Gradient Boosting (XGBoost) . . . . . . . . . . . . . . . . . . . 18

3.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 Random Forests (RFs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.8 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . 23

3.9 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . 27

3.10 Regression Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 29

3.11 Classification Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 32

3.12 Overall Modeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.13 Data Type and Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.14 Simulated Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.15 Simulation Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.16 iris Data Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Data Analysis 43

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Data Type and Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Discussion, Conclusion and Recommendations 54

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.1 Regression Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 54

5.2.2 Classification Evaluation Metrics . . . . . . . . . . . . . . . . . . 55

5.2.3 Performance of Ensembles . . . . . . . . . . . . . . . . . . . . . . 56

5.2.4 Algorithms’ Overall Performance . . . . . . . . . . . . . . . . . . 56

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

v


5.4 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4.1 Recommendations for Further Studies . . . . . . . . . . . . . . . . 57

5.4.2 Policy Recommendations . . . . . . . . . . . . . . . . . . . . . . . 57

References 58

Appendix A 62

A.1 Ethical Review Committee Report . . . . . . . . . . . . . . . . . . . . . . 62

A.2 Similarity Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

vi


List of Figures

Figure 3.1: Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 3.2: Classification Techniques . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 3.3: Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . 23

Figure 3.4: Multi-Layer Perceptron Architecture . . . . . . . . . . . . . . . . . 24

Figure 3.5: Overall Modeling Process . . . . . . . . . . . . . . . . . . . . . . . 34

Figure 3.6: The Simulated Group Proportions . . . . . . . . . . . . . . . . . . . 35

Figure 3.7: Multiple Linear Regression Assumptions . . . . . . . . . . . . . . . 36

Figure 3.8: XGBoost Predicted Vs Actual . . . . . . . . . . . . . . . . . . . . . 37

Figure 3.9: The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 3.10: The Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 3.11: R squared Vs Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 3.12: RMSE Vs Misclassification Error . . . . . . . . . . . . . . . . . . . 40

Figure 3.13: Multi-Layer Perceptron Architecture . . . . . . . . . . . . . . . . . 41

Figure 3.14: R squared Vs Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 3.15: RMSE Vs Misclassification Error . . . . . . . . . . . . . . . . . . . 42

Figure 4.1: The Diamond’s Key Measurements . . . . . . . . . . . . . . . . . . 44

Figure 4.2: The Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 4.3: The Bulls-eye Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure 4.4: The Diamond Dataset Correlation Chart . . . . . . . . . . . . . . . 46

Figure 4.5: The Logarithmic Transformation of Price and Carat . . . . . . . . . 47

Figure 4.6: The Normality Test . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Figure 4.7: The Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Figure 4.8: The Heatmap of cut and color . . . . . . . . . . . . . . . . . . . . . 50

vii


Figure 4.9: The 4Cs Visualizations . . . . . . . . . . . . . . . . . . . . . . . . 51

Figure 4.10: R squared Vs Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 4.11: RMSE Vs Misclassification Error . . . . . . . . . . . . . . . . . . . 53

viii


List of Tables

Table 3.1: Confusion Matrix Table . . . . . . . . . . . . . . . . . . . . . . . . 32

Table 3.2: Glimpse of the simulated data for 8 random observations . . . . . . 37

Table 3.3: Regression Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 38

Table 3.4: Classification Evaluation Metrics . . . . . . . . . . . . . . . . . . . 38

Table 3.5: The Lead Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Table 3.6: Overall Algorithm’s Performance . . . . . . . . . . . . . . . . . . . 39

Table 3.7: Regression Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 41

Table 3.8: Classification Evaluation Metrics . . . . . . . . . . . . . . . . . . . 41

Table 3.9: Algorithm’s Lead Table . . . . . . . . . . . . . . . . . . . . . . . . 41

Table 3.10: Classification Evaluation Metrics . . . . . . . . . . . . . . . . . . . 41

Table 4.1: The Study Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 4.2: Regression Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 52

Table 4.3: Classification Evaluation Metrics . . . . . . . . . . . . . . . . . . . 52

Table 4.4: Algorithm’s Lead Table . . . . . . . . . . . . . . . . . . . . . . . . 52

Table 4.5: Overall Algorithms’ Performance . . . . . . . . . . . . . . . . . . . 52

ix


List of Abbreviations
ROC Receiver Operating Character-

istic

RMSE Root Mean Squared Error

MAE Mean Absolute Error AUC Area Under Curve

GIA Gemological Institute of

America

SMLAs Supervised Machine Learning Al-

gorithms

SML Supervised Machine Learning MLAs Machine Learning Algorithms

ML Machine Learning BCART Boosted Classification and Re-

gression Tree

SVM Support Vector Machine MLP Multi-Layer Perceptron

RF Random Forest kNN K-nearest neighbors

XGBoost eXtreme Gradient Boosting MLR Multiple Linear Regression

LDA Linear Discriminant Analysis ANN Artificial Neural Network

ReLU Rectified Linear Unit ME Misclassification Error

x


Acknowledgement
First and above all, I respectfully and gratefully recognize the almighty God for the gift

of grace and capacity, as well as the gift of wisdom and knowledge, that He has bestowed

upon me in order to conduct this research. Prof. Bernard Omolo and Dr. Evans Omondi,

my supervisors, deserve special thanks for their unwavering support and advice during the

research time. Finally, I want to express my gratitude to my parents, Mr. and Mrs. Kigo,

brother Robert Kigo, nephew Kennedy Mwangi, and all of my fellow classmates (Kroneckers)

for their essential moral and technical assistance throughout this thesis.

xi


Dedication
Mr. Paul Gitau, my mentor and sponsor, is honored with this thesis. In the last decade, he

has been my mentor, and over that time, I have felt a deeper feeling of intellectual worthiness

in my life. His guidance has aided me in defining and even exceeding my personal

boundaries. He has always pushed my academic abilities, knowing full well that I would

always do better. This has been my compass for reaching my potential, and it has enabled

me to achieve more than I could have dreamed, particularly in my academic career. May the

Almighty bestow His blessings on him.

xii


Chapter 1

Introduction

1.1 Background of the Study

A diamond is the hardest substance on the planet, and its beauty as a gemstone has long

been recognized. By 2019, it was estimated that 142 million carats of diamonds have been

mined around the world. Australia, Canada, the Democratic Republic of Congo, Botswana,

South Africa, and Russia are all major producers. There are an estimated 1.2 billion carats in

the world’s reserves. The largest reserves are in Russia, which are believed to be worth 650

million carats (M.Garside, 2022).

In 2020, global diamond jewelry sales were 68 billion (in US dollars) according to M.Garside

(2021b), with the United States accounting for 35 billion (in US dollars) of that amount

(M.Garside, 2021a). In 2019, the United States had the biggest demand for polished diamonds,

totaling 12.8 billion (in US dollars) according to (M.Garside, 2020). The United States,

China, India, Japan, and the Persian Gulf region are the top five markets for diamonds,

according to (Mamonov and Triantoro, 2018).

The 4Cs - Cut, Carat, Color, and Clarity - were introduced by the Gemological Institute of

America (GIA) in the 1950s and are the most well-known attributes of diamonds. The 4Cs

describe each diamond’s distinct characteristics and have a significant impact on diamond

prices. Three of the four Cs have a lengthy history: carat weight, color, and clarity were all

utilized in the original diamond grading system over 2,000 years ago in India (Mamonov and

Triantoro, 2018).

The dimensions of a diamond’s cut determine how efficiently it reflects light. On a scale of

fair to ideal, the cut perfection is classified.

1


One of the main components of the cut variable is the degree of perfection achieved by the

cutting and polishing process. This is a complicated variable that includes, among other

things, the stone’s symmetry and adherence to local market-specific standards regarding the

stone’s proportions and the presence or absence of specific features such as an ID number

engraved in the stone’s girdle, the girdle’s faceting, and so on (Cardoso and Chambel, 2005).

The cut of a diamond also has three other characteristics: brilliance, or the amount of

light reflected from it; fire, or the dispersion of light into the colors of the spectrum; and

scintillation, or the flashes of glitter that occur when a diamond is moved around (Mamonov

and Triantoro, 2018). According to the International Gem Society, out of 4Cs, the cut is

the most important attribute of a diamond (Clark, 2022). Chu (2001) asserts that optimal

cut should neither be too deep nor too shallow for it will impede the trajectory of light and

thereby the brilliance or “fire” of a diamond stone.

Blue Nile, one of the largest online diamond retailers, asserts that cut has the biggest effect

on the sparkle, and even with perfect color and clarity, a poorly cut diamond will look dull

(Nile, 2022). In addition to 4Cs, there are many other attributes of diamonds such as length,

width, height, table, etc. To better understand how such complex features influence diamond

prices, we propose application of supervised machine learning algorithms (SMLAs). SMLAs

afford the advantage of capturing non-linearity relationships in a given dataset.

Buyers and investors in the diamond trade industry encounter a number of challenges in

estimating the price of diamond stones. Because of the differences in the shapes, sizes, and

clarity of the stones, this is a challenging task (Alsuraihi et al., 2020). Diamonds are a very

unique consumer product. It is the hardest mineral i.e. 58 times harder than any other mineral,

Mihir et al. (2021), thus used in various machines and other types of equipment for cutting

and slicing.

Diamonds demand, on the other hand, is not directly tied to such inherent features, but rather

to their perceived value as rare and expensive items (Mamonov and Triantoro, 2018). It is

one of the gemstones on which more money is spent than any other combined gemstone.

2


The diamond gains popularity because it has an optical property (Mihir et al., 2021). Other

factors include its durability, custom, fashion, and aggressive marketing by diamond pro-

ducers (Sharma et al., 2021). Due to its non-linearity and fluctuating time series behavior,

forecasting the prices of precious metals such as gold and diamonds is a difficult task

(Mamonov and Triantoro, 2018).

Because of their unique features and great market demand, banks prefer to invest in precious

metals. As a client, you’re constantly unsure when it’s the best moment to invest in, buy, or

sell valuable items like gold and diamonds (Pandey et al., 2019). When it comes to generating

the greatest profit out of an investment and the least expense out of a purchase made for the

above-mentioned products, pricing is extremely important to buyers and investors.

Rapaport list, price list that quote Rapaport’s opinion of high cash asking prices generally

accepted as high enough to serve as the initial starting point for negotiations, has been used in

classical diamond evaluation. The Rapaport list prices are almost always higher than actual

dealer transaction prices, which tend to trade at discounts to the list. Final transaction prices

are the result of negotiations between buyer and seller thus being difficult to predict based

only on the Rapaport price list (Cardoso and Chambel, 2005).

There are numerous models and applications currently in the market which are used for

predicting the future price of diamond stones, including machine learning algorithms Sharma

et al. (2021), as well illustrated in section two of this paper.

Contemporary statistical analysis is characterized with the evolution of Machine Learning

Algorithms. Ahmed et al. (2010) and Kampichler et al. (2010) observe that these algorithms

have been empirically proven to be serious contenders to Classical Statistical Models in

dealing with high dimensional data that are often non-linear and do not meet the assumptions

of conventional statistical procedures. This aspect thus affirms the decision to employ them

in this work for diamond price prediction and classification.

The aforementioned assertions are based on various predictive performance metrics including

Precision, Accuracy, Kappa, Recall, F-Measure, RMSE, MAE, R Squared and computational

aspects such as speed and build time, among others.

3


The knowledge of best performing model(s) is imperative in refocusing the modeler’s time,

effort and other resources to only potential candidate models in machine learning and pattern

recognition.

In classical statistics, modeling complex non-linear relationships was the biggest drawback

until 1980’s where advancement in computing technology permitted non-linear modeling

(James et al., 2013). The explosion of ’Big Data’ has seen the release of over 95% of the

current world data in just about the past five years. There exists a pressing need by businesses,

governments and researchers to draw meaningful insights out of these overwhelming amounts

of data in making smarter decisions.

In fact, businesses are now viewing data as a cocktail of new opportunities or a crude oil that

requires state-of-the-art skills and expertise to refine, i.e. gaining insight into the engineering

process behind data thus discovering the hidden patterns that will inform valuable versions of

investment decisions. It is evident that data has become the new front of competition among

businesses and other commercial underpinnings.

The new epoch of big data is redefining statistical learning applications on supervised and

unsupervised modeling and prediction. Osisanwo et al. (2017) postulate that this tendency

can be traced back to the advancement of Smart and Nano technologies, which has sparked

interest in uncovering hidden patterns in data, both structured and unstructured, to derive

value. Further, increase in the freely available and user-friendly statistical softwares such as

R and Python has provided an upthrust to machine learning innovations in modeling.

In this paper, we explore the use of supervised machine learning algorithms to investigate the

relationship between diamond physical qualities and diamond prices in order to establish the

extent to which the latter are determined by the former.

4


1.2 Problem Statement

Cut is the most important variable influencing diamond prices, according to the various

literature materials reviewed. The cut itself contains three key aspects: brilliance, dispersion,

and scintillation, all of which attract the attention of the diamond market’s major players.

Thus, in addition to predicting diamond prices, we capture the aspect of diamond classification

based on cut in this study, which has received little to no attention in previous research.

As a result, the predictive power of SMLAs in predicting diamond prices and diamond

classification based on cut will be evaluated in this paper. An algorithm’s overall performance

will be judged depending on how well it performs in both regression and classification

scenarios.

To address the said gap, we employ the following Supervised Machine Learning Algorithms:

boosted classification and regression trees (BCART), support vector machines (SVM), Multi-

Layer Perceptron (MLP), random forest (RF), K-nearest neighbors (KNN) and eXtreme

Gradient Boosting (XGBoost). Multiple Linear Regression (MLR) and Linear Discriminant

Analysis (LDA) from classical statistics will be used as baseline models.

Though not the main subject, the study will delve into ensemble methods, defined as a

collection of models whose predictions are combined by weighted averaging (continuous

response variable) or plurality voting scheme (categorical response variable) (Moisen, 2008).

On this objective, a boosted or bagged version of an algorithm is expected to perform better

than the baseline counterpart.

Vafeiadis et al. (2015) postulate that boosting ameliorates the performance of a classifier

based on the respective F-measure score. Despite the fact that ensemble approaches have

outstanding empirical performance, Bucilua et al. (2006), most model comparison studies

have not applied them.

5


1.3 Objectives of the Study

1.3.1 General Objective

The general objective was to assess predictive performance of supervised machine learning

algorithms on diamond pricing.

1.3.2 Specific Objectives

• To assess the predictive performance of Supervised Machine Learning Models in

diamond price prediction.

• To assess the predictive performance of Supervised Machine Learning Models in

diamond classification.

• To compare the performance of boosting and bagging (bootstrapped aggregation) in

ensemble methods.

1.4 Research Questions

This study ought to answer the following questions:

• What is the predictive performance of Supervised Machine Learning Models in dia-

mond price prediction?

• What is the predictive performance of Supervised Machine Learning Models in dia-

mond classification?

• How does the performance of boosting and bagging (bootstrapped aggregation) com-

pare in ensemble methods?

6


1.5 Significance of the Study

The findings will have larger ramifications for how online commerce affects the pricing of

diamonds and other luxury products. Furthermore, the results will have an impact on future

strategies for the diamond industry’s major players, such as addressing the pricing pressure

imposed by e-commerce. We propose an alternate approach as demonstrated by the ’overall

modeling process’ in chapter three, based on the diamond cut, a feature that can be easily

tweaked by dealers to meet market demand while also maximizing earnings and customer

satisfaction.

1.6 Dissemination and Utilisation of the Study Results

The findings will be disseminated through publications and the Strathmore University Library

Catalogue. The important actors in the diamond market, i.e. suppliers and purchasers for

whom pricing is critical, will be the consumers of the research outputs. This is intended

to create a near-ideal market in which both buyers and sellers have access to the same

information. The optimal model will be based on an interactive space, such as R-Shiny,

where diamond attributes are fed and the model generates the most accurate price estimate.

1.7 Limitations of the Study

This analysis is predicated on the premise that diamond unit prices will not fluctuate much

over time, as this would make the model unstable. Diamond prices, on the other hand, appear

to be rather steady throughout time, at least in the short to medium term.

7


Chapter 2

Literature Review

2.1 Introduction

This section provides a summary of the numerous sources that were studied for this work,

including books, scholarly articles, and other materials. To create the groundwork for the

research topic, a critical examination of the materials was carried out.

2.2 Supervised Machine Learning Algorithms

The urge to analyze data with most efficient Machine Learning Models has provided a

springboard to a proliferation of model comparison studies in the past decade. However, the

application of Machine Learning Algorithms (MLAs) demand a wide array of skills, most of

which are not within the scope of many practitioners (Vafeiadis et al., 2015). This fact has

motivated most scholars in statistics domain, including this research, to add more knowledge

to the MLAs’ bank through publications.

Supervised machine learning (SML) refers to the quest for algorithms that reason from

externally supplied instances to develop general hypotheses, which then make predictions

about future instances, based on certain intelligent systems (Osisanwo et al., 2017).

According to James et al. (2013), SML consists of building mathematical models for pre-

dicting the outcome of future observation. Predictive models can be classified into two

main groups: regression analysis for predicting a continuous variable and classification for

predicting the class or group of individuals.

8


2.3 Application of ML in Classification and Regression

Caruana et al. (2008) assert that most ML models’ comparison studies have exclusively and

extensively focused on classification problems. Vafeiadis et al. (2015) evaluate performance

of five most widely employed classification algorithms on customer churn predictions.

Song et al. (2004) compare ML classifiers against classical statistical classification models.

Recently, few studies have however been founded on regression problems for example

Phaladisailoed and Numnonda (2018) which is meant to predict bitcoin prices and Salazar

et al. (2015) that employs machine learning techniques to study dam behavior.

2.4 Application of ML in Diamond Pricing

Alsuraihi et al. (2020) seeks to develop the best algorithm for diamond dealers to employ

in order to accurately estimate prices. Due to the wide diversity in diamond stone sizes and

other important factors, the prediction procedure is much more challenging in the case of

diamonds.

Several machine learning methods, including Liner regression, Random forest regression,

polynomial regression, Gradient descent, and Neural network, were utilized in this article

to aid in the prediction of diamond price. After training numerous models, verifying their

accuracy, and analyzing the findings, it was discovered that the random forest regression

is the best, with MAE and RMSE values of 112.93 and 241.97, respectively. Although RF

performed exceptionally well in this study, a setting with a significant class imbalance which

was the case for the dataset under analysis is not appropriate for this algorithm (Wu et al.,

2014).

Importantly, the study should have taken into account classification, which is a significant

factor in diamond pricing. The diamond cut, which comes in five different types (classes),

has a significant impact on the price. As a result, the study should have chosen the optimal

model based on classification and regression findings to capture this noble attribute. Finally,

the study may have used a combination of ensemble models to improve the results.

9


Mamonov and Triantoro (2018) establishes that e-commerce has made it easier for buyers to

compare diamond pricing (price dispersion) with diamond physical qualities across different

sellers in order to make informed purchasing selections. The purpose of this study is to

investigate the relationship between diamond physical features and diamond prices in order

to identify the extent to which physical attributes influence diamond pricing.

The primary variables that determine diamond prices are discovered to be diamond weight,

color, and clarity. The data mining findings also point to a significant level of subjectivity

in diamond pricing, which could be due to diamond dealers’ price obfuscation methods.

The newly discovered information contributes to our knowledge of the relationship between

consumer search costs and price volatility.

Because diamond price is a continuous interval target variable with a ratio scale, decision

forest, boosted decision tree, and artificial neural network prediction data mining approaches

have been used. When the complete dataset is analyzed, Decision Forest yields the lowest

MAE, 5.8 percent. When the carat range in the diamonds dataset is narrowed to 0.2 - 2.5,

ANN achieves an MAE of 8.2 percent, beating other techniques.

Other novel prediction data mining approaches, such as XGBoost, SVR, Knn, and others,

which have been empirically demonstrated to produce good outcomes in model comparison

studies, have been left out of this work for no obvious reason. Given the non-linearity of

the variable relationship, SVR’s polynomial option (svmPoly) may have produced superior

results.

It’s possible that a stacking ensemble with a one-time model run was used. R2, RMSE, and

other evaluation criteria were not employed. Despite the fact that the research suggests that

the cut is the most important of the 4Cs, it is not included in the analysis, for example when

trying to classify diamonds based on this attribute. The cut of a diamond determines its

market value in this case.

Pandey et al. (2019) states that diamond and precious metal values fluctuate on a regular

basis, making it difficult to forecast future value. This study uses ensemble approaches to

anticipate future prices of precious metals such as gold and precious stones such as diamonds,

10


with the goal of obtaining the most accurate result possible. Also employed are feature

selection approaches, and the outcomes are compared.

Over-fitting and under-fitting are common problems with supervised models, and they

perform poorly on imbalance datasets. To solve these problems, this research proposes a

hybrid model that combines the strengths of random forest and principal component analysis

(PCA). The random forest model outperforms the linear regression model in the analysis, with

a mean accuracy of 0.9730 versus 0.8695. With 5 best features, Random Forest Regression

with Chi-Square feature selection had the best accuracy (0.9754 vs. 0.8663 for Linear

Regression).

However, this study chose random forest as its evaluation method without providing any

reason for why it did not investigate alternative empirically proven high-performing esemble

methods such as bagging, boosting, Bayesian averaging, and stacking. The research might

have used powerful algorithms like MLP and XGBoost, which have the ability to address the

problem of Over-fitting as well as find relevant features using Variable Importance Operation.

Again, when it comes to choosing on the cutoff points, PCA implementation allows for the

modeler’s subjectivity, suffocating statistical truth and independence. Furthermore, other

evaluation metrics such as the R2, RMSE, were not used to verify the claims/results in this

study.

Sharma et al. (2021) says that the main goal of their research work is to present supervised

machine learning algorithms for predicting diamond prices (in US dollars). Due of their

monetary value, precious stones such as diamonds are always in high demand. The cost

of such stones varies depending on their characteristics. As a result, the study conducted a

comparative analysis and application of multiple supervised models in predicting the diamond

price. Because these heavy stones are more expensive than lighter stones, the relationship

will not be linear.

This study compares and contrasts eight alternative supervised models, including linear

regression, lasso regression, ridge regression, decision tree, random forest, ElasticNet,

AdaBoost Regressor, and Gradient-Boosting Regressor, to find the best model for the job.

According to the research given in the publication, the random forest method outperforms

11


the other supervised learning algorithms. The Random forest method can reach an R2 score

of 97.93% when the dataset is split 80% for training and 20 percent for testing, according to

the paper.

When compared to MLR, which does not use the shrinkage idea, coefficient shrinkage models

(lasso,ridge,elastic net) may produce overly optimistic findings. Random forest ought to have

been compared to other novel machine learning algorithms such as XGBoost, MLP, BRT,

SVR, and others, since the focus was on supervised model evaluation.

This study appears to dismiss the importance of classification in pricing (Similar studies have

found that cut is an important variable in diamond pricing thus classifying stones by this

variable would have been more informative). The models were not assessed using multiple

regression metrics such as the MAE and R squared to further determine the claims/results.

Mihir et al. (2021) observes that unlike gold and silver, establishing the price of a diamond is

extremely difficult since numerous factors must be taken into account, such as clarity, carat

weight, cut, breadth, length, color, percentage of depth, and table width. The goal of this

project is to develop the most efficient algorithm for predicting diamond prices.

Linear regression, Support Vector regression, Decision trees, Random Forest regression,

kNeighbors regression, CatBoost regression, Huber regression, Extra tree regression, Passive

Aggressive regression, Bayesian Regression, and XGBoost Regression are some of the

algorithms used to train machine learning models on the diamond dataset for predicting

diamond prices based on various attributes. CatBoost Regression was found to be the most

suited algorithm for diamond price prediction, with the greatest R2 score of 0.9872 and

comparatively lower RMSE and MAE values, based on the performance parameter values

and analysis.

To acquire more accurate findings, one of the future prospects of this article is to introduce a

variety of factors such as shape, table value, polish, symmetry, and so on.

Chu (1996) suggests that the costs of diamond rings be related to the weights of their diamond

stones using basic linear regression. Simple linear regression was employed to conduct the

analysis in this study. The generated regression line has a negative intercept. The postulated

12


pricing mechanism implies a negative relationship between diamond ring costs and the

weights of their diamond stones, raising doubts about the method’s validity.

There is plenty of evidence that diamond essential characteristics have a non-linear connec-

tion, implying that the model utilized was incorrect. Furthermore, the technique relied solely

on carat weight for price, with no explanation as to why other characteristics as as color,

clarity, and cut were overlooked.

Chu (2001) attempts to build a diamond stone pricing model. The paper also teaches us

about the different degrees of clarity and color, as well as the relative cost of cartage. The 4

C’s: Carat, Clarity, Color, and Cut are elements that determine the price of a diamond stone,

according to this research. Carat units are used to measure the weight of a diamond stone. A

carat is the same as 0.2 grams. In the absence of other factors, bigger diamond stones attract

greater prices due to their scarcity.

The pricing model is built using multiple linear regression (MLR), which is believed to

provide flexibility and clarity when dealing with exogenous elements. Carat, color, clarity,

and GIA and IGI certificates were the criteria utilized to predict the price. The value of

r-squared was 97.2%.

According to the study, there is a non-linear relationship between caratage and price, with

heavier stones being more valuable than lighter ones. A scatter plot of Price against Carats

confirms this, with the trend appearing to fan out. As a result, instead of using MLR to

construct a diamond pricing model, it would be more prudent to use machine learning models.

Furthermore, other significant variables were kept out of the analysis.

Cardoso and Chambel (2005) propose new pricing models for cut diamonds. Derived models

may have some advantages over the traditional Rapaport - an industry-wide adopted price

indicator - in that they are based on published final selling prices, which already include

corrections not included in the Rapaport, and so can evaluate prices closer to the market.

This is accomplished through the use of regression trees, Chi-Square Automatic Interaction

Detection, and neural networks (with backpropagation). Neural networks outperform other

13


methods in terms of prediction, accounting for almost 96 percent of the fluctuation in cut

diamond unit pricing.

The research did not take into account novel ensemble techniques used in machine learning,

such as Random Forest, XGBoost, and others. The researcher ought to have used a Multi-

layer perceptron to compare the results to a single hidden layer perceptron. The study does

not rigorously assess the proposed prediction model’s flaws, such as the RMSE, neglecting a

crucial statistical decision-making method.

Scott and Yelowitz (2010) takes diamond into account when examining the market for

commodities that are consumed not just for their intrinsic utility but also for the impact

their usage has on others. Diamonds are in high demand because they create a market for

social status and the inherent usefulness that comes with wearing beautiful things. Data was

gathered from online diamond sellers in order to investigate the factors of diamond prices

empirically. Carat weight and cut are established during the production process, whereas

color and clarity are dictated by nature.

The first standard considers carat weight, color, cut, and clarity when determining the log of

price. For Blue Nile, Union Diamond, and Amazon listed diamonds, this results in adjusted

R squared values of 88.9%, 89.8%, and 93.7%, respectively. All round diamonds between

0.4 and 0.6 carats are included in the sample.

Given the non-linear nature of the relationship between diamond attributes and price, this

research should have looked into using machine learning to solve the problem. Furthermore,

the study does not use error measurements such the RMSE to corroborate the findings.

14


Chapter 3

Methodology

3.1 Introduction

This chapter gives detailed information and description of research methodology to be used.

We propose SML models as outlined in Figure 3.1 and in Figure 3.2.

Figure 3.1: Regression Techniques Figure 3.2: Classification Techniques

3.2 Multiple Linear Regression (MLR)

We consider this model when the study variable involves more than one predictor variables.

Here, the relationship is important in that it allows the mean function E(y) to depend on

more than one predictor variables and to assume shapes other than straight line (Montgomery

and Runger, 2010).

Given the model as

y = β0 +β1X1 +β2X2 + · · ·+βkXk + ε (3.1)

15


now, given that the n− tuples of observations follow the same model, below is satisfied:

y1 = β0 +β1X11 +β2X12 + · · ·+βkX1k + ε1

y2 = β0 +β1X21 +β2X22 + · · ·+βkX2k + ε2
...

yn = β0 +β1Xn1 +β2Xn2 + · · ·+βkXnk + εn

(3.2)

The above n equations can be expressed in form of matrices as


y1

y2
...

yn

=


1 X11 X12 . . . X1k

1 X21 X22 . . . X2k
...

...
...

...

1 Xn1 Xn2 . . . Xnk


︸ ︷︷ ︸

DesignMatrix


β0

β1
...

βk

+


ε1

ε2
...

εn

 (3.3)

Through algebraic operation, the OLS estimator of β is given as:

β = (X ′X)−1X ′y (3.4)

3.3 Boosted Classification and Regression Trees (BCARTs)

Tree boosting is a method of combining many weak learners (trees) into a strong classifier

where: Each tree is created iteratively and the tree’s output h(x) is given a weight w relative

to its accuracy.

The ensemble output is the weighted sum:

ŷ(x) = ∑ twtht(x) (3.5)

16


After each iteration each data sample is given a weight based on its misclassification i.e. the

more often a data sample is misclassified, the more important it becomes. Here, the goal is to

minimize an objective function:

O(x) = ∑ il(ŷi,yi)+∑ tΩ( ft) (3.6)

where:

• l(ŷi,yi) is the loss function i.e. the distance between the truth and the prediction of the

ith sample.

• Ω( ft) is the regularization function i.e. it penalizes the complexity of the tth tree.

Lampa et al. (2014) observes that CARTs are incredibly straightforward yet effective. They

divide the data into a number of isolated zones and, within each of these parts, approximation

the result with a constant value. A sequence of binary splits in the input variables are used to

achieve this. A statistical criterion, such as the residual sum of squares, is optimized by first

identifying the variable and split point that best fits the CART.

Utilizing the subset of observations that passed through the preceding split, the optimal

split is found within each generated subset. This is done repeatedly until there are normally

no more than 10 observations left that can be split. Single CARTs are referred to as weak

learners in statistical learning terminology because of their inferior prediction abilities. The

concept behind stochastic gradient boosting (boosting), a numerical approach, is that a strong

learner with improved prediction performance can be generated by combining several weak

learners.

Using a function F(x), commonly referred to as the target function and approximated via an

additive expansion, the objective is to accurately map a set of explanatory variables x to an

outcome variable y.

Selection bias in favor of variables with a large number of potential split points is a downside.

CARTs’ high degree of variability is another problem; even a little change in the outcome

data can result in a different CART.

17


F̂(x) =
M

∑
m=1

βmb(x;γm)

.

Where M is the number of weak learners; βm are the expansion coefficients and b(x;γm)are

individual weak learners characterized by the parameters γm. Accuracy is defined by a loss

function L(y, F) which represent the loss in predicting y with F(x).

The algorithm works as follows;

1. Initialize F̂0(x) to a constant α .

2. Randomly sample a fraction η from the data without replacement.

3. Using η , compute the negative gradient of the loss function, zm =−▽L, and fit a depth

d CART, g(x), predicting zm.

4. Update F̂m(x)← F̂m−1(x)+λρg(x).

5. Iterate steps 2 through 4 M times.

.

In step 4, ρ is the step size along the gradient and is a shrinkage parameter which slows

down the learning to reduce overfitting. The parameters M, d and λ can be tuned using the

bootstrap or cross-validation.

Further details on BCARTs can be obtained from Friedman (2002) and (Breiman et al.,

1984).

3.4 eXtreme Gradient Boosting (XGBoost)

The XGBoost algorithm tries to minimize the following objective function (loss function and

regularization) J at step t:

18


J(t) =
n

∑
i=1

L
(
yi, ŷt−1

i + ft(x1)
)
+

t

∑
i=1

Ω( fi) (3.7)

where the first term contains the train loss function L (e.g. mean squared error) between real

class y and output ŷ for the n samples and the second term is the regularization term, which

controls the the complexity of the model and helps to avoid overfitting (Dimitrakopoulos

et al., 2018). It is observable that the XGBoost objective is a function of functions (i.e.

L is a function of CART learners, a sum of the current and previous additive trees). To

solve the above objective function, Taylor approximation is applied to transform the original

objective function to a function in the Euclidean domain, in order to be able to use traditional

optimization techniques. In XGBoost, the complexity is defined as:

Ω( f ) = γT +
1
2

λ

T

∑
j=1

w2
j (3.8)

where T is the number of leaves, γ is the pseudo-regularization hyperparameter, depending

on each dataset and λ is the L2 norm for leaf weights.

Using gradients for second-order Taylor approximation of the loss function and finding the

optimal weights w, the optimal value of objective function is:

J(t) =−1
2

T

∑
j=1

(∑i∈I gi)
2

∑i∈I hi +λ
+ γT (3.9)

where gi = ∂ŷt−1L(y, ŷt−1) and hi = ∂ 2
ŷt−1L(y, ŷt−1) are the gradient statistics on the loss

function, and I is the set of leaves.

The XGBoost benefits from the shrinkage strategy in which newly added weights are scaled

after every step of boosting (greedy algorithm) by a learning factor rate. This helps to

diminish the effects of future new trees on every existing individual tree, thereby reducing

the risk of overfitting (Mohammadi et al., 2021).

XGBoost is comprised of three main elements:

19


• Weak Learners – simple decision trees that are constructed based on purity scores

(e.g., Gini)

• Loss Function – a differentiable function you want to minimize. In regression, this

could be a mean squared error, and in classification, it could be log loss.

• Additive Models – additional trees are added where needed, and a functional gradient

descent procedure is used to minimize the loss when adding trees.

3.5 Support Vector Machine (SVM)

SVM is a machine learning technique that works by identifying the optimal decision bound-

ary that separates data points from different classes, and then predicts the class of new

observations based on the said boundary. Kassambara (2017) observes that SVM can be used

for two-class as well as multi-class classification problems.

James et al. (2013) asserts that there is an extension of the SVM for regression (i.e. for a

quantitative rather than a qualitative response), called support vector regression. Support

vector regression seeks coefficients (βo,β1, . . . ,βp) that minimize a different type of loss,

where only residuals larger in absolute value than some positive constant contribute to the

loss function.

Suppose that we have n× p matrix of data set, where samples belong to two linearly

separable classes represented by +1 or 1, and suppose gi is the features vector. The, (gi,yi) ∈

G×Y, i = 1,2, . . . ,n will be satisfied where yi ∈ (+1,−1) is the target variable dichotomy in

the p dimensional space. The aim is to classify the sample into one of the two classes and

by extension find an SVM classifier that generalizes to a multi-class problem achieved by

finding an optimal separating hyperplane (Mohammed et al., 2021).

A separating hyperplane for the two classes is given as:

w∗g+b≥ 1 when yi =+1.

w∗g+b≤−1 when yi =−1.

20


where w is the weight vector, b is the bias, and |b|/||w|| is the perpendicular distance to the

hyperplane. The distance from the nearest point in each class to the hyperplane becomes

1/||w|| and 2/||w|| between the two classes after rescaling. The solution to the optimization

problem is obtained by maximizing the margin:

minw,b||w||2

subject to yi(w∗g+b)≥ 1, i = 1,2, . . . ,n.

In this study, we will employ one-vs-one multi-class classification in which the SVM classifier

produces all possible pairs of binary classifications. Here, given that we have k classes where

k > 2, it follows that k(k−1)
2 binary classifiers are produced in the training step of the algorithm.

Consequently, a sample in the test dataset is assigned the class label that is voted the most by

the binary classifiers from the trained one-vs-one SVM.

3.6 K-Nearest Neighbors (KNN)

K-nearest neighbors (kNN) is a non-parametric method used for classification and regression

(Yao and Ruzzo, 2006). Given a positive integer K and a test observation x0, the KNN

classifier first identifies the K points in the training data that are closest to x0, represented by

ψ0. It then estimates the conditional probability for class j as the fraction of points in ψ0

whose response values equal j:

Pr(Y = j|X = x0) =
1
K ∑

i∈ψ0

I(yi = j). (3.10)

Lastly, KNN applies Bayes rule and classifies the test observation x0 to the class with the

largest probability (James et al., 2013).

The regression seeks to estimate f (x0) using the average of all the training responses in ψ0,

mathematically expressed as:

21


f̂ (x0) =
1
K ∑

xi∈ψ0

yi. (3.11)

3.7 Random Forests (RFs)

Random Forest is an unpruned classification or regression tree ensemble produced by employ-

ing bootstrap samples of the training data and random feature selection in tree induction. The

ensemble’s forecasts are summed (majority vote or averaging) to make a prediction. When

creating these decision trees, a random sample of m predictors is picked as split candidates

from the entire set of p predictors each time a split in the tree is evaluated. Only one of the

m predictors can be used in the split. A fresh sample of m predictors is taken at each split,

and typically we choose m≈√p, i.e. the number of predictors considered at each split is

approximately equal to the square root of the total number of predictors (James et al., 2013).

The random forest prediction is the most prevalent class among individual tree predictions

in the classification setting. If there are T trees in the forest, the amount of votes a class m

receives is:

vm =
T

∑
t=1

I(ŷt == m). (3.12)

where ŷt is the prediction of the t− th tree on a particular instance. The indicator function

I(ŷt == m) takes on the value 1 if the condition is met, else it is 0.

In a regression setting, the random forest’s forecast is the average of the individual trees’

predictions. If there are T trees in the forest, each making a prediction ŷt , the final prediction

ŷt is:

ŷ =
1
T

T

∑
t=1

ŷt . (3.13)

22


3.8 Multi-Layer Perceptron (MLP)

MLP is an algorithm that is inspired by the structure and function of the brain, which is

usually called Artificial Neural Networks (ANN). To store information, the brain changes

the connections between neurons. The neuron does not store information; instead, it enables

signal transmission between neurons. The human brain is made up of a gigantic network of

neurons. The neural network mimics the brain’s mechanism. While the human brain employs

neuronal association, the neural network employs neuronal connection weights (Ghatak,

2019). The information of the neural network is stored in the form of weights and biases as

demonstrated in Figure 3.3.

Figure 3.3: Neural Network Architecture

The input signals are multiplied by the weights before entering the node as shown below

v = (w1× x1)+(w2× x2)+(w3× x3)+b = wx+b (3.14)

The weighted sum can be expressed in matrix form

23


v =
[
w1 w2 . . .wn

]


x1

x2
...

xn

+
[
b
]

(3.15)

The output of the node (y) is processed using activation function (g) as shown below

ŷ = g(v) = g(w.x+b) (3.16)

It is important to note that MLP is defined by two or more hidden layers. According to

Cardoso and Chambel (2005), the more hidden units there are in a network, the less likely it

is to encounter a local minimum during training. Figure 3.4 shows a typical MLP network.

Figure 3.4: Multi-Layer Perceptron Architecture

24


Input nodes merely relay the input signal; they do not compute the weighted sum or apply

the activation function. Because they are not visible from outside the neural network, they

are called hidden layers. In supervised learning, the learning rule trains the neural network to

produce the proper output that has already been determined.

The weights are initialized and the error is calculated accordingly. Then, the weights are

adjusted to reduce the error. This procedure is repeated until the minimum error is attained.

The systematic way of modifying the weights is known as the Learning Rule, as demonstrated

by the generalized delta rule below

∆wi j← wi j +αδix j

Where;

• δi = φ ′i (vi)ei

• ei is the error of node i

• ei︸︷︷︸
error

= di︸︷︷︸
correct out put

− yi︸︷︷︸
observed out put

• φ ′ = d
dx (Activation Function)

• ∆wi j is the updated weight

• wi j is the previous weight

• x j is the output from node j ( j = 1,2,3 . . . )

• α is the learning rate (0≤ α ≤ 1)

The learning rate determines the extent to which weights are changed at every epoch. A high

value of α indicates that the output gravitates around the expected solution while a low value

shows that out fails to converge to the acceptable solution. Sigmoid, given as below, will be

used as the activation function.

25


φ(X) =
1

1+ e−X (3.17)

Getting the first derivative;

φ
′(x) = φ(x)(1−φ(x)) (3.18)

The strengths of Multi-Layer Perceptron (MLP);

• The impediment of training multi-layers is solved by Back-propagation algorithm.

• Poor performance due to vanishing gradient is addressed by use of Rectified Linear

Unit (ReLU), which is applied as the activation function.

• The vulnerability to overfitting resulting from model complexity with additional hidden

layers is solved by ’Dropout’, i.e.training some of the randomly selected nodes rather

than the entire network. Regularization is also used to prevent overfitting by simplifying

the architecture of MLP.

• The softmax activation function in the output layer helps to keep range between 0 1

which can be used as probabilities.

ReLU function gives us the maximum value between zero and a given input.

φ(x) =

x,x≥ 0

0,x≤ 0
=max(0,x) (3.19)

The derivative of ReLU function;

φ
′(x) =

1,x≥ 0

0,x≤ 0
(3.20)

26


The softmax activation function;

φ(z)i =
ezi

∑
K
j=1 ez j

(3.21)

Where;

• φ = so f tmax

• z = imput vector

• ezi = standard exponential f unction f or input vector

• K = number o f classes in the multi− class classi f ier

• ez j = standard exponential f unction f or out put vector

3.9 Linear Discriminant Analysis (LDA)

This extends the LDA classifier to the case of multiple predictors. Here, the assumption is that

X =(X1,X2, . . . ,Xp is drawn from a multivariate normal or multivariate Gaussian distribution

N(µk,Σ), with a class-specific multivariate mean vector and a common covariance matrix

(James et al., 2013).

Chris (2021) postulates that LDA uses Bayes Theorem for classification which we can explain

by noting that if we have K classes and we want to classify the qualitative response variable

Y where there are K possible distinct and unordered values derived as follows:

Let πkbe the prior probability that a given randomly chosen observation comes from the kth

class. Let fk(x)≡ Pr(X = x|Y = k) be the density function of X for an observation from the

kth class. fk(x) is relatively large if there is a high probability that an observation in the kth

class has X ≈ x and fk(x) is relatively small if it is very unlikely that an observation in the

kth class has X ≈ x.

27


Bayes Theorem states that:

Pr(Y = k|X = x) =
πk fk(x)

∑
K
l=1 πl f l(x)

(3.22)

Letting pk(x) = Pr(Y = k|X), we can simply plug in estimates of πk and fk(X) into the

formula which can be generated with the software that then takes care of the rest. We refer to

pk(x) as the posterior probability that an observation X = x belongs to the kth class given

the predictor value for that observation.

Estimating k is easy if we have a random sample of Y ′s from the population but estimating

fk(X) is more difficult. However, if we have an estimate for fk(x) then we can build a

classifier that approximates the Bayes classifier.

By assuming that X = (X1,X2 . . . ,Xp) is drawn from a multivariate Gaussian distribution,

with a class specific mean vector and a common covariance matrix which we can write as

X ∼ N(µ,Σ)

to indicate that p has a multivariate Gaussian distribution.

E(X) = µ

is the mean of the X vector with p components and

Cov(X) = Σ

is the pp covariance matrix of X.

Formally, the multivariate Gaussian density is given as:

f (x) =
1

(2π)
p
2 |Σ| 12

exp
(
−1

2
(x−µ)T

Σ
−1(x−µ)

)
. (3.23)

28


Plugging the density function for the kth class, fk(X = x) into equation 19 above, and

applying some algebra we see that the Bayes classifier assigns X = x to the class for which:

δk(x) = xT
Σ
−1

µk−
1
2

µ
T
k Σ
−1

µk + log πk (3.24)

is the largest. The Bayes decision boundaries represent the set of values x for which

δk(x) = δl(x). In other words for which xT Σ−1µk− 1
2 µT

k Σ−1µk = xT Σ−1µl− 1
2 µT

l Σ−1µl , for

k ̸= l

The log k term has disappeared because each of the three classes has the same number of

training observations, thus k is the same for each class. To estimate µ1 . . . ,µk,π1 . . . ,k and Σ

we use similar conventions for the case where p = 1

3.10 Regression Evaluation Metrics

Root Mean Squared Error (RMSE)

The root-mean-square error measures the model’s prediction error. It is the average difference

between the observed known values of the outcome and the predicted values by the model

(Kassambara, 2018). Low values of RMSE signifies better predictions from the model.

Barnston (1992) gives below mathematical expression for RMSE:

RMSE f o =

[
N

∑
i=1

(z f i− zoi)
2/N

] 1
2

(3.25)

Where;

• f = f orecasts(expected valuesor unknownresults)

• o = observed values,(knownresults)

• (z f i− zoi)
2 = di f f erences,squared.

• N = samplesize.

29


Mean Absolute Error (MAE)

Willmott and Matsuura (2005) Assert that MAE as a measure of model’s accuracy is unam-

biguous, stable and more natural measure of average error unlike RMSE which varies with

variability within the distribution of error magnitudes. MAE is calculated by summing the

magnitudes (absolute values) of the errors to obtain the ‘total error’ and then dividing the

total error by n, as shown below;

MAE =

[
n−1

n

∑
i=1
|ei|

]
(3.26)

Where;

Pi = 1,2, . . . ,n, Oi = 1,2, . . . ,n, ei = 1,2, . . . ,n and

• ei = Pi−Oi

• n = samplesize

• P = predicted values

• O = observed values

Low values of MAE signifies better model performance in terms of the accuracy of predic-

tions.

R Squared

The R-squared (R2) is given by a series of some metrics including;

residual sum of squares (RSS) expressed as:

RSS =
n

∑
i=1

(yi− ŷi)
2. (3.27)

residual standard error (RSE) given as:

30


RSE =

√
1

n−2
×RSS =

√
1

n−2

n

∑
i=1

(yi− ŷi)2 (3.28)

where yi and ŷi are the actual and predicted values of observations, respectively.

total sum of squares (TSS) which is the total variance in the response Y i.e.inherent in the

response before the regression is performed (James et al., 2013). It is given by the formula;

TSS =
n

∑
i=1

(yi− ȳ)2 (3.29)

where ȳ is the mean of observed response values.

The R-squared (R2) statistic,commonly referred to as coefficient of determination is thus

given as

R2 =
T SS−RSS

T SS
= 1− RSS

T SS
(3.30)

While T SS−RSS measures the amount of variability in the response that is explained (or

removed) by performing the regression, R2 is a measure of the proportion explained variability

in the response variable Y that is associated with predictor variable X .

Wooldridge (1991) propose the following formula for the adjusted R2:

R̄2 ≡ 1− RSS/(T −K−1)
T SS/(T −1)

(3.31)

Where;

• T = samplesize

• K = number o f predictors

31


The adjusted R2 is modified or adjusted so as to accommodate the changes in degrees of

freedom that results due to addition or removal of some independent variables in a regression

model.

3.11 Classification Evaluation Metrics

Confusion Matrix

Table 3.1 is a tabular representation of Actual vs Predicted values. It helps us find the accuracy

of the model thus avoid overfitting. Accuracy refers to the total number of predictions that

were correct. The below table summarizes the elements of the Confusion Matrix.

Predicted(Y) Predicted(Y)
Positive(Y=1) True Positive(A) False Negative
Negative(Y=0) False Positive(C) True Negative(D)

Table 3.1: Confusion Matrix Table

Accuracy =
A+D

A+B+C+D
(3.32)

In essence, this is given as
T P+T N

T P+FP+T N +FN

Proportion of the predicted positive cases that were correct, or Precision is given by:

Precision =
T P

T P+FP
(3.33)

Proportion of the positive cases that were correctly identified, or Recall is given by:

Recall =
T P

T P+FN
(3.34)

The harmonic mean of Precision and Recall, or F-measure Vafeiadis et al. (2015) is given by:

32


F1−Score = 2× Precision×Recall
Precision+Recall

(3.35)

Since good performance of a classifier cannot be exclusively measured by either precision

or recall, F1-Score, which combines the two is used as a single metric for evaluating a

classifier’s performance (Vafeiadis et al., 2015). A F-measure value closer to one indicates

better classifier performance.

The cohen’s kappa measure

Cohen’s kappa is a measure of the agreement between two raters who each classify N items

into C mutually exclusive categories. The definition of k is:

k =
po− pe

1− pe
= 1− 1− po

1− pe
(3.36)

po is the relative observed agreement among raters, and pe is the hypothetical probability of

chance agreement.

po =
A+B

A+B+C+D
(3.37)

ppos =
a+b

a+b+ c+d
× a+ c

a+b+ c+d
(3.38)

pneg =
b+d

a+b+ c+d
× c+d

a+b+ c+d
(3.39)

pe = ppos + pneg (3.40)

33


The flowchart below illustrates the overall modeling process:

3.12 Overall Modeling Process

Figure 3.5: Overall Modeling Process

Figure 3.5 depicts a flow chart showing the various options for achieving the final goal,

which is diamond price forecast. While the blue coloring represents an existing technique

to diamond pricing that uses machine learning, the orange shading represents the study’s

recommended solution.

3.13 Data Type and Source

To understand and validate the case under inquiry, we looked at two sets of data. The first

data set was created using simulation in order to mimic the real data in Chapter 4. There

were 3000 observations and 7 variables in this dataset. The renowned iris flower data set,

often known as Fisher’s Iris data set, is a multivariate data set created by British statistician

and biologist Ronald Fisher. (Fisher, 1936).

34


3.14 Simulated Data Analysis

Simulating a categorical variable

Let’s say that I have a categorical variable (Group) which can take the values A, B, C and

D. The aim is to generate 3000 random data points and control for the frequency of each as

below:

A = 20%,B = 25%,C = 25%,D = 30%

29.6

24

26.8

19.6

Group

A

B

C

D

Figure 3.6: The Simulated Group Proportions

Figure 3.6 visualizes the simulated Categorical variable proportions.

35


The graphical features of simulated numeric variables

2898

20972599

−40

0

40

0 1000 2000 3000
Fitted values

R
es

id
ua

ls

Residuals vs Fitted

2898

20972599

−4

−2

0

2

4

−2 0 2
Theoretical Quantiles

S
ta

nd
ar

di
ze

d 
re

si
du

al
s

Normal Q−Q

289820972599

0.0

0.5

1.0

1.5

2.0

0 1000 2000 3000
Fitted values

S
ta

nd
ar

di
ze

d 
re

si
du

al
s

Scale−Location

1898
2659

1007
−4

−2

0

2

4

0.000 0.002 0.004 0.006
Leverage

S
ta

nd
ar

di
ze

d 
R

es
id

ua
ls

Residuals vs Leverage

Figure 3.7: Multiple Linear Regression Assumptions

Figure 3.7 confirms that the simulated multivariate data set satisfy the following assumptions:

• Fixed location (Measure of central tendancy)

• Fixed scale (Measure of spread)

• Fixed distribution (Multivariate Normality)

• Randomness.

36


Table 3.2: Glimpse of the simulated data for 8 random observations

Group Y X1 X2 X3 X4 X5
D 1604.96 66.30 112.65 328.96 168.79 205.08
D 1303.96 49.11 229.80 165.10 107.73 105.87
B 2360.40 46.40 345.55 354.48 113.03 300.47
C 1267.44 46.40 146.64 223.07 67.73 186.52
A 1207.57 39.49 149.98 212.38 63.95 108.43
D 1943.86 54.20 187.04 409.62 84.00 188.72
C 1181.76 48.47 240.18 102.82 162.11 106.20
A 386.79 68.52 122.98 35.90 61.63 22.97

0 100 200 300 400 500 600

0
50

0
10

00
15

00
20

00
25

00

x

te
st

_m
at

rix
_y

original test_y
predicted test_y

Figure 3.8: XGBoost Predicted Vs Actual

Figure 3.8 visually depicts strong convergence of predicted and actual data points for the

XGBoost model’s test set.

37


Figure 3.9: The Models

Figure 3.10: The Metrics

3.15 Simulation Analysis Results

Table 3.3: Regression Evaluation Metrics

Algorithms R2 RMSE MAE

MLR 99.89 16.11 12.95

SVR 99.86 18.29 14.77

BRT 99.56 30.62 23.96

XGBoost 99.33 37.95 29.70

Rf 98.40 63.45 46.10

kNN 93.15 141.67 103.64

Table 3.4: Classification Evaluation Metrics

Algorithms Precision Recall F1 Accuracy

XGBoost 26.84 24.91 18.57 31.17

SVM 13.33 25.06 12.54 29.60

kNN 26.56 26.96 25.85 29.10

LDA 14.24 24.54 15.79 28.6

Rf 23.62 23.86 23.15 25.08

BCAT 22.19 22.10 20.96 24.25

The regression results from simulated dataset are shown in Table 3.3 while classification

results are presented in Table 3.4.

We will not explore MLR and LDA based on the analyses in Figures 3.9 and 3.10 because

they do not apply to both scenarios under consideration, i.e. Regression and Classification.

Overall Per f ormance = X
7 ∗100

X=Total metrics under investigation.

38


Algorithms R2 RMSE MAE Precision Recall F1 Accuracy
SVM 1 2 2 5 2 5 2

BCART 3 3 3 4 5 3 5
XGBoost 4 4 4 1 3 4 1

Rf 4 4 4 3 4 2 4
kNN 5 5 5 2 1 1 3

Table 3.5: The Lead Table

Algorithms Rating (%)
SVM 14.29

BCART 0.00
XGBoost 28.57

Rf 0.00
kNN 14.29

Table 3.6: Overall Algorithm’s Performance

Table 3.5 shows the algorithms’ ranking based on evaluation metrics in both regression and

classification. Table 3.6 presents the overall algorithms’ performance rating in percentage

where the best ML model in regression and classification is XGBoost at 28.57%.

39


0.3117
0.2508

0.291
0.2425

0.286

0

0.296

99.3398.4
93.15

99.56

0

99.89 99.86

0

25

50

75

100

0.00

0.25

0.50

0.75

1.00

BCART kNN LDA MLR RF SVM XGB
Algorithms

R
 S

qu
ar

ed

A
ccuracy

R squared Vs Accuracy~simulated_data

Figure 3.11: R squared Vs Accuracy

0.6883
0.7492

0.709
0.7575

0.714

0

0.704

37.95

63.45

141.67

30.62

0

16.11 18.29

0

50

100

0.00

0.25

0.50

0.75

1.00

BCART kNN LDA MLR RF SVM XGB
Algorithms

R
M

S
E

M
isclassification E

rror

RMSE Vs Misclassification Error~simulated_data

Figure 3.12: RMSE Vs Misclassification Er-
ror

The results in Figure 3.11 rate SVR (svmLinear) and BRT models highest for regression at

R2 values of 99.86% and 99.56% respectively . XGBoost is the best for classification at an

accuracy value of 31.17%. kNN is the weakest in regression at R2 value of 93.15%. Overally,

XGBoost shows better results for predicting categorical and numeric response variables.

3.16 iris Data Analysis Results

40


Table 3.7: Regression Evaluation Metrics

Algorithms R2 RMSE MAE
MLP 93.3 0.24 0.14
MLR 88.77 0.32 0.26

Rf 86.62 0.28 0.23
BRT 86.04 0.30 0.24

XGBoost 81.28 0.40 0.31
SVR 80.96 0.34 0.28
kNN 80.44 0.36 0.32

Table 3.8: Classification Evaluation Metrics

Algorithms Precision Recall F1 Accuracy
LDA 100.00 100.00 100.00 100.00
kNN 96.97 96.67 96.66 96.67
Rf 94.44 93.33 93.27 93.33

SVM 93.33 93.33 93.33 93.33
XGBoost 93.33 93.33 93.33 93.33

MLP 88.89 94.87 90.56 92.31
BCT 92.31 90.00 89.77 90.00

Figure 3.13: Multi-Layer Perceptron Architecture

Table 3.9: Algorithm’s Lead Table

Algorithms R2 RMSE MAE Precision Recall F1 Accuracy
SVM 4 3 3 3 3 3 3

BCART 2 2 2 5 5 5 5
XGBoost 3 5 4 4 4 4 4

Rf 2 2 2 2 2 3 2
kNN 5 4 5 1 1 1 1
MLP 1 1 1 6 2 5 5

Table 3.10: Classification Evaluation Metrics

Algorithms Rating (%)
SVM 0.00

BCART 0.00
XGBoost 0.00

Rf 0.00
kNN 57.14
MLP 42.86

Overall Per f ormance = X
7 ∗100

Table 3.10 shows that MLP is the top ML model for regression at 42.86%, while kNN leads

in classification metrics at 57.14%. The MLP model is made up of two hidden layers, each

with 10 and 5 neurons as shown in Figure 3.13

Figure 3.14 demonstrates that kNN has the best classification accuracy of 96.97%, while

MLP has the best regression results with a R2 value of 92.31%. Despite the fact that MLP

41


0.93330.9333
0.9667

0.9

1

0

0.93330.9231

81.28
86.62

80.44
86.04

0

88.77

80.96

93.3

0

25

50

75

0.00

0.25

0.50

0.75

1.00

BCART kNN LDA MLP MLR RF SVM XGB
Algorithms

R
 S

qu
ar

ed

A
ccuracy

R squared Vs Accuracy~iris dataset

Figure 3.14: R squared Vs Accuracy

0.06670.0667
0.0333

0.1

0 0

0.06670.0769

0.4

0.28

0.36

0.3

0

0.32
0.34

0.24

0.0

0.1

0.2

0.3

0.4

0.00

0.25

0.50

0.75

1.00

BCART kNN LDA MLP MLR RF SVM XGB
Algorithms

R
M

S
E

M
isclassification E

rror

RMSE Vs Misclassification Error~iris dataset

Figure 3.15: RMSE Vs Misclassification Er-
ror

produces great results in regression metrics, the study will not use it due to the high computing

cost of parameter tuning and training time.

42


Chapter 4

Data Analysis

4.1 Introduction

The focus of this chapter is data analysis and interpretation of results. It entails exploratory

analysis, descriptive statistics and correlation analysis to estimate the relationship between

variables. The entire analysis was carried out in R.

4.2 Data Type and Source

Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an

online community for machine learning practitioners and data scientists, as well as a robust,

well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users

can search for and publish various datasets. In a web-based data-science environment, they

can study datasets and construct models. The Diamond Dataset’s primary features will be

provided by the Kaggle Diamond Dataset, which has approximately 53000 observations.

(Agrawal, 2017).

4.3 Exploratory Data Analysis

This is a way of evaluating data sets in order to summarize their key properties, which

frequently involves the use of statistical graphics and other data visualization techniques.

With the use of summary statistics and graphical representations, it aims to find patterns, spot

anomalies, test hypotheses, and verify assumptions.

43


Variable Description
price Price in US dollars (326-18,823)
carat Weight of the diamond (0.2-5.02)
cut Quality of the cut (Fair, Good, Very good, Premium, Ideal)

color Diamond color, from D (best) to J (worst)
clarity A measurement of how clear the diamond is (I1(worst), SI2, SI1, VS2, VS1, WS2, WS1, IF (best))
table Width of the top of diamond relative to widest point (43-95)
depth Total depth percent = z/mean( x,y) = 2*z/(x+y)(43-75)

x Length in mm (0-10.74)
y Width in mm (0-58.9)
z Depth in mm (0-31.8)

Table 4.1: The Study Variables

Figure 4.1: The Diamond’s Key Measurements

Figure 4.2: The Data Structure

44


0

10000

20000

30000

40000

50000

factor(1)

co
un

t

factor(cut)

Fair

Good

Very Good

Premium

Ideal

Figure 4.3: The Bulls-eye Chart

The ideal cut has the highest count while fair has the least as shown in Figure 4.3

45


Figure 4.4: The Diamond Dataset Correlation Chart

Figure 4.4 indicates that the carat and price seem to be skewed to the right thus logarith-

mic transformation is necessary to achieve normality. Variables depth and z have normal

distribution but more peaked than normal i.e. leptokurtic.

46


Figure 4.5: The Logarithmic Transformation of Price and Carat

Figure 4.5 indicates that the price and carat logarithmic transformation achieves normality.

47


−2

0

2

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s log(carat)

50
60
70
80

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s depth

50
60
70
80
90

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s table

0
3
6
9

12

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s x

0

20

40

60

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s y

0

10

20

30

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s z

5

10

−2.5 0.0 2.5
Theoretical Quantiles

S
a
m

p
le

 Q
u
a
ti
le

s log(price)

Figure 4.6: The Normality Test

Figure 4.6 shows that the normality assumption is generally satisfied save for slight dispersion

for some variables caused by outliers which tend to draw the line of best fit to themselves.

48


6

7

8

9

10

−1 0 1
log(carats)

lo
g
(p

ri
c
e
)

 Scatter Plot

Figure 4.7: The Scatter Plot

Based on Figure 4.7, there is evidence of outliers in the data, as shown by a log of carat

weight greater than one, where the confidence band begins to widen. It is therefore critical to

remove them from the study in order to avoid producing spurious and nonsensical results and

their interpretation. However, the removal of outliers is critical and should follow a judicious

process, as they should only be removed if it can be proven that they are the result of error

rather than natural causes. As a result, this study considers leaving the outliers alone.

49


Figure 4.8: The Heatmap of cut and color

From Figure 4.8, we can conclude that:

• Most ideal and premium cuts are from colour G.

• Most very good and good cut diamonds are from colour E.

• Fair cut diamonds are usually from colour F, G, H.

• Overall, all cut group diamonds are rare in colour J.

50


0

5000

10000

15000

0 1 2 3 4 5
Weight(carats)

pr
ic

e
cut

Fair

Good

Very Good

Premium

Ideal

Scatter Plot by cut

6

7

8

9

10

−1 0 1
log(carats)

lo
g(

pr
ic

e)

cut

Fair

Good

Very Good

Premium

Ideal

Scatter Plot by cut

6

7

8

9

10

−1 0 1
log(carats)

lo
g(

pr
ic

e)

cut

Fair

Good

Very Good

Premium

Ideal

line Plot by cut

6

7

8

9

10

−1 0 1
log(carats)

lo
g(

pr
ic

e)
clarity

I1

SI2

SI1

VS2

VS1

VVS2

VVS1

IF

Scatter Plot by clarity

Figure 4.9: The 4Cs Visualizations

Figure 4.9 indicates that:

• The relationship between diamond price and weight(carats) proves to be relatively

non-linear where heavy diamonds seem to exhibit higher price volatility, thus requiring

logarithmic transformation.

• Diamond price seems to be an increasing function of the 4Cs.

• There is a positive correlation between the price and the carat for the different cut

diamonds. This is confirmed on the correlation chart where price and carat exhibit

correlation of 0.92.

51


Table 4.2: Regression
Evaluation Metrics

Algorithms R2 RMSE MAE
XGBoost 97.45 646.69 347.94

Rf 97.13 704.61 347.13
BRT 96.88 707.22 369.36
SVR 91.84 1164.68 662.91
kNN 85.35 1504.37 793.12

Table 4.3: Classification Evaluation Metrics

Algo’s Prec Recall F1 Kappa Accuracy
XGBoost 71.89 77.26 71.63 63.06 74.28

BCT 73.37 71.44 72.04 63.12 74.09
Rf 74.51 71.49 72.10 61.00 72.61

SVM 75.19 55.62 55.45 52.03 66.95
kNN 53.30 41.80 43.95 36.96 55.8

Table 4.4: Algorithm’s Lead Table

Algorithms R2 RMSE MAE Precision Recall F1 Kappa Accuracy
SVM 4 4 4 4 3 3 4 4

BCART 3 3 3 2 2 2 1 2
XGBoost 1 1 2 4 1 3 2 1

Rf 2 2 1 2 1 1 3 3
kNN 5 5 5 5 5 5 5 5

Table 4.5: Overall Algo-
rithms’ Performance

Algorithms Rating (%)
SVM 0.00

BCART 12.50
XGBoost 50.00

Rf 37.50
kNN 0.00

Overall Per f ormance = X
8 ∗100

X=Total number of metrics.

Results in Table 4.5 indicate that the best ML model in regression and classification is

XGBoost at 50%.

52


0.74280.7261

0.558

0.7409

0.6472

0

0.6695

97.4397.13

85.35

96.88

0

92.88 91.84

0

25

50

75

100

0.00

0.25

0.50

0.75

1.00

BCART kNN LDA MLR RF SVM XGB
Algorithms

R
 S

qu
ar

ed

A
ccuracy

R squared Vs Accuracy~diamond_data

Figure 4.10: R squared Vs Accuracy

0.25720.2739

0.442

0.2591

0.3528

0

0.3305

646.69
704.61

1504.37

707.22

0

1070.04
1164.68

0

500

1000

1500

0.00

0.25

0.50

0.75

1.00

BCART kNN LDA MLR RF SVM XGB
Algorithms

R
M

S
E

M
isclassification E

rror

RMSE Vs Misclassification Error~diamond_data

Figure 4.11: RMSE Vs Misclassification Er-
ror

With an accuracy of 74.28% and a R2 value of 97.45%, Figure 4.10 shows that XGBoost is

the overall best model for both classification and regression scenarios. BCART and Rf both

perform well in classification (74.09% vs. 72.61%) and regression (96.88% vs. 97.13%),

respectively. With RMSE of 1504.37 and 1164.68, respectively, kNN and SVM (Polynomial)

have greater error values, making them poor regression predictors as indicated by Figure 4.11.

53


Chapter 5

Discussion, Conclusion and

Recommendations

5.1 Introduction

The objective of this section is to interpret and discuss the importance of the study findings

in connection to the research problem under investigation, as well as to explain any new

knowledge or insights gained from the research. Diamonds dataset is used to train and

validate all of the models discussed earlier in chapter three.

Here, the goal is to predict the price of diamond using key diamond features. The evaluation

begins by dividing the dataset into two parts: the Train set (80%) and the Validation set (20%).

Our model can make predictions on values it has never seen before thanks to the Validation

set. All of the models under consideration were subjected to a k-fold crossvalidation, with k

set to 5. The dataset was scaled and centered for feature comparability.

5.2 Discussion

5.2.1 Regression Evaluation Metrics

Price was regressed against nine other variables, including categorical ones, in regression

(color, cut and clarity). One Hot Encoding (OHE) was used to convert these factor variables.

OHE is a crucial step in the process of translating categorical data variables into machine

54


and deep learning algorithms, which improve model predictions and classification accuracy

(Seger, 2018).

The XGBoost model outperformed all algorithms in terms of the regression metrics tested.

For the R2, RMSE, and MAE, the regression metrics scores were 97.45%, 646.69, and 347.94,

respectively. These results were an improvement above the Alsuraihi et al. (2020) XGBoost

Model, which achieved RMSE and MAE values of 1406 and 938, respectively.

In this study, the optimal model was achieved after tuning the critical architecture parameters

as follows; (max.depth = 6,epochs = 46, eta = 0.3, gamma = 5, nfold = 5, booster = gbtree).

It’s worth noting that XGBoost seems to perform worse on tiny data sets, such as the

simulated and iris datasets, which each comprised 3000 and 150 observations.

With a R2 of 97.13% and cost function (RSME and MAE) values of 704.61 and 347.13, Rf

was the second best performing model. Cross-validation (k-fold) was performed where k

was held at 5. This result, although being based on a collection of 9 features, equaled Pandey

et al. (2019), where features were reduced to only 5. In this case, the Rf R2 score was found

to be in close agreement with 97.93% by (Sharma et al., 2021).

Random Forest followed the same pattern as XGBoost in that performance was poorer on

small datasets, such as iris and simulated datasets. On regression scenarios, kNN performed

the poorest, with a R2 of 85.35%, 1504.37 and 793.12 for RMSE and MAE, respectively

while k (neighbors) was held at 2. All of the other datasets followed the same pattern for

kNN regression.

5.2.2 Classification Evaluation Metrics

The response variable in classification was cut, which contains five classes (Fair, Good, Very

Good, Premium, Ideal). XGBoost exhibits the highest Accuarcy and Recall in classification,

at 74.28% and 77.26%, respectively. Here, the optimal model was achieved after tuning

the critical architecture parameters as follows; (max.depth = 6, epochs = 40, eta=0.001,

gamma=5, nfold=5, booster = gbtree).

55


The accuracy of the BCT model is 74.09% at (k-fold = 5, booster = xgbTree). While SVM

(k-fold = 5, version = svmLinear) isn’t the greatest at Accuracy (66.95%), it is the best at

Precision (75.19% ).

All of the classification evaluation measures show that kNN performs the worst at (k-fold

= 5,neighbors = 17). However, with tiny datasets (irs and simulated datasets), it produces

good results. kNN may not be the best model for evaluating huge datasets, as evidenced. At

F1-scores, BCT and Rf perform best (72.04% and 72.10%, respectively). In terms of kappa,

BCT, XGBoost, and Rf are at the top with 63.12%, 63.06%, and 61%, respectively.

5.2.3 Performance of Ensembles

We began by classifying modeling techniques into three categories: linear models, ensemble

models, and others. XGBoost, which falls under the area of Ensembles and more precisely

under the Boosting Techniques, was the best model for classification and regression as

confirmed by key metrics i.e. a R2 of 97.45% and an Accuracy of 74.28%. As a result, the

findings show that Boosting outperforms Bootstrapped Aggregation (Bagging) in terms of

prediction.

5.2.4 Algorithms’ Overall Performance

On a small dataset (the iris dataset), kNN has the best overall classification and regression

prediction performance at 28.57%. In larger datasets (simulation and diamond datasets), XG-

Boost takes the lead with 57.14% and 50%, respectively. This outstanding performance can

be credited to XGBoost’s cutting-edge architecture, which navigates large and complicated

data structures and feature interactions.

56


5.3 Conclusion

The eXtreme Gradient Boosting (XGBoost) outperformed other algorithms in diamond

classification based on cut, an alternative price prediction approach proposed in this pa-

per. Furthermore, XGBoost generated the best results in diamond price prediction using

regression, demonstrating that it is the best tool in diamond price prediction utilizing both

methodologies.

5.4 Recommendations

5.4.1 Recommendations for Further Studies

Increase processing power to handle complex algorithms such as MLP by developing and de-

ploying end-to-end GPU-accelerated data science workflows that allow for rapid exploration,

iteration, and deployment of work. Using the RAPIDS-accelerated data science libraries, it

would be feasible to execute data analysis at scale using a wide range of GPU-accelerated

machine learning methods, such as XGBoost, cuGRAPH’s single-source shortest path, and

cuML’s KNN, DBSCAN, and others. This study, for example, had to abandon the MLP ap-

proach, despite its promise of high output, due to a lengthy training time that was discovered

to be computationally expensive.

5.4.2 Policy Recommendations

This study recommends creating an online interactive space, such as R-Shiny, where dia-

mond attributes are fed and the model generates the most accurate cut category (key price

determinant) and thus justifiable price estimate, to eliminate information asymmetry that

propagates price obfuscation by various diamond retailers.

57


References
Shivam Agrawal. Analyze diamonds by their cut, color, clarity, price, and other attributes.

Diamond Competition, 2017. https://www.kaggle.com/shivam2503/diamonds, Accessed
on May 24, 2017.

Nesreen K Ahmed, Amir F Atiya, Neamat El Gayar, and Hisham El-Shishiny. An empirical
comparison of machine learning models for time series forecasting. Econometric Reviews,
29(5-6):594–621, 2010.

Waad Alsuraihi, Ekram Al-hazmi, Kholoud Bawazeer, and Hanan AlGhamdi. Machine learn-
ing algorithms for diamond price prediction. In Proceedings of the 2020 2nd International
Conference on Image, Video and Signal Processing, pages 150–154, 2020.

Anthony G Barnston. Correspondence among the correlation, rmse, and heidke forecast
verification measures; refinement of the heidke score. Weather and Forecasting, 7(4):
699–709, 1992.

Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and
regression trees. CRC press, 1984.

Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 535–541, 2006.

Margarida GMS Cardoso and Luis Chambel. A valuation model for cut diamonds. Interna-
tional Transactions in Operational Research, 12(4):417–436, 2005.

Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina. An empirical evaluation
of supervised learning in high dimensions. In Proceedings of the 25th international
conference on Machine learning, pages 96–103, 2008.

Schmidt Chris. Analysis of LR, LDA, QDA, GAM models with K-CV. RPubs, 2021.
https://rpubs.com/ChrisSchmidt/777478, Accessed on June 14, 2021.

Singfat Chu. Diamond ring pricing using linear regression. Journal of Statistics Education,
4(3), 1996.

Singfat Chu. Pricing the c’s of diamond stones. Journal of Statistics Education, 9(2), 2001.

Donald Clark. How to choose a diamond. Expert Buying Guide, 2022. https://www.
gemsociety.org/article/choosing-a-diamond/, Accessed on March 8, 2022.

Georgios N Dimitrakopoulos, Aristidis G Vrahatis, Vassilis Plagianakos, and Kyriakos
Sgarbas. Pathway analysis using xgboost classification in biomedical data. In Proceedings
of the 10th Hellenic Conference on Artificial Intelligence, pages 1–6, 2018.

Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of
eugenics, 7(2):179–188, 1936.

58

https://www.kaggle.com/shivam2503/diamonds
https://rpubs.com/ChrisSchmidt/777478
https://www.gemsociety.org/article/choosing-a-diamond/
https://www.gemsociety.org/article/choosing-a-diamond/


Jerome H Friedman. Stochastic gradient boosting. Computational statistics & data analysis,
38(4):367–378, 2002.

Abhijit Ghatak. Deep learning with R. Springer, 2019.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to
statistical learning, volume 112. Springer, 2013.

Christian Kampichler, Ralf Wieland, Sophie Calmé, Holger Weissenberger, and Stefan
Arriaga-Weiss. Classification in conservation biology: a comparison of five machine-
learning methods. Ecological Informatics, 5(6):441–450, 2010.

A Kassambara. Linear regression essentials in r. Retrieved on http://www. sthda.
com/english/articles/40-regression-analysis/165-linear-regression-essentials-in-r, 2018.

Alboukadel Kassambara. Machine Learning Essentials, volume 1. Sthda, 2017.

Erik Lampa, Lars Lind, P Monica Lind, and Anna Bornefalk-Hermansson. The identification
of complex interactions in epidemiology and toxicology: a simulation study of boosted
regression trees. Environmental health, 13(1):1–17, 2014.

Stanislav Mamonov and Tamilla Triantoro. Subjectivity of diamond prices in online retail:
insights from a data mining study. Journal of theoretical and applied electronic commerce
research, 13(2):15–28, 2018.

M.Garside. Global demand value for polished diamonds by country 2019
. Diamond Industry, 2020. https://www.statista.com/statistics/894919/
global-polished-diamond-demand-value-by-country/, Accessed on November 11,
2020.

M.Garside. Global diamond jewelry market value by country 2020. Di-
amond Industry, 2021a. https://www.statista.com/statistics/585103/
diamond-jewelry-market-value-worldwide-by-region/, Accessed on November
15, 2021.

M.Garside. Global diamond jewelry market value 2010-2020. Diamond Industry, 2021b.
https://www.statista.com/statistics/585267/diamond-jewelry-market-value-worldwide/,
Accessed on November 15, 2021.

M.Garside. Diamond industry statistics and facts. Diamond Industry, 2022. https://www.
statista.com/topics/1704/diamond-industry/#dossierContents__outerWrapper, Accessed
on February 15, 2022.

Harshvadan Mihir, Manish I Patel, Soham Jani, and Ruchi Gajjar. Diamond price predic-
tion using machine learning. In 2021 2nd International Conference on Communication,
Computing and Industry 4.0 (C2I4), pages 1–5. IEEE, 2021.

Mohammad-Reza Mohammadi, Fahime Hadavimoghaddam, Maryam Pourmahdi, Saeid
Atashrouz, Muhammad Tajammal Munir, Abdolhossein Hemmati-Sarapardeh, Amir H
Mosavi, and Ahmad Mohaddespour. Modeling hydrogen solubility in hydrocarbons using
extreme gradient boosting and equations of state. Scientific reports, 11(1):1–20, 2021.

59

https://www.statista.com/statistics/894919/global-polished-diamond-demand-value-by-country/
https://www.statista.com/statistics/894919/global-polished-diamond-demand-value-by-country/
https://www.statista.com/statistics/585103/diamond-jewelry-market-value-worldwide-by-region/
https://www.statista.com/statistics/585103/diamond-jewelry-market-value-worldwide-by-region/
https://www.statista.com/statistics/585267/diamond-jewelry-market-value-worldwide/
https://www.statista.com/topics/1704/diamond-industry/#dossierContents__outerWrapper
https://www.statista.com/topics/1704/diamond-industry/#dossierContents__outerWrapper


Mohanad Mohammed, Henry Mwambi, Innocent B Mboya, Murtada K Elbashir, and Bernard
Omolo. A stacking ensemble deep learning approach to cancer type classification based
on tcga data. Scientific reports, 11(1):1–22, 2021.

Gretchen G Moisen. Classification and regression trees. In: Jørgensen, Sven Erik; Fath,
Brian D.(Editor-in-Chief). Encyclopedia of Ecology, volume 1. Oxford, UK: Elsevier. p.
582-588., pages 582–588, 2008.

Douglas C Montgomery and George C Runger. Multiple linear regression. Applied Statistics
and Probability for Engineers, pages 410–467, 2010.

Blue Nile. choose your diamond. Blue Nile Education, 2022. https:
//www.bluenile.com/education/diamonds#:~:text=This%20video%20explains%20the%
204Cs,characteristics%20of%20buying%20a%20diamond., Accessed on March 8, 2022.

FY Osisanwo, JET Akinsola, O Awodele, JO Hinmikaiye, O Olakanmi, and J Akinjobi.
Supervised machine learning algorithms: classification and comparison. International
Journal of Computer Trends and Technology (IJCTT), 48(3):128–138, 2017.

Avinash Chandra Pandey, Shubhangi Misra, and Mridul Saxena. Gold and diamond price
prediction using enhanced ensemble learning. In 2019 Twelfth International Conference
on Contemporary Computing (IC3), pages 1–4. IEEE, 2019.

Thearasak Phaladisailoed and Thanisa Numnonda. Machine learning models comparison for
bitcoin price prediction. In 2018 10th International Conference on Information Technology
and Electrical Engineering (ICITEE), pages 506–511. IEEE, 2018.

Fernando Salazar, MA Toledo, E Oñate, and R Morán. An empirical comparison of machine
learning techniques for dam behaviour modelling. Structural Safety, 56:9–17, 2015.

Frank Scott and Aaron Yelowitz. Pricing anomalies in the market for diamonds: evidence of
conformist behavior. Economic Inquiry, 48(2):353–368, 2010.

Cedric Seger. An investigation of categorical variable encoding techniques in machine
learning: binary versus one-hot and feature hashing, 2018.

Garima Sharma, Vikas Tripathi, Manish Mahajan, and Awadhesh Kumar Srivastava. Com-
parative analysis of supervised models for diamond price prediction. In 2021 11th Interna-
tional Conference on Cloud Computing, Data Science & Engineering (Confluence), pages
1019–1022. IEEE, 2021.

Xiaowei Song, Arnold Mitnitski, Jafna Cox, and Kenneth Rockwood. Comparison of
machine learning techniques with classical statistical models in predicting health outcomes.
In MEDINFO 2004, pages 736–740. IOS Press, 2004.

Thanasis Vafeiadis, Konstantinos I Diamantaras, George Sarigiannidis, and K Ch Chatzisav-
vas. A comparison of machine learning techniques for customer churn prediction. Simula-
tion Modelling Practice and Theory, 55:1–9, 2015.

Cort J Willmott and Kenji Matsuura. Advantages of the mean absolute error (mae) over the
root mean square error (rmse) in assessing average model performance. Climate research,
30(1):79–82, 2005.

60

https://www.bluenile.com/education/diamonds#:~:text=This%20video%20explains%20the%204Cs,characteristics%20of%20buying%20a%20diamond.
https://www.bluenile.com/education/diamonds#:~:text=This%20video%20explains%20the%204Cs,characteristics%20of%20buying%20a%20diamond.
https://www.bluenile.com/education/diamonds#:~:text=This%20video%20explains%20the%204Cs,characteristics%20of%20buying%20a%20diamond.


Jeffrey M Wooldridge. A note on computing r-squared and adjusted r-squared for trending
and seasonal data. Economics Letters, 36(1):49–54, 1991.

Qingyao Wu, Yunming Ye, Haijun Zhang, Michael K Ng, and Shen-Shyang Ho. Forestexter:
an efficient random forest algorithm for imbalanced text categorization. Knowledge-Based
Systems, 67:105–116, 2014.

Zizhen Yao and Walter L Ruzzo. A regression-based k nearest neighbor algorithm for gene
function prediction from heterogeneous data. In BMC bioinformatics, volume 7, pages
1–11. BioMed Central, 2006.

61


Ole Sangale Rd, Madaraka Estate. PO Box 59857-00200, Nairobi, Kenya. Tel +254 (0)703 034000 
Email admissions@strathmore.edu www.strathmore.edu 

30th May 2022  
 

Mr Kigo Samuel, 
samuel.kigo@strathmore.edu 
 
Dear Mr Kigo,  
 
RE: Assessing Predictive Performance of Supervised Machine Learning 
Algorithms 

This is to inform you that SU-IERC has reviewed and approved your above SU Masters’ research 
proposal. Your application reference number is SU-IERC1352/22. The approval period is 30th May 
2022 to 29th May 2023. 

This approval is subject to compliance with the following requirements: 

i. Only approved documents including (informed consents, study instruments, MTA) will be used 
ii. All changes including (amendments, deviations, and violations) are submitted for review and 

approval by SU-IERC. 
iii. Death and life-threatening problems and serious adverse events or unexpected adverse events 

whether related or unrelated to the study must be reported to SU-IERC within 48 hours of 
notification 

iv. Any changes, anticipated or otherwise that may increase the risks or affected safety or welfare 
of study participants and others or affect the integrity of the research must be reported to SU-
IERC within 48 hours 

v. Clearance for export of biological specimens must be obtained from relevant institutions. 
vi. Submission of a request for renewal of approval at least 60 days prior to expiry of the approval 

period. Attach a comprehensive progress report to support the renewal. 
vii. Submission of an executive summary report within 90 days upon completion of the study to 

SU-IERC. 
Prior to commencing your study, you will be expected to obtain a research license from National 
Commission for Science, Technology, and Innovation (NACOSTI) https://research-
portal.nacosti.go.ke/ and obtain other clearances needed. 
 

Yours sincerely, 

 
Dr Ben Ngoye, 
Secretary; SU-IERC 

Cc: Prof Fred Were, 
Chairperson; SU-IERC 

for:

Appendix A

A.1 Ethical Review Committee Report

62


1/19

Document Information

Analyzed document Assessing Predictive Performance of Supervised Machine Learning Algorithms_136851.pdf

(D139242561)

Submitted 2022-06-03T20:10:00.0000000

Submitted by

Submitter email Samuel.Kigo@strathmore.edu

Similarity 1%

Analysis address library.strath@analysis.urkund.com

Sources included in the report

MLReport_Group116.pdf
Document MLReport_Group116.pdf (D132354410)

2

5735819.pdf
Document 5735819.pdf (D29041156)

1

QRM_Thesis_Niels_Nijdam_2573944.pdf
Document QRM_Thesis_Niels_Nijdam_2573944.pdf (D109696012)

3

A.2 Similarity Report

63


	List of Figures
	List of Tables
	List of Abbreviations
	Acknowledgement
	Dedication
	1 Introduction
	1.1 Background of the Study
	1.2 Problem Statement
	1.3 Objectives of the Study
	1.3.1 General Objective
	1.3.2 Specific Objectives

	1.4 Research Questions
	1.5 Significance of the Study
	1.6 Dissemination and Utilisation of the Study Results
	1.7 Limitations of the Study

	2 Literature Review
	2.1 Introduction
	2.2 Supervised Machine Learning Algorithms
	2.3 Application of ML in Classification and Regression
	2.4 Application of ML in Diamond Pricing

	3 Methodology
	3.1 Introduction
	3.2 Multiple Linear Regression (MLR)
	3.3 Boosted Classification and Regression Trees (BCARTs)
	3.4 eXtreme Gradient Boosting (XGBoost)
	3.5 Support Vector Machine (SVM)
	3.6 K-Nearest Neighbors (KNN)
	3.7 Random Forests (RFs)
	3.8 Multi-Layer Perceptron (MLP)
	3.9 Linear Discriminant Analysis (LDA)
	3.10 Regression Evaluation Metrics
	3.11 Classification Evaluation Metrics
	3.12 Overall Modeling Process
	3.13 Data Type and Source
	3.14 Simulated Data Analysis
	3.15 Simulation Analysis Results
	3.16 iris Data Analysis Results

	4 Data Analysis
	4.1 Introduction
	4.2 Data Type and Source
	4.3 Exploratory Data Analysis

	5 Discussion, Conclusion and Recommendations
	5.1 Introduction
	5.2 Discussion
	5.2.1 Regression Evaluation Metrics
	5.2.2 Classification Evaluation Metrics
	5.2.3 Performance of Ensembles
	5.2.4 Algorithms' Overall Performance

	5.3 Conclusion
	5.4 Recommendations
	5.4.1 Recommendations for Further Studies
	5.4.2 Policy Recommendations


	References
	Appendix A 
	A.1 Ethical Review Committee Report
	A.2 Similarity Report