Temporal-difference Comparison of Learning Methods for

Stock Market Prediction

Stephen Gakuo Maina

Thesis presented in fulfillment of the academic requirement for the

degree of Masters in Statistical Science of Strathmore University

September 2021


Declaration and recommendation

Declaration

I declare that this work has not been previously submitted and approved for the award

of degree by this or any other university. To the best of my knowledge and belief, the

thesis contains no material previously published or written by another person except

where due reference is made in the thesis itself.

©No part of this thesis may be reproduced without the permission of the author and

Strathmore University.

Stephen Gakuo Maina

Signature . . . . . . . . . . . . . . . . . . . . . .

Date: 23-May-2021

Approval

The thesis of Stephen Gakuo Maina was reviewed and approved by the following:

Phd Elphas Luchemo Okango

Signature . . . . . . . . . . . . . . . . . . . . . .

Date: 23-May-2021

Strathmore University

Dr. Bernard Shibwabo

Director of Graduate Studies

Strathmore University

ii


Abstract
Background: A Stock/Securities exchange is considered to be among the primary

indicators’ of a country’s economic strength and development. Stock market prices

are volatile in nature and are affected by factors like inflation, economic growth, etc.

Prices depend heavily on demand and supply dynamics. Stock market price determi-

nation using ANNs has gained a lot of traction lately due to the obvious advantages

this would represent to traders. Most methods in use today have largely been based

on the feed forward algorithms, however, evolutionary techniques remain largely unex-

plored for this process despite their obvious robustness.

Method: Using data from the Nairobi Securities Exchange, and specifically the NSE20

Share Index, the project will seek to apply and compare traditional ANN techniques

for stock market prediction against the relatively new evolution algorithms. The Mean

Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE) and a confu-

sion matrix will be calculated for performance evaluation.

Results: The empirical results showed that the proposed evolutionary techniques out

performed classic artificial neural networks methods-feed forward backpropagation.

iii


Contents

List of Figures vi

List of Tables vii

Acknowledgment viii

Publications ix

Conference x

Abbreviations x

1 Introduction 1

1.1 Background to the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Artificial Neural Networks Models . . . . . . . . . . . . . . . . . . 2

1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Main Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Specific Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature review 4

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Traditional Forecasting Methods-ARIMA . . . . . . . . . . . . . . 5

2.1.2 Modern Forecasting-ANN . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.4 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . 8

3 Methodology 12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

iv


3.2 Artificial Neural Networks Architecture . . . . . . . . . . . . . . . . . . . 13

3.2.1 Activation Functions Choice . . . . . . . . . . . . . . . . . . . . . 13

3.2.2 Overfitting Mitigation . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.4 Model hyperparameters to consider . . . . . . . . . . . . . . . . . 17

3.2.5 Choice of layers of neural network . . . . . . . . . . . . . . . . . . 18

3.3 Algorithms under consideration . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Backward Propagation of Errors . . . . . . . . . . . . . . . . . . . 19

3.3.2 Genetic Algorithm - Differential Evolution . . . . . . . . . . . . . 23

3.3.3 Survivor Selection-Elitism . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.2 Missing observations and imputations . . . . . . . . . . . . . . . . 29

3.5.3 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.4 Data Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Results 32

4.1 Results from Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Backpropagation model results . . . . . . . . . . . . . . . . . . . 32

4.1.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Models compared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Discussions, Conclusions and Recommendations 36

Bibliography 37

v


List of Figures

Figure 2.1: Optimization algorithms compared . . . . . . . . . . . . . . . . . . 11

Figure 3.1: A Perceptron: Wiki-Images . . . . . . . . . . . . . . . . . . . . . . 13

Figure 3.2: Early Stopping-Wiki Image . . . . . . . . . . . . . . . . . . . . . . 16

Figure 3.3: A simple BPN Architecture . . . . . . . . . . . . . . . . . . . . . . 20

Figure 3.4: One point crossover: a random point in red followed by recombi-

nation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 3.5: Multi point Crossover . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 3.6: Graphical representation of missing observations . . . . . . . . . . 30

Figure 4.1: BPN Neural Network Configuration . . . . . . . . . . . . . . . . . 33

Figure 4.2: Real and Predicted EAB Share price values BPN Algorithm . . . . 33

Figure 4.3: Algorithm Progression of the GA Object . . . . . . . . . . . . . . . 34

vi


List of Tables

Table 4.1: Performance Metrics BPN GA-DE . . . . . . . . . . . . . . . . . . 35

vii


Acknowledgment
The success and final outcome of this project required undivided attention and guid-

ance from a whole host of people.

I would like first to thank my supervisor, Dr. Elphas Okango, who took interest in

this research and made available his expertise that was invaluable in formulating this

research project. Your insightful feed-back pushed me to sharpen my thinking and

brought my work to a higher level.

I would like to acknowledge my course mates from the university for their wonderful

collaboration. In addition, I would like to thank my wife for her understanding and

patience as I took many hours huddled in a corner of our house working late hours

on this project and my 3 Kids who always provided a timely call for me each time

I wanted to put my feet up by telling me ”daddy go and study”- I shall return the

favour one day.

viii


List of publications

ix


Abbreviations
ANNs - Artificial Neural Networks

AR - Autoregressive Model

AP - Autoregressive Process

Backpropagation - Backward Propagation of Errors

BPN- Backward Propagation of Errors

DE - Differential Evolution

DL - Deep Learning

GA - Genetic Algorithm

GA-DE - Genetic Algorithm-Differential Evolution

I - Intergrated

IDE - Integrated Development Environment

LMA - Levenberg-Marquardt Algorithm

MA - Moving Average

MAE - Mean Absolute Error

ML - Machine Learning

MSE - Mean Squared Error

NN- Neural Network

PMM - predictive mean matching

ReLU - Rectified Linear Unit

RMSE - Root Mean Squared Error

SGD - Stochastic Gradient Descent

SVM - Support Vector Machines

x


Chapter 1

Introduction

1.1 Background to the study

According to (26) stock market prediction is the deliberate act of trying to determine

the future value of a company’s stock or other financial instrument traded on an ex-

change. Here an exchange or stock market refers to a regulated platform for trading a

company’s derivatives e.g shares at an agreed price (24).

A lot of effort has been made in the field of stock market prediction, and a raft of

ideas and methods have been proposed and can be categorized into three broad schools

of thought. The first category, believes that no investor can achieve beyond average

trading advantages based on historical and present information. The second category,is

that of rudimentary analysis, whereby industry experts undertake an in-depth study of

macro-economic factors and prevailing financial conditions, this allows them to figure

out the extent of correlation that may exist with changing stock prices. A third view is

that of recurring patterns in the markets, these patterns can be identified and used for

predictions (16).

However, these three techniques yield unreliable results mainly caused by their over re-

liance to human judgement, which fails to address the inherent nature of stock market

indices that are by nature dynamic, non-linear, and non-parametric. (16).

To address these drawbacks, the recent trend has been to develop statistical models to

forecast financial data, with various technologies such as ANNs, SVMs taking center

stage. Forecasting is a major application area of ANNs algorithms. There has been an

increasing interest in forecasting using artificial intelligent systems in recent years. The

importance of forecasting as a subject is underscored by the diversity of its applications

in a wide range of disciplines ranging from business to engineering. Having the ability to

accurately predict the future (within a margin of error) is fundamental to many decision

processes in planning, scheduling, purchasing, strategy formulation, policy making, and

1


stock price (2). Forecasting has been dominated by linear methods for many decades.

Linear methods are easy to develop and implement and they are also relatively simple to

understand and interpret. However, linear models have serious limitation in that they

are not able to capture any nonlinear relationships in data (2).

The approximation of linear models to complicated nonlinear relationships is not al-

ways satisfactory. In the early days of forecasting, large-scale competitions (known as

M-competitions) were organized to test the commonly used linear methods with real

time series data. The mixed results obtained from these competitions, showed that no

single linear model is universally the best, which we can now inter-prate as the failure

of linear models in accounting for a varying degree of non-linearity that is common in

real world data (17).

Stock market prediction is one field that is gaining a lot of traction lately from re-

searchers due to its obvious commercial applications, the high stakes and the kinds of

attractive benefits that it has to offer. (16).

1.1.1 Artificial Neural Networks Models

Artificial neural networks are self learning mechanisms that are able to work in tandem

with input variables, a feature that allows them to handle large sets of data (1). They

mimic the neural structure of the brain, by basically learning from experience, brains

store information as patterns.

The principal advantage of artificial neural networks is their ability to find patterns

and approximate any nonlinear function to a very high degree of accuracy(16), this in

turn makes them ideal models compared to fore asters, this is because they poses an

inherently nonlinear structure useful for capturing the complex underlying relationship

in many real world problems (1). They are not only capable of finding nonlinear struc-

tures in a task, but can also be used to model linear processes. A number of researchers

have shown the capability of neural networks in modelling linear time series (11).

2


1.2 Statement of the Problem

Backward Propagation of Errors algorithm requires that in all instances the gradient

to be defined, in other words the task should be differentiable, but not all data in the

context of the real world are, most practical problems have objective functions that are

non-differentiable, noisy, are non-linear in nature and have many local minima. Faced

with such problems it is impossible to solve them analytically. Differential Evolution

algorithm can be applied to find approximate solutions to such problems.

1.3 Objectives of the Study

1.3.1 Main Objective

The main objective of this project is to review and compare existing Artificial Neural

Network Algorithms for stock market prediction and gauge their efficacy.

1.3.2 Specific Objective

1. Evaluate the feasibility of Machine Learning Algorithms to predict stock market

prices in order to make more informed and accurate investment decisions.

2. Assess the accuracy and performance of ANNs learning algorithms in stock market

price prediction.

3


Chapter 2

Literature review

2.1 Introduction

A significant challenge facing prospectors is to accurately predict the index of a stock

market in a limited amount of time in order to make credible investment decisions that

will yield profits on invested monies. Historical data points to the fact that history

repeats itself and thus an investor can use this proxy as a prediction equivalent, but this

association is feeble and rarely yields tangible results.

To this effect various techniques have been formulated to investigate and come up with

more scientific models to explain the relationship between historical and present data to

determine tomorrows prices. These techniques can classified into two broad arenas: Fun-

damental Analysis- mainly focuses on external economic factors and prevailing events,

it’s basis being the law of supply and demand which in turn makes prices go up, low

or stay the same. Technical Analysis- has it’s focus set on market movements, through

the use of data analysis to find patterns that can be used to forecast future trends on

prices. the two main techniques are different in that the former studies the cause of

market action, while the latter looks at effects(12). They are however disadvantaged by

the nonlinear nature of most data (1).

More recently, artificial intelligence techniques and hybrid intelligent systems have been

mooted to tackle limitations of traditional statistical methods in nonlinear and time

variant problems of finance data. One of the artificial intelligence technologies is artifi-

cial neural networks (ANN), which is a supervised, self-learning technique, that can find

the smooth approximation between input and output data and recognize new patterns

not present in current data(12).

BPN neural network is the widely used method for stock market price forecasting, it

applies a multilayer model whose main principle lies in spreading the internal network

error by applying the gradient descent technique to lower the empirical network error.

4


The performance of this method is affected by the choice of hidden layers-how many neu-

rons will be selected, the transfer function between layers and the choice of parameters

(12).

2.1.1 Traditional Forecasting Methods-ARIMA

The ARIMA model was introduced by Box and Jenkins(1976) and has shown efficiency

in generating short- term forecasts. In this model, the future value of a given variable

is a linear combination of past values and past errors (3). Its an univariate method as

it uses previous values of a variable to predict future values of the same variable.

The model is comprised of three component functions: AR (p), the number of lag ob-

servations or autoregressive terms in the model; I (d), the difference in the nonseasonal

observations; and MA (q), the size of the moving average window. The model order is

depicted as (p,d,q) with values for the order or number of times the function occurs in

running the model, a zero value is admissible.

The model uses differenced data to make the data stationary, that results in consistency

of the data over time. This function mitigates the effect of trends or seasonality, such

as market or economic data.

However, the model suffers from major draw backs in prediction in that it makes an as-

sumption of constant variance, most financial data are however volatile and this feature

of the data can’t be fulfilled under this assumption, in addition to this, the symmetric

joint distribution requirement of the stationary models does not fit data with strong

asymmetry a feature all to common in financial data.Thus the model is more suitable

for use with data that exhibit a negligible amount of sudden burst of large amplitude

at irregular time epochs. These limitations of the integrated ARMA models in fore-

casting time series data lead us to models where we can retain and allow data to be

non-Gaussian or abandon the linearity assumption.

5


2.1.2 Modern Forecasting-ANN

The first step toward artificial neural networks came in 1943 when (18), a neuro-

physiologist, and a young mathematician, Walter Pitts, wrote a paper on how neurons

might work. They modeled a simple neural network with electrical circuits.

Donald Hebb, in his 1949 book Organization of Behavior, pointed out that neural path-

ways are strengthened out each they are used.

Artificial Neural Networks use and application has gained a lot of interest in the recent

past, their applicability in the arena of prediction and classification has endeared them

even further. These networks are practical attempts aimed at mimicking the human

brain by design and implementation, normally, the brain is capable of recognizing pat-

terns even in the presence of statistical noise (unexplained variability in a data sample),

form abstract ideas, have fault tolerance and be able to recall memories that allow it to

make judgments on situations never before encountered (27). These networks are now

being co-opted in fields where traditional statistical methods have been widely used be-

fore, for instance (8) used them in a classification problem to identify underwater sonar

contacts, while (5) used them to predict heart problems in subjects by visual interpreta-

tion of nuclear images. Neural networks have also been used in speech recognition (14),

in application of time series, neural networks have been used in the prediction of stock

market performance as outlined by (10).

These problems are usually solved by the application of basic statistical models such

as logistic regression, discriminant analysis and whole host of other classifiers such as

the Bayes. A handful of researchers have carried out studies to compare these statis-

tical methods and neural networks with the main aims being to gauge performance on

application to specific problems.

2.1.3 Related Works

The main focus of predictive systems currently is to develop machine learning methods

with higher accuracy and efficiency for financial markets.

(6) used an ensemble system anchored on genetic algorithm. The system consisted of

10 SVM classifiers, each classifier had it’s own set of inputs(technical indicators) and

6


parameters, their outputs were combined by majority voting. The genetic algorithm

was used to select best features(variables) and parameter optimization for each SVM

classifier in stock market price prediction for the Sao Paulo stock exchange. Their ex-

perimental results showed that their proposed model was more accurate as opposed

to methods such as bagging, SVM, Random Forest with the only drawback being the

length of time it took the model to run.

(21) proposed the use of hybrid forecasting models to predict (t+1) day closing price,

they used technical indicators as inputs and applied support vector regression in conver-

sion of technical indicator data, to predict the closing day price of the (t+1) day they

applied the following models support vector regression, ANN and random forest and

found out that the accuracy of two stage models was higher than that of single models

when the number of predicted points increased.

(28) proposed a forecasting model by applying the genetic algorithm in sequential steps,

first to get optimal weights for backpropagation network, then they used back prop-

agation neural network, after that they trained and tested the network using optimal

weights of the best fit chromosome, to predict close price of Shanghai index along ten

days. GA was then used in order to improve the speed of convergence and to overcome

overfitting problem because, as known BPN has two drawbacks, it falls easily into local

minimum and has slow convergence. Also they used BP single model to compare it with

the GA-BP proposed model results. These team proved that experimental results of the

GA-BP had achieved the satisfactory accuracy of Shanghai composite index prediction.

(Jena and Padhy) proposed a hybrid forecasting model which entailed the application of

a genetic algorithm to choose the best parameters for Support Vector Machine classifier,

and proved that the proposed hybrid model GA-SVM is better than the single model

of support vector machine (SVM) in predicting the prices of the Indian stock market

indexes. The GA-SVM model improved the prediction accuracy as measured through

DS, MAE and MSE.

We intend to add to this literature by way of contrasting a simple statistical regres-

sion model against a neural network tool, linear models have been used extensively

for forecasting due to their ease in developing, implementing and at best require basic

knowledge of statistics to interpret and understand.

7


2.1.4 Optimization Algorithms

Most deep learning algorithms and methods involve some sort of optimization, optimiza-

tion refers the process of minimizing or maximising some function f(x) by altering the

constituent variable x (7). The function we want to minimize or maximize is referred

to as the objective function, when we are specifically minimizing it we refer to it as the

cost, loss, or error function.

To train neural networks for these tasks we apply optimization algorithms which are

basically learning procedures which look for a set of parameters for which the loss in-

dex/function for the network will be at a minimum. There exists a large number of

optimization algorithms each with it’s own characteristics and numerical precision;

• Learning problem algorithm: measures the performance of a network on a data

set and is formulated in terms of a minimization of a loss function, f, this loss

function is composed of a regularization term that is used to control over fitting

by controlling the complexity of the network and an error term that evaluates how

well the network fits the data set.

• Gradient descent algorithm: it is a first order method that requires information

from the gradient vector. For this method we start at a point x(0) until a stopping

criteria is meet. That is, we move from x(i) to x(i+1) in the training direction

w(i)=-y(i). All we are doing in this method is iterating x(i+1) = x(i) − g(i)n(i) for

i=0,1,..., the parameter n is the learning rate whose value can either be set to a

fixed value or calculated by one-dimensional optimization at each training step.

This method is best applied when we have a huge neural network with thousands

of parameters, since it stores the gradient vector as opposed to the hessian matrix.

• Newton method: considered to be a second order algorithm as it makes use of

the Hessian matrix, it’s purpose is to use the second order derivatives of the loss

function to find more efficient training direction, mathematically we have this

algorithm expressed as follows,if we let;

f(W (i)) = f (i),∀f(W (i)) = g(i)

8


and

Hf(W (i)) = H(i)

H(0) is the hessian matrix of f evaluated at a point W (0), if we set g to be equal

to zero as the minimum, we get the equation

g = g(0) +H(0).(W −W (0)) = 0

with an initial vector W (0), this method iterates

W (i+1) = W (i) − (H(i)−1.g(i))n

, the value n ensures that the newton step moves towards a minimum if the hessian

matrix is not positive definite.

• Conjugate gradient: is an intermediate method between gradient descent and

newton’s method that was formulated to deal with slow convergence issue that

ails the gradient descent optimization algorithm, and also avoids some of the

information needs associated in the evaluation, storage and inversion of the Hessian

matrix a prerequisite of the Newton method.

In this training algorithm, the search occurs along conjugated directions with

respect to the Hessian matrix, assume starting with an initial parameter vector

W (0) and an initial training direction vector t(0) = −g(0).

The method proceeds to construct successive training directions, expressed as

follows

t(i+1) = g(i+1) + d(i).γ(i),

for i=0,1,... and γ is referred to as the conjugate parameter. NB:the training

direction is reset to negative of the gradient at regular intervals for all conjugate

gradient algorithms and the parameters improved by applying

W (i+1) = W (i+1) + d(i).η(i),

η is the training rate calculated by line minimization.

• Quasi-Newton Method: was formulated to deal with the high computational re-

9


quirements of Newton method in evaluating the Hessian matrix and its inverse.

This method comes up with an approximation of the inverse Hessian at each it-

eration of the algorithm, by using information derived by the first derivative of

the loss function. Thus, the idea behind this method is that of approximating the

Hessian by use of another matrix by using only the first partial derivatives of the

loss index. It is expressed as;

W (i+1) = W (i) − (G(i).g(i).η(i))

The training rate η can either be set to a fixed value or found by line minimization.

• Levenberg-Marquardt algorithm: was made to work with loss functions that take

the form of a sum of squared errors. it utilizes the Jacobian matrix and the

gradient vector. We can express the sum of squared errors of a loss fuction as

follows

f =
n∑

i=1

e2i

n is the instances in data.

The Jacobian matrix of the loss index contains the derivatives of the errors con-

cerning the parameters.

Ji,j =
δei
δWj

for i=1,...,k and j=1,...,l, with k being the number of instances in the data set

and l the number of parameters in the neural network. The jacobian matrix is of

dimension k.l. To compute the gradient vector for this method we:

∇f = 2JT .ε

with ε being a vector of the error terms. The Hessian matrix under this method

is derived by the following expression

Hf ≈ 2JT .J + λI

, here lambda is known as the dampening factor ensuring the Hessian is positive

and I is the identity matrix.

10


Performance comparison of the above algorithms is depicted in the chart below, the slow-

est method is the gradient descent method requiring less memory but a slow convergence

speed, LMA is fast but requires a substantial amount of computational memory.

Figure 2.1: Optimization algorithms compared

In conclusion the choice of optimization algorithm in our neural network is dependent

on the number of parameters, if in thousands gradient descent or the conjugate gradient

algorithms are preferred, for a few hundreds of parameters the best choice will be the

LMA.

For this research the gradient descent optimization will be applied to the NSE20 share

index during BPN and GA-DE optimization due to the fact that it utilizes a gradient

vector.

Stock prices can be regarded as a random time sequence characterized by unexplained

variability/noise, artificial neural networks are large-scale nonlinear parallel processing

systems that depend on their own intrinsic link data, they provide methods and tech-

niques that can be used to approximate any nonlinear continuous function, without

making any assumptions about the nature of the generating process (22).

11


Chapter 3

Methodology

3.1 Introduction

Machine learning algorithms can be viewed as statistical computer programs that auto-

matically improve in performance as they acquire more and more experience, they work

by building mathematical models based on a given set of data, and use this models

synthesized from data to make predictions or decisions about scenarios not experienced

before hand. For instance an algorithm intending to perform classification, might be

taught to identify the features of an object and assigns a weight to that attribute based

on how valuable it is in correctly identifying the object, this allows it, to within a certain

degree of accuracy identify a similar object in the future.

We have two main problems that we aim to resolve under the category of supervised

machine Learning;

i. Classification: algorithms in use this field aim to estimate the mapping function

f(x) from some input variable(s) x to a discrete or categorical output variable(s)

y.

ii. Regression: while in regression we employ algorithms to learn and estimate the

mapping function f(x) from the input variable(s) x to a numerical or continuous

output variable(s) y.

In this research our focus will be on regression using artificial neural networks. A simple

regression equation is of the form:

Ŷ = w1X1 + w2X2 + ...+ wnXn + b

where: X ′ns are a set of inputs, W ′
nS correspond to weights, b is the bias, Ŷ is the the

output.

12


3.2 Artificial Neural Networks Architecture

The basic computational unit of the human brain is the neuron, these neurons number

1011 in the human brain, they work parallel to one another but are highly interconnected

for information sharing. For ANNs the single neuron network or the perceptron is the

fundamental building block. In essence, we have in a neural network a network of inter-

connected processing units, such networks allow us to extract meaningful information

as well as detect hidden patterns from complex data sets.

The single neuron network is the basic unit of an artificial neural network, it is referred

to as the perceptron.

Figure 3.1: A Perceptron: Wiki-Images

• Where the xi’s are a set of inputs that are multiplied by a corresponding weight

wi and the result summed up.

• a bias term w0 is then added to the weighted sum and passed through a non linear

activation function to yield the desired output Ŷ .

Therefore, to compute the output of a perceptron: we first compute the dot product, we

then add a bias term then take a non-linearity (Activation Functions in Deep Learning).

3.2.1 Activation Functions Choice

The purpose of activation functions is to introduce non-linearity’s that allow us to

approximate arbitrarily complex functions in a network, this is because in real world

13

https://xzz201920.medium.com/activation-functions-linear-non-linear-in-deep-learning-relu-sigmoid-softmax-swish-leaky-relu-a6333be712ea


applications of ANNs, data is almost always non-linear.

Activation functions are in essence mathematical equations that determine the output of

a neural network, this functions are attached to neurons and determine if each neuron

will be fired or activated, their main attribute is that they must be computationally

efficient as they are calculated across a large of neurons in a given data sample. Their

are awash of activation functions to choose from in the application of Neural Networks.

This choice in the hidden layer controls how well the the network learns from the training

data set and consequently, the choice in the output layer is determined by the type of

predictions the model will make.

A differentiable nonlinear activation function is typically applied in the hidden layers of

a neural network as it allows the model to learn more complex functions (4).

3.2.1.1 Sigmoid Function

A mathematical function that produces a sigmoidal curve, makes a model logistic in

nature for predicting probabilities since it squashes the input to values between 0 and

1.

f(x) =
1

1 + exp(−x)

However, this method suffers from the following disadvantages;

1. since it uses a logistic model, the computations are time consuming and complex,

2. it causes gradients to vanish and no signals pass through the neurons after some

point of time

3. It is slow in convergence

4. It is not zero centered Ciaburro and Venkateswaran

3.2.1.2 Hyperbolic Tangent

Is a scaled sigmoid function with a stronger gradient, nonlinear in nature, defined in the

range of values (-1,1). It is calculated as follows;

f(x) = tanh(x)

14


(4)

3.2.1.3 Rectified Linear Unit-ReLu

The most widely used activation function as it is relatively easy to compute, has non-

linearity at k=0. It is defined as follows;

f(k) = 0, k < 0, and k when k is greater or equal to 0

. The range of output is between zero and infinity.

The choice of activation has to be robust enough to satisfy the following, it should be

differentiable, that is, it should cause gradients to disappear, it should be fast processing

and simple, and should not be zero centered.

In this research we will use the ReLu activation funtion, justification: it is simple

and fast to process, it does not suffer from the vanishing gradient problem exhibited by

sigmoid and tanh functions. it is widely used for DL algorithms(4).

3.2.1.4 linear function

Is the simplest activation function, and is commonly used for the output layer activation

in most deep learning neural network problems, is characterised by the output being the

same as the input. Derived as follows y = f(x) = x and is defined in the range of −∞

to ∞ (4). For this research the linear activation function will be applied at the output

layer for the BPN algorithm.

3.2.2 Overfitting Mitigation

A major consideration when fitting an ANN is overfitting mitigation, basically, overfit-

ting is a modelling error that occurs when a model is too closely aligned to a limited

set of data points, which causes such a model/network to perform poorly on an unseen

data set. Several mitigation techniques for this error exist as detailed below;

• Model simplification- main cause of overfitting is a complex model with too many

parameters, this means a model complexity can be reduced by reducing the number

15


of neurons or removing layers that in turn makes the model small, this ensures

variables that have maximum variability on the outcome are feed to the network.

• Early stopping- simply stop as soon as the validation loss rises above a maximum

threshold, these methods work to improve a model fitness with each iteration by

providing rules that guide on how many iterations can be done prior to the model

overfitting.

Figure 3.2: Early Stopping-Wiki Image

• Regularization- a techniques that is also used to reduce a model complexity by

adding a penalty term to the loss function. Most common regularization methods

are L1 and L2.

– L1 Aims to reduce the absolute value of weights, it’s mathematically repre-

sented as;

L(x, y) =
n∑

i=1

(yi − h0(xi))2 + λ
n∑

i=1

|θi|

We finally end up with a model that is simple, interpretable and robust to

outliers.

– L2 In turn aims at minimizing the squared magnitude of the weights, it’s

mathematically represented as;

L(x, y) =
n∑

i=1

(yi − h0(xi))2 + λ
n∑

i=1

θ2i

Allows us to learn complex data patterns.

16


• Dropouts Technique- is a regularization technique that modifies the neural network

by randomly dropping neurons during model training after each iteration. The

overarching idea here being that when a set of neurons are dropped, the subsequent

run is like training a new network all together, thus the net effect of dropouts will

reduce overfitting by ensuring the network is not over reliant on any one path, and

therefore develops a whole ensemble of different paths.

• Batching- Computing the gradient is computationally expensive, more so with

modern data sets that are ever lengthy, thus it is not feasible to use all data points

to compute the gradient, to overcome this we batch our data into mini batches

of say K data points, from this we can get a true estimate of the gradient by

averaging the gradients from each of these sets of data points (4).

For this research the early stopping overfitting method was applied for both the BPN

algorithm and Genetic Algorithm method.

3.2.3 Loss Function

The loss function aspect is applied during model training, we wish to find the neural

network weights W, that by which we achieve the lowest empirical loss over the training

data set. Loss optimization formula will be as below;

W ∗ = argmin
1

n

n∑
i=1

L(f(xi;W ), yi)

In summary, the loss function of the gradient descent method tells us how good the

model is at making predictions for a given set of parameters while, it’s slope tells us

how to update our parameters to make the model more accurate (23).

3.2.4 Model hyperparameters to consider

As alluded to earlier hyperparameters in ANNs are crucial variables that the statistician

has to decide on before execution of the network, such as;

Batch - which defines the number of samples to work through before updating the

17


internal model parameters. At the end of each batch, a comparison is made between

the predictions and the expected output variables, an error metric is then calculated,

from this a decision is made to update the algorithm and consequently the model.

Epoch- defines the number of times a learning algorithm will go through the entire

training data set, it gives each sample in the training set an almost equal chance to

update the internal model parameters. An epoch is made up of one or more batches but

it should be sufficiently large to allow a learning algorithm run until the model error is

minimized.

3.2.5 Choice of layers of neural network

By design most neural networks have three types of layers;

1. Input layer - every neural network has exactly one input layer, the number of

neurons is completely and uniquely determined by the structure of the training

data. Specifically, the number of neurons/nodes comprising this layer is equal to

the number of features in your data, in some configurations though we can add

one additional node for a bias term.

2. Hidden Layer - is the most dynamic layer in a neural network, it is referred to

as hidden since it is not observable and has to be learned, if the data under

investigation is linearly separable then a hidden layer is not needed, the optimal

size of the hidden layer is usually between the size of the input and size of the

output layers, (i) number of hidden layers equals one; and (ii) the number of

neurons in that layer is the mean of the neurons in the input and output layers

(20).

For this research the number of neurons was calculated as follows;

Numneurons =
2(N + 1)

3

, N is the number of neurons in the input layer, applying this equation gave me 3

neurons in the hidden layer for the BPN algorithm in this research.

3. Output layer - similar to the input layer every neural network has exactly one

18


output layer, its size is determined by the model configuration, that is, if the NN

is a regressor, then a single node will suffice for this layer. In case the NN is a

classifier, it will also use a single node, if softmax is used then the output layer

will have one node per class label in your model.

The choice of layers in a ANN determines it’s accuracy and convergence speed.

3.3 Algorithms under consideration

The algorithms under review in this research try to improve this Ŷ by adjusting the

weights iteratively and calculating a loss function, given by

MSE = 1
2
∗ [(Yactual − Ypred)]2.

I propose to use the Backpropagation and the Genetic algorithm-Differential Evolution

method to minimize this loss function.

3.3.1 Backward Propagation of Errors

Backpropagation is a supervised learning algorithm for ANNs that employ the gradient

descent method to reduce some error function called the loss function, done by repeatedly

adjusting the weights of the connections in a network. (23).

19


Figure 3.3: A simple BPN Architecture

3.3.1.1 Gradient Descent

Gradient descent, is an optimization method that is used to update model paramaters of

a feed-forward back-propagation algorithm, it minimizes the loss function by iteratively

moving in the direction of the steepest descent as defined by the negative of the gradient.

Our main aim being to achieve a predicted value that is close as possible to the original

value. This method assumes a global minimum value that must be found in order to

find the minimum loss function (23). This is only achievable if our loss function graph

assumes a parabolic shape. It is important to note that predicted values are dependent

on the weights, at the initial stage, the model assigns arbitrary weights to features until

a minimum loss function is achieved.

So, how does the the algorithm know; it needs to optimize weights to minimize the error,

in essence what we need to check is how the error varies with weight.

To do this we find a derivative of error with respect to weight, this is what we refer to

as a gradient;

Gradient =
δE

δW

20


The negative of the gradient will always show the direction along which the weights

should be moved in order to optimize the loss function, once the movement direction

is established, we then have the model determine the rate of movement. This rate is

referred to as the learning rate, denoted by α. This in essence is the distance to be

moved form point a to a new location b.

dx = α ∗ | δE
δW
|

A high learning rate α implies that we cover more ground per step but we risk overshoot-

ing our lowest point, whereas a low learning rate is more precise but computationally

expensive. Our new weight is thus given by;

Wj(new) = Wj(old) + α ∗ e(i) ∗ δY

δWj

(19)

3.3.1.2 Choice of learning rate, η

The learning rate eta (η) is the most important hyper-parameter (a hyper-parameter is

a parameter that needs to be chosen by the programmer before executing a ML algo-

rithm), several methods exist that aid in the determination of this rate;

Learning rate annealing- presupposes starting with a high learning rate and then

gradually reducing the learning rate linearly during training, with the rate decreasing

to a value close to 0. The intuition behind this approach is that we’d like to move

quickly from the initial parameters to a range of optimal values but then we’d like a

learning rate small enough that we can explore the ”deeper, but narrower parts of the

loss function”. Step decay is the most popular form of learning rate annealing, here the

learning rate is reduced by some margin, usually a percentage after a fixed number of

training epochs.

Cyclical learning rates- (25) in her paper proposed that instead of monotonically de-

creasing the learning rate it should be varied cyclically between two reasonable boundary

21


values. The general schedule can be written as

ηt = ηmin + (ηmax − ηmin)(max(0, 1− x))

where x is expressed as

x =

∣∣∣∣iterationsstepsize
− 2(cycle) + 1

∣∣∣∣
the cycle is calculated as follows

cycle = floor

(
1 +

iterations

2(stepsize)

)
,

where ηmin and ηmax define the bounds of our learning rate, iterations represents the

number of completed mini-batches, step size defines one half of a cycle length, 1x should

always be positive.

Stochastic Gradient Descent with Warm Restarts (SGDR)- is where an anneal-

ing schedule is combined with periodic ”restarts” to the initial starting learning rate η

the restart is warm as the optimization does not start from scratch but from the pa-

rameters to which the model converged during the last step (15). This method relies on

an aggressive cosine annealing function to lower the learning rate, the function varies

between -1 and 1, mathematically it can be expressed as;

ηk = ηimin +
1

2
(ηimax − ηimin)

(
1 + cos

(
Tcurrentπ

Ti

))

, where ηk is the learning rate at time-step k , ηimin and ηimax define the range of desired

learning rates, Tcurrent represents the number of epochs since the last restart, calculated

at every iteration and can only take on fractional values, and Ti is number of epochs

in an cycle. Empirically it has been found that SGD with warm restarts requires fewer

epochs up to 50% fewer than learning rate annealing method and achieves comparable

or better performance.

To select a good learning rate might mean starting off with different values and seeing

which one gives you the best loss without sacrificing the speed of training. We shall

implement the BPN algorithm in R by the use of the package neuralnet.

22


3.3.2 Genetic Algorithm - Differential Evolution

Genetic algorithms are based on the biological theory of evolution by natural selection

formulated by Charles Darwin, he postulated that organisms evolve over generations,

and that they do this by inheriting physical or behavioral traits from a previous gener-

ation through the process of natural selection. This is the process through which living

organisms change and adapt. Therefore, individuals with better suited traits are more

likely to survive and reproduce, and through this favorable traits are thus transmitted

from one generation to the next.

GA in ANN are an implementation of this evolutionary technique observed in nature,

they start with a set of candidate solutions referred to as a population, solutions in a

population are then used to form a new population, the idea being that this new popu-

lation will be superior to it’s predecessor, the selection is based on a fitness criteria, the

process is repeated until some user defined condition is meet.

The basic parameters of a genetic algorithm are as follows;

• Initialization: a random population of chromosomes of size n is generated, these

are the candidate solutions to a problem.

• Fitness: each chromosome in the populous, is evaluated for fitness

• Birth of a new population: a new population is created by following the steps

below

– selection: an encoding is chosen

– crossover: uses genes from parent chromosome to create a new offsprig

– Mutation: prevent solutions from falling into a local minima prematurely

– Acceptance: places an offspring into a new population

• Replace: Use new generated population for a further run of algorithm

• Testing: stop if criteria is meet, and return the best solution in current population,

else loop to fitness step.

Problem solving in this aspect can be of regarded as looking for extreme of a function.

Some function will be given and GA will try to find a minimum of the function.

23


In this research I am proposing to use Differential Evolution, which is a type of

genetic algorithm, that use biologically inspired techniques to solve optimization prob-

lems that seek to find the best solution from all feasible solutions over a continuous

search space, it does this by improving a problem by iteratively trying to improve a can-

didate solution based on an evolutionary process of crossover, mutation and selection

over the course of successive generations (9). This algorithm makes few or no assump-

tions about the underlying optimization problem. It can find the true global minimum

despite the initial parameter values, it does not use the gradient of the problem being

optimized which means it does not require the problem at hand to be differentiable.

3.3.2.1 How does it work?

A. Initialization: it sets a population vector of candidate solutions of size NP; candidate

solutions or agents refer to a set of all possible solutions available for an optimization

problem that satisfy the problem constraints.

Xi,G = [X1
i ,G ................X

D
i ,G ]

where i=1,...,NP represents the total population. The parameter values are selected

randomly and uniformly between Xmin and Xmax. To generate the initial value of the

jth parameter for an individual i at a generation Y=0, we sample;

Xj
i 0 = Xj

min + rand(0, 1) ∗ (Xj
max −Xj

min)

B. Mutation: By applying the mutation operator these agents are then moved around

in a search space by mathematical formulae to combine the positions of existing agents

from the population, one mutant vector Vi,G is produced, that is, for each target vector

Xi,G for generation G,it’s mutant vector is

Vi,G = [V 1
i ,G ...V

D
i , G]

. Mutations are small random tweaks in the chromosome aimed at getting new solutions,

types of mutation include;

24


• Bit flip mutation- one or more random bits are selected then flipped, mainly ap-

plied in binary encoded genetic algorithms.

• Random resetting- mainly applied in integer representations, a random value from

the set of permissible values is assigned to a randomly chosen gene.

• Swap Mutation- 2 positions on the chromosome are chosen at random, the values

at this positions are then interchanged, commonly applied in permutations based

encoding.

• scramble mutation- from the entire chromosome, a subset of genes is chosen and

their values are scrambled or shuffled randomly.

• Inverse Mutation- a subset of genes are chosen, then the entire string in that subset

is inverted.

C. Selection: We then compare the target vector Xi,G with the trial vector Ui,G +1,

the one with the lowest function value is assimilated into the next generation, that

is, we accept an agent if the new position is an improvement, it then forms part of

the population, else we discard this agent(s). DE does these repeatedly until we get a

satisfactory solution which is however not guaranteed. Lets assume that f : Rn −→ R is

a function that we plan to reduce/minimize. This function, takes a candidate solution

as an argument in the form of a vector of R, it the outputs a R which is now a measure

of fitness of the given candidate solution. The gradient of this solution is not known.

Thus, the aim is to find a solution m for which f(m) ≥ f(p) ∀ P in the search-space,

this implies that m is the global minimum.

Their several ways to implement selection in genetic algorithms;

• Tournament selection- several tournaments are played among a few individuals

whose selection is done randomly prior, the winner of each tournament is selected

to be in the generation.

• Roulette wheel and proportionate selection- every individual can become a parent

with a probability which is proportional to its fitness. Thus, fitter individuals have

a higher chance of mating and propagating their features to the next generation,

25


the selection strategy applies a selection pressure to the more fit individuals in the

population, evolving better individuals over time. It’s implemented as follows;

– Calculate S = sum of a finesses.

– Generate a random number between 0 and S.

– Starting from the top of the population, keep adding the finesses to the partial

sum P, till P < S.

– The individual for which P exceeds S is the chosen individual.

• Rank Selection- works with negative fitness values and is mainly used when the in-

dividuals in the population have very close fitness values, whereby, each individual

no matter how fit relative to each other has an approximately same probability

of getting selected as a parent, consequently, it leads to a loss in the selection

pressure towards fitter individuals, making the GA to make poor parent selection.

To overcome this, the concept of fitness is abandoned when selecting a parent and

rank every individual in the populous according to their fitness, the higher ranked

individuals are then preferred.

• Steady state selection- we use a few good chromosomes to create new off springs

after every iteration, the bad chromosomes are discarded and replaced with the

new offspring, the rest of the population migrates over to the next generation

without the need of going through the selection process.

D. Recombination/Crossover: A trial vector Ui,G will be generated from the elements

of the target vector pair and its corresponding mutant vector this after going through

the recombination process.

Ui,G = [u1i ,G , u
2
i ,G , ......, u

D
i ,G ]

In GA crossover is applied with high probability and is used to create new solutions

from existing ones after the application of the selection operator. Crossover mechanisms

include

• One point crossover- we select randomly a crossover point and swap the tails of

the two parents to form new off-springs.

26


Figure 3.4: One point crossover: a random point in red followed by recombination

• Multi point crossover-is a generalization of the one-point crossover wherein alter-

nating segments are swapped to get new off-springs.

Figure 3.5: Multi point Crossover

• uniform crossover- the chromosome in this method are not divided up into seg-

ments, rather we treat each gene as a separate entity, essentially a probability is

used to decide whether or not a chromosome will be included in the off-spring.

• Whole Arithmetic Recombination- commonly used for integer representation, takes

the weighted average of the two parents by using the following formulae

Child1 = α.x+ (1− α).y

Child2 = α.x+ (1− α).y

if α = 0.5, both children will be identical.

3.3.3 Survivor Selection-Elitism

This GA policy determines which individuals are to be kicked out and which are to be

kept in the next generation. It ensures the fittest individuals are not kicked out of the

population that is is they do not undergo mutation and are propagated as they are, and

at the same time diversity should be maintained in the population.

GA’s employ Elitism as the survivor selection policy, whose main principle is that

current fittest member of the population is always propagated to the next generation.

Elitism in our ANN is defined as a percentage or a number.

27


I shall use the R package GA to implement the differential evolution algorithm in this

research.

3.4 Performance Evaluation

Algorithm evaluation will be in terms of accuracy assessment that will be quantified by

calculating the metrics below;

• Root Mean Squared Error (RMSE)- is an absolute measure of fit and is defined as

the square root of the variance of the residuals, when applied it will indicate the

absolute fit of the model to the data; how close are the observed data points to to

the model predicted values, it is always in the same units as the response variable.

The lower the values of RMSE the better the model fits data.

RMSE =

√√√√ 1

n

n∑
x=1

(
At − Pt

At

)

• Mean Absolute Error (MAE)- gives the average magnitude of the errors in a set of

predictions, does this without considering their directions. That is, average over

the test set of the absolute differences between predicted and actual observations

with all individual differences having equal weight.

MAE =
1

n

n∑
x=1

|At − Pt|
|At|

• Mean Squared Error (MSE)- the MSE of these estimator models will measure the

average squared difference between predicted values and the actual values.

MSE =
1

n

n∑
x=1

(At − Pt)
2

Where n is the number of predicted points, At is the actual recorded stock price and Pt

is the predicted stock price value.

Prediction performance was evaluated through these performance measures; MAE, MSE,

and RMSE with the interpretation being that the smaller the values of these measures

28


the closer the predicted time series values are to the actual values, and thus the better

the model.

3.5 Data Description

The Data Set in use in this research is from the Nairobi Stock Exchange, NSE20 Share

Index to be precise, the data is observed for a period of 10 years from January 2011

to January 2021, collected in 5 day weekly intervals with an exception of weekends and

public holidays in the republic of Kenya.

From NSE20 share index data, I choose to work with the East African Breweries mar-

ket, justification being that this company’s derivatives have consistently been present

on the stock exchange, and it’s share prices over the 10 year period under review have

been fairly consistent. The research data is comprised of 2610 observations of 5 vari-

ables, the response variable is the eabshare price, while eabmarket value, eab-

shareprice high, eabshareprice low are the explanatory variables.

3.5.1 Data Pre-Processing

3.5.2 Missing observations and imputations

The mice package from the R statistical software was used to data clean this data

set by finding and imputing missing observations, approximately 226 observations were

missing for 2 variables, 113 cases each.

29


Figure 3.6: Graphical representation of missing observations

Using this package I imputed this values by applying the predictive mean

matching method(PMM), which is a semi-parametric approach in that for each miss-

ing value it fills in a value randomly from among observed donor values from an obser-

vation who regression-predicted values are closest to the regression-predicted values for

the missing value from the simulated regression model. PMM ensures imputed values

are plausible. It does this by;

• forming a small set of candidate donors for each missing entry

• a donor value is then randomly selected/observed

• this ”observed” value is then taken to replace the missing entry

This technique assumes the distribution of the missing values to be the same as that

of the observed candidate donors with the main advantage of applying it being it’s

robustness against misspecification.

30


3.5.3 Data Normalization

A key step in ML algorithms involves data normalization, which bounds our ANN inputs

to values between [−1, 1] for this research the range is [0, 1], subsequently making them

comparable. Failure to normalize data will mean that the machine learning algorithm

will be dominated by variables that use a larger scale which biases the model to inac-

curacy, in addition, scaled data brings about the desired effect of faster convergence.

Feature scaling normalization technique was applied to this data set through the R

Caret package, as indicated in the formula below

xscaled =
x− xmin

xmax − xmin

It was chosen as the observed data is not Gaussian. On application of the technique,

data was scaled to a fixed range of [0, 1], and I ended up with data with a smaller

standard deviation(S.D) which in-turn suppressed the effect of outliers.

3.5.4 Data Split

I used the createDataPartition() function from Caret package in R to split the data set

into a training data set and a test data set at 70:30 percentage ration respectively.

31


Chapter 4

Results
The algorithms put forward were applied using R statistical software version 4.0.5

through it’s Rstudio IDE version 1.4.1103, on a PC running Windows 10 Enterprise-

64bit operating system(Build 19042), with a core-i5 processor and an 8GB RAM.

4.1 Results from Algorithms

4.1.1 Backpropagation model results

The neural network model was applied as follows, a configuration of one input layer was

chosen with 3 nodes guided by the fact that we have 3 features, a hidden layer configu-

ration of 3 nodes was chosen as it yielded a higher prediction accuracy as opposed to any

other configuration, the linear.output argument was set to True with the assumption

that their is a linear association between the independent and dependent variable, the

threshold is set to 0.01% meaning if the change in error is less than 1% per iteration,

then no further optimization was to be carried out.

I used 100 epochs and a Relu transfer function of the neurons of the hidden layer,

whereas the linear transfer function was used for the single output layer since this is a

regression problem.

The plot depicts result of the backpropagation of errors algorithm , a network with a

single hidden layer and 3 neurons was the optimized network, it also highlight one of its

drawbacks, in that, it took this model with a one network of a hidden layer of 3 neurons

1900 iteration steps to converge.

32


Figure 4.1: BPN Neural Network Configuration

Predicted prices versus actual observed prices for the market under study.

Figure 4.2: Real and Predicted EAB Share price values BPN Algorithm

The difference between actual and predicted share price values was too negli-

gible, testament to this model’s performance.

33


4.1.2 Genetic Algorithm

The genetic algorithm simultaneously performed feature selection and parameter opti-

mization, for features, GA settled on all 3 features(number of features are negligible in

this paper), 1830 samples were run from the training data set which took a considerable

amount of time and gave me more or equal results compared to the BPN algorithm.

The main parameters were set as follows: maximum generations 30; population per

generation 50; crossover probability is set to 0.8 with k-tournament selection method

applied and chose the individual with the best fitness value after each iteration ; muta-

tion probability at 0.1. GA was equipped with a random forest to investigate whether

feature selection results in a marked performance difference, it did.

The best or optimal solutions are obtained at least at the 20th generation coinciding

with the 50th iteration no improvement was noted after, so the algorithm stopped after

the 30th generation at 75th iteration. See 4.3

Figure 4.3: Algorithm Progression of the GA Object

34


4.2 Models compared

Prediction Model MSE RMSE MAE Accuracy %
BPN 0.0020 0.003 0.0016 99.1

GA-DE 0.0006 0.0025 0.0004 99.9

Table 4.1: Performance Metrics BPN GA-DE

The two algorithms were based on all indicators, figures in table 4.1 above, give an

account of the differences between BPN and GA-DE time series forecasting models,

the values are convergent. It is evident that machine learning algorithms are potent

substitutes to traditional statistical methods, BPN method despite it’s slight under

performance versus the GA-DE algorithm has a higher convergence rate and is compu-

tationally inexpensive in comparison.

35


Chapter 5

Discussions, Conclusions and

Recommendations
From literature a key variable in stock market prediction is the technical indicator, whose

lack from the NSE20 share index may have greatly contributed to the small difference

in performance between the algorithms with my hypothesis being that the GA-DE al-

gorithm would have been superior for such a task.

Faced with a more complex data set, the advantages of the GA-DE would have been

more apparent especially due it’s ability to evolve a candidate solution by learning and

subsequently picking out the most important features for regression.

Model accuracy and prediction power for BPN is affected by the selection in the number

of layers in use in the hidden environment, a high numbered network has the ability of

error reduction but with an increased convergence time, this in turn increases the com-

plexity of the model and can result in a higher probability of model overfitting, special

care should be taken while deciding on this, in addition the number of batches is cru-

cial in regulating the speed and stability of the learning process which directly impacts

accuracy, a high number of batches can lead to overfitting, rule of the thumb is to settle

for model defined defaults.

In conclusion ANNs are excellent candidates for stock market prediction and their ap-

plication by investors in the Nairobi Stocks Exchange is recommended and feasible with

a high return on investment envisaged.

36


Bibliography
[1] Ahangar, R. G., Yahyazadehfar, M., and Pournaghshband, H. (2010). The compari-

son of methods artificial neural network with linear regression using specific variables

for prediction stock price in tehran stock exchange. arXiv preprint arXiv:1003.1457.

[2] Anderson, D. and McNeill, G. (1992). Artificial neural networks technology a dacs

state-ofthe-art report. New York: Kaman Sciences.

[3] Ariyo, A. A., Adewumi, A. O., and Ayo, C. K. (2014). Stock price prediction using

the arima model. In 2014 UKSim-AMSS 16th International Conference on Computer

Modelling and Simulation, pages 106–112. IEEE.

[4] Ciaburro, G. and Venkateswaran, B. (2017). Neural Networks with R: Smart models

using CNN, RNN, deep learning, and artificial intelligence principles. Packt Publish-

ing Ltd.

[5] Fujita, H., Katafuchi, T., Uehara, T., and Nishimura, T. (1992). Application of

artificial neural network to computer-aided diagnosis of coronary artery disease in

myocardial spect bull’s-eye images. Journal of Nuclear Medicine, 33(2):272–276.

[6] Gonzalez, R. T., Padilha, C. A., and Barone, D. A. C. (2015). Ensemble system

based on genetic algorithm for stock market forecasting. In 2015 IEEE Congress on

Evolutionary Computation (CEC), pages 3102–3108. IEEE.

[7] Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning,

volume 1. MIT press Cambridge.

[8] Gorman, R. P. and Sejnowski, T. J. (1988). Learned classification of sonar targets

using a massively parallel network. IEEE Transactions on Acoustics, Speech, and

Signal Processing, 36(7):1135–1140.

[9] Holland, J. H. et al. (1992). Adaptation in natural and artificial systems: an intro-

ductory analysis with applications to biology, control, and artificial intelligence. MIT

press.

37


[10] Hutchinson, J. M., Lo, A. W., and Poggio, T. (1994). A nonparametric approach

to pricing and hedging derivative securities via learning networks. The Journal of

Finance, 49(3):851–889.

[11] Hwarng, H. B. (2001). Insights into neural-network forecasting of time series cor-

responding to arma (p, q) structures. Omega, 29(3):273–289.

[12] Inthachot, M., Boonjing, V., and Intakosum, S. (2016). Artificial neural network

and genetic algorithm hybrid intelligence for predicting thai stock price index trend.

Computational intelligence and neuroscience, 2016.

[Jena and Padhy] Jena, O. P. and Padhy, S. Application of ga with svm for stock price

prediction in financial market.

[14] Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural

computation, 1(1):1–38.

[15] Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv

preprint arXiv:1711.05101.

[16] Majhi, R., Panda, G., Sahoo, G., Dash, P. K., and Das, D. P. (2007). Stock market

prediction of s&p 500 and djia using bacterial foraging optimization technique. In

2007 IEEE congress on evolutionary computation, pages 2569–2575. IEEE.

[17] Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski,

R., Newton, J., Parzen, E., and Winkler, R. (1982). The accuracy of extrapolation

(time series) methods: Results of a forecasting competition. Journal of forecasting,

1(2):111–153.

[18] McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent

in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133.

[19] Mohapatra, P., Raj, A., and Patra, T. K. (2012). Indian stock market predic-

tion using differential evolutionary neural network model. International Journal of

Electronics Communication and Computer Technology (IJECCT), 2(4):159–166.

[20] Nadia, M. (2019). Stock market price prediction system using neural networks and

genetic algorithm. Theoretical and Applied Information Technology.

38


[21] Patel, J., Shah, S., Thakkar, P., and Kotecha, K. (2015). Predicting stock market

index using fusion of machine learning techniques. Expert Systems with Applications,

42(4):2162–2172.

[22] Pino, R., Parreno, J., Gomez, A., and Priore, P. (2008). Forecasting next-day price

of electricity in the spanish energy market using artificial neural networks. Engineering

Applications of Artificial Intelligence, 21(1):53–62.

[23] Rumelhart, D. E., Durbin, R., Golden, R., and Chauvin, Y. (1995). Backpropa-

gation: The basic theory. Backpropagation: Theory, architectures and applications,

pages 1–34.

[24] Selvamuthu, D., Kumar, V., and Mishra, A. (2019). Indian stock market prediction

using artificial neural networks on tick data. Financial Innovation, 5(1):1–12.

[25] Smith, L. N. (2017). Cyclical learning rates for training neural networks. In 2017

IEEE winter conference on applications of computer vision (WACV), pages 464–472.

IEEE.

[26] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.

Machine learning, 3(1):9–44.

[27] Warner, B. and Misra, M. (1996). Understanding neural networks as statistical

tools. The american statistician, 50(4):284–293.

[28] Yizhen, L., Wenhua, Z., Ling, L., Jun, W., and Gang, L. (2011). The forecasting

of shanghai index trend based on genetic algorithm and back propagation artificial

neural network algorithm. In 2011 6th International Conference on Computer Science

& Education (ICCSE), pages 420–424. IEEE.

39


	List of Figures
	List of Tables
	 Acknowledgment
	Publications
	Conference
	Abbreviations
	Introduction
	Background to the study
	Artificial Neural Networks Models

	Statement of the Problem
	Objectives of the Study
	Main Objective
	Specific Objective


	Literature review 
	Introduction
	Traditional Forecasting Methods-ARIMA
	Modern Forecasting-ANN
	Related Works
	Optimization Algorithms


	Methodology
	Introduction
	Artificial Neural Networks Architecture
	Activation Functions Choice
	Overfitting Mitigation
	Loss Function
	Model hyperparameters to consider
	Choice of layers of neural network

	Algorithms under consideration
	Backward Propagation of Errors
	Genetic Algorithm - Differential Evolution
	Survivor Selection-Elitism

	Performance Evaluation
	Data Description
	Data Pre-Processing
	Missing observations and imputations
	Data Normalization
	Data Split


	Results
	Results from Algorithms
	Backpropagation model results
	Genetic Algorithm

	Models compared

	Discussions, Conclusions and Recommendations
	Bibliography