Strathmore University 
SU+ @ Strathmore 
University Library  
  
 
Electronic Theses and Dissertations 
 
 
2018 
Stock market price prediction using sentiment 
analysis: a case study of Nairobi stock exchange 
market 
 
Victor K. Lwanga 
Faculty of Information Technology (FIT) 
Strathmore University 
 
 
 
Follow this and additional works at https://su-plus.strathmore.edu/handle/11071/5996 
 
 
 
Recommended Citation 
Lwanga, V. K. (2018). Stock market price prediction using sentiment analysis: a case study of 
Nairobi stock exchange market (Thesis). Strathmore University. Retrieved from https://su-
plus.strathmore.edu/handle/11071/5996 
 
 
This Thesis - Open Access is brought to you for free and open access by DSpace @Strathmore  University. It has been accepted for 
inclusion in Electronic Theses and Dissertations by an authorized administrator of DSpace @Strathmore University. For more 
information, please contact librarian@strathmore.edu 
 
  
Stock Market Price Prediction Using Sentiment Analysis: A Case Study of 
Nairobi Stock Exchange Market 
 
 
Lwanga Victor Kwome 
 
Submitted in partial fulfillment of the requirements for the Degree of Master of 
Science in Information Technology at Strathmore University 
 
  
Faculty of Information Technology  
Strathmore University  
April 2018 
 i 
 
Declaration 
I declare that this work has never been submitted for examination in any university. 
 
Student’s Name: Lwanga Victor Kwome. 
Signature: ___________________________________________________ 
Date: ___________________________________________________ 
Approval 
The thesis of Lwanga Victor Kwome was reviewed and approved by the following: 
Dr. Joseph Orero (PhD) 
Senior Lecturer, Faculty of Information Technology 
Strathmore University 
Signature: ___________________________________________________ 
Date: ___________________________________________________ 
 
Professor Ruth Kiraka 
Dean, School of Graduate Studies 
Strathmore University   
 ii 
 
Abstract 
Stock market price prediction has become an area of research and interest for several years 
now due to the many challenges in making accurate price predictions due to the volatility of the 
data. However, the stock market is not easily predicted. Movement in the stock market is 
influenced by various factors such as personal fortunes, political events, individual tastes, 
preferences and natural disasters. People can express all these through their sentiments and 
opinions on the social media platforms, financial news, and blogs. The stock price does not only 
rely on the law of demand and supply. People’s opinions and moods also have a substantial impact 
on the movement of the stock prices of a company.  
Recently, efforts to increase the accuracy of stock market predictions by including data 
from social media such as Facebook and Twitter has received a lot of attention. Social media can 
be regarded as an indicator of sentiments, and these are known to influence the stock market. 
Current models lack a clear interpretation, and it is also difficult to determine, which data is 
relevant for stock market prediction since there is an abundance of the same on social media. 
This study proposed the use of machine learning algorithms that will be utilized in Natural 
Language Processing (NLP) to get opinions and sentiments from social media on a particular 
company's stock to predict the stock market prices. Previous studies show that public mood, 
opinion, and stock market price have some relation to an extent. The research used Support Vector 
Machine with bigram feature to perform sentiment analysis which exhibited and accuracy of 83 
percent and Artificial Neural Network in Stock price prediction which had a mean squared error 
of 5.6. This research has proven that sentiment analysis can be incorporated in stock price 
prediction. 
 
 iii 
 
Acknowledgment 
I would like give my special thanks to my supervisor, Dr. Joseph Orero, for his guidance and 
support throughout the project and Dr. Bernard Shibwabo for his assistance during the Research. 
I also would like to express my gratitude to my family for their continued support. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 iv 
 
List of Abbreviations 
ANN – Artificial Neural Network 
API – Application Programming Interface 
ARIMA – Auto-Regressive Integrated Moving Average 
CDSC – Central Depository and Settlement Corporation 
DJIA – Dow Jones Industrial Average 
EMH – Efficient Market Hypothesis 
NLP – Natural Language Processing 
NN – Neural Network 
NSE – Nairobi Stock Exchange 
SVM – Support Vector Machine 
S & P 500 index – Standard and Poor 500 Index. 
 
 
 
 
 
 
 v 
 
Table of Contents 
Declaration ..............................................................................................................................i 
Abstract ................................................................................................................................. ii 
Acknowledgment .................................................................................................................. iii 
List of Abbreviations ............................................................................................................. iv 
Table of Contents ................................................................................................................... v 
List of Tables ......................................................................................................................... ix 
List of Equations .................................................................................................................... x 
Chapter 1 : Introduction ........................................................................................................ 1 
1.1 Background ................................................................................................................... 1 
1.2 Problem Statement........................................................................................................ 3 
1.3 Objectives ...................................................................................................................... 4 
1.4 Research Questions ....................................................................................................... 4 
1.5 Scope and Limitation .................................................................................................... 4 
1.6 Justification ................................................................................................................... 5 
Chapter 2 : Literature Review ............................................................................................... 6 
2.1 Introduction .................................................................................................................. 6 
2.2 Stock Market in Kenya ................................................................................................. 6 
2.2.1 Nairobi Stock Exchange (NSE) .............................................................................. 6 
2.2.2 Capital Markets Authority (CMA) of Kenya ......................................................... 7 
2.2.3 Central Depository and Settlement Corporation (CDSC) ..................................... 8 
2.2.4 Stock Market Trading ............................................................................................ 9 
2.3 Stock Market Prediction Theories ................................................................................ 9 
2.3.1 Random Walk Theory .......................................................................................... 10 
2.3.2 Efficient Market Hypothesis ................................................................................ 10 
2.4 Stock Market Prediction ............................................................................................. 11 
2.4.1 Technical Analysis ................................................................................................ 11 
2.4.2 Fundamental Analysis .......................................................................................... 12 
2.4.3 Time Series Method.............................................................................................. 13 
2.4.4 Machine Learning ................................................................................................ 15 
2.5 Sentiment Analysis ...................................................................................................... 16 
2.5.1 Unsupervised Classification ................................................................................. 17 
 vi 
 
2.5.2 Supervised Learning ............................................................................................ 17 
2.5.3 Lexicon Based Approach ...................................................................................... 19 
2.5 Twitter Analysis .......................................................................................................... 20 
2.6 Empirical Review ........................................................................................................ 21 
2.6.1 Prediction of movies success using sentiment analysis of tweets .......................... 21 
2.6.2 Other Related Works ........................................................................................... 22 
2.7 Conceptual Model ....................................................................................................... 24 
Chapter 3 : Research Methodology ...................................................................................... 26 
3.1 Introduction ................................................................................................................ 26 
3.2 Research Design .......................................................................................................... 26 
3.3 Target Population and Sampling ................................................................................ 26 
3.4 Data Collection ............................................................................................................ 26 
3.4.1 Twitter Data ......................................................................................................... 27 
3.4.2 Python .................................................................................................................. 27 
3.4.3 Stock Data Collection ........................................................................................... 27 
3.5 System Development Methodology ............................................................................. 28 
3.6 Research Quality ......................................................................................................... 29 
Chapter 4 : System Design and Architecture ....................................................................... 32 
4.1 Introduction ................................................................................................................ 32 
4.2 Requirements Analysis ............................................................................................... 32 
4.3 System Architecture .................................................................................................... 33 
4.4 System Analysis........................................................................................................... 35 
4.4.1 Use Case Diagram ................................................................................................ 35 
4.4.2 Use Case Scenarios ............................................................................................... 36 
4.4.3 Sequence Diagram ................................................................................................ 40 
4.5 System Design ............................................................................................................. 42 
4.5.2 Context Diagram .................................................................................................. 42 
4.5.3. Level 1 DFD Diagram .......................................................................................... 43 
Chapter 5 : System Implementation and Testing ................................................................. 44 
5.1 Introduction ................................................................................................................ 44 
5.2 Sentiment Analysis ...................................................................................................... 44 
5.2.1 Building Sentiment Corpus .................................................................................. 44 
 vii 
 
5.2.2 Preprocessing ....................................................................................................... 45 
5.2.3 Labeling ................................................................................................................ 47 
5.2.4 Training the model ............................................................................................... 48 
5.2.5 Testing .................................................................................................................. 48 
5.3 Share Price Prediction ................................................................................................ 50 
5.3.1 Building corpus .................................................................................................... 51 
5.3.2 Preprocessing ....................................................................................................... 51 
5.3.3 Training the model ............................................................................................... 52 
5.3.4 Testing the model ................................................................................................. 53 
Chapter 6 : Discussions ........................................................................................................ 55 
6.1 Introduction ................................................................................................................ 55 
6.2 Experiments on Sentiment Analysis ........................................................................... 55 
6.2.1 Using Different Classifiers.................................................................................... 55 
6.2.2 Experiment 1: SVM with Different Feature Types .............................................. 56 
6.2.3 Experiment 2: Naïve Bayes with Different Feature Types ................................... 56 
6.2.4 Experiment 3: Random Forest with Different Feature Types ............................. 57 
6.2.5 Experiment 4: KNN with Different Feature Types .............................................. 57 
6.3 Stock Prediction .......................................................................................................... 58 
6.3.1 Different Classifiers .............................................................................................. 58 
Chapter 7 : Conclusion and Recommendations ................................................................... 60 
7.1 Conclusion................................................................................................................... 60 
7.2 Recommendation ........................................................................................................ 60 
References ............................................................................................................................ 61 
Appendix .............................................................................................................................. 67 
Appendix A: Originality Report ....................................................................................... 67 
Appendix B: Python Source Code .................................................................................... 68 
 
 
 viii 
 
List of Figures 
Figure 2:1: Artificial Neural Network ..................................................................................... 16 
Figure 2:2: Conceptual Model................................................................................................. 24 
Figure 3:1: RAD Development Cycle...................................................................................... 28 
Figure 4:1: System Architecture .............................................................................................. 34 
Figure 4:2: Use Case Diagram ................................................................................................ 35 
Figure 4:3: Sequence Diagram ................................................................................................ 41 
Figure 4:4: Context Diagram .................................................................................................. 42 
Figure 4:5: Level 1 DFD Diagram........................................................................................... 43 
Figure 5:1: Crawler Code ....................................................................................................... 44 
Figure 5:2: Downloading Tweets ............................................................................................ 45 
Figure 5:3: Text Cleaning Code .............................................................................................. 46 
Figure 5:4: Data Retrieved From Twitter ................................................................................. 46 
Figure 5:5: Cleaned Twitter Data ............................................................................................ 47 
Figure 5:6: Labeled Dataset .................................................................................................... 47 
Figure 5:7: SVM Model Training Code ................................................................................... 48 
Figure 5:8: Share Price Dataset ............................................................................................... 51 
Figure 5:9: Training Dataset ................................................................................................... 52 
Figure 5:10: ANN Training Code ............................................................................................ 52 
Figure 5:11: ANN Predicted Prices ......................................................................................... 53 
Figure 5:12: Mean Squared Error Code ................................................................................... 53 
Figure 5:13: ANN Predicted Price and Actual Price Graph ...................................................... 54 
Figure 6:1: Different Classifiers Performance Graph ............................................................... 59 
 
 
 
 
 
 
 
 
 ix 
 
List of Tables 
Table 3:1: NSE Data............................................................................................................... 28 
Table 3:2: Confusion Matrix ................................................................................................... 30 
Table 5:1: Confusion Matrix ................................................................................................... 49 
Table 5:2: Confusion Matrix Values ....................................................................................... 49 
Table 5:3: SVM Performance ................................................................................................. 50 
Table 5:4: SVM Mean Squared Error ...................................................................................... 53 
Table 6:1: Classifiers Performance .......................................................................................... 55 
Table 6:2: SVM Performance ................................................................................................. 56 
Table 6:3: Naive Bayes Performance ...................................................................................... 57 
Table 6:4: Random Forest Performance .................................................................................. 57 
Table 6:5: KNN Performance ................................................................................................. 58 
Table 6:6: Classifiers Performance in Stock Price Prediction ................................................... 59 
 
 
 
 
 
 
 
 
 
 
 
 x 
 
List of Equations 
Equation 2:1: Random Walk Equation .................................................................................... 10 
Equation 2:2: ARIMA Equation .............................................................................................. 14 
Equation 3:1: Accuracy .......................................................................................................... 30 
Equation 3:2: Error Rates ....................................................................................................... 31 
Equation 3:3: Recall ............................................................................................................... 31 
Equation 3:4: False Positive Rate ............................................................................................ 31 
Equation 3:5: Precision ........................................................................................................... 31 
 1 
 
Chapter 1 : Introduction 
1.1 Background 
The prediction of stock market prices and trends is a problem of interest. The pricing of 
shares on the stock exchange has dynamic behavior often driven by the law of supply and demand 
for action. This dynamism attracts the attention of investors because it provides huge profits when 
investments are made in the best way at the right time. According to Rocha and Macedo (2011), 
investment in the capital market, the objective is always to buy shares when its price is low as 
possible and sell them when the price is much higher. In this way, expect the behavior of the stock 
market means generating profits and reduce risk and losses. This anticipation can also be referred 
to as prediction and can enhance the profitability of investment (Rocha & Macedo, 2011). 
Uncertainty is the common characteristic that most stock markets have hence the long-term 
and short-term future states. Uncertainty in this section is undesirable for existing investors but 
remains so as this is unavoidable when using stock markets as investment tools. The solution in 
such scenarios us is the ability to reduce uncertainty levels. The process of reducing uncertainty 
entails the application of stock market forecasts or predictions. Past studies conducted in this 
subject relied on historical prices collected form companies following their fluctuation 
characteristics. The theories of efficient market hypothesis state that the movements in financial 
markets depend on current events, news and releases of products and their impact of the stock 
values of different companies. For this reason, the prices in the stock markets follow the random 
walk pattern resulting in inaccurate predictions of more than 50% (Pagolu et al., 2016). 
Business and financial news brings us the latest facts about the industry’s stock market. 
Previous Studies show that both financial and business news have a robust relationship with future 
stock performance. Therefore, extracting sentiments and opinions from business and financial 
 2 
 
news is useful as it may assist in stock-market price prediction. It has been proven that the financial 
market is "informationally efficient" (Fama, 1965), stock prices reflect information and the price 
movement is in response to news or events. 
As it is well known, emotional state can influence our decisions and no doubt such choices 
include stock market investment decision (Gilbert & Karahalios, 2010). When people are 
pessimistic or uncertain about the future, they will be more cautious to invest and trade. So 
capturing the collective mind of the peoples' mood becomes one possible way to predict the stock 
movement. 
Antweiler and Frank (2004) in their study determined the association between activities in 
Internet message boards related to stock volatility and relevant trading volume. Gilbert and 
Karahalios (2010) used over 20 million posts from the Live Journal website to create an index of 
the US national mood, which they call the Anxiety Index. They found out that when the index rose 
sharply, the S&P 500 index needed the day marginally lower than expected. It shows there a 
correlation between how public mood or sentiments, financial news and people’s opinion can 
affect a company's stock price. 
Nowadays social media is representing the opinions and sentiments of the public about 
current events. Sentiment analysis entails having a complete understanding of an author's opinion 
expressed in a text. Elaborately, optimistic news and opinions in existing social media about a 
corporation would inspire people to capitalize in the stock of the specific company resulting in the 
stock prices of the corporation increasing. Behavioral finances hypothesis states that a public mood 
and any market performance are always correlated. The idea here is that when individuals are 
happy, in good moods or optimistic, they are most likely to surge investment, which in turn 
advances stock market price performance. (Makrehchi, Shah, & Liao, 2013) 
 3 
 
This study is involves taking existing non-quantifiable statistics on financial news and 
public sentiments on companies. The collected data is used in predicting the trends on future 
stocks. The assumption here is that the opinions and news have a significant impact on the changes 
in the stock markets with an attempt to establish the correlations between public sentiments, 
opinions, stock trends and company news.  
1.2 Problem Statement 
Stock market prediction relies on factors such as interest rates, economic activities and 
related markets that influence the demand and supply of the trading volume. Currently, 
Stockbrokers who execute trades and advice clients, rely on their experience, technical analysis 
(price trends) or fundamental analysis in picking their stocks. These current methods are subjective 
and are usually short-sighted due to their limited capacity to crunch raw numbers. With the value 
of trade money involved, the improper investment could easily mean great losses for investors, 
especially if they keep making wrong decisions. Lack of guaranteed returns has also led to the 
reluctance by potential investors to participate in the market. It is therefore desirable to have a 
model that can guide on the most likely next day prices (prediction) as a basis for making any 
investment decision.  
This study proposes text mining of financial news and public sentiments and opinions from 
social media such as twitter. The combination of market data and news features together helps 
improve the accuracy of predictions. Regardless, already existing systems have failed to 
effectively integrate news features together with market data. With this, the results obtained are 
converted into numeric forms that feed the prediction process.  
 
 4 
 
1.3 Objectives 
General Objective 
This project aims at predicting future price movements of the stock market using financial 
news and peoples' opinions posted on the social media platforms hence getting sentiments that will 
aid in stock price forecasting. 
Specific Objectives 
i. To investigate how the stock market NSE operates. 
ii. To analyze the current methods used in the stock market prediction 
iii. To evaluate current methods, use in text mining and processing from social media. 
iv. To develop a model for stock market price prediction based on sentiment on social 
media. 
v. To test and validate the stock market price prediction model. 
1.4 Research Questions 
i. How does the NSE stock market operate? 
ii. What are the current methods used in prediction of the stock market price? 
iii. What are the current methods used in text mining and processing of data from social 
media? 
iv. How to design and develop a model for stock market price prediction? 
v. How will the prediction stock market price prediction model be validated? 
1.5 Scope and Limitation 
The project is limited to only the company's shares listed on the NSE. Additionally, the 
company should have traded for at least five years to ensure there is data consistency. The 
 5 
 
languages to be used are English and Swahili in the sentiment analysis process. Use of slang in 
this case sheng’ and vernacular language will not be considered. The assumption in this study is 
that there should be no form of manipulation that could have a bigger effect on the prices of stock 
movements by either the stockbrokers or any other affected parties.  
1.6 Justification 
Currently, in the Kenyan context, the stockbrokers use methods based on trend patterns 
which may not be effective. These methods do not have the predictive ability, and they are based 
on demand and supply. In forecasting the stock market prices, other critical factors will influence 
the price of the stock and also the demand and supply. Predicting the behavior of shares in the 
stock exchange is not a simple task as it involves variables not always known and can undergo 
various influences from the collective emotion to high profile news. By Incorporating the news 
and sentiments of people who in this case are considered as key external factors that are not 
presented in a quantifiable format, can aid in providing accurate results in the stock market price 
prediction. 
  
 6 
 
Chapter 2 : Literature Review 
2.1 Introduction 
This section of the document reviews existing related literature and previous research and 
studies on predicting the stock exchange market using sentiment analysis and other methods. 
2.2 Stock Market in Kenya 
2.2.1 Nairobi Stock Exchange (NSE) 
The NSE is an African exchange that offers trading facilities to investors looking for 
exposure to the Kenyan economies; it also lists equity and debt securities. It was founded in 1954, 
demutualized and self-enlisted in 2014. It operates under the Capital Markets Authority of Kenya 
(Nairobi Securities Exchange, 2017). The NSE is a member of the (WFE) World Federation of 
Exchange, which is the founder member of the ASEA (African Securities Exchanges Association) 
and the (EASEA) East African Securities Exchanges Association. It is also belongs to the 
Association of Futures Market and is a partner exchange in the United Nations-led SSE initiative 
(Nairobi Securities Exchange, 2017). 
The NSE trades in both shares and bonds. The identified shares in NSE are classified into 
sectors that are commercial, agricultural, financial sectors, allied sectors and services. The sectors 
are displayed alphabetically for ease of location by the interested investors who can view daily 
trading from a public gallery. The bonds traded by NSE are treasury bonds which are issued by 
the Kenyan government and the corporate bonds which are issued by trading companies. Shares 
are equities while bonds are referred to debt instruments (Nairobi Securities Exchange, 2017). The 
NSE also provides five types of live (real-time) data: Real Time Listed Equity Securities Data, 
Real-Time Listed Debt Securities Data, and NSE Live Ticker on corporation websites, FTSE NSE 
Equity directories and FTSE NSE Bond Index.  
 7 
 
The NSE enables the money markets segment to be productive by providing meetups between 
lenders of money and borrowers but at a lower cost. The lenders or savers end up being the 
investors. Their role is to invest in the market and expect financial rewards through profits. The 
borrowers or issuers borrow money from the market lenders and pay them in form of profits after 
a set duration.  It also provides education to the public and its users regarding the highest profits 
available on shares and bonds; including the buying and selling process. They also teach the 
community on how to do investments as a group. It also provides financial answers to the most 
known problems. The shares and bonds are recognized as guarantees for loans in co-operative 
societies as well as bank loans. The two can be prearranged, with the assistance of money 
managers, to pay for children’s tutoring fees, hospital bills, car services and other assurance 
schemes such as pension and retirement plans (Nairobi Securities Exchange, 2017). Through the 
two, the local government, cooperatives societies, small and big companies, and similar 
establishments can raise funds to increase their commercial activities, make profits, create 
employment opportunities and support the growth of the economy. 
2.2.2 Capital Markets Authority (CMA) of Kenya 
The self-governing public organization which was established through the Act of 
Parliament, under the finance ministry. The established authority took power on December 15th, 
1989, after it was passed and established in office in 1990 (Capital Markets Authority, 2017). 
The CMA and associated industry operates in a set regulatory framework guides the actions 
of the industry players in all the activities. Following its inception; the act has worked towards 
broadening and deepening the CMA by developing new regulatory frameworks that also facilitate 
development of new products and services as well as institutions. This is achieved through research 
and fairness coupled with orderliness in the industry (Capital Markets Authority, 2017). 
 8 
 
2.2.3 Central Depository and Settlement Corporation (CDSC) 
CDSC provides quality services in settlement and clearing services in the Kenya Capital 
Markets. It offers a secure central custody that is simplified, safe and swift transfer of investors' 
value to the right place. One way of boosting investor confidence the in CMA market is through 
creation of customized solutions that ensure the investor is made aware of all transactions taking 
place in an individual’s central depository system (CDS) account (Central Depository & 
Settlement Corporation, 2017) 
The services offered are; online account access, Investor services- SMS services, email 
statement services and statements of the accounts. Depository services such as securities accounts, 
deposits, transfers, pledges, and releases. Clearing and settlement services include guarantee fund, 
trade reporting, and clearing. Issuer’s services; trading rights, AD HOC reports, dividends, bonus 
issues and initial public offers (Central Depository & Settlement Corporation, 2017). 
The Central Depositories Act helps by laying down a regulatory and legal agenda through 
which the founding and processes of the CDSC are anchored. The organization also functions 
under the governing oversight and supervision of the CMA. It is a limited liability company 
permitted by the CMA and authorized to warrant the efficiency of the delivery process, clearing 
procedures, and settlement of purchases securities in all capital markets of Kenya. In this regard, 
CDSC is also managed by the Capital Markets Act as well as the rules and regulations (Central 
Depository & Settlement Corporation, 2017). 
CDSC being an integrated financial market infrastructure plays an important role in the 
competent functioning of both domestic and regional monetary markets. The CDS Rules guide its 
day-to-day management. The CDSC policies and procedures describe the processes and 
descriptions of how all the stated functions should be performed by participating parties. Any 
 9 
 
changes made in the policies and procedures must always be approved by the Capital Markets 
authority and the government where necessary (Central Depository & Settlement Corporation, 
2017).   
CDSC also has the function of entering enters into new contractual relations and 
agreements with interested stakeholders for the delivery of selected services based on individual 
preference. The essential elements are the contracts signed by CDSC and CDAs. This is 
accompanied by new agreements btween settlement bankls and CDSC.  The most important 
agrreement is that one between CDSC and the central bank of Kenya such that all securities 
transactions are settled here (Central Depository & Settlement Corporation, 2017). 
2.2.4 Stock Market Trading 
Stock trading is classified as either day trading, medium-term, short-term and long-term 
based on the duration of the stock holding process. In day trading, the buying and selling of 
financial instruments usually done on the same selected day with all the trading closed before the 
end of day. The traders that are hired to trade in the day trading are referred to as day traders or 
active traders. Short-term trade is that which involves trading of one to few days; maybe a week. 
Medium-trading is that which takes place in few weeks to months. Long-term trading on the other 
hand goes on from months to years depending on the need (Zhang, 2013). 
2.3 Stock Market Prediction Theories 
In any financial derivation, two main principles are considered according to Hellstrom 
(1998) and Lawrence (2002).  One principle is that profit is not generated from anything and the 
arbitrage principle which states that “no opportunities for arbitrage that is there is no possibility of 
generating profit without any associated risks”.  
 10 
 
2.3.1 Random Walk Theory 
This is a theory that works with the conclusion that changes in stock price have the same level of 
distribution and are always independent of each other. In this case, past movements and trends in 
stock prices or any markets cannot be used to forecast any future movement Hellstrom, (1998). 
The theory’s formula is:  
 
 
Equation 2:1: Random Walk Equation 
 Where v (t): the price of stock at the time t 
v (t – 1): the price of stock at the time t -1 
Δ v (t): change in the price of the stock at time 
d (t): dividend at the time t 
c (t): adjustment term at the time t 
Since c (t) is the actual impact of all the privately and publicly available information on the stocks, 
which predicts Δ v (t) before a difficult task. 
2.3.2 Efficient Market Hypothesis 
This theory states that all market price mirrors the assimilation of all the information 
available. When the generated information enters a market, the system enters the unstable state 
and forecasts the new price eliminates correct change. From the available information, it is 
 11 
 
impossible to predict any future prices of the stocks (Burton,2003). The principle is that all 
investors react immediately to any form of informational advantages available thus eliminating 
any possible profit opportunities. The prices at all times reflect the available information with the 
conclusion that no profits are generated from the information-based trading (Lo & MacKinley, 
1999). 
Fama (1970) mentioned that there exists three forms of competent marketplaces based on 
the information used to predict a future price. The first one is the weak form; only the historical 
price or past information is used. The second form is the semi-strong formula which includes past 
prices as well as the publicly accessible information. Lastly, the strong form which includes both 
the public and privately available information, it also includes insider information. One should 
note that efficient market hypothesis and the random walks never amount to the equivalent thing. 
A random walk in the stock-prices fails to suggest that the stock market prices are resourceful with 
normal investors. A random-walk is always defined by the fact that prices and their relevenat 
changes are independent of each other always (Brealey et al., 2005). 
2.4 Stock Market Prediction 
There exists four stock market prediction methods: Fundamental analysis, Technical 
analysis, Machine learning and Time series analysis and (NeuroAI, 2013). 
2.4.1 Technical Analysis 
Technical analysis is the numerical time series methodology used in the prediction of stock 
markets founded on historical data with charts being introduced as the primary tools (Pring, 
1991).It is a technique used in the evaluation of securities by interpreting the figures produced by 
activities in the stock market such as previous prices and volume traded. The aim of using technical 
 12 
 
indicators in prediction is to get the trends and patterns, which should then inform a direction of 
future prices. 
This method has three major assumptions that are taken into account. The first assumption 
is that the market discounts everything. Technical analysis has been heavily criticized for not 
considering the fundamental factors as it only takes the price movement into account. Technical 
analysts have confidence in that everything from a corporation’s fundamentals to the broad market 
is already priced into the stock hence there is no need to consider other influencing factors. The 
other assumption is the trend in price movement. In technical analysis, the price movement is 
believed to follow certain trends. It implies that once the trend is established, the future price 
movements are likely to follow the same trend. The last assumption is that the past has a habit of 
repeating itself regarding price movement. The repetitive nature of all the movements in market 
prices is attributed to all market psychologies, which tend to be very foreseeable based on the 
human emotions such as fear and excitement. Technical analysis also uses chart and graph patterns 
to analyze market movements and understand trends (Huang et al., 2011). Technical analysis 
process deals with past price movements to forecast a pattern that guides future investment 
decisions. All these indicators must be calculated and their values used for guiding the prediction. 
2.4.2 Fundamental Analysis 
The fundamental analysis process is referred to as a study of different factors affecting the 
supply and demand (Thomsett, 1998). The theory works with the idea that data assembly and its 
consequent interpretation is the foremost process involved in the prediction of the stock prices. 
The trading opportunities of the analysis uses the gaps between the existence of a new event and 
the consequence market responses towards the event. The central data used in the process of 
fundamental analysis is company data including annual reports, quarterly reports, balance sheets, 
 13 
 
income statements and auditor’s reports. News in the industry plays an important part in the 
analysis process as such news also reflect the existing supply and demand chains in the 
marketplace. Fundamental analysis tends to have an overview of the company from a top-level 
view and considers issues such as political, economic and the business environment of the 
company. The general requirement of fundamental analysis is to understand the company and 
decide on its prospects (Thomsett, 1998). 
2.4.3 Time Series Method 
Time series method utilizes past performances to forecast on a time-series measure. The 
time series is referred to as a system of experimented quantities from a selected observations, 
whereby discoveries such as a periodic dissemination can be established (Zhang, 2003). Other 
important methods in the time series prediction are auto-regression, linear regression, as well as 
ARIMA. A significant characteristic of time series data is the fact that it is  dependent on time. For 
this reason; current observations have to depend on a past explanations in time. A typical prediction 
model, in this case, requires information external to the particular stock, which can be used to 
extrapolate the performance of the stock in question. Such information should be having a bearing 
on the stock of study. In time series forecast mathematical data series are placed successively, they 
take place in equivalent periods. In the method, there exists chains of numbers consisting of normal 
periods for a fixed time duration (Pang et al., 2002). 
2.4.3.1 Linear Regression 
Linear regression is a model that attempts to establish a connection between any two 
variables by accurately fitting a new linear equation to any collected or observed data (Pang et al., 
2002). It can also be fitted with a quadratic equation, and still, it will be called linear regression.  
 14 
 
In this context, one of the variables is the explanatory variable while the other is the dependent 
variable. A linear regression line with a linear equation is of the form: y = a + bx, 
Where x;  is the identified explanatory variable and y is the identified dependent variable. The 
slope of the line is b, and a is the intercept.  
2.4.3.2 Auto-Regressive Integrated Moving Average 
Zhang (2003), states that in an ARIMA model, the next future value of the variable is 
assumed to be the linear function of several historical observations as well as random errors. It can 
be represented in the form  
 
Equation 2:2: ARIMA Equation 
Where yt is the differenced time series value, ϕ and θ are unknown parameters and e are 
independent identically distributed error terms with zero mean. Here, yt is expressed as its past 
values and the current and past values of error terms. 
The ARIMA model tends to combine three basic methods that are the autoregression (AR), 
Differencing (I-for Integrated) and Moving Average (MA). For auto-regression algorithm, the 
digits of the given time-series facts are regressed to their own lagged values, that is indicated by 
the “p” values in the model. Differencing this comprises of differencing the time series data to 
eliminate the identified trend and converting a non-stationary time series to a new stationary one. 
The “t” value indicates this in the model the moving average nature of the model is represented by 
the “q” value which is the number of lagged values of the error term.  
 15 
 
2.4.4 Machine Learning 
Machine learning is the last method and is extraordinary for AI solutions considering it is 
based on the principles of learning from continuous training and practices. The association models 
such as artificial neural networks are well fit for machine learning where new association weights 
are adjusted to progress the competence of a formed network. 
2.4.4.1 Artificial Neural Network 
The bio-inspired ML model has shown incredible success in the application and fields of 
artificial intelligence. Many scholars have shown that using the bio-inspired algorithm has 
improved the results of the research domain. The algorithms include artificial neural networks 
(ANN), artificial immune systems, evolutionary computation, fuzzy systems, and swarm 
intelligence (Andries, 2007).  
An artificial neuron network takes the model of a biological neuron. The artificial neuron 
accepts signals or inputs from the other neurons or the surrounding environment. The signal will 
be fired given certain conditions, thus, transmitting the signal to all other connected neurons 
(Uhrig, 1995). Figure. 2:1 below is a representation of an artificial neuron.  Here, there is an 
association between the numerical positive and the negative value which is associated with each 
neuron such that they either inhibit or excite inputs with each connection made to the artificial 
neuron. The activation functions in ANN are used to regulate the firing taking place in the artificial 
neuron. The neuron then collects all incoming signals by computing their net input signals as a 
function with the associated or given weights. These net input signals then serve as input to the 
activation function which calculates the output signal of the artificial neurons (Zupan, 1994). An 
ANN is a layered system containing of one or many artificial neurons. ANN components include 
the input layer, hidden layer and the output layer. Based on the interconnection of the components; 
 16 
 
the ANN has been modelled with the ability to perform learning, generalize and map abilities to 
process information in parallel.  
 
Figure 2:1: Artificial Neural Network 
Several ANN architectures have been developed such as feedforward neural network, 
recurrent neural network, and spiking neural network. Also, there are also different types of neural 
network such as single-layer neural network, multi-layer perceptron (MLP), temporal neural 
network, radial basis neural function network, self-organizing neural network (Peterson & 
Rögnvaldsson, 1992). 
2.5 Sentiment Analysis 
A sentiment is a feeling, opinion or emotion that is formed by a person towards something, 
an idea or someone. Sentiment analysis in the studies is referred to as attitude mining, opinion 
mining studies the sentiments of people towards certain ideologies (Fang & Zhan, 2015). There 
exists different sentiment classification techniques machine learning approach, hybrid approach 
and lexicon-based approach and (Maynard & Funk, 2011). The ML approach employs the sue of 
 17 
 
ML algorithms and simple linguistic features. In lexicon-based approach, the algorithm relies on 
different sentiment lexicons which are a collection of precompiled and known terms. This is 
classified into dictionary-based approaches and corpus-based approaches that utilize statistical or 
semantic techniques to find sentiment polarity. 
2.5.1 Unsupervised Classification 
Unsupervised learning has no definite explicit target outputs associated with the input, and 
consequent learning through manual observation. The purpose is to have the existing machine 
learn without giving any obvious instructions. The eminent approach to most unsupervised 
learning techniques is clustering, whereby similarities of features in the training data is discovered. 
Cluster resemblance parameter is well-defined upon common metrics such as the Euclidean 
distance. K-means, Gaussian mixture models, Hierarchical, Hidden Markov models and Self-
organizing maps are examples of clustering algorithms (Batool et al., 2013). 
2.5.2 Supervised Learning 
2.5.2.1 Rule-based Learning 
The rule-based learning classifier is based on the rule of incidences of sentiments in a text. 
If any selected word contains positive emotions, then the conclusion is that it is positive. If the 
word contains any negative emotions, the conclusion is that it is negative. The rule-based classifier 
has similarities with the fuzzy logic system that allows intermediate value to be well-defined 
between conventional evaluations like yes or no, true or false and others. (Bhardwaj et al., 2015) 
2.5.2.2 Support Vector Machines 
Support Vector Machines (SVM) or Support Vector Networks (SVN) are classification and 
regression examination techniques. Support vector machines are categorized as supervised 
 18 
 
learning models for information analysis as well as pattern recognition. The common application 
areas for the SVM algorithms include image processing, bionformatucs and text analysis. The 
support vector machine constructs a hyperplane or a set of hyperplanes in a high or infinite-
dimensional space. In many cases, the data is not linearly separable. With the use of an SVM 
learning algorithm, it is probable to create a room that is transformable. The model signifies the 
examples as points in space, maps separate categories and divides them as much as possible. The 
goal is to design a hyperplane that classifies all training vectors into two distinct classes, where 
the best choice is the hyperplane that leaves the maximum margin for both classes. (Platt, 1999) 
Recent research and state of the art approaches of Support Vector Machines show that using 
ensemble approaches can drastically reduce the training complexity while maintaining high 
predictive accuracy. This has been done by implementing the SVM without duplicate storage and 
evaluation of support vectors, which has been shared between consistent models (Marc et al. 2014). 
2.5.2.3 Naive Bayes methods 
The Naive Bayes methods are a set of supervised learning algorithms that is used for 
clustering and classification (Lowd & Domingos 2005). Methods employed are based on the 
application of  Thomas Bayes’ theorem with a simple assumption of there being an independence 
between every pair of selected features. The Naive Bayes classifiers are known as linear classifiers 
and are able to perform well, simply and are very efficiently (Zhang, 2004). For small sample 
sizes, naive Bayes classifiers can outperform more powerful alternatives. However, non-linear 
classification problems can lead to poor performances of naive Bayes classifiers. These methods 
are used in a several of different fields such as diagnosis of diseases, classification of RNA 
sequences in taxonomic studies and spam filtering in e-mail clients (Raschka, 2014). Research of 
Naive Bayes has previously been proved to be an optimal method of clustering and classification, 
 19 
 
no matter how strong the dependencies among the attributes are. If the dependencies distribute 
evenly in classes or if they cancel each other out, Naive Bayes performs optimally (Zhang 2004). 
Recently, Naive Bayes theorem has been applied to image classification algorithms, where the 
Local Naive Bayes Nearest Neighbor algorithm increases classification accuracy and improves its 
ability to scale to bigger numbers of object classes. (Lowe, 2012) 
2.5.2.4 Decision tree classifiers. 
Decision trees employ a hierarchical decomposition of training data in which certain 
conditions on an attribute value are used to classify and divide data (Quinlan, 1986). The predicate 
and conditions used implies the absence and presence of more than one word. In decision trees; 
the process of dividing data takes place recursively until all the leaf nodes have a minimum number 
of records that show a detailed classification. 
2.5.3 Lexicon Based Approach 
The lexicon-based approach aims at attaining an effective cross-domain performance. The 
methodology works with the assumption that the total of the sentiment orientations of all words 
make contextual sentiment orientations. Here, words that are opinionated are used on the 
classification tasks. The positive opinions are employed in describing desired states while the 
negative opinions express the undesired states. There exists several opinion idioms and phrases 
that are known as lexicons.  There exists different approaches when it comes to compiling and 
collecting the opinions used in the word list. Since the manual approach is time consuming; it is 
used together with other faster approaches that are automatic with the aim of checking out for any 
errors and mistakes. The common approaches are discussed in the section below (Bhardwaj et al., 
2015): 
 20 
 
2.5.3.1 Dictionary-Based Approach 
In a dictionary based approach; sets of words or opinions are collected manually based on 
set subjects. The same grows through searching for more words in a corpora WordNet or thesaurus 
for simple synonyms as well as antonyms. The words obtained are often added to a seed-list 
resulting in the formation of new iterations. The process of iteration stops when the system fails to 
find new words. At the end of the process; there is manual inspection done with the purpose of 
eliminating any existing errors. The key disadvantage of this approach is the inability to find any 
opinions or words that are context or domain specific orientations (Mohammad et al., 2009).   
2.5.3.2 Corpus-based Approach 
The Corpus-based method helps to resolve the difficulty of finding opinion words given 
context-specific orientations. Its methods depend on syntactic patterns or patterns that occur 
together along with a seed list of opinion words to find other opinion words in a large corpus 
(Medhat et al., 2013). The Corpus-based approach tries to find co-occurrence patterns of words to 
determine their sentiments. This approach is based on seeding list of opinion words and then find 
another opinion words which have a similar context. This method is used to assign happiness factor 
of words depending on the frequency of their occurrences in “happy” or “sad” blog post (Bhardwaj 
et al., 2015). 
2.5 Twitter Analysis 
For any platform to be feasible as a predictor of stocks, the platform itself must be 
appropriate for the gatheoring of data. Twitter offers a comprehensive search API, up to seven 
days back in time, but also offers the opportunity to query against tweets in real-time, through its 
streaming API (Arafat et al., 2013). The Twitter API is convenient since it removes the need to 
 21 
 
batch data gathering and management, and offers a whole new aspect to Stock Prediction due to 
the high accessibility of data. A major drawback using the Twitter Search API is the limitation on 
complexity where overly complex queries are restricted, and the limitation on the availability of 
data older than a set number of days, seven days to be precise. This is because the Search API 
makes use of indices that only contains the most recent or popular tweets, according to the 
developer's page on the Twitter website (Twitter, 2017). Furthermore, it is explained that the 
Twitter Search API should be used for relevance and not completeness and that some tweets and 
users might be missing in the query results. 
The Twitter Search API Developers Page propose that the Streaming API is more suitable 
for completeness-oriented queries which would be the case of gathering data for the Sentiment 
Analysis where high completeness is required to analyze the whole picture rather than specific 
chunks of data (Twitter, 2017). The Streaming API is also favored by existing research on the 
subject (Choi & Varian 2012). 
2.6 Empirical Review 
2.6.1 Prediction of movies success using sentiment analysis of tweets 
The researcher tried to predict the popularity of movies from twitter sentiment analysis on 
the movies. He manually labels tweets to create a training set and train a classifier to classify the 
tweets into positive, negative, neutral, and irrelevant. He further developed a metric to capture the 
relationship between sentiment analysis and the box office results of movies. He finally predicted 
the Box Office results by classifying the movie as three categories: hit, flop, or average (Jain, 
2013). The prediction was of eight movies which had just been released.The prediction outcome 
were five movies to be a hit and one to be a super hit, one to be average and he could not determine 
the success rate for one due to it data unavailability. Comparing his prediction results with box 
 22 
 
office results he found his prediction model to be exact for in four cases; a case was on the 
borderline between hit and average and for another one he could not find data to check the 
prediction confidence. (Jain, 2013). 
2.6.2 Other Related Works 
Kihoro and Okango (2014) used an artificial neural network (ANN) model in predicting 
stock market prices of Equity Bank in Kenya. They used the company’s historical data then fitted 
it in an ARIMA model to identify the best input lags into the ANN model. The best combination 
of the lags was taken as input lags. The historical data used was obtained from Nairobi’s Security 
Exchange financial and investment segment, comprising 487 daily share price for the bank. They 
observed that ANN could effectively model the stock market prices. The model was able to 
discover non-linear relationship in the data which was evident in the fact that the mean-squared 
misclassification between the predicted share price and the desired share price was very minimal. 
The ANN architecture gave the best results in terms mean squared error. 
The proliferation of documents online and user-generated texts led to the recent growth in 
exploration in the field of sentiment analysis and their relationship with financial markets. The 
authors discuss the application of Twitter as a corpus for the achievement of sentiment analysis. 
The discussion is on the methods used in the gathering and processing of tweets from twitter. The 
writers use emoticons to formulate a training set used in sentiment classification; a machine 
learning technique that significantly reduces physical tweet tagging. The training set in the study 
was split into two sets of positive and negative samples that were based on sad and happy 
emoticons. Additionally, they analyze a few accuracy improvement methods. Similarly, Albert 
and Eibe (2011) presents an interesting discussion on streaming the Tweet mining process and the 
process of sentiment extraction as well as opinion mining. Bollen and Mao (2010) offered the 
 23 
 
primary indications that there may be an existing correlation between stock market prices and 
Twitter sentiments. In the study, a sentiment result is connected with the DJIA and fed into an 
ANN algorithm to predict future market movements. The study uses a mood-tracking instrument; 
Opinion-Finder to find the mood in six dimensions (Alert, Calm, Sure, Kind, Vital, and Happy). 
Thereafter, they relate the mood-time-series with DJIA final values by means of a Self-Organizing 
Fuzzy NN. Using the techniques, the researchers measured a possible improvement in DJIA’s 
prediction accuracy. After successful publication, the paper launched abundant of the research in 
the determining the relationship between Twitter and existing market sentiments. 
 24 
 
2.7 Conceptual Model 
 
Figure 2:2: Conceptual Model 
 
 
 25 
 
Data mining of the mood, opinion, and financial news will be retrieved from Twitter. Then 
the tweets will be processed for them to be classified to be as either positive, neutral or negative. 
Lastly, the classified tweets will generate the overall mood and use the previous prices the model 
will be able to predict the future prices. The historical data or NSE will be collected then processed 
to extract features for prediction. The association between the values of stock and the sentiment 
value will be generated which will later be used in our model for stock prediction. 
  
 26 
 
Chapter 3 : Research Methodology 
3.1 Introduction 
According to Bhatnagar and Singh, research methodology can be defined as the process of 
systematically solving problems. This project will use experimental research to create an artificial 
intelligence tool based on a model and test its performance on a practical problem. The tool will 
have a model that will learn from previous shares prices, sentiments from the public and financial 
news found on social media, in this case, Twitter, that will be used to predict the daily prices for 
future of a particular stock. The project's objectives stated in chapter one will guide the research. 
3.2 Research Design  
The research design is the arrangement of conditions for collection and analysis of data in 
a manner that aims to combine relevance to the research purpose with economy in procedure. It 
constitutes of the conceptual structure within the research conducted. This research will take an 
experimental design approach to develop the different models for predicting the stock prices with 
different data samples which can give the best performance and thus the best result. For purposes 
of designing and evaluating the model, the research will need data from a typical stock exchange 
market NSE, and sentiment and opinion data from Twitter 
3.3 Target Population and Sampling 
The target population is defined as the total number of units in a study environment from 
which a sample may be selected. In this study, Twitter posts and comments related to stock market 
prices and companies will be used.   
3.4 Data Collection 
The process of data collection in sentiment analysis entails collecting of sentiment related 
data. Different statistical learning methods, adequate data sets for tweets and stocks are necessary.  
 27 
 
3.4.1 Twitter Data 
For tweet collection, Twitter provides a rather robust API. There are two possible ways to 
gather Tweets, using the Streaming API or the Search API. The Streaming API lets users obtain 
real-time access to tweets from an input query. The user first requests a connection to a stream of 
tweets from the server. Then, the server opens a streaming connection and tweets are streamed in 
as they occur, to the user. A limitation of the streaming API is that one cannot specify the language, 
i.e., English or Swahili. 
3.4.2 Python 
The programming language that will be used for collecting the data through the Twitter 
Streaming API is python. Python is a programming language commonly used for statistical 
computing and computer graphics. Data miners and statisticians extensively use it for data 
analysis. The reason why python is chosen for computing the data is primarily its powerful tools 
and large community. Python programming also has vast libraries for performing statistics and 
machine learning. 
3.4.3 Stock Data Collection 
The stock data will be collected using web scraping, which is the act of extracting 
information from the web. The data will table the format shown below. 
 
 
 
 
 28 
 
Date Company 
Lowest 
Price of the 
Day 
Highest 
Price of 
the Day 
Closing 
Price 
Previous 
Day 
Closing 
Price 
Volume 
Traded 
1/3/2017 Eaagads Ltd 25.5 25.5 25.5 25.5 2500 
1/3/2017 Nation Media 
Group 
83 88.5 85 87 15500 
1/3/2017 Standard 
Group Ltd 
18.75 18.75 18.75 18.75 --- 
1/3/2017 Centum 
Investment 
Company Ltd 
34.5 35.75 34.5 35 13600 
Table 3:1: NSE Data 
3.5 System Development Methodology 
According to Berman (2006), with Rapid Prototyping, also known as Rapid E-learning, 
learners or subject matter experts interact with prototypes and instructional designers in a 
continuous review and revision process. 
 
Figure 3:1: RAD Development Cycle 
 29 
 
Phases of RAD Development 
(i) Planning Phase- It involves getting the system requirements and doing a quick analysis. In 
this phase, all the tools and necessary materials will be gathered, and the plan for the whole 
process will be created. 
(ii) Prototyping phase- In this phase, the designs and models for the prototype are created. The 
development follows after.  
(iii) Testing- This phase involves the validation of the models created. Unit, integration and 
system testing are done. 
(iv) Cutover- This is the final phase which involves the including data conversion and 
deployment of the system. The development of a prototype is the first step and analysis is 
continuous throughout the process. This strategy has many potential benefits including 
reduction in production time and the cost of late development revisions (Jones & Richey, 
2000). 
3.6 Research Quality 
Validity is the degree to which a concept is accurately measured in a quantitative study. 
The second element in measuring the research quality is reliability or accuracy. This is the extent 
to which a research instrument consistently has the same results if it is used in the same situation 
on repeated occasions (Heale & Twycross, 2015).The reliability of the sources of information of 
the data that will be used in the study, the research instruments, and any other concerned research 
aspect will be guaranteed and accredited. In all datasets, there will be no missing values because 
all companies' stock prices are posted daily. In the validation of the model, a confusion matrix will 
be used. A confusion matrix contains information about actual and predicted classifications done 
 30 
 
by a classification system. Performance of such systems is commonly evaluated using the data in 
the matrix (Kohavi and Provost, 1998). 
 Actual class 
 
Positive sentiment Negative sentiment 
Predicted class  Positive sentiment TP FP 
Negative sentiment FN TN 
Table 3:2: Confusion Matrix 
Evaluation Metrics of the confusion matrix according to Kohavi and Provost (1998) 
True positives (TP): is the number of correct predictions that an instance is positive. 
True negatives (TN): is the number of incorrect predictions that an instance is negative. 
False positives (FP): is the number of incorrect of predictions that an instance positive. 
False negatives (FN): is the number of correct predictions that an instance is negative. 
Accuracy is the percentage of the total number of predictions that were found to be correct. It is 
determined using the equation:  
Accuracy =  
TP + TN
TP + TN + FP + FN
  
Equation 3:1: Accuracy 
Misclassification Rate: Overall, how often is the model wrong? 
 31 
 
Misclassification Rate = 
FP+FN
Total
  
Equation 3:2: Error Rates 
This is also known as the error rate 
True Positive Rate: is the proportion of positive cases that were correctly identified, as 
calculated using the equation: 
True Positive Rate = 
TP
Actual Yes Values
  
Equation 3:3: Recall 
This is also known as sensitivity or recall. 
False Positive Rate: is the proportion of negatives cases that were incorrectly classified as 
positive, as calculated using the equation: 
False Positive Rate =  
FP
Actual No Values
 
Equation 3:4: False Positive Rate 
  
Precision: is the proportion of the predicted positive cases that were correct, as calculated using 
the equation:  
Precision =  
TP
Predicted Yes Values
  
Equation 3:5: Precision 
  
 32 
 
Chapter 4 : System Design and Architecture 
4.1 Introduction 
This chapter reviews the proposed architecture, analysis and design of the stock market 
price prediction model. The system design and architecture was achieved through UML diagrams; 
use case diagram, sequence diagrams and the data flow diagrams. The diagrams provide detailed 
descriptions of the components of the proposed system and their interaction at each level.  
4.2 Requirements Analysis 
Based on the objectives as well as the user requirements, this section outlines the various 
requirements to be met in the research. 
a. Functional requirements 
These are functions or processes the proposed system and its components must perform. They 
are a definition of what users of the system expect form it. For the system, the functional 
requirements include: 
a. The system should allow a user to select the stock to be predicted. 
b. The system should crawl the historical and current price on a company's stock price 
c. The system should retrieve financial news and sentiments pertaining the company's stock 
from twitter 
d. The system should perform preprocessing of the tweets to clean then and store them in a 
comma separated values (csv) file. 
e. The system should be able to generate an approximate share price for the next trading day 
 
 
 33 
 
b. Non-functional requirements 
Unlike the functional requirements, non-functional requirements place constraints or limits in 
how the proposed system will achieve its functional requirements. They describe how well the 
system does its functions and are classified based on the needs of the users. The non-functional 
requirements of the system include: 
a. Usability- The intended users of the proposed system are the stock brokers from different 
accredited trading firms. The interaction with the system will be simple to allow stock price 
prediction. 
b. Reliability - The reliability of the model will highly depend on the accuracy of the data 
collected (stock). As this data will be used to train the model which will be used in 
prediction. 
c. Interoperability – This is the degree to which the developed system will be able to facilitate 
of couple the different interfaces with other systems.  
d. Response time – this is defined as the time between the end of a request by a user and start 
of the response. For the proposed system, the response time should be fast.  
e. Scalability – This describes the degree in which the system is able to expand its processing 
or functional capabilities outward or upward with the aim of supporting business growth 
and user requirements. 
f. Persistent storage- the proposed system components and devices should be able to retain 
data or information after device’s power have been shut down or eliminated.  
4.3 System Architecture 
System architecture outlines the structure of a system and its behavioral components. The 
proposed stock prediction system comprises of the classifier, the machine learning predictor, pre-
 34 
 
processor component and historical stock prices data. The only users of the system will be the 
stockbrokers for the different companies. The stock price prediction begins with the user or 
stockbroker entering keywords to retrieve tweets related to stock prices of a particular company. 
Once the tweets have been obtained, the twitter search API matches the keywords and sends them 
to the pre-processor for cleanup. The processed tweets are transformed into a document-term 
matrix that is suitable for machine learning algorithms and the tweet classifier. Based on the 
Machine Learning algorithm employed in the study; the tweets are classified as either positive (1), 
negative (-1) or neutral (0). Based on the classification, the tweets are matched against historical 
stock prices in the database hence the prediction of the stock price for the period under review. 
The result is then presented to the stockbroker for analysis or decision making. The figure 4:1 
below illustrates the architecture of the proposed system: 
 
Figure 4:1: System Architecture 
 35 
 
4.4 System Analysis 
4.4.1 Use Case Diagram 
The use case diagram in the analysis phase is used to describe the interactions between the 
system users and system itself. The most common relationships captured in a use case diagram are 
those between the actors, use cases and system. In the stock prediction model using sentiment 
analysis, the actors in the system are the stockbrokers, twitter search API, Historical data Module 
and the prediction model as illustrated below I figure 4:2: 
 
Figure 4:2: Use Case Diagram 
 36 
 
4.4.2 Use Case Scenarios 
Also known as use case narratives, is a detailed text-based and step-by-step dialogues and 
interactions between the actors and the system. In system analysis, the use case narrative is used 
to explain a complete business transaction successful or unsuccessful. The use case narrative for 
the proposed system is as below: 
a. Scenario 1: Crawl Company’s Share Prices 
Use Case: Crawl Company’s Share Prices 
Primary Actor: User/Stock Broker 
Precondition: The Company selected is listed in the NSE 
Post Condition: System fetches the company’s daily share price and saves them in a CSV file 
Main Success Scenario:  
Actor System Responsibility 
1. User enters the company 
whose share price is to be 
crawled 
 
 2. The system takes the company’s name and 
crawls the share/stock price 
 3. The system retrieves the share prices of the 
company 
 4. The share prices are saved in a CSV file 
 37 
 
5. Views the CSV file with the 
collected tweets 
 
Extensions: 
At any time the system fails to retrieve tweets: the user must confirm that there is internet access 
and restart the system 
b. Scenario 2: Predicting Stock Prices 
Use Case: Predict company stock prices 
Primary Actor: User/Stock Broker 
Precondition:  
1. Company’s financial news tweets are stored successfully in a CSV file 
2. Company’s share price data are stored successfully in a CSV file 
Post Condition:  
1. The system fetches the financial news tweets and the company’s and uses them to predict 
next day and future share/stock prices of the specific company. 
Main Success Scenario: 
Actor System Responsibility 
1. The user selects the 
company share price to be 
predicted 
 
 38 
 
 2. The system fetches the financial news tweets and 
performs the classification of each day’s financial 
news to either positive( 1), negative (-1) or neutral 
(0) 
 3. The system combines the aggregated sentiment and 
historical share price and uses the model to predict 
the future share/stock prices. 
 4. The company saves the predicted stock/share price 
5. The user views the 
predicted share price 
 
Extensions: 
At any time the system fails to provide predictions user should repeat the process until successful 
prediction of stock prices. 
c. Scenario 2: Search for Financial Sentiment News 
Use Case: Search for Financial Sentiment News 
Primary Actor:  
1. User/Stock Broker 
2. Twitter Search API 
Precondition:  
1. The company whose financial sentiment news is being retrieved is set or determined and 
listed on the NSE 
 39 
 
Post Condition:  
1. System fetches financial sentiment tweets of the related company using the twitter search 
API  
Main Success Scenario 
Actor System 
1. The user enters the keywords to be used 
to find financial news about the 
company 
 
 2. Passes the keyword entered to the 
twitter search API  
 3. Retrieves tweets from twitter search 
API based on the keyword 
 4. Saves tweets in a CSV file 
5. Views CSV file with the collected 
tweets 
 
Extensions: 
At any time the system fails to load or provide information regarding stock price and company 
sentiments; the user should cancel and restart the process or repeat the same for clarification 
 
 
 
 40 
 
4.4.3 Sequence Diagram 
Sequence diagrams depict the chronological flow of events in the system. In essence they 
describe communication and relationships between objects together with messages that trigger the 
communications. The user or stockbroker enters keywords that are used as search parameters for 
Twitter through the web platform. When the keywords are obtained, the system passes them 
through a Twitter Search API which returns results that are later saved into a CSV file. The 
stockbroker initiates the classification of the sentiments or tweets related to the company’s stock 
or share price.  
The web platform passes a message cleanup_tweets() to the preprocessor which processes 
the retrieved tweets and returns tweets that match that of the company or stock prices. The 
obtain_tweet_features() message is sent to the feature extractor component. The result is passed to 
classify_tweets() message that classifies the tweets as either positive (1), negative (-1) or neutral 
(0) about stock market pricing of a Company. The system loops through the clean_tweets(), 
obtain_tweet_features() and classify_tweets() based on user requests. The process ends when the 
stockbroker requests for the results of the classification for the tweets received.  
 The sequence flow of events in the proposed system is as in figure 4:3: 
 41 
 
 
Figure 4:3: Sequence Diagram 
 42 
 
4.5 System Design 
4.5.2 Context Diagram 
These are graphical representations of the flow of data through the information system. A 
DFD shows the flow of data from external entities into the system as well as movement of data 
from one process to another. Context diagrams is a DFD that represents the scope of a system 
consisting of external entities and system boundaries together with information flows between the 
system and the external entities. The context diagram shows participants or entities that will 
interact with the proposed system. The figure 4:4 below shows the context diagram of the stock 
market price prediction using sentiment analysis. 
 
 
Figure 4:4: Context Diagram 
 43 
 
4.5.3. Level 1 DFD Diagram 
The context diagram is then expanded into several inter-related processes or levels. A level 
1 DFD diagram represents the system’s main processes, data stores and data processes with high-
level details. With DFDs, it is easier for system users and non-users to understand how data flows 
through the system. The level 1 DFD is different from the context diagram as they illustrate the 
first level processes of the system. The level 1 DFD of the stock prediction model is as below in 
figure 4:5: 
 
Figure 4:5: Level 1 DFD Diagram 
. 
 
 
 44 
 
Chapter 5 : System Implementation and Testing 
5.1 Introduction 
This chapter explains how the model of the prototype was developed and tested. It explains 
the whole process starting with the building of the sentiment analysis for financial news corpus 
which entails the way the data was obtained and the format. The next step is the preprocessing of 
the corpus then followed by the training of the model. The model is tested against 30% of the 
dataset to get the accuracy. The sentiment value is then correlated with the day’s share price to 
build a forecasting model for 30 days in the future. Various experiments with different algorithms 
with different features were tested to pick the best model. 
5.2 Sentiment Analysis 
5.2.1 Building Sentiment Corpus 
Tweets with financial news were crawled from twitter using the Twitter API with the help 
with tweepy which is a python library. The tweets were crawled and saved in a csv. Figure 5:1 and 
figure 5:2 illustrates a sample code used in crawling and downloading messages respectively.  
 
Figure 5:1: Crawler Code 
 45 
 
 
Figure 5:2: Downloading Tweets 
5.2.2 Preprocessing  
The text data obtained was in unstructured data format and this cannot be used in building 
a machine learning model. The data contains the tweet id, created at which is a timestamp and the 
text field. Text data contains more noisy words and symbols which are not contributing towards 
classification. The text data contains numbers, white spaces, tabs, punctuation characters, stop 
words, URL links, retweet symbols and others. All this needed to be cleaned by removing all those. 
In preprocessing the first start was to harmonize the text by converting it to lowercase. The 
text may have different cases, and this may affect the classification. Then we remove the URL 
links, hashtags, usernames, punctuations and Twitter symbols. Figure 5:3 show the sample code 
used in the preprocessing of the data crawled. Figure 5:4. Shows the sample dataset that was 
crawled in its raw state. Figure 5:5 shows how the dataset looks after being cleaned. 
 46 
 
 
Figure 5:3: Text Cleaning Code 
 
Figure 5:4: Data Retrieved From Twitter 
 47 
 
 
Figure 5:5: Cleaned Twitter Data 
5.2.3 Labeling 
After the preprocessing, we need labels that will serve as a training and testing dataset. The 
labels are considered the most important to the model development. Each tweet of this dataset is 
tagged as either 1 if positive or -1 if negative or 0 if neutral. To classify these tweets. This process 
was done manually. Figure 5:6 below illustrates a sample dataset: 
 
Figure 5:6: Labeled Dataset 
 48 
 
5.2.4 Training the model 
The next crucial step is the creation of the model by training using our label dataset as 
shown in figure 5:7. SVM with bigram feature performed the best during the experiments, so the 
model was trained using it. The corpus was first shuffled then split into two; the training dataset 
which is 70% and the rest 30% is used as a testing dataset to measure the model’s performance. 
Of the experiments carried out, SVM exhibited the best performance and using bigram feature 
increased its accuracy. 
 
Figure 5:7: SVM Model Training Code 
5.2.5 Testing  
The testing dataset constituted 30% of the original dataset which was used to validate the 
model. The model was validated using a confusion matrix in table 5:1. 
 Actual -1 (negative) Actual 1 (positive) 
Predicted -1 (negative) 207 241 
Predicted 1 (positive) 59 1290 
 
 49 
 
Table 5:1: Confusion Matrix 
From the confusion matrix, we are able to get the values for true positive, true negative, false 
positive and false negative as illustrated in Table 5:2 
True Negative 207 
False Negative 241 
True Positive 1290 
False Positive 59 
 
Table 5:2: Confusion Matrix Values 
The metrics: accuracy, recall, precision, and f-score can then be calculated from the values in Table 
5:3 or the confusion matrix on table 5:1.  The accuracy of the model was computed to be 83%. 
 Precision Recall F-Score Support 
Positive (1) 0.84 0.96 0.90 1349 
Negative (-1) 0.78 0.46 0.58 448 
Total Average 0.83 0.83 0.82 1749 
 
 50 
 
Table 5:3: SVM Performance 
Also a receiver operating characteristics (ROC) curve for the SVM model was drawn. This helps 
to visualize the performance of the classifier. 
 
Figure 5. 1 ROC Curve 
5.3 Share Price Prediction 
We require the existing share prices which include the historical and current share prices 
and incorporate the day’s financial news sentiment to help to predict future share price value. The 
sentiment values serve as input into the new model. 
 51 
 
5.3.1 Building corpus 
This involved crawling NSE historical data for the chosen company which is Equity bank. 
The data was daily data for the last one year and was saved in a csv file shown in figure 5:8. 
 
Figure 5:8: Share Price Dataset 
5.3.2 Preprocessing 
This involved converting the daily prices for all the company for that day to monthly or 
yearly share prices data format. Also, it involves incorporating the day’s sentiments which 
comprise of the sentiment value, number of positive tweets and the number of negatives tweets. 
The day’s sentiment values are derived by performing a K-Nearest Neighbor on the classified day’ 
tweets, and a number of tweets that have the highest number of sentiment classification takes the 
final sentiment value i.e. if the negative values are ore than positive sentiments then the final 
sentiment for the day is negative. 
 52 
 
 
Figure 5:9: Training Dataset 
5.3.3 Training the model 
Artificial neural network proved the best algorithm that yielded the best result for 30 days 
share price prediction. The dataset is split into two; the training dataset which is 70% and the rest 
30% is used as a testing dataset to measure the model’s performance. 
 
Figure 5:10: ANN Training Code 
 53 
 
5.3.4 Testing the model 
For testing, 30% of the dataset was used for testing. Then a measured error was used to 
compute the error between the predicted and the actual. The figure 5:11 shows the 30 days 
predicted by the model, and we have the current values for the 30days which were predicted. 
 
Figure 5:11: ANN Predicted Prices 
The figure 5:12 shows a sample computation of the mean squared error 
 
Figure 5:12: Mean Squared Error Code 
Table 5.4 illustrates the performance of the predicted value Mean squared error was performed. 
Mean Squared Error 5.77 
 
Table 5:4: SVM Mean Squared Error 
 54 
 
 
Figure 5:13: ANN Predicted Price and Actual Price Graph 
 
 
 
  
 55 
 
Chapter 6 : Discussions 
6.1 Introduction 
This chapter discusses the experiments carried during this research to achieve the specified 
results in line with the objectives of the project. The main objective was to build a model that can 
use existing historical data and sentiment analysis to predict the share price for a company. For 
sentiment, analysis SVM seemed to perform better than the other machine learning method while 
artificial neural network performed better in the prediction of the share price. 
6.2 Experiments on Sentiment Analysis 
6.2.1 Using Different Classifiers 
The purpose of this experimentation was to compare the performance of the different 
classifiers for sentiment analysis. Four machine learning methods were used namely SVM, Naive 
Bayes, Random Forest and K-nearest neighbor. The results are shown in table 6:1 
Classifier Accuracy Precision  Recall F-score 
SVM 0.83 0.82 0.83 0.82 
Naive bayes 0.77 0.81 0.78 0.70 
Random Forest 0.813 0.80 0.81 0.79 
KNN 0.801 0.79 0.80 0.77 
 
Table 6:1: Classifiers Performance 
 56 
 
6.2.2 Experiment 1: SVM with Different Feature Types 
The main purpose of this experimentation was to determine the effect of using different 
feature types on the SVM and check how the accuracy of the model is affected. The features used 
were unigram, bigrams, and trigrams. The results in Table 6:2 shows that the best performance of 
the SVM classifier is obtained when using bigram feature. 
Feature Accuracy Precision Recall F-Score 
Unigram 0.833 0.83 0.83 0.82 
Bigram 0.839 0.84 0.84 0.82 
Trigram 0.838 0.83 0.84 0.82 
 
Table 6:2: SVM Performance 
6.2.3 Experiment 2: Naïve Bayes with Different Feature Types 
This experiment is similar to experiment 2, using unigram, bigram and trigrams on Naive Bayes 
checking its performance. The results in Table 6:3 shows that the best performance of the Naïve 
Bayes classifier is obtained when using unigram feature. 
Feature Accuracy Precision Recall F-Score 
Unigram 0.768 0.81 0.77 0.69 
 57 
 
Bigram 0.762 0.82 0.76 0.67 
Trigram 0.759 0.82 0.76 0.66 
Table 6:3: Naive Bayes Performance 
6.2.4 Experiment 3: Random Forest with Different Feature Types 
This experiment is similar to experiment 2, using unigram, bigram and trigrams on Random Forest 
checking its performance. The results in Table 6:3 shows that the best performance of the Random 
Forest classifier is obtained when using bigram feature. 
Feature Accuracy Precision Recall F-Score 
Unigram 0.802 0.80 0.79 0.79 
Bigram 0.798 0.79 0.79 0.78 
Trigram 0.783 0.78 0.78 0.77 
Table 6:4: Random Forest Performance 
6.2.5 Experiment 4: KNN with Different Feature Types 
This experiment is similar to experiment 2, using unigram, bigram, and trigrams on K-Nearest 
Neighbor checking its performance. The results in Table 6:2 shows that the best performance of 
the SVM classifier is obtained when using bigram feature. 
 
 58 
 
Feature Accuracy Precision Recall F-Score 
Unigram 0.800 0.79 0.80 0.77 
Bigram 0.797 0.79 0.80 0.76 
Trigram 0.796 0.79 0080 0.76 
Table 6:5: KNN Performance 
6.3 Stock Prediction 
6.3.1 Different Classifiers  
The goal of this experiment was to compare the performance of the different classifiers for 
stock market price incorporating the sentiment value. Four machine learning methods were used 
namely support vector machine, Naive Bayes, Random Forest and artificial neural network. From 
the Table 6:6 it shows Artificial neural network was better regarding the values predicted as its 
mean squared error was the least while the SVM model was the worst as it had its mean squared 
value the highest. 
Classifier Mean Squared Error 
SVM 129.40 
Naive Bayes 43.59 
Random Forest 74.96 
 59 
 
Artificial Neural Network 5.77 
 
Table 6:6: Classifiers Performance in Stock Price Prediction 
Figure 6:1 shows the different 30 days prices predicted by the different models and also the actual 
prices that occurred in the 30 days. From the graph, ANN price curves are the closest to the actual 
prices in terms of the graph trend. 
 
Figure 6:1: Different Classifiers Performance Graph 
 
 60 
 
Chapter 7 : Conclusion and Recommendations 
7.1 Conclusion 
In this research, we have tried to predict equity’s 30 days share price movement on Nairobi 
Stock Exchange by performing sentiment analysis on financial news tagged tweets on twitter about 
the company.  
7.2 Recommendation 
To increase the accuracy of the model and for it to be more reliable, the financial news 
should be collected from multiple sources such as digital newspaper headlines to avoid bias and to 
provide diversity in the news. Also, the data should be collected continuously for a duration greater 
than one year to avoid one been limited. On Twitter, one should be able to subscribe to the 
enterprise package though it’s costly. This will help one retrieve historical tweets that may have a 
financial impact on that days share price that will help the building and increasing the accuracy of 
the model. Currently, Twitter API is limited to only 7 days tweets. 
A method needs to be devised on how we can uniquely identify any company listed on 
Nairobi Stock Exchange market on the internet. An example is Stock listed on US stock exchange 
like Facebook and Google which are listed on NASDAQ. To identify them anywhere is the internet 
is easy as theyy have the symbol $FB and $GOOGL respectively, this makes one easy to crawl 
any news that has the initial provide for them. If this can be implement to NSE listed companies 
to that unique way of identifying them on the internet or social media, this will make it easy to 
crawl their data. 
 
  
 61 
 
References 
Albert, B., & Eibe, F. (2011). Sentiment knowledge discovery in twitter streaming data. 
Alejandro Mosquera, Lamine Aouad, Slawomir Grzonkowski, & Dylan Morss. (2014). On Detecting 
Messaging Abuse in Short Text Messages using Linguistic and Behavioral patterns. 
Andries, P. E. (2007). Computational Intelligence: An Introduction (2 Edition). Wiley Publishing. 
Antweiler, W., & Frank, M. Z. (2004). Is All That Talk Just Noise? The Information Content of Internet 
Stock Message Boards. Journal of Finance, 59(3), 1259–1294.  
Arafat, J., Habib, A., & Hossain, R. (2013). Analyzing Public Emotion and Predicting Stock Market Using 
Social Media. American Journal of Engineering Research, 265–275. 
Batool, R., Khattak, A., Maqbool, J., & Lee, S. (2013). Precise tweet classification and sentiment analysis, 
461–466. 
Berman, P. (2006). E-Learning Concepts and Techniques. 
Bhardwaj, A., Narayan, Y., Dutta, M., Vanraj, & Pawan. (2015). Sentiment Analysis for Indian Stock 
Market Prediction Using Sensex and Nifty. 
Bollen, J., & Mao, H. (2010). Twitter mood as a stock market predictor. IEEE, 91– 94. 
Brealey, R. A., Myers, S. C., & Allen, F. (2005). Corporate Finance (8 Edition). New York: McGraw-Hill 
Irwin. 
Brownlee, J. (2016a, November 18). What is a Confusion Matrix in Machine Learning? Retrieved 28 
August 2017, from https://machinelearningmastery.com/confusion-matrix-machine-learning/ 
 62 
 
Burton, M. (2003). The Efficient Market Hypothesis and its Critics. The Journal of the Economic 
Perspectives, 17(1), 59–82. 
Capital Markets Authority. (2017). Retrieved 26 August 2017, from 
https://www.cma.or.ke/index.php/about-us/who-we-are 
Central Depository & Settlement Corporation (CDSC). (2017, August 7). Retrieved from 
http://fib.co.ke/deal/cds/ 
Choi, H., & Varian, H. (2012). Predicting the Present with Google Trends.The Economic Record, 2–9.  
Choudhury, M. D., Sundaram, H., John, A., & Seligmann, D. D. (2010). Can Blog Communication 
Dynamics be correlated with Stock Market Activity? Proceedings of the Nineteenth ACM 
Conference on Hypertext and Hypermedia. 
Chu, T., Jue, K., & Wang, M. (2017). Comment Abuse Classification with Deep Learning. 
Dawei Yin, Zhenzhen Xue, & Liangjie Hong. (2009). Detection of Harassment on Web 2.0. 
Dinesh Sonachalam. (2015). Using Twitter to predict Stock Market Returns.International Journal of 
Scientific & Engineering Research, 6(10). 
Ding, X., Zhang, Y., Liu, T., & Duan, J. (2015). Deep Learning for Event-Driven Stock Prediction. 
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 
2015). 
Dumais, S. (2001, November). Support Vector Machines. Retrieved 19 August 2017, from 
https://www.microsoft.com/en-us/research/project/support-vector-machines/ 
Fama, E. (1965). The behavior of stock-market prices. The Journal of Business. 
 63 
 
Fama, E. (1970). Efficient Capital Markets: A Review of Theory and Empirical Work. Journal of Finance, 
25(2), 383–417. 
Fang, X., & Zhan, J. (2015). Sentiment Analysis Using Product Review Data. Journal of Big Data, 2. 
Gilbert, E., & Karahalios, K. (2010). Widespread Worry and the Stock Market. 4th International AAAI 
Conference on Weblogs and Social Media (ICWSM). 
Hellstrom, T. (1998). A Random Walk through the stock Market Licentiate. Umea Univeristy. 
History of NSE - Nairobi Securities Exchange (NSE). (2017). Retrieved 16 August 2017, from 
https://www.nse.co.ke/nse/history-of-nse.html 
Hossein Hosseini, Sreeram Kannan, Baosen Zhang, & Radha Poovendran. (2017). Deceiving Google’s 
Perspective API Built for Detecting Toxic Comments. 
Huang, C., Chen, P., & Pan, W. (2011). Using Multi-Stage Data Mining Technique to Build Forecast 
Model for Taiwan Stocks. Neural Computing and Applications. 
Jain, V. (2013). Prediction of Movie Success using Sentiment Analysis of Tweets. The International 
Journal of Soft Computing and Software Engineering. 
Jones, T. S., & Richey, R. C. (2000). Rapid prototyping methodology in action: A developmental study, 
63–80. 
Khatri, S. K., Singhal, H., & Johri, P. (2014). Sentimental analysis to Predict Bombay Stock Exchange 
Using Artificial Neural Network, 380–384.  
Khatri, S. K., & Srivastava, A. (2016). Using Sentimental Analysis in Prediction of Stock Market 
Investment. 5th International Conference on Reliability, Infocom Technologies and Optimization 
(ICRITO). 
 64 
 
Kihoro, J., & Okango, E. (2014). The stock market price prediction using artificial neural networks: An 
application to the Kenyan Equity Bank share prices, 16. 
Kohavi, R., & Provost, F. (1998). On Applied Research in Machine Learning. In Editorial for the Special 
Issue on Applications of Machine Learning and the Knowledge Discovery Process, 30. 
Lawrence, S. (2002). A Model for Stock price Fluctuations Based on Information, 48. 
Lo, A. W., & MacKinley, A. C. (1999). A Non-Random Walk Down Wall Street. Princeton: Princeton 
University Press. 
Lowd, D., & Domingos, P. (2005). Naive Bayes Models for Probability Estimation. 
Lowe, D. (2012). Local Naive Bayes nearest Neighbor for Image Classification. Proceedings of the 2012 
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3650–3656. 
Makrehchi, M., Shah, S., & Liao, W. (2013). Stock Prediction Using Event-based Sentiment Analysis. 
International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT). 
Makrehchi, M., Shah, S., & Wenhui Liao. (2013). Stock Prediction Using Event-based Sentiment 
Analysis. International Conferences on Web Intelligence (WI) and Intelligent Agent Technology 
(IAT). 
Marc Claesen, Smet, F. D., Moor, B. D., & Suykens, J. A. K. (2014). EnsembleSVM: A Library for 
Ensemble Learning Using Support Vector Machines. Journal of Machine Learning Research, 141–
145. 
Maynard, D., & Funk, A. (2011). Automatic detection of political opinions in tweets. Proceedings of the 
8th International Conference on the Semantic Web, 88–99. 
 65 
 
Medhat, W., Hassan, A., & Korashy, H. (2013). Sentiment analysis algorithms and applications: A survey. 
Ain Shams Engineering Journal, 1093–1113. 
Minging, H., & Bing, L. (2004). Mining and summarizing customer reviews. Proceedings of ACM 
SIGKDD International Con- Ference on Knowledge Discovery and Data Mining. 
Mohammad, S., Dunne, C., & Dorr, B. (2009). Generating high-coverage semantic orientation lexicons 
from overly marked words and a thesaurus. 
Nairobi Securities Exchange. (2017). Retrieved 16 August 2017, from https://www.nse.co.ke/nse/about-
nse.html  
NeuroAI. (2013). Stock Market Prediction | Neuro AI. Retrieved 7 August 2017, from 
http://www.learnartificialneuralnetworks.com /stockmarketprediction.html 
Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. 
Proceedings of LREC. 
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in 
Information, 2, 1–135. 
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Sentiment classification using machine learning technique, 
79–86. 
Peterson, C., & Rögnvaldsson, T. (1992). An introduction to artificial neural networks. Proc. 1991 CERN 
Summer School of Computing, 113–170. 
Platt, J. (1999). Probabilities for SV Machines. Advances in Large Margin Classifiers. MIT Press, 61–74. 
Pagolu, S. & Nayan R, Panda, G., Majhi, B. (2016). Sentiment analysis of Twitter data for predicting 
stock market movements. 1345-1350. 10.1109/SCOPES.2016.7955659. 
 66 
 
Pring, M. J. (1991). Technical Analysis Explained. 
Quinlan, J. (1986). Machine Learning 
Raschka, S. (2014, October 4). Naive Bayes and Text Classification. Retrieved from 
sebastianraschka.com/Articles/2014_naive_bayes_1.html 
Rocha, M., & Macedo, M. (2011). Previsão do preço de ações usando redes neurais. Congresso USP de 
Iniciação Científica Em Contabilidade. 
Thomsett, M. C. (1998). Mastering Fundamental Analysis. Chicago: Dearborn Publishing. 
Uhrig, R. (1995). Introduction to artificial neural networks. Proceedings of the 1995 IEEE IECON 21st 
International Conference, 1, 33–37. 
Wanjawa, B. W. (2014, May). A Neural Network Model for Predicting Stock Market Prices at the Nairobi 
Securities Exchange. University of Nairobi. 
Zhang, H. (2004). The Optimality of Naive Bayes. 
Zhang, L. (2013). Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. 
The University of Texas at Austin. 
Zhang, Peter. (2003). Zhang, G.P.: Time Series Forecasting Using a Hybrid ARIMA and Neural Network 
Model. Neurocomputing 50, 159-175. Neurocomputing. 50. 159-175. 10.1016/S0925-
2312(01)00702-0. 
Zhang, X., Fuehres, H., & Gloor, P. A. (2011). Predicting Stock Market Indicators through Twitter. 
Zupan, J. (1994). Introduction to Artificial Neural Network (ANN) Methods: What They Are and How to 
Use Them, 41–327.  
 67 
 
Appendix 
Appendix A: Originality Report 
 
  
 68 
 
Appendix B: Python Source Code 
import tweepy 
import xml.etree.ElementTree as ET 
class Credentials: 
    def __init__(self): 
        self.credential_xml = 'twitter-credentials.xml' 
    def get_twitter_credentials(self): 
        credential_xml_data = ET.parse(self.credential_xml).getroot() 
        return (credential_xml_data[0].text, 
                credential_xml_data[1].text, 
                credential_xml_data[2].text, 
                credential_xml_data[3].text) 
    def authentinticate_twitter(self): 
        twitter_credentials = self.get_twitter_credentials() 
        auth = tweepy.OAuthHandler(twitter_credentials[0], twitter_credentials[1]) 
        auth.set_access_token(twitter_credentials[2], twitter_credentials[3]) 
        api = tweepy.API(auth) 
        return api 
 
 
 
 
 
 69 
 
The code below was used to clean the text and save it in a csv file. 
import csv 
import re 
import string 
import html 
class Cleaner: 
    def __init__(self): 
        self.remove_punctuations = str.maketrans('', '', string.punctuation) 
    def read_csv(self,csv_name): 
        cleaned_text = [] 
        with open('../data/twitter_data/raw_data/'+csv_name+'.csv', newline='', encoding='utf-
8') as csvfile: 
            reader = csv.DictReader(csvfile) 
            for row in reader: 
                text = row['text'] 
                clean_text = self.clean_tweets(text) 
                cleaned_text.append(clean_text) 
        self.save_cleaned_csv('cleaned_'+csv_name,cleaned_text) 
    def clean_tweets(self,tweet): 
        # harmonize the cases 
        lower_case_text = tweet.lower() 
        # remove urls 
        removed_url = re.sub(r'http\S+', '', lower_case_text) 
 70 
 
        # remove hashtags 
        removed_hash_tag = re.sub(r'#\w*', '', removed_url)  # hastag 
        # remove usernames from tweets 
        removed_username = re.sub(r'@\w*\s?','',removed_hash_tag) 
        # removed retweets 
        removed_retweet = removed_username.replace("rt", "", True)  # remove to retweet 
        # removing punctuations 
        removed_punctuation = removed_retweet.translate(self.remove_punctuations) 
        # remove spaces 
        remove_g_t = removed_punctuation.replace("&gt", "", True) 
        remove_a_m_p = remove_g_t.replace("&amp", "", True) 
        final_text = remove_a_m_p 
        return final_text 
    def pre_cleaning(self,text): 
        html_escaped = html.unescape(text) 
        final_text = html_escaped.replace(';','') 
        return final_text 
    def pre_labeling(self,text): 
        lower_case_text = text.lower() 
        removed_url = re.sub(r'http\S+', '', lower_case_text) 
        return removed_url 
    def save_cleaned_csv(self,name,tweets_list): 
        with open('../data/twitter_data/cleaned_data/' + name + '.csv', 'w') as f: 
 71 
 
            writer = csv.writer(f) 
            writer.writerow(["text"]) 
            for tweet in tweets_list: 
                writer.writerow([tweet,]) 
        pass 
    def save_pre_labled_csv(self,csv_name): 
        cleaned_text = [] 
        with open('../data/twitter_data/raw_data/' + csv_name + '.csv', newline='', 
encoding='utf-8') as csvfile: 
            reader = csv.DictReader(csvfile) 
            for row in reader: 
                text = row['text'] 
                clean_text = self.pre_labeling(text) 
                cleaned_text.append(clean_text) 
        self.save_pre_labeled_csv('unlabeled_' + csv_name, cleaned_text) 
    def save_pre_labeled_csv(self,name,tweets_list): 
        with open('../data/twitter_data/pre_labeled/' + name + '.csv', 'w') as f: 
            writer = csv.writer(f) 
            writer.writerow(["text","label"]) 
            for tweet in tweets_list: 
                writer.writerow([tweet,]) 
        pass 
 
 72 
 
The code below was used in SVM model training. This was used in sentiment analysis. 
 
def svm_accuracy(X, y): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 
    svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), 
                    ('svm', SVC(kernel="linear", C=1))]) 
    svm = svm.fit(X_train, y_train) 
    ypred = svm.predict(X_test) 
    print("SVM metrics") 
    print(metrics.accuracy_score(y_test, ypred)) 
    print(metrics.classification_report(y_test, ypred)) 
     
This is a sample artificial neural network code that was used in forecasting the store price. 
import pandas as pd 
import numpy as np 
from sklearn import preprocessing, cross_validation 
from sklearn.neural_network import MLPRegressor 
df = pd.read_csv('../equity.csv') 
df_close = df[[3]] 
forecast_out = int(30) # predicting 30 days into future 
df['Prediction'] = df_close.shift(-forecast_out) #  label column with data shifted 30 units up 
# print(df.tail()) 
X = np.array(df.drop(['Prediction'], 1)) 
 73 
 
X = preprocessing.scale(X) 
X_forecast = X[-forecast_out:] # set X_forecast equal to last 30 
X = X[:-forecast_out] # remove last 30 from X 
y = np.array(df['Prediction']) 
y = y[:-forecast_out] 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3) 
# Training 
clf = MLPRegressor() 
clf.fit(X_train,y_train) 
# Testing 
confidence = clf.score(X_test, y_test) 
print("confidence: ", confidence) 
forecast_prediction = clf.predict(X_forecast) 
print('30 Days prediction') 
print(forecast_prediction)