SU+ @ Strathmore 

University Library 
 

Electronic Theses and Dissertations 

 
This work is availed for free and open access by Strathmore University Library.  

It has been accepted for digital distribution by an authorized administrator of SU+ @Strathmore University. 

For more information, please contact library@strathmore.edu 

 
2023 

 
A Sound classification and display tool for 

assisting the deaf and hard-of-hearing: a 

case of Kenya. 

 
Wanjiru, Rosemary Wangari 
School of Computing and Engineering Sciences  
Strathmore University 
 
 
Recommended Citation 

Wanjiru, R. W. (2023). A Sound classification and display tool for assisting the deaf and hard-of-hearing: A case 

of Kenya [Strathmore University]. http://hdl.handle.net/11071/13522 

 
Follow this and additional works at: http://hdl.handle.net/11071/13522 

https://su-plus.strathmore.edu/
https://su-plus.strathmore.edu/
http://hdl.handle.net/11071/2474
mailto:library@strathmore.edu
http://hdl.handle.net/11071/13522
http://hdl.handle.net/11071/13522


A Sound Classification and Display Tool for Assisting the Deaf and Hard-of-

Hearing: A Case of Kenya 

 
Wanjiru Rosemary Wangari 

138771 

 
Master of Science in Information Technology 

 
2023


A Sound Classification and Display Tool for Assisting the Deaf and Hard-of-

Hearing: A Case of Kenya 

 
Wanjiru Rosemary Wangari 

138771 

 
Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in 

Information Technology at Strathmore University 

 
School of Computing & Engineering Sciences 

Strathmore University 

Nairobi, Kenya 

 
July, 2023 

 
This thesis is available for Library use on the understanding that it is copyright material and that no 

quotation from the thesis may be published without proper acknowledgement. 


ii 

 
Declaration and Approval 

Declaration 

I declare that this work has not been previously submitted and approved for the award of a degree by this 

or any other University. To the best of my knowledge and belief, the thesis contains no material previously 

published or written by another person except where due reference is made in the thesis itself. 

 
© No part of this thesis may be reproduced without the permission of the author and Strathmore University 

 
Student’s Name: Wanjiru Rosemary Wangari 

 
Sign:           Date: ______09/06/2023__________________ 

 
Approval 

The thesis of Wanjiru Rosemary Wangari was reviewed and approved for examination by the 

following: 

 
Dr. Victor Rop 

Lecturer, School of Computing & Engineering Sciences,  

Strathmore University 

 
Dr. Julius Butime, 

Dean, School of Computing & Engineering Sciences, 

Strathmore University 

 
Dr. Bernard Shibwabo, 

Director of Graduate Studies, 

Strathmore University 

  
iii 

 
Abstract 

Sound is an essential component of existence in all aspects of life. It is a crucial component when it 

comes to creating automated systems for various domains such as personal safety and essential 

surveillance. Hearing people always absorb information through sound and language that is spoken 

around them. On the other hand, deaf and hard of hearing people lack the luxury of hearing and may 

end up having major problems due to lack of this awareness. Various researches have shown that 

there is a mismatch between the need and the demand of the assistive technologies as you find that 

the need is high but the demand and supply is low which impose a challenge in enhancing access of 

the assistive devices. Also, there is a gap between the number of people who require assistive 

technologies to meet their needs and the number of people who are willing and able to purchase and 

use these technologies. This mismatch could be due to factors such as the cost of the technologies, 

lack of awareness or knowledge about the technologies, or cultural barriers to their use. Only a small 

percentage of people have access to the assistive devices. This study reviewed the existing assistive 

technologies for the deaf and hard of hearing. Prior studies on assistive technologies for the deaf 

revealed that, sound classification systems have been developed world wide, but none have been 

implemented for use in Kenya. The research employed a machine learning approach, specifically 

utilizing convolutional neural networks, to design a sound classification model. The process involved 

transforming detected sound events into spectrogram images, which were then processed by the 

Convolutional Neural Network to extract relevant features. The extracted features were subsequently 

employed to classify environmental sounds, including car horns, dog barking among others. Once 

the sounds have been classified, a mobile app was used to display a notification indicating the type 

of sound that has been detected. The machine learning model was evaluated for its effectiveness in 

assisting the deaf and hard-of-hearing individuals, with the ability to accurately classify a wide range 

of urban sounds relevant to the study and display corresponding notifications on the user interface. 

The development of this model stems from a strong motivation to empower deaf individuals, 

enabling them to experience greater independence without relying on others, with an aim to bridge 

the gap between auditory awareness and the needs of the deaf and hard-of-hearing community. 

 
Keywords: Deaf and Hard of Hearing, Convolutional Neural Networks, Sound Classification, 

Spectrogram. 


iv 

 
Table of Contents 

Declaration and Approval ............................................................................................................ ii 

Abstract ......................................................................................................................................... iii 

List of Figures ............................................................................................................................... ix 

List of Tables ................................................................................................................................ xi 

Abbreviations/ Acronyms ........................................................................................................... xii 

Definition of Terms .................................................................................................................... xiii 

Acknowledgements .................................................................................................................... xiv 

Dedication .....................................................................................................................................xv 

Chapter 1: Introduction ................................................................................................................1 

1.1. Background of the study .................................................................................................1 

1.2. Problem Statement ..........................................................................................................2 

1.3. Research Objectives ........................................................................................................3 

1.3.1. General Objective ........................................................................................................3 

1.5. Justification ......................................................................................................................3 

1.6. Scope .................................................................................................................................4 

1.7. Limitations .......................................................................................................................4 

Chapter 2: Literature Review .......................................................................................................5 

2.1. Introduction .....................................................................................................................5 

2.2. Challenges Facing the Deaf and Hard-of-Hearing .......................................................5 

2.2.1. Overall understanding and experience of the world for deaf individuals ................. 5 

2.2.2. Implications for safety and awareness of potential dangers in the environment ...... 6 


v 

 
2.3. Empirical Literature .......................................................................................................6 

2.3.1 Assistive Technologies Globally .................................................................................... 6 

2.3.2 Assistive Technologies in Kenya. ...................................................................................9 

2.4.1. Computational Theory ............................................................................................ 10 

2.4.2. Cognitive Theory .................................................................................................... 11 

2.4.3. Psychoacoustics Theory .......................................................................................... 14 

2.5. Models ............................................................................................................................15 

2.5.1. Hidden Markov Model (HMM) .............................................................................. 15 

2.5.2. Gaussian Mixture Model (GMM) ........................................................................... 16 

2.6. Frameworks ...................................................................................................................17 

2.6.1. Tensor Flow ............................................................................................................ 17 

2.6.2. Keras ....................................................................................................................... 17 

2.6.3. Caffe ........................................................................................................................ 18 

2.6.4. Deeplearning4j ........................................................................................................ 18 

2.7. Architectural Designs ....................................................................................................19 

2.7.1. Assistive Technology Architecture ......................................................................... 19 

2.7.2. Smart 311 Architecture ........................................................................................... 20 

2.8. Algorithms......................................................................................................................20 

2.8.1. Decision Trees ........................................................................................................ 20 

2.8.2. K- Nearest Neighbor (KNN) ................................................................................... 21 

2.8.3. Naïve Bayes ............................................................................................................ 21 

2.8.4. Bayesian Network ................................................................................................... 21 

2.9. Research Gaps ...............................................................................................................22 

2.10. Conceptual Framework ................................................................................................22 

Chapter 3: Research Methodology .............................................................................................23 

3.1. Introduction ...................................................................................................................23 

3.2. Variables and Research Design ....................................................................................23 

3.2.1 Variables ..................................................................................................................... 23 


vi 

 
3.2.2 Research Design ......................................................................................................... 23 

3.3. Population and Sampling..............................................................................................24 

3.3.1. Target Population .................................................................................................... 24 

3.3.2. Sampling ................................................................................................................. 24 

3.4. Data Collection Methods and Analysis........................................................................25 

3.4.1. Data Collection Methods ........................................................................................ 25 

3.4.2. Data Analysis .......................................................................................................... 26 

3.5. Research Quality and Reliability .................................................................................26 

3.5.1. Research Quality ..................................................................................................... 26 

3.5.2. Reliability ................................................................................................................ 26 

3.6. System Development Methodology ..............................................................................27 

3.7. Utilization and Dissemination of Research Results ....................................................27 

3.8. Ethical Considerations/Issues.......................................................................................28 

Chapter 4: System Design and Architecture .............................................................................29 

4.1. Introduction ...................................................................................................................29 

4.2.1. Functional Requirements ........................................................................................ 29 

4.2.2. Non-functional Requirements ................................................................................. 30 

4.3. System Architecture ......................................................................................................30 

4.4. System Design ................................................................................................................31 

4.4.1. Use case Model ....................................................................................................... 32 

4.4.1.1. Use case diagram and descriptions ......................................................................... 32 

4.4.2. System Sequence Diagram ..................................................................................... 36 

4.4.3. Entity Relation Diagram ......................................................................................... 36 

4.4.4. Class Diagram ......................................................................................................... 38 

4.5. WireFrames ...................................................................................................................38 

4.5.1. User Login .............................................................................................................. 38 


vii 

 
4.5.2. Record  and Predict Sound ...................................................................................... 39 

4.5.3. Sound Classification Results................................................................................... 40 

Chapter 5: Model Implementation and Testing ........................................................................41 

5.1. Introduction ...................................................................................................................41 

5.2. Development Environment and Language  .................................................................41 

5.2.1. Software Requirements and Hardware Requirements ............................................ 41 

5.3. Model Components. ......................................................................................................42 

5.3.1. Input Layer. ............................................................................................................. 43 

5.3.2. Hidden Layer .......................................................................................................... 43 

5.3.3. Output Layer ........................................................................................................... 43 

5.4. Model Development.......................................................................................................43 

5.4.1. Sound Data Collection ............................................................................................ 43 

5.4.2. Import of necessary libraries. .................................................................................. 44 

5.4.3. Audio Extraction ..................................................................................................... 45 

5.4.4. Filtering the Metadata file and the audio files ........................................................ 48 

5.4.5. Preprocessing Audio Files ...................................................................................... 50 

5.5. Model Training ..............................................................................................................51 

5.5.1. Training the model from scratch ............................................................................. 52 

5.6. Android Mobile Application Development .................................................................55 

5.6.1. Authentication ......................................................................................................... 55 

5.6.2. Main Activity .......................................................................................................... 56 

5.6.3. HomeViewModel Class .......................................................................................... 57 

5.6.4. Recording View Model ........................................................................................... 58 

5.7. Model Testing ................................................................................................................58 

Chapter 6: Discussion of Results ................................................................................................60 

6.1. Introduction ...................................................................................................................60 

6.2. Results of the study .......................................................................................................60 


viii 

 
6.3. System Validation ..........................................................................................................61 

6.4. System Evaluation .........................................................................................................61 

6.5. Accomplishment of the objectives ................................................................................61 

6.6. Research Limitations ....................................................................................................62 

Chapter 7: Conclusion, Recommendations, and Future Works..............................................64 

7.1. Conclusion ......................................................................................................................64 

7.2. Recommendations .........................................................................................................64 

7.3. Future Works.................................................................................................................65 

References .....................................................................................................................................67 

Appendices ....................................................................................................................................74 

Appendix A: Similarity  Report ..............................................................................................74 

Appendix B: Ethical Clearance Confirmation ......................................................................75 

Appendix C: Urban 8K Dataset License ................................................................................76 

 
ix 

 
List of Figures 

Figure 2.1: Sound Event Detection Processing(Miyazaki et al., 2019). ....................................... 11 

Figure 2.2: Audio Feature Extraction (Hershey et al., 2017)........................................................ 13 

Figure 2.3:The Structure of Audio Classification System (Jasim et al., 2022 .............................. 14 

Figure 2.4: Graphical Representation of a Gaussian Matrix Model (Carrasco 2020) .................. 17 

Figure 2.5: Assistive Technology Architecture(Mielke et al., 2013) ........................................... 19 

Figure 2.6: Smart 311 Noise Sound Classification Architecture (Tariq et al., 2018) ................... 20 

Figure 2.7: Conceptual Framework .............................................................................................. 22 

Figure 3.1: Agile Development Cycle (Concas et al., 2008). ....................................................... 27 

Figure 4.1: System Architecture ................................................................................................... 31 

Figure 4.2: Use Case Diagram ...................................................................................................... 32 

Figure 4.3: System Sequence Diagram ......................................................................................... 36 

Figure 4.4: Entity Relationship Diagram ...................................................................................... 37 

Figure 4.5: Class Diagram ............................................................................................................ 38 

Figure 4.6: User Login Page ......................................................................................................... 39 

Figure 4.7: Record and  Predict Sound Wireframe ....................................................................... 39 

Figure 4.8: Sound Classification ................................................................................................... 40 

Figure 5.1: Importing Libraries..................................................................................................... 45 

Figure 5.2: Spectrograms Transform ............................................................................................ 46 

Figure 5.3: Mel Spectrograms....................................................................................................... 47 

Figure 5.4: Mel-Frequency Cepstral Coefficients (MFCC) .......................................................... 48 

Figure 5.5: Filtering the Metadata ................................................................................................ 49 

Figure 5.6: Dataset Samples ......................................................................................................... 49 

Figure 5.7: Filtering Audio Files .................................................................................................. 50 

Figure 5. 8: Training and Testing Dataset .................................................................................... 51 

Figure 5.9: PyTorch DataLoader .................................................................................................. 52 

Figure 5.10: Building convolutional and linear neural network layers ........................................ 53 

Figure 5.11: Urban8KNet Model .................................................................................................. 53 

Figure 5.12: Augmentation Class ................................................................................................. 54 

Figure 5.13: Pretrained Model ...................................................................................................... 55 


x 

 
Figure 5.14: Authentication Class................................................................................................. 56 

Figure 5.15: Main Activity Class .................................................................................................. 57 

Figure 5.16: HomeViewModel Class ........................................................................................... 57 

Figure 5.17: Recoding View Model Class .................................................................................... 58 

 
xi 

 
List of Tables 

Table 4. 1 Use case description of Record Sound ...................................................................................... 33 

Table 4.2: Use case description for sound preprocessing ........................................................................... 34 

Table 4.3: Use case description of Classify Sound .................................................................................... 35 

Table 4.4: Use case description of Display Sound Classification(predicted) Results ................................ 35 

Table 5.1: Software and Hardware Requirements ...................................................................................... 41 

Table 5.2: Test Case Results....................................................................................................................... 59 

 
xii 

 
Abbreviations/ Acronyms 

AT - Assistive Technology 

ATD - Assistive Technology Device 

ALDs - Assistive Listening Devices 

AAC - Augmentative and alternative communication devices 

CNN - Convolutional Neural Networks 

DHH - Deaf and Hard-of-Hearing 

ESR - Environmental Sound Recognition 

FM - Frequency Modulation 

HLAA - Hearing Loss Association of America 

KNN - K-Nearest Neighbor 

RNN - Recurrent Neural Networks 

SED - Sound Event Detection 

SOM - Self-Organized Maps 

T. Coil - Telecoil 

WHO - World Health Organization 


xiii 

 
Definition of Terms 

Assistive Technology Device It is any device, tool, software, or system that helps to 

enhance, preserve, or improve the functional abilities 

of individuals with hearing disabilities.  

Cochlear implants A small, advanced electronic device that aids 

individuals who are either completely deaf or have 

severe hearing loss in perceiving sound.  

Deafness This happens when someone has trouble 

understanding speech even when sound is enhanced.  

Environmental Sound Recognition Processing of environmental sounds(ES) such as 

alarms, recognize (R) when a device is not functioning 

correctly, locate an event in space, monitor a change 

in status, communicate an emotional or physical 

condition. 

Hard of hearing/hearing loss This results in a diminished capacity to hear noises in 

a way that other individuals can. 

Hearing aids An electronic device that is small enough to be worn 

in or behind the ear and helps individuals with hearing 

loss to participate more fully in daily activities and 

conversations by amplifying sounds. This can improve 

their hearing ability in both quiet and noisy 

surroundings.  

Profound Hearing Loss This describes complete deafness. A profoundly deaf 

person is utterly unable to hear 

Sound Event Detection The process of identifying sound events in a recording 

and assigning them temporal start and end time 

Telecoil A tiny copper wire that is discretely coiled inside 

hearing aids and can detect electromagnetic signals 

from various sources and can be readily activated by 

pressing a button.  


xiv 

 
Acknowledgements 

This thesis would not have been possible without the support of many people. First and 

foremost, I would like to express my gratitude to God for His goodness and for giving me 

the strength to undertake this research. 

I extend my utmost sincere gratitude to my supervisor, Dr. Victor Rop, for allowing me to 

undertake this work and for his continuous guidance and invaluable suggestions throughout 

the research process. I would also like to offer special thanks to Professor Ismail Ateya for 

his guidance and insightful contributions. 

I am sincerely grateful to all my family members and friends for their unwavering support 

and love. In particular, I am immensely thankful to my Mum and Dad for their unconditional 

love, encouragement, and support throughout my studies. I also extend my heartfelt 

appreciation to Joseph Mwaniki, Jimmie Munyi, and David Mwangi for their continuous 

support and assistance throughout this project.  

May God bless you all. 

 
xv 

 
Dedication 

I dedicate this remarkable accomplishment to my parents, who have consistently served as 

the pillars of strength and support in my life. Their love has been the guiding force that has 

shaped my path. I am eternally grateful for the sacrifices, support, and unwavering 

confidence I have received, which have made this achievement possible. 

I also dedicate this thesis to my brother, Joseph Mwaniki Ngatia, for his unwavering support 

throughout this journey, as well as to all my dear deaf friends who have been a great source 

of inspiration. 

 
1 

 
Chapter 1: Introduction 

1.1.  Background of the study 

According to the most recent United Nations report, the world population as of October 2022 

is 7.98 billion (Worldometer, 2022). The World Health Organization (2021) states that more 

than 1.5 billion people worldwide live with hearing loss and that by 2050, an estimated 2.5 

billion individuals may have some degree of hearing loss, with at least 700 million requiring 

hearing rehabilitation. The World Health Organization (2021),  also reports that over 1 billion 

young adults are at risk of permanent, preventable hearing loss due to dangerous listening 

habits. According to Felman (2018), deafness refers to the condition where individuals are 

unable to comprehend speech through hearing, even when sound is amplified. This condition is 

characterized by severe hearing loss, where individuals can either hear very little or nothing at 

all, and it results in significant hearing loss. Deaf individuals are unable to hear anything or 

only very little, and they often communicate through sign language (World Health 

Organization, 2021).   

Hearing loss is categorized as disabling if it exceeds 40 decibels in an adult's better ear and 30 

decibels in a child's better ear. People with mild to severe hearing loss, referred to as hard of 

hearing, usually communicate through speech and can benefit from devices such as hearing aids 

and cochlear implants as well as assistive technology like captioning, m-health, and loop system 

(Garg et al., 2021; World Health Organization, 2021). 

There are various ways of being deaf, such as being born with it (congenital hearing loss) or 

developing it later in life (acquired hearing loss). Better Health Channel (2017) states that Noise 

is the most common cause of acquired hearing loss. Other causes of acquired hearing loss 

include accidents, genetic defects, life-altering experiences, and aging. Recently, the COVID-

19 pandemic has further exacerbated the difficulties faced by deaf and hard-of-hearing 

individuals as they struggled to adjust in a world designed for the hearing. This led to a lack of 

inclusiveness and affected their mental, physical, and social well-being (Garg et al., 2021). 

The inability to hear sounds around you cause social, emotional, and behavioral issues. 

According to numerous studies, there is a mismatch between the demand and the need for 

assistive technologies, with the supply falling short of the latter. Since so few people have 

access to assistive technologies, it presents a problem for enhancing access to these gadgets. In 


2 

 
situations where non-auditory cues are not available, providing information about sounds can 

be beneficial for individuals who are deaf or hard-of-hearing (Bragg et al., 2016). 

This project proposes a machine learning tool based on convolutional neural networks. The 

model was trained to detect sound, extract sound attributes, and classify the sound in order to 

assist the deaf in distinguishing between various environmental sound types. A mobile app is 

used to display pop-up notification showing the type of sound that is been identified. 

1.2. Problem Statement 

Sounds convey information about the world around us. This means that when a deaf person is 

oblivious of the sounds around them, something terrible may happen to them that could have 

been prevented. When non-auditory cues are not present, it is crucial to alert deaf and hard-of-

hearing individuals about sounds. Mielke and Bruck (2016), stated that there are commercially 

available devices for accessing environmental sounds, but they are primarily designed for 

indoor environments like homes or workplaces, and only support specific events like doorbells 

or telephones. Communicating in the dark or dimly lit places is a huge problem for people with 

hearing difficulties  (Kumar, 2019). This inability to detect any sounds in the surroundings leads 

to social, emotional and behavioral problems. 

In Kenya, the available assistive technologies are currently limited to assistive listening devices 

like hearing aids. However, these devices lack the important feature of a phone's display, which 

plays a crucial role in promoting inclusivity for individuals who are deaf or hard of hearing. 

Jain et al. (2015) state that while hearing aids and cochlear implants can improve a person's 

ability to recognize sounds, they typically do not enhance their capacity to determine the 

specific type of sound. This limitation can negatively impact their ability to utilize visual cues 

to understand the auditory information they receive. 

In addition, cultural factors play a significant role in shaping the perception and classification 

of sounds, even within the same geographical area like urban and rural environments. Urban 

areas tend to prioritize sounds related to transportation and industrial activities, while rural areas 

place emphasis on sounds associated with nature and wildlife. These contextual and 

environmental variations greatly impact how sounds are perceived and categorized. 

 
3 

 
1.3. Research Objectives 

1.3.1. General Objective 

The main objective of this study is to create a system for classifying sounds that could aid 

individuals who are deaf and hard of hearing in distinguishing between different types of 

sounds. The system would function by presenting a pop-up notification on a mobile application 

that displays the detected sound type, accompanied by a vibration option to provide tactile 

feedback. 

1.3.2.  Specific Objectives 

i. To investigate challenges facing the deaf in identifying sounds. 

ii. To review existing techniques and tools on assistive technologies for the deaf and hard-

hearing. 

iii. To design and develop a mobile application to classify different sounds and  display pop 

up notification to the deaf and hard of hearing users.  

iv. To validate the developed model.  

1.4. Research Questions 

i. What are the challenges faced by the deaf and hard of hearing in identifying sounds? 

ii. What are the existing techniques and tools on assistive technologies? 

iii. How can we  design and develop the mobile application to display notifications? 

iv. How will the developed model be validated? 

1.5. Justification 

In order to participate in non-face-to-face communications, assistive listening devices (ALDs) 

help magnify the sound that a deaf person would like to hear. These ALDs can be used to assist 

the deaf have audio environmental awareness and can be used in conjunction with a hearing aid 

or cochlear implant. Even though the deaf community has benefited from these ALDs greatly 

over the past two decades, there are still certain gaps in the system.   

One of the most important discoveries is text telephony, which enables deaf people to 

communicate with others via text messages. The majority of the equipment created specifically 

for the deaf is a communication tool that enables interaction with hearing persons. One 

significant issue does exist, especially when dealing with the environment on a regular basis. 


4 

 
Think of a situation when something audible occurs in a public setting, yet only hearing persons 

can truly understand what is happening. They won't be able to understand what is happening 

unless you translate for the deaf individual. A deaf person may be hit by a car if they are walking 

and cannot hear the honking of a car coming up behind them at high speed. 

This project therefore, had the advantage of attempting to raise awareness of such situations. A 

system that can differentiate between various sounds was developed where it initially 

concentrated on particular categories of sounds. The outcome was reduced risk levels to the 

deaf as a result of their increased awareness of their surroundings and ability to foresee potential 

risks. The benefit was that, the majority of the difficulties that deaf individuals encounter when 

engaging with their surroundings are connected. As a result, the study ultimately benefited 

people of all ages. The most important aspect of the research was that it concentrated on raising 

awareness for deaf people when interacting with their surroundings. 

1.6. Scope 

The system classified sound and sent a notification, which only assisted deaf people in 

participating in various events where a pop-up notification indicating the type of sound captured 

was displayed on a mobile application. It did not, however, classify all of the sounds. The study 

concentrated on a small number of distinct sounds in order to categorize and predict each 

sound's category. The focus was on a public setting, as there was currently no assistive 

technology to assist the deaf while they were in public. The model was only trained using 5 

categories of sound, namely siren, street music, children playing, dog barking, and car horn. 

However, the system provided space for future addition of other sorts of sound. 

1.7. Limitations 

The focus of this research was only on a few sounds, and not every sound was considered. The 

system was to only assist in sound classification to assist the deaf and not speech interpretation.  

 
5 

 
Chapter 2: Literature Review 

2.1. Introduction 

The use of non-technical solutions, commercial goods, and research endeavors are examples of 

sound awareness strategies. The variety of sound awareness techniques emphasizes how 

important sound awareness is. However, there hasn't been much research done on how well a 

trainable sound detector works for individuals who are hard of hearing or deaf (Bragg et al., 

2016).  

2.2. Challenges Facing the Deaf and Hard-of-Hearing 

Deaf individuals experience a lack of auditory input, which significantly impacts their  perception 

and interaction with the world. This absence of sound affects various aspects of daily life, including 

communication, education, safety, and the understanding of social interactions. Without the ability 

to perceive sound, deaf individuals face challenges in understanding spoken language, hearing 

warning signals, and grasping the nuances and emotional cues that accompany sound. These 

limitations can make it difficult for them to understand and interpret sounds in their environment. 

 
2.2.1.  Overall understanding and experience of the world for deaf individuals 

Deaf individuals face a reduced access to environmental sounds, which has significant implications 

for their overall understanding and experience of the world. Ambient sounds, including traffic, 

nature, and other background noises, play a crucial role in providing context and information about 

the environment. For example, the sound of car horns can alert individuals to potential dangers on 

the road, and the chirping of birds can indicate the presence of wildlife in a serene park. Without 

the ability to perceive these sounds, deaf individuals may miss out on important cues and 

information that can enhance their understanding of their surroundings. This can lead to potential 

safety hazards, as they may not hear emergency sirens or approaching vehicles. Additionally, the 

absence of environmental sounds can impact their overall sensory experience and appreciation of 

various environments. The soothing sound of rainfall, the rustling of leaves in the wind, or the 

crashing of waves at the beach all contribute to the richness of sensory experiences that deaf 

individuals may not fully access. 


6 

 
2.2.2.  Implications for safety and awareness of potential dangers in the environment 

Deaf individuals face limited access to auditory cues, which can have significant implications for 

their safety and awareness of potential dangers in their environment. Sound serves as a crucial 

source of information and cues, alerting individuals to various situations and events. For example, 

alarms and sirens provide warnings in emergency situations, doorbells signal the arrival of guests 

or deliveries, and honking horns indicate potential hazards on the road. Without the ability to 

perceive these auditory cues, deaf individuals may not be alerted to these important signals, 

potentially compromising their safety and ability to respond appropriately. This can result in 

situations where they are unaware of emergencies, miss important notifications, or fail to recognize 

hazardous situations. The absence of auditory cues can also impact their independence and 

everyday routines, as they may rely on others to notify them or adapt their living spaces to include 

visual or tactile alternatives. It highlights the need for alternative communication methods and 

assistive technologies to ensure that deaf individuals can access and interpret important auditory 

cues for their safety and well-being.  

Jain et al. (2019) discovered that the participants relied on conventional methods to recognize 

sounds in their homes. They would ask for assistance from others, move around the house to locate 

the source of any audible sounds they could not recognize, and use dogs as guides. Some 

participants preferred visual or vibrational alternatives over auditory devices, such as doorbells 

that flash or vibrate the bed, a vibratory alarm clock, and a wall-mounted light that indicates the 

ambient sound level. Participants mentioned voices as adequate adaptations for sounds they did 

not possess, as well as sounds of activity. On the other hand, mechanical sounds, outdoor sounds, 

and animal sounds were some of the sounds that several participants had no means of coping with. 

2.3. Empirical Literature 

2.3.1 Assistive Technologies Globally  

Bhutkar et al. (2020) proposed a prototype alert device for hard-of-hearing users. When 

developing the prototype, they collected 9 sound datasets and the home environment for 

hearing-impaired users was to be considered. The deaf and hard of hearing would benefit greatly 

from this device's ability to detect both common place sounds and some extremely important 

non-speech sounds, such as a door closing, a fire alarm, an intruder alert, and movement 

detection, all of which are required for home safety and security. A few features in the prototype 

design can help people with mild to severe impairments in their home office settings. It has the 


7 

 
ability to recognize different noises and self-train sounds. The proposed Alert Device was 

consequently developed largely with the intention of helping hearing-impaired individuals 

recognize sounds only in the home environment.  

Bragg et al. (2016) conducted a web-based survey with 87 deaf and hard-of-hearing people to 

find out their preferences for sound awareness as well as the noises they believe should be made 

aware of most urgently. The survey revealed that the most requested sounds included 

emergency alarms, appliance information, door knocks and doorbells. A prototype of a 

personalizable mobile sound detector app was created as part of the project, and participants in 

an alpha test were asked how they felt about the capabilities that were being looked into. They 

conducted a survey to learn what sounds deaf and hard-of-hearing people value, what methods 

they already use for sound awareness, and what design specifications they would like for their 

app. To build a model of those sounds, the application employed training examples of the user's 

recorded, personally meaningful sounds. Deaf and hard-of-hearing users were able to 

independently train the app. The incoming audio stream from the phone's microphone is then 

checked for those sounds. It then vibrates to alert the user when it hears a sound. However, they 

didn't add any Environmental Sound Recognition features to the app. Through manually 

transmitted notifications to the user's device, they conducted tests simulating real-time 

recognition. In their research, the authors used Gaussian Matrix Model based approach which 

classified only two sounds with limited accuracy, and it was unlikely to present varied use cases, 

sound and environmental noise in the daily life of DHH users.  

Jain et al. (2015) conducted to examine the experiences and perceptions of deaf and hard-of-

hearing (DHH) individuals regarding sounds in the home environment, gather their feedback 

on early domestic sound awareness systems, and identify any potential issues. The research was 

qualitative in nature and involved 12 DHH participants who shared their thoughts on how they 

perceive and manage sounds in their homes and provided feedback on early prototypes of sound 

awareness systems. The results of the study were based on these participants' experiences and 

insights. In the light of this study, they developed three prototypes for tablet-based sound 

awareness systems, which they evaluated using a Wizard of Oz methodology with 10 DHH 

participants. The results of the study indicated a widespread interest in sound awareness systems 

for smart homes, especially those that provide contextually aware, personalized, and easily 

digestible visual representations. However, significant concerns were raised regarding privacy, 


8 

 
activity tracking, mental workload, and trust during the testing process.  

Sicong et al. (2017) created a mobile app prototype that implements sound recognition through 

the use of deep learning models. A dataset with nine sound classes was used to validate high 

sound recognition accuracy. The proposed system boasts efficient performance in terms of 

sound recognition speed and battery usage. Although the sound recognition process takes place 

entirely on the mobile device, the classifier training is performed in the cloud due to the high 

computational demands of deep neural network training. Additionally, the authors put forth a 

preliminary solution for handling overlapping sounds through the use of unsupervised Non-

negative Matrix Factorization (NMF), however, this solution is only applicable when multiple 

microphones are available. 

Mielke and Bruck (2016) created a prototype for an environmental sound detector that runs on 

a smartwatch and was tested in a controlled environment. The design of the application was 

evaluated by deaf and hard-of-hearing users who were asked to use a simulated sound 

recognition feature. The study found that participants had preferences for the user interface, 

such as customizable vibrating patterns for sound detection notifications. However, it was 

difficult for participants who were deaf from birth to understand the concept of a sound, which 

made it challenging for them to comprehend what frequencies make up a unique sound. 

Akbal (2020) proposed a three-stage process for classifying environmental sounds that includes 

feature generation, selection, and classification. The study used various techniques for feature 

extraction, including one-dimensional native binary models, one-dimensional quarterly models, 

and statistical characterization production methods. The main objective of the study was to 

introduce a new Environmental Sound Classification (ESC) approach based on highly precise 

static feature extraction. The ESC method utilized Environmental Component Exploration to 

select distinguishing features and employed a cubic support vector machine for classification. 

The results of this research showed an intellectually novel, highly accurate, and lightweight 

ESC technique. 

Koh et al. (2019) investigated the use of Convolutional Neural Networks (CNN) for sound 

classification, specifically the classification of bird species based on their sounds. The study 

utilized the ResNet and Inception model architectures and preprocessed the data using the MEL-

scale log-amplitude spectrogram approach. The study results were obtained after several 

iterations and showed that the validation set accuracy was improved before adding Gaussian 


9 

 
Noise. The authors concluded that CNN is the most accurate method for bird sound 

classification, though the precision of the findings may be limited by the quality of the sound 

recordings. 

Nanni et al. (2020) created a set of classifiers for animal audio datasets that produced 

comparable results through using taxonomy and varied parameter settings. They experimented 

with multiple fine-tuned Convolutional Neural Networks (CNNs) that were trained for various 

audio classification tasks, and evaluated, compared, and combined six different CNNs. The 

study also tested a CNN trained from scratch and combined it with an already high-performing 

CNN. The results showed that multiple correctly tuned CNNs can be linked for efficient and 

dependable audio classification. Lastly, they improved the ensemble performance of the CNNs 

by mixing custom textures derived from spectrograms. 

2.3.2 Assistive Technologies in Kenya.  

In the realm of assistive technologies for the deaf in Kenya, Sign-iO, an innovative wearable 

technology, emerged as a promising solution. Developed by Kenyan engineer Roy Allela, Sign-

iO aimed to bridge the communication gap between sign language users and those unfamiliar with 

sign language (Otieno, 2020). This wearable technology consisted of a pair of smart gloves that 

were wirelessly connected to a mobile application via Bluetooth. Through its intricate design, the 

Sign-iO system captured the intricate gestures of sign language performed by the user. The 

companion mobile application then utilized this data to convert the sign language gestures into 

spoken words in real-time. This seamless conversion process facilitated effective communication 

between individuals fluent in sign language and those who lacked proficiency in it. However, the 

Sign-iO's functionality primarily focused on capturing sign language gestures and converting them 

into spoken words, lacking the ability to accurately distinguish and reproduce a wide range of 

sounds. 

According to Femmehub (2022), an assistive technology called "Echonoma" was developed to 

facilitate communication between the hearing community and individuals with hearing 

impairment.The innovation aimed to break the communication barrier between these two groups 

and ensure access and communication within their immediate environment. While "Echonoma" 

focused on promoting confidentiality and inclusivity, it lacked an option to assist the deaf in 

distinguishing between various sounds, which could have affected their overall auditory 


10 

 
experience. 

The developed sound classification and display tool, incorporating machine learning algorithms 

and customizable user interfaces, will significantly improve the accuracy of sound classification, 

enhance user satisfaction, and demonstrate superior usability compared to existing assistive 

technologies for individuals with hearing impairments. 

 
2.4.Theoretical Framework 

2.4.1.  Computational Theory 

Computational theory deals with the design and analysis of algorithms and systems that perform 

specific computational tasks, and is concerned with the development of algorithms and systems 

that can automatically detect and categorize sounds. The theory provides the framework and 

techniques for designing and implementing sound event detection systems. These systems can 

use a combination of signal processing techniques, machine learning algorithms, and 

knowledge from other related fields, such as psychoacoustics and cognitive theory, to perform 

sound event detection. 

2.4.1.1.  Sound Event Detection (SED) 

Sound event detection refers to the task of automatically detecting specific sounds, such as 

speech, music, or environmental sounds, in an audio signal. This involves analyzing the audio 

signal and recognizing patterns that correspond to specific sound events. 

Using SED aims to identify specific sound events and determine their start and end times, not 

just the label for each sound event (Miyazaki et al., 2019). Figure 2.1 depicts a high-level 

overview of SED processing. It is commonly assumed that the observed sound signal can 

contain many sound events, that multiple occurrences of the same sound event are possible, and 

that multiple sound events frequently overlap. For Sound Event Detection (SED), separating 

out individual sound events, not just classifying them, is a crucial aspect. A standard approach 

for SED is to utilize multiple classifiers for supervised learning, utilizing mixed sound signals 

with time-stamped labels for the separate sound events as training data.  


11 

 
Figure 2.1: Sound Event Detection Processing(Miyazaki et al., 2019). 

SVM and random forests are two simple classifier systems that have been proposed (Phan et 

al., 2015). A system for detecting target sound events based on an exemplar-based approach 

and NMF has also been proposed by (Bisot et al., 2016; Komatsu et al., 2017). 

2.4.2.  Cognitive Theory 

Veenstra (2010) states that, cognitive theory can be used to develop a system that mimics the 

way the human auditory system processes sound. The system can be designed to recognize and 

categorize sounds based on their properties, such as pitch, frequency, and duration, much like 

how humans process auditory information. By understanding the cognitive mechanisms 

involved in sound perception, a sound classification system can be optimized to accurately 

identify and categorize different types of sounds. The theory is related to temporal frequency 

attention in that it provides a framework for understanding how humans allocate their attention 

to different aspects of incoming sensory information. It helps to explain why and how humans 

can selectively attend to specific temporal and frequency aspects of sounds.  

Temporal-frequency attention is rooted in the idea that the perception of sound is not only based 

on its amplitude or loudness but also on its frequency content and the way they change over 

time (dsa2gamba & abbottds, n.d.). 

In sound classification, the goal is to automatically categorize sounds into predefined categories 


12 

 
based on their acoustic features. The traditional approach for sound classification is to use hand-

engineered features such as Mel-Frequency Cepstral Coefficients (MFCCs) that capture the 

spectral characteristics of the sound. However, this approach can be limited as it does not 

capture the temporal dynamics of the sound, which can be crucial in differentiating between 

different sound categories. 

Temporal-frequency attention addresses this limitation by allowing a model to learn to attend 

to different parts of an audio signal based on both its temporal and spectral characteristics. The 

steps involved in implementing temporal-frequency attention in sound classification are: 

2.4.2.1.  Pre-processing 

The audio signal is first transformed into a spectrogram representation that captures both the 

temporal and spectral information. 

2.4.2.2.  Audio Feature Extraction  

In a research conducted by Jasim et al. (2022), they employed different techniques of audio 

feature extraction in order for them to classify sound.  The Features represented values that can 

be expressed numerically and quantified using the appropriate methodologies. A sound wave, 

for example, is made up of two components: sample rate and sample data. The sampling rate 

and sample data can now be transformed in a variety of ways in order to extract important 

valuable features from them (Zhang, 2021).  The accuracy of the system is determined by its 

characteristics and classification techniques. Extraction of effective features is a critical step in 

developing a reliable classification system's front-end module. The sound signal of one class, 

on the other hand, may change over time, and this change may occur on any of the sound 

variables, such as amplitude or frequency. Each type of sound has distinguishing characteristics 

that set it apart from the others (Jasim et al., 2022). There are different methods for extracting 

features from sound files. Some concentrate on extracting features from the frequency space, 

while others concentrate on the time space.             

2.4.2.2.1.  The Zero Crossing Rat (ZCR)    

The Zero Crossing Rate (ZCR) is a measure of how rapidly a signal alternates from positive to 

negative or vice versa. This feature is commonly utilized in speech recognition and music 

processing systems. It is particularly effective in detecting percussive sounds, such as those 

produced by minerals and rocks, where the ZCR has a high value (Giannakopoulos & Pikrakis, 

2014). 


13 

 
2.4.2.2.2.  Linear Predictive Coding (LPC)      

In audio and speech processing, the LPC (Linear Predictive Coding) is a method used to 

describe the spectral envelope of a speech signal in a compressed form through a linear 

predictive model (Dave, 2013). 

2.4.2.2.3.  Perceptual Linear Prediction (PLP)  

PLP extracts features from audio data, which are then used to describe it. The definition of PLP 

involves an estimation of three phenomena related to perceptrons: critical band resolution 

curves, equal loudness curves, and the power law relationship between intensity and loudness 

(Hershey et al., 2017). The LPC and PLP are frequently used in feature extraction algorithms 

in the disciplines of voice recognition and speaker verification (Grama & Rusu, 2017). 

Figure 2.2: depicts an audio wave file representing a sound event that has been transformed into 

a spectrogram image and is being processed by a CNN. The image features are used to classify 

various environmental sounds and occurrences such as car horn, dog barking, drill etc.  

 
Figure 2.2: Audio Feature Extraction (Hershey et al., 2017) 

 
2.4.2.3. Attention mechanism 

An attention mechanism is applied to the extracted features to weight different parts of the 

signal based on their importance for the classification task. The attention mechanism can be 

implemented as a separate layer in the model or as part of the fully connected layer. 

2.4.2.4.  Audio Content Classification         

Convolutional neural networks (or "CNNs") are capable of producing cutting-edge results in 

image and sound classification (Jasim et al., 2022).  Malfante et al. (2018) proposed a system 

that employs deep learning networks to classify environmental sounds based on their 

spectrograms, where CNN was used in both the feature extraction and classification stages. The 

researchers utilized spectrogram images of environmental sounds to train both the tensor deep 

stacking network (TDSN) and the convolutional neural network (CNN). Based on their 


14 

 
experimental investigation, they determined that their proposed system of utilizing spectrogram 

sound images for sound classification can serve as a foundation for developing sound 

recognition and classification systems. 

Figure 2.3 shows the structure of audio classification system from the moment the sound signal 

is entered, and at the step of features extraction, distinctive features are extracted from it and 

then provided to the classification model. CNN is used as the classification model due to its 

effectiveness in separating different classes.  

 
Figure 2.3:The Structure of Audio Classification System (Jasim et al., 2022 

 
Mu et al. (2021) states that, the importance of temporal-frequency attention in sound 

classification lies in its ability to improve the performance of the model by allowing it to focus 

on the most relevant parts of the audio signal. This is particularly important in cases where the 

audio signal is cluttered with background noise or where there is significant variability in the 

temporal and spectral characteristics of the sound within the same category. By allowing the 

model to attend to different parts of the signal based on both its temporal and spectral 

characteristics, temporal-frequency attention can significantly improve the performance of 

sound classification systems. 

2.4.3.  Psychoacoustics Theory 

Psychoacoustics is the scientific study of human perception of sound which provides a 

theoretical framework and practical insights into the way in which the human auditory system 

processes sound, which can be used to guide the design of sound classification systems 

(Psychoacoustics | ScienceDirect Topics).  University of Salford, states that, the goal of 

psychoacoustics in sound classification is to understand the properties of sounds that are 


15 

 
relevant for human perception and categorization, and to use this knowledge to design 

algorithms that can accurately mimic human perception. This is achieved by studying the 

physiological and neural responses to sounds, as well as the psychological processes involved 

in sound perception and categorization. 

2.5. Models  

2.5.1.  Hidden Markov Model (HMM) 

The HMM is a statistical model utilized in machine learning that explains the relationship 

between the evolution of observable occurrences and underlying, indirectly observable factors. 

Instead of determining the step-by-step conditions of a random process, it models the 

probabilistic characteristics of the process using probability distributions. HMM can be utilized 

to categorize audio samples into speech, music, or environmental sound.  

Hidden Markov Models (HMMs) are probabilistic models used in a variety of applications, 

including speech recognition, speech synthesis, and sound classification. 

In the context of sound classification, HMM can be used to model the probability distribution 

of different sounds or audio classes. It consists of two components: (1) a set of hidden states 

that represent the underlying sound class, and (2) a set of observation symbols that represent 

the audio features extracted from a sound clip. The HMM also defines the transition 

probabilities between hidden states and the observation probabilities given a hidden state. 

During the training phase, the parameters of the HMM are estimated based on a large labeled 

dataset of sound clips, where each sound clip is associated with a sound class. In the testing 

stage, a HMM is fed a new audio clip and utilizes it to identify the sound class with the highest 

likelihood based on the observed symbols. This is accomplished using the Viterbi algorithm, 

which computes the maximum likelihood path through the hidden states and observation 

symbols. 

HMMs are important for sound classification because they are able to model the temporal 

dependencies between audio features, which is crucial for capturing the dynamics of different 

sounds. In addition, HMMs are flexible and can be used to model a wide variety of audio 

classes, including speech, music, and environmental sounds. Additionally, the hidden states in 

an HMM can be used to model different levels of abstraction, such as phonemes, words, and 

sentences in speech recognition, making it a powerful tool for many different sound 

classification tasks. 


16 

 
2.5.2.  Gaussian Mixture Model (GMM) 

Carrasco (2020) describes a Gaussian Mixture as a combination of several Gaussian functions, 

each identified by k ranging from 1 to K, where K represents the number of clusters in the dataset. 

Each Gaussian, denoted by K, consists of: 

i. A mean, μ, that determines its center. 

ii. A covariance, Σ, that specifies its width, which would be the equivalent of an ellipsoid's 

dimensions in a multivariate situation. 

iii. A mixture probability, π, that determines the size of the Gaussian function 

The GMM is trained on a labeled dataset of sound clips, where each sound clip is associated 

with a sound class. During training, the parameters of the GMM, such as the mean, covariance, 

and mixing coefficients, are estimated for each sound class. Once the GMM is trained, it can be 

used to determine the most likely sound class for a new sound clip by computing the likelihood 

of the audio features given each sound class, and selecting the class with the highest likelihood. 

The importance of GMM in sound classification is that, it is a flexible and powerful model that 

can capture the underlying distributions of different sound classes. It can handle complex 

distributions that cannot be modeled by simple Gaussian distributions and is able to model multi-

modal distributions, which are common in many sound classes. Additionally, GMMs can be 

used in conjunction with other models, such as Hidden Markov Models (HMMs), to create more 

sophisticated sound classification systems. 

Figure 2.4: shows a graphical representation of a Gaussian Matrix Model, with three Gaussian 

functions hence K=3. Each Gaussian explains the data in each of the three available clusters. 

The curves are plotted on a graph with the x-axis being the data values and the y-axis being the 

probability density function (pdf) of the Gaussian distribution. 

 
17 

 
Figure 2.4: Graphical Representation of a Gaussian Matrix Model (Carrasco 2020) 

2.6.  Frameworks 

There are different types of frameworks that are used for deep learning. Some of these 

frameworks include; Deeplearning4j, Caffe, Theano, PyTorch, Keras, and TensorFlow. 

However, TensorFlow, Keras, and PyTorch are three of these frameworks that have gained 

popularity in recent years because to their usability, widespread use in academic research, 

commercial code, and extensibility. 

2.6.1.  Tensor Flow 

According to Madhavan et al. (2021), Tensor is a term used to describe the multi-dimensional 

arrays used in mathematical models for neural networks in the context of machine learning. A 

tensor is often a generalization of a vector or matrix with a higher dimension. The TensorFlow 

framework can be run on various platforms and operating systems, including CPUs, desktops, 

and mobile devices, and it can be deployed both locally and in the cloud. It is considered to 

offer better support for distributed processing, as well as improved flexibility and performance 

for commercial use. Python is the main programming language used with TensorFlow. 

Although there are no stability guarantees for other languages, such as C++, Java, and Go, there 

are third-party bindings available for many languages, including C#, Haskell, Julia, Rust, Ruby, 

Scala, R, and PHP. For executing TensorFlow applications on Android, Google has developed 

a mobile-optimized TensorFlow-Lite library. 

2.6.2.  Keras 

According to Madhavan et al. (2021), Keras is a Python deep learning library that distinguishes 

itself from other deep learning frameworks. Keras serves as a high-level application programming 


18 

 
interface (API) for constructing neural networks and offers a means of enhancing the capabilities 

of diverse deep learning framework backends that it employs. In version 2.4.0, Keras stopped 

supporting multiple backends and now only focuses on TensorFlow. Essentially, it is a part of 

TensorFlow, with the Keras API for TensorFlow being implemented in the tf.keras submodule or 

package. 

 
2.6.3.  Caffe 

According to Madhavan et al. (2021), Caffe is a deep learning platform that offers support for 

a diverse range of architectures, such as Convolutional Neural Networks (CNNs) and Long 

Short-Term Memory (LSTM) networks. However, it does not provide compatibility with 

Restricted Boltzmann Machines (RBMs) or Deep Boltzmann Machines (DBMs). Caffe takes 

advantage of GPU acceleration using the NVIDIA CUDA Deep Neural Network library and 

has been utilized for image classification and other visual tasks. To facilitate parallel processing 

across a group of systems, Caffe supports Open Multi-Processing (OpenMP). In order to 

optimize performance, Caffe and Caffe2 are coded in C++ and offer deep learning training and 

implementation options through Python and MATLAB interfaces. 

2.6.4.  Deeplearning4j  

Madhavan et al. (2021) describes Deeplearning4j as a widely recognized deep learning 

framework that utilizes Java technology. However, it also provides APIs for other programming 

languages such as Python, Scala, and Clojure. 

This framework, which is licensed under Apache, is equipped to handle Restricted Boltzmann 

Machines (RBMs), Deep Belief Networks (DBNs), Convolutional Neural Networks (CNNs), 

and Recurrent Neural Networks (RNNs). Furthermore, it features distributed parallel variants 

that are tailored for compatibility with big data processing platforms like Apache Hadoop and 

Spark. 

 
19 

 
2.7.  Architectural Designs 

2.7.1.  Assistive Technology Architecture 

A smartphone is the key component in the design. With the increasing popularity of 

smartphones with powerful processors, they have become a vital part of the market. In order to 

effectively differentiate between various sounds, the classifier needs to be highly adaptable and 

flexible. To implement real-time pattern recognition algorithms, a processing device with 

sufficient computing power is required, which is present in a smartphone. Its ability to connect 

to the internet can be utilized to access an online service containing the training data for the 

classifiers. If the system fails to detect a sound or the user feels that an event should have been 

recognized but wasn't, the sound attributes can be uploaded to the database either automatically 

or manually. This allows other users to train their devices for improved recognition. The content 

of the training sample and the location where it was recorded can be identified through the use 

of tags. 

An architecture diagram is shown in Figure 2.5. A microphone or microphone array is used to 

record sound, which is then processed by the smartphone When a sound of interest is identified, 

the system alerts the user and provides them with the option to transmit the acoustic imprint to 

a central server. 

 
Figure 2.5: Assistive Technology Architecture(Mielke et al., 2013) 

  
20 

 
2.7.2.  Smart 311 Architecture 

Tariq et al. (2018) conducted a classification using machine learning algorithms for noise 

detection, using both shallow learning and deep learning models. Figure 2.6 depicts the 

architecture of a Smart 311 system. The Smart311 system is capable of operating in smart city 

environments such as indoors, shopping malls, and even public streets. A mobile application 

plays a crucial role in the smart city environment by recognizing sounds that are categorized as 

noise, such as air conditioner noise, gunshots, dog barking, and jackhammer noise. The mobile 

application can transfer the sounds it detects in a smart city environment to a server via the 

client. After extracting the features from the input audio data, the machine learning component 

identifies a particular type of sound. When the sound classification system identifies any of the 

categories mentioned above, it sends a 311 request to the server based on the severity of the 

incident. 

 
Figure 2.6: Smart 311 Noise Sound Classification Architecture (Tariq et al., 2018) 

2.8.  Algorithms 

2.8.1.  Decision Trees 

In this type of supervised machine learning, the training data is continually divided based on a 

particular criterion, and both the input and corresponding output are provided. The tree consists 

of decision nodes and leaves, which are utilized to describe the connection between inputs and 

outputs. Decision trees can be used to compare different classifiers. When tested as part of an 

ensemble of random choice forests, decision trees outperformed them in terms of classification 

speed but not accuracy. By using a series of decision forest iterations, the ensemble seeks to 

take care of the method's lower accuracy.  


21 

 
2.8.2.  K- Nearest Neighbor (KNN) 

KNN algorithms display three traits that set them apart from other learning algorithms  which 

lead to performance advancement over time. They delay in processing their instances until they 

get information requests. They only save their instances in storage for later usage. KNN 

combines their training instances and data to respond to information requests, discarding any 

intermediary results. 

For this algorithm, the class produced by this approach will be the class of the instance that is 

most similar to the tested examples (Harrison, 2019). 

2.8.3.  Naïve Bayes   

This algorithm is designed to solve binary (two-class) and multi-class classification problems. 

It has proven to be not only simple, but also quick, accurate, and dependable and works 

particularly well with natural language processing (NLP) problems (Gaurav, 2018). it can be 

used to categorize an object by independently mapping each characteristic to the classifier. The 

algorithm determines the membership probabilities for each class, including the probability that 

a particular record or data belongs to a specific class. The class with the highest probability is 

considered the most probable one. 

In a research study conducted by Fanzeres et al. (2018), it was seen that despite having the 

lowest accuracy, naive Bayes classification turned out to be a good solution for their mobile 

sound application for the DHH, with an average of 89%. The processing part that can be 

improved upon during the training phase has the highest degree of accuracy. In addition, 

compared to decision trees and neural networks, naive Bayes training was far faster. 

2.8.4.  Bayesian Network 

Probabilistic graphical models known as Bayesian networks utilize Bayesian inference to 

calculate probabilities. They are represented as directed graphs with edges indicating 

conditional dependencies, and aim to model the relationships and causality among variable. 

Through these connections, one can effectively employ factors to draw conclusions about the 

graph's random variables (Soni, 2019). They are particularly adept at studying a previously 

occurring event and determining the likelihood that any of the countless known causes 

contributed to it. 


22 

 
2.9.  Research Gaps 

According to Mielke & Bruck (2016), their smart watch only focused on an office setting and 

did not collect sound data from other locations. The smart watch did not display a pop-up 

notification, making it difficult for deaf users to understand the captured sound. Bhutkar et al. 

(2020) created a prototype alert device that was solely focused on the home environment. The 

data used in their study was only existing sound-data, and their prototype lacked any pop-up 

notification to alert deaf users. Bragg et al. (2016) created a mobile sound detector app to help 

deaf and hard of hearing people. However, their prototype lacked an environmental sound 

recognition function, and they had to manually notify deaf users. 

 
2.10.  Conceptual Framework 

Figure 2.7 shows a conceptual framework of the solution. The mobile phone's microphone was 

used to detect sound. The obtained data sets were then trained and tested using a machine 

learning model. After that, the ML categorized the sounds, such as car honking or sirens, and a 

notification was automatically displayed on the mobile application. 

 
Figure 2.7: Conceptual Framework 


23 

 
Chapter 3: Research Methodology 

3.1. Introduction 

Methodology and techniques are two closely related and interdependent words that are 

frequently used interchangeably. Neuman (2014) defines methodology as the large structure 

that houses methods. Cohen et al. (2000) state that, methodology refers to a methodical 

approach to data collecting from a particular population in order to comprehend a phenomenon 

and generalize knowledge obtained from the target population. According to Jansen (2020), 

research methodology pertains to the practical implementation of a research project. It 

encompasses the systematic planning of a study by the researcher to ensure reliable and valid 

results that effectively address the research's aims and objectives. It primarily focuses on what 

data should be collected, from whom it should be collected (sample design), methods for data 

collection and analysis.  

3.2. Variables and Research Design 

3.2.1 Variables  

 3.2.1.1 Independent Variable 

Implementation of the sound classification and display tool 

  3.2.1.2 Dependent Variables 

Measurements that are influenced by the independent variable. These dependent variables are;  

i. Accuracy. Measures the tool's ability to accurately classify and categorize different types 

of sounds. 

ii. Processing speed: Quantifies the time taken by the tool to process and classify incoming 

sounds.  

iii. Effectiveness of the sound display: How well the sound classification and display tool 

present the classified sounds to the user in a clear, understandable, and user-friendly 

manner. 

3.2.2 Research Design 

Experimental design involves conducting research in an objective and controlled manner to 

maximize precision and draw specific conclusions regarding a hypothesis statement. The main 

goal is to determine the impact of an independent variable on a dependent variable. 

The objective of this research was to create a sound classification and display tool by building a 


24 

 
model, and developing a mobile app. An experimental design was employed to determine the 

study's methodology, data collection, and analysis procedures. 

An open dataset, (Urban8K dataset), comprising various sounds was collected and carefully 

annotated with labels indicating sound type and characteristics. The experimental design 

encompassed several phases. Firstly, a machine learning-based sound classification model was 

developed. The collected sound samples underwent preprocessing to extract relevant audio 

features. These features were then utilized to train a machine learning model, employing a deep 

learning neural networks technique (Convolutional Neural Network). The trained model 

underwent rigorous testing and validation to ascertain accuracy and generalization capabilities. 

Concurrently, a user interface was designed and implemented to visually display the classified 

sounds, using text-based display. The effectiveness of the sound display was evaluated by testing 

the mobile app on the trained model with various sounds. The findings provided insights into the 

potential effectiveness and usability of the tool in real-world scenarios, highlighting its capacity to 

enhance the auditory experience of individuals with hearing impairments.  

3.3. Population and Sampling 

3.3.1. Target Population 

The study focused on various types of sounds present in the environment. Data on various sound 

types found in the environment were obtained from the Urban8K dataset, from which five 

categories were derived. Following that, the model was trained using the five categories to assist 

the deaf in differentiating them. 

3.3.2.  Sampling  

The study employed both probability and non - probability sampling methods.  

3.3.2.1. Cluster Sampling 

The researcher had sampled the total sound type-data into groups or clusters that reflected 

certain categories. Based on parameters such as sound class, clusters were identified and 

included in a sample. The Urban8K dataset was used, which originally contained 10 classes of 

environmental sounds, including air-conditioner, car-horn, children-playing, dog-bark, drilling, 

engine idling, gun-shot, jackhammer, siren, and street music. However, the model was trained 

using only 5 classes, namely siren, street music, children playing, dog barking, and car horn. 

To filter the metadata file and audio files, which originally contained 10 classes, the researcher 


25 

 
created a list of the 5 classes and used pandas to filter the main metadata file to a processed one 

that only had the classes that were of interest. The class ID’s were altered so that they were in 

between 0 and 4, with one for each class. Once the processed metadata file contained data from 

the five required classes, the researcher did the same to the audio files, so that there were only 

audio files from the five classes that were of interest to this research. 

3.3.2.2. Consecutive Sampling 

Using this sampling technique, the researcher selected one sound category from a sample of 

sound data, examined the data, and moved on to the next sound category. By gathering data 

with crucial insights, this strategy enabled the researcher to work with different sound types and 

fine-tune the research.  

3.4. Data Collection Methods and Analysis 

3.4.1.  Data Collection Methods 

The sole purpose of research tools is to collect data from research subjects on a specific topic of 

interest. An ideal instrument, on the other hand, is one that yields objective, accurate, sensitive, 

efficient, and relevant results. The following tools were used in this study. 

3.4.1.1.  Existing Sound-Data 

The researcher utilized the Urban8k Dataset in the present investigation, which featured ten 

categories, namely air-conditioner, car-horn, children-playing, dog-bark, drilling, engine idling, 

gun-shot, jackhammer, siren, and street music. To streamline the analysis, the dataset was filtered 

to include only five classes relevant to the research questions. Specifically, the metadata file 

UrbanSound8k.csv was used to provide classification information for each sound file and select 

audio files that belonged to the chosen five classes. These audio files were then used to train the 

model, allowing the researcher to accurately classify new audio files based on their sound 

characteristics. 

 
3.4.1.2.  Prototyping  

A prototype was developed to facilitate the testing and refinement of ideas that could be 

conveyed to deaf users more effectively. The mobile application was tested in the test bed 

environment, which allowed for comprehensive analysis of its functionality and effectiveness. 

This approach provided valuable insights into the application's strengths and weaknesses, 


26 

 
allowing for necessary adjustments to be made.   

3.4.2.  Data Analysis 

In this study, inferential analysis was employed to analyze the collected data and draw meaningful 

insights. By applying inferential analysis techniques to the data, insights were gained into the 

effectiveness and performance of the sound classification and display tool. The analysis focused 

on evaluating the accuracy of the tool in classifying sounds based on the data from the Urban 8K 

dataset. One aspect of the analysis involved evaluating the effectiveness of existing models used 

in sound classification and display. By conducting inferential analysis on the performance and 

outcomes of these models, insights were gained into their strengths and limitations. This 

information played a crucial role in shaping the design of the sound classification and display tool 

to effectively address the research problem. 

 
3.5.  Research Quality and Reliability 

3.5.1.  Research Quality 

The chosen research methodology can greatly impact the quality and success of a research 

(Thattamparambil, 2020). To ensure the collection of relevant data and the use of the most 

appropriate data analysis method, the researcher carefully selected an appropriate research 

methodology. Bouchrika (2022) states that, effective research requires reviewing previous 

studies on the topic and generating new knowledge. By exploring the literature and other 

materials related to the topic, the researcher was able to gain a better understanding of prior 

research and how the current study fits into the current field of research. In this study, the 

researcher reviewed previous related work, discovered the models and framework used in sound 

classification, and identified gaps in previous research. 

3.5.2.  Reliability 

Reliability is defined as "the accuracy and precision of the measurement, as well as the absence 

of differences in the results if the research was repeated" (Collins & Hussey, 2014).  

To avoid any possible bias in the research findings, the researcher was mindful of their own 

position throughout the study. The goal was to eliminate or minimize any potential impact that 

could compromise the reliability of the results. The researcher aimed to prevent confirmation 


27 

 
bias by treating all data impartially, analyzing it sincerely, and resisting the temptation to falsify 

it. 

 
3.6.  System Development Methodology 

Object Oriented Analysis and Design (OOAD) is an organized process for doing analysis, 

developing a system using object-oriented principles, and producing a number of graphical 

system models during the software development life cycle (Elgabry, 2021). Regardless of 

limitations like the right technology, the aim of the analysis phase is to build a model of the 

system. Typically, use cases and conceptual models are used to define the most crucial things 

in an abstract manner. On the other hand, the analytical model is improved during the design 

process, which also applies the required technology and other implementation constraints. The 

Unified Modeling Language (UML) was used to represent the system's various views and 

functionalities. This approach was used within the agile methodology, which was iterative and 

incremental and was performed in a highly collaborative manner to produce high quality 

software, according to (Concas et al., 2008). Figure 3.1 shows the different cycles of an agile 

methodology.  

 
Figure 3.1: Agile Development Cycle (Concas et al., 2008). 

 
3.7.  Utilization and Dissemination of Research Results 

The results of this study aided individuals who are deaf or have difficulty hearing in identifying 

different sounds in their surroundings. The findings also helped future researchers who were 

interested in solving problems that can be addressed using sound classification to help the deaf 

and hard of hearing. These findings were disseminated through online publications. 


28 

 
3.8.  Ethical Considerations/Issues 

Some of the ethical considerations that were put in place during this research were;   

i. Institution approval was required to certify the study and the results obtained. 

ii. The research was designed and executed in accordance with the strictest standards of 

excellence, integrity, moral propriety, and legality. 

 
29 

 
Chapter 4: System Design and Architecture 

4.1. Introduction 

System design and architecture are fundamental to software engineering as they define the 

foundation of a software system. System design involves identifying the components, interfaces, 

and data that make up a system, while architecture determines how these components and modules 

interact and are supported by the infrastructure. The success rates of a project are significantly 

influenced by how well the project requirements are defined, while on the other hand, failure to 

properly gather and analyze requirements and manage resources can lead to project failure. To 

mitigate this risk, the use of computer-aided software engineering (CASE) tools has been 

proposed. Unified Modeling Language (UML) is one such tool that can assist with system design 

and architecture. 

4.2.  Requirement analysis 

The Urban8K dataset was downloaded from Kaggle and served as the primary source of audio 

data for the study. To facilitate the sound classification process, the audio data was first extracted 

from the dataset and converted into mel-spectrogram images. This conversion was necessary as it 

allowed the data to be easily visualized and analyzed using machine learning algorithms. 

Additionally, any irrelevant audio files were removed from the dataset to ensure that the resulting 

model was accurate and reliable. The use of the Urban8K dataset and mel-spectrogram extraction 

techniques proved to be an effective approach in developing a sound classification tool for 

assisting the deaf and hard-of-hearing. 

Chung & do Prado Leite (2009) highlighted that the functional and non-functional aspects define 

a system's utility. They pointed out that, like anything else, system quality is essential and ought 

to be considered when creating high-quality software. The requirements and study goals for this 

research have been broken down into functional and non-functional requirements as shown below. 

4.2.1.  Functional Requirements 

The functional requirements were developed based on the desired behaviors that would be 

accomplished in a system. These therefore encompassed the numerous system functions and 

capacities that were discovered to be compatible with the study's goals, as highlighted below: 

i. The system should accept sound input from a microphone or other audio sources. 

ii. The system should process the audio input to extract relevant features. 


30 

 
iii. The system should analyze the input sound and classify it based on predefined categories 

of the environmental sounds. 

iv. The system should display the classification results on a user interface, such as a textual 

description. 

v. The system should provide a user-friendly interface that is easy to navigate and use. 

vi. The system should incorporate a vibration alert feature to notify individuals who are deaf 

or hard of hearing of incoming notifications. 

4.2.2.  Non-functional Requirements 

Non-functional requirements are a set of criteria that describe the characteristics or qualities of a 

software system, rather than its specific functional capabilities. These requirements focus on how 

the system operates, rather than what it does. They do not relate to functionality, but to attributes 

such as reliability, efficiency, usability, maintainability, and portability. The following are the non-

functional requirements:  

i. Reliability: The system should be highly reliable, with accurate sound classification and 

minimal errors or false positives. Users should be able to depend on the tool to correctly 

identify the different types of sounds. 

ii. Performance: The system should have good performance and speed, with minimal latency 

in sound classification. The tool should be able to handle multiple sounds simultaneously 

without slowing down or crashing. 

iii. Security: The system should be secure, with appropriate measures in place to protect users' 

personal data and privacy.  

iv. Usability: The system should be easy to use and understand, with a clear and intuitive user 

interface. 

v. Compatibility: The system should be compatible with a wide range of devices and 

platforms, including different operating systems and screen sizes.  

vi. Maintainability: The system should be easy to maintain and update. It should be designed 

with modularity and scalability in mind, to facilitate future updates and improvements. 

4.3.  System Architecture 

The system architecture is a conceptual framework that encompasses the various views, structure, 

and behaviors of the system. It provides a description and representation of how the different 

system components operate and interact with one another. Essentially, the system architecture 


31 

 
captures how the system functions as a whole by coordinating its components and subsystems to 

achieve its intended purpose. 

The components used in the system are users, sound recording module, sound processing module, 

sound classification module, user interface module, database module and system communication 

module.  

The sound recording module, will be responsible for capturing the sound that needs to be classified 

and displayed. To improve the accuracy of sound classification, a sound processing module will 

be included to filter and pre-process the recorded sound by removing noise or enhancing certain 

frequencies. The sound classification module will then analyze the pre-processed sound and 

classify it based on its type, including dog barking, car horn and siren. 

A user interface module will provide a graphical user interface for the user to interact with the 

system and view the sound classification results. To store the sound classification results for future 

reference or analysis, a database module will be included. Lastly, a system communication module 

will handle the communication between the different modules of the system to ensure that they are 

integrated effectively and efficiently. By integrating these different modules, the "Sound 

Classification and Display Tool for Assisting the Deaf and Hard-of-Hearing" system, as shown in 

Figure 4.1, will be able to provide the required functionality for the deaf and hard-of-hearing users. 

 
Figure 4.1: System Architecture 

 
4.4.  System Design 

(Odhiambo, 2019) describes system design as designing the system’s components, including the 

architecture, modules, interfaces, and data flow. The objective of engaging in System Design is to 

gather and present detailed information about the system and its components, in order to support 


32 

 
the implementation process that aligns with the system architecture models and views. The design 

process involves utilizing various diagrams such as system sequence diagrams, use case diagrams, 

partial domain models, context and data flow diagrams, entity diagrams, and class diagrams. These 

diagrams are utilized at different stages of the design process to depict and document the 

functionality of the system. 

4.4.1.  Use case Model 

A use case model shows how a system interacts with its users, other systems, or external entities 

through a set of actions called use cases. It describes the various use cases, their relationships, and 

the actors involved in each use case. Use case diagrams are well-suited for, illustrating the 

objectives of interactions between a system and its users, structuring and clarifying the functional 

requirements of a system, defining the prerequisites and demands of a system, and describing the 

fundamental sequence of actions in a use case. 

4.4.1.1. Use case diagram and descriptions 

 
Figure 4.2: Use Case Diagram 

  
33 

 
In figure 4.2, the Deaf or Hard-of-Hearing User is the main actor in this use case diagram. The 

user interacts with the system to record sound using a microphone. The system then classifies the 

sound, using machine learning algorithms (Convolution Neural Network) to identify the type of 

sound, whether is a dog bark, car horn, siren, street music or children playing. The system then 

displays the classification results to the user, using a visual interface which is a text message pop 

up. 

Table 4.1 focus on "Record Sound," detailing the steps and requirements for capturing audio input 

within the system and providing a comprehensive understanding of the recording functionality. 

Table 4.2, on the other hand, pertains to "Sound Preprocessing," outlining the tasks and processes 

involved in preparing the recorded sound data for further analysis, including noise reduction and 

filtering. Table 4.3 introduces the use case description of "Classify Sound," elaborating on the 

procedure for analyzing and categorizing the preprocessed sound data using classification 

algorithms or techniques, thereby enabling the identification of different sound classes. Table 4.4 

addresses the use case description of "Display Sound Classification (predicted) Results," 

illustrating how the classified sound data is presented to the user, presenting the predicted results 

in a clear and user-friendly manner, and ultimately facilitating the interpretation and understanding 

of the sound classification outcomes. 

 
Table 4. 1 Use case description of Record Sound 

Use Case Record Sound 

Description The system records sound data using the Sound Recording 

Module (Phone’s Microphone) and saves it to the database. 

Source Ambient Environment 

Inputs needed Sound Data 

Preconditions 1. The Sound Recording Module is active. 

2. The audio input device (Phone’s microphone), is 

functioning correctly. 

3. Database is available to store the recorded sound data. 

Post Condition 1. Sound data is recorded and saved to the database. 

Flow of Events 1. The system's Sound Recording Module begins 

capturing sound data from the user's audio input device. 


34 

 
2. The system's Sound Processing Module analyzes the 

recorded sound data and the system automatically saves 

the sound data to the system's database. 

3. The user may view and manage the recorded sound data 

using the system's user interface. 

 
Table 4.2: Use case description for sound preprocessing 

Use Case Sound Pre-Processing 

Description The system automatically processes the recorded sound data to 

enhance its quality and extract relevant features for 

classification 

Source System 

Inputs needed Recorded Sound Data 

Preconditions 1. Sound data has been recorded and stored in the system. 

2. Sound processing module is operational. 

Post Condition 1. Filtered and feature-extracted sound data is stored in the 

system's database. 

Flow of Events 1. The system receives the recorded sound data. 

2. The system applies a noise reduction filter to the sound 

data to remove any unwanted noise and artifacts. 

3. The sound processing module extracts relevant features 

from the preprocessed data. 

4. The preprocessed sound data is saved in the database 

5. The preprocessed data is used as input for the Sound 

Classification module to accurately classify the sound. 

 
35 

 
Table 4.3: Use case description of Classify Sound 

Use Case Classify Sound 

Description The system classifies the recorded sound data using a trained machine 

learning model(CNN) and displays the results on the system's user 

interface for the user. 

Source System 

Inputs needed Recorded Sound data from the system’s datab