Cross-Lingual model for hate speech detection on Twitter: a case of Swahili and Swahili-English slang

Date
2023
Authors
Kariuki, A. O.
Journal Title
Journal ISSN
Volume Title
Publisher
Strathmore University
Abstract
The prevalence and entrenchment of online hate, hate crimes and hate speech in contemporary society concern organisations and governments. Detecting online hate, especially on social media, has proven daunting as offensive languages have multifaceted behaviours, and most training data are topic specific. On top of that, available solutions and research are geared towards the English language; thus, detecting online hate in lower-level languages like Swahili and Indigenous African Languages is much more difficult. This has worsened because social platforms such as Twitter, Facebook, Instagram, Rumble and YouTube enable consumers to converse and participate in their native dialects. This research proposed using cross-lingual transfer learning for hate detection to overcome these challenges. A Cross-Lingual model built on a BERT pre-trained model was developed as part of the research's experimental methodology, and its performance was compared to those of more established text classifiers like SVM, NB, and LR. Through the Twitter API, more than 300K tweets with a Kenyan focus were collected. These tweets focused on Kenya's most divisive moments in history, namely the 2013, 2017, and 2022 general elections. A set of predetermined criteria, including user location, tweet location, hashtags, pro-hate accounts, hate patterns, and racial epithets, were used to collect the data. For usage in the model development, training and validation, a random sample of over 20K tweets was annotated as hate or non-hate. The developed Cross-Lingual model achieved a ROC curve area under the curve of 0.77 and an accuracy of 77 per cent. The following are the contributions made by this study. Primarily, the research established an empirical framework and methodology for utilising transfer learning to identify the offensive language in low-resource languages. Additionally, this strategy was crucial in creating a text classification framework that could be broadly applied to different types of abusive language on online platforms. The model's results may thus be used to inform data-driven legislation regarding the detection of online hate as well as evidence-based decisions by pertinent intelligence agencies. Keywords: Deep learning, free speech, freedom of speech, hate detection, hate speech, machine learning, natural language processing, social media, Twitter.
Description
Full- text thesis
Keywords
Citation
Kariuki, A. O. (2023). Cross-Lingual model for hate speech detection on Twitter: A case of Swahili and Swahili-English slang [Strathmore University]. http://hdl.handle.net/11071/13529