Sentiment analysis for hate speech detection on social media: TF-IDF weighted N-Grams based approach

Date
2017
Authors
Mugambi, Sharon Kaari
Journal Title
Journal ISSN
Volume Title
Publisher
Strathmore University
Abstract
Hate speech on social media has unfortunately become a common occurrence in the Kenyan online community largely due to advances in mobile computing and the internet. Incidents of hate speech on social media have the potential of quickly disseminating amidst online users and escalating into acts of violence and hate crimes due to incitement, as was the case during the 2007-2008 Post Election Violence. With the upcoming, highly contested 2017 general elections, the monitoring of hate speech on social media platforms is of critical importance to detect hate speech occurrences as soon as possible to prevent any further escalations which may result in violence. Current efforts by the National Cohesion and Integration Commission to monitor hate speech on social media involve the use of web crawlers to collect possible instances of hate speech based on specific keywords. Human monitors then have to analyze the collected data to determine instances that are actually hate speech. This human analysis is not only time consuming and overwhelming but also introduces subjective notions of what constitutes hate speech. This research proposed the application of machine learning techniques to build a text binary classifier to detect hate speech on twitter. Hate speech data was collected and labelled to build the corpora. A Support Vector Machine model was trained and validated based on the labelled text data using unigram features and term frequency-inverse document frequency weighting. The research employed an experimental approach to determine which combination of features, weighting schemes and classifiers gives the best performance on the collected hate speech data. Bigram features weighted using term frequency-inverse document frequency fed into a Support Vector Machine classifier gave the best classification performance at an accuracy of 76.22 percent, with an area under the curve of 0.76 for a Receiver Operating Characteristic curve.
Description
Thesis submitted in partial fulfillment of the requirements for the Degree of Master of Science in Information Technology (MSIT) at Strathmore University
Keywords
Hate Speech -- Social Media, Machine Learning, Support Vector Machine, TF-IDF, Bigram
Citation