Use of regular expressions for multi-lingual detection of hate speech in Kenya
Hate speech has of late become a sensitive issue in Kenya given that it helped trigger the post election violence of 2007/2008. At the same time, the percentage of the populace that has internet access has continued to grow giving rise to an active online community whose activity is scarcely monitored. The current detection of these hate messages is manual as it mostly relies on what is captured on the media or text that an online user happens to flag. Given that bloggers have come under investigation for the content they post online shows intent on the part of regulatory bodies to clean up online communication, however, a widespread and automated means by which this cleanup can be achieved is yet to formally materialize. The main objective of this research was to establish an automated means of detecting textual hate speech in Kenya for the Sheng and Swahili languages. This was achieved by incorporating the use of regular expressions whose power and flexibility were found to be best suited to the unstructured nature of the Sheng language in particular. In this study a corpus was created using data collected from correspondents. The data was collected using questionnaires and direct interviews. Hate speech in Kenya is tribal and the annotators submitted texts that they determined to be hateful towards certain tribes, texts that were non-hateful in nature were submitted as well. The constituted corpus was divided into two segments; one portion was used to formulate the rules used by the system to detect hate speech while the second portion was used to test the functionality of the system. The test data was pre-labelled by the human correspondents and accuracy was measured as a direct comparison between the systems‟ classification of the test data and that of the human annotators. The system was able to distinguish between hate speech and non-hate speech to an accuracy of 71.4%.