Keywords

1 Introduction

At present, the active social media population is reported as more than 4.5 billion worldwideFootnote 1. As the amount of social media content produced by users increases, it emerges the need for better moderation techniques for unwanted content. Therefore, automated detection of offensive text gains a lot of traction focusing on concepts around aggression, hate speech, trolling activities, misogyny, cyberbullying and so on. Offensive language is considered as degrading language that has a negative impact. Examples of offensive (OFF) and not offensive (NOT) texts are given in Table 1, from Semi-Supervised Offensive Language Identification Dataset (SOLID) [54].

There are numerous research and a number of previous reviews on the field of offensive language detection [28, 37, 64]. Advancements in natural language processing also led to improvements and an increase in the variety of research in this field. Use of machine learning and deep learning algorithms for accurate classification of offensive language and further classification of fine-grained types of offense are widely researched. Moreover, creating high-quality datasets to train and test the models, as well as methods for evaluating dataset annotation have been studied.

Table 1. Example texts from SOLID dataset

In this paper, we present an overview of the background and current state of offensive language detection on social media. In Sect. 2, we describe our methodology for article search and selection. In Sect. 3, we provide a background around terminology, variations, and definitions, application areas, shared tasks organized on the topic, existing datasets with their differences in classes, and differences on creation steps such as annotation agreement and finally model evolution over time. In Sect. 4, we discuss challenges, gaps, and potential opportunities in the area.

2 Methodology of the Literature Review

While forming the methodology, guidelines from Kitchenham and Templier for writing literature reviews were used for best practice [32, 61]. The following resources were considered to do the search on the topic:

Conference Proceedings: According to conference rankings (by Google Scholar on Computational Linguistics) the following top 3 conferences for the last 2 years were examined: Meeting of the Association for Computational Linguistics (ACL), Conference on Empirical Methods in Natural Language Processing (EMNLP), Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL).

Digital Libraries: Among digital libraries Web of Science has been chosen for key term search since it is considered more reliable and has information on citation. Keyword groups have been searched on the platform and the results of the searches have been collected. Time interval between 2017-01-01 and 2023-05-31 was considered for the recent papers while no time constraint imposed for the background. Furthermore, additional resources have been included for relevant sections including areas other than computer science.

As search terms, firstly, keywords that can describe the offensiveness in various ways have been selected. The European Commission against Racism and Intolerance (ECRI) Glossary is also benefited from for the search keyword selection.Footnote 2 These keywords are: “offensive", “hate speech", “racism", “sexism", “cyberbullying". For the conference papers, these keywords were used directly in the search within the titles of the accepted papers. For the digital library search these keywords were used along with complementary terms to help discriminate articles from fields such as social science. For this purpose, the following search term list was applied on Web of Science: 1. “offensive" “text classification" 2. “hate speech" “text classification" 3. “cyberbullying" “detection" 4. “racism" “text classification" 5. “sexism" “text classification" for the recent papers (2017–2022); 6. “offensive" “language detection", 7.“hate speech" “detection" and 8. “cyberbullying" “detection" for a fundamental background. Papers were sorted by number of citations in descending order.

After obtaining the search results, overlapping papers were excluded, publications only written in English have been taken into account and publications on other fields were excluded such as social sciences.

3 Background

3.1 Definition and Variations

Offensive language is defined as the term that is applied to hurtful, derogatory or obscene comments made.Footnote 3 Whereas, the United Nations indicates that hate speech is used in common language as loosely referring to “offensive discourse targeting a group or an individual based on inherent characteristics - such as race, religion or gender - and that may threaten social peace".Footnote 4

On the other hand, different terms are used in the literature for automatic text detection to refer to the same concept as offensive language such as aggression [56, 58], toxic [38, 40, 62], abusive [66] or threatening [35] language. Another specific term that is commonly used in the field is cyberbullying. Cyberbullying is a generic term which is defined as “bullying that takes place over digital devices and includes sending, posting or sharing negative, harmful, false, or mean content about someone else".Footnote 5

Furthermore, more specific concepts under the umbrella “hate speech" have been considered. These concepts are usually based on the target group. They include but are not limited to particular concepts such as racism, sexism, homophobia, and so on. Additionally, very few examples of solely ideological hate speech identification towards the right wing in Germany [29]. Figure 1 shows a hierarchical schema of our attempt to clarify the relations of terms around the concept.

Fig. 1.
figure 1

Hierarchical therminology schema

Moreover, Wiegand et al. [65] drew attention to the lack of good performance on detection of implicit abusive language (i.e. not conveyed by explicit offensive words) and presented a list of sub-types of implicit abusive language with divide-and-conquer idea behind it. The sub-types they recommended are ‘stereotypes’, perpetrators’ (meaning a person committing an illegal, criminal or evil act), ‘comparisons’ (e.g. “You sing like a dying bird"), ‘dehumanization’ (as the act of perceiving people as less than human, e.g.“I own my wife and her money."), ‘euphemistic constructions’ (e.g.“You inspire my inner serial killer." actually being an equivalent of “I want to kill you."), ‘call for action’ (the author asking something, typically some form of punishment), ‘multimodal abuse’ (i.e. harmful content of a micropost is hidden in the nontextual components or results as an interplay of text and image/video), ‘phenomena requiring world knowledge and inferences’ (as sub-types: jokes, sarcasm, rhetorical question) and finally, other implicit abuse to cover further cases.

3.2 Motivation and Application Areas

The increasing amount of social media input makes human moderation impossible while traditional rule-based (e.g. black lists of words) systems are insufficient to provide good coverage. Therefore the need for an efficient automated detection mechanism gains a lot of traction.

Previous research has shown a strong negative relation between cyberbullying and young people’s mental health [37]. Earlier studies claim that derogatory language aiming at minority groups leads to political radicalization and worsens intergroup interactions [4].

Considering ethical, sociological and psychological impacts, the demand for an efficient mechanism is quite high in various application areas. In private sector, tech companies and user input based platforms want to increase audience engagement and protect their brand by removing unwanted content as efficiently and quickly as possible. The 2018 Content Moderation report indicated that 27% of respondents of the Digital Trust Survey stated that they would stop using a social platform if it continued to allow harmful content.Footnote 6 As stated in the Business Journal from the Wharton School of the University of Pennsylvania (Jan, 2022), Facebook alone has committed to allocating 5% of the firm’s revenue, $3.7 billion, to content moderation (please note that it is overall content moderation including text, image, video etc.)Footnote 7.

As Klonick [34] summarized the development of online speech moderation, the major social media platforms such as Facebook and Youtube did not even have clear public policies and community standards until the late 2000s, and yet since then they have been developing and improving the scope and definition of their policies, user feedback mechanisms and internationality of the moderation. Moreover, the increase in the usage of streaming platforms (e.g. Twitch) emphasized the need for real-time content moderation which requires speeds that is not always possible with manual moderation.

All in all, due to its scalability and speed, AI-based content moderation is on an increasing demand, while accuracy remaining the biggest challenge at the moment.

3.3 Shared Tasks

Shared tasks are challenges or competitions organized by research community that enable teams of researchers submit systems to solve specific tasks. Escartin et al., conducted a survey and reported that, in the NLP community, shared tasks are generally celebrated as an important factor for advancement of the field [17]. Among various specific tasks in the field of NLP, from news article similarityFootnote 8 to patronizing language detectionFootnote 9, identification of various forms of offensive language is quite popular.

As shown in Fig. 2, the main data source of datasets used in the previous shared tasks is Twitter. In terms of the language of the datasets; the most common is English with 11 dataset, followed by Spanish with 7, German and Hindi with 4, Arabic with 3, Italian with 2 and then the others including Bengali, Danish, Greek,Marathi, Turkish, Urdu and Vietnamese with only 1.

Fig. 2.
figure 2

Source of the datasets used in the shared tasks

In terms of the participant and winner models, it is seen that throughout the years, submissions trained on neural network models have increased compared to non-neural networks. Support Vector Machine (SVM) [12], Logistic Regression, Random Forest, Naive Bayes, Decision Trees were the popular non-neural approaches while recurrent neural network (RNN) [55], convolutional neural network (CNN) [21], long short-term memory (LSTM) [25], bi-LSTM, GRU were the popular deep learning architectures. Also, ensemble classification systems are in general highly preferred among participants. For the tasks on more than one language, some participants also submitted their results with a multilingual approach where they trained their model with multiple languages.

For example, in the earlier tasks such as GermEval, TRAC, EVALITA in 2018, the ratio of total participant submission models were around 48% non-Neural (mostly SVM and Logistic Regression) and 52% Neural networks [6, 36, 67]. Whereas in a recent task, the latest EXIST [52], it is reported that all participants used some kind of transformer-based system except one team; more specifically, majority used Bidirectional Encoder Representations from Transformers (BERT)[16] or versions of BERT including multilingual BERT - mBERT, Spanish version of BERT called BETO, RoBERTa, DeBERTa, multilingual version of RoBERTa called XLM-R or other transformer versions. Also in OSACT (2022), it is reported that the participant teams used different fine-tuned transformer versions such as AraBERT, mBert, XLMRoberta etc. where the highest ranking submissions used an ensemble of different transformers [24].

3.4 Datasets

Reviewing the recent datasets it is seen that as well as the differences in labeling and classification schema, annotation mechanisms have also differences such as different numbers of annotators and evaluation of the agreement between anotators. For example, usually three or more annotators annotated each instance in the datasets. For the annotation agreement calculations Fleiss’sKappa was considered for some datasets [19, 71], Krippendorf’s was used in another one [38]. Moreover, annotator’s profile was also often in consideration. In some datasets it is stated that variety has been maintained among annotators, the details have, however, been kept private. In some datasets it is revealed, as an example, for the Levantine hate speech dataset by Mulki et al., genders of the 3 annotators has been chosen as one male and two females [46].

In terms of the data sources, the most common one appears to be social media platforms such as Twitter due to the limited short text structure. However, also sources such as newspapers and platforms known as more liberal such as ‘gab.com’ have also been considered. Data collection is usually based on certain keywords and hashtags; however at times data is collected based on searches following important events that have gone viral. A different example of data collection strategy is seen from the hate speech dataset by Mubarak et al. [45] which they collected tweets using emojies existed in offensive text which are extracted from previous datasets by Zampieri et al. [72] and Chowdhury et al. [10].

Rosa et al. [53] examined earlier cyberbullying datasets (from 2011 to 2018) and reported that the majority of the datasets are in English, mainly labelled by 3 annotators, with variety of size from 2K to 85K instances and from data sources including not only Twitter, YouTube, Instagram but also Formspring, AskFM, MySpace.

Regardless of which the classes are considered, the majority of the datasets we are aware of have a binary approach to labeling; however Hada et al. created the first dataset based on the degree of offensiveness, where each instance has a score between -1 (maximally positive) and 1 (maximally offensive) [23].

More recently, counter-narrative generation started to be researched as an alternative solution [11, 60, 73]. Counter-narratives are texts that withstand hate speech with fact-bound arguments or alternative viewpoints. Also in dataset creation, examples of hate speech/counter narrative datasets started to emerge [18]. In a recent study, GPT-2 was utilized to generate synthetic training data for the model [47, 50].

Further examples of dataset creation have concentrated on aspects of racial bias. Sap et al. examined annotators’ insensitivity to differences in dialect and showed that when annotators are made explicitly aware of an African-American English tweet’s dialect they are significantly less likely to label the tweet as offensive [57]. Davidson et al. also examined racial bias by training models on different datasets and finally referred that racial bias exists in datasets, as classifiers trained on them tend to predict that tweets written in African-American English are abusive at substantially higher rates [13].

The area of research in various languages also keep expanding with new datasets as in most recent Korean, Chinese, Turkish studies [15, 30, 31].

3.5 Methodology and Model Evolution

In general, a detection system starts with pre-processing of the data, splitting the data as train and test (typically as 80% to 20%), feature extraction, model training and finally the classified output.

Pre-processing depends on the data format and involves punctuation removal, stopword removal, tokenization, stemming, lemmatization, smiley and slang conversion.

Features used vary from lexical features such as Bag of Words (BoW), n-grams, use of offensive word dictionaries to syntactic features such as use of identifiers i.e. second person pronouns and user level features [9]. Common techniques for feature extraction includes TF-IDF, BoW, sentiment, Part of Speech (PoS) tag, word embeddings (GloVe [48], Word2Vec [42], FastText [5], ELMO [49]). Jahan et al. reported in a systematic review that the most used features were word embeddings and TF-IDF; additionally as for the word embeddings most commonly used ones were Word2Vec and FastText [28].

Additionally, further features are explored by researchers. Galan et al. proposed an approach for cyberbullying detection based on the idea that every trolling profile is followed by the real profile of the user behind the trolling one and generated the hypothesis that it is possible to link an account of fake profile to the real profile and analyse different features of the profile including text [22]. McGillivray et al., recently draw attention to the fact that the meaning of a word may change over time and proposed a time dependent lexical feature approach, meaning that they applied an algorithm to detect semantic change over a 2 years’ time to detect words whose semantics has changed and either acquired or lost offensiveness [41]. Casavantes et al. used metadata features such as tweet creation time of the day or account age and reported that utilizing metadata gave better results. While doing this, they utilized three different learning models as their consideration being classical (BoW), advanced (Glove) and state-of-the-art (BERT) text representations and reported statistically significant difference with use of metadata [8].

In terms of models, some of the earliear research have utilized WEKA (standard machine learning software developed at University of Waikato) [26] and applied traditional algorithms as in Rezavi et al. [51]. They selected a classifier and created an abusive language dictionary and assigned a weight (1–5) to the entries in it then applied models from multi-level classifiers boosted by the dictionary. Akhter et al., performed a series of experiments and reported that character n-grams outperformed word n-grams [2].

As Rosa et al. noted, earlier research on cyberbullying detection that dates back from 2011 to 2017; detection mechanisms were mainly based on traditional machine learning algorithms such as SVM [53]. Van Hee et al. also showed that Support Vector Machine had better results than baseline systems that are based on keywords and word unigrams [63]. Similarly, on hate speech specific detection SVM has commonly been experimented [39]. Fortuna et al. also stated in their review back in 2018 that most chosen algorithms in the hate speech detection as traditional algorithms being the most frequent as SVM and followed by Random Forest, Decision Tree, Logistic Regression and Naive Bayes [20].

More recently, deep neural network models gained great attention [1]. In 2017, Badjatiya et al. reported that deep learning methods significantly outperformed state-of-the-art char/word n-gram methods on hate speech detection utilizing CNN, LSTM and FastText [3]. Mozafari et al. obtained high F1 scores on hate speech detection with their transfer learning approach including BERT [16] and CNN [44]. In their systematic review on hate speech, Jahan et al. identified 96 documents with deep learning approaches and noted that among deep learning algorithms, BERT is the most commonly used one with 38% although it has been released quite recently, then LSTM, CNN, bi-LSTM, GRU and combination of these followed respectively by the percentage of usage [28].

Lately, language generation has also been involved in hate speech detection. Chung et al. proposed methods for improving counter narrative generation for hate speech detection [11]. Another example is that Wullach et al. used GPT for generating synthetic hate speech data from available labeled examples [68]. Depending on the dataset size and class distribution, data augmentation is commonly utilized on imbalanced datasets to improve performance and prevent issues such as overfitting. Ilan et al. reported they improved the performance with an augmentation method that they introduced as input real unlabelled data unlike real labeled or synthetic data (using a generative model), which their approach made use of online platforms in which people are specifically asked to be bullying (such as subreddit r/RoastMe/ platform) [27]. Another recent data oriented approach was reported by Yang et al. in which they asked people to generate offensive arguments deliberately aiming to be less sensitive to lexical overlap [70]. With researchers drawing attention to the scarcity of labelled data, Sarracén et al. presented a study in which their model was composed of Convolutional Graph Neural Network (GNN) and reported that it performed better then state-of-the-art models on small datasets [14]. Besides, Tanvir et al. reported that with their GAN-BERT approach, they obtained a promising result with a small sized dataset in Bengali language [59]. Breazzano et al. experimented a transformer-based architecture combining BERT with multi-task and generative adversarial learning (MT-GAN-BERT) for six different abusive language classification tasks enabling semi-supervised learning and reported conclusions as decrease in computational costs without a considerable decrease in prediction quality [7].

From a more generic perspective, Minaee et al. reviewed deep learning approaches for various text classification tasks through more than 150 deep learning models and 40 datasets. They emphasized the fast progress on text classification over the recent years thanks to contributions such as neural embedding, attention mechanism, self attention, transformer as well as showing again that deep learning models resulted in significant improvements compared to non-deep learning models. They also discussed that to choose the best neural network for a classification task, there is not one single solution and it varies depending on things such as nature of the domain, application areas, availability of the labels [43].

4 Discussion and Conclusions

In this paper, we presented an overview of detection of offensive language in social media text by natural language processing. We tried to bring light on the ambiguity in terminology and classification along with different data classes given in the datasets provided through shared tasks that accepts many experiment inputs each.

For this review we have not taken into account the research that considers multi-modals such as including image and video into consideration along with text and author metadata.

4.1 Challenges

As in the ’garbage in garbage out’ principle, issues around data constitutes an important part of the challenges in the field [64]. Complexity of the definition as mentioned in previous sections sometimes causes ambiguity in dataset classes, annotation and combining similar datasets. Moreover, due to the nature of social media, most of the text contains slang, variety of smileys and grammatically incorrect sentences therefore they are hard to predict structures. In addition, context switch, use of different dialects, not enough available data source for all languages are also other language related challenges.

Furthermore, as social media is our main consideration; social media entries are subject to change in time relatively quick. For example, a new term or an acronym which did not exist in a language before or exist with neutral or positive meaning before, might emerge in a new incident such as a new released television advertisement or a political scandal or a viral video and after that people might start using it with a secondary negative meaning.

4.2 Gaps in the Research

In spite of the traction on the field, some of the gaps in the research can be identified as follows:

  • Although there are numerous datasets and experiments on different languages, the majority of the research is on English language. However, we are not aware of any research with comparison of the perception on ‘native speaker’ and ‘non-native speaker’ point of view.

  • The amount of research on more particular topics such as migration or refugee status, disability etc. is relatively low in comparison to more generic classification such as offensive/non-offensive. There is an opportunity to create datasets and work on classification on more specialized areas of hate speech.

  • Even though recently there are more research on multi-lingual classification, still the research so far is limited and there is opportunity to study languages other than English and research on multi-lingual models.

  • Dataset sources are mostly social media user input sometimes along with user metadata and there are very few examples of other sources such as: Song lyrics, movie or TV show dialogs (sub-titles) (Informative documents such as Wikipedia are not in consideration).

  • Images and videos are important parts of social media. For instance, Instagram users share over one million memes dailyFootnote 10. However, research that combines text with image or video is quite limited. Although there has been some research, such as (Yang et al. 2019)’s multi-modal [69] and (Kiela et al. 2020)’s ‘Hateful Meme Challenge’ [33] there is still opportunity on this aspect.