The Unseen Targets of Hate -- A Systematic Review of Hateful Communication Datasets

Machine learning (ML)-based content moderation tools are essential to keep online spaces free from hateful communication. Yet, ML tools can only be as capable as the quality of the data they are trained on allows them. While there is increasing evidence that they underperform in detecting hateful communications directed towards specific identities and may discriminate against them, we know surprisingly little about the provenance of such bias. To fill this gap, we present a systematic review of the datasets for the automated detection of hateful communication introduced over the past decade, and unpack the quality of the datasets in terms of the identities that they embody: those of the targets of hateful communication that the data curators focused on, as well as those unintentionally included in the datasets. We find, overall, a skewed representation of selected target identities and mismatches between the targets that research conceptualizes and ultimately includes in datasets. Yet, by contextualizing these findings in the language and location of origin of the datasets, we highlight a positive trend towards the broadening and diversification of this research space.


Introduction
Hateful communication is a vehicle of conflict between individuals and groups and exposure to hateful communication online is not a rare phenomenon.In a cross-national survey of internet users, 53% of American respondents report being exposed to hate material online, while 48% of Finns, 39% of Brits, and 31% of Germans report exposure (James Hawdon and Räsänen 2017).In a more recent study in Germany, about 76% of respondents said they had been confronted with hateful communication online and 39% had to deal with online hate very often (Landesanstalt für Medien NRW 2023).Online platforms increasingly appear to be media of de-civilization since even a low prevalence of hateful content can lead to high exposure rates if uncivil content becomes popular.
Researchers, legal scholars, and practitioners have not agreed upon a single definition of hateful online communication and definitions range from very specific to extremely broad (Siegel 2020).Our definition of hateful online communication builds upon the definitions of hate speech by the Encyclopedia of the American Constitution (Nockleby 2000) and Britannica (Curtis 2023).Consequently, we define hateful communication as "any form of communication or expression (e.g., speech, images, text) that denigrates a person or persons on the basis of (alleged) membership in a social group identified by attributes such as race, ethnicity, gender, sexual orientation, religion, age, physical or mental disability, and others." To tackle the problem of hateful online communication, we need to detect and address it.To this end, practitioners and researchers devoted significant effort to developing automated methods to detect hateful online communication based on Machine Learning (ML).Since it is well known in Computer Science that the performance of an ML model is upper-bound by the quality of the training data, the topic of data quality gained more attention recently (Jain et al. 2020;Geiger et al. 2021;Liang et al. 2022).The famous "garbage in, garbage out" principle does not only apply to supervised ML approaches (where the data quality depends on the quality of data annotations that guide the ML model, among other factors) but also to semisupervised and fully unsupervised ML methods (where the quality of data depends amongst others on the data selection and preprocessing decisions).Dimensions of data quality that are typically discussed by ML scholars include noisy labels/annotations, class imbalance, data coverage, data homogeneity, and data valuation (Jain et al. 2020;Liang et al. 2022).In the social sciences scholars differentiate between intrinsic and extrinsic data quality dimensions (Daikeler et al. 2023).In the context of hateful communication datasets, intrinsic quality dimensions refer to the extent the dataset covers the phenomenon of interest in its full diversity; extrinsic quality dimensions relate to the accessibility and reusability of datasets.How those identities that shape and are covered by the datasets impact the intrinsic and extrinsic quality of datasets has received little attention so far.
While it is well known that curating datasets requires crucial design decisions that impact their quality, little attention has been paid to the identities of those who curate the datasets.The identity of an individual refers to its community, socio-demographics, position, or self-representation, including but not limited to political affiliation, age, body image, and institutional or organizational membership.Especially when curating data for hateful online communication we expect that the identities of data curators may impact the data quality of the final data in at least two ways: on the one hand, the scientific environment and background of the researchers may affect their definitions of the construct and their practices as data curators; on the other hand, their awareness, interests, and sensibilities towards different targets of hateful communication intersect with their own beliefs, attitudes, and experiences, which in turn may affect their choices on which targets and phenomena to include in the datasets.
Furthermore, issues of data quality related to the identities of researchers compound with those related to annotators.Previous research suggests that identities and beliefs of data annotators impact their perceptions and consequently their annotation of hateful online communication (Sap et al. 2022;Pei and Jurgens 2023).Sap et al. (2022) find that more conservative annotators and those who scored highly on their scale for racist beliefs were less likely to rate anti-Black language as toxic, but more likely to rate African American English dialect as toxic.More recently Pei and Jurgens (2023) re-annotated 1500 comments sampled from a dataset consisting of Reddit comments (Hada et al. 2021) using 262 annotators from a representative sample from prolific.Their results show that people from other cultures may perceive the same comment with a lower or higher degree of offensiveness.
For ML models to have a real-world, positive impact on the targets of hateful communication, it is necessary to unpack the relationships between the identities included in the datasets and identities involved in the curation of datasets.
Recent systematic reviews summarized how the literature in hateful communication advanced methodologically and theoretically (Paz et al. 2020;Vidgen and Derczynski 2020;Pamungkas et al. 2023).Our research adds to this body of literature by addressing the practices around curating hateful communication datasets, with a focus on how they represent the targets of hateful communication.In particular, we offer a positionality outlook on hateful communication research.We perform a systematic review of the past decade of datasets meant for training ML models for detecting hateful language, focusing on the identities that are included and that shape the production of hateful communication datasets.
First, we focus on the producers of hateful communication datasets and their practices.We find that in the past five years, the field of hateful communication research broadened its geographic borders, became shaped by international collaboration, and increased its coverage of different languages and platforms in the datasets.Yet, the production is still dominated by researchers with U.S.-based affiliations and the majority of datasets are in English.
Next, we focus on the targets of hateful communication that are explicitly included in their design.Leveraging frameworks for assessing the quality of datasets (Sen et al. 2021), we distinguish between explicitly conceptualized targets-those who are included in the explicit definition for hateful communication as a construct-and operationalized targets-those who are operationalized in the sampling, annotation and/or analysis of the dataset.We find that hand-in-hand with the broadening of the production of hateful communication research, conceptualized and operationalized targets came to include more identities in recent years.However, some target identities, such as age and body image are rarely covered in any of the datasets, which raises concerns about the ability of ML systems to detect hate towards those target identities.
Focusing on a sample of 15 highly-used datasets, we analyze the discrepancy between the targets included in writing-conceptualized and operationalized targets-and the detected targets, actually present in the datasets, independently of whether they were included in writing.We find that among the instances for which we detect targets, up to 16% fall in single target categories that were not conceptualized and/or operationalized first.This may make the hate classifier perform unpredictably on such targets.
In summary, the paper addresses the following questions:

Q1:
In what countries are producers of hateful communication datasets located, what languages are they studying and how are the datasets' qualities evolving?
Q2: Which identities are discussed as targets of hateful communication in the scientific literature?
Q3: Which identities are included as targets in hateful communication datasets, even if not explicitly mentioned in the literature?
Overall, this work highlights a diversification and broadening of the research space around the curation of hateful communication datasets, in terms of both the participants in the scientific field and their attention toward the targets of hateful communication.Within this overall positive trend, the review identifies shortcomings in how research reflects local contexts and identities.Addressing this gap may help the next decade of research in addressing the needs of the targets of hateful communication more accurately and equitably.Thus, this work suggests practical steps for developing standards and practices that ensure the quality of hateful communication datasets.

Related Work
Creating an ideal dataset for training and evaluating hate speech detection systems is challenging (Sodhi et al. 2021).Frequently observed limitations include datasets that are too narrow in their linguistic diversity, with standard English vernaculars being the most studied (Ghosh et al. 2022); datasets that are limited to ad-hoc definitions of hateful content, restricting the validity of the resulting machine learning models (Hardaker and McGlashan 2016;Albadi et al. 2018); and datasets that skew towards frequently studied targets, disregarding less-frequent but equally consequential ones (Gao and Huang 2017;Vigna et al. 2017).For example, Moy et al. (2021) analyzed the language discrepancy between English and non-English hate speech datasets and highlighted the importance of non-English datasets for hate speech detection, especially in multi-lingual countries.The study of Swamy et al. (2019) highlights the redundancy and non-generalisability between datasets for abusive language detection through experiments on cross-dataset training and testing.
Despite the numerosity of datasets that cover instances of hateful content, only a few studies have focused on their quality.Waqas et al. ( 2019) released a literature review on hate speech research, focusing on research papers from the Web of Science core database published through March 2019.Specifically, they concentrated on mapping broad research indices, prevalent research topics, research hotspots, and significant stakeholders such as organizations and contributing areas.Fortuna and Nunes (2018) provided an analysis of the status of hate speech by presenting a summary of approaches, covering algorithms, methodologies, and main features.They also focus on categorizing the different works that aim to detect hate speech for different targets (i.e., Racism, Sexism, Prejudice toward refugees, Homophobia, and General hate speech).Vidgen and Derczynski (2020) reviewed 63 publicly available abusive language datasets also using the PRISMA review methodology.They described the information that the datasets contain (and exclude), how they have been annotated, and how tasks have been constructed.Lastly, they gave a comprehensive examination of methods for making training datasets more accessible and helpful.Poletto et al. (2021) systematically assessed the hate speech datasets' characteristics, including their creation methods, thematic focus, and language coverage.While they do not cover targets specifically, they analysed topical focus, i.e., the specific topics and abusive phenomena addressed.For example, topical focuses can be aggressiveness, homophobia, toxicity, or misogyny.While topical focus can also consider targets, this is more related to the task addressed.
Some reviews on hateful communication paid particular attention to Natural Language Processing (NLP) methods rather than datasets.Schmidt and Wiegand (2017) presented a survey on the automatic detection of hate speech, mainly focusing on the NLP approaches.A survey review conducted by Torregrosa et al. (2021) focused on the existing NLP techniques on extremism detection and their application and mentioned datasets about their availability.Similarly, Jahan and Oussalah (2021) systematically reviewed literature of the last 10 years from a technological perspective, with a special focus on NLP and deep learning technologies applied for automatic hate speech detection.Ayo et al. (2020) focused on the Machine Learning techniques for hate speech classification of Twitter data and provided their current status and future directions.
Unlike previous work, our study focuses on the identities included in the datasets and identities involved in the curation of datasets and how they impact the quality of hateful communication datasets.Target identities play a main role in the dataset from its production to the analysis, which varies a lot across contexts and languages.It could influence the type of content being included and the results of analysis, depending on the research topic of the original study.Prior studies have focused more on explicit characteristics of the dataset being analyzed like language and the approaches used for analysis, while less attention has been paid to the mediator that distinguishes the dataset, namely the target identities of hateful communication.Unlike previous work, our study focuses on the identities included in the datasets and identities involved in the curation of datasets and how they impact the quality of hateful communication datasets.

Methods and Data
In this section, we outline how we survey the literature and analyze papers that introduce novel datasets of hateful communication.Then, we clarify the methodology used to assess the identities that are included and that shape the hateful communication datasets and their quality.

Paper Analysis: Systematic Literature Review
We follow the PRISMA guidelines for surveying the literature systematically (Page et al. 2021).Here, we clarify how we search, select, and annotate papers.In particular, to gain an encompassing view of the quality of the datasets introduced by this body of work, we annotate the papers introducing the datasets, the datasets themselves, and the targets of hateful communication explicitly mentioned in the papers.We conducted this literature review in early 2022; therefore our sample is restricted to papers published till March 2022.To prevent misunderstandings related to the partial availability of data for the year 2022, we de-emphasize the corresponding results by reducing their opacity in plots.

Search and Selection Procedures
We based our literature search on the following academic databases and search engines: Scopus, ACM Digital Library, and ACL Anthology because of their topical relevance and interdisciplinary nature.
To achieve a broad inclusion of datasets concerned with different dimensions of hateful content, we first constructed a set of queries, composed of four parts.These queries systematically combine four different dimensions: topic, content, dataset, and data source.Our broad understanding of hateful content is reflected particularly in the wide range of keywords in the topic dimension.For each dimension, we defined a set of relevant keywords: topics (e.g., "hate", "troll", "dehumanize"), content (e.g., "message", "speech", "language"), datasets (e.g., "corpora", "dataset", "corpus") and data source (e.g., "web", "internet", "online").While these queries help to include a large number of datasets on hateful content, they also lead to the inclusion of related, but not relevant constructs and datasets.These publications are screened out in the second stage of our extensive manual paper selection procedure.An overview of the literature screening process is given in Figure 1.Further details on the search and selection procedures can be found in the appendix.Annotation Procedure We constructed a concept matrix (Appendix Figure 14) that included details about the publications, corresponding datasets, and targets to assist in guiding our data annotation process.For the final round of reviews, three annotators evaluated the full text of each paper and completed the annotation matrix for all papers included in our sample.
Annotation: Paper Metadata The paper section provides the publication's general meta-data, including its title, journal, country of author affiliation, summary, citation number, publication date, and accessibility.Within the dataset section, we differentiate between construct definition, metadata about the dataset (e.g., its collection procedure, time span, topical focus) and its annotation (e.g., number of annotators, guidelines, incentives).
Annotation: Dataset Metadata The dataset metadata block contains detailed information about the dataset, including its availability, way to access, format, reference name, number of newly created or adapted datasets, language, data source, topical focus, relevance to social events, data production and collection time, country of origin, and any measures taken for anonymity protection of the data source.We explored the annotation process that was used during the dataset creation by identifying information about the annotation type, procedure, selection strategy, overall data size and the size used for training and testing, information about annotators, guidelines, and incentives provided.
Annotation: Conceptualized Targets The target section provides information about the authors' definitions of targets of hateful communication that are explicitly addressed by the authors of the datasets.We differentiate between papers that aim to measure hate towards selected targets and papers that do not discuss specific targets.We further differentiate between individual targets (e.g., a specific politician), and targets corresponding to collective identities (e.g., one's political affiliation).We categorize collective identity targets according to the taxonomy introduced in ElSherief et al. (2018a) and iteratively added four other major collective identities we found in the publications in our literature review, namely-political affiliation, age, body image, and institutional or organizational membership.*  For the construct, we collected its definitions in the text of the papers, including all potential sub-categories of the main construct.From the definitions, we extract both the topical focus-the communicative phenomenon under study, such as male chauvinism-and the targets of the hateful communication-such as women.In this paper, we call We introduce a two-tier categorization of targets.First, we distinguish between conceptualized targets (i.e., those who are included in the explicit definition of hateful communication as a construct chosen by the researcher) and operationalized targets (i.e., those who are operationalized in the sampling, annotation and/or analysis of the dataset).Moreover, while conceptualized and operationalized targets are explicitly accounted for and typically described in the paper, the corresponding dataset may include other targets that are not: we call the latter detected targets.The figure depicts a mismatch between these three types of targets: the researcher has chosen a very broad conceptualization of hateful online communication encompassing rage, gender, and religion, but a narrow operationalization, which only aims to capture hate towards gender identities in the dataset; yet, ultimately, the final dataset may include also targets that were part neither of the conceptualization nor the operationalization, such as identities based on political ideology.
conceptualized targets those that are explicitly mentioned in the construct definition.
As an example, Taradhita and Darma Putra (2021) define the hate speech construct as "an act of communication by a particular person or group that aims to insult a person or a group based on their ethnicity, race, religion, gender, sexual orientation, or class", the latter being the targets of the particular kind of hateful communication.On the contrary, some definitions in the literature do not explicitly identify specific groups of targets as part of the construct definition.For example, according to Zhang et al. ( 2018) "we identify that hate speech 1) targets individual or groups on the basis of their characteristics (targeting characteristics); 2) demonstrates a clear intention to incite harm, or to promote hatred; 3) may or may not use offensive or profane words".
Following frameworks for data quality (Sen et al. 2021), we aim to assess the inclusion of targets in all phases of dataset creation.To this end, we annotate targets beyond those mentioned in the construct definition at two crucial steps: those included intentionally and explicitly in the operationalization (i.e., data creation process), and those included in the final dataset itself even if unintentionally.We clarify the former, before discussing the latter in the next section.
Annotation: Operationalized Targets We annotate as operationalized targets those targets for which the authors define concrete measures to ensure that their presence in the data is visible.For example, authors may mention targets in the annotation codebook and/or use them as labels or may use certain methods that are specifically designed to detect certain targets (e.g., a dictionary to detect gender words or an antisemitism classifier).Additionally, some authors also define measures that increase the presence of certain targets in the data (i.e., define data collection strategies that potentially boost the presence of certain targets).

Dataset Analysis: Detected Targets
We next turn to the targets that are included in the datasets themselves, irrespective of whether the authors of the datasets mentioned them in the conceptualization or operationalization of the construct under study.We call these detected targets.
Therefore, we complement our literature review with an in-depth analysis of a convenience sub-sample of targeted datasets-a combination of datasets already included in our systematic review and datasets added and annotated exclusively for this analysis-which was made available by Risch et al. (2021).For this sub-sample, we analyse the prevalence of different targets.This analysis allows quantifying the potential mismatch between conceptualized, operationalized and included targets in datasets that are or can be used to train hate speech detection systems.A potential mismatch on these different levels that are depicted in Figure 2 may lead to surprising failures of hate speech detection systems.
For target detection, we use a dictionary-based approach, which aims to detect different identity terms that are mentioned in a hateful or non-hateful context.Dictionary Creation Our methodology for the dictionarybased target detection builds on related work with a strong focus on target identities (Silva et al. 2016;ElSherief et al. 2018b;Vidgen and Derczynski 2020).Starting with a list of more than 750 keywords from the website hatebase.org,ElSherief et al. (2018b) identify the 51 terms most indicative of hate speech, removing phrases that were deemed context-sensitive or that would frequently be used in contexts other than hate speech, for example, the term "pancake".hatebase.org,one of the biggest repositories of multilingual hate speech, compiled this list by asking users to contribute through the addition of new hate speech terminology and classifying it into different categories.Both the compressed list of keywords and the categorization scheme were subsequently adopted as a basis for the analysis by ElSherief et al. (2018b).
After reviewing several works focusing on hate speech targets (Davidson et al. 2017;de Gibert et al. 2018;Qian et al. 2019;Kennedy et al. 2020;Pamungkas et al. 2020;Vidgen et al. 2021;Kennedy et al. 2022), we decided to further extend the target dictionary proposed by ElSherief et al. (2018b) to improve the coverage of target categories discussed in existing literature.Specifically, we incorporated the categories Age and Body, along with corresponding keywords found in Vishwamitra et al. (2020) and Baheti et al. (2021).Additionally, we included the category Political, as well as the category Organizations/Institutions and corresponding keywords from Zampieri et al. (2019).We excluded the category Archaic as it did not represent a single, homogeneous target group.The final set of target categories is Age, Body, Class, Disability, Gender, Nationality, Organisations/Institutions (Org./Inst.),Political, Race, Religion, and Sexuality.We report in Table 3 of the appendix the full list of keywords associated with each category.
Dictionary Application For each instance of a dataset, we check whether one or multiple keywords defined in the dictionary occur in the instance's text.To improve the accuracy of these matches and increase the efficiency of the matching process, minor preprocessing steps such as the deletion of stopwords, digits, and punctuation and a transformation into lowercase text were implemented.As the keywords in our dictionary included bigrams, we tokenized the instances' texts into both uni-and bigrams, before applying the dictionaries to them.The results were aggregated on the dataset level, resulting in a distribution of the instances over the available target categories for each dataset.To identify the hateful instances in the datasets, we rely on the annotations provided by the dataset creators.
Dictionary Validation While we cannot assess the recall of our dictionary approach for detecting targets, we find that it does afford precision: estimating on a stratified random sample of 33 instances where at least one target was detected (3 instances for each of the 11 target categories), the correct target is present in 21 instances, corresponding to a macroaverage precision of 68%.While the agreement between our annotations and the dictionary results is high for the Untargeted refers to all other datasets that do not meet these two criteria.Note that the data for 2022 is only partially available (as described in our Methods and Data Section).frequently operationalized target categories, it seems to be inherently difficult to operationalize the categories Age and Body using a theory-based lexicon.Even though we observe low agreement for these categories, we decided to retain them to surface this difficulty and highlight the lack of datasets covering them.For the categories Organisations/Institutions and Nationality, the round of manual annotations showed their proximity to the categories Political and Race, respectively.This observation is in line with the difficulties that Bretschneider and Peters (2017) and Ousidhoum et al. (2019) report in distinguishing those category pairs during their annotations.

Results
We provide an overview of the state-of-the-art in hateful communication datasets, before presenting our findings on the identities that shape them.First, we analyze the location of the authors' institutional affiliations, the languages of the datasets they contribute, and the topics they focus on.Next, we study which targets of hateful communication are explicitly mentioned in the papers that introduce new hateful communication datasets, to map out how research has distributed its efforts across different target groups.Finally, we analyze the targets that are empirically included in the datasets to assess the possible mismatches between the conceptualization of hateful communication and the resources intended to address it.

Summary of Collected Papers & Datasets
Out of 2533 papers initially matching our search queries, we identified 341 papers introducing novel datasets about hateful online communication suitable for training machine learning models.Figure 3 shows the increase in the number of datasets shared over time, spanning publication years 2012-2022.We now turn to the stakeholders of the hateful communication datasets.We start by unpacking the context of the production of the datasets.We investigate the diversity of researchers contributing to this body of work, using the location of their institutional affiliations to situate the researchers.We show how researchers' locations are correlated with differences in the choice of languages and topics covered by the datasets they produce.We analyze the quality of hateful communication datasets with respect to both intrinsic data quality indicators-such as the diversity and coverage of languages, platforms, and targets-as well as extrinsic factors-that include the accessibility and interoperability of datasets.
Dataset Availability 51% (251) of all datasets in our sample are publicly available (i.e., authors provide links to the dataset or specify that the data is available upon request; we also included those papers that are only available upon request since terms of use of social media platforms often hold authors back from sharing their data in public data repositories).Among those that are available, the most common way to distribute datasets is GitHub, with 62% of datasets shared via Github, followed by some open repositories such as Zenodo (5%) and Google Drive (2%), and the rest are available through a provided link to websites.Only 7% of datasets are specifically available upon request via email or any given contact.Although all 180 publications corresponding to the 251 publicly available datasets mention ways to access the data, we found that it was not possible to access 21 datasets of 17 papers due to wrong or expired links.The remaining 163 out of 180 studies provided valid access to 230 novel datasets.

Diversity of Researchers' Locations
The United States is the country leading the production of hateful communication datasets.Overall, U.S.-based researchers were involved in the creation of 135 new datasets-27% of the total number of datasets (492).In decreasing order, the remaining locations with the most contributed datasets are India (12% (57) datasets), the United Kingdom (12% (57) datasets), Germany (7% (32) datasets), and Italy (7% (32) datasets)(Figure 5).
Yet, the field appears changing over time.Not only is the field growing in size (as noted in the previous section and depicted in Figure 3), but it is concurrently expanding its geographic borders.As of 2022, researchers with affiliations in 59 countries contributed new datasets, compared to 9 countries before 2017.Especially, since 2017, each of the top-5-dataset-producing countries contributed fewer datasets than the non-top-5 countries taken together (see Figure 6a).Moreover, since 2017, transnational collaborations increased: 23% of all datasets in our sample were published by transnational teams since 2017, while only 1% of datasets were published by transnational teams before 2017.
Also when accounting for the low number of datasets before 2017, we see a difference in the frequency of transnational collaborations: before 2017, only 4 out of 23 datasets (17%) were created by transnational teams, while since 2017, 25% of datasets fall into that category.
This highlights the increasing diversity of teams that contribute hateful datasets but also shows that large parts of the world are not involved despite experiencing hate online.
Figure 6b shows that the number of datasets in English has steadily risen since 2012.However, similarly, as for the locations of the researchers, we have witnessed a diversification of the languages since 2017/2018.Before 2017, almost all datasets were in English.After 2018 there are almost as many datasets in non-top-5-most-common languages as there are in English each year.
Next, we unpack how the language of the datasets relates to the location of the researchers authoring them.Figure 8 shows the distribution of datasets by their language and the location of their authors.Researchers in selected locations, such as the U.S. and U.K., contribute to datasets in a range of languages.The converse appears also true: English is the most common dataset language even for researchers in countries that do not speak English as an official language.For all other languages, the majority of datasets are contributed by countries that speak the language itself officially, e.g., Bengali datasets mainly originate from Bangladesh and Russian datasets from Russia.Arguably, this may reflect the contextual nature of hateful communication, which requires not only linguistic proficiency, but also deep cultural situatedness-researchers' own experiences (historical, cultural, familial, and personal) shape the way they act in the world around them, in this case through their focus on different languages, and possibly different phenomena: e.g., what hateful communication looks like in Germany may be shaped by its current sociopolitical condition as well as its history, and therefore may significantly differ from hateful communication in neighbouring countries such as France.This opens questions on the ability of hateful communication research to be effective in social contexts that are currently not represented in datasets.
Given the relevance of situatedness for hateful communication research, we further unpack the relationship between how researchers define the construct under study and the language of the datasets.We find that how researchers define a construct varies significantly depending on the language of the dataset.To gain qualitative insights into those differences we use word clouds that surface the most discriminative  terms used in construct definitions across various languages.
Figure 9 shows the differences between the top 6 languages in our sample.To surface the most discriminative terms used in construct definitions across various languages, we use an approach based on Term Frequency Inverse Document Frequency (TF-IDF).Specifically, we combine the construct definitions of all datasets in a language into a single document, compute the TF-IDF scores of each keyword in all documents, and then highlight keywords with a TF-IDF above the threshold of 0.01.This approach highlights the words that are especially salient for certain languages, instead of words that are common across all languages.
Our results show that gender is emphasized in French and Spanish datasets, while race and religion are more pertinent in Arabic datasets.This highlights the diversity of conceptualizations of hate across different languages.Considering those differences is especially important when researchers merge and translate datasets to train hateful content detection systems which is a promising approach, especially for underresourced languages (Röttger et al. 2022).various backgrounds involved, the overall volume and the diversity in languages and targets of hateful communication have increased, which also contributes to an improved representation of hateful communication in the collection of datasets in the field as a whole.

Summary
Q2: Which identities are discussed as targets of hateful communication in the scientific literature?
The previous section analyzed the temporal evolution of the production of hateful communication datasets and unpacked how those identities relate to different practices that affect the qualities of the datasets.Next, we analyze the identities that this body of research focuses on-in particular, the identities of the targets of hateful communication.Following frameworks for assessing the quality of datasets (Sen et al. 2021), we track how targets are included in the conceptualization phase of the work-i.e., in the definitions of the constructs under study-in the operationalization phase-i.e., in the choice and design of automated or manual labeling procedures, as well as sampling and data collection procedures.
Diversity of Targets in the Literature Here we show how the literature divides its effort among targets that are mentioned in the construct definition and/or are considered in the data creation process.In this analysis, we combine conceptualized and operationalized targets for the large sample of 341 papers.
Figure 11 shows the distribution of papers that describe targets explicitly over time.Overall, 64% (314) of all datasets mention at least one specific target group, contrasted to the other 36% (178) of datasets that explore hateful content as a general phenomenon without covering any identifiable target groups.
The most common target identities are race (38% (188) datasets), gender (35% (172) datasets), religion (32% (158) datasets), sexuality (22% (110) datasets), and political affiliation (19% (92) datasets).There are also 21% (103) of all datasets containing targets outside our predefined categories which we label "other"; examples of other targets include celebrities (Lu et al. 2020)  When looking at the diversity of targets that are represented in different languages (see Figure 10), we see differences between languages that are also in line with the differences in the conceptualizations that we discussed before (see Figure 9).For example, religious hatred, particularly Islamophobia is a major focus in Muslimmajority countries where Arabic, Turkish, Bengali, and Urdu are spoken (Turkey, Bangladesh, Pakistan, etc.), while gender is the most frequent target in Spanish, French, and Hindi-English datasets.While those variations in the prominence of targets may be a reflection of specific cultural and political factors in countries where these languages are spoken, it is important to consider those differences when datasets are used for training multilingual hate detection systems.
Summary Datasets are increasingly specific about which targets they aim to include by applying more refined and targeted sampling strategies, and diverse in the range of targets they cover.However, datasets rarely cover certain target identities such as age, body image, and organizational/institutional affiliation.Differences in which targets are represented in different languages are very pronounced and can hinder the interoperability of datasets.
Q3: Which identities are included as targets in hateful communication datasets, even if not explicitly mentioned in the literature?
Next, we assess how the targets that are described in the literature are empirically included in the datasets themselves.First, we report on the diversity of dataset creation strategies observed in the literature, briefly discussing their potential impact on the composition of resulting datasets.Second, we aim to identify, if any, the discrepancy between conceptualized, operationalized, and detected targets.For this in-depth analysis, we focus on a convenience sample of 15 widely-cited and easily accessible Englishlanguage datasets.Three computational social science (CSS) researchers independently annotated conceptualized and operationalized targets in the 15 publications, discussing differences until consensus.Then, we computationally analyze the datasets to find detected targets, which may be present regardless of whether they were conceptualized or operationalized in the accompanying publications.
We first give an overview of the different dataset creation strategies found in the literature, then compare conceptualized and operationalized targets, and finally compare them in aggregate against detected targets.
Diversity and Impact of Dataset Creation Strategies For 34.6% of datasets (170 datasets from 131 publications), the procedure to collect the dataset is explicitly mentioned.The most common strategy is full or partial random sampling (34% (57) datasets), followed by the use of a specifically developed lexicon, corpus, dictionary, or otherwise assembled list of topically relevant terms and keywords (31% (52) datasets).Other than that, the focus on specific languages (8% ( 14) datasets) and the use of classifiers (5% (9) datasets) are additional dataset creation strategies in active use.
There is a direct link between the dataset creation strategy used and the composition of the resulting dataset.Datasets created using fully or partially random sampling are expected to cover a broad range of targets (e.g., the dataset by Wulczyn et al. 2017), while datasets that result from termand keyword-based lists naturally tend to more precisely capture the specific group of targets operationalized via the underlying list (e.g., the dataset by Waseem and Hovy 2016).Conceptualized vs. Operationalized Targets Table 1 reports the overall discrepancy between conceptualized and operationalized targets in the convenience sample datasets.The majority of datasets (7 out of 15) conceptualize targets without explicitly labeling which targets are present in their datasets.Upon close inspection, we find that targets are often included in the annotation instructions, but the annotators' task is ultimately to label whether a message contains hateful communication or not in an untargeted, binary way.We speculate that this is for a cost/benefit trade-off.Firstly, annotating targets can be expensive since it requires additional time and effort.Secondly, the downstream applications are often formulated as binary problems-for many benchmarks and shared tasks, models are expected to identify hateful communication and not necessarily its targets.
2 out of 15 datasets operationalize more targets than they conceptualize.We find this is due to post hoc analyses where the authors of the datasets decided to label data characteristics they found interesting.As an example, Bretschneider and Peters (2017) set out to study anti-immigrant hate, but after finding several discussions about politicians in the data, they included "politicians" as one of their operationalized targets.Three datasets neither conceptualize nor operationalize targets because they study phenomena like general abuse or toxicity.Only three datasets operationalize the exact targets they conceptualize.This finding is surprising and stresses the need for standardized reporting practices in the field.
Conceptualized/Operationalized vs. Detected Targets Considering the inconsistencies around conceptualized and operationalized targets observed in our convenience sample, we combine them before comparing them against detected targets.In the following analyses, we focus on the proportion of the datasets that are labeled as containing hateful communication because those instances supposedly include conceptualized and operationalized targets.
Figure 13 shows the distribution of detected targets.Black frames denote conceptualized/operationalized targets.Each cell displays the proportion of instances containing each detected target category.Instances for which we could not detect any target are reported in the "untargeted" column.Figure 13.Distribution of instances labeled as "hate" over the target categories per dataset.To the right of the heatmap, the share of instances labeled as "hate" is indicated for each dataset.The last row shows the distribution over the target categories for all datasets aggregated.Darker shades of green correspond to a higher share of targets in the respective target category.A red frame around a cell indicates that the target category has been either explicitly conceptualized or operationalized by the dataset creators in the corresponding publication.If the publication does not explicitly conceptualize or operationalize any targets, "untargeted" is highlighted.
Among those publications that explicitly conceptualize or operationalize specific targets, we find that the fraction for single categories of detected targets that were not part of these definitions ranges between 1% and 16%.Although relatively small, the presence of non-conceptualized targets is consequential, as it may impact the performance of classifiers trained on the data in terms of divergent validitye.g., a classifier claiming to detect hate toward women (conceptualized target: gender) may in practice detect opposition to left-wing ideology (detected target: political affiliation), which may be empirically correlated in the dataset; the application of such classifier would have unintended consequences, such as censoring political views rather than preventing harm.
While three papers in our sample did not declare targets in their conceptualization or operationalization we find that empirically, the corresponding datasets cover a wide range of detected targets in varying proportions (1% to 33% of the instances).This unequal distribution may negatively impact reliability in detecting hateful communication toward underrepresented targets.Note that the issue of distribution among different targets is also present in papers that conceptualize and operationalize targets.We provide suggestions for researchers on how to critically address this issue in the discussion section.
Finally, we identify an overarching pattern.Works that explicitly operationalize the targets they conceptualized (Jha and Mamidi 2017; Waseem et al. 2017;Ousidhoum et al. 2019), also include a higher fraction of matching detected targets-in other words, the datasets are empirically good fits with what was documented in the paper and annotated by labelers.In contrast, targets that are conceptualized but not operationalized are also frequently underrepresented as targets in the datasets, or missing altogether (Gao and Huang 2017;Mandl et al. 2019;Qian et al. 2019;Founta et al. 2018;Basile et al. 2019;Zampieri et al. 2019).We speculate that including targets in the operalization phase prompts authors and labelers to be aware of them throughout the whole dataset creation process.Hence, what benefits from such a practice is not just the quality of documentation, but also the quality of the dataset itself.
Summary Conceptualized and operationalized targets match in only 20% of datasets in our sample.For datasets that conceptualize or operationalize targets, up to 16% of their instances contain targets that were never conceptualized nor operationalized, which may make classifiers perform unpredictably on such targets.Datasets that do not declare targets at all cover targets unequally, which may impact the accuracy of classifiers on underrepresented targets.It is crucial to underscore that our findings are derived from a convenience subset of 15 datasets.While these datasets hold significant prominence within the domain of hateful communication, being frequently referenced sources, future research is needed to assess if the observed patterns persist in larger samples.

The State of Hateful Communication Datasets
This review proposed a broader outlook on the quality of hateful communication datasets, the identities of the targets of hate, as well as the linguistic diversity and backgrounds of the researchers involved in the processes of data collection, annotation, and curation.In the context of the computational study of hateful communication, we unpacked how the identity and diversity of the targets included in the research not only depend on the identities of the data annotators but also on the wider research context in which datasets are created.We found that the production of hateful communication datasets is concentrated in selected locations: researchers with affiliations in the U.S. contribute over one-fourth of the datasets.Similarly, almost two-thirds of the datasets are in English.This is in contrast with the situatedness of hateful communication research: languages other than English are mainly covered by researchers located in native-speaking countries, and the constructs that they study differ depending on such languages.This arguably reflects the need for deep knowledge about the context where hateful communication takes place.
Furthermore, it is important to discuss the potential causes and effects of this apparent conflict between an established U.S.-and English-centric mainstream, and the need for contextualization in the specific domain of study.On the one hand, the relative homogeneity of datasets cast doubt on their suitability to train machine learning models that capture context-specific aspects of hateful communication.We find that promisingly, the production of hateful communication datasets is not only growing in output but also diversifying in who is represented in it-in terms of both the researchers that produce it and the identities that are included in the data.
On the other hand, multiple factors may slow down the diversification of this field.We argue that researchers may be incentivized to produce datasets in English both explicitly and implicitly.Producing datasets is costly: obtaining reliable annotations for a wide range of targets and at a scale requires substantial investment.Moreover, curating datasets in new languages may require developing specialized resources, contrasted with highly available tools and techniques to sample and process data in English.Therefore, it is unsurprising that new datasets rarely introduce new languages, and that the Global North is among the largest producers of datasets, which raises concerns about the ability of this field to avoid reproducing inequality.Furthermore, curating datasets in English may widen the user base for the datasets and consequently, increase the visibility of the research.
In this light, institutionalized incentives for promoting local impact are essential to sustain the diversification of the field, especially through the inclusion of a broader range of researchers.

Implications for the Users of Hateful Communication Datasets
Although research is broadening its attention towards a diverse range of targets of hateful communication, we found that some identities (such as class, disability, and age) receive less attention.Moreover, a sizable fraction of the publications to date do not specify which targets are included in their datasets, and for the publications that do, there is a discrepancy between the targets documented in the publications and those effectively present in the datasets.These findings can be problematic for the users of hateful communication datasets-especially those who train machine learning models on such datasets to detect hateful communication in different application domains.
Lack of representation, under-representation, and undocumented representation of targets all make machine learning models unable to perform accurately and predictably.Determining an ideal ratio of targets in a dataset may not be practical or possible.However, for research to have a positive real-world impact, the dataset should precisely represent the targets it aims to capture.Future research could explore participatory approaches (Maronikolakis et al. 2022), data documentation (Miceli et al. 2021), and theory-informed and target-aware data collection procedures (Li and Caragea 2021;Samory et al. 2021;Uyheng et al. 2022) as promising avenues to overcome and document the mismatch between the conceptualized, operationalized, and detected targets.

Implications for the Creators of Future Hateful Communication Datasets
We highlighted a gap in how present-day datasets cover the variety of hateful communication targets.Especially, the field is moving towards including multiple targets in each dataset, which enables sophisticated computational modelling and robust evaluation.Thus, there are ample opportunities for the creation of novel datasets that better serve minority identities.Furthermore, we find a shortage in studies focusing on intersectionality-while some papers do explicitly operationalize intersectional targets (Waseem et al. 2017;Vidgen and Derczynski 2020), they are a small portion of datasets studied in this work.
Yet, we also identified avenues for improving data practices, to sustain high-quality standards in this sociallyrelevant field.We stress the importance of clearly including targets in all phases of the dataset creation process, starting with the conceptualization of the construct that the dataset aims to capture (whereas, several papers to date omit this crucial information).Targets should also be included in the operalization phase of the data curation process, e.g., in sampling strategies, annotation instructions, and annotation labels if possible (whereas, almost half of the conceptualized targets were not operationalized in the literature).Finally, authors should take steps to identify targets in their collected dataset that they did not explicitly intend to include, and be upfront about their treatment of such cases (whereas, we found that almost all papers in our sample had targets included in their dataset that were not described in the paper).Overall, we surfaced the need for better conceptualization, operationalization, and documentation practices around targets.To this end, we believe it a fruitful avenue of research to develop tools and standard procedures to aid the fine-grained documentation of targets.

Future Work and Limitations
This review proposes ways forward to improve the quality of datasets in future studies.First, instituting benchmarks, measures and shared tasks to empirically evaluate the dataset generalizability across contexts and targets may promote critical approaches to data quality; Second, research on the science of computational social science should arguably aim for improved measures of data quality, to establish causal links between the quality of datasets and the practices, views and characteristics of those involved in the creation.Finally, we argue that higher reflexivity in this research field may be beneficial, such as via positionality statements in papers and datasets.
The study presented has some limitations that must be considered when interpreting its results.Although we attempted to cover most of the existing literature and datasets on hateful content, we made some restrictive assumptions that may have resulted in missing works.Firstly, we only examined three academic databases, which may not have included all relevant publications.Secondly, we only included English publications, meaning that languagespecific conferences were excluded, potentially leading to the exclusion of relevant research.Lastly, we excluded all papers that did not mention whether a newly created dataset was introduced based only on the information provided in the abstract.This may have led to an increased number of false-negative decisions.
Furthermore, we discovered during the annotation process that while it is relatively easy to decide if a paper talks about targets or not, it is often non-trivial to identify the conceptualization of targets and differentiate it from the operationalization.For example, in Warner and Hirschberg ( 2012), hate speech and potential targets are mentioned in several sections of the paper, such as the introduction and related work.Therefore it is not easy for annotators of the literature review to determine the target conceptualization employed in the context of the respective work.On the other hand, based on the annotator instructions, it is clear that targets are explicitly annotated as anti-semitic, antiblack, anti-Asian, anti-woman, anti-muslim, anti-immigrant, or other-hate (Warner and Hirschberg 2012).To avoid building our analysis on annotations where we know that annotators had a hard time and also in part disagreed, we decided to merge the conceptualized and operationalized target annotations for the large sample.For the small sample, we provide an in-depth discussion of the differences together with more detailed and time-consuming annotations.
Also, our dictionary-based approach to identifying targets is limited and prior work suggests that the Hatebase lexicon includes terms that are generally not used in a hateful context (Davidson et al. 2017).Our small validation study does not assess the recall of the dictionary but shows that the precision is acceptable (0.68).We apply our dictionary only to the small sample and acknowledge that it would have been ideal to conduct this analysis on the complete body of datasets.However, we are confident that the subset provided by Risch et al. (2021) is an adequate representation.
Finally, our study of targets is abstracted at a demographic identity level, focusing on broad categories of race, gender, religion, etc.We do not distinguish between finergrained identities within these, e.g., we do not differentiate Islamophobia from anti-Christian rhetoric.Future research can build on our work to specifically measure the representation of persecuted minority groups within these categories.

Further Details on the Search and Selection Procedures
Search Procedure The queries used for retrieving publications from the different literature search engines are provided in Table 2.
After removing duplicates from the list of downloaded publications from the three search engines, we compared the resulting papers with dataset papers included in hatespeechdata.com,which is a dedicated website for cataloguing datasets annotated for hate speech, online abuse, and offensive language, with a list of datasets and keywords provided (Vidgen and Derczynski 2020).Almost all publications from hatespeechdata.comwere included in our sample.We manually added the three missing publications to our sample (see Figure 1).Consequently, our final sample is a superset of hatespeechdata.com and is 259% larger than the hatespeechdata.comrepository.
Selection Procedure All publications in our final sample were manually screened by three experienced scholars with backgrounds in CSS and processed according to the inclusion and exclusion criteria described below.The reviewer training was separated into two parts, one for the title and abstract screening procedure, and the other for the full-text annotation procedure.Paper screening procedures were piloted among the experts before each round (see Figure 1), and results were reviewed by the other authors independently for a sufficient inter-screener agreement level (Cohen's kappa > 0.8).
Discrepancies were discussed among the experts, and the conflicts left were discussed with other authors altogether to reach a mutual agreement.
The first round of the screening was conducted by reviewing the titles of the publications to determine whether they met one of the following three inclusion or exclusion criteria: (i) Topical Relevance: The study should cover research on hateful online communication (e.g., explicit hate speech towards religion, as well as subtle and implicit forms of harm such as benevolent sexism, among others).Studies on misinformation, spam, fake news, or deception should be excluded.(ii) Dataset Novelty: At least one dataset should be newly developed or adapted for the study consisting of text data or multimodal data where text is included, and the data needs to be collected from online sources.Papers using public datasets without any adaptations (e.g., new annotations or other expansions) should be excluded.(iii) Computational Application: Since the review focuses on datasets that can be used to develop computational methods to detect hateful communication online, we included only papers that mention computational methods such as specific machine learning or natural language processing methods.We exclude articles that focus only on manual data analysis of small samples of online data or data from other social science methods such as survey data, experiments, content analysis, or qualitative analysis.

Figure 1 .
Figure 1.Details of literature search and screening process using the PRISMA flow diagram.Notes: DL stands for Digital Library; I&E stands for Inclusion and Exclusion.

Figure 2 .
Figure2.The three types of targets studied in this work and the potential mismatches between them.We introduce a two-tier categorization of targets.First, we distinguish between conceptualized targets (i.e., those who are included in the explicit definition of hateful communication as a construct chosen by the researcher) and operationalized targets (i.e., those who are operationalized in the sampling, annotation and/or analysis of the dataset).Moreover, while conceptualized and operationalized targets are explicitly accounted for and typically described in the paper, the corresponding dataset may include other targets that are not: we call the latter detected targets.The figure depicts a mismatch between these three types of targets: the researcher has chosen a very broad conceptualization of hateful online communication encompassing rage, gender, and religion, but a narrow operationalization, which only aims to capture hate towards gender identities in the dataset; yet, ultimately, the final dataset may include also targets that were part neither of the conceptualization nor the operationalization, such as identities based on political ideology.

Figure 3 .
Figure3.Conceptualized and operationalized targets by year along with the distribution of datasets.Targeted refers to datasets that have explicitly mentioned at least one target in their construct definition (i.e., in the conceptualization phase) and/or publications in which the authors define concrete measures to ensure and validate the presence of at least one target group in the data (i.e., in the operationalization phase).Untargeted refers to all other datasets that do not meet these two criteria.Note that the data for 2022 is only partially available (as described in our Methods and Data Section).

Figure 4 .
Figure 4. Single vs. multiple platforms as data sources over time.While most of the datasets are collected from a single source, around 2018 researchers are increasingly collecting data from multiple sources (i.e., two or more).

Figure 5 .
Figure 5. Geographic distribution of researchers' affiliation that contributed datasets.Researchers affiliated with institutions located in the U.S. published the most datasets, followed by researchers from India and the United Kingdom institutions.
distribution of datasets over time for the five most frequent researcher locations and all other locations combined.Researchers from the US publish the most consistently, while other countries began producing more datasets since 2016.distribution of datasets over time for the five most studied and all other languages combined.English has received most of the attention which has risen over time, while research on other languages became prominent after 2017.

Figure 6 .Figure 7 .
Figure6.Linguistic and geographic trends over time in harmful language research.Notably, the number of datasets for the non-top five languages continues to be lower than all English ones, while since 2017 we see more and more datasets in languages other than English.

Figure 8 .Figure 9 .
Figure8.Hateful datasets by language and the researchers' location.We show the distribution of the top 16 language datasets and the 40 most common locations of the researchers who created these datasets.The frequency of datasets is log-scaled to reduce the dominant effect of English language datasets.We see that research in English is widespread across many geographic regions, while researchers from the US and UK contribute to research in a variety of languages.Spanish and Arabic are also researched in multiple countries, reflecting the spread of their worldwide speakers.On the other hand, research in other languages is concentrated in countries or locations where they are most widely spoken, e.g., publications with Indonesian and Russian datasets originate from Indonesia and Russia, respectively.EnglishArabic Spanish

Figure 14 .
Figure 14.The activity diagram of literature survey and annotation with an example.

Distribution of conceptual or operationalized targets across languages.
The most popular target categories are race, gender, and religion.Race is the most frequent target studied in English, German, Portuguese, and Italian datasets.In contrast, gender-based abuse is widely studied in Spanish, French, and Hindi-English code-mixed datasets, and religion is the most frequent target for Arabic, Turkish, and Indonesian, and one of the main targets in Bengali and Urdu.Other target attributes like class, disability, and age are rare.
researchers conceptualize the construct.However, we see an increase in multilingual datasets over time, which is a positive indicator for the dataset quality, since it signals increases in the diversity and coverage of the dataset.Before 2018, U.S.-based researchers led the production of hateful communication datasets, which are predominantly in English.Since, with a growing number of researchers from and students (Del Bosque and Garza 2014).Although certain Figure 12.Yearly proportion of targets in datasets over time based on our literature review.There has been an increase in attention towards targets beyond race, gender, and religion, especially after 2017.

Table 1 .
Confusion matrix, showing the (mis-)match between conceptualizations and operationalizations of targets.The (mis-)matches are analyzed on a dataset level based on our convenience sample.

Table 3
. Target dictionary.Target categories are presented with their respective keywords.