It is concerning that the growing popularity of social networks is encouraging violence or inciting offense toward other people. An attempt has been made in the past several years to detect offensive language in social media posts. Nonetheless, the majority of studies focus on recognizing offensive language in English. Moreover, dataset labeling emerges as a crucial and fundamental step for training high-quality models, considering the increasing use of artificial intelligence and machine learning tools. Utilizing crowdsourcing platforms is an efficient and optimal method that can be used for data labeling. This approach uses human resources who are sufficiently knowledgeable about the topic to label the data. In this paper, we introduce PerGOLD, a new Persian General Offensive Language Dataset, in which we use an event-based data collection methodology to detect offensive language in Persian Twitter. To access labeled training data, we build a crowdsourcing platform to benefit from human input. We labeled 13,716 tweets, and according to the obtained results, 34% of them were labeled as offensive language. Finally, we evaluated the efficiency of these data by applying some classic machine learning models (LR, SVM) and transformer-based language models (RoBERTa, ParsBERT). The obtained F1-score of the best model (ParsBERT) was 85.4%.
Jafarinejad, F. , Rahimi, M. , Khodabakhsh, M. and Karimi, S. (2025). PerGOLD: Identification of offensive language in Persian tweets: leveraging crowdsourcing. Computer and Knowledge Engineering, 8(1), 35-42. doi: 10.22067/cke.2025.90088.1132
MLA
Jafarinejad, F. , , Rahimi, M. , , Khodabakhsh, M. , and Karimi, S. . "PerGOLD: Identification of offensive language in Persian tweets: leveraging crowdsourcing", Computer and Knowledge Engineering, 8, 1, 2025, 35-42. doi: 10.22067/cke.2025.90088.1132
HARVARD
Jafarinejad, F., Rahimi, M., Khodabakhsh, M., Karimi, S. (2025). 'PerGOLD: Identification of offensive language in Persian tweets: leveraging crowdsourcing', Computer and Knowledge Engineering, 8(1), pp. 35-42. doi: 10.22067/cke.2025.90088.1132
CHICAGO
F. Jafarinejad , M. Rahimi , M. Khodabakhsh and S. Karimi, "PerGOLD: Identification of offensive language in Persian tweets: leveraging crowdsourcing," Computer and Knowledge Engineering, 8 1 (2025): 35-42, doi: 10.22067/cke.2025.90088.1132
VANCOUVER
Jafarinejad, F., Rahimi, M., Khodabakhsh, M., Karimi, S. PerGOLD: Identification of offensive language in Persian tweets: leveraging crowdsourcing. Computer and Knowledge Engineering, 2025; 8(1): 35-42. doi: 10.22067/cke.2025.90088.1132
Send comment about this article