Microsoft AI researchers present (De)ToxiGen: a large-scale machine-generated dataset for the detection of contradictory and implicit hate speech

Although the use of artificial intelligence has several advantages, this advanced technology also has disadvantages. An example is the creation of inappropriate language by language models. Since these models are trained on massive amounts of data, inappropriate language can be learned due to its presence in the training data.

In specific cases only, content moderation techniques may be used to flag or filter such language. However, the datasets used to train these programs often fail to capture the complexity of potentially inappropriate and toxic language, especially hate speech. Moreover, the neutral samples of these datasets rarely contain group references. Therefore, the tools can flag even neutral language referring to a minority identification group as hate speech. A dataset should be created to train content moderation algorithms that can be used to better detect implicitly harmful material, inspired by the ability of large language models to mimic the tone, style and vocabulary of signals that they receive, whether poisonous or benign.


Microsoft researchers collected initial examples of neutral statements with group mentions and examples of implicit hate speech in 13 minority identity groups. They used a large-scale language model to extend and guide the generation process in their paper “ToxiGen: A large-scale machine-generated dataset for contradictory and implicit hate speech detection”. The result is the largest publicly available implicit hate speech dataset: 274,000 samples of neutral and toxic utterances. Although large Transformer-based language models do not directly contain semantic information, they can identify statistical interactions of words in various scenarios. They learned how to use careful rapid engineering tactics to build ToxiGen’s implicit hate speech dataset by experimenting with language production using one of these huge language models.

While demo-based prompts can help with large-scale data production, they don’t produce data designed explicitly to test a specific content moderation technology or content classifier. Data from both demo-based prompting and their suggested adversarial decoding technique are included in our ToxiGen dataset.

This deposit provides all the components required to create the ToxiGen dataset, which includes implicitly dangerous and neutral phrases on 13 minority groups. It contains ALICE (or (De)ToxiGen), a tool for stress testing and iteratively improving a given out-of-the-box content moderation system among different minority groups.

As with many technologies, the solutions we create to make them stronger, more secure and less sensitive can be used in unexpected ways. Although the methods described here may generate inappropriate or harmful language, they are much more useful in combating such language. These methods can aid in content moderation tools that can be used with human guidance to support fairer, safer, more reliable, and inclusive AI systems.

This Article Is Based On The Research Paper 'TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, Microsoft blog and Github codes. 

Please Don't Forget To Join Our ML Subreddit

James G. Williams