
Illustration: Aïda Amer & Eniola Odetunde/Axios
Scientists have long tried to use AI to automatically detect hate speech, which is a huge problem for social network users. And they're getting better at it, despite the difficulty of the task.
What's new: A project from UC Santa Barbara and Intel takes a big step further — it proposes a way to automate responses to online vitriol.
- The researchers cite a widely held belief that counterspeech is a better antidote to hate than censorship.
- Their ultimate vision is a bot that steps in when someone has crossed the line, reining them in and potentially sparing the target.
The big picture: Automated text generation is a buzzy frontier of the science of speech and language. In recent years, huge advances have elevated these programs from error-prone autocomplete tools to super-convincing — though sometimes still transparently robotic — authors.
- I wrote earlier this year about the potential for harm from convincing bot-generated text. It would be easy to train an AI writer to mimic hate speech, for example.
- This project shows how the technology could instead be used for good.
How it works: To build a good hate speech detector, you need some actual hate speech. So the researchers turned to Reddit and Gab, two social networks with little to no policing and a reputation for rancor.
- For maximum bile, they went straight for the "whiniest most low-key toxic subreddits," as curated by Vice. They grabbed about 5,000 conversations from those forums, plus 12,000 from Gab.
- They passed the threads to workers on Amazon Mechanical Turk, a crowdsourcing platform, who were asked to identify hate speech in the conversations and write short interventions to defuse the hateful messages.
- The researchers trained several kinds of AI text generators on these conversations and responses, priming them to write responses to toxic comments.
The results: Some of the computer-generated responses could easily pass as human written — like, "Use of the c-word is unacceptable in our discourse as it demeans and insults women" or "Please do not use derogatory language for intellectual disabilities."
- But the replies were inconsistent, and some were incomprehensible: "If you don't agree with you, there's no need to resort to name calling."
- When Mechanical Turk workers were asked to evaluate the output, they preferred human-written responses more than two-thirds of time.
Our take: This project didn't test how effective the responses were in stemming hate speech — just how successful other people thought it might be.
- Even the most rational, empathetic response, not to mention the somewhat robotic computer-generated ones above, could flop or even backfire — especially if Reddit trolls knew they were being policed by bots.
"We believe that bots will need to declare their identities to humans at the beginning," says William Wang, a UCSB computer scientist and paper co-author. "However, there is more research needed how exactly the intervention will happen in human-computer interaction."