Oct 19, 2019 - Technology

Fighting hate with AI-powered retorts

Illustration of a chat conversation between a user and a bot, the user is using profanity and the bot instructs them to stop.
Illustration: Aïda Amer & Eniola Odetunde/Axios

Scientists have long tried to use AI to automatically detect hate speech, which is a huge problem for social network users. And they're getting better at it, despite the difficulty of the task.

What's new: A project from UC Santa Barbara and Intel takes a big step further — it proposes a way to automate responses to online vitriol.

  • The researchers cite a widely held belief that counterspeech is a better antidote to hate than censorship.
  • Their ultimate vision is a bot that steps in when someone has crossed the line, reining them in and potentially sparing the target.

The big picture: Automated text generation is a buzzy frontier of the science of speech and language. In recent years, huge advances have elevated these programs from error-prone autocomplete tools to super-convincing — though sometimes still transparently robotic — authors.

How it works: To build a good hate speech detector, you need some actual hate speech. So the researchers turned to Reddit and Gab, two social networks with little to no policing and a reputation for rancor.

  • For maximum bile, they went straight for the "whiniest most low-key toxic subreddits," as curated by Vice. They grabbed about 5,000 conversations from those forums, plus 12,000 from Gab.
  • They passed the threads to workers on Amazon Mechanical Turk, a crowdsourcing platform, who were asked to identify hate speech in the conversations and write short interventions to defuse the hateful messages.
  • The researchers trained several kinds of AI text generators on these conversations and responses, priming them to write responses to toxic comments.

The results: Some of the computer-generated responses could easily pass as human written — like, "Use of the c-word is unacceptable in our discourse as it demeans and insults women" or "Please do not use derogatory language for intellectual disabilities."

  • But the replies were inconsistent, and some were incomprehensible: "If you don't agree with you, there's no need to resort to name calling."
  • When Mechanical Turk workers were asked to evaluate the output, they preferred human-written responses more than two-thirds of time.

Our take: This project didn't test how effective the responses were in stemming hate speech — just how successful other people thought it might be.

  • Even the most rational, empathetic response, not to mention the somewhat robotic computer-generated ones above, could flop or even backfire — especially if Reddit trolls knew they were being policed by bots.

"We believe that bots will need to declare their identities to humans at the beginning," says William Wang, a UCSB computer scientist and paper co-author. "However, there is more research needed how exactly the intervention will happen in human-computer interaction."

Go deeper