Feb 22, 2024 - Technology

New report: 60% of OpenAI model's responses contain plagiarism

Illustration of robot hands typing on a typewriter.

Illustration: Allie Carl/Axios

A new report from plagiarism detector Copyleaks found that 60% of OpenAI's GPT-3.5 outputs contained some form of plagiarism.

Why it matters: Content creators from authors and songwriters to The New York Times are arguing in court that generative AI trained on copyrighted material ends up spitting out exact copies.

  • Copyleaks is an AI-based text analysis company that began selling plagiarism-detection tools to businesses and schools long before ChatGPT's arrival.
  • GPT-3.5 was the model powering ChatGPT when it debuted, but OpenAI has moved on to the bigger and more capable GPT-4.0.

Between the lines: Plagiarism takes many forms beyond simple cutting and pasting full sentences and paragraphs.

  • Copyleaks attempts to turn detecting plagiarism from "I know it when I see it" into an exact science.
  • The company uses a proprietary scoring method that aggregates the rate of identical text, minor changes, paraphrased text, and other factors and then assigns content a "similarity score."
  • Per the report, for GPT-3.5, "45.7% of all outputs contained identical text, 27.4% contained minor changes, and 46.5% had paraphrased text."
  • "A score of 0% signifies that all of the content is original, whereas a score of 100% means that none of the content is original," per the report.

Zoom in: Copyleaks asked GPT-3.5 for around a thousand outputs, each around 400 words, across 26 subjects.

  • The individual GPT-3.5 output with the highest similarity score was in computer science (100%), followed by physics (92%), and psychology (88%).
  • The lowest similarity scores appeared in theater (0.9%), humanities (2.8%) and English language (5.4%).

Yes, but: "Our models were designed and trained to learn concepts in order to help them solve new problems," OpenAI spokesperson Lindsey Held wrote in a statement to Axios. "We have measures in place to limit inadvertent memorization, and our terms of use prohibit the intentional use of our models to regurgitate content."

The intrigue: The New York Times lawsuit against Microsoft and OpenAI claims that their AI systems' "widescale copying" constitutes copyright infringement.

  • OpenAI responded to the lawsuit arguing that "regurgitation" is a "rare bug" and also accusing The New York Times of "manipulating prompts."

Editor's note: This story has been updated with comment from OpenAI.

Go deeper