Feb 22, 2024 - Technology

New report: 60% of OpenAI model's responses contain plagiarism

Megan Morrone

Illustration of robot hands typing on a typewriter. — Illustration: Allie Carl/Axios

A new report from plagiarism detector Copyleaks found that 60% of OpenAI's GPT-3.5 outputs contained some form of plagiarism.

Why it matters: Content creators from authors and songwriters to The New York Times are arguing in court that generative AI trained on copyrighted material ends up spitting out exact copies.

Copyleaks is an AI-based text analysis company that began selling plagiarism-detection tools to businesses and schools long before ChatGPT's arrival.
GPT-3.5 was the model powering ChatGPT when it debuted, but OpenAI has moved on to the bigger and more capable GPT-4.0.

Between the lines: Plagiarism takes many forms beyond simple cutting and pasting full sentences and paragraphs.

Copyleaks attempts to turn detecting plagiarism from "I know it when I see it" into an exact science.
The company uses a proprietary scoring method that aggregates the rate of identical text, minor changes, paraphrased text, and other factors and then assigns content a "similarity score."
Per the report, for GPT-3.5, "45.7% of all outputs contained identical text, 27.4% contained minor changes, and 46.5% had paraphrased text."
"A score of 0% signifies that all of the content is original, whereas a score of 100% means that none of the content is original," per the report.

Zoom in: Copyleaks asked GPT-3.5 for around a thousand outputs, each around 400 words, across 26 subjects.

The individual GPT-3.5 output with the highest similarity score was in computer science (100%), followed by physics (92%), and psychology (88%).

The lowest similarity scores appeared in theater (0.9%), humanities (2.8%) and English language (5.4%).

Yes, but: "Our models were designed and trained to learn concepts in order to help them solve new problems," OpenAI spokesperson Lindsey Held wrote in a statement to Axios. "We have measures in place to limit inadvertent memorization, and our terms of use prohibit the intentional use of our models to regurgitate content."

The intrigue: The New York Times lawsuit against Microsoft and OpenAI claims that their AI systems' "widescale copying" constitutes copyright infringement.

OpenAI responded to the lawsuit arguing that "regurgitation" is a "rare bug" and also accusing The New York Times of "manipulating prompts."

Editor's note: This story has been updated with comment from OpenAI.

Add Axios on Google

New report: 60% of OpenAI model's responses contain plagiarism

What to read next