Quick Take

Ethical Cracks in the Generative AI Edifice

Paul M. Barrett

April 8, 2024

As the Center discussed in it report, Safeguarding AI, the technology that brought us attention-grabbing apps like OpenAI’s ChatGPT also bring with them troubling ethical and practical questions: How do developers “train” their systems so that they provide eerily human-like responses to simple natural-language prompts? Why do those systems show signs of bias? What causes generative AI to make up falsehoods—an unexplained tendency known in the industry as “hallucination”?

Some answers are beginning to come into focus. A startling in-depth dispatch in The New York Times has revealed that developers like OpenAI, Google, and Meta have cut ethical corners in gathering mountainous supplies of internet data to train their AI models. These measures have included Google’s transcribing more than a million hours of YouTube videos to generate additional supply of data, a move that the company did secretively and in violation of its own rules, the Times reported.

The Times elaborated:

Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.

Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its A.I. products.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming A.I. industry. Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates.

These revelations undercut some company contentions—that they are developing generative AI products in the public interest—but reinforce declarations by industry leaders like OpenAI CEO Sam Altman that government must step in to regulate his company and its rivals.

I’ve viewed these calls for regulation skeptically, as a tactic to improve public relations taken because of industry confidence that Congress will not, in fact, pull itself together to overcome partisan differences and extend systematic oversight to Silicon Valley. But regardless of the motives behind the embrace of regulation, the revelations in the Times underscore the urgent need for transparency and oversight that, at this point, it is becoming clear only government can provide.