OpenAI’s Mira Murati is “not sure” where Sora’s training data comes from

The data source of OpenAI’s upcoming video-generating artificial intelligence model, Sora, is unclear to the company’s chief technology officer, Mira Murati.

During an interview with The Wall Street Journal published on March 13, Murati offered vague responses when asked about the source of data for the company’s Sora model, which is capable of generating videos from text instructions.

“We used publicly available data and licensed data,” replied Murati about how the company valued at $80 billion was training its upcoming model.

Joanna Stern, from the Journal, then asked whether Sora was trained with data from social media platforms, such as YouTube, Instagram, or Facebook. “I’m actually not sure about that,” Murati replied, adding:

“You know, if they were publicly available — publicly available to use. But I’m not sure. I’m not confident about it.”

Before moving to another topic, Stern mentioned OpenAI’s partnership with stock image company Shutterstock, asking if its data could be used to train Sora. “I’m just not going to go into detail about the data that was used. But it was publicly available or licensed data,” Murati added. Later, she confirmed to the Journal that Shutterstock data was used for Sora.

AI models are trained using large sets of data, known as training data sets, which help the model learn to recognize patterns, make predictions, or understand language.

OpenAI’s CTO Mira Murati during interview with The Wall Street Journal. Source: WSJ

Murati has been at OpenAI since 2018, leading some of the company’s most popular projects, including the image-generator model DALL-E 3, the speech-recognition tool Whisper and the latest version of the company’s chatbot GPT-4. In November 2023, she briefly took over as interim CEO after OpenAI’s board ousted Sam Altman.

OpenAI has been targeted by several legal actions involving its AI models’ training data. In July 2023, authors Sarah Silverman, Richard Kadrey, and Christopher Golden filed a lawsuit against the company, alleging that ChatGPT generates summaries of the authors’ works based on copyrighted content.

In December, The New York Times sued Microsoft and OpenAI in a similar copyright infringement complaint that alleges the companies used the newspaper’s content to train AI chatbots. A different class-action lawsuit was filed in California, alleging that OpenAI scraped private user information from the internet to train ChatGPT without user consent.

Main, News