OpenAI’s recent unveiling of Sora has raised anticipations, even though the company has not planned to make it public for now. Instead, OpenAI has commenced sharing the model with independent safety testers. A key concern of the organization is the potential manipulation of realistic but fabricated videos. OpenAI aims to enhance safety measures before releasing it for general public use, as stated by Aditya Ramesh, an OpenAI scientist who created DALL-E, the company’s text-to-image model.
However, OpenAI has future plans of product launch and plans to share the model with specific video creators and artists for feedback to make Sora as versatile and beneficial as possible for creative professionals. According to Ramesh, another goal is to give a glimpse into the future possibilities of these models.
The development of Sora was done by adapting the technology of DALL-E 3, the up-to-date version of OpenAI’s popular text-to-image model. Sora adopts this method, making it applicable to videos rather than static images. An added technique was the combination of a diffusion model with a kind of neural network called a transformer, different from DALL-E or most other generative video models.
Transformers are efficient in processing long sequences of data, especially words. Hence, they are the secret ingredient in large language models like GPT-4 of OpenAI and Google DeepMind’s Gemini. Videos, however, are not composed of words. So, researchers had to devise a method to segment videos into segments that are treated as words. The method they came up with was to dissect videos across time and space. “Basically, it’s like taking a stack of all video frames and extracting little cubes from it”, says Brooks.
The transformer in Sora can then process these video chunks similarly to how the transformer in a language model treats words in a text chunk. This unique feature enabled the team to train Sora on several types of videos, including various resolutions, durations, aspect ratios, and orientations, a feature not found in existing models, said Brooks.
A couple of visual prompts are shared, depicting scenarios of mammoths in a snowy meadow and the hustle-bustle of snowy Tokyo city, attributed to OpenAI.
Sam Gregory, executive director at Witness, a specialist organization in video technology application and misuse, has appreciated the technical marvel this model brings about. It is potentially an artistic revolution for storytellers through video, but also a potential avenue for misuse.