How fake or how real is the growing stream of artificial intelligence (AI)-produced video?
Turns out, there’s a quantitative measure for that — or, almost. Humans still need to decide, based on their human perception, if a video is good or not.
Also: New Meta Ray-Ban AI features roll out, making the smart glasses even more tempting
Mark Zuckerberg, owner of Meta Platforms, announced on Friday a new AI model called Movie Gen that can generate HD videos (1080p resolution) from a text prompt. The firm says these videos are more “realistic” on average than videos generated by competing technology (such as OpenAI’s Sora text-to-video model).
It can also generate synced audio, tailor the video to show a person’s face, and then edit the video automatically with just a text prompt, such as, “dress the penguins in Victorian outfits” to cloak on-screen penguins.
Also: OpenAI unveils text-to-video model and the results are astonishing. See for yourself
In the accompanying paper, “Movie Gen: A Cast of Media Foundation Models,” Meta AI researchers describe how they had humans rate the realism of the AI-generated videos:
Realness: This measures which of the videos being compared most closely resembles a real video. For fantastical prompts that are out of the training set distribution (e.g., depicting fantasy creatures or surreal scenes), we define realness as mimicking a clip from a movie following a realistic art-style. We additionally ask the evaluators to select a reason behind their choice i.e., “subject appearance being more realistic” or “motion being more realistic”.
There is also a companion blog post.
The human tests identify a win/loss score for Movie Gen versus Sora and three other prominent text-to-video AI models, Runway Gen3, Lumalabs, and Kling1.5.
Also: The best AI image generators of 2024
The authors note that it’s not yet possible to get good comparisons in an automated fashion. Furthermore, “assessing realness and aesthetics heavily depends on human perception and preference,” they write.
Not just in realism but also in the matter of how good the motion is in a video, whether it skips or fumbles parts of an action, and how faithful the video is to the text prompt entered, are things you just can’t automate, they state.
“We find that existing automated metrics struggle to provide reliable results, reinforcing the need for human evaluation.”
The benchmark measures the ways “humans prefer the results of our model against competing industry models,” the paper relates, resulting in a “net win rate” in percentage terms.
Also: These Meta Ray-Ban smart glasses are my favorite Prime Day deal so far
The average win rate against Sora, they relate, is 11.62% of the time. The win rate against the others is substantially higher.
“These significant net wins demonstrate Movie Gen Video’s ability to simulate the real world with generated videos that respect physics, with motion that is both reasonable in magnitude but consistent and without distortion.”
They offer some sample screen grabs of videos directly in contrast to Sora. As the authors see it, “OpenAI Sora can tend to generate less realistic videos (e.g., the cartoonish kangaroo in the second row) that can be missing the motion details described in the text prompt (e.g., the non-walking robot in the bottom row).”
The authors built the AI model for Movie Gen from what they call a “cast of foundation models.”
Also: In a surprise twist, Meta is suddenly crushing Apple in the innovation battle
In the training phase, images and videos from a mixture of public and licensed data sets are compressed until the model learns to efficiently reproduce pixels of the data, the authors relate. As they term it, “We encode the RGB pixel-space videos and images into a learned spatiotemporal compressed latent space using a Temporal Autoencoder (TAE), and learn to generate videos in this latent space.”
That video generation is then “conditioned” on text inputs to get the model to be able to produce video in alignment with the text prompts.
The parts add up to a model with 30 billion parameters — not huge by today’s training standards.
Also: Meta’s new $299 Quest 3S is the VR headset most people should buy this holiday season
A second neural net, called “Movie Gen Audio,” produces high-fidelity audio — but for sound effects and music, not for speech. That is built on an existing approach called a “diffusion transformer,” with 13 billion parameters.
All that takes a lot of computing horsepower: “6,144 H100 GPUs, each running at 700W TDP and with 80GB HBM3, using Meta’s Grand Teton AI server platform.”
Generating videos is not all Movie Gen does. In a subsequent step, the authors also subject the model to additional training to create “personalized” videos, where an individual’s face can be forced to show up in the movie.
Also: ChatGPT is the most searched AI tool by far, but number two is surprising
They also add a final component, the ability to edit the videos with just a text prompt. The problem the authors faced is that “video editing models are hindered by the scarcity of supervised video editing data,” so there aren’t enough examples to give the AI model to train it.
To get around that, the team went back to the Movie Gen AI model and modified it in several steps. First, they use data from image editing to simulate what is involved in editing frames of video. They put that into the training of the model at the same time as the original text-to-video training so that the AI model develops a capacity to coordinate individual frame editing with multiple frames of video.
In the next portion, the authors feed the model a video, a text caption, such as “a person walking down the street,” and an edited video, and train the model to produce the instruction that would lead to the change from original video to edited video. In other words, they force the AI model to associate instructions with changed videos.
Also: The 4 biggest challenges of AI-generated code that Gartner left out of its latest report
To test the video editing capability, the authors compile a new benchmark test based on 51,000 videos collected by Meta’s researchers. They also hired crowd workers to come up with editing instructions.
To evaluate the editing of the videos, the Meta team asked human reviewers to rate which video was better: one created with their AI model or with the existing state-of-the-art. They also used automated measures to compare the before and after videos in the task.
Also: These AI avatars now come with human-like expressions
“Human raters prefer Movie Gen Edit over all baselines by a significant margin,” write the authors.
In all these steps, the authors break ground in coordinating the size of AI models the data, and the amount of computing used. “We find that scaling the training data, compute, and model parameters of a simple Transformer-based model trained with Flow Matching yields high-quality generative models for video or audio.”
However, the authors concede that the human evaluations have their pitfalls. “Defining objective criteria evaluating model generations using human evaluations remains challenging and thus human evaluations can be influenced by a number of other factors such as personal biases, backgrounds, etc.”
Also: Pearson launches new AI certification – with focus on practical use in the workplace
The paper doesn’t have any suggestions as to how to deal with those human biases. But Meta notes that they will be releasing a benchmark test for use by others, without disclosing a time frame:
In order to thoroughly evaluate video generations, we propose and hope to release a benchmark, Movie Gen Video Bench, which consists of 1000 prompts that cover all the different testing aspects summarized above. Our benchmark is more than 3⇥ larger than the prompt sets used in prior work.
The company also pledged to at some point offer its videos for public inspection: “To enable fair and easy comparison to Movie Gen Video for future works, we hope to publicly release our non-cherry picked generated videos for the Movie Gen Video Bench prompt set.”
Also: Can synthetic data solve AI’s privacy concerns? This company is betting on it
According to Meta, the Movie Gen model has not yet been deployed. In the conclusion of their paper, the authors write that the AI models all “need multiple improvements before deploying them.” For example, the videos generated by the model “still suffer from issues, such as artifacts in generated or edited videos around complex geometry, manipulation of objects, object physics, state transformations, etc.” The audio “is sometimes out of synchronization when motions are dense” such as a video of tap dancing.
Despite those limitations, Movie Gen implies a path someday to a full video creation and editing suite and even tailoring a video podcast with one’s own likeness.
Artificial Intelligence