Humans at an early age can identify objects and how each object can interact with its environment. For example, when watching videos of sports like tennis and football, spectators and sportscasters can understand and anticipate plays despite never being given a list of possible actions. We as humans develop this skill as we watch events unfold live and on the screen. Furthermore, we can reason about what happens if a player took a different action and how this might change the video.
In an effort to create an AI system that can develop some of these same reasoning skills, researchers at the University of Trento, the Institut Polytechnique de Paris, and Snap, Inc. propose in a new paper the task of playable video generation, where the goal is to learn a set of actions from real-world video clips and offer users the ability to generate new videos. The idea is that users provide an “action label” at every time step and can see its impact on the generated video, like a video game. The researchers believe this framework might pave the way for methods that can simulate real-world environments and provide a gaming-like experience.
In an experiment, the researchers architected a framework called Clustering for Action Decomposition and DiscoverY (CADDY) that discovers a set of actions after watching multiple videos and outputs “playable” videos. (Here’s a live demo.) CADDY uses the aforementioned action labels to encode the semantics of a given action, as well as a continuous component to capture how the action is performed.
The researchers claim that CADDY can generate “high-quality” videos while offering users the chance to choose which actions occur in those videos — akin to Facebook’s AI that extracts playable characters from real-world videos. For example, with CADDY, given a real-life video of a tennis player, users can select Left, Right, Forward, Backward, Hit the ball, or Stay to prompt the system to create videos capturing that action.
“Our experiments show that we can learn a rich set of actions that offer the user a gaming-like experience to control the generated video. As future work, we plan to extend our method to multi-agent environments,” the researchers wrote. “CADDY automatically discovers the most significant actions to condition video generation and can produce playable video generation models in a variety of settings, from video games to real videos.”
In the near term, the researchers’ work could lower the cost of corporate video production. Filming a short commercial runs $1,500 to $3,500 on the low end, a hefty expense for small-to-medium-size businesses. This leads some companies to pursue in-house solutions, but not all have the expertise required to execute on a vision. A tool like CADDY could eliminate the need for reshoots while opening up new creative possibilities.