Google's latest video generation AI model, Lumiere, employs a novel diffusion model named Space-Time-U-Net (STUNet). This model determines the spatial location of objects in a video and how they change over time. Lumiere initiates the process by creating a base frame from a given prompt. Utilising the STUNet framework, it predicts the movement of objects within the frame, generating additional frames that seamlessly flow into each other, creating the illusion of smooth motion. Lumiere produces 80 frames, a notable increase compared to Stable Video Diffusion's 25 frames.
Unlike other models that piece together videos from generated keyframes, where movement has already occurred, STUNet allows Lumiere to concentrate on the movement itself based on the anticipated location of generated content at specific times in the video.
The introduction of Lumiere has sparked a lot of interest and praise among video editors. However, there are also worries about how this kind of technology might affect job security in the field. This reflects a larger conversation about how AI is changing the way content is created and the concerns about what it means for human professionals in the industry.
While Google has not been a prominent player in the text-to-video category, its gradual release of advanced AI models, coupled with a multimodal focus, is evident. The upcoming Gemini large language model is expected to bring image generation to Bard. Lumiere, though not available for testing yet, demonstrates Google's capability to develop an AI video platform comparable to, and arguably surpassing, existing AI video generators like Runway and Pika. It's worth noting the progress Google has made in the AI video domain over the past two years.
Beyond text-to-video generation, Lumiere facilitates image-to-video generation, and stylized generation allowing users to create videos in specific styles animating specific portions of a video, and inpainting for masking out areas to alter color or pattern. Google acknowledges the risk of misuse for creating fake or harmful content with Lumiere and emphasizes the need for tools to detect biases and malicious use cases, ensuring safe and fair utilization. However, the paper doesn't elaborate on how this can be achieved.