Introducing Emu Video and Emu Edit, our latest generative AI research milestones
The field of generative AI is rapidly evolving, showing remarkable potential to augment human creativity and self-expression. In 2022, we made the leap from image generation to video generation in the span of a few months. And at this year’s Meta Connect, we announced several new developments, including Emu, our first foundational model for image generation. Technology from Emu underpins many of our generative AI experiences, some AI image editing tools for Instagram that let you take a photo and change its visual style or background, and the Imagine feature within Meta AI that lets you generate photorealistic images directly in messages with that assistant or in group chats across our family of apps. Our work in this exciting field is ongoing, and today, we’re announcing new research into controlled image editing based solely on text instructions and a method for text-to-video generation based on diffusion models.
Emu Video: A simple factorized method for high-quality video generation
Whether or not you’ve personally used an AI image generation tool, you’ve likely seen the results: Visually distinct, often highly stylized and detailed, these images on their own can be quite striking—and the impact increases when you bring them to life by adding movement.
With Emu Video, which leverages our Emu model, we present a simple method for text-to-video generation based on diffusion models. This is a unified architecture for video generation tasks that can respond to a variety of inputs: text only, image only, and both text and image. We’ve split the process into two steps: first, generating images conditioned on a text prompt, and then generating video conditioned on both the text and the generated image. This “factorized” or split approach to video generation lets us train video generation models efficiently. We show that factorized video generation can be implemented via a single diffusion model. We present critical design decisions, like adjusting noise schedules for video diffusion, and multi-stage training that allows us to directly generate higher-resolution videos.
도큐멘토에서는 일부 내용만을 보여드리고 있습니다.
세부적인 내용은 바로가기로 확인하시면 됩니다.