Mentatcurated
▸ Concept also: text-to-video, video synthesis, video diffusion

Video generation

A model that produces video clips from a text prompt or image by learning the statistical structure of large video datasets — extending image generation across time.

In a nutshell

A video generation model takes a text description, a still image, or both, and produces a sequence of frames that match it. The core problem is temporal coherence: each frame must follow plausibly from the last, objects must persist, and motion must be physically believable — constraints that don't exist for single images. Current systems extend diffusion or autoregressive approaches to the time axis. The hard part is that errors compound: a small geometry mistake in frame two drifts into an incoherent scene by frame twenty.

Where it came from

Year2022
SourceHo et al. — Video Diffusion Models (NeurIPS 2022)
Why it matteredExtended the image diffusion framework to jointly model frames, establishing the template most subsequent video generators follow.

How this connects

Tap a node to open it