Gemini Omni
Google's new model takes any mix of text, image, audio and video in and hands back an edited video with sound — the first time arbitrary-input fusion and video generation live in one consumer engine.
Earlier "omni" models could read anything you threw at them — text, a photo, a voice clip — but couldn't make video. Video tools could make a clip from a prompt but couldn't take a sketch plus a hummed melody plus a written note and fuse all three into one edited scene. Gemini Omni does both at once: you hand it any combination of inputs and it returns video with synced audio, and each edit builds on the last so a single scene holds together across a conversation.
Google's own model card concedes the rest: keeping a scene consistent across edits, rendering complex motion, and producing accurate text "remains a challenge."
It shipped the day it was announced — free inside YouTube Shorts and the YouTube Create app, and to paying Gemini subscribers in Google's apps — replacing the company's previous video engine by default. That points it at a creator base in the billions.
Google and Demis Hassabis pitched it not as a video tool but as a "world model" that simulates reality. That is the part worth holding at arm's length. Google published no test scores at launch — every benchmark was deferred to a later developer release — so the claim rests on stage demos and one keynote. The only independent hands-on test so far found the opposite of simulation: drop two marbles and the second veers off course, keep editing and the scene falls apart after about four turns, and ask for Japanese text and it renders eleven of forty-six characters. The genuinely new thing here — fusing any input into generated video, in a tool a billion people can open today — is real, and oddly bigger than the unfalsifiable claim Google chose to lead with.
The lenses
The facts
Concepts
How this connects
Tap a node to open it