Artificial Intelligence medium · first-party

Gemini Omni

Google's new model takes any mix of text, image, audio and video in and hands back an edited video with sound — the first time arbitrary-input fusion and video generation live in one consumer engine.

tool Google DeepMind · 2 min read

Earlier "omni" models could read anything you threw at them — text, a photo, a voice clip — but couldn't make video. Video tools could make a clip from a prompt but couldn't take a sketch plus a hummed melody plus a written note and fuse all three into one edited scene. Gemini Omni does both at once: you hand it any combination of inputs and it returns video with synced audio, and each edit builds on the last so a single scene holds together across a conversation.

Google's own model card concedes the rest: keeping a scene consistent across edits, rendering complex motion, and producing accurate text "remains a challenge."

It shipped the day it was announced — free inside YouTube Shorts and the YouTube Create app, and to paying Gemini subscribers in Google's apps — replacing the company's previous video engine by default. That points it at a creator base in the billions.

Google and Demis Hassabis pitched it not as a video tool but as a "world model" that simulates reality. That is the part worth holding at arm's length. Google published no test scores at launch — every benchmark was deferred to a later developer release — so the claim rests on stage demos and one keynote. The only independent hands-on test so far found the opposite of simulation: drop two marbles and the second veers off course, keep editing and the scene falls apart after about four turns, and ask for Japanese text and it renders eleven of forty-six characters. The genuinely new thing here — fusing any input into generated video, in a tool a billion people can open today — is real, and oddly bigger than the unfalsifiable claim Google chose to lead with.

The lenses

Novelty 4

Impact · breadth 5

Impact · depth 3

Actionable 3

Substance 2

Hype 4

The facts

Where to use itFree in YouTube Shorts and YouTube Create; in Google's Gemini app and Flow for paid subscribers

What it makesUp to 10-second video with sound, edited across a back-and-forth conversation

Proof to wait forNo benchmarks published yet; an API and test scores are promised later

How well it holds upIndependent testing found edits drift after ~4 turns and non-Latin text fails

Concepts

World model

Open deepmind.google →

How this connects

Tap a node to open it

Gemini Omni

The lenses

The facts

Concepts

More in Artificial Intelligence

Safety's rounding error

The Jevons bill comes due

Money stopped being the bottleneck

How this connects