主题综述

World Models:下一个范式? · World Models

主题综述

主题页(活文档)· 最近更新 2026-06-11 · 取材 6 篇核心访谈(按逐字稿全文核对)

更新日志

主流共识

第一点:world model 的核心是"学到一个能预测下一状态的可交互表征"

不同团队实现差很远,但这个最底层定义是共享的。General Intuition 的 Pim 讲得最朴素,也点出它和"视频模型"的区别:

"In a video model, you might predict the next likely sequence or the next most entertaining frame. What world models do is they actually have to understand the full range of possibilities and outcomes from the current state, and based on the action that you take, generates the next state, the next frame."
「视频模型预测的是下一个最可能、或最有意思的帧。而 world model 必须理解当前状态下所有可能的结果,然后根据你采取的动作,生成下一个状态、下一帧。」
Pim De Witte (General Intuition) · World Models & General Intuition

Gemini co-lead Oriol Vinyals 给的是最抽象的版本——一种把视频压缩成概念的表征学习:

"A pure aspect of world model would be representation learning. So you could imagine we take these modalities like the videos … and then compressing that into sort of a set of concepts — what the movements, the objects are within those. … And it models the world in a very compact way that compresses away what's probably not relevant."
「world model 最纯粹的一面是表征学习。你可以想象我们把视频这些模态……压缩成一组概念——里面的运动、物体是什么。……它用一种非常紧凑的方式给世界建模,把大概不相关的东西压掉。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

第二点:real-time interactivity 是"wow moment"的来源

Genie 团队反复强调即时响应带来的体验跃迁:

"There is something when it responds immediately that is really magical. I think that's kind of sparked the imagination of many people when the DOOM simulation came out."
「即时响应的时候有一种真正的魔力。DOOM 模拟器出来时,它激发了很多人的想象力。」
"It's at the point where a human who is not an expert will watch it and think it looks real. And I think that's pretty incredible."
「现在到了一个非专家观看时也会觉得'看起来是真的'的程度——我觉得这相当不可思议。」

第三点:当前都锚定在游戏/视频域——机器人/物理世界是共同的"应许之地"

GI 用游戏剪辑训练、Genie 从游戏引擎演化、xAI 从视频生成切入——三家入口都在游戏/视频,而都把"迁移到机器人和物理世界"当作下一站。Vinyals 把这条说成了 world model 的存在理由:

"It could also meaningfully add a dimension of simulation that could make us use, for example, things like prediction before acting in the world. And of course, obvious applications for these kind of 3D or video world models would be clearly self-driving cars or robotics."
「它还能有意义地加上一层'仿真'——让我们能在真正动手之前先做预测。而这类 3D / 视频 world model 最明显的应用,当然就是自动驾驶和机器人。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

分歧在哪

底层定义共享,但再往上分歧很大:world model 到底是什么、智能住在哪一层。外加 Andon Labs 一盆冷水。

阵营 A · "world model 是概念级表征 / 仿真引擎"——Vinyals 的研究派

Vinyals 的定义最抽象、最雄心——world model 是个"渲染器",你用语言就能改它:

"The world model itself is acting as a renderer of the world that you can really just change by a language."
「world model 本身就像一个世界的渲染器,你用语言就能直接改它。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

但逐字稿里有一条摘要层完全没收的料——Vinyals 自己都不确定"重力"这种物理概念到底在不在 world model 里,而且语言会"挡在前面",让你没法干净地测它懂不懂物理:

"As soon as you add language, all of a sudden that knowledge is there in the way. So if you ask basic questions about gravity, of course you would answer them by just having read explanations of them online. So you would need to somehow connect the concept of gravity — which could be present or not in a world model — to then decode that into an explanation."
「你一旦加上语言,那部分知识就立刻挡在前面。所以你问关于重力的基本问题,它当然能答——因为它在网上读过解释。你需要把'重力'这个概念——它在不在 world model 里都不一定——连接起来,再解码成一个解释。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

他对机器人迁移也留了余地——还是 open problem:

"For the latter to work better, it's still a very open problem. There's also all sorts of issues with transfer."
「要让这个(用仿真训练机器人)真正 work,仍然是个很 open 的问题,而且迁移上也有各种各样的问题。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

阵营 B · "智能在 LLM,视频模型是笨执行器"——xAI 的产品派

xAI 的 Ethan He(曾在 NVIDIA 做 Cosmos world model)抛出一个跟 Vinyals 截然对立的"大胆主张"——视觉智能其实来自语言模型,视频模型本身很笨:

"I have a pretty big claim. The visual intelligence is actually mostly coming from language. … Every time you see some improvement on these models, I would say mostly comes from the language model, not coming from the video distribution models themselves."
「我有一个挺大胆的主张:视觉智能其实大部分来自语言。……每次你看到这些模型有进步,我会说主要来自语言模型,而不是视频分布模型本身。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

他用一个"猫"的例子把"笨"讲得很具体:

"The video distribution models, I would describe they're kind of dumb because they take the input instruction literally. … If you put a cat in, they would literally show a cat in maybe a white background because you didn't describe the background. The cat is not moving because you didn't describe it. It takes the instruction quite literally. It's kind of dumb."
「视频分布模型,我会说它们有点笨,因为它们把输入指令照字面执行。……你输入'一只猫',它就真给你一只猫,背景可能是白的——因为你没描述背景。猫也不动——因为你没描述它会动。它非常字面地执行指令。挺笨的。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

He 据此把未来押到"video agents"——LLM 当大脑、把生成模型当众多工具之一:

"Video agents, mostly language models, will call these generative models … as a tool. So this model can iteratively refine the results or even generate longer content through a very long chain of thought. It's actually very similar to how humans create art. We don't generate the pixels directly. … It can also use image editing tools from Photoshop."
「video agents 主要是语言模型,把这些生成模型……当成一个工具来调用。这样它就能反复打磨结果、甚至通过很长的思维链生成更长的内容。这其实很像人类创作艺术——我们不直接生成像素。……它也可以用 Photoshop 这类图像编辑工具。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

这跟 Vinyals"world model 本身就是那个表征/智能"是直接对立的——一个说智能在 world model 里,一个说 world model 只是手、脑在 LLM。He 还把这套逻辑推到界面层——生成式 UI:

"The generative UI will be user intention to the pixels directly. … I want the email to show to me like a TikTok so I can swipe left and right. … It's going to be a revolutionary replacement of the interface."
「生成式 UI 就是把用户意图直接映射到像素。……我想让邮件像 TikTok 一样展示,好让我左右滑动切邮件。……它将是界面的革命性替代。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

阵营 C · "world model 是从人类行为模仿学来的"——General Intuition 的数据派

GI 的路径既不是 Vinyals 的概念表征、也不是 xAI 的 LLM-driven,而是从海量游戏剪辑做纯模仿学习,直接从帧预测动作:

"What I'm about to show you is a completely vision-based agent that's just seeing pixels and predicting actions the exact same way a human would. … These are pure imitation learning."
「我接下来给你看的,是一个完全基于视觉的 agent——它只看像素、像人一样预测动作。……这些都是纯模仿学习。」
Pim De Witte (General Intuition) · World Models & General Intuition

他把这条路类比成"为交互性造一个 common crawl"——这是 GI 整个赌注的核心隐喻:

"LLMs were trained on predicting text tokens, words on the internet. What if we predict action tokens on essentially what is the equivalent of the common crawl data set, but for interactivity?"
「LLM 是靠预测文本 token、互联网上的词训练的。那如果我们去预测 action token 呢——在一个相当于 common crawl、但属于'交互性'的数据集上?」
Pim De Witte (General Intuition) · World Models & General Intuition

产品形态上,GI 想直接替换游戏引擎里的"玩家控制器":

"Really what we're doing at the moment is replacing essentially the player controller inside of a game engine. Anything that you're currently … deterministically coding, we hope to replace with a single API, which is just, you stream us frames and we predict actions — and that can be inside an engine or it can be eventually even inside the real world."
「我们现在做的,本质上是替换游戏引擎里的'玩家控制器'。任何你现在……用确定性代码写的行为,我们希望用一个 API 替掉——你把帧流给我们、我们预测动作。这可以在引擎里,最终甚至可以在真实世界里。」
Pim De Witte (General Intuition) · World Models & General Intuition

GI 对这条路的信心强到拒了 OpenAI 给 Medal 游戏剪辑数据开出的 5 亿美元报价、自建独立实验室(Khosla 自 OpenAI 以来最大单笔种子投资)。Pim 的逻辑是数据量足够大、可以并行押模仿学习:

"We essentially realized that we could get so far on just imitation learning. … We think we can essentially leap every single company that's forced to either be consumers of world models or build world models, and take this foundation model bet for spatial-temporal agents."
「我们意识到,光靠模仿学习就能走很远。……我们觉得自己能跳过每一家被迫'要么当 world model 消费者、要么自己造 world model'的公司,直接押这个'时空 agent 的基础模型'赌注。」
Pim De Witte (General Intuition) · World Models & General Intuition

但他对机器人迁移给了一个很诚实的限定——bet 不是迁移到高自由度机器人,而是"机器人得有游戏式输入":

"Our bet is not that we can transfer over to like higher DOF robots with the keyboard and mouse. It's really just that we can move the hard work of pre-training hopefully to post-training. … But the key is that the robot has to have gaming inputs."
「我们的赌注不是说能用键鼠迁移到更高自由度的机器人。而只是说,我们能把预训练的苦活挪到后训练。……但关键前提是:机器人得有'游戏式'的输入。」
Pim De Witte (General Intuition) · World Models & General Intuition

冷水 · "当前模型根本没有空间智能"——Andon Labs 的实证

前三个阵营都假设 world models 在通往物理世界。Andon Labs 把模型放进真实空间推理任务里,撞到的是相反的事实:

"We took models and gave them 20 images of interior photographs of apartments. And then we asked them to redesign the floor plan from that. … For this you need to reason about 3D space. And it turns out the models are absolutely horrible at this. No one scores statistically better than random chance."
「我们给模型 20 张公寓室内照片,让它们据此重新设计平面图。……这需要对 3D 空间做推理。结果发现模型在这件事上烂透了——没有一个的得分在统计上好过随机瞎猜。」

这条直接戳到"world model 通往机器人/物理仿真"的软肋——连 2D→3D 重建都做不好,"物理世界仿真引擎"的应许还很远。而且它和 Vinyals 自己的"我不确定重力概念在不在模型里"是互相印证的。

都没说透的

我的看法

判断(不是事实):当前叫"world model"的至少是三个不同的东西被同一个词收编了——Vinyals 的概念表征(research bet)、xAI 的实时视频交互(product bet)、GI 的行为模仿(data bet)。短期(12–18 个月)跑出商业价值的会是 xAI 那条(视频 / 生成式 UI,最接近现成的消费与创作市场,且务实承认"智能在 LLM");中期最有学术分量的是 Vinyals / GI 那条(如果概念表征或模仿行为真能喂给机器人)。但Andon 的"没有空间智能"是整条叙事里最该认真对待的信号——读完逐字稿我更确信这点,因为连 Vinyals 自己都不敢说"重力"在模型里。我目前赌:world models 会是个巨大的创作 / 游戏 / UI 工具,但"通往物理世界 AGI"那一步缺乏证据。

把握程度:中等偏低。最强支撑是三家定义确实不一致(可直接对比引文)+ Vinyals 与 Andon 互相印证的"结构理解存疑";最弱的是"会停在工具、到不了物理 AGI"——它仍主要建立在 Andon 单一信源 + Vinyals 一句自陈上,需要更多实证才能确认这是架构瓶颈而非数据问题。

还想知道什么

取材

核心 5 篇本轮按逐字稿全文重读、所有引用逐字稿核对:

其余 alias 命中、关联较弱或本轮未引用:Project Genie、Fal.ai(生成媒体史)、4D Creation、Best-of-2025 圆桌、John Schulman、Khosla/Rabois Uncapped。后续 headless 全量重综会自动纳入全文。