主题综述

World Models：下一个范式? · World Models

主题综述

更新日志

2026-06-11 — 取材升级为逐字稿全文。把 podwise 摘要层的三手转述换成逐字原话：Vinyals 的"representation learning"、Ethan He 的"视频模型是笨执行器"、GI 的"从帧预测动作"原来都是 3rd-person 摘要，现已换成发言人原话。新增几条被摘要层埋掉的关键料——Vinyals 自己都不确定"重力"这个概念在不在 world model 里、Ethan He 的"cat 不会动是因为你没描述它"、GI"用键鼠 action token 当 common crawl"。Andon 那盆冷水也换成了实验原话（floor plan / random chance）。删掉一条无法在逐字稿核实的跨题引用（Vinyals 非参数记忆）。
2026-06-10 — 首次综述（从 long-horizon-agents 拆出独立成题）。基于 7 篇访谈。

主流共识

第一点：world model 的核心是"学到一个能预测下一状态的可交互表征"

不同团队实现差很远，但这个最底层定义是共享的。General Intuition 的 Pim 讲得最朴素，也点出它和"视频模型"的区别：

"In a video model, you might predict the next likely sequence or the next most entertaining frame. What world models do is they actually have to understand the full range of possibilities and outcomes from the current state, and based on the action that you take, generates the next state, the next frame."
「视频模型预测的是下一个最可能、或最有意思的帧。而 world model 必须理解当前状态下所有可能的结果，然后根据你采取的动作，生成下一个状态、下一帧。」
Pim De Witte (General Intuition) · World Models & General Intuition

Gemini co-lead Oriol Vinyals 给的是最抽象的版本——一种把视频压缩成概念的表征学习：

"A pure aspect of world model would be representation learning. … you could imagine we take these modalities like the videos … And then compressing that into sort of a set of concepts and what those … the movements, the objects, et cetera, are within those … And it models the world in a very compact way that compresses away what's probably not relevant."
「world model 最纯粹的一面是表征学习。……你可以想象我们把视频这些模态……再压缩成一组概念，以及里面的……运动、物体等等是什么……它用一种非常紧凑的方式给世界建模，把大概不相关的东西压掉。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

第二点：real-time interactivity 是"wow moment"的来源

Genie 团队反复强调即时响应带来的体验跃迁：

"There is something when it responds immediately that is really magical. I think that's kind of sparked the imagination of many people when the DOOM simulation came out …"
「即时响应的时候有一种真正的魔力。DOOM 模拟器出来时，它激发了很多人的想象力……」
DeepMind Genie 3 team · Google DeepMind Lead Researchers on Genie 3

"It's at the point where like a human who is not an expert will watch it and think it looks real, right? And I think that's pretty incredible."
「现在到了一个非专家观看时也会觉得'看起来是真的'的程度，对吧？我觉得这相当不可思议。」
DeepMind Genie 3 team · Google DeepMind Lead Researchers on Genie 3

第三点：当前都锚定在游戏/视频域——机器人/物理世界是共同的"应许之地"

GI 用游戏剪辑训练、Genie 从游戏引擎演化、xAI 从视频生成切入——三家入口都在游戏/视频，而都把"迁移到机器人和物理世界"当作下一站。Vinyals 把这条说成了 world model 的存在理由：

"It could also meaningfully adds maybe a dimension of simulation that could make us … use, for example, things like prediction before acting in the world. And of course, obvious applications for these kind of 3D or video world models would be clearly … self-driving cars or robotics."
「它可能还能有意义地加上一层'仿真'——让我们能……在真正动手之前先做预测。而这类 3D / 视频 world model 最明显的应用，当然就是……自动驾驶和机器人。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

分歧在哪

底层定义共享，但再往上分歧很大：world model 到底是什么、智能住在哪一层。外加 Andon Labs 一盆冷水。

阵营 A · "world model 是概念级表征 / 仿真引擎"——Vinyals 的研究派

Vinyals 的定义最抽象、最雄心——world model 是个"渲染器"，你用语言就能改它：

"The world model itself is acting as a renderer of the world that you can really just change by a language."
「world model 本身就像一个世界的渲染器，你用语言就能直接改它。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

但逐字稿里有一条摘要层完全没收的料——Vinyals 自己都不确定"重力"这种物理概念到底在不在 world model 里，而且语言会"挡在前面"，让你没法干净地测它懂不懂物理：

"As soon as you add language, all of a sudden that knowledge is there in the way. So if you ask basic questions about gravity, of course, you would answer them by just having read … explanations of them online and so on. So you would need to somehow connect the concept of gravity — which could be present or not in a world model — to then decode that into an explanation …"
「你一旦加上语言，那部分知识就立刻挡在前面。所以你问关于重力的基本问题，它当然能答——因为它在网上读过……这些解释之类。你需要把'重力'这个概念——它在不在 world model 里都不一定——连接起来，再解码成一个解释……」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

他对机器人迁移也留了余地——还是 open problem：

"For the latter to work better, I mean, it's still a very open problem. There's also all sorts of issues with transfer."
「要让这个（用仿真训练机器人）真正 work，仍然是个很 open 的问题，而且迁移上也有各种各样的问题。」
Oriol Vinyals (Gemini) · Ep 87: Gemini Co-Lead on World Models

阵营 B · "智能在 LLM，视频模型是笨执行器"——xAI 的产品派

xAI 的 Ethan He（曾在 NVIDIA 做 Cosmos world model）抛出一个跟 Vinyals 截然对立的"大胆主张"——视觉智能其实来自语言模型，视频模型本身很笨：

"I have a pretty big claim. The visual intelligence are actually mostly coming from language. … every time you see there, there's some improvement on these models. I would say mostly, again, comes from language model, not coming from the video model itself, like the video distribution models themselves."
「我有一个挺大胆的主张：视觉智能其实大部分来自语言。……每次你看到这些模型有些进步，我会说主要还是来自语言模型，而不是视频模型本身——也就是视频分布模型本身。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

他用一个"猫"的例子把"笨"讲得很具体：

"The video distribution models, I would describe they're kind of dumb because they, they take the input instruction literally … If you put a cat in … they would literally show a cat in maybe a white background because you didn't describe the background. The cat is not moving because you didn't describe it. It takes the instruction quite literally. It's kind of dumb."
「视频分布模型，我会说它们有点笨，因为它们把输入指令照字面执行……你输入'一只猫'……它就真给你一只猫，背景可能是白的——因为你没描述背景。猫也不动——因为你没描述它会动。它非常字面地执行指令。挺笨的。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

He 据此把未来押到"video agents"——LLM 当大脑、把生成模型当众多工具之一：

"Video agents, mostly language models, will call these generative models … as a tool. So this model can iteratively refine the results or even generate longer content through a very long chain of thought. It's actually very similar to how humans create art. So we don't generate the pixels directly. … It can also use image editing tools from Photoshop."
「video agents 主要是语言模型，把这些生成模型……当成一个工具来调用。这样它就能反复打磨结果、甚至通过很长的思维链生成更长的内容。这其实很像人类创作艺术——我们不直接生成像素。……它也可以用 Photoshop 这类图像编辑工具。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

这跟 Vinyals"world model 本身就是那个表征/智能"是直接对立的——一个说智能在 world model 里，一个说 world model 只是手、脑在 LLM。He 还把这套逻辑推到界面层——生成式 UI：

"The generative UI will be user intention to the pixels directly. … I want the email to show to me like a TikTok so I can swipe left and right for the emails. … It's going to be a revolutionary replacement of the interface."
「生成式 UI 就是把用户意图直接映射到像素。……我想让邮件像 TikTok 一样展示，好让我左右滑动切邮件。……它将是界面的革命性替代。」
Ethan He (xAI Grok Imagine) · Why Video Agent models are next

阵营 C · "world model 是从人类行为模仿学来的"——General Intuition 的数据派

GI 的路径既不是 Vinyals 的概念表征、也不是 xAI 的 LLM-driven，而是从海量游戏剪辑做纯模仿学习，直接从帧预测动作：

"What I'm about to show you is a completely vision-based agent that's just seeing pixels and predicting actions the exact same way a human would. … These are pure imitation learning."
「我接下来给你看的，是一个完全基于视觉的 agent——它只看像素、像人一样预测动作。……这些都是纯模仿学习。」
Pim De Witte (General Intuition) · World Models & General Intuition

他把这条路类比成"为交互性造一个 common crawl"——这是 GI 整个赌注的核心隐喻：

"LLMs were trained on predicting like text tokens on words on the internet. What if we predict action tokens on essentially what is the equivalent of the common crawl data set, but for interactivity?"
「LLM 是靠预测文本 token、互联网上的词训练的。那如果我们去预测 action token 呢——在一个相当于 common crawl、但属于'交互性'的数据集上？」
Pim De Witte (General Intuition) · World Models & General Intuition

产品形态上，GI 想直接替换游戏引擎里的"玩家控制器"：

"Really what we're doing at the moment is replacing essentially the player controller inside of a game engine. Anything that you're currently … deterministically coding, we hope to replace with a single API, which is just, you stream us frames and we predict actions — and that can be inside an engine or it can be eventually even inside the real world."
「我们现在做的，本质上是替换游戏引擎里的'玩家控制器'。任何你现在……用确定性代码写的行为，我们希望用一个 API 替掉——你把帧流给我们、我们预测动作。这可以在引擎里，最终甚至可以在真实世界里。」
Pim De Witte (General Intuition) · World Models & General Intuition

GI 对这条路的信心强到拒了 OpenAI 给 Medal 游戏剪辑数据开出的 5 亿美元报价、自建独立实验室（Khosla 自 OpenAI 以来最大单笔种子投资）。Pim 的逻辑是数据量足够大、可以并行押模仿学习：

"We essentially realized that we could get so far on just imitation learning. … We think we can essentially leap every single company that's forced to either be consumers of world models or build world models and take this foundation model bet for spatial-temporal agents …"
「我们意识到，光靠模仿学习就能走很远。……我们觉得自己能跳过每一家被迫'要么当 world model 消费者、要么自己造 world model'的公司，直接押这个'时空 agent 的基础模型'赌注……」
Pim De Witte (General Intuition) · World Models & General Intuition

但他对机器人迁移给了一个很诚实的限定——bet 不是迁移到高自由度机器人，而是"机器人得有游戏式输入"：

"… the key is that the robot has to have gaming inputs. So our bet is not that we can transfer over to like higher DOF robots and the keyboard and mouse. It's really just that we can move the hard work of pre-training hopefully to post-training."
「……关键前提是：机器人得有'游戏式'的输入。所以我们的赌注不是说能用键鼠迁移到更高自由度的机器人。而只是说，我们能把预训练的苦活挪到后训练。」
Pim De Witte (General Intuition) · World Models & General Intuition

冷水 · "当前模型根本没有空间智能"——Andon Labs 的实证

前三个阵营都假设 world models 在通往物理世界。Andon Labs 把模型放进真实空间推理任务里，撞到的是相反的事实：

"We took models and then we gave them 20 images of interior photographs of apartments. And then we asked them to like redesign the floor plan from that. … you need to reason about 3D space. And it turns out the models are absolutely horrible at this. No one scores statistically better than random chance."
「我们拿了一些模型，给它们 20 张公寓室内照片，让它们据此重新设计平面图。……你需要对 3D 空间做推理。结果发现模型在这件事上烂透了——没有一个的得分在统计上好过随机瞎猜。」
Axel Backlund (Andon Labs) · Reality: The Final Eval

这条直接戳到"world model 通往机器人/物理仿真"的软肋——连 2D→3D 重建都做不好，"物理世界仿真引擎"的应许还很远。而且它和 Vinyals 自己的"我不确定重力概念在不在模型里"是互相印证的。

都没说透的

三个阵营对"智能住在哪"的分歧没人正面对账。 Vinyals 说 world model 本身就是表征/智能；Ethan He 说视频模型是"笨执行器"、智能在 LLM；GI 说智能在模仿学到的行为里。这是关于 world model 本质的根本分歧，但三家从不同入口（研究/产品/数据）切入，论点擦肩而过、从未碰撞。
"生成一个看起来对的世界" ≠ "理解这个世界的空间结构"——这个区别是整题的命门，却没人正面点破。 Genie 生成的世界"非专家看着像真的"，但 Andon 说模型重建不出 3D，连 Vinyals 都不确定重力概念在不在模型里。如果"生成对"和"理解结构"真的解耦，那"world model = 物理 AGI 之路"就要降级成"很强的视频/游戏工具"。
Genie 的"60 秒上限 + dynamism 随时间衰减"是个被轻描淡写的硬约束。 一个一分钟后就退化的世界，离"持久可交互世界"还差数量级。没人讨论这个衰减是工程问题还是架构问题——若是后者，整条路线时间表要重估。
"下一个范式"缺一个可证伪的里程碑。 大家都说通往机器人和 AGI，但没人给"达到什么具体能力才算走通"——是连续 10 分钟一致性？zero-shot 迁移到真实机器人？还是 2D→3D 重建达到人类水平？没有里程碑，"下一个范式"就只是叙事。

我的看法

判断（不是事实）：当前叫"world model"的至少是三个不同的东西被同一个词收编了——Vinyals 的概念表征（research bet）、xAI 的实时视频交互（product bet）、GI 的行为模仿（data bet）。短期（12–18 个月）跑出商业价值的会是 xAI 那条（视频 / 生成式 UI，最接近现成的消费与创作市场，且务实承认"智能在 LLM"）；中期最有学术分量的是 Vinyals / GI 那条（如果概念表征或模仿行为真能喂给机器人）。但Andon 的"没有空间智能"是整条叙事里最该认真对待的信号——读完逐字稿我更确信这点，因为连 Vinyals 自己都不敢说"重力"在模型里。我目前赌：world models 会是个巨大的创作 / 游戏 / UI 工具，但"通往物理世界 AGI"那一步缺乏证据。

把握程度：中等偏低。最强支撑是三家定义确实不一致（可直接对比引文）+ Vinyals 与 Andon 互相印证的"结构理解存疑"；最弱的是"会停在工具、到不了物理 AGI"——它仍主要建立在 Andon 单一信源 + Vinyals 一句自陈上，需要更多实证才能确认这是架构瓶颈而非数据问题。

还想知道什么

"生成对的世界" vs "理解世界结构"的分离实验。 同一个 world model，一边测生成视频的真实感，一边测它从生成的世界里回答空间推理问题（这两个物体谁离门更近？）。前者高后者低，就坐实 Andon、给"物理 AGI"叙事降温。
Genie 60 秒衰减的根因。 是注意力/内存的工程约束，还是缺乏显式 3D 表征的架构必然？这决定"持久世界"的时间表。
机器人侧的真实迁移数据。 Vinyals 押 world model → 机器人/规划，但语料里没有一个"world model 学到的东西真迁移到了物理机器人任务"的具体案例。哪怕一个 early result 都能大幅改变判断。
xAI"智能在 LLM、视频模型是笨执行器"的可证伪检验。 把同一个视频模型配更强 vs 更弱的 LLM/prompt-rewriter，输出质量差多少？这能验证 Ethan He 的"智能分层"到底成不成立。
GI 的模仿学习能否迁移出游戏。 GI 拒了 $5 亿、押"游戏行为可迁移真实世界"，但自己也承认"机器人得有游戏式输入"——需要一个游戏域外的迁移结果来验证这个 $500M-级别的赌注。

取材

核心 5 篇本轮按逐字稿全文重读、所有引用逐字稿核对：

Oriol Vinyals (Gemini co-lead, DeepMind) · 2026-05-25 · 36bea6160e71814d817ef4b66284a6cf
Pim De Witte (General Intuition) · 2025-12-09 · 2c4ea6160e71810a98e4cd35178e1269
Ethan He (xAI Grok Imagine) · 2026-06-06 · 377ea6160e7181afbc95fc1e523f979c
Andon Labs (Petersson & Backlund) · 2026-06-06 · 377ea6160e71814d8526e19e5bd5313e
DeepMind Genie 3 team · 2025-08-16 · 251ea6160e7181ff8ddbe5a20272b560

其余 alias 命中、关联较弱或本轮未引用：Project Genie、Fal.ai（生成媒体史）、4D Creation、Best-of-2025 圆桌、John Schulman、Khosla/Rabois Uncapped。后续 headless 全量重综会自动纳入全文。