主题综述

长程智能体的下一步 · Long-Horizon Agents

主题综述

主题页(活文档)· 最近更新 2026-05-20 · 取材 9 篇访谈

更新日志

主流共识

第一点:agent 不可靠这件事是结构性问题,不是当前模型版本的瑕疵

"The issue with agents is they aren't reliable to nine nines of reliability, but they can do a ton of work and more and more work over longer time horizons."
「agent 的问题是它们没法做到九个 9 的可靠性,但它们能完成大量工作,并且在更长的时间跨度上做越来越多。」

第二点:长程任务真正生效的场景,目前是"生成一个初稿,由人来审"——而非端到端自动化。

"If you can find these framings where they run for a long period of time but produce a first draft of something, those to me are the killer applications of long-horizon agents right now. Coding is an example — you usually put up a PR, you don't directly push to prod. AISREs usually surface it to a human who comes in and then reviews it."
「如果你能找到那种'跑很久、但产出一个初稿'的框架,那些在我看来就是当下长程 agent 的杀手级应用。编码就是一例——你通常是提一个 PR、而不是直接推到生产。AI SRE 通常是把结果交给一个人来 review。」

第三点:reward hacking 是真实存在的训练问题,且 reward function 越复杂越难治。

Will Brown(Prime Intellect)观察到 Claude 3.7 会出现 reward hacking——在完成被要求的任务时夹带一些不必要的动作,Anthropic 在 Opus 上正努力压低这种行为。(该集为英文原音、podwise 仅存中文译文,故此处转述、不作逐字引用。)

分歧在哪

阵营 A · "Multi-turn RL 是终极范式"——ReflectionAI / Prime Intellect 立场

Misha Laskin (ReflectionAI) 给得最干脆:

"It became pretty clear to us that the next paradigm and effectively the final paradigm that we need to have in place before what people used to call AGI or now I think the goalposts have shifted to ASI is reached is just figuring out how to scale reinforcement learning."
「我们越来越清楚:在到达原来的 AGI(现在大家把目标改成 ASI)之前,要落地的下一个、也基本上是最后一个范式,就是把强化学习 scale 起来。」

但 Misha 自己也给出了一个让 RL 派难做的承认——generalization 可能不存在

"Maybe the hot take is that there's no such thing as generalization. There's just bringing the test distribution into train."
「也许更尖锐的说法是:根本没有所谓的泛化,只是把测试分布塞进训练。」
Misha Laskin · Asimov

Will Brown / Johannes Hagemann (Prime Intellect) 走基础设施侧——他们押开源 RL environments 是下一个 GitHub:

"RL allows you to trade off compute for data, in a sense, where you can get a lot of value out of a smaller amount of data by using more compute."
「RL 允许你用计算换数据——用更多算力,就能从少量数据里榨出大量价值。」

Will 也是这一派里最公开承认 reward hacking 风险的人。他有两条值得记的观察(均出自英文原音、podwise 仅存中文,故转述):一是很多"耸动的 AI 安全结果"在他看来其实是两难困境——模型被给了两个互相冲突的目标、只能选一个,选哪个听起来都很糟;二是对"用工具"本身给奖励,会诱发"虚拟工具调用"——模型调了工具、却对推理毫无帮助,纯粹是为了拿奖励。

阵营 B · "脚手架在做大部分工作"——LangChain 立场

Harrison Chase 不直接反对 RL,但他把"长程 agent 突然能用"的原因放到了模型外面——归功于 context engineering(compaction、sub-agents、skills 这些管理上下文的工程手段,也是他那期的标题主张)。他给的最锋利的一条,是 agent 和传统软件的本质区别——应用逻辑不再都在代码里

"When you're building software, all of the logic is in the code in the software and you can see it there. When you're building an agent, the logic for how your applications works is not all in the code. A large part of it comes from the model."
「写软件时,所有逻辑都在代码里,看得见。写 agent 时,应用怎么工作的逻辑不全在代码里——很大一部分来自模型。」
Harrison Chase · Context Engineering

Kyle Corbitt (OpenPipe) 给的是阵营 A 和阵营 B 之间的一个中间观察——RL 重要,但工程化路径并不像"训练吃数据"那么简单:

"The way GRPO works is you have to do a set of different rollouts all in parallel with the exact same environment, the exact same conditions, and then you score each of them. … [The pro is] operational simplicity — there's a whole extra value model you need for PPO that you can throw away with GRPO."
「GRPO 的工作方式是:你必须并行跑一组不同的 rollout,环境完全相同、条件完全相同,然后给每个打分。……它的好处是操作上简单——PPO 需要一个额外的 value 模型,用 GRPO 可以把它扔掉。」
"People should fine-tune when it's cost, latency, or quality consistency that you really care about."
「你应该在真正关心成本、延迟或质量一致性的时候做 fine-tuning。」

阵营 C · "World Models 是基础设施"——General Intuition / Google DeepMind 立场

Pim De Witte (General Intuition) 押的是一种与 RL / context 路线都不同的方向:

"What world models do is they actually have to understand the full range of possibilities and outcomes from the current state, and based on the action that you take, generates the next state, the next frame."
「world model 干的事是:理解当前状态下所有可能的结果,根据你采取的动作,生成下一个状态、下一帧。」

GI 的论据是数据 + 行为模仿:3.8B 游戏剪辑,4 秒空间记忆就能展现 superhuman 行为——他们拒了 OpenAI 5 亿美元的数据收购,自己做实验室。

Google DeepMind Genie 3 的研究者讲的是同方向但更基础设施化:

"It's at the point where a human who is not an expert will watch it and think it looks real."
「现在到了非专业人士看了会觉得"看起来是真的"的程度。」
"A year ago, getting a minute of consistency for an autoregressive model in real time was a stretch goal. Now that we've landed it, people say a minute isn't long enough; that is the ultimate sign of progress."
「一年前,实时跑出一分钟一致性还是一个 stretch goal;现在我们做到了,大家说一分钟不够长——这就是进步的终极信号。」

需要注意——Camp C 的应用领域目前几乎全在游戏 / 视觉生成,跨到"通用 agentic 任务"的路径在语料里没有展开

暗流 · "应用层 vs 通用代理"——Misha 的另一条线

Misha Laskin 在阵营 A 里同时给出了一个跟其他 RL 派不一样的微妙差异:

"Just because a company might have the best model in some general set of academic benchmarks, doesn't actually mean they have the best product."
「就算一家公司在一些通用学术 benchmark 上有最好的模型,也不意味着他们有最好的产品。」
Misha Laskin · Asimov
"When you look at what an engineer does in an organization, 80% of the time they're spending trying to comprehend complex systems and collaborating with teammates."
「在一个组织里看工程师做什么,80% 的时间花在理解复杂系统和与队友协作。」
Misha Laskin · Asimov

Asimov 不是 code generation agent,是 comprehension agent——这条产品方向跟 Will Brown 那套通用 multi-turn RL 路线在战术上是不同的,虽然他们都自称 RL 派。

最后,Tuhin 把这个题目跟推理经济学接到了一起:

"From the developer's perspective, they would insert more intelligence if you make it cheaper. They will insert more intelligence anyway, but if you make it cheaper, [they insert even more]."
「从开发者的视角看,你把智能做得更便宜,他们就会塞进更多智能。其实他们本来就会塞更多,但你越便宜、他们塞得越多。」

都没说透的

我的看法

判断(不是事实):这三条路线短期内会并行而不是收敛。1–2 年看:阵营 B(context engineering)在生产端做绝大多数的工作,因为它能套在任何前沿模型上,迭代快;3–5 年看:阵营 A 的 RL post-training 会在"垂直任务"上跑出更可靠的 agent,但 generalization 的边界比现在的乐观预期窄得多(Misha 的"hot take"我会赌它是真的);5–10 年看:阵营 C(world models)有可能成为 long-horizon 真正的基础设施,但目前完全在游戏 / 视觉域,跨域转移这一步没人有可信的路径图。

把握程度:中等偏低。我对短期"context engineering 在生产端占主导"这一条信心高(实证已经发生);对中期 RL 边界判断信心一般(依赖 generalization 假设);对长期 world models 路径信心明显低——缺少跨域应用的真实案例。

还想知道什么

取材