主题综述

长程智能体的下一步 · Long-Horizon Agents

主题综述

更新日志

2026-07-16 — 引用忠实度修复（全站审计）：2 处——重引 2 条（Kyle 的 fine-tune 三要素实为主持人 Swyx 转述、已改引 Kyle 本人回应并在正文注明；Genie 团队"一分钟一致性"换回逐字原话）。
2026-06-11 — 取材升级为逐字稿全文。把 podwise 摘要层的三手转述换成原话：Harrison 的"初稿型杀手应用"、Kyle 的 GRPO（其实不是摘要说的"死胡同"，而是"并行 rollout 必须环境完全一致"的具体复杂度）、Tuhin 的 Jevons 都换成逐字。Will Brown 的 Multi-Turn 那集是英文原音、podwise 仅存了全中文译文，无英文可逐字引用，故其 reward-hacking / 虚拟工具调用几条改为转述。
2026-05-20 — 首次综述。基于 9 篇访谈（Will Brown / Prime Intellect、Misha Laskin / ReflectionAI、Harrison Chase / LangChain、Kyle Corbitt / OpenPipe、Pim De Witte / General Intuition、Google DeepMind Genie 3 团队、Project Genie、Fal.ai、Baseten Tuhin Srivastava）。

主流共识

第一点：agent 不可靠这件事是结构性问题，不是当前模型版本的瑕疵。

"The issue with agents is they aren't reliable to nine nines of reliability, but they can do a ton of work and more and more work over longer time horizons."
「agent 的问题是它们没法做到九个 9 的可靠性，但它们能完成大量工作，并且在更长的时间跨度上做越来越多。」
Harrison Chase · Context Engineering Our Way to Long-Horizon Agents

第二点：长程任务真正生效的场景，目前是"生成一个初稿，由人来审"——而非端到端自动化。

"If you can find these framings where they run for a long period of time but produce a first draft of something, those to me are the killer applications of long-horizon agents right now. Coding is an example of that. Coding, you usually put up a PR. You don't directly push to prod… AISREs usually surface it to a human who comes in and then reviews it."
「如果你能找到那种'跑很久、但产出一个初稿'的框架，那些在我看来就是当下长程 agent 的杀手级应用。编码就是一例——编码，你通常是提一个 PR，而不是直接推到生产……AI SRE 通常是把结果交给一个人来 review。」
Harrison Chase · Context Engineering Our Way to Long-Horizon Agents

第三点：reward hacking 是真实存在的训练问题，且 reward function 越复杂越难治。

Will Brown（Prime Intellect）观察到 Claude 3.7 会出现 reward hacking——在完成被要求的任务时夹带一些不必要的动作，Anthropic 在 Opus 上正努力压低这种行为。（该集为英文原音、podwise 仅存中文译文，故此处转述、不作逐字引用。）

分歧在哪

阵营 A · "Multi-turn RL 是终极范式"——ReflectionAI / Prime Intellect 立场

Misha Laskin (ReflectionAI) 给得最干脆：

"It became pretty clear to us that the next paradigm and effectively the final paradigm that we need to have in place before what people used to call AGI or now I think the goalposts have shifted to ASI is reached is just figuring out how to scale reinforcement learning on top of large language models."
「我们越来越清楚：在到达原来的 AGI（现在大家把目标改成 ASI）之前，要落地的下一个、也基本上是最后一个范式，就是搞清楚怎么在大语言模型之上把强化学习 scale 起来。」
Misha Laskin · Asimov: Building An Omniscient RL Oracle

但 Misha 自己也给出了一个让 RL 派难做的承认——generalization 可能不存在：

"Maybe the hot take is that there's no such thing as generalization. There's just bringing the test distribution into train."
「也许更尖锐的说法是：根本没有所谓的泛化，只是把测试分布塞进训练。」
Misha Laskin · Asimov

Will Brown / Johannes Hagemann (Prime Intellect) 走基础设施侧——他们押开源 RL environments 是下一个 GitHub：

"RL allows you to trade off compute for data, in a sense, where you can get a lot of value out of a smaller amount of data by using more compute."
「RL 允许你用计算换数据——用更多算力，就能从少量数据里榨出大量价值。」
Will Brown · Building the GitHub for RL Environments

Will 也是这一派里最公开承认 reward hacking 风险的人。他有两条值得记的观察（均出自英文原音、podwise 仅存中文，故转述）：一是很多"耸动的 AI 安全结果"在他看来其实是两难困境——模型被给了两个互相冲突的目标、只能选一个，选哪个听起来都很糟；二是对"用工具"本身给奖励，会诱发"虚拟工具调用"——模型调了工具、却对推理毫无帮助，纯粹是为了拿奖励。

阵营 B · "脚手架在做大部分工作"——LangChain 立场

Harrison Chase 不直接反对 RL，但他把"长程 agent 突然能用"的原因放到了模型外面——归功于 context engineering（compaction、sub-agents、skills 这些管理上下文的工程手段，也是他那期的标题主张）。他给的最锋利的一条，是 agent 和传统软件的本质区别——应用逻辑不再都在代码里：

"When you're building software, all of the logic is in the code in the software and you can see it there. When you're building an agent, the logic for how your applications works is not all in the code. A large part of it comes from the model."
「写软件时，所有逻辑都在代码里，看得见。写 agent 时，应用怎么工作的逻辑不全在代码里——很大一部分来自模型。」
Harrison Chase · Context Engineering

Kyle Corbitt (OpenPipe) 给的是阵营 A 和阵营 B 之间的一个中间观察——RL 重要，但工程化路径并不像"训练吃数据"那么简单：

"One pro is just sort of like operational simplicity. Like there's a whole extra model you need for this value model you need for PPO that you can throw away with GRPO. … So the way GRPO works is you have to do a set of different trajectories or a set of different rollouts all in parallel with the exact same environment, the exact same conditions, and then you score each of them."
「一个好处就是操作上简单——PPO 需要一整个额外的 value 模型，用 GRPO 可以把它扔掉。……GRPO 的工作方式是：你必须并行跑一组不同的轨迹、一组不同的 rollout，环境完全相同、条件完全相同，然后给每个打分。」
Kyle Corbitt · Why Fine-Tuning Lost and RL Won

关于"何时该 fine-tune"，那个流传的"成本、延迟、质量一致性"三要素框架其实出自主持人 Swyx 之口——他在节目里复述 Kyle 早年在 AI Engineer World's Fair 演讲中给的建议（Kyle 本人没说过这句原话）。Kyle 的回应是：

"Yeah, I mostly stand by that. I don't think it's changed. … But the main one I see that really drives fine-tuning is if you have to move to a smaller model, and it's typically for latency reasons, and this is usually like real-time voice. … I would say for 90% of use cases where you aren't forced to a smaller model, then it's still not a good ROI and you probably shouldn't invest in it today."
「是的，我基本上坚持这一点。我认为它没有改变。……但我看到真正推动微调的主要原因是，如果你必须迁移到更小的模型，这通常是出于延迟的原因，而且这通常就像实时语音一样。……我会说，对于 90% 的非被迫使用较小模型的用例，它仍然不是一个好的投资回报率，你今天可能不应该投资它。」
Kyle Corbitt · Why Fine-Tuning Lost and RL Won

阵营 C · "World Models 是基础设施"——General Intuition / Google DeepMind 立场

Pim De Witte (General Intuition) 押的是一种与 RL / context 路线都不同的方向：

"What world models do is they actually have to understand the full range of possibilities and outcomes from the current state, and based on the action that you take, generates the next state, the next frame."
「world model 干的事是：理解当前状态下所有可能的结果，根据你采取的动作，生成下一个状态、下一帧。」
Pim De Witte · World Models & General Intuition

GI 的论据是数据 + 行为模仿：3.8B 游戏剪辑，4 秒空间记忆就能展现 superhuman 行为——他们拒了 OpenAI 5 亿美元的数据收购，自己做实验室。

Google DeepMind Genie 3 的研究者讲的是同方向但更基础设施化：

"It's at the point where a human who is not an expert will watch it and think it looks real."
「现在到了非专业人士看了会觉得"看起来是真的"的程度。」
Genie 3 team · Google DeepMind Lead Researchers on Genie 3

"A year ago when we were thinking we'll get a minute of consistency for an autoregressive model like this in real time, people thought that was like our stretch goal kind of thing. … and people say, but a minute actually isn't long enough yet. I think that's a sign of the progress, right?"
「一年前，当我们认为我们可以在实时环境中为一个像这样的自回归模型获得一分钟的连贯性时，人们认为这就像我们的延伸目标一样。……人们说，但一分钟实际上还不够长。我认为这是一个进步的标志，对吧？」
Project Genie team · Project Genie: Create and Explore Worlds

需要注意——Camp C 的应用领域目前几乎全在游戏 / 视觉生成，跨到"通用 agentic 任务"的路径在语料里没有展开。

暗流 · "应用层 vs 通用代理"——Misha 的另一条线

Misha Laskin 在阵营 A 里同时给出了一个跟其他 RL 派不一样的微妙差异：

"Just because a company might have the best model in some general set of academic benchmarks, doesn't actually mean they have the best product."
「就算一家公司在一些通用学术 benchmark 上有最好的模型，也不意味着他们有最好的产品。」
Misha Laskin · Asimov

"When you look at what an engineer does in an organization, 80% of the time they're spending trying to comprehend complex systems and collaborating with teammates."
「在一个组织里看工程师做什么，80% 的时间花在理解复杂系统和与队友协作。」
Misha Laskin · Asimov

Asimov 不是 code generation agent，是 comprehension agent——这条产品方向跟 Will Brown 那套通用 multi-turn RL 路线在战术上是不同的，虽然他们都自称 RL 派。

最后，Tuhin 把这个题目跟推理经济学接到了一起：

"From the developer's perspective, they would insert more intelligence if you make it cheaper. They will insert more intelligence anyway, but if you make it more cheaper, they'll insert a hell of a lot more intelligence."
「从开发者的视角看，你把智能做得更便宜，他们就会塞进更多智能。其实他们本来就会塞更多，但你要是做得更便宜，他们会塞进多得多的智能。」
Tuhin Srivastava · Baseten CEO on the AI Inference Crunch

都没说透的

"9 个 9 可靠性"什么时候摸得到？ Harrison Chase 明确说现在做不到。但没人给出"在哪个时间尺度上能从 90%→99%→99.9%"的判断。这是商业化的关键变量，且没有人正面回答。
RL 路线和 World Models 路线有没有融合点？ 两边都谈"环境"，但 RL 派的 environment 是为 task reward 设计的，World Models 派的环境是为 state representation 设计的。两边都没说这两类工作能不能合，或哪一个会先吃掉另一个。
如果 generalization 真的不存在（Misha 的"hot take"），那 multi-turn RL 训出来的 agent 怎么部署？ Misha 自己抛出来了但没接住。整个 RL 派的论点严重依赖训练能 transfer 到生产分布，这个 hot take 如果是对的，论点会动摇。
reward hacking 究竟有多严重？ Will Brown 给的是定性描述（Claude 3.7 中有，Anthropic 正减少）。没有量化数据——多少比例的 task 中模型走捷径？修复一次要多少环境迭代？这是工程上的 unknown unknown。
context engineering 派和 RL 派的功劳归属，工程实证从哪儿来？ Harrison Chase 把"长程能用"归功于 harness 改进；RL 派归功于训练范式。没有一个能拆开来对照测的实验——同一个模型，同一个 harness，加 vs 不加 RL post-training，差多少？

我的看法

判断（不是事实）：这三条路线短期内会并行而不是收敛。1–2 年看：阵营 B（context engineering）在生产端做绝大多数的工作，因为它能套在任何前沿模型上，迭代快；3–5 年看：阵营 A 的 RL post-training 会在"垂直任务"上跑出更可靠的 agent，但 generalization 的边界比现在的乐观预期窄得多（Misha 的"hot take"我会赌它是真的）；5–10 年看：阵营 C（world models）有可能成为 long-horizon 真正的基础设施，但目前完全在游戏 / 视觉域，跨域转移这一步没人有可信的路径图。

把握程度：中等偏低。我对短期"context engineering 在生产端占主导"这一条信心高（实证已经发生）；对中期 RL 边界判断信心一般（依赖 generalization 假设）；对长期 world models 路径信心明显低——缺少跨域应用的真实案例。

还想知道什么

一份 ablation study：同一个 task，同一个模型，加 vs 不加 RL post-training、加 vs 不加 context engineering 的对照——目前每篇访谈都在讲自己一侧的功劳，但没人拿出"另一侧不存在时差多少"的数据。
reward hacking 的量化基线：在 SWE-bench、Tau-bench 这类任务上，模型走捷径的比例是多少、修复一次平均需要多少轮 reward function 迭代。Will Brown 提了这个问题，但还没人给数字。
world models 跨域的真实尝试：Genie / GI 把游戏域学到的 spatial memory / action prediction 用到非游戏域（机器人除外）的早期实验。如果三年内一个都没有，Camp C 在"通用 agent"语境下的相关性会大幅下降。
Misha 的 "no generalization" hot take 的实证检验：找一个具体任务（比如 code review on JavaScript 训练 → Python 上跑），看 RL post-training 的迁移性。这是阵营 A 论点能否成立的关键。
"9 nines 可靠性"的真实进度曲线：哪些任务从 50% → 80% → 95% → 99%，分别用了多久。可靠性曲线本身的形状（指数渐进 vs 阶梯）决定了 agent 商业化的时间窗。

取材

Misha Laskin (ReflectionAI / Asimov) · 2025-07-25 · 23bea6160e71810e914efc5645d97690
Will Brown (Prime Intellect, Multi-Turn RL) · 2025-06-18 · 216ea6160e7181d68d72ede27715b1e1
Will Brown / Johannes Hagemann (Prime Intellect, RL Environments) · 2026-02-26 · 313ea6160e71817c84d2f1cdcd003ce2
Harrison Chase (LangChain) · 2026-01-31 · 2f9ea6160e7181979ea0facede77a105
Kyle Corbitt (OpenPipe) · 2025-10-17 · 28fea6160e718186bfe4e90dde1d3a7d
Pim De Witte (General Intuition) · 2025-12-09 · 2c4ea6160e71810a98e4cd35178e1269
Google DeepMind Genie 3 team · 2025-08-16 · 251ea6160e7181ff8ddbe5a20272b560
Project Genie team · 2026-02-02 · 2fbea6160e71811390e5f85bea25a498
Tuhin Srivastava (Baseten) · 2026-05-11 · 35dea6160e718145a7a3c5263827a3bb