跨主题线索

阅读线索：agent 不到 9 nines 可靠，那应该卖给企业还是消费者

更新日志

2026-07-16 — 引用忠实度修复（全站审计）：5 处——重引 4 条（Chase / Taylor / Spiegel / Genie）、更正归属 1 条（Diana Hu 非 Garry Tan）。

跨主题线索 · 拉的是 long-horizon-agents × ai-moat-2026 · 2026-05-20

这条线索

long-horizon-agents 主题里有一条所有阵营都同意的事实：agent 当前的可靠性远低于"九个 9"，且没人摸到怎么到达。这件事被讨论时几乎全是从技术侧来看（RL 还是 world models 还是 context engineering）——但跨到 ai-moat-2026 后会浮出一个完全不同维度的问题：90% 可靠性的 agent 适合卖给什么类型的客户？ 这条线索拉的是"产品形态 × 客户类型 × 护城河"的三联，发现 long-horizon 的不可靠不是技术问题这么简单——它直接决定了不同 AI-native 产品的护城河形状会不一样。

把这两个题目接起来的具体瞬间

1. Harrison Chase 给出 90% 时代的产品形态判断

long-horizon-agents 里 Harrison Chase 把这件事讲得最直接：

"The issue with agents is they aren't reliable to nine nines of reliability, but they can do a ton of work and more and more work over longer time horizons."
「agent 的问题是它们没法做到九个 9 的可靠性，但它们能完成大量工作，并且在更长的时间跨度上做越来越多。」
Harrison Chase · Context Engineering Our Way to Long-Horizon Agents

"If you can find these framings where they run for a long period of time but produce a first draft of something, those to me are the killer applications of long-horizon agents right now."
「如果你能找到这些框架，让它们长时间运行，但可以产生某个东西的初稿，那么对我来说，这些就是目前长程 agent 的杀手级应用。」
Harrison Chase · Context Engineering Our Way to Long-Horizon Agents

桥接的关键洞察：Harrison Chase 给出的"first draft + human review"模式预设了人在错误发生时的容错成本可承受。但容错成本在不同客户场景里差几个数量级——这把 ai-moat-2026 那场护城河辩论从"什么是护城河"变成了"在不同容错成本下，护城河长在不同地方"。

2. Camp B（替代 labor）vs Camp D（消费品牌）—— 容错成本差距决定可不可行

ai-moat-2026 里两个截然不同的护城河阵营在这件事上的对比异常鲜明：

B2B / 替代 labor 派（Bret Taylor、Jesse Zhang）押的是 outcome-based pricing——这预设客户愿意根据结果付费。但在长程任务上，10% 失败率在企业场景里意味着什么？

"And so my view is to the degree agents have a measurable outcome, outcome-based pricing feels like the secular business model for agents. … The reason why I don't think token-based makes sense is It's charging for an input that is uncorrelated with the output that your clients actually care about."
「因此，我的观点是，在某种程度上，agent 具有可衡量的结果，基于结果的定价感觉就像是 agent 的长期商业模式。……我不认为基于 token 的方式有意义的原因是，它对客户实际关心的输出是不相关的输入收费。」
Bret Taylor · Uncapped #42 Bret Taylor from Sierra

如果一个法律 agent 90% 时间正确、10% 时间生成幻觉条款，没有律所敢按 outcome 付费——10% 失败的 outcome 不是"少 10% 的钱"，是律所执照吊销。这是 long-horizon-agents 的"9 nines 没摸到"在 ai-moat 里直接禁掉了 outcome-based pricing 的某些 B2B vertical。

B2C / 品牌派（Cannon-Brookes、Spiegel）的处境完全不同：

"One of the things that's very special about Snapchat is it actually changed people's relationship with photography. So I'm dating myself, but back in the day, people used to take photos to save a moment. … Snapchat changed people's relationship with photos and made them about communicating."
「Snapchat 非常特别的一点是，它真正改变了人和摄影的关系。虽然这么说暴露年龄，但在过去，人们拍照是为了存住一个瞬间。……Snapchat 改变了人和照片的关系，把它变成了沟通。」
Evan Spiegel · Snap CEO Evan Spiegel

消费场景里 10% 失败的成本通常是"再试一次" 或 "不用了"——容错成本接近于 0。这意味着 long-horizon-agents 的当前可靠性水平对 B2C 产品形态几乎没约束，但对 B2B 高责任 vertical 是 hard ceiling。

3. Lightcone 的"startup-shaped holes"是这条线索的实证锚点

ai-moat-2026 里 Diana Hu / Lightcone 提到的"企业愿意接受新创公司"恰好对应 long-horizon 当前的容错策略：

"But when you do succeed and plug into the systems of record, the pot of gold is actually quite big, but it does take a long time."
「但一旦你成功并接入 systems of record，那笔财富其实非常可观——只是要花很长时间。」
Diana Hu (The Lightcone) · Inside The MIT AI Study

桥接：Lightcone 提到的成功 AI startup（Tactile、Greenlight、Castle AI、Reducto）的共同点不是"它们的 agent 比别人更可靠"——是它们选的 vertical 中 10% 失败的容错成本可控。Reducto 处理 PDF：错的话回滚再来一次就行；Tactile / Greenlight 是垂直 enterprise AI：错的话 human review 就能 catch。这条不在 long-horizon-agents 的技术辩论里，但它解释了"为什么这些公司能在 9-nines 还没到达的时候就有商业价值"。

4. World Models 阵营的 Genie 团队也无意中说明了同样的事

long-horizon-agents 里 World Models 阵营做的所有 demo 都在游戏 / 创作场景——这是英文语料里没人正面解释的一个现象：

"A year ago when we were thinking we'll get a minute of consistency for an autoregressive model like this in real time, people thought that was like our stretch goal kind of thing. … and people say, but a minute actually isn't long enough yet. I think that's a sign of the progress, right?"
「一年前，当我们认为我们可以在实时环境中为一个像这样的自回归模型获得一分钟的连贯性时，人们认为这就像我们的延伸目标一样。……人们说，但一分钟实际上还不够长。我认为这是一个进步的标志，对吧？」
Project Genie team · Project Genie: Create and Explore Worlds

桥接：Genie 选择游戏域不只是因为数据丰富——也因为游戏域的容错成本接近于 0。一帧错了就跳过去；不一致就重置世界。这跟 Lightcone 的"startup-shaped holes"在表象上完全不同（一边是技术研究、一边是企业销售），但在"为什么这个 vertical 现在可行" 这个问题上回答相同——可靠性不够时，先去找容错成本低的场景。

如果你继续往下拉

把 long-horizon 的可靠性约束和 ai-moat 的护城河形态接起来后，浮出一个没人正面讲但很有解释力的二维表：

| 容错成本高 | 容错成本低 | |---|---| | B2B 高责任 vertical（法律、医疗、金融审批） — agent 还做不了主导，只能做"draft for human review"；护城河必须建在 *audit trail + 人在环* 上 | B2C 消费产品（Spotify、TikTok、Roblox 类）— agent 失败成本接近零，护城河可以是品牌 / 设计 / 网络效应（Camp D）| | B2B 中等责任（合同摘要、邮件分诊、初稿）— Harrison Chase 的"first draft for human review"模式正适用；护城河在 workflow embedding（Tuhin 的 Camp C）| 创作 / 游戏 / 模拟（Genie / World Models / Wanaka）— 容错近零；护城河在 *创作工具 + 分发结构*（张阳的熟人分发观察） |

对 builder 的具体含义：在 long-horizon agent 摸到九个 9 之前——按 Harrison Chase 的判断这至少还要 3–5 年——B2C 和"容错成本可控的 B2B vertical"才是真正可以 ship 的市场。Bret Taylor 的 Sierra 之所以在客服 vertical（错了客户只是不满意，不是被起诉）成立，不是因为 agent 比 Harvey 的法律 agent 更可靠，是客服 vertical 选对了容错成本带。这条线索同时也提示最不该做的产品形态：押"用 agent 取代医生 / 律师 / 金融审批人员"——技术 ceiling 是 hard wall，且没人在 long-horizon-agents 的语料里给出过这堵墙在哪一年能拆。

取材

Harrison Chase (LangChain) · 2026-01-31 · 2f9ea6160e7181979ea0facede77a105
Bret Taylor (Sierra) · 2026-02-26 · 313ea6160e7181b985fcd3c0984d0d5a
Evan Spiegel (Snap) · 2026-04-14 · 342ea6160e7181198cf6ec5286a01ae3
The Lightcone / Diana Hu · 2025-11-02 · 29fea6160e71816cb845c95fc6f01108
Project Genie / DeepMind team · 2026-02-02 · 2fbea6160e71811390e5f85bea25a498
Tuhin Srivastava (Baseten) · 2026-05-11 · 35dea6160e718145a7a3c5263827a3bb