主题综述

推理经济学 · Inference Economics

主题综述

更新日志

2026-07-27 — 新增 6 篇（20VC Open vs Frontier / Modal CTO Akshat Bubna / Accel / 20VC App Layer Glean / Lab of the Future / AI Memory Dan Biderman）。本轮几乎全部落在强化已有判断、而非改写。 Jevons（主流共识第三点）一次拿到三份独立的一线账单证词：Glean 的 Arvind Jain「年度 AI 预算一两个月就超支」、Accel「7 倍公司被 CFO 叫去『let it rip』、看涨每月 5 亿美元 token 花销」、Sierra 的 Clay Bavor「顶尖工程师年烧 $100k token、占工程师薪资将从 3.8% 涨到 20%」。模型路由 / quality-per-dollar（第四点）被 Glean（把「挑便宜模型」直接做成核心卖点：cost control）和 Sierra（按任务混搭模型）再顶一层，独立信源已到七八个。真正复杂化的有两处：一是 Clay Bavor + Accel 给出对「价格趋近计算成本」终局的最强反驳——对前沿智能的需求无上限、成本地板是能源与算力（新增阵营 J）；二是 Modal 把自家推理性能优势（DFlash）开源上游，亲手削平了阵营 A 押注的「软件层护城河」。AI Memory 的 Dan Biderman 用 KV cache 80GB 的系统级数字补强了阵营 H。Lab of the Future（Lila Science）与本主题关系很弱，未引用。
2026-07-16 — 引用忠实度修复（全站审计）：9 处——重引逐字 2 条（Noam Brown、Dan Biderman）、更正归属 2 条（"zero marginal cost" 系 Anish Acharya 非 Josh Elman；Opus 4.7 比价观察系 Gabe Pereyra 非 Ev Randle）、其余 5 条经逐字稿复核确认已符合原话。二次修复补记：1 处（重引 1 · 降级转述 0）——阵营 H（Engram）四行对话补回被略去的 Shaun Maguire 原话（"Sonya works at Fireworks. She really loves it."），消除未标注的内部剪辑。
2026-07-06 — 新增 8 篇（Harvey/Gabe Pereyra、Benchmark AI Bets、Mark Chen、Kevin Weil、Engram、Josh Elman、Self-Driving Lab、Noam Brown）。材料上确有实质改变：Harvey 和 Benchmark 把"模型路由 / quality-per-dollar"从边缘话题抬成新的一条主流共识（见「主流共识」新增第四点），并给出了此前语料缺的具体账单数字（单次 query $20、单次审查 $20,000、"$10M 账单"）；Benchmark 的"高毛利=红旗""distillation 让开源逼近 95%"直接补强了 McGrew 的"价格=计算成本"终局（阵营 D）与 Stephanie 的"开源/闭源 gap 决定推理供应商命运"（阵营 B）；Engram 提出"把成本从 inference 挪到 training"的第三条路，Noam Brown 的 test-time compute 则把"推理成本"本身变成一个随预算滑动的变量——两者一起复杂化了「都没说透的」里对 Jevons 的讨论。Mark Chen（Cooking with OpenAI）与 Self-Driving Lab 两集与本主题关系很弱，未引用。
2026-06-11 — 取材升级为逐字稿全文。把 podwise 摘要层的三手转述换成原话：Tri Dao 的"也许 100 倍？"、"工作负载向 Transformer/MoE 收敛、给芯片对手开门"、Tuhin 的 Jevons 都换成逐字。Redpoint、Stephanie、McGrew 几集是英文原音、podwise 仅存中文译文，无英文可逐字引用，故改为转述（McGrew 的"律师不再稀缺"比喻保留为转述要点）。
2026-05-20 — 首次综述。基于 9 篇访谈（Baseten Tuhin Srivastava、Together.AI Tri Dao、Fal.ai、vLLM/Inferact、Nvidia Brev/Dynamo、Bob McGrew、Redpoint、Stephanie Palazzolo、a16z Nathan Labenz）。

主流共识

第一点：推理成本在以每年约一个数量级的速度下降，这个量级几乎没有反对。

"In the last couple years inference cost has probably come down, maybe 100x …"
「过去这几年，推理成本大概降了，也许 100 倍……」
Tri Dao · Ep 74: Chief Scientist of Together.AI

本轮 Sierra 的 Clay Bavor 给了这条最具冲击力的一个具体倍数——他用 GPT-4 级别智能作锚点，量出了"等质量 token 单价"的塌陷幅度：

"It's now 1,300th the cost for an intelligence equivalent token."
「现在智能等价令牌的成本降低到了 1,300 分之一。」
Clay Bavor · 20VC: Open Models vs Frontier Models, Who Actually Wins

Redpoint 的投资人也给了同量级的判断（英文原音、podwise 仅存中文，故转述）：LLM 的推理和训练成本每年下降约 10 倍，直接改善了在其上构建的应用公司的利润结构。

第二点：推理负载本身在变质——从静态、同尺寸的请求变成 dynamic、agentic、长上下文，对调度和内存提出全新的问题。

"What if the hardest problem in artificial intelligence isn't training smarter models, but simply keeping them running?"
「如果说人工智能领域最难的问题不是训练更聪明的模型，而是简单地保持它们的运行呢？」
Matt Bornstein（a16z，主持人开场旁白，非嘉宾原话） · Inferact: Building the Infrastructure That Runs Modern AI

本轮 Modal 的 Akshat Bubna 从供应商一线证实了这条：他们最大的用例已经是 elastic inference，流量是 diurnal、突发的，"从零到一"扩容让位给"一小时内从 1000 张 GPU 弹到 1500 张"——这正是负载变质带来的新调度问题。AI Memory 的 Dan Biderman 则从内存端给了一个骇人的量级：长上下文推理的 KV cache 消耗大到与整个模型权重同数量级（见阵营 H）。

第三点：Jevons paradox 在生效——成本降低没有压缩总开支，而是把开发者引向更复杂、更耗 compute 的 agentic workflows。Harvey 的 Gabe Pereyra 给了这条最直白的一线证词——agentic 转向把单位账单直接拉到了荒谬量级：

"We have simple like assistant type queries where you say, you know, draft me a document that a single query can cost $20. We have like a review product where you can upload 100,000 contracts and ask the models to review them. And some of those can cost $20,000. And so it is just getting incredibly expensive."
「我们有一些简单的助手类型查询，比如说，给我起草一份文件，而单个查询可以花费 20 美元。我们还有一个审查产品，你可以上传 100,000 份合同并要求模型进行审查。其中一些可能会花费 20,000 美元。所以这变得非常昂贵。」
Gabe Pereyra · Harvey Co-Founder on the Token Pricing Reckoning

本轮这条 Jevons 一次拿到三份互不相干的一线账单证词。Glean 的 Arvind Jain（企业买方视角）——预算根本关不住：

"given that AI has become so expensive, we hear stories all the time about companies coming up with a annual budget for AI and they run past that within a month or two."
「考虑到 AI 变得如此昂贵，我们经常听到公司制定 AI 的年度预算，并且在一个月或两个月内就超支。」
Arvind Jain · 20VC: Why OpenAI and Anthropic Won't Win the App Layer

Accel 增长团队（投资人视角）——他们做了一份 CFO 调查，结论是"花更多"压倒性地打赢"省一点"，而且看涨到荒谬量级（逐字稿仅把说话人标为 Speaker，故归属为团队）：

"we did this survey of developers and we asked them how many of them had a CFO who was telling them to spend less on consumption versus more. And we actually found that seven times more companies are being told to like let it rip and spend more."
「我们对开发者进行了一项调查，询问他们中有多少人有 CFO 告诉他们在消费上要减少开支，而不是增加。我们实际上发现，七倍的公司被告知要尽情花费，增加开支。」
Accel 增长团队 · Accel: The Quiet Firm Behind Facebook, Cursor, Nebius…

Sierra 的 Clay Bavor（企业内部视角）——把这条 Jevons 落成一个可验证的比率预测：token 支出正在从工程师薪资的零头，涨向一大块：

"top engineers who are really leaning in to Claude Code, Codex, and so on, are spending more than $100,000 on a run rate basis on tokens per year. That's a meaningful fraction of an engineering salary."
「那些真正投入 Claude Code、Codex 等的顶尖工程师，年均在代币上的支出超过 100,000 美元。这在工程师薪水中是一个相当可观的比例。」
Clay Bavor · 20VC: Open Models vs Frontier Models, Who Actually Wins

"I would not bet on 3.8%. I would bet on much closer to 20%."
「我不会押注于 3.8%。我会赌更接近 20%。」
Clay Bavor · 20VC: Open Models vs Frontier Models, Who Actually Wins

第四点："哪个模型最好"正在让位于"哪个模型能用最低价格解决这个任务"——模型路由 / quality-per-dollar 成了新的一层经济学。 这条以前只是 Baseten 论点的一个侧面，本轮已被 Harvey、Benchmark、Kevin Weil、Josh Elman、Glean、Sierra 从企业、投资、研究、消费多个互不相干的位置独立顶成共识。Gabe Pereyra 把它讲成一个从"能力受限"到"成本受限"的相变：

"Up until recently, we've always been capability constrained. And so we always wanted to use the largest model … But that has changed in the past six months where now we are consuming like a huge number of tokens."
「直到最近，我们一直受到能力的限制。所以我们总是希望使用最大的模型……但在过去六个月中，这种情况发生了变化，现在我们正在消耗大量的令牌。」
Gabe Pereyra · Harvey Co-Founder on the Token Pricing Reckoning

"Increasingly, it's not just which model is the best; it's which model can solve the task at the lowest price point."
「越来越多的人关心的不是哪种模型最好，而是哪个模型能以最低的价格解决任务。」
Gabe Pereyra · Harvey Co-Founder on the Token Pricing Reckoning

本轮 Glean 的 Arvind Jain 把这条从"工程实践"抬成"商业模式"——路由本身就是他们卖给企业的核心价值（cost control）：

"90% or greater of use cases can now be fully handled by many, many different models, including open source models. So there is definitely commoditization from that perspective. In fact, like, you know, we at Lean, that's actually one of our core value adds to our customers, you know, which is cost control. … we actually pick the right model for you. And if you're okay with using open source models, we'll use them, you know, when we think it's appropriate, when it's going to generate a high quality answer."
「90% 或更高的用例现在可以通过许多不同的模型，包括开源模型，完全处理。因此，从这个角度来看，显然是商品化。实际上，像你知道，我们在 Lean，这实际上是我们为客户提供的核心价值之一，你知道，成本控制。……我们实际上会为你选择合适的模型。如果你愿意使用开源模型，我们会在认为合适的时候使用它们，以生成高质量的答案。」
Arvind Jain · 20VC: Why OpenAI and Anthropic Won't Win the App Layer

Benchmark 的 Ev Randle 从投资人视角给了同一件事的镜像——他甚至把"高毛利"重新定义成了 AI 时代的红旗，因为它意味着没人真在用你的 AI：

"AI inference costs a lot of money. And if you have an AI product with high gross margins, that means that no one's using your AI features."
「AI 推理成本很高。如果你有一个毛利率高的 AI 产品，这意味着没有人在使用你的 AI 功能。」
Ev Randle · Benchmark's AI Bets

OpenAI 的 Kevin Weil 则点出了这层经济学的根源——推理不像传统软件那样边际成本趋零，它每次调用都真金白银地烧钱，这既是 B2B 先起飞的原因，也是路由必须存在的原因：

"Models also cost money to use. In a relative sense, much more than like traditional, you know, get a database and pay network costs and stuff like that. You start having costs right away as a business …"
「使用模型也需要花费。在相对意义上，比传统的……获取数据库和支付网络费用要贵得多。作为一家公司，你一开始就会产生成本……」
Kevin Weil · AI Is Crossing the Frontier of Human Knowledge

分歧在哪

Tuhin Srivastava (Baseten) 把推理拆成两层并明确押注上层：

"GPUs as a service is not sticky. … Inference with the software layer included is incredibly sticky."
「GPU 即服务并不具备粘性。……包含软件层的推理则具有极强的粘性。」
Tuhin Srivastava · Baseten CEO on the AI Inference Crunch

"what is valuable to a company is the user signal that they can gather, that only they can gather. And to the extent that that is encoded In a model, I think a lot of their business will be at risk, but to the extent that it is encoded in workflows, that is where they will be able to develop modes."
「我认为对于企业来说，唯一有价值的是他们能收集到的独家用户信号。如果这些信号只编码在模型里，我认为他们的业务将面临很大风险；但如果信号编码在工作流中，他们就能建立起护城河。」
Tuhin Srivastava · Baseten CEO on the AI Inference Crunch

本轮 Modal 的 Akshat Bubna 是这条"软件层赢"最新、最具体的一票——他明确说，能不能找到 GPU、能不能跑模型早已不是差异点，弹性、真·scale-to-zero、跨 17 家云的可靠性层才是：

"with these endpoints, we are way more elastic, as you said, than anyone else. And you have true scaling to zero. You have true burstiness. And in practice, that matters a lot more to people than just finding the GPU and running model code."
「正如你所说，通过这些端点，我们比其他任何人都更加灵活。你可以实现真实的缩放到零。你可以实现真正的突发性。在实践中，这对人们来说比仅仅找到 GPU 并运行模型代码更重要得多。」
Akshat Bubna · Why AI Infrastructure Must Evolve for Agent Experience

但 Modal 同时给阵营 A 埋了一颗自爆弹：他们把自己的推理性能优势（block-based speculator DFlash，带来 2–4x 加速）直接开源、上游进 SGLang——等于亲手把"软件层护城河"里最硬的那块技术抹平：

"by using open source DFlash, you can get the same performance as you would with one of the proprietary providers."
「通过使用开源的 DFlash，你可以获得与任何一个专有提供者相同的性能。」
Akshat Bubna · Why AI Infrastructure Must Evolve for Agent Experience

也就是说，阵营 A 里 Baseten 押的"软件粘性"和 Modal 押的"弹性/可靠性"是同一类论点，但 Modal 恰恰在证明：软件层里可被量化复制的那部分（kernel、speculator）正在被从业者自己开源掉，长期能守住的只剩下不可复制的运营（多云容量、调度、可靠性）。这条内部张力此前语料里没有。

阵营 B · "开源在逼近前沿、模型层在被商品化"——记者与买方视角的怀疑

Stephanie Palazzolo (The Information) 给的是 Camp A 的直接反驳（英文原音、仅存中文，故转述）：推理供应商被怀疑"不过是 GPU 经销商"——尽管融资凶猛，却很难证明其高估值；而且它们的命运被绑死在一条线上——开源模型相对闭源模型的性能差距，推理供应商本质上是这条 gap 的代理。

Redpoint 给了同方向的实证——switching cost 在塌（同为英文原音转述）：他们组合里很多公司几天内就从 Anthropic 切到 DeepSeek、推理成本降 80–90%；LLM 之间切换成本极低，公司可以按成本 / 性能随意换供应商。

Benchmark 的 Ev Randle 把 Stephanie 那条"gap 决定命运"的抽象判断落成了一个具体机制——distillation（蒸馏）：只要开源能持续逼到前沿的 95%，前沿实验室对 token 收溢价的能力就会被结构性压掉。

"if at any point it seems like capabilities are actually hitting an absolute ceiling, And distillation continues as it has historically. And the open source actually gets, you know, 95% as good as wherever the ceiling of capabilities tops out. That's a really scary situation for the Frontier Labs. … you'd have a much, much greater impact from open source depressing their ability to have pricing power and charge a premium for the tokens that they're producing."
「如果在任何时候似乎能力实际上达到了绝对上限，并且蒸馏仍然像历史上那样进行。而开源实际上可以达到您知道的能力上限的 95%。这对于 Frontier Labs 来说是一个非常可怕的情况。……开源会大大影响他们的定价能力，降低对他们所生产的代币收取溢价的能力。」
Ev Randle · Benchmark's AI Bets

本轮 Glean 的 Arvind Jain 把这条怀疑从"假说"推到了"正在发生"——他给了三个此前语料缺的硬判断：开源已便宜一个数量级、OpenAI 传闻要大砍价、以及"模型业务本身没那么赚钱"：

"with open source, like, you know, it actually is an order of magnitude cheaper prices. I actually heard rumors that OpenAI was going to drastically reduce their model prices in response to these developments like competition as an open source."
「随着开源的出现，实际上价格要便宜一个数量级。我听说 OpenAI 可能会大幅降低他们的模型价格，以应对这样的竞争。」
Arvind Jain · 20VC: Why OpenAI and Anthropic Won't Win the App Layer

"The model business on its own is actually probably not as lucrative as everybody believes, but these companies now have a lot more things. They're no longer model companies only."
「模型业务本身实际上可能没有大家想象中那么有利可图，但这些公司现在有了更多的业务。他们不再只是模型公司。」
Arvind Jain · 20VC: Why OpenAI and Anthropic Won't Win the App Layer

Arvind 甚至给了一个时间表：三年内大多数企业负载会跑在开源模型上，而拐点就在 GLM 5.2 逼进前沿三个月内那一刻。Sierra 的 Clay Bavor 从另一端给了同一件事的供给侧解释——中国开源的强，本质是"蒸馏美国前沿模型"，而美国实验室出于自保不会自己发布同能力的开源权重：

"if you can't build frontier models yourself, okay, maybe the next best approach is to distill them and offer them up."
「所以如果你自己不能建立前沿模型，好的，也许下一个最佳选择就是提炼它们并提供出来。」
Clay Bavor · 20VC: Open Models vs Frontier Models, Who Actually Wins

阵营 C · "硬件可移植性是神话——但赢家也轮换"——Tri Dao 的双重立场

Tri Dao 既给 Nvidia 站台，又给挑战者留了门：

"I would say hardware portability is kind of a myth. Simply, you know, even for our Nvidia chips, like generation to generation, they change a lot … Even for Nvidia, every generation, they essentially have to rewrite all the code … So even hardware portability, even between generations, isn't quite there."
「我认为硬件可移植性有点像神话。简单地说，你知道，即使是我们的 Nvidia 芯片，从一代到一代，它们也发生了很大的变化……即使对 Nvidia 来说，每一代他们基本上都要重写所有代码……所以即使是硬件的可移植性，即使在几代产品之间，也还不够好。」
Tri Dao · Ep 74: Chief Scientist of Together.AI

"Obviously, Nvidia is dominant for a couple of reasons. They design very good chips and they build very good software. And that creates this ecosystem where people build on that. … But I think certainly we'll see a lot of competitors entering the space as the workload starts to be a little bit more sort of coalescing around on the architecture side, you know, transformer and MOE and so on. It becomes a little bit easier to design the chips for that kind of workload, right?"
「显然，Nvidia 占据主导地位有两个原因。他们设计出非常好的芯片，并且构建非常好的软件。这就创造了一个人们在此基础上构建的生态系统。……但我认为，随着工作负载开始更多地围绕架构方面（例如 Transformer 和 MOE 等）整合，我们肯定会看到很多竞争者进入这个领域。针对这种工作负载设计芯片变得容易一些，对吧？」
Tri Dao · Ep 74: Chief Scientist of Together.AI

这一立场把战场推到 hardware-software co-design。

阵营 D · "agent 价格会被压到计算成本"——McGrew 的终局推演

Bob McGrew（OpenAI 前研究负责人） 给出最悲观（对推理供应商而言）的版本（该集英文原音、podwise 仅存中文，故以下为转述）：因为竞争，agent 的价格可能被压到只剩计算成本，从而侵蚀掉一切建立在"专业稀缺"上的经济模式。他的比喻很直白——律师贵是因为时间稀缺；但一旦你把律师做成 AI 模型，就等于有了无限多个律师，于是它一点都不稀缺了。而价值会"逃"到哪里？McGrew 押的是：最有价值的机会在那些对"模型之外的领域"有深刻理解的应用——比如把 AI 接进既有业务流程的企业方案。

两条旧证词从两个相反方向打这个终局。一边补强：Harvey 的 Gabe Pereyra 观察到，Opus 4.7 比 5.5 贵 3 倍、性能只好 10–20%，而这种细微的性能差在多家供应商 + 开源追赶下会被互相压价：

"Opus 4.7 is three times more expensive than 5.5, but it's 10 or 20% more performant. But these are going to cause pressure on each other. And then if open source catches up, you're also going to have pricing pressure."
「Opus 4.7 的价格是 5.5 的三倍，但其性能仅提高了 10% 或 20%。但这些会相互施加压力。如果开源跟上了，你也会面临定价压力。」
Gabe Pereyra · Harvey Co-Founder on the Token Pricing Reckoning

另一边复杂化：OpenAI 的 Noam Brown 指出，在 test-time compute 时代，"一个 agent 值多少钱"根本不是一个固定数——模型能力是你投入预算的函数：

"The capability of the model is a function of how much money you put into it, basically. If you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. Give it a budget of $10 million, it can do even more."
「模型能力基本上是你投入多少钱的函数。如果你给它一个 10000 美元的预算，它能做的比拥有 10 美元预算时多得多。给它 1000 万美元的预算，它能做的更多。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

但 Noam 也给了 McGrew 阵营一个反向弹药——解一道给定难题的成本本身在按代际暴跌：

"the cost of Disproving the Erdos unit distance conjecture drops by like 10 or 100x with every model release cycle, probably in some cases more. … Yeah, just go on vacation and come back two months later and then it's a thousand times cheaper."
「反驳 Erdős 单位距离猜想的成本在每次模型发布周期中下降了大约 10 或 100 倍，在某些情况下可能更多。……是的，放假去度假，过两个月再回来，然后便宜了千倍。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

阵营 E · "在巨头打架的缝里挑窄市场"——Fal.ai 的策略

Fal.ai 的 Gorkem 和 Batuhan 选择根本不和 LLM 推理玩家正面打——他们的论点是 inference 经济学的现实是"分赛道"：

"We chose to be a leader or play to be a leader in this fast-growing niche market rather than trying to go against Google or OpenAI or Anthropic."
「我们选择成为这个快速增长的利基市场的领导者，而不是试图与 Google 或 OpenAI 或 Anthropic 竞争。」
Gorkem / Batuhan · History of Generative Media with Fal.ai

"Whenever something new comes up, we are the first one to optimize it. First one to adapt our inference engine to it."
「每当有新的东西出现，我们总是第一个优化它，第一个调整我们的推理引擎来适应它。」
Gorkem / Batuhan · History of Generative Media with Fal.ai

阵营 F · "Diversity wins"——开源基础设施视角

vLLM / Inferact 的 Simon Mo 和 Woosuk Kwon 给的是另一种结构性预测：

"What we believe is diversity will triumph that sort of single of anything at all."
「我们相信多样性会战胜任何单一的事物。」
Simon Mo · Inferact

"Having this community all work together for this open-source, we have the execution beyond any single entity can have."
「拥有这个社区一起为开源项目努力，我们的执行力超越了任何单一实体所能达到的水平。」
Simon Mo · Inferact

Modal 本轮的"开源上游 DFlash"其实是这条阵营 F 的一个新注脚：连闭源推理平台都在把自己的性能优势喂回开源社区，这在为"多样性/开源赢"背书的同时，直接掏空了阵营 A 的软件护城河。

阵营 G · "垂直整合到 SOL"——Nvidia / Nebius 的反向押注

Nvidia 的 Kyle Kranen 和 Nader Khalil 不接受"多样性赢"也不接受"软件层赢"——他们的逻辑是只有 hardware-model co-design 才能逼近物理极限：

"Before trying to layer reality back in of like, why can't this be delivered at some date? Let's just understand the physics. What is the theoretical limit to like, how fast this can go?"
「在试图重新加入现实之前——比如，为什么不能在某个日期交付？——让我们先理解物理。理论上它能跑多快？」
Nader Khalil · NVIDIA's AI Engineers: Agent Inference at Planetary Scale

本轮 Accel 增长团队从投资人视角给了同一逻辑的公司化身——Nebius 这类"垂直整合的下一代 hyperscaler"（自建数据中心、自研机架、自控软件栈），赌的正是"控住产能就能吃下指数级增长的推理需求"。这与阵营 F 的"多样性赢"在底层架构观上几乎对立。（逐字稿仅中文，故转述其对 Nebius 垂直整合"从数据中心到软件全拥有、以此高效交付推理"的判断。）

阵营 H · "把成本从 inference 挪回 training"——Engram 的第三条路

Engram 的 Dan Biderman 和 Jessy Lin 提出的路线跟前面所有阵营都不同：与其在推理端优化，不如在权重里内化上下文，用一次性的轻量训练换掉每次调用都要重读的巨型 system prompt——把 token 消耗砍两个数量级：

"The fact that you don't have to research things and re-read things and the fact that you don't have to write monstrous system prompts. … That can give you, you know, two orders of magnitude reduction in token inference consumption. … answer, you know, within 100 tokens what the best frontier models would consume 100,000 tokens."
「你不必研究事物和重新阅读事物的事实，以及你不必写出庞大的系统提示。……这可以让你，知道，减少两个数量级的令牌推断消耗。……在 100 个令牌内回答，而最佳前沿模型可能需要消耗 10 万个令牌。」
Dan Biderman · Memory and Continual Learning: Engram

本轮 Dan Biderman 的第二场访谈（AI Memory）给了这条路线此前缺的系统级数字——把"长上下文很贵"从口号变成一个可量的物理事实：让 LLaMA-70B 读一篇维基百科文章，GPU 上的 KV cache 就膨胀到与整个模型权重同数量级：

"if you take a LLAMA 70D model and you load one article from Wikipedia … The brain state of the model when reading this few tens of kilobytes is like 80 gigabytes. 80 gigabytes on the HBM of the GPU. … And the entire set of parameters of this model would be like 140 or so gigabytes … this one article about Taylor Swift is like same order of magnitude memory consumption on the GPU. So it's highly memory inefficient. That's a systems problem."
「如果你拿一个 LLAMA 70D 模型，并加载一篇来自维基百科的文章……当模型阅读这几十千字节的内容时，其大脑状态就像是 80GB 一样。在 GPU 的 HBM 上有 80 GB。……而这个模型的所有参数总共大约是 140 GB……而这篇关于泰勒·斯威夫特的文章在 GPU 上的内存消耗也是同样数量级。所以它的内存使用效率非常低。这是一个系统问题。」
Dan Biderman · The AI Memory Problem: Why Long Context Isn't Enough

他也承认，这套"把 token 效率变成硬指标"的紧迫感是最近才起来的——反过来印证了主流共识第三点里 Jevons 的真实压力：

"One is token efficiency and cost, which is a major issue that wasn't actually an issue when we started the company late last year and became more urgent now."
「首先是令牌效率和成本，这是一个重大问题，当我们去年年底创办公司时并不是一个问题，现在变得更加紧迫。」
Dan Biderman · The AI Memory Problem: Why Long Context Isn't Enough

值得记的是，第一场 Engram 访谈里坐着 Fireworks 的 Sonya Huang，她对"agent 跑几天的巨额推理成本"给了推理供应商最赤裸的立场——高推理成本是好事：

— Jessy Lin: "… I think a lot of what people are worried about these days is the huge inference costs of running these agents like for days on end."
— Sonya Huang: "High inference costs a good thing."
— Jessy Lin: "I mean, consuming tokens for what?"
— Shaun Maguire: "Sonya works at Fireworks. She really loves it."
— Sonya Huang: "I love inference."
「——Jessy：人们现在最担心的是持续几天运行这些代理的巨额推理成本。——Sonya：高推理成本是件好事。——Jessy：我是说，消耗代币是为了什么呢？——Shaun：Sonya 在 Fireworks 工作。她非常喜欢这份工作。——Sonya：我喜欢推理。」
— Memory and Continual Learning: Engram

这段对话把整个主题的张力压成了三行——供应商爱推理成本，客户想消灭它，而 Engram 想把它挪到别处。

阵营 I · "把便宜的推理推到端上"——Josh Elman 的消费侧视角

Josh Elman 与 a16z 的 Anish Acharya 在同一场对话里从消费 App 的角度接住了同一根线：Acharya 点出推理有实际边际成本、逼着消费产品从免费转向订阅，Elman 接着给出破局点——把简单推理下放到设备端：

"Because there's no zero marginal cost of distribution, inference costs money, you can't really build a mass market free product without a massive balance sheet."
「因为没有边际分发成本，推理要花钱，你其实不能没有庞大的资产负债表建立一个大众市场的免费产品。」
Anish Acharya（a16z，主持人） · What's Next for Consumer AI

"We're going to be figuring out with the models even that you can run on iPhones or on laptops, how to push more and more of the … less complex inference that needs to be done onto device to lower the cost … And then you only send the more complex stuff up."
「我们将会发现如何在 iPhone 或笔记本电脑上运行的模型中，通过将越来越多需要在设备上进行的……不复杂的推理转移到设备上来降低成本……然后你只发送更复杂的东西上去。」
Josh Elman · What's Next for Consumer AI

本轮 Sierra 的 Clay Bavor 给这条"端上便宜"加了一个明确的边界：端侧能改善部分消费体验，但缓解不了服务器侧的根本瓶颈——前沿负载受限于手机的热功耗，只能回到数据中心那排 TPU/GPU。Elman 的"简单的放端上、复杂的送云端"和 Kevin Weil、Harvey、Glean 的"模型路由"其实是同一条经济学在多个层面（端/云、模型编排、供应商选择）的重复——都在为主流共识第四点背书。

阵营 J · "前沿智能需求无上限，成本地板是能源与算力"——Clay Bavor / Accel 的反 doom

Sierra 的 Clay Bavor 和 Accel 增长团队 给了对阵营 D "价格趋近计算成本 → 中间层被压死"最直接的一个反驳。他们不否认价格在跌，但把结论从"margin 归零、生意死"改写成"地板是真的、需求无上限、大家都在长"。Clay Bavor 的两块论证：一是需求端根本没到顶——

"So, I think we have not yet appreciated the unbounded demand for, call it frontier levels of intelligence."
「所以，我认为我们尚未充分认识到对，称其为，前沿智能水平的无边需求。」
Clay Bavor · 20VC: Open Models vs Frontier Models, Who Actually Wins

二是成本端有一条物理地板——就算开源避开了托管前沿模型的 margin stack，最底层的输入仍是被约束的电力与 GPU：

"if you have unbounded demand for frontier level intelligence or GPUs to run open waste models and the rate limiter is the number of Blackwells and H100s you have, you end up with kind of a floor on the cost of tokens because you've got to pay for the energy, you've got to pay for the compute."
「如果你对前沿水平的智能或用于运行开放权重模型的 GPU 有无限的需求，而限制因素是你拥有的 Blackwells 和 H100 的数量，那么你最终会发现代币的成本有一个底线，因为你必须支付能源费用，必须支付计算费用。」
Clay Bavor · 20VC: Open Models vs Frontier Models, Who Actually Wins

Accel 增长团队从投资端给了同一立场的宏观版本——token 消费的走向完全由 capability frontier 决定，只要能力还在涨，需求就会一路上修；但他们也点了 Jevons 的一个刹车条件：价值不回来，就会有 governor（制动器）：

"it's only going as far as the value It receives back, right? And so there is going to be a governor against if we don't see value back from it."
「但它只会走到它所得到的价值，对吗？因此，如果我们没有看到价值回报，就会有一个制约因素。」
Accel 增长团队 · Accel: The Quiet Firm Behind Facebook, Cursor, Nebius…

这条阵营 J 和阵营 D 并不是硬对立——两边都同意价格向成本靠拢，分歧在"然后呢"：D 说地板足够低、无差异化、中间层死；J 说地板由能源/算力钉住、需求无上限，所以真正的约束会从"margin 竞争"转成"电力与 GPU 供给能不能跟上"。

都没说透的

"软件层粘性"是否真的扛得过一次性的迁移成本？ Tuhin 的 Baseten 论点和 Redpoint 的"几天切换 DeepSeek 省 80%"实证是直接矛盾的，而本轮 Modal 又加了一层反讽——它把自己软件层里可复制的性能优势（DFlash）主动开源了。没有一方拿出具体客户长期留存数据：Baseten 30x 增长是新增还是续费没拆，Modal 的"弹性/可靠性"到底带来多少 NDR 也没有数字。如果软件层里能被量化的部分正在被从业者自己开源掉，剩下守得住的"运营护城河"值多少钱，没人算。
Jevons paradox 多久会被消费侧的预算上限阻断？ 本轮 Jevons 的一线证词密度暴涨（Glean 的"预算一两月超支"、Accel 的"7 倍公司被叫去 let it rip""看涨每月 5 亿美元 token""每季度都上修"、Clay 的"3.8% → 20%"），但仍然没人给出企业 AI 总预算的真实弹性曲线——全是方向性判断和调查口径，不是账单数据。反作用力也第一次被点名：Accel 说 Jevons 有个"value governor"（价值不回来就刹车），Glean 把"路由省钱"直接做成产品——但这两股力量谁跑得快，语料里还是没有数据。
"推理成本"到底指哪个数？ Noam Brown 揭了 test-time compute 让"推理成本"变成一条随预算滑动的曲线；本轮 Clay Bavor 又从另一端钉了一条能源/算力的物理地板。降的是同等质量的单价，还是被同时上涨的 per-task token 量吃掉了？ 没有一集把"等质量单价"和"每任务 token 量"这两条曲线叠在一起画出来——Clay 的"1/1300"是前者，Harvey 的"$20,000 一次审查"是后者，但没人把它们放进同一张图。
模型路由被当成了免费的银弹，可它自己就是个未解难题。 主流共识第四点里，路由被七八个独立信源当成"同样的活尽量用便宜模型干"的默认答案，但只有做过路由的 Dan Biderman 泼了冷水——"It's an unsolved thing. There's been progress on it, but a lot more work is required to actually route things to the right model, the right time, the right cost." 路由本身的准确率、判错成本、以及在模型版本每周变动下的维护成本，没有任何一集给出数字。
如果"代理价格 = 计算成本"成立，哪种业务模型先死？ McGrew 抛出、Benchmark 的 distillation 补强、Glean 的"OpenAI 传闻砍价 + 模型业务没那么赚"再推一把，但仍然没人接住"谁先死"这一问。最直接的输家是按 token / 按调用收费的中间层，语料里没有任何一家做这种业务的人承认这一点，全部声称"我们有软件层 / 我们有路由 / 我们把成本挪到 training / 我们控住产能"。而 Glean 顺口点出的另一层——"一旦转向 consumption，就没有捆绑优势，要拿到同样收入得做十倍的工作"——把这条追问从"中间层"扩到了整个应用层的收入模型，同样没人展开。
新架构（MoE、MLA、SSM 等）的速度是否真的快于 Nvidia 重写代码的速度？ Tri Dao 暗示了但没量化。这是判断 Nvidia 护城河会不会松动的关键变量。

我的看法

判断（不是事实）：这场争论里"推理是新瓶颈"是共识、"价值落点"是分歧——而分歧更多是赌时间窗，不是赌终局。短期（1–2 年）软件层粘性是真的（多云协调、kernel 优化、长尾客户化），Camp A 对；中期（3–5 年）开源 stack 会把这层粘性吃掉一大半，Camp F 和 Camp B 对；长期 McGrew 的"价格 = 计算成本"成立的概率是高的，但前提是模型差异化继续收敛——如果某家实验室真做出代际差距大的模型，整个论断就反转。所以这套论点的强度很大程度上取决于 performance-plateau 这个主题的走向（见姊妹主题）。

我对这个判断的把握：中等。最强的一环是"开源 stack 长期吃掉中间层"——这条已被 vLLM 的实际市占验证，本轮 Glean 的"三年内多数企业负载跑开源""开源便宜一个数量级""OpenAI 传闻砍价"又给它加了买方侧的实测；连闭源推理平台 Modal 都在开源自己的性能优势，这几乎是自我实现。最弱的一环仍是"Jevons paradox 的净效应"——证词密度这轮暴涨，但依旧全是方向性判断，缺账单弹性数据。

本轮我调整了一处：过去我把"路由 / quality-per-dollar"当成 McGrew 终局的前置动作，方向不变，但Dan Biderman 提醒我别把路由当免费银弹——它自己是未解的研究问题，判错成本没人量过。同时，Clay Bavor + Accel 的阵营 J 是我这轮见到的对"价格→零 margin→中间层死"最有力的反驳：它没否认价格在跌，而是把约束换成了物理地板（能源、GPU 供给）。所以我现在更倾向认为，接下来两年真正的裁决变量不是"margin 会不会归零"，而是"电力/先进封装/GPU 供给"与"企业预算弹性"这两条硬约束里哪条先绑住——这比纯粹的定价竞争更可能决定谁活下来。

还想知道什么

Baseten / Together / Fireworks / Modal 的真实客户留存与扩展数据——3 家以上跑了 18 个月以上的客户、按 ACV 分层的 NDR 数据。没有这个，"软件层粘性"是叙事不是事实；尤其在 Modal 已经把性能优势开源之后，更需要用留存数字证明护城河还在。
企业 AI 总开支的真实增长曲线——不是 TAM 估算、不是"7 倍公司 let it rip"这种调查口径，是 Fortune 500 公司 2024–2026 的实际推理账单。Jevons paradox 的净效应只能用账单数据回答。
一份模型路由的真实成本/收益拆解——某家公司上了路由后，token 账单降了多少、判错率多少、维护成本多少。Dan Biderman 说路由未解，需要一个量化案例才能判断它到底省下多少、又新增多少隐性成本。
一个跨架构迁移成本的具体案例——某家公司从 H100 迁到 B200 / TPU / Trainium 用了多少工程师月、性能恢复到多少。Tri Dao 说"hardware portability 是神话"需要这个数字才能落地。
DeepSeek / GLM 后的"开源 / 闭源"差距数据——Stephanie 说推理供应商的命运绑定在这条 gap 上，Glean 说 GLM 5.2 已把它压进三个月内。需要至少 2–3 个独立 benchmark（不只是 MMLU）的逐月对比，才能判断这条 gap 是在缩小还是在锁定。
一份"app 层吃掉推理利润"的反例——一家专注 inference 优化、维持了 40%+ 毛利、且客户留存高的公司。如果三年内一个都找不到，McGrew 的"价格 = 计算成本"基本上自证。

取材

Tuhin Srivastava (Baseten) · 2026-05-11 · 35dea6160e718145a7a3c5263827a3bb
Tri Dao (Together.AI) · 2025-09-11 · 26bea6160e7181f39b48fd0fba93d842
Gorkem / Batuhan (Fal.ai) · 2025-09-06 · 266ea6160e7181d5a2bed15472fcbfad
Simon Mo / Woosuk Kwon (vLLM / Inferact) · 2026-01-31 · 2f9ea6160e718111815bf871f04ddd3b
Bob McGrew (former OpenAI Research) · 2025-06-18 · 216ea6160e7181a4abe1e0c576c28050
Stephanie Palazzolo (The Information) · 2025-08-07 · 248ea6160e71818fb7a4cf78a04a75ba
Redpoint AI Investors · 2025-04-28 · 1e3ea6160e7181dc895ff3e77430e7ae
Kyle Kranen / Nader Khalil (NVIDIA Brev / Dynamo) · 2026-03-12 · 320ea6160e71812c9d7ed4b2a8e30e83
Nathan Labenz on AI slowdown · 2025-10-15 · 28dea6160e718189ae7bcf6cbddefe24
Gabe Pereyra (Harvey) · 2026-05-18 · 387ea6160e71816d928afff83182d03f
Ev Randle (Benchmark) · 2026-05-29 · 390ea6160e7181848860ca758a56d9ab
Noam Brown (OpenAI) · 2026-06 · 38cea6160e71819db9edf75b0e6eb94d
Dan Biderman / Jessy Lin (Engram) · 2026-06 · 38cea6160e7181a498ccefad30236956
Josh Elman × Anish Acharya (a16z) · 2026-06 · 38cea6160e718164ba74d098db7cc325
Kevin Weil (OpenAI) · 2026-06 · 38fea6160e7181d9862cd6a455019494
Clay Bavor (Sierra) · 2026-07-06 · 395ea6160e718112a7a1dba860d7381f
Akshat Bubna (Modal) · 2026-07-13 · 39cea6160e718139a154e87365fd53e7
Accel 增长团队 · 2026-07-13 · 39cea6160e71817c8437f09acf9e2013
Arvind Jain (Glean) · 2026-07-13 · 39cea6160e718145b093d79633bd3280
Rafa Gómez-Bombarelli 等 (Lila Science) · 2026-07-17 · 3a0ea6160e7181309f08d5135f750f56
Dan Biderman (Engram / AI Memory) · 2026-07-27 · 3aaea6160e7181cfbac0e557c83ca7b6