主题综述

模型表现触顶了吗 · Performance Plateau

主题综述

更新日志

2026-07-16 — 引用忠实度修复（全站审计）：1 处——重引 1 条（Jack Morris「记忆容量 plateau」由 podwise 摘要句替换为逐字原话）。
2026-07-16 — 新增 2 篇（Noam Brown、Mark Chen，均为 OpenAI 研究内部人）。新开阵营 F「plateau 是尺子的问题」：Noam Brown 给出机制——能力已是推理预算的函数，而基准网格不控制 test-time compute，纸面增量系统性缩水（AISI 1 亿 token 仍在提升、「没人知道能力上限」）；Mark Chen 以「评估危机／基准全面饱和」从评估生产侧呼应。主流共识第一条（pre-training 回报变缓）出现语料里第一个正面异议（Mark Chen：10 个数量级未破、pre-training 被低估），阵营 B 加入 Mark Chen 并与 Joelle 的重定义式看多对照。「都没说透的」第一条标记为被 Noam 部分回应，新增测量派利益结构、小预算外推性两个缺口；「还想知道什么」新增网格均衡是否被打破、两个带时间戳的可核销预测等跟踪项。「我的看法」更新：吸收测量假象机制，同时对两位发言者的 OpenAI 立场打了折价。
2026-06-11 — 取材升级为逐字稿全文。把 podwise 摘要层的三手转述换成原话：Christina Kim 的"前端是另一个级别 / RL 很 data-efficient"、Joelle 的"我仍非常看多 RL，但别指望开箱即用的 RL 给出 AGI"、Ari 的"Kaplan/Chinchilla 都假设 IID 数据，这太离谱"都换成逐字。Vibe Check（英文原音、仅存中文）和 Zelikman 的 EQ 几条改为转述。
2026-05-20 — 首次综述。基于 9 篇访谈（OpenAI GPT-5 团队 Isa Fulford / Christina Kim、AI Vibe Check panel、Harvey Winston Weinberg、Brad Lightcap、Cohere Joelle Pineau、Jack Morris、Ari Morcos / Datology、Eric Zelikman / humans&、a16z Casado&Wang）。

主流共识

第一点：pre-training 这一条轴的回报在变缓——几乎所有人，包括"看多"派，都接受这一点。

AI Vibe Check 的几位研究者认为：简单地往模型里塞更多数据，回报正在递减；与其无限砸资本，不如聚焦样本效率（sample efficiency）。（该集英文原音、podwise 仅存中文，故转述。）

收益递减是固有的，因为模型的智能与用于训练它的计算量呈对数线性关系，这意味着你必须以指数方式增加计算量才能获得智能的每次增量。
— Bob McGrew (former OpenAI) · 见姊妹主题 inference-economics

2026-07 更新：这条"接近事实"的共识第一次出现了正面异议。Mark Chen（OpenAI 首席研究官）不接受"回报变缓"的叙事框架本身：

"I think pre-training is definitely not dead. It's underrated."
「我认为预训练绝对没有死。它被低估了。」
Mark Chen · Cooking with OpenAI’s Research Chief

注意他反对的不是对数线性的数学（访谈里他没有正面处理这个问题），而是"瓶颈会真的挡住 scaling"这个推论——完整论证见阵营 B。

第二点：capability 的瓶颈正在远离"模型本身"，开始向部署/使用侧迁移。即使最看好的模型公司也承认这点。

"99% of people get to use bad tools or don't have any tools at all."
「99% 的人用的是糟糕的工具，或者根本没工具。」
Brad Lightcap (OpenAI) · Uncapped #46 Brad Lightcap from OpenAI

"We're so far from just the ability of the models right now being integrated into daily life. People do not know how to use these systems."
「我们离把模型现有的能力真正整合进日常生活还差得很远。人们根本不知道怎么用这些系统。」
Winston Weinberg (Harvey) · 20VC: How Model Performance is Plateauing

第三点：新增量正在出现在 reasoning / post-training / data-quality 等非"参数 + 算力"的轴上——但各家押不同的子轴。

分歧在哪

阵营 A · "pre-training 平台期是真的"——Harvey / Vibe Check panel 立场

Winston Weinberg (Harvey) 把"plateauing"明确写进了播客标题——但他的论证更微妙，他强调的不是"capability 没涨"，而是部署侧已经赶不上，所以应用层公司的现实约束已经不在模型升级：

"We're so far from just the ability of the models right now being integrated into daily life."
「我们离把模型现有能力整合到日常生活，还差得很远。」
Winston Weinberg · 20VC: How Model Performance is Plateauing

Vibe Check panel 给的是更技术化的"plateau"——但他们也加了重要限定：

他们还有一个更技术化的观察（同为转述）：RL 能不能成功，取决于落进一个"适度区"——模型已经懂得够多、能做出合理猜测，但又还没强到能直接解决任务；太难或太易，RL 都吃不到信号。

"We're just starting to scratch the surface in terms of economic value creation from the model."
「就模型创造的经济价值而言，我们才刚刚触到表面。」
Vibe Check panel · AI Vibe Check

注意——这个 panel 是"plateauing"派里最看多经济价值的，他们对"capability 平台"和"应用价值平台"的区分极为锋利。

Jack Morris 在更基础的研究层面给了一个具体的"plateau 测得到"的论点：

"So we have this result that's maybe the third discovery I was alluding to, which is a way to measure the exact capacity of a language model. And we get this number, if you train a language model on a ton of random data, and you measure its rate of memorization … No matter how you scale the training size, you hit this perfect perfect-ish plateau in model memorization, which we call the model capacity."
「所以我们有一个结果，这可能是我提到的第三个发现，这是一种测量语言模型确切容量的方法。我们得到了这个数字，如果你在大量的随机数据上训练一个语言模型，然后测量它的记忆率……无论你如何缩放训练规模，你都会在模型记忆中达到一个完美的平台期，我们称之为模型容量。」
Jack Morris · Information Theory for Language Models

阵营 B · "Scaling laws will continue"——但要扩展定义

Joelle Pineau (Cohere Chief Scientist) 是这一派最公开的代表，名义上看多 scaling，但她的论证已经把"scaling"重定义到包含算法创新：

"I tend to decompose different ingredients that lead to progress. You know, people often talk about like the algorithms, the data, the compute. I think in general, compute and data have a more linear effect on progress. You build more compute, you run bigger models, you can typically get better performance, you feed in more data."
「我习惯把推动进步的不同要素拆开——大家常说的算法、数据、算力。总体上，算力和数据对进步的影响更线性：你堆更多算力、跑更大的模型，通常性能就更好；你喂更多数据也是。」
Joelle Pineau · 20VC: Cohere's Chief Scientist on Why Scaling Laws Will Continue

——言下之意，算法创新才是那条非线性的轴，只是它要很久才看得出来。这就是她把"scaling 会继续"悄悄重定义成"算法那条轴会继续"的地方。

她对 RL 既看多本质、又给了一个限定——别指望开箱即用的 RL 直接给出 AGI，这是 Camp B 内部值得注意的微差异：

"I'm still super bullish on RL in that the concept itself is so fundamental. This idea of training through a system of rewards of indicating what's valuable and what's not valuable through numerical values, that is so fundamental. It's not going away. Now, you know, where we're maybe getting a little bit ahead is thinking that just RL out of the box is going to give us AGI. That part, a lot less so."
「我对 RL 仍然非常看多——因为这个概念本身太根本了。通过一套奖励系统、用数值指出什么有价值、什么没价值来训练，这个想法太根本了，不会消失。现在，我们也许有点超前的地方，是以为'开箱即用的 RL'就能给我们 AGI——这一点，我就没那么信了。」
Joelle Pineau · 20VC: Cohere's Chief Scientist

Mark Chen (OpenAI Chief Research Officer)（2026-07 新增）是这一派目前最"原教旨"的声音——与 Joelle 恰成对照：Joelle 把"scaling 会继续"重定义为"算法那条轴会继续"，Mark Chen 则直接为最狭义的 scaling laws 辩护，论据是十个数量级的历史：

"So I mean, it's held for almost 10 orders of magnitude. There's no reason it should not keep holding."
「所以我的观点是，这已经持续了近 10 个数量级。没有理由它不会继续保持下去。」
Mark Chen · Cooking with OpenAI’s Research Chief

他对"pre-training is dead"叙事的处理方式是历史归纳——每一代瓶颈都曾被宣布为终点、又都被突破：

"There have always been some bottlenecks that people, well, you can't scale past this because of this bottleneck. And we've always found some kind of technique, whether it be better engineering or some new research insight that helps you break past the boundary."
「总是有一些瓶颈，人们会说，你不能超过这个瓶颈。而我们总是能够找到某种技术，无论是更好的工程还是新的研究见解，帮助你突破界限。」
Mark Chen · Cooking with OpenAI’s Research Chief

值得对照：Joelle 用要素分解论证、Casado 用资本可追溯性论证、Mark Chen 用历史归纳论证——阵营 B 内部三种论证类型，强度和可证伪性各不相同。而 Mark Chen 自己也给 RL 那条新轴画了边界（与 Joelle 的"别指望开箱即用 RL"形成 Camp B 内部第二组呼应）：

"So it's these fields where things are hard to grade, where RL has the least amount of ability to go and directly apply there."
「所以这些领域的东西难以评估，强化学习在这里的直接应用能力最弱。」
Mark Chen · Cooking with OpenAI’s Research Chief

Sarah Wang / Martin Casado (a16z) 把"scaling continues"绑到资本流动模式上：

"This is probably also a unique time in that for the first time you can actually trace dollars to outcomes … provided that scaling laws are holding and capabilities are actually moving forward."
「这可能也是一个独特时期——你第一次能把美元真正追溯到结果，*前提是 scaling laws 仍然成立、能力仍然在前进*。」
Martin Casado / Sarah Wang · Inside AI's $10B+ Capital Flywheel

注意"前提"两个字——这是 Camp B 里自己埋下的撤退条款：如果 scaling 真停了，整个资本飞轮的逻辑会断。

阵营 C · "pre-training 慢下来，但新的增量在 reasoning / post-training"——OpenAI 立场

OpenAI 的 Isa Fulford / Christina Kim 把 GPT-5 的改进直接归功于 data quality + RL，而不是单纯参数 / 算力：

"If you compare it to O3's front-end coding capability, this is just totally next level. It feels very different. … The team just really cared about like nailing front-end. And that means like getting the best data, like thinking about the aesthetics of the model and all of these things."
「跟 O3 的前端编码能力比，这完全是另一个级别，感觉很不一样。……团队真的非常在意把前端做好。这意味着拿到最好的数据、认真想模型的审美这些事。」
Christina Kim · GPT-5 and Agents Breakdown

"One thing that's interesting is with reinforcement learning, training a model to be good at a specific capability is very data efficient. You don't need that many examples to teach it something new."
「有意思的一点是，用强化学习把模型训得擅长某个具体能力，是非常 data efficient 的——你不需要很多例子就能教它一件新东西。」
Isa Fulford · GPT-5 and Agents Breakdown

阵营 D · "Data 才是被严重低估的轴"——Datology / 数据派立场

Ari Morcos (Datology) 是少数把整套问题重新框架的人：

"Data is the most under-invested in area of research relative to its impact, and I don't think it's even close."
「相对于影响而言，data 是 ML 研究里投入最不足的一块——而且差距很大。」
Ari Morcos · Better Data is All You Need

"Even if you go and you look at the scaling laws work from Kaplan and Chinchilla and all these other things, they all assume IID data. Which is insane. We know that all data are not created equal, that garbaging garbage out is like the oldest adage in computer science."
「哪怕你去看 Kaplan、Chinchilla 以及所有其他这些 scaling laws 的工作，它们全都假设 IID（独立同分布）数据。这太离谱了。我们都知道'数据并非生而平等'——'垃圾进、垃圾出'是计算机科学里最老的格言。」
Ari Morcos · Better Data is All You Need

"Making the data better can be a massive compute multiplier. It can change the performance per dollar by orders of magnitude."
「把数据做得更好，可以成为巨大的算力倍增器——每美元性能可以变化好几个数量级。」
Ari Morcos · Better Data is All You Need

这条线把"plateau"问题转换成"我们一直在用错的资源花算力"。

阵营 E · "Capability ≠ Utility"——Brad Lightcap / Zelikman 的角度

Brad Lightcap 直接给出最强版本的"capability 已经远超 deployment"论点：

"You could stop progress right now. And I still think there's kind of a 10 or 20 year diffusion and innovation cycle that just to get it into the economy."
「就算现在停下进展，我还是认为有 10 到 20 年的扩散和创新周期——光是把它送进经济里就要这么久。」
Brad Lightcap · Uncapped #46 Brad Lightcap from OpenAI

"When you reduce the cost of something to zero, the demand for it goes up significantly."
「当你把某样东西的成本降到零，对它的需求会显著上升。」
Brad Lightcap · Uncapped #46

Eric Zelikman (humans&) 从另一边补了同样的判断：

"We have these incredibly smart models that are capable of so much, but they're not used for anywhere near what they're capable of."
「我们手上有这些极其聪明的模型，能干的事情很多——但它们的使用远远没到它们能做的水平。」
Eric Zelikman · Humans&: Bridging IQ and EQ

Zelikman 还给了一个不在其他阵营里、被忽视的瓶颈——情商（emotional intelligence）。他创办 humans& 的核心论点正是：当下模型常常"失败"，瓶颈不在 IQ，而在缺乏情商和对人类价值的理解，于是"真正帮到人"的能力被卡住。（此条为其访谈要点转述。）

阵营 F · "先别问模型触没触顶，先问尺子还准不准"——Noam Brown 的测量派重构（2026-07 新增）

Noam Brown（OpenAI，推理范式的先驱之一） 把整场争论向后退了一步：他不直接回答"模型触顶了吗"，而是论证用来判断触顶的仪器——benchmark 网格——已经失效。机制有两步。

第一步，能力不再是模型的静态属性，而是推理预算的函数：

"The problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically. If you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. If you give it a budget of $10 million, it can do even more. And so at what budget should you evaluate these models? The policies that exist today don't really address that question."
「问题是我们现在的世界中，模型的能力基本上取决于你投入多少资金。如果你给它 10,000 美元的预算，它能做的比给它 10 美元时多得多。如果你给它 1000 万美元的预算，它能做的更多。那么，在什么预算下你应该评估这些模型？目前存在的政策并没有真正解决这个问题。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

第二步，行业发布模型时的"网格"不控制这个预算变量，于是纸面增量系统性缩水——这正是 5.5 发布时被质疑"没进步"的原因：

"The benchmark results are being presented in the wrong way. They're not controlling for the amount of test-time compute that is being used on that benchmark question. It turned out that 5.5 is just much more efficient with its thinking."
「基准以错误的方式呈现。他们没有控制在那个基准问题上使用的测试时计算量。结果表明，5.5 在思考上更加高效。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

这个机制直接打击语料里所有从"benchmark 只涨了几个百分点"推出 plateau 的判断（阵营 A 里 Vibe Check panel 的技术性观察就属于这一类）。值得注意的对照：Jack Morris 的"记忆容量触顶"是训练侧的信息论结果，不依赖推理预算——是阵营 A 里不受这条批评波及的那一条。

更进一步，当被要求"让模型跑到性能 plateau 再评估"时，Noam 的回答等于宣布"plateau 点"在实操上已不可达：

"The thing is, the point at which it plateaus is actually really far out these days. … 5.5 and other models … if you scaffold them reasonably well, can think for weeks even before having performance plateau on some of these benchmarks. And so the point at which they plateau is simply too far out to reasonably test."
「问题是，平稳的点现在实际上非常遥远。……5.5 和其他模型如果合理地进行支撑，可以持续思考数周，甚至在某些基准上达到表现平稳之前。所以它们平稳的时刻实际上太远，不适合合理测试。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

他给了一个可查证的硬数字，以及一个更激进的推论——天花板在哪没人知道，因为没人跑够长：

"The AISI in their evaluations has shown that the models continue to improve at 100 million tokens, if you run them for 100 million tokens, they're still improving at beyond that point."
「实际上 AISI 在其评估中显示这些模型在 1 亿个标记时仍在持续改进，如果你运行它们达到 1 亿个标记，它们在这一点之后仍在改进。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

"And so nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell."
「所以没人真正知道这些模型的能力上限是什么，因为没有人真的运行过足够长的时间来真正判断。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

那为什么行业不改测量方式？Noam 给的解释不是技术难度，而是博弈均衡：

"You kind of end up in this bad equilibrium where everybody kind of knows that it's a bad equilibrium, but like nobody wants to break out."
「所以你最终会陷入这种糟糕的平衡中，每个人都知道这是个糟糕的平衡，但没有人想要打破它。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

Mark Chen 从评估生产侧确认了同一件事——不是模型停了，是尺子饱和了：

"And we really are kind of in an evals crisis, right? Where all the really great evals that we all know, like growing up, like taking the SAT, those are all fully saturated. We really need to find good new ways to benchmark the models."
「我们确实正处于评估危机之中，对吧？所有我们都知道的，像小时候做 SAT 的那些优秀评估，都是完全饱和的。我们真的需要找到良好的新方法来基准测试模型。」
Mark Chen · Cooking with OpenAI’s Research Chief

"There's this philosophy of once an eval is out in the world, then it's just already not a good eval."
「有一种观点是，一旦评估发布在世界上，它就已经不再是一个好的评估。」
Mark Chen · Cooking with OpenAI’s Research Chief

Camp F 内部的限定同样锋利——这不是无限看多。Noam 自己承认有些能力轴对预算不敏感（事实检索类问题给一周也不会更好），research taste 也还不行：

"There are some benchmarks where the models will just not improve if they have more inference budget."
「有一些基准，如果模型有更多的推理预算，就不会改善。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

"One thing I see for research in particular is they don't have very good research taste right now."
「我认为目前在研究上，它们并没有很好地把握研究的品味。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

而且这套"能力=预算的函数"的世界观在他那里反过来变成反对"一夜智能爆炸"的论据——纸面低估进步与物理时间限速，是同一枚硬币的两面：

"If it requires so much test-time compute to unlock the full capabilities of the model, then that means you're bottlenecked by time. Things can only go so fast because the models need to run for long enough to actually do something really, really powerful. Time itself becomes a bottleneck to what we can do."
「如果解锁模型的全部能力需要如此多的测试计算，那么这意味着你受到时间的限制。事情的速度只能如此快，因为模型需要运行足够长的时间才能真正做出非常强大的事情。时间本身成为我们能做的事情的一个瓶颈。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

这一阵营留下了语料里最可核销的一批预测和数字。Noam 的扑克 solver（他的博士论文课题）：

"And I wouldn't be surprised if, you know, six months or a year from now, the model is able to do zero shot an entire poker solver, basically my entire PhD thesis in one go."
「如果六个月或一年后，模型能够零样本完成整个扑克解算器，我不会感到惊讶，基本上是我整个博士论文一次性完成。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

Erdős 单位距离猜想的反证成本（同一任务的价格随代际塌缩——这是"能力=预算函数"最直观的量化）：

"The cost of disproving the Erdos unit distance conjecture drops by like 10 or 100x with every model release cycle, probably in some cases more."
「所以反驳 Erdős 单位距离猜想的成本在每次模型发布周期中下降了大约 10 或 100 倍，在某些情况下可能更多。」
Noam Brown · Why Traditional Benchmarks Fail Modern AI Models

Mark Chen 一侧对应的时间表——三年路线图的终点：

"When we look at our kind of three-year roadmap, the end goal that we want to reach is one where the models are just doing end-to-end research."
「当我们看我们的三年路线图时，我们想要达到的最终目标是模型能够进行端到端的研究。」
Mark Chen · Cooking with OpenAI’s Research Chief

都没说透的

大家在用"plateau"形容不同的东西。 是 pre-training loss 的 plateau、benchmark 的 plateau、capability 的 plateau、还是 deployment 的 plateau？2026-07 更新：Noam Brown 补上了其中最关键的一刀——他把"benchmark 上的增量"与"能力"正式拆开，并给出机制（不控制 test-time compute 的网格会系统性低估进步）。但他只拆了这一对；四种 plateau 的完整澄清依然没人做，"plateauing 派"和"continues 派"的相互错位大部分仍然存在。
测量派的利益结构没人挑明。 "benchmark 低估了我们的模型"这话出自模型发布方——Noam 与 Mark Chen 都是 OpenAI 研究口的当事人，最新发布的模型正是"纸面提升小"叙事的当事方。语料里没有评测机构、学界或竞对实验室对"控制预算后代际差重新拉开"的独立确认，主持人也没有追问这一层。
Noam 自己点名的空白：小预算能否外推大预算性能。 他明说"这方面的研究还不多"、用 $10/$100 的预算预测 $10,000 预算下的表现"会是一篇很好的论文"。这是测量派主张能否工程化的关键——做不出来，"能力上限不可知"就是长期状态；这直接连着他点破的另一个错配：发布周期每 2–3 个月一轮，而把一个模型推到极限要数周到数月，安全评估（preparedness frameworks）同样没把 test-time compute 算进去。
"reasoning / RL 这条新轴"还能跑多久？ 阵营 C 隐含假设这条轴上的回报会持续——但 Joelle Pineau 的"RL 低效"和 Misha Laskin 的"或许没有 generalization"（见姊妹主题 long-horizon-agents）都在动摇这个前提。没人系统地讨论这条轴自己的 diminishing returns 曲线。Noam 的"1 亿 token 仍在提升"给了一个数据点，但那是 test-time 轴、不是 RL 训练轴——两条曲线在语料里仍然经常被混为一谈。
如果"capability 已远超 deployment"为真，那进一步追 capability 的边际价值是多少？ Brad Lightcap 自己抛出了 10–20 年扩散期；但他不会停止继续训更大的模型。这套行为和论点之间的张力没人挑明。
Data quality 派的可扩展性。 Datology 的论点强大但没有量化——"几个数量级的 performance per dollar 改进"在实证上能不能复刻到 frontier scale？这一点在访谈里只有定性陈述。
"中国创新由约束驱动" 是 Vibe Check panel 提到的一个独立的潜在反例——如果西方实验室无限算力开始进入 diminishing returns、而受限的中国实验室在效率上跑出来，那"plateau"的判断地图会被重画。这个角度在英文语料里只有一次出现。

我的看法

判断（不是事实）：上一版的核心判断保留——pre-training 单轴回报变缓的共识依然最强（Mark Chen 的异议针对的是"瓶颈挡得住 scaling"的推论，不是对数线性本身），新增量在 post-training/RL、data quality、deployment 三条轴上分裂，"触顶了吗"被错误框定为单一维度。这一版新增一条："plateau 感"里可能有相当份额是测量假象。Noam Brown 给出的机制（能力=推理预算的函数 + 发布网格不控预算）是语料里第一个能同时解释"纸面增量缩水"与"体感进步明显"的具体机制，且自带可检验含义——按预算画曲线后代际差应重新拉开。但我对这个新判断打折价：两位新声音都是 OpenAI 研究内部人，正是"纸面提升小"叙事的利益相关方，且目前没有第三方复核。综合更新为："模型触顶了吗"在 2026 年中的最诚实答案是"以现行公开测量方式，无法判定"——不是"没触顶"，是"尺子测不出来"。应用价值增长持续至少 5–10 年的判断不变：部署滞后在 Camp A、C、E 跨阵营独立证实，Noam 的"没人探索过给 5.5 十万美元算力能做什么"实际上又加了一票。

把握程度：应用价值持续增长——中等偏高（不变）。"plateau 感有相当份额是测量假象"——中等：机制自洽、有 AISI 1 亿 token 这个外部数据点，但证词集中于利益相关方。可证伪点：如果未来 12 个月内出现按预算控制的第三方对比、而前沿代际差依然只有几个百分点，测量假象论就该降权；反之若有实验室开始按 Noam 的方式出图且差距拉开，应升权。

还想知道什么

不同子能力的 benchmark 时序曲线分解：reasoning（如 AIME / IMO）、coding（SWE-bench）、agentic（Tau-bench）、knowledge（MMLU-Pro）—— 把它们的月度提升曲线画出来，看哪些在加速、哪些在 plateau。语料里所有人都在用一个"capability"概括四件事，需要把它们分开。且按 Noam Brown 的批评，每条曲线都必须控制 test-time compute——不控预算的时序对比本身就是失真的。
谁先打破"网格均衡"：未来 12 个月内有没有前沿实验室在发布时以 token/成本/时间为 x 轴发布性能曲线（Noam 公开呼吁的均衡迁移）。这是测量派主张最容易从外部观察的证据，可以直接跟踪。
小预算→大预算的性能外推：Noam 点名的开放论文题——用 $10/$100 的推理预算预测 $10,000 预算下的表现。若可外推，评估危机的实操成本大降；若不可外推，"能力上限不可知"成为长期状态。同时找 AISI"1 亿 token 仍在提升"的报告原文与跨模型系的独立复刻。
两个带时间戳的可核销预测：Noam Brown——6–12 个月内模型 zero-shot 完成完整扑克 solver（"我整个博士论文一次性完成"），2027 年中前可验；Mark Chen——三年路线图终点是模型端到端做研究，约 2029 可验。到期回填。
post-training / RL 的 compute scaling 实证：Camp C 押宝在这条轴上，但目前公开数据极少。Anthropic 或 OpenAI 关于 RL post-training 实际算力消耗 vs capability 提升的关系数据会决定整个判断。
Datology 论点的 frontier-scale 复刻：找一个独立团队（不是 Datology）展示 data quality 操作在 70B+ scale 上的"performance per dollar"提升量级。如果只有 Ari Morcos 一个声音说，这条线判断不稳。
企业部署 ROI 的量化时间表：Harvey、Cursor、Anthropic 客户的实际 ROI 数据——18 个月后再看渗透率、续费、扩展。Lightcap 说 10–20 年扩散期，需要至少有 12 个月的真实趋势数据才能判断他是激进还是保守。
中国实验室效率优势的真实数据：Vibe Check panel 提到的"约束驱动创新"——DeepSeek、月之暗面、Qwen 在 cost-per-capability 上跑出来的具体数字。如果这条线证实，西方"无限算力"的范式 plateau 速度会被拉快。
Eric Zelikman 的 EQ 瓶颈是否真实：能否构造一个评测，分离"IQ 能力"和"EQ 能力"，看模型在后者上的进步曲线。这个轴目前完全没数据。

取材

Why Traditional Benchmarks Fail Modern AI Models — Noam Brown (OpenAI) · 2026-06-27 · 38cea6160e71819db9edf75b0e6eb94d
Cooking with OpenAI’s Research Chief — Mark Chen (OpenAI) · 2026-06-27 · 38cea6160e7181c0a479f668acdeafe6
Isa Fulford / Christina Kim (OpenAI) · 2025-08-10 · 24bea6160e7181a6ab96f16429901f68
AI Vibe Check panel (Ari Morcos, Jacob, Rob Toews) · 2025-12-22 · 2d1ea6160e718151b03de5b615a3a9fa
Winston Weinberg (Harvey) · 2026-01-20 · 2eeea6160e71814b8e75d2fcef01bdec
Jack Morris (Cornell Tech) · 2025-07-25 · 23bea6160e7181d7bc8fe8b7ee98a4ec
Brad Lightcap (OpenAI) · 2026-04-02 · 336ea6160e71814ebc52d163cb0974eb
Martin Casado / Sarah Wang (a16z) · 2026-02-26 · 313ea6160e7181e49982c635043e8277
Joelle Pineau (Cohere) · 2025-11-04 · 2a1ea6160e718141a7f7ebeb3040b1f3
Eric Zelikman (humans&) · 2025-10-12 · 28aea6160e718126a929f75ad9fe5c4a
Ari Morcos (Datology) · 2025-08-30 · 25eea6160e71810caa61e1a8a2eb8200