Evals 即护城河 · Evals as Moat
主题综述
主题页(活文档)· 最近更新 2026-06-11 · 取材 11 篇访谈(核心按逐字稿全文核对)
更新日志
- 2026-06-11 — 取材升级为逐字稿全文。本次把 podwise 摘要层里的三手转述换成第一人称原话,重灾区是阵营 F(私有 eval = 护城河)——它原本三条引用全是 3rd-person 摘要,现已换成 Foody / Satya / Nebius 的逐字原话;并确认"Full-Stack Builder"那篇说话人就是 Satya Nadella。主流共识与阵营 B/C 的几条 paraphrase 也换成了逐字原话(Mia 的 SWE-Bench"saturated and contaminated"、Foody 的"eval = PRD = sales collateral"、median pay $95→$500)。新增逐字稿里被摘要层埋掉的料:企业"怕做 eval、因为会暴露自己被自动化"(Foody)、SWE-Bench 污染的具体证据(Mia)、"2026 是让模型端到端克隆 Slack 的一年"(Foody)。
- 2026-06-10 — 刷新:新增 3 篇(Brendan Foody/Mercor 新访谈、Satya Nadella、Nebius Roman Chernin),汇聚出"私有 eval set = 企业 system of record"这条新观点,新增阵营 F。
- 2026-05-20 — 首次综述。基于 8 篇访谈。
主流共识
第一点:通用公开 benchmark 在迅速饱和和污染——这一点全员同意,而且 OpenAI 自己的人讲得最狠。
"SWE-Bench Verified has been one of the North Star coding benchmarks that the field has looked at to measure coding progress. But recently, we've seen that progress has kind of stalled. And we realize that this is because the eval is effectively saturated and also highly contaminated. So at this point, we think that it's not really measuring coding performance improvements well anymore."「SWE-Bench Verified 一直是全行业用来衡量编码进展的北极星基准之一。但最近我们看到进展停滞了。我们意识到,这是因为这个 eval 实际上已经饱和、而且被严重污染。所以到这个点上,我们认为它已经不能很好地衡量编码能力的提升了。」Mia Glaese / Olivia Watkins (OpenAI) · SWE-Bench-Dead
而且"污染"不是泛泛而谈——他们用一个审计 agent 抓到了具体证据:
"In SWE-Bench Verified, we found many instances of contamination across OpenAI models, across Quad, Opus 4.5, Gemini, Flash. … We saw things like regurgitating the ground truth solutions, things like in some cases, giving the task IDs."「在 SWE-Bench Verified 上,我们在 OpenAI 的模型、Claude、Opus 4.5、Gemini、Flash 上都发现了大量污染实例。……我们看到模型直接背出 ground truth 答案,有些情况下甚至报出 task ID。」Mia Glaese / Olivia Watkins (OpenAI) · SWE-Bench-Dead
Galileo 的 Pratik Bhavsar 从另一个角度确认同一件事——通用榜单的排名不迁移到 agentic 任务:
"It's not necessary that what you see as top models in let's say LLM arena or other general evaluations, they might not be also the same ranking for other tasks like agentic tasks."「在 LLM arena 或其他通用评估上排名靠前的模型,未必在别的任务上、比如 agentic 任务上是同样的排名。」Pratik Bhavsar (Galileo) · Ranking Agentic LLMs
第二点:LLM 实验室自己的 north star 已经是 eval-quality 而不是模型 capability——这把 evals 推到了基础设施级地位。Foody 把它压成一句话:
"If the model is the product, then the eval is the product requirement document. … In many ways, the barrier to applying agents to the entire economy to automate every workflow is how do we measure success? How do we eval it?"「如果模型是产品,那么 eval 就是产品需求文档(PRD)。……在很多意义上,把 agent 推向整个经济、自动化每个工作流的瓶颈,就是:我们怎么衡量成功?怎么 eval 它?」Brendan Foody (Mercor) · Why experts writing AI evals
第三点:evals 的真正信号在"看 trace"那一步——多个 practitioner 独立观察到。
"The right answer is keep looking at traces until you feel like you're not learning anything new."「正确的做法是持续看 trace,直到你感觉学不到新东西为止。」Hamel Husain & Shreya Shankar · Why AI evals are the hottest new skill
分歧在哪
阵营 A · "Evals 是 PM 的核心技能"——practitioner / 普及派
Aman Khan (Arize Head of Product) 给的是最 PM 化的版本:
"The PM's job is to have judgment on what that end product experience should be. And so being in the details on that when it comes to the human evals is really what determines whether or not your product is successful or fails."「PM 的工作就是对最终产品体验该是什么样做出判断。所以在人工评估上深入细节,是决定你产品成败的关键。」Aman Khan · Complete Beginner's Course on AI Evaluations
"Because if the golden data set sucks, then the rest of your evals will be terrible."「因为如果 golden dataset 不行,你后面的所有 evals 都会很糟。」Aman Khan · Complete Beginner's Course on AI Evaluations
Hamel Husain 把同一立场推得更绝对:
"Looking at traces and annotating them to identify problems is the most valuable part of the evaluation process, even more so than building judges or doing formal evaluations."「看 trace、标注以识别问题,是评估流程里最有价值的部分——比建判官、做正式评估都更有价值。」Hamel Husain · AI Evaluations Crash Course in 50 Minutes
"I've found that when you try to hide behind a score, you're not really making a decision."「我发现:当你试图躲在一个分数背后时,你其实没有在做决策。」Hamel Husain · AI Evaluations Crash Course
Hamel & Shreya Shankar 给了 Camp A 内部的微妙差异——把 eval-judge 放到 PRD 的位置上:
"This is the purest sense of what a product requirements document should be. Eval, judge, that's telling you exactly what it should be and it's automatic and running constantly."「这就是 PRD 最纯粹的样子。Eval 和 judge 准确告诉你产品该是什么样,而且是自动的、持续运行的。」Hamel & Shreya · Why AI evals are the hottest new skill
注意——同一个 Hamel,在另一段里说"躲在分数后面就是没做决策"。看似矛盾,实际上他在区分自己理解后建立的 judge(PRD-级别)和外包给一个 generic metric 的躲避(hallucination score、coherence score 之类)。
阵营 B · "Evals 是经济级护城河 / 新职业"——Mercor 立场
Brendan Foody (Mercor) 把 evals 重新框架为一种新职业类别和供应链生意:
"We grew from one to 400 million in revenue run rate in 16 months, fastest ascent in history."「我们 16 个月内做到 $4 亿 ARR——史上最快增长。」Brendan Foody · Why experts writing AI evals
"If you really think about it, we were put on earth to create reinforcement learning training data for labs."「认真想想,我们生下来就是为了给实验室创造 RL 训练数据的。」Brendan Foody · Why experts writing AI evals
eval 同时是 PRD、也是销售材料——这条他讲得很完整:
"Evals are the PRD, but also subsequently the sales collateral, right? Because evals are what you give to researchers to show them what they should be building … but they're also the way that you demonstrate the efficacy of capabilities."「Eval 是 PRD,但随后也是销售材料。因为 eval 是你给研究员看、告诉他们该构建什么的东西……但它同时也是你向外展示模型能力有多强的方式。」Brendan Foody · Why experts writing AI evals
Mercor 的经济结构本身是这条论点的实证——专家时薪远高于众包:
"Our median pay rate in the marketplace is $95 an hour, but it can flex up well up into $500 an hour. … If you look at the economics of the crowdsourcing companies, oftentimes they would pay like $30 an hour."「我们市场里的中位时薪是 $95/小时,但能上浮到 $500/小时。……而众包公司的经济结构,往往只付大约 $30/小时。」Brendan Foody · Why experts writing AI evals
Foody 在 Camp B 内部又给了一个独立判断——eval 业务的天花板取决于"人类还能做什么模型不能做的事":
"The market is bound by the amount of things where humans can do something that models can't."「这个市场的上限是:人类还能做、而模型做不到的事情的总量。」Brendan Foody · Why experts writing AI evals
逐字稿里挖出一条摘要层没有的料——Foody 观察到企业对"做 eval"本身的恐惧,因为 eval 会把"我正在被自动化"这件事坐实:
"There are certain enterprises we talk to that are almost fearful, not wanting to engage, not wanting to eval their businesses, because that'll provide the evidence that their value chain is being automated."「我们接触的一些企业,几乎是恐惧的——不愿参与、不愿对自己的业务做 eval,因为那会坐实'我的价值链正在被自动化'这个证据。」Brendan Foody · Why experts writing AI evals
阵营 C · "通用 benchmark 在垮,接什么各有押注"——OpenAI / Galileo 立场
OpenAI 的 Mia Glaese / Olivia Watkins 押的是"更难的公开 benchmark + 正确性之外的维度"。光看对不对已经不够了:
"Like Olivia talked about — does it have design taste? Like does it solve the problem the way that my team likes to solve problems? Is the code nice? Is it maintainable in the future? People care about a lot of these less tangible, harder to measure things that are still super meaningful."「就像 Olivia 说的——它有没有设计品味?它解决问题的方式,是不是我团队喜欢的方式?代码漂不漂亮?将来好不好维护?这些更难量化的东西,大家其实很在意,而且对用 coding agent 的人来说意义重大。」Mia Glaese / Olivia Watkins (OpenAI) · SWE-Bench-Dead
Pratik Bhavsar (Galileo) 押的是 domain-specific leaderboard 是下一站:
"People in the industry come from, let's say, I'm from healthcare. I want to see if this models work for healthcare or not. People are coming from investment, finance, insurance industry, and they want to know are the models ready for my domain or not?"「业内的人会说,我来自医疗——我想看模型在医疗上行不行。来自投资、金融、保险的人也想知道模型在他们的领域上 ready 没 ready。」Pratik Bhavsar (Galileo) · Ranking Agentic LLMs
而且他给了一个具体的"饱和"数字——连他自己的新 benchmark 都已经被刷到 0.95:
"We saw with V1 that the scores are saturating. We see that the best score has already reached 0.95. … So we want the benchmarks to be harder."「我们在 V1 上就看到分数在饱和——最好的分已经到 0.95。……所以我们想把 benchmark 做得更难。」Pratik Bhavsar (Galileo) · Ranking Agentic LLMs
阵营 D · "LLM-as-judge 要分相对排序和绝对打分"——OpenPipe 的拐点
Kyle Corbitt (OpenPipe) 给了 evals 工具链里一个被许多人忽略的实证拐点:
"RULER, a library using LLMs as judges for relative ranking, significantly improved RL performance, increasing Kyle's confidence in RL's potential from 25% to 50%."「RULER 是一个用 LLM 做相对排名判官的库,它显著提升了 RL 的表现,把我对 RL 潜力的信心从 25% 拉到 50%。」Kyle Corbitt · Why Fine-Tuning Lost and RL Won
"If you tell a human, choose which of these is better, it's easier for them to do than say, is this one good or bad in absolute terms?"「让一个人选'两者哪个更好',比让他说'这个绝对意义上好不好'要容易得多。」Kyle Corbitt · Why Fine-Tuning Lost and RL Won
跟 Hamel"看到 agreement score 就警惕"放一起,这揭示了一个没人正面说出来的区分——LLM-as-judge 在相对排名任务上(RULER)有效,在绝对打分任务上(generic hallucination/coherence score)容易自欺。
阵营 E · "好的 evaluator 是 reflection / self-correction 的前提"——研究侧视角
Shunyu Yao 和 Harrison Chase 在讨论 reflection / self-correction 时给的限定值得拎出来:
"The principle of self-reflection is more applicable if there is a good evaluator, such as in coding where errors can be easily identified and used for reflection."「self-reflection 原则更适用于'有好评估器'的场景——比如编码这种错误能被轻松识别并用作反思的领域。」Shunyu Yao · Language Agents: From Reasoning to Acting
这把 evals 从"产品工具"升级到"agentic 智能的前提条件"——没有好的 evaluator,多数 agentic 自改进都失效。
阵营 F · "私有 eval set = 企业的 system of record / 真正的护城河"——Mercor / Satya / Nebius 的新汇聚
这是 2026 年 6 月这批访谈里最大的新观点,来自三个独立信源。逐字稿读下来,这条线比摘要层呈现得更硬、也更可证伪。
Foody 在新访谈里把 eval set 直接定义成企业的新 system of record,并点出它的战略用途——商品化模型层:
"Over time, this is going to develop to look very similar across every Fortune 500, where they'll need to have this system of record for evaluating and specifying agent behavior across every workflow in their business. And they're going to use that to commoditize the model layer because they want to enable perfect competition for the models having zero switching costs."「随时间推移,这会演变成每个 Fortune 500 都长一个样:他们需要一个 system of record,去评估和规定每个业务工作流里的 agent 行为。而且他们会用它来把模型层商品化——因为他们想让模型之间完全竞争、切换成本为零。」Brendan Foody (Mercor) · 20VC: Mercor CEO on Why Application Layer Companies Have No Defensibility
护城河不在软件层,而在 forward-deployed 把组织隐性知识编码进 agent:
"If you have a great forward deployed motion where you're going deep with a customer, you're training the agents based on all of this tacit knowledge within the company so that it understands how to perform effectively, that feels incredibly differentiated and hard to recreate."「如果你有一套很强的 forward-deployed 动作——深入一个客户、用公司里所有那些隐性知识去训练 agent,让它知道怎么高效干活——那才是极难复制、极具差异化的东西。」Brendan Foody (Mercor) · 20VC: Mercor CEO on Why Application Layer Companies Have No Defensibility
而软件层为什么没护城河——他给了一个吓人的时间表:
"We're building out an eval set that measures how effectively agents can build end-to-end SaaS applications. 2025 was the year of how do you get a model to make a PR in a code base, and 2026 is the year of how do you get the model to clone Slack end-to-end. Those capabilities are going to exist in the models in the next 12 months."「我们正在做一个 eval set,衡量 agent 端到端构建 SaaS 应用的能力。2025 年的主题是'怎么让模型在代码库里提一个 PR',2026 年的主题是'怎么让模型端到端克隆一个 Slack'。这些能力,未来 12 个月内就会出现在模型里。」Brendan Foody (Mercor) · 20VC: Mercor CEO on Why Application Layer Companies Have No Defensibility
Satya Nadella 给了几乎一样的判断,但加了一个可操作的"控制权测试"——能不能在不泄露 trace 的前提下,把私有 eval 从模型 A 搬到模型 B 继续爬坡:
"Every company having private evals may be the biggest IP. … You have an eval that's private, you're using a model A, can you switch it to model B and climb up? If you can, then you're in control. If you can't, you're not in control."「每家公司拥有自己的私有 eval,可能就是最大的 IP。……你有一个私有 eval,现在用模型 A——你能不能换成模型 B 并继续往上爬?如果能,你就握有控制权;如果不能,你就没有。」Satya Nadella (Microsoft) · The Rise of the Full-Stack Builder
他还把"公开 eval 都能被刷爆、所以只有私有 eval 才算数"讲明白了:
"You'll have private evals because we know all the evals out there are good, interesting, but they're not really that critical at this point because they all can be maxed. And so the point is each company will have its own private eval."「你会有自己的私有 eval,因为我们都知道外面那些 eval 虽然好、虽然有意思,但此刻已经不那么关键了——它们都能被刷满。所以重点是:每家公司都会有自己的私有 eval。」Satya Nadella (Microsoft) · The Rise of the Full-Stack Builder
Nebius 的 Roman Chernin 从基础设施侧给了第三个佐证——eval 是企业能不能进入指数增长期的冷启动门槛:
"First of all, they were focusing on evaluations. … You need to have metrics. You need to have this CI/CD process established for AI development. … They have this, you can call it foundational investments, a cold start problem. When they solve it, they start to grow exponentially."「首先,他们聚焦的是评估。……你得有指标,你得为 AI 开发建立起 CI/CD 流程。……这是一种基础投资,可以叫它'冷启动问题'。一旦解决,他们就开始指数级增长。」Roman Chernin (Nebius) · 20VC: Nebius Co-Founder on AI Infrastructure Bubbles
这条线和 Camp A/B 的关系微妙:Camp A 说 eval 是 PM 技能、Camp B 说 eval 是高薪专家生意——Camp F 说积累出来的那套 eval set 本身是护城河。三者其实是 eval 价值链的三个环节:写它的人(A)、外包它的生意(B)、攒成的资产(F)。但没人正面把这三环串起来。它还直接挑战姊妹主题 ai-moat-2026 / ai-native-products——如果"软件层没有 defensibility、私有 eval 才是护城河"(Foody 新访谈标题主张),护城河的定义又被重画了一次。
都没说透的
- "Evals 是普及技能(Camp A)"和"Evals 是专家高薪生意(Camp B)"如何并存? Hamel 说"PM 自己做",Foody 说"我们雇 $500/小时的专家"——没人正面解释这两条经济逻辑能不能同时成立。最可能的答案是不同 evals(产品级 vs 模型 frontier 级)适配不同人,但语料里没人讲透。
- LLM-as-judge 用作"相对排序"和"绝对打分"效果差很多,但几乎没人正面区分。 Kyle 用 RULER(相对)就工作了;Hamel 警惕的是 agreement/coherence(绝对)。没人画出"什么任务用哪种 judge"的决策表。
- 公开 benchmark 死后接什么?三种押注各异,没人 head-to-head。 OpenAI 押 SWE-Bench Pro(更难的公开 benchmark),Galileo 押 domain-specific leaderboard(垂直公开),Hamel/Aman/Satya 押完全私有的 internal eval。三条路径的相对优势没人正面比较。
- "私有 eval = 护城河"(Camp F)和"data moat 只在 mega-scale 成立"(a16z, 见 ai-moat-2026)正面打架。 Foody/Satya 说一套攒出来的私有 eval 就能护城;a16z 说数据网络效应要到数十亿用户才显现。私有 eval 到底要攒到多大、多久才真正难以复制?Satya 给了"能不能换模型继续爬坡"这个定性测试,但没人给定量门槛——这是 Camp F 论点最关键、也最空的一块。
- Foody 同一个人,两次访谈 framing 变了。 旧访谈卖的是"专家写 eval 的劳动生意"(Mercor 的 GMV、$500/小时),新访谈卖的是"私有 eval = 护城河 + 应用层没有 defensibility"。两个 framing 服务不同叙事(一个抬高劳动供给价值,一个抬高 Mercor 作为护城河中介的价值),没人追问他这两套说法怎么自洽——尤其是:如果护城河在客户自己攒的私有 eval 里,那 Mercor 这个"帮你攒"的中介,长期价值捕获在哪?
我的看法
判断(不是事实):evals 在 2026 年同时是两个东西——对应用层 PM,它是手工 + 判断密集的核心 craft(Camp A 对);对前沿实验室,它是新型供应链业务(Camp B / Mercor 对)。这两件事的混淆制造了大量错觉——很多公司既不严肃做手工 trace 分析、又付不起 Mercor 那种专家费,结果两边都不在。真正的护城河不在工具或方法(会快速商品化),而在积累专有 trace、并把它转成内部 judge 的能力——Satya 的"能不能换模型继续爬坡"是我目前见过最干净的判定标准。读完逐字稿我对 Camp F 的把握升了一点:因为 Satya 给了可操作测试、Foody 给了"克隆 Slack"的时间表,这条线不再只是 vendor 口号。
把握程度:中等偏高。最强支撑是 Hamel 反复独立观察到"看 trace 是最有价值的部分",Camp A 内部无反对票。最弱环节仍是"私有 eval 要多大才算护城河"——它跟 ai-moat-2026 的"data moat 只在 mega-scale 成立"存在张力,且三个 Camp F 信源里有两个(Foody、Satya)是卖这条叙事的当事人,需要客户侧的独立证据才能锁死。
还想知道什么
- 一个公司"手工 trace 标注 → 内部 eval-judge → 模型迭代"全闭环的真实案例:周期多长、人力多少、最终 quality 提升多少。Hamel 反复推这条 loop,但没有一篇访谈给出完整的事后 12 个月数据。
- "私有 eval 要攒到多大、多久"的定量门槛。 Satya 给了定性测试(换模型能否继续爬坡),但 Camp F 缺一个"多少条标注样本 / 多少个月 forward-deployed 之后,竞品复制成本才真正变高"的数字。这是 Camp F 对 a16z"data moat 只在 mega-scale 成立"那条最需要的反驳证据。
- Mercor TAM 演化数据:Foody 说市场受限于"人类能做、模型做不到的事的总量"——若模型在 2026 下半年吃掉一大块专家可证伪任务,Mercor 的 ARR 增速曲线会立刻给信号。
- SWE-Bench Pro / domain leaderboard / 私有 eval 三条路径的 head-to-head:同一个 agent 在三种 eval 体系下排名是否一致?不一致的话方差在哪?
- 一个"我们靠 evals 在某 vertical 干掉某竞品"的客户侧第一人称叙述。 目前几乎所有"evals 是护城河"的论点都来自 vendor / 教学者(Mercor、Arize、Galileo、Hamel 课程、Microsoft)。缺一个买方视角的商战复盘。
取材
核心 6 篇本轮按逐字稿全文重读、所有引用逐字稿核对:
- Mia Glaese & Olivia Watkins (OpenAI Frontier Evals — SWE-Bench-Dead) · 2026-02-26 ·
313ea6160e7181cfa590f64d2c2162af - Brendan Foody (Mercor — evals 劳动生意) · 2025-09-22 ·
276ea6160e7181edb819cd82fd44b5ac - Brendan Foody (Mercor — 私有 eval = 护城河 / 应用层无 defensibility) · 2026-06-06 ·
377ea6160e7181e38e78ea1048f812d1 - Satya Nadella (Microsoft — private eval = 最大 IP / Full-Stack Builder) · 2026-06-06 ·
377ea6160e718196bed4e563447eaa46 - Roman Chernin (Nebius — eval 是 cold-start 前置) · 2026-06-10 ·
37bea6160e71818187b7d7eaa67763b6 - Pratik Bhavsar (Galileo — Ranking Agentic LLMs) · 2025-07-25 ·
23bea6160e71810bb58fc297ea2443c4
沿用上一轮已逐字核对的 practitioner 引用(Camp A/D/E,原文均为第一人称逐字稿):
- Aman Khan (Arize) · 2025-08-25 ·
25aea6160e7181aaacb6e251289d1800 - Hamel Husain & Shreya Shankar · 2025-09-26 ·
27aea6160e7181bbb856ff899d12c8e4 - Hamel Husain (Crash Course, NurtureBoss) · 2025-10-04 ·
282ea6160e71813883fbee731f4d43e6 - Kyle Corbitt (OpenPipe — RULER) · 2025-10-17 ·
28fea6160e718186bfe4e90dde1d3a7d - Shunyu Yao & Harrison Chase (Language Agents) · 2025-07-25 ·
23bea6160e7181ff8d84f86195b392f6
注:阵营 F 的三篇(Foody-20VC、Satya、Nebius)当前未进入 topics.json 的 alias 成员表,是上一轮手工引入;建议后续 topics.py update 时把它们的 id 补进 interview_ids,以便 headless 全量重综能自动纳入。