主题综述

Evals 即护城河 · Evals as Moat

主题综述

更新日志

2026-07-16 — 引用忠实度修复（全站审计）：5 处——重引 3 条（Hamel"annotation and counting"、Kyle RULER/25→50、Shunyu Yao evaluator：均由 podwise 摘要句换回逐字原话）、更正归属 2 条（"golden dataset sucks"系未具名共同主持人所说，换成 Aman 的逐字回应；"purest sense of PRD"系主持人 Lenny 所说，并补 Shreya 的逐字认可与限定）。
2026-06-11 — 取材升级为逐字稿全文。本次把 podwise 摘要层里的三手转述换成第一人称原话，重灾区是阵营 F（私有 eval = 护城河）——它原本三条引用全是 3rd-person 摘要，现已换成 Foody / Satya / Nebius 的逐字原话；并确认"Full-Stack Builder"那篇说话人就是 Satya Nadella。主流共识与阵营 B/C 的几条 paraphrase 也换成了逐字原话（Mia 的 SWE-Bench"saturated and contaminated"、Foody 的"eval = PRD = sales collateral"、median pay $95→$500）。新增逐字稿里被摘要层埋掉的料：企业"怕做 eval、因为会暴露自己被自动化"（Foody）、SWE-Bench 污染的具体证据（Mia）、"2026 是让模型端到端克隆 Slack 的一年"（Foody）。
2026-06-10 — 刷新：新增 3 篇（Brendan Foody/Mercor 新访谈、Satya Nadella、Nebius Roman Chernin），汇聚出"私有 eval set = 企业 system of record"这条新观点，新增阵营 F。
2026-05-20 — 首次综述。基于 8 篇访谈。

主流共识

第一点：通用公开 benchmark 在迅速饱和和污染——这一点全员同意，而且 OpenAI 自己的人讲得最狠。

"SWE-Bench Verified has been one of the North Star coding benchmarks that the field has looked at to measure coding progress. But recently, we've seen that progress has kind of stalled. And this we realize that this is because the eval is effectively saturated and also highly contaminated. So at this point, we think that it's not really measuring coding performance improvements well anymore."
「SWE-Bench Verified 一直是全行业用来衡量编码进展的北极星基准之一。但最近我们看到进展停滞了。我们意识到，这是因为这个 eval 实际上已经饱和、而且被严重污染。所以到这个点上，我们认为它已经不能很好地衡量编码能力的提升了。」
Mia Glaese / Olivia Watkins (OpenAI) · SWE-Bench-Dead

而且"污染"不是泛泛而谈——他们用一个审计 agent 抓到了具体证据：

"In SWE-Bench Verified, we found many instances of contamination across OpenAI models, across like Quad, Opus 4.5, Gemini, Flash. … We saw things like regurgitating the ground truth solutions, things like in some cases, giving like the task IDs …"
「在 SWE-Bench Verified 上，我们在 OpenAI 的模型、Claude、Opus 4.5、Gemini、Flash 上都发现了大量污染实例。……我们看到模型直接背出 ground truth 答案，有些情况下甚至报出 task ID。」
Mia Glaese / Olivia Watkins (OpenAI) · SWE-Bench-Dead

Galileo 的 Pratik Bhavsar 从另一个角度确认同一件事——通用榜单的排名不迁移到 agentic 任务：

"It's not necessary that what you see as top models in let's say LLM arena or other specific evaluation or general evaluations, they might not be also the same ranking for other tasks like agentic tasks."
「在 LLM arena 或其他特定评估、通用评估上排名靠前的模型，未必在别的任务上、比如 agentic 任务上是同样的排名。」
Pratik Bhavsar (Galileo) · Ranking Agentic LLMs

第二点：LLM 实验室自己的 north star 已经是 eval-quality 而不是模型 capability——这把 evals 推到了基础设施级地位。Foody 把它压成一句话：

"If the model is the product, then the eval is the product requirement document. … In many ways, the barrier to applying agents to the entire economy to automate every workflow is how do we measure success? How do we eval it?"
「如果模型是产品，那么 eval 就是产品需求文档（PRD）。……在很多意义上，把 agent 推向整个经济、自动化每个工作流的瓶颈，就是：我们怎么衡量成功？怎么 eval 它？」
Brendan Foody (Mercor) · Why experts writing AI evals

第三点：evals 的真正信号在"看 trace"那一步——多个 practitioner 独立观察到。

"The right answer is keep looking at traces until you feel like you're not learning anything new."
「正确的做法是持续看 trace，直到你感觉学不到新东西为止。」
Hamel Husain & Shreya Shankar · Why AI evals are the hottest new skill

分歧在哪

阵营 A · "Evals 是 PM 的核心技能"——practitioner / 普及派

Aman Khan (Arize Head of Product) 给的是最 PM 化的版本：

"The PM's job is to have judgment on what that end product experience should be. And so being in the details on that when it comes to the human evals is really what determines whether or not your product is successful or fails."
「PM 的工作就是对最终产品体验该是什么样做出判断。所以在人工评估上深入细节，是决定你产品成败的关键。」
Aman Khan · Complete Beginner's Course on AI Evaluations

"golden dataset 决定后面一切"这个点，逐字稿里其实是共同主持人先说出口的（transcript 仅标注为 Speaker 3，通篇未具名）："if the golden data set sucks, then the rest of your evals will be terrible"。Aman 当场全盘接住，并把它钉成整个流程的第一步：

"Yeah, totally. … it's the most important step before you start trying to build anything complicated on top, like LLM as a judge or code-based evals. Like, just look at the data and debate, are these the right metrics for us to look at? Do we have the eval criteria in place? Do we know how to evaluate this or not? … we just start with five rows of data, right?"
「是的，完全正确。……这是在你开始尝试构建任何复杂的东西之前最重要的一步，比如 LLM 作为裁判或基于代码的评估。比如，看看数据并争论，这些是我们需要关注的正确指标吗？我们是否有适当的评估标准？我们知道如何评估这个吗？……我们从五个数据行开始，对吧？」
Aman Khan · Complete Beginner's Course on AI Evaluations

Hamel Husain 把同一立场推得更绝对：

"The most valuable process of evals is the annotation and counting. Even if that's all you do, you don't build any judge, you don't do any eval, you don't do whatever, you can get insane value by just doing that. That's the one part that everyone skips."
「评估中最有价值的过程是注释和计数。即使你只做这些，你不构建任何 judge，不做任何评估，不做任何其他事情，仅仅通过做这些就能获得巨大的价值。这是每个人都会跳过的部分。」
Hamel Husain · AI Evaluations Crash Course in 50 Minutes

"I've found that when you try to hide behind a score, you're not really making a decision."
「我发现：当你试图躲在一个分数背后时，你其实没有在做决策。」
Hamel Husain · AI Evaluations Crash Course

"eval-judge 即 PRD"这个说法，逐字稿里其实出自主持人 Lenny 的现场总结——他说最近多位嘉宾都在讲 "evals are the new PRDs"，并指着屏幕上的 LLM-as-judge prompt 说：

"This is the purest sense of what a product requirements document should be. Eval, judge, that's telling you exactly what it should be and it's automatic and running constantly."
「这就是 PRD 最纯粹的样子。Eval 和 judge 准确告诉你产品该是什么样，而且是自动的、持续运行的。」
Lenny（主持人，综合多位嘉宾的 "evals are the new PRD" 说法） · Why AI evals are the hottest new skill

Shreya 当场认可，但补了一个 Camp A 式的限定——这份"PRD"必须从自己的数据里长出来，而不是事先拍脑袋写下：

"Yeah, absolutely. And it's kind of derived from our own data. So of course, it's a product manager's expectations. What I find a lot of people miss is they just put in what their expectations are before looking at their data. But as we look at our data, we uncover more expectations that we couldn't have dreamed up in the first place. And that ends up going into this prompt."
「是的，当然。而且它在某种程度上来源于我们自己的数据。所以，这当然是产品经理的期望。我发现很多人忽略的是，他们在查看数据之前，就把自己的期望放进去了。但是当我们查看数据时，我们发现了更多最初无法想象的期望。而这些最终会进入到这个提示中。」
Shreya Shankar · Why AI evals are the hottest new skill

注意——Hamel 本人在另一场访谈里说"躲在分数后面就是没做决策"。跟"eval-judge 即 PRD"放在一起看似矛盾，实际上区分的是自己理解后建立的 judge（PRD-级别）和外包给一个 generic metric 的躲避（hallucination score、coherence score 之类）——Shreya 的限定（judge 必须从自己的数据里长出来）恰好点在这条分界线上。

阵营 B · "Evals 是经济级护城河 / 新职业"——Mercor 立场

Brendan Foody (Mercor) 把 evals 重新框架为一种新职业类别和供应链生意：

"We grew from one to 400 million in revenue run rate in 16 months, fastest ascent in history."
「我们 16 个月内做到 $4 亿 ARR——史上最快增长。」
Brendan Foody · Why experts writing AI evals

"If you really think about it, we were put on earth to create reinforcement learning training data for labs."
「认真想想，我们生下来就是为了给实验室创造 RL 训练数据的。」
Brendan Foody · Why experts writing AI evals

eval 同时是 PRD、也是销售材料——这条他讲得很完整：

"Evals are the PRD, but also subsequently the sales collateral, right? Because like evals are what you give to researchers to show them what they should be building … but they're also the way that you demonstrate the efficacy of capabilities."
「Eval 是 PRD，但随后也是销售材料。因为 eval 是你给研究员看、告诉他们该构建什么的东西……但它同时也是你向外展示模型能力有多强的方式。」
Brendan Foody · Why experts writing AI evals

Mercor 的经济结构本身是这条论点的实证——专家时薪远高于众包：

"Our median pay rate in the marketplace is $95 an hour, but it can flex up well up into, like, $500 an hour. … If you look at the economics of the crowdsourcing companies, oftentimes they would pay, like, $30 an hour to talent as sort of the average."
「我们市场里的中位时薪是 $95/小时，但能上浮到 $500/小时。……而众包公司的经济结构，付给人才的平均值往往只有大约 $30/小时。」
Brendan Foody · Why experts writing AI evals

Foody 在 Camp B 内部又给了一个独立判断——eval 业务的天花板取决于"人类还能做什么模型不能做的事"：

"The market is bound by the amount of things where humans can do something that models can't."
「这个市场的上限是：人类还能做、而模型做不到的事情的总量。」
Brendan Foody · Why experts writing AI evals

逐字稿里挖出一条摘要层没有的料——Foody 观察到企业对"做 eval"本身的恐惧，因为 eval 会把"我正在被自动化"这件事坐实：

"There are certain enterprises we talk to that are almost fearful, not wanting to engage, not wanting to eval their businesses, because that'll provide the evidence that their value chain is being automated."
「我们接触的一些企业，几乎是恐惧的——不愿参与、不愿对自己的业务做 eval，因为那会坐实'我的价值链正在被自动化'这个证据。」
Brendan Foody · Why experts writing AI evals

阵营 C · "通用 benchmark 在垮，接什么各有押注"——OpenAI / Galileo 立场

OpenAI 的 Mia Glaese / Olivia Watkins 押的是"更难的公开 benchmark + 正确性之外的维度"。光看对不对已经不够了：

"Like Olivia talked about sort of like, does it have like design taste, right? Like does it solve the problem the way that, you know, my team likes to solve problems. Is the code nice, right? Like, is it, is it well written? Is it sort of like clean code, right? Like people care about this. Is it maintainable in the future? People care about a lot of these maybe less tangible, less tangible and like harder to measure, frankly, things that are still like super meaningful for people that are working with coding agents."
「就像 Olivia 说的——它有没有设计品味？它解决问题的方式，是不是我团队喜欢的方式？代码漂不漂亮？写得规不规范？是不是整洁的代码？将来好不好维护？这些更难量化的东西，大家其实很在意，而且对用 coding agent 的人来说意义重大。」
Mia Glaese / Olivia Watkins (OpenAI) · SWE-Bench-Dead

Pratik Bhavsar (Galileo) 押的是 domain-specific leaderboard 是下一站：

"People in the industry come from, let's say, I'm from healthcare. I want to see if this models work for healthcare or not. People are coming from investment, finance, insurance industry, and they want to know that is, are the models ready for my domain or not?"
「业内的人会说，我来自医疗——我想看模型在医疗上行不行。来自投资、金融、保险的人也想知道模型在他们的领域上 ready 没 ready。」
Pratik Bhavsar (Galileo) · Ranking Agentic LLMs

而且他给了一个具体的"饱和"数字——连他自己的新 benchmark 都已经被刷到 0.95：

"We saw with V1 that the scores are saturating. We see that the best score has already reached 0.95. … So we want the benchmarks to be harder."
「我们在 V1 上就看到分数在饱和——最好的分已经到 0.95。……所以我们想把 benchmark 做得更难。」
Pratik Bhavsar (Galileo) · Ranking Agentic LLMs

阵营 D · "LLM-as-judge 要分相对排序和绝对打分"——OpenPipe 的拐点

Kyle Corbitt (OpenPipe) 给了 evals 工具链里一个被许多人忽略的实证拐点：

"And so simplifying a lot, it's basically just LLM as judge on a whole group. So you say, okay, this is the task I'm trying to achieve. Here's four different runs of an agent trying to achieve it. Which of these did best? And it stack ranks them. And it turns out that works phenomenally well with gRPO, like way better than I expected …"
「所以简化一下，它基本上只是把 LLM 作为对整个群体的评判。所以你说，好的，这是我试图完成的任务。这是代理尝试完成它的四种不同的运行方式。其中哪个做得最好？它对它们进行堆叠排名。事实证明，这与 gRPO 配合得非常好，比我预期的好得多……」
Kyle Corbitt（谈 RULER） · Why Fine-Tuning Lost and RL Won

这个"相对排名"拐点直接改写了他自己对 RL 方向的下注概率：

"So I mentioned my initial opinion of how likely this direction was to work was maybe 25%. We're up to 55% or so. And RULER is actually a big update that got me from the 25 to the 50."
「所以我提到了我最初的看法，即这个方向成功的可能性大约是 25%。我们现在达到了 55% 左右。而 RULER 实际上是一个很大的更新，它使我从 25% 提高到了 50%。」
Kyle Corbitt · Why Fine-Tuning Lost and RL Won

"If you tell a human, choose which of these is better, it's easier for them to do than say, is this one good or bad in absolute terms?"
「让一个人选'两者哪个更好'，比让他说'这个绝对意义上好不好'要容易得多。」
Kyle Corbitt · Why Fine-Tuning Lost and RL Won

跟 Hamel"看到 agreement score 就警惕"放一起，这揭示了一个没人正面说出来的区分——LLM-as-judge 在相对排名任务上（RULER）有效，在绝对打分任务上（generic hallucination/coherence score）容易自欺。

阵营 E · "好的 evaluator 是 reflection / self-correction 的前提"——研究侧视角

Shunyu Yao 和 Harrison Chase 在讨论 reflection / self-correction 时给的限定值得拎出来：

"I think a key bottleneck is the evaluator … in order to reflect upon your thoughts, you have to have a very good evaluator to judge whether your thought is good or not. But that might be as hard as solving the problem itself or even harder. The principle of self-reflection is probably more applicable if you have a good evaluator, for example, in the case of coding. If you have those arrows [errors], then you can just reflect on that and how to solve the bug and stuff."
「我认为一个关键瓶颈是评估器……为了反思你的想法，你必须有一个非常好的评估器来判断你的想法是好还是不好。但这可能和解决问题本身一样难，甚至更难。如果你有一个好的评估器，例如在编码的情况下，那么自我反思的原则可能更适用。如果你有这些箭头（transcript 原文如此，应为 errors），那么你就可以反思它以及如何解决 bug 等等。」
Shunyu Yao · Language Agents: From Reasoning to Acting

这把 evals 从"产品工具"升级到"agentic 智能的前提条件"——没有好的 evaluator，多数 agentic 自改进都失效。

阵营 F · "私有 eval set = 企业的 system of record / 真正的护城河"——Mercor / Satya / Nebius 的新汇聚

这是 2026 年 6 月这批访谈里最大的新观点，来自三个独立信源。逐字稿读下来，这条线比摘要层呈现得更硬、也更可证伪。

Foody 在新访谈里把 eval set 直接定义成企业的新 system of record，并点出它的战略用途——商品化模型层：

"Over time, this is going to develop to look very similar across every Fortune 500, where they'll need to have this system of record for evaluating and specifying agent behavior across every workflow in their business. And they're going to use that to commoditize the model layer because they want to enable perfect competition for the models having zero switching costs."
「随时间推移，这会演变成每个 Fortune 500 都长一个样：他们需要一个 system of record，去评估和规定每个业务工作流里的 agent 行为。而且他们会用它来把模型层商品化——因为他们想让模型之间完全竞争、切换成本为零。」
Brendan Foody (Mercor) · 20VC: Mercor CEO on Why Application Layer Companies Have No Defensibility

护城河不在软件层，而在 forward-deployed 把组织隐性知识编码进 agent：

"If you have a great forward deployed motion where you're going deep with a customer, you're training the agents based on all of this tacit knowledge within the company so that it understands how to perform effectively, that feels incredibly differentiated and hard to recreate."
「如果你有一套很强的 forward-deployed 动作——深入一个客户、用公司里所有那些隐性知识去训练 agent，让它知道怎么高效干活——那才是极难复制、极具差异化的东西。」
Brendan Foody (Mercor) · 20VC: Mercor CEO on Why Application Layer Companies Have No Defensibility

而软件层为什么没护城河——他给了一个吓人的时间表：

"We're building out an eval set that measures how effectively agents can build end-to-end SaaS applications, where 2025 was the year of, how do you get a model to make a PR and a code base? And 2026 is the year of, how do you get the model to clone Slack end-to-end? Those capabilities are going to exist in the models in the next 12 months."
「我们正在做一个 eval set，衡量 agent 端到端构建 SaaS 应用的能力。2025 年的主题是'怎么让模型在代码库里提一个 PR'，2026 年的主题是'怎么让模型端到端克隆一个 Slack'。这些能力，未来 12 个月内就会出现在模型里。」
Brendan Foody (Mercor) · 20VC: Mercor CEO on Why Application Layer Companies Have No Defensibility

Satya Nadella 给了几乎一样的判断，但加了一个可操作的"控制权测试"——能不能在不泄露 trace 的前提下，把私有 eval 从模型 A 搬到模型 B 继续爬坡：

"Every company having private evals may be the biggest IP. … You have an eval that's private, you're using a model A, can you switch it to model B and climb up? If you can, then you're in control. If you can't, you're not in control."
「每家公司拥有自己的私有 eval，可能就是最大的 IP。……你有一个私有 eval，现在用模型 A——你能不能换成模型 B 并继续往上爬？如果能，你就握有控制权；如果不能，你就没有。」
Satya Nadella (Microsoft) · The Rise of the Full-Stack Builder

他还把"公开 eval 都能被刷爆、所以只有私有 eval 才算数"讲明白了：

"You'll have private evals because we know all the evals out there are good, interesting, but they're not really that critical at this point because they all can be maxed. And so the point is each company will have its own private eval."
「你会有自己的私有 eval，因为我们都知道外面那些 eval 虽然好、虽然有意思，但此刻已经不那么关键了——它们都能被刷满。所以重点是：每家公司都会有自己的私有 eval。」
Satya Nadella (Microsoft) · The Rise of the Full-Stack Builder

Nebius 的 Roman Chernin 从基础设施侧给了第三个佐证——eval 是企业能不能进入指数增长期的冷启动门槛：

"First of all, they were focusing on evaluations. … You need to have like metrics. … You need to have this CI, CD process established for AI development. … They have this, you can call it foundational investments, a cold start problem, how to start shipping. When they solve it, they start to grow exponentially."
「首先，他们聚焦的是评估。……你得有指标，你得为 AI 开发建立起 CI/CD 流程。……这是一种基础投资，可以叫它'冷启动问题'——怎么开始交付。一旦解决，他们就开始指数级增长。」
Roman Chernin (Nebius) · 20VC: Nebius Co-Founder on AI Infrastructure Bubbles

这条线和 Camp A/B 的关系微妙：Camp A 说 eval 是 PM 技能、Camp B 说 eval 是高薪专家生意——Camp F 说积累出来的那套 eval set 本身是护城河。三者其实是 eval 价值链的三个环节：写它的人（A）、外包它的生意（B）、攒成的资产（F）。但没人正面把这三环串起来。它还直接挑战姊妹主题 ai-moat-2026 / ai-native-products——如果"软件层没有 defensibility、私有 eval 才是护城河"（Foody 新访谈标题主张），护城河的定义又被重画了一次。

都没说透的

"Evals 是普及技能（Camp A）"和"Evals 是专家高薪生意（Camp B）"如何并存？ Hamel 说"PM 自己做"，Foody 说"我们雇 $500/小时的专家"——没人正面解释这两条经济逻辑能不能同时成立。最可能的答案是不同 evals（产品级 vs 模型 frontier 级）适配不同人，但语料里没人讲透。
LLM-as-judge 用作"相对排序"和"绝对打分"效果差很多，但几乎没人正面区分。 Kyle 用 RULER（相对）就工作了；Hamel 警惕的是 agreement/coherence（绝对）。没人画出"什么任务用哪种 judge"的决策表。
公开 benchmark 死后接什么？三种押注各异，没人 head-to-head。 OpenAI 押 SWE-Bench Pro（更难的公开 benchmark），Galileo 押 domain-specific leaderboard（垂直公开），Hamel/Aman/Satya 押完全私有的 internal eval。三条路径的相对优势没人正面比较。
"私有 eval = 护城河"（Camp F）和"data moat 只在 mega-scale 成立"（a16z, 见 ai-moat-2026）正面打架。 Foody/Satya 说一套攒出来的私有 eval 就能护城；a16z 说数据网络效应要到数十亿用户才显现。私有 eval 到底要攒到多大、多久才真正难以复制？Satya 给了"能不能换模型继续爬坡"这个定性测试，但没人给定量门槛——这是 Camp F 论点最关键、也最空的一块。
Foody 同一个人，两次访谈 framing 变了。 旧访谈卖的是"专家写 eval 的劳动生意"（Mercor 的 GMV、$500/小时），新访谈卖的是"私有 eval = 护城河 + 应用层没有 defensibility"。两个 framing 服务不同叙事（一个抬高劳动供给价值，一个抬高 Mercor 作为护城河中介的价值），没人追问他这两套说法怎么自洽——尤其是：如果护城河在客户自己攒的私有 eval 里，那 Mercor 这个"帮你攒"的中介，长期价值捕获在哪？

我的看法

判断（不是事实）：evals 在 2026 年同时是两个东西——对应用层 PM，它是手工 + 判断密集的核心 craft（Camp A 对）；对前沿实验室，它是新型供应链业务（Camp B / Mercor 对）。这两件事的混淆制造了大量错觉——很多公司既不严肃做手工 trace 分析、又付不起 Mercor 那种专家费，结果两边都不在。真正的护城河不在工具或方法（会快速商品化），而在积累专有 trace、并把它转成内部 judge 的能力——Satya 的"能不能换模型继续爬坡"是我目前见过最干净的判定标准。读完逐字稿我对 Camp F 的把握升了一点：因为 Satya 给了可操作测试、Foody 给了"克隆 Slack"的时间表，这条线不再只是 vendor 口号。

把握程度：中等偏高。最强支撑是 Hamel 反复独立观察到"看 trace 是最有价值的部分"，Camp A 内部无反对票。最弱环节仍是"私有 eval 要多大才算护城河"——它跟 ai-moat-2026 的"data moat 只在 mega-scale 成立"存在张力，且三个 Camp F 信源里有两个（Foody、Satya）是卖这条叙事的当事人，需要客户侧的独立证据才能锁死。

还想知道什么

一个公司"手工 trace 标注 → 内部 eval-judge → 模型迭代"全闭环的真实案例：周期多长、人力多少、最终 quality 提升多少。Hamel 反复推这条 loop，但没有一篇访谈给出完整的事后 12 个月数据。
"私有 eval 要攒到多大、多久"的定量门槛。 Satya 给了定性测试（换模型能否继续爬坡），但 Camp F 缺一个"多少条标注样本 / 多少个月 forward-deployed 之后，竞品复制成本才真正变高"的数字。这是 Camp F 对 a16z"data moat 只在 mega-scale 成立"那条最需要的反驳证据。
Mercor TAM 演化数据：Foody 说市场受限于"人类能做、模型做不到的事的总量"——若模型在 2026 下半年吃掉一大块专家可证伪任务，Mercor 的 ARR 增速曲线会立刻给信号。
SWE-Bench Pro / domain leaderboard / 私有 eval 三条路径的 head-to-head：同一个 agent 在三种 eval 体系下排名是否一致？不一致的话方差在哪？
一个"我们靠 evals 在某 vertical 干掉某竞品"的客户侧第一人称叙述。 目前几乎所有"evals 是护城河"的论点都来自 vendor / 教学者（Mercor、Arize、Galileo、Hamel 课程、Microsoft）。缺一个买方视角的商战复盘。

取材

核心 6 篇本轮按逐字稿全文重读、所有引用逐字稿核对：

Mia Glaese & Olivia Watkins (OpenAI Frontier Evals — SWE-Bench-Dead) · 2026-02-26 · 313ea6160e7181cfa590f64d2c2162af
Brendan Foody (Mercor — evals 劳动生意) · 2025-09-22 · 276ea6160e7181edb819cd82fd44b5ac
Brendan Foody (Mercor — 私有 eval = 护城河 / 应用层无 defensibility) · 2026-06-06 · 377ea6160e7181e38e78ea1048f812d1
Satya Nadella (Microsoft — private eval = 最大 IP / Full-Stack Builder) · 2026-06-06 · 377ea6160e718196bed4e563447eaa46
Roman Chernin (Nebius — eval 是 cold-start 前置) · 2026-06-10 · 37bea6160e71818187b7d7eaa67763b6
Pratik Bhavsar (Galileo — Ranking Agentic LLMs) · 2025-07-25 · 23bea6160e71810bb58fc297ea2443c4

沿用上一轮已逐字核对的 practitioner 引用（Camp A/D/E，原文均为第一人称逐字稿）：

Aman Khan (Arize) · 2025-08-25 · 25aea6160e7181aaacb6e251289d1800
Hamel Husain & Shreya Shankar · 2025-09-26 · 27aea6160e7181bbb856ff899d12c8e4
Hamel Husain (Crash Course, NurtureBoss) · 2025-10-04 · 282ea6160e71813883fbee731f4d43e6
Kyle Corbitt (OpenPipe — RULER) · 2025-10-17 · 28fea6160e718186bfe4e90dde1d3a7d
Shunyu Yao & Harrison Chase (Language Agents) · 2025-07-25 · 23bea6160e7181ff8d84f86195b392f6

注：阵营 F 的三篇（Foody-20VC、Satya、Nebius）当前未进入 topics.json 的 alias 成员表，是上一轮手工引入；建议后续 topics.py update 时把它们的 id 补进 interview_ids，以便 headless 全量重综能自动纳入。