主题综述

高质量人类数据是最后瓶颈吗:合成数据能否替代专家标注 · The Human-Data Ceiling

主题综述

主题页(活文档)· 最近更新 2026-06-12 · 取材 6 篇访谈

更新日志

主流共识

第一,公开互联网文本被抽干,增益转到 post-training,而 post-training 吃人类数据。 Handshake 的 Garrett Lord 把这条产业逻辑讲得最直白——这也是整批公司存在的前提:

"There's an insatiable appetite for this data as models … most of the gains as of recent, the last 12, 18 months have all come from post-training as they've already absorbed the entire internet on the pre-training side and there aren't as many gains they made there. So specifically what that looks like is like thousands of professionals and experts providing human data to these labs."
「这种数据有着永不满足的需求……模型的绝大部分收益都来自于后训练,因为它们已经吸收了预训练方面的整个互联网,并且在那里没有取得太多收益。具体来说,就是成千上万的专业人士和专家向这些实验室提供人工数据。」

第二,合成数据"补而不替"——人人都在用,但都说它替不掉最上层那部分。 Edwin Chen 给出了整个主题里被引用最多的那句话,而且带着客户实测的数字:

"A lot of times our customers will come to us and be like, yeah, for the past six months, I've been experimenting with synthetic data. I've gathered 10 to 20 million pieces of synthetic data. Actually, we finally realized that 99% of it just wasn't useful. … even a thousand pieces of high quality human data, highly curated, really, really high quality human data is actually more valuable than those 10 million points."
「很多时候我们的客户会来找我们说,过去六个月,我一直在尝试使用合成数据。我收集了 1000 万到 2000 万条合成数据。实际上,我们最终意识到其中 99% 都没用。……即使是 1000 条高质量的人工数据,经过高度整理的、真正高质量的人工数据,实际上也比那 1000 万条数据更有价值。」

Foody 在被直接问"合成数据会不会取代人类数据"时,落点完全一致——合成只做"审阅/增强",突破前沿仍需人类当基准:

"There's going to be synthetic reviews and there's going to be synthetic augmentation to make it more efficient to engage with humans. But ultimately, if you want to push the frontier to get the model to do something that the human knows how to do that the model doesn't know how to do, then you need some human stasis point to measure that."
「会有合成评论,也会有合成增强,使其与人类互动更有效率。但最终,如果你想突破界限,让模型做人类知道但模型不知道的事情,那么你需要一些人类的静止点来衡量。」

第三,合成数据不可靠的技术根因是"模型没有外部信号、会自嗨跑偏"。 Edwin Chen 用一个具体故障讲清楚为什么模型需要人来当校正锚:

"One of the frontier models … If you go use it today, like maybe 10% of the time when I use it, it will just output random Hindi characters and random Russian characters in the model responses. … It's almost like you need an external human to tell the model that, yeah, this is wrong. … you almost need like an external like quality signal in order to tell it what the right objective should be. And if you don't have that, then the model just go in all these crazy directions."
「其中一个前沿模型……如果你今天去使用它,大概有 10% 的时候,它会在模型响应中输出随机的印地语字符和随机的俄语字符。……几乎就像你需要一个外部的人类来告诉模型,这是错的。……你需要一个外部的质量信号来告诉它正确的客观应该是什么。如果你没有这个信号,那么模型就会朝着各种疯狂的方向发展。」

第四,由此催生一个按小时高价招募领域专家的新市场,而且价格档次拉开了。 Mercor 明码标出与老一代众包厂的价差:

"Our average marketplace pay rate is $95 an hour to put that in frame of reference, whereas Scale and search generally pay about $30 an hour."
「我们的平均市场支付时薪是 95 美元,作为参考,而 Scale 和 Search 通常支付的时薪约为 30 美元。」

Handshake 从供给侧给出同一档价格,并把它包装成对专家的吸引力:

"Instead of teaching a chemistry class, making $22 an hour, they can make $150 an hour working on a project … Model companies weren't actually getting the data that they needed at the level of throughput or speed, volume, and quality that they needed. That's what they care about, speed, volume, and quality."
「他们可以不用花每小时 22 美元去教化学课,而是……赚到每小时 150 美元……模型公司实际上并没有获得他们所需吞吐量、速度、数量和质量级别的数据。这才是他们关心的,速度、数量和质量。」

分歧在哪

表面上六位(实为四家公司、五位创始人级发言人)都在说"人类数据不可替代"。但把原话摆到一起,"天花板到底是什么、朝哪个方向走"上,两套世界观互相矛盾——而这恰恰是这门生意值多少钱的核心变量。

一、天花板画在哪里:质量/丰富度(无限) vs. TAM(有限且收窄)

这是本主题最深的裂缝。Edwin Chen 反复用"没有上限"作为口头禅——他说的是质量维度和环境丰富度,方向朝上、几乎无限:

"There's almost an unlimited ceiling in this gen AI world on the type of quality that you can build."
「在这个生成式 AI 世界里,你能创造的质量几乎没有上限。」

问到 RL 环境要多真实才够,他给的还是同一个答案:

"I think there's no ceiling. At the end of the day, you just want as much diversity and richness as you can get, because the more richness that you have, the more that models can learn from. … So I think there's almost an unlimited ceiling here."
「我认为没有上限。归根结底,你只是想要尽可能多的多样性和丰富性,因为你拥有的丰富性越多,模型可以学习的就越多。……所以我认为这里几乎有一个无限的上限。」

Foody 画的是另一根轴——需求上限(TAM),而且明确说它有限。 被 Harry Stebbings 追问"模型越强、你能用的聪明人越少,供给不是越来越窄吗",他没有回避这个收窄,而是把天花板等式写死:

"The total addressable market is limited by the amount of things that humans are better at than models. … so long as there's things that the human is able to do, the model is not able to do, and we want those capabilities in the model … we need humans that help to create those verifiers and help to measure that frontier to ultimately improve model capabilities."
「总的可寻址市场受到人类比模型更擅长的事情的数量的限制。……只要人类能够做的事情,模型无法做到,并且我们希望模型具备这些能力……我们都需要人类来帮助创建这些验证器,并帮助衡量这个边界,以最终提高模型的能力。」

两人用同一个词"ceiling",指的却是相反的东西:Edwin 的天花板是质量深度(往上无限),Foody 的天花板是任务存量(随模型进步往下缩)。Edwin 卖的是"永远有更精细的品味要教";Foody 卖的是"人类领先模型的那块面积",而他自己承认这块面积在缩小——只是他相信缩得慢。

二、这块面积到底在扩还是在缩:Foody 的两个自相矛盾的锚

值得单独摊开的是,Foody 在同一档访谈里给出了两个方向相反的证据,他没有调和,我也不替他调和。

一方面,他讲了一个 TAM 不缩反扩的故事(加复杂度就能把被淘汰的标注员重新拉回来):

"We started out with 100 people … over time, only 20 people could contribute to it, the exact dynamic that you're describing. But then we started adding other degrees of complexity … How do we get it to do the trajectories that a human might spend 10 hours, 100 hours on? And all of a sudden, everyone else could contribute to the project again because they could stump the model."
「我们最初有 100 人……随着时间的推移,只有 20 个人可以为此做出贡献,这正是你所描述的情况。但后来我们开始增加其他复杂程度……我们如何让它完成人类可能花费 10 小时、100 小时的轨迹?突然间,其他人都可以再次为这个项目做出贡献,因为他们可以难住模型。」

另一方面,在另一档访谈里他把这条路的终点说得很冷静——你请专家进来,本质是让他们造出取代自己的东西,且这只是"通往 superintelligence 的时间":

"There's always a frontier."(Elad Gil: "Unless it becomes superhuman, right?")"Yeah, unless it becomes superhuman."
「总会有一个前沿。」(Elad Gil:「除非它变得超人类,对吧?」)「是的,除非它变得超人类。」

同一人:一边"加复杂度就能无限续命",一边"总会有前沿、除非到 superhuman 就归零"。这两句不冲突当且仅当你相信"到 superhuman 还很远"——而这正是他和 Edwin 的共同押注(见下)。

三、人类数据"什么时候真的失效":到 AGI 为止

被 Lenny 直接问"会不会有一天不再需要这些人",Edwin 的回答把失效点绑死在 AGI 上,几乎是定义式的:

"I think that will not happen until we reach AGI. Like, it's almost like by definition, if we haven't reached AGI yet, then there's more for the models to learn from. And so I don't think that's gonna happen anytime soon."
「我认为在我们达到 AGI 之前不会发生。就像从定义上来说,如果我们还没有达到 AGI,那么模型还有更多的东西要学习。所以我认为这不会很快发生。」

而他给的 AGI 时间表是十年到数十年——这也就等于把"这门生意的下限"押到很远:

"Within the next one or two years, the models are going to automate 80% of the average L6 software engineer's job. It's going to take another few years to move to 90% and another few years to 99% … I think we're closer to a decade or decades away than that."
「在未来一两年内,这些模型将自动化 L6 软件工程师 80% 的工作。还需要几年才能达到 90%,再过几年才能达到 99%……我认为我们离那个目标还有十年甚至几十年的时间。」

Foody 用一模一样的逻辑,但补了一个更狠的循环论证——你连"到没到 superintelligence"都得靠人类数据才能知道:

"You don't even know that you have superintelligence without having emails for everything. Because you need to understand what is the human baseline and what is good."
「你甚至不知道自己拥有超级智能,除非拥有所有事物的评估。因为你需要了解什么是人类基线,什么是好的。」

注意这个共识本身很脆:它把整个行业的存续押在"AGI 还远"这一个判断上。一旦这个判断错了,两家公司的生意逻辑同时坍塌——他们只是不认为它会错。

四、"质量"是不是可测的客观量:算法可测 vs. 品味不可归约

阵营内还有一条更细的裂缝。Edwin 把质量说成可以工程化测量的东西(类比 Google Search 的信号系统):

"We gather so many signals. You gather page-dependent signals, user-dependent signals, activity-based signals. And all of these feed into a giant ML algorithm at the end of the day. … We basically have an ML team in Toronto that builds a lot of these algorithms to measure all of this."
「我们收集了很多信号。你收集页面相关的信号、用户相关的信号、基于活动的信号。所有这些最终都会输入到一个巨大的 ML 算法中。……我们基本上在多伦多有一个 ML 团队,他们构建了许多这些算法来衡量所有这些。」

但同一个人在另一场访谈里又强调质量的核心是不可归约为清单的隐性品味——这跟"可测"存在张力:

"Certain companies, if you ask them what is good poem, they will simply robotically check off all of these instructions on our list. But again, I don't think that makes for good poetry. So certain Frontier Labs, the ones with more taste and sophistication, they will realize that it doesn't reduce to this thick set of checkboxes."
「如果你问某些公司什么是好诗,他们只会机械地勾选我们列表上的所有这些说明。但我仍然认为这不会产生好的诗歌。因此,某些前沿实验室,那些更有品味和成熟度的实验室,会意识到它不会简化为这套厚厚的复选框。」

Foody 则把"质量"直接转译成一个不同的问题——人才的幂律。在他的框架里,不是"数据质量"这个抽象量,而是"哪 10-20% 的人贡献了大部分模型提升":

"The outcomes of data and the people that contribute to it are extremely power law … if you have 100 people on a project, oftentimes majority of the model improvement is coming from the top 10 to 20% of people. … When we're able to find those people that are the 10x contributors, it's very difficult to recreate."
「数据的结果和贡献者都遵循幂律分布……如果你有一个 100 人的项目,通常模型改进的大部分来自前 10% 到 20% 的人。……当我们找到那些贡献是常人十倍的人时,这是很难复制的。」

于是"护城河"在两家嘴里是两样东西:Surge 说护城河是测质量的算法+技术,并直接开火说别家没有("they don't have any technology");Mercor 被当面转述这句指控,反驳说自己"use all sorts of models and algorithms to assess the quality",但把真正的壁垒落在找到并留住那批 10x 专家上。两边都自称有技术、都暗示对方是"body shop"。

五、Reducto 在这条线上的位置(边缘,但值得一句)

Reducto 只在边角触到本主题——它做的是把复杂文档转成"LLM-ready"的结构化数据,是数据摄取基础设施,不是合成 vs. 人类的替代性辩论。但它无意中给主流共识添了一块砖:连做文档抽取的数据集,人类都懒得好好造。

"We were expecting there to be a lot of really good quality datasets for us to measure up against. But we often saw that in the space of documents, people just weren't often willing to put in the effort that was required to generate the really high quality data. … we reached out to teams of PhDs … built really, really extensive data pipelines and kind of a data engine in-house."
「我们原本期望能有很多高质量的数据集,供我们进行衡量。但我们经常发现,在文档领域,人们通常不愿意投入必要的精力来生成真正高质量的数据。……我们联系了博士团队……构建了非常非常广泛的数据管道,以及一个内部数据引擎。」
Adit / Ronak(Reducto)· Reducto: Making Human Data LLM-Ready

都没说透的

我的看法

以下是判断,不是转述,把握程度中等偏低(样本极度偏向卖方,缺反方)。

还想知道什么

取材