Zhoufutu Wen 温周伏土

Algorithm Researcher @ ByteDance SEED · Foundation-Evaluation

$$\nabla \cdot \mathbf{E}=\frac{\rho}{\varepsilon_0} \qquad \nabla \cdot \mathbf{B}=0 \qquad \nabla \times \mathbf{E}=-\frac{\partial \mathbf{B}}{\partial t} \qquad \nabla \times \mathbf{B}=\mu_0 \mathbf{J}+\mu_0 \varepsilon_0 \frac{\partial \mathbf{E}}{\partial t}$$

Maxwell unified electricity, magnetism, and light into four lines. I try to bring the same kind of structure to LLM evaluation.

Biography

I am an Algorithm Researcher at ByteDance SEED, working on the Foundation-Evaluation team. I build automated evaluation systems across pre-training and post-training stages of LLMs, with a focus on expert-level capability assessment, model behavior analysis, and multimodal understanding. My work has contributed to the Seed 1.6 / 1.8 / 2.0 releases, and the benchmarks I co-led (FinSearchComp, DiscoX, XpertBench) have appeared at NeurIPS / ICLR / ACL, been cited by Nature, and adopted as standard agentic tests by leading foundation-model teams including MiniMax-M2 and Moonshot Kimi K2.5 — with FinSearchComp also quote-tweeted by Elon Musk, reaching tens of millions of impressions.

Research Interests

LLM Evaluation Capability Assessment Model Behavior Analysis Benchmarking

Contact

Google Scholar GitHub

Research

My thesis: Evals are the new PRD — what we measure is what we build.

As LLMs' capability boundaries expand while compute and human resources remain limited, evaluation must answer two fundamental questions: (1) Where to invest — defining evaluation directions anchored in realistic, economically valuable tasks that reflect how models are actually used; and (2) How to iterate efficiently — building evaluation systems with high data quality, strong discriminative resolution, and comprehensive coverage, so that good training strategies are never missed. I work on these efforts at ByteDance Seed's Foundation-Evaluation team (part of my insights are demonstrated in Seed1.8 Model Card and Seed 2.0).

Specific topics include:

Capability Definition

Systematically define core capability structures for LLMs in real-world domains (finance, translation, legal, education, etc.), decomposing evaluable units from industry workflows covering search, tool use, and multi-stage decision-making. Representative works include FinSearchComp (ICLR 2026), XpertBench, DiscoX (cited by Nature), and WorldTravel (ICML 2026).
Quantitative Evaluation

Build evaluation metrics and systems that provide objective, high-resolution feedback on model capabilities, serving as the basis for training strategy iteration. Focus on evaluation quality and resolution to prevent good strategies from being missed. Representative works include SuperGPQA (NeurIPS 2025), KOR-Bench (ICLR 2025), and MARS-Bench (EMNLP 2025 Findings).
Analysis & Attribution

Go beyond scores to analyze model capability patterns and behavioral signals — competitor analysis, internal iteration analysis, base vs. instruct comparisons — turning evaluation into actionable diagnostic insights. Representative works include Quantification of LLM Distillation (ACL 2025 Main), When Agents Look the Same (ACL 2026), and CryptoX.
Verification & Optimization

Verify analysis hypotheses through controlled experiments and sanity checks, then scale validated insights across teams. We design TreePO (ICML 2026) to bridge the gap between policy optimization quality and inference-time compute in LLM reinforcement learning.

News

2026-05

"WorldTravel" and "TreePO" accepted at ICML 2026.

2026-04

Our paper "XpertBench: Expert Level Tasks with Rubrics-Based Evaluation" is published on arXiv — a 1,346-task expert benchmark built with 1,000+ domain experts.

2026-04

Our paper "When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors" accepted at ACL 2026.

2026-02

"Seed 2.0" officially released, featuring Pro / Lite / Mini / Code variants.

2026-02

Our paper "WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints" is published on arXiv.

2025-12

"Seed1.8 Model Card: Towards Generalized Real-World Agency" is officially released.

2025-12

"Introduction to Techniques Used in Seed1.6" technical report is released.

2025-11

Our paper "DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains" is published on arXiv.

2025-10

"FinSearchComp" adopted as a core benchmark in the MiniMax-M2 model card — listed alongside SWE-bench, Terminal-Bench, BrowseComp and τ²-Bench.

2025-09

Our paper "FinSearchComp" is published on arXiv and accepted at ICLR 2026 — Grok 4 tops the global leaderboard at 68.9%, retweeted 3 times by Elon Musk with tens of millions of impressions.

2025-09

"SuperGPQA" accepted at NeurIPS 2025; "Quantification of LLM Distillation" accepted at ACL 2025 Main.

2025-08

Our paper "TreePO: Bridging the Gap of Policy Optimization and Inference Efficiency" is published on arXiv.

2025-06

Our paper "MARS-Bench" is published on arXiv; later accepted at EMNLP 2025 Findings.

2025-04

"Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning" is published on arXiv.

2025-02

"SuperGPQA" and "CryptoX" published on arXiv.

2025-01

"Quantification of Large Language Model Distillation" published on arXiv; "KOR-Bench" accepted at ICLR 2025.

2024-10

Our paper "KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks" is published on arXiv.

2023-09

Joined ByteDance SEED as Algorithm Researcher on the Foundation-Evaluation team.

Selected Publications

* equal contribution † corresponding author

XpertBench: Expert Level Tasks with Rubrics-Based Evaluation

X. Liu, X. Ma, Y. Ma, Y. Peng, D. Wang†, Zhoufutu Wen, G. Zhang†, K. Zhang (Core Contributors, α–β order)

A high-fidelity benchmark of 1,346 expert-curated tasks across 7 professional domains, evaluated with 15–40 weighted rubric checkpoints per task and our ShotJudge paradigm — exposing a pronounced "expert-gap" where even frontier models peak at only ~66% success.

arXiv 2026 Paper Xpert

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

C. Yang, Y. Zhang, Zhoufutu Wen†, T. Gong†, J. Liu, Q. Chu, N. Yu

Extends our distillation-quantification line from knowledge to agentic tool-use behaviors — introducing Response Pattern Similarity (RPS) and Action Graph Similarity (AGS) to reveal structural homogenization among 18 models across 8 providers on τ-Bench / τ²-Bench.

ACL 2026 Paper Code

Seed 2.0: Pro / Lite / Mini / Code

ByteDance Seed Team (Core Contributor: Evaluation & Model Analysis System)

Introduces Model-in-Model evaluation paradigm: leveraging LLM agents to automatically analyze model capabilities, identify behavioral patterns, and generate structured evaluation reports — enabling scalable, human-out-of-the-loop diagnostics.

Tech Report 2026 Homepage

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Z. Wang, C. Yang, Y. Que, Z. Yang, H. Yuan, Y. Wang, Z. Jiang, S. Fang, Z. Wu, Z. Wang, Z. Yao, J. Liu, J. Ren, Y. Li, Y. Yang, J. Liu, J. Yang, Z. Wang, G. Zhang, Zhoufutu Wen†, W. Huang

As LLM capability boundaries expand, we examine models' operations-research abilities in everyday life — testing whether they can handle tightly coupled constraints that mirror real-world travel planning.

ICML 2026 Paper

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed Team (Core Contributor: Evaluation System)

Features a comprehensive Evaluation System covering foundational LLM, multimodal VLM, and agentic capabilities — combining public benchmarks with internal assessments aligned to high-value application patterns (e.g., FinSearchComp, WorldTravel, GUI Agent).

Tech Report 2025 Blog GitHub

Introduction to Techniques Used in Seed1.6

ByteDance Seed Team

Introduces Adaptive Chain-of-Thought (AdaCoT) and sparse MoE architecture (230B total / 23B active) with 256K context — dynamically allocating reasoning depth based on query complexity.

Tech Report 2025 Blog

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

X. Zhao, Zhoufutu Wen†, Z. Chen, J. Ding, J. Jiao, S. Li, X. Li, D. Liang, S. Long, Q. Liu, X. Wu, et al.

Targets the classic task of translation, revealing that even the most advanced LLMs still fall significantly short of human experts at discourse-level translation in professional domains — a long road ahead.

arXiv 2025 Paper Project Xpert Nature Z Potentials

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

L. Hu, J. Jiao, J. Liu, Y. Ren, Zhoufutu Wen†, K. Zhang, X. Zhang, X. Gao, T. He, F. Hu, Y. Liao, Z. Wang, C. Yang, Q. Yang, et al.

Evaluates models' ability to complete real-world financial analyst workflows through 635 expert-crafted questions spanning global and Greater China markets — adopted by leading foundation-model teams as a standard capability test, e.g., a core agentic benchmark in the MiniMax-M2 model card and one of six core agentic benchmarks in the Moonshot Kimi K2.5 tech report; Grok 4 tops the global leaderboard at 68.9% (vs. 75% human-expert), and the result was quote-tweeted by Elon Musk, reaching tens of millions of impressions.

ICLR 2026 Paper Code Project Xpert Xpert Financial 量子位

TreePO: Bridging the Gap of Policy Optimization and Inference Efficiency with Heuristic Tree-based Modeling

Y. Li*, Q. Gu*, Zhoufutu Wen*, Z. Li, T. Xing, S. Guo, T. Zheng, X. Zhou, X. Qu, W. Zhou, et al.

Explores rollout efficiency in RL — investigating how to scale reinforcement learning for LLMs by bridging the gap between policy optimization quality and inference-time compute.

ICML 2026 Paper Project

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

C. Yang*, Y. Luo*, Zhoufutu Wen†, Q. Chu†, T. Gong, L. Liu, K. Zhang, J. Jiao, G. Zhang, W. Huang, N. Yu

A closed-loop experiment: discovering problems from practice, distilling them into a benchmark, and using the benchmark to guide model iteration — focusing on multi-turn dialogue robustness under motivation transfer and cross-turn dependency.

EMNLP 2025 Findings Paper Code Project

Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

ByteDance Seed Team

ByteDance's first reasoning-focused model — achieves 86.7 on AIME 2024 and 55.0 on Codeforces via reinforcement learning, with a compact MoE design (20B active / 200B total).

arXiv 2025 Paper

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, Zhoufutu Wen, et al.

The largest expert-level knowledge benchmark to date — spanning 285 graduate disciplines with 80+ expert annotators and a Human-LLM collaborative filtering mechanism, revealing that even the best models only reach ~62% accuracy.

NeurIPS 2025 Paper Code

CryptoX: Compositional Reasoning Evaluation of Large Language Models

J. Shi*, C. Wei*, L. Yang*, Z.M. Wang, C. Yang, G. Zhang, S. Huang, T. Peng, J. Yang†, Zhoufutu Wen†

Uncovers the gap between fluid intelligence (novel reasoning) and crystallized intelligence (memorized knowledge) in LLMs through cryptographic compositional tasks, with mechanistic interpretability experiments explaining why.

arXiv 2025 Paper Code Leaderboard

Quantification of Large Language Model Distillation

S. Lee, J. Zhou, C. Ao, K. Li, X. Du, S. He, H. Wu, T. Liu, J. Liu, Zhoufutu Wen†, et al.

A pioneering work that opens up the study of LLM distillation quantification — proposing the first systematic framework to measure distillation degree across models, revealing widespread model homogenization in the field.

ACL 2025 Main Paper ACL Anthology 机器学习社区

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

K. Ma, X. Du, Y. Wang, H. Zhang, Zhoufutu Wen, X. Qu, J. Yang, J. Liu, M. Liu, X. Yue, et al.

Isolates pure reasoning ability from memorized knowledge by designing knowledge-orthogonal rules — revealing that LLMs struggle significantly with out-of-distribution reasoning independent of pretrained knowledge.

ICLR 2025 Paper Project

Experience

ByteDance SEED

Algorithm Researcher · Foundation-Evaluation

Sep 2023 – Present · Beijing, China

Led the construction and development of the SEED evaluation system covering pre-training and post-training stages. Focus areas include expert-level capability assessment, evaluation data production, and model behavior analysis. Open-sourced benchmarks such as SuperGPQA, KOR-Bench, FinSearchComp, DiscoX, XpertBench, and more.

LLM Evaluation Benchmark Design Model Analysis Expert-level Assessment

Baidu · Ecom

Algorithm Engineer

Jul 2020 – Sep 2023 · Beijing, China

Built multimodal content understanding solutions for commercial monetization. Trained and optimized VLP models (200M–10B params) applied across ad trigger, ranking, creative generation, and user experience. Completed 30+ A/B tests over 3 years.

Multimodal Models VLP Ad Systems Baidu Highest Award Nominee ×2

Education

University of Electronic Science and Technology of China (UESTC)

M.S. in Computer Science · School of Computer Science

Aug 2017 – Jun 2020

Top 10%

University of Electronic Science and Technology of China (UESTC)

B.E. in Electronic Engineering · School of Physics

Aug 2013 – Jun 2017

Top 5% · National Scholarship