Zhoufutu Wen 温周伏土

Algorithm Researcher @ ByteDance SEED · Foundation-Evaluation

$$\nabla \cdot \mathbf{E}=\frac{\rho}{\varepsilon_0} \qquad \nabla \cdot \mathbf{B}=0 \qquad \nabla \times \mathbf{E}=-\frac{\partial \mathbf{B}}{\partial t} \qquad \nabla \times \mathbf{B}=\mu_0 \mathbf{J}+\mu_0 \varepsilon_0 \frac{\partial \mathbf{E}}{\partial t}$$
Zhoufutu Wen

Biography

I am an Algorithm Researcher at ByteDance SEED, working on the Foundation-Evaluation team. I build automated evaluation systems across pre-training and post-training stages of LLMs, with a focus on expert-level capability assessment, model behavior analysis, and multimodal understanding.

Research Interests

LLM Evaluation Capability Assessment Model Behavior Analysis Benchmarking

Research

As LLMs' capability boundaries expand while compute and human resources remain limited, evaluation must answer two fundamental questions: (1) Where to invest — defining evaluation directions anchored in realistic, economically valuable tasks that reflect how models are actually used; and (2) How to iterate efficiently — building evaluation systems with high data quality, strong discriminative resolution, and comprehensive coverage, so that good training strategies are never missed. I work on these efforts at ByteDance Seed's Foundation-Evaluation team (part of my insights are demonstrated in Seed1.8 Model Card and Seed 2.0).

Specific topics include:

  1. Capability Definition
    Systematically define core capability structures for LLMs in real-world domains (finance, translation, legal, education, etc.), decomposing evaluable units from industry workflows covering search, tool use, and multi-stage decision-making. Representative works include FinSearchComp (ICLR 2026), DiscoX (cited by Nature), and WorldTravel.
  2. Quantitative Evaluation
    Build evaluation metrics and systems that provide objective, high-resolution feedback on model capabilities, serving as the basis for training strategy iteration. Focus on evaluation quality and resolution to prevent good strategies from being missed. Representative works include SuperGPQA (NeurIPS 2025), KOR-Bench (ICLR 2025), and MARS-Bench (EMNLP 2025 Findings).
  3. Analysis & Attribution
    Go beyond scores to analyze model capability patterns and behavioral signals — competitor analysis, internal iteration analysis, base vs. instruct comparisons — turning evaluation into actionable diagnostic insights. Representative works include Quantification of LLM Distillation (ACL 2025 Main) and CryptoX.
  4. Verification & Optimization
    Verify analysis hypotheses through controlled experiments and sanity checks, then scale validated insights across teams. We design TreePO to bridge the gap between policy optimization quality and inference-time compute in LLM reinforcement learning.

News

2026-02
"Seed 2.0" officially released, featuring Pro / Lite / Mini / Code variants.
2026-02
Our paper "WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints" is published on arXiv.
2025-12
"Seed1.8 Model Card: Towards Generalized Real-World Agency" is officially released.
2025-12
"Introduction to Techniques Used in Seed1.6" technical report is released.
2025-11
Our paper "DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains" is published on arXiv.
2025-09
Our paper "FinSearchComp" is published on arXiv and accepted at ICLR 2026 — retweeted 3 times by Elon Musk with tens of millions of impressions.
2025-09
"SuperGPQA" accepted at NeurIPS 2025; "Quantification of LLM Distillation" accepted at ACL 2025 Main.
2025-08
Our paper "TreePO: Bridging the Gap of Policy Optimization and Inference Efficiency" is published on arXiv.
2025-06
Our paper "MARS-Bench" is published on arXiv; later accepted at EMNLP 2025 Findings.
2025-04
"Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning" is published on arXiv.
2025-02
"SuperGPQA" and "CryptoX" published on arXiv.
2025-01
"Quantification of Large Language Model Distillation" published on arXiv; "KOR-Bench" accepted at ICLR 2025.
2024-10
Our paper "KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks" is published on arXiv.
2023-09
Joined ByteDance SEED as Algorithm Researcher on the Foundation-Evaluation team.

Selected Publications

* equal contribution   corresponding author

ByteDance Seed Team (Core Contributor: Evaluation & Model Analysis System)

Introduces Model-in-Model evaluation paradigm: leveraging LLM agents to automatically analyze model capabilities, identify behavioral patterns, and generate structured evaluation reports — enabling scalable, human-out-of-the-loop diagnostics.

Tech Report 2026 Homepage
Z. Wang, C. Yang, Y. Que, Z. Yang, H. Yuan, Y. Wang, Z. Jiang, S. Fang, Z. Wu, Z. Wang, Z. Yao, J. Liu, J. Ren, Y. Li, Y. Yang, J. Liu, J. Yang, Z. Wang, G. Zhang, Zhoufutu Wen, W. Huang

As LLM capability boundaries expand, we examine models' operations-research abilities in everyday life — testing whether they can handle tightly coupled constraints that mirror real-world travel planning.

arXiv 2026 Paper
ByteDance Seed Team (Core Contributor: Evaluation System)

Features a comprehensive Evaluation System covering foundational LLM, multimodal VLM, and agentic capabilities — combining public benchmarks with internal assessments aligned to high-value application patterns (e.g., FinSearchComp, WorldTravel, GUI Agent).

Tech Report 2025 Blog GitHub
ByteDance Seed Team

Introduces Adaptive Chain-of-Thought (AdaCoT) and sparse MoE architecture (230B total / 23B active) with 256K context — dynamically allocating reasoning depth based on query complexity.

Tech Report 2025 Blog
X. Zhao, Zhoufutu Wen, Z. Chen, J. Ding, J. Jiao, S. Li, X. Li, D. Liang, S. Long, Q. Liu, X. Wu, et al.

Targets the classic task of translation, revealing that even the most advanced LLMs still fall significantly short of human experts at discourse-level translation in professional domains — a long road ahead.

L. Hu, J. Jiao, J. Liu, Y. Ren, Zhoufutu Wen, K. Zhang, X. Zhang, X. Gao, T. He, F. Hu, Y. Liao, Z. Wang, C. Yang, Q. Yang, et al.

Evaluates models' ability to complete real-world financial analyst workflows through 635 expert-crafted questions spanning global and Greater China markets — widely adopted as a standard capability assessment by leading foundation model teams.

Y. Li*, Q. Gu*, Zhoufutu Wen*, Z. Li, T. Xing, S. Guo, T. Zheng, X. Zhou, X. Qu, W. Zhou, et al.

Explores rollout efficiency in RL — investigating how to scale reinforcement learning for LLMs by bridging the gap between policy optimization quality and inference-time compute.

arXiv 2025 Paper
C. Yang*, Y. Luo*, Zhoufutu Wen, Q. Chu, T. Gong, L. Liu, K. Zhang, J. Jiao, G. Zhang, W. Huang, N. Yu

A closed-loop experiment: discovering problems from practice, distilling them into a benchmark, and using the benchmark to guide model iteration — focusing on multi-turn dialogue robustness under motivation transfer and cross-turn dependency.

EMNLP 2025 Findings Paper Code Project
ByteDance Seed Team

ByteDance's first reasoning-focused model — achieves 86.7 on AIME 2024 and 55.0 on Codeforces via reinforcement learning, with a compact MoE design (20B active / 200B total).

arXiv 2025 Paper
X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, Zhoufutu Wen, et al.

The largest expert-level knowledge benchmark to date — spanning 285 graduate disciplines with 80+ expert annotators and a Human-LLM collaborative filtering mechanism, revealing that even the best models only reach ~62% accuracy.

NeurIPS 2025 Paper Code
J. Shi*, C. Wei*, L. Yang*, Z.M. Wang, C. Yang, G. Zhang, S. Huang, T. Peng, J. Yang, Zhoufutu Wen

Uncovers the gap between fluid intelligence (novel reasoning) and crystallized intelligence (memorized knowledge) in LLMs through cryptographic compositional tasks, with mechanistic interpretability experiments explaining why.

S. Lee, J. Zhou, C. Ao, K. Li, X. Du, S. He, H. Wu, T. Liu, J. Liu, Zhoufutu Wen, et al.

A pioneering work that opens up the study of LLM distillation quantification — proposing the first systematic framework to measure distillation degree across models, revealing widespread model homogenization in the field.

K. Ma, X. Du, Y. Wang, H. Zhang, Zhoufutu Wen, X. Qu, J. Yang, J. Liu, M. Liu, X. Yue, et al.

Isolates pure reasoning ability from memorized knowledge by designing knowledge-orthogonal rules — revealing that LLMs struggle significantly with out-of-distribution reasoning independent of pretrained knowledge.

ICLR 2025 Paper Project

Experience

ByteDance SEED

Algorithm Researcher · Foundation-Evaluation
Sep 2023 – Present · Beijing, China
Led the construction and development of the SEED evaluation system covering pre-training and post-training stages. Focus areas include expert-level capability assessment, evaluation data production, and model behavior analysis. Open-sourced benchmarks such as SuperGPQA, KOR-Bench, FinSearchComp, DiscoX, and more.
LLM Evaluation Benchmark Design Model Analysis Expert-level Assessment

Baidu · Ecom

Algorithm Engineer
Jul 2020 – Sep 2023 · Beijing, China
Built multimodal content understanding solutions for commercial monetization. Trained and optimized VLP models (200M–10B params) applied across ad trigger, ranking, creative generation, and user experience. Completed 30+ A/B tests over 3 years.
Multimodal Models VLP Ad Systems Baidu Highest Award Nominee ×2

Education

University of Electronic Science and Technology of China (UESTC)

M.S. in Computer Science · School of Computer Science
Aug 2017 – Jun 2020
Top 10%

University of Electronic Science and Technology of China (UESTC)

B.E. in Electronic Engineering · School of Physics
Aug 2013 – Jun 2017
Top 5% · National Scholarship