Zhoufutu Wen 温周伏土

Algorithm Researcher @ ByteDance SEED · Foundation-Evaluation

"Two things fill the mind with ever new and increasing admiration and awe: the starry heavens above me and the moral law within me." — Kant

Zhoufutu Wen

Biography

I am an Algorithm Researcher at ByteDance SEED, working on the Foundation-Evaluation team. I lead the construction and development of automated evaluation systems for both pre-training and post-training stages of large language models. My current focus lies in expert-level capability assessment, high-quality evaluation data production, and post-evaluation model behavior analysis.

As LLMs' capability boundaries continue to expand while compute and human resources remain limited, I believe the core challenge of evaluation is determining where to invest resources and how to efficiently support model iteration once a direction is chosen.

Research Interests

LLM Evaluation & Benchmarking
Expert-level Capability Assessment
Model Behavior Analysis
Multimodal Understanding

News

2026-02
"Seed 2.0" officially released, featuring Pro / Lite / Mini / Code variants.
2026-02
Our paper "WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints" is published on arXiv.
2025-12
"Seed1.8 Model Card: Towards Generalized Real-World Agency" is officially released.
2025-12
"Introduction to Techniques Used in Seed1.6" technical report is released.
2025-11
Our paper "DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains" is published on arXiv.
2025-09
Our paper "FinSearchComp" is published on arXiv and accepted at ICLR 2026 — retweeted 3 times by Elon Musk with tens of millions of impressions.
2025-09
"SuperGPQA" accepted at NeurIPS 2025; "Quantification of LLM Distillation" accepted at ACL 2025 Main.
2025-08
Our paper "TreePO: Bridging the Gap of Policy Optimization and Inference Efficiency" is published on arXiv.
2025-06
Our paper "MARS-Bench" is published on arXiv; later accepted at EMNLP 2025 Findings.
2025-04
"Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning" is published on arXiv.
2025-02
"SuperGPQA" and "CryptoX" published on arXiv.
2025-01
"Quantification of Large Language Model Distillation" published on arXiv; "KOR-Bench" accepted at ICLR 2025.
2024-10
Our paper "KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks" is published on arXiv.
2023-09
Joined ByteDance SEED as Algorithm Researcher on the Foundation-Evaluation team.

Selected Publications

* equal contribution   corresponding author

ByteDance Seed Team

Introduces Model-in-Model evaluation paradigm: leveraging LLM agents to automatically analyze model capabilities, identify behavioral patterns, and generate structured evaluation reports — enabling scalable, human-out-of-the-loop diagnostics.

Tech Report 2026 Homepage
Z. Wang, C. Yang, Y. Que, Z. Yang, H. Yuan, Y. Wang, Z. Jiang, S. Fang, Z. Wu, Z. Wang, Z. Yao, J. Liu, J. Ren, Y. Li, Y. Yang, J. Liu, J. Yang, Z. Wang, G. Zhang, Zhoufutu Wen, W. Huang

As LLM capability boundaries expand, we examine models' operations-research abilities in everyday life — testing whether they can handle tightly coupled constraints that mirror real-world travel planning.

arXiv 2026 Paper
ByteDance Seed Team

Features a comprehensive Evaluation System covering foundational LLM, multimodal VLM, and agentic capabilities — combining public benchmarks with internal assessments aligned to high-value application patterns (e.g., FinSearchComp, WorldTravel, GUI Agent).

Tech Report 2025 Blog GitHub
ByteDance Seed Team

Introduces Adaptive Chain-of-Thought (AdaCoT) and sparse MoE architecture (230B total / 23B active) with 256K context — dynamically allocating reasoning depth based on query complexity.

Tech Report 2025 Blog
X. Zhao, Zhoufutu Wen, Z. Chen, J. Ding, J. Jiao, S. Li, X. Li, D. Liang, S. Long, Q. Liu, X. Wu, et al.

Targets the classic task of translation, revealing that even the most advanced LLMs still fall significantly short of human experts at discourse-level translation in professional domains — a long road ahead.

arXiv 2025 Paper Project
L. Hu, J. Jiao, J. Liu, Y. Ren, Zhoufutu Wen, K. Zhang, X. Zhang, X. Gao, T. He, F. Hu, Y. Liao, Z. Wang, C. Yang, Q. Yang, et al.

Evaluates models' ability to complete real-world financial analyst workflows — retweeted 3 times by Elon Musk with tens of millions of impressions, and adopted as a capability showcase by multiple leading foundation model teams in China.

Y. Li*, Q. Gu*, Zhoufutu Wen*, Z. Li, T. Xing, S. Guo, T. Zheng, X. Zhou, X. Qu, W. Zhou, et al.

Explores rollout efficiency in RL — investigating how to scale reinforcement learning for LLMs by bridging the gap between policy optimization quality and inference-time compute.

arXiv 2025 Paper
C. Yang*, Y. Luo*, Zhoufutu Wen, Q. Chu, T. Gong, L. Liu, K. Zhang, J. Jiao, G. Zhang, W. Huang, N. Yu

A closed-loop experiment: discovering problems from practice, distilling them into a benchmark, and using the benchmark to guide model iteration — focusing on multi-turn dialogue robustness under motivation transfer and cross-turn dependency.

EMNLP 2025 Findings Paper Code Project
ByteDance Seed Team

ByteDance's first reasoning-focused model — achieves 86.7 on AIME 2024 and 55.0 on Codeforces via reinforcement learning, with a compact MoE design (20B active / 200B total).

arXiv 2025 Paper
X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, Zhoufutu Wen, et al.

The largest expert-level knowledge benchmark to date — spanning 285 graduate disciplines with 80+ expert annotators and a Human-LLM collaborative filtering mechanism, revealing that even the best models only reach ~62% accuracy.

NeurIPS 2025 Paper Code
J. Shi*, C. Wei*, L. Yang*, Z.M. Wang, C. Yang, G. Zhang, S. Huang, T. Peng, J. Yang, Zhoufutu Wen

Uncovers the gap between fluid intelligence (novel reasoning) and crystallized intelligence (memorized knowledge) in LLMs through cryptographic compositional tasks, with mechanistic interpretability experiments explaining why.

S. Lee, J. Zhou, C. Ao, K. Li, X. Du, S. He, H. Wu, T. Liu, J. Liu, Zhoufutu Wen, et al.

A pioneering work that opens up the study of LLM distillation quantification — proposing the first systematic framework to measure distillation degree across models, revealing widespread model homogenization in the field.

ACL 2025 Main Paper ACL Anthology
K. Ma, X. Du, Y. Wang, H. Zhang, Zhoufutu Wen, X. Qu, J. Yang, J. Liu, M. Liu, X. Yue, et al.

Isolates pure reasoning ability from memorized knowledge by designing knowledge-orthogonal rules — revealing that LLMs struggle significantly with out-of-distribution reasoning independent of pretrained knowledge.

ICLR 2025 Paper Project

Experience

ByteDance SEED

Algorithm Researcher · Foundation-Evaluation
Sep 2023 – Present · Beijing, China
Led the construction and development of the SEED evaluation system covering pre-training and post-training stages. Focus areas include expert-level capability assessment, evaluation data production, and model behavior analysis. Open-sourced benchmarks such as SuperGPQA, KOR-Bench, FinSearchComp, DiscoX, and more.
LLM Evaluation Benchmark Design Model Analysis Expert-level Assessment

Baidu · Ecom

Algorithm Engineer
Jul 2020 – Sep 2023 · Beijing, China
Built multimodal content understanding solutions for commercial monetization. Trained and optimized VLP models (200M–10B params) applied across ad trigger, ranking, creative generation, and user experience. Completed 30+ A/B tests over 3 years.
Multimodal Models VLP Ad Systems Baidu Highest Award Nominee ×2

Education

University of Electronic Science and Technology of China (UESTC)

M.S. in Computer Science · School of Computer Science
Aug 2017 – Jun 2020
Top 10%

University of Electronic Science and Technology of China (UESTC)

B.E. in Electronic Engineering · School of Physics
Aug 2013 – Jun 2017
Top 5% · National Scholarship