Algorithm Researcher @ ByteDance SEED · Foundation-Evaluation
I am an Algorithm Researcher at ByteDance SEED, working on the Foundation-Evaluation team. I build automated evaluation systems across pre-training and post-training stages of LLMs, with a focus on expert-level capability assessment, model behavior analysis, and multimodal understanding.
As LLMs' capability boundaries expand while compute and human resources remain limited, evaluation must answer two fundamental questions: (1) Where to invest — defining evaluation directions anchored in realistic, economically valuable tasks that reflect how models are actually used; and (2) How to iterate efficiently — building evaluation systems with high data quality, strong discriminative resolution, and comprehensive coverage, so that good training strategies are never missed. I work on these efforts at ByteDance Seed's Foundation-Evaluation team (part of my insights are demonstrated in Seed1.8 Model Card and Seed 2.0).
Specific topics include:
* equal contribution † corresponding author
Introduces Model-in-Model evaluation paradigm: leveraging LLM agents to automatically analyze model capabilities, identify behavioral patterns, and generate structured evaluation reports — enabling scalable, human-out-of-the-loop diagnostics.
As LLM capability boundaries expand, we examine models' operations-research abilities in everyday life — testing whether they can handle tightly coupled constraints that mirror real-world travel planning.
Features a comprehensive Evaluation System covering foundational LLM, multimodal VLM, and agentic capabilities — combining public benchmarks with internal assessments aligned to high-value application patterns (e.g., FinSearchComp, WorldTravel, GUI Agent).
Introduces Adaptive Chain-of-Thought (AdaCoT) and sparse MoE architecture (230B total / 23B active) with 256K context — dynamically allocating reasoning depth based on query complexity.
Targets the classic task of translation, revealing that even the most advanced LLMs still fall significantly short of human experts at discourse-level translation in professional domains — a long road ahead.
Evaluates models' ability to complete real-world financial analyst workflows through 635 expert-crafted questions spanning global and Greater China markets — widely adopted as a standard capability assessment by leading foundation model teams.
Explores rollout efficiency in RL — investigating how to scale reinforcement learning for LLMs by bridging the gap between policy optimization quality and inference-time compute.
A closed-loop experiment: discovering problems from practice, distilling them into a benchmark, and using the benchmark to guide model iteration — focusing on multi-turn dialogue robustness under motivation transfer and cross-turn dependency.
ByteDance's first reasoning-focused model — achieves 86.7 on AIME 2024 and 55.0 on Codeforces via reinforcement learning, with a compact MoE design (20B active / 200B total).
The largest expert-level knowledge benchmark to date — spanning 285 graduate disciplines with 80+ expert annotators and a Human-LLM collaborative filtering mechanism, revealing that even the best models only reach ~62% accuracy.
Uncovers the gap between fluid intelligence (novel reasoning) and crystallized intelligence (memorized knowledge) in LLMs through cryptographic compositional tasks, with mechanistic interpretability experiments explaining why.
A pioneering work that opens up the study of LLM distillation quantification — proposing the first systematic framework to measure distillation degree across models, revealing widespread model homogenization in the field.
Isolates pure reasoning ability from memorized knowledge by designing knowledge-orthogonal rules — revealing that LLMs struggle significantly with out-of-distribution reasoning independent of pretrained knowledge.