Algorithm Researcher @ ByteDance SEED · Foundation-Evaluation
I am an Algorithm Researcher at ByteDance SEED, working on the Foundation-Evaluation team. I build automated evaluation systems across pre-training and post-training stages of LLMs, with a focus on expert-level capability assessment, model behavior analysis, and multimodal understanding. My work has contributed to the Seed 1.6 / 1.8 / 2.0 releases, and the benchmarks I co-led (SuperGPQA, FinSearchComp, DiscoX, XpertBench) have appeared at NeurIPS / ICLR / ACL, been cited by Nature, and adopted as standard agentic tests by leading foundation-model teams including MiniMax-M2 and Moonshot Kimi K2.5 — with FinSearchComp also quote-tweeted by Elon Musk, reaching tens of millions of impressions.
My thesis: Evals are the new PRD — what we measure is what we build.
As LLMs' capability boundaries expand while compute and human resources remain limited, evaluation must answer two fundamental questions: (1) Where to invest — defining evaluation directions anchored in realistic, economically valuable tasks that reflect how models are actually used; and (2) How to iterate efficiently — building evaluation systems with high data quality, strong discriminative resolution, and comprehensive coverage, so that good training strategies are never missed. I work on these efforts at ByteDance Seed's Foundation-Evaluation team (part of my insights are demonstrated in Seed1.8 Model Card and Seed 2.0).
Specific topics include:
* equal contribution † corresponding author
A high-fidelity benchmark of 1,346 expert-curated tasks across 7 professional domains, evaluated with 15–40 weighted rubric checkpoints per task and our ShotJudge paradigm — exposing a pronounced "expert-gap" where even frontier models peak at only ~66% success.
Extends our distillation-quantification line from knowledge to agentic tool-use behaviors — introducing Response Pattern Similarity (RPS) and Action Graph Similarity (AGS) to reveal structural homogenization among 18 models across 8 providers on τ-Bench / τ²-Bench.
Introduces Model-in-Model evaluation paradigm: leveraging LLM agents to automatically analyze model capabilities, identify behavioral patterns, and generate structured evaluation reports — enabling scalable, human-out-of-the-loop diagnostics.
As LLM capability boundaries expand, we examine models' operations-research abilities in everyday life — testing whether they can handle tightly coupled constraints that mirror real-world travel planning.
Features a comprehensive Evaluation System covering foundational LLM, multimodal VLM, and agentic capabilities — combining public benchmarks with internal assessments aligned to high-value application patterns (e.g., FinSearchComp, WorldTravel, GUI Agent).
Introduces Adaptive Chain-of-Thought (AdaCoT) and sparse MoE architecture (230B total / 23B active) with 256K context — dynamically allocating reasoning depth based on query complexity.
Targets the classic task of translation, revealing that even the most advanced LLMs still fall significantly short of human experts at discourse-level translation in professional domains — a long road ahead.
Evaluates models' ability to complete real-world financial analyst workflows through 635 expert-crafted questions spanning global and Greater China markets — adopted by leading foundation-model teams as a standard capability test, e.g., a core agentic benchmark in the MiniMax-M2 model card and one of six core agentic benchmarks in the Moonshot Kimi K2.5 tech report; Grok 4 tops the global leaderboard at 68.9% (vs. 75% human-expert), and the result was quote-tweeted by Elon Musk, reaching tens of millions of impressions.
Explores rollout efficiency in RL — investigating how to scale reinforcement learning for LLMs by bridging the gap between policy optimization quality and inference-time compute.
A closed-loop experiment: discovering problems from practice, distilling them into a benchmark, and using the benchmark to guide model iteration — focusing on multi-turn dialogue robustness under motivation transfer and cross-turn dependency.
ByteDance's first reasoning-focused model — achieves 86.7 on AIME 2024 and 55.0 on Codeforces via reinforcement learning, with a compact MoE design (20B active / 200B total).
The largest expert-level knowledge benchmark to date — spanning 285 graduate disciplines with 80+ expert annotators and a Human-LLM collaborative filtering mechanism, revealing that even the best models only reach ~62% accuracy.
Uncovers the gap between fluid intelligence (novel reasoning) and crystallized intelligence (memorized knowledge) in LLMs through cryptographic compositional tasks, with mechanistic interpretability experiments explaining why.
A pioneering work that opens up the study of LLM distillation quantification — proposing the first systematic framework to measure distillation degree across models, revealing widespread model homogenization in the field.
Isolates pure reasoning ability from memorized knowledge by designing knowledge-orthogonal rules — revealing that LLMs struggle significantly with out-of-distribution reasoning independent of pretrained knowledge.