DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Performance Leaderboard

Overall Performance (Total Score + Three Dimensions)

Rank	Models	Overall	Accuracy	Fluency	Appropriateness	Open-source

Rank	Models	Overall	Accuracy	Fluency	Appropriateness	Open-source
👑 Expert	Human Expert	80.16	49.80	15.96	14.40	-
🥇 1	GPT-5-high	76.66	48.65	15.21	12.80	×
🥈 2	Gemini-2.5-Pro	71.25	46.68	13.14	11.43	×
🥉 3	Qwen-3-235B	59.66	33.15	14.96	11.55	×
4	Kimi-K2	55.80	27.63	16.44	11.73	×
5	o3-high	55.57	28.78	15.79	11.00	×
6	o4-mini-high	55.09	29.55	14.29	11.25	×
7	Claude-4	54.04	39.38	5.98	8.68	×
8	Claude-4-T	53.53	38.98	5.47	9.08	×
9	Qwen-3-235B-T	49.97	23.20	15.54	11.23	×
10	GPT-4.1	49.65	29.25	11.05	9.35	×
11	DeepSeek-V3	49.60	22.80	16.20	10.60	✓
12	Doubao-1.6-T	49.51	29.30	10.11	10.10	×
13	DeepSeek-R1	46.06	19.75	16.11	10.20	✓
14	Gemini-2.5-Flash-Lite	44.01	26.70	7.91	9.40	×
15	Grok-4	43.82	31.38	4.71	7.73	×
16	GPT-4o	39.93	20.35	11.28	8.30	×
17	Qwen-3-14B	39.36	22.40	7.73	9.23	✓
18	Qwen-3-8B	28.37	15.13	5.84	7.40	✓
-	Youdao-14B	46.37	28.50	9.82	8.05	✓
-	Google-NMT	37.10	18.96	10.12	8.02	-

Performance by Translation Direction (zh→en vs en→zh)

Models	zh→en				en→zh				Diff
	Score	Accuracy	Fluency	Appropriateness	Score	Accuracy	Fluency	Appropriateness
GPT-5-high	84.49	52.35	17.24	14.90	68.83	44.95	13.18	10.70	15.66
Gemini-2.5-Pro	80.22	50.25	15.82	14.15	62.26	43.10	10.46	8.70	17.96
Qwen-3-235B	66.15	36.35	16.80	13.00	53.17	29.95	13.12	10.10	12.98
Kimi-K2	64.12	32.90	18.32	12.90	47.46	22.35	14.56	10.55	16.66
o3-high	67.18	36.10	17.98	13.10	43.95	21.45	13.60	8.90	23.23
o4-mini-high	70.34	40.10	15.94	14.30	39.84	19.00	12.64	8.20	30.50
Claude-4	62.44	43.45	7.84	11.15	52.62	35.30	4.12	6.20	9.82
Claude-4-T	62.34	44.15	6.94	11.25	44.70	33.80	4.00	6.90	17.64
Qwen-3-235b-T	58.12	28.45	15.92	13.75	41.81	17.95	15.16	8.70	16.31
GPT-4.1	65.82	39.65	13.62	12.55	33.48	18.85	8.48	6.15	32.34
DeepSeek-V3	66.97	36.55	17.62	12.80	32.23	9.05	14.78	8.40	34.74
Doubao-1.6-T	53.13	33.65	9.18	10.30	45.89	24.95	11.04	9.90	7.24
DeepSeek-R1	58.12	28.60	16.72	12.80	34.00	10.90	15.50	7.60	24.12
Gemini-2.5-Flash-Lite	62.51	38.75	11.26	12.50	25.51	14.65	4.56	6.30	37.00
Grok-4	59.29	40.70	7.04	11.55	28.33	22.05	2.38	3.90	30.96
GPT-4o	58.13	30.95	15.88	11.30	21.73	9.75	6.68	5.30	36.40
Qwen-3-14B	47.20	26.80	9.00	11.40	31.51	18.00	6.46	7.05	15.69
Google-NMT	46.49	25.51	11.90	9.08	27.80	12.47	8.36	6.97	18.69
Qwen-3-8B	32.95	18.70	6.20	8.05	23.78	11.55	5.48	6.75	9.17
Average	61.37	36.00	13.22	12.15	39.94	22.11	9.71	7.75	21.43

Performance by Domain (Academic vs Non-Academic)

Model Name	Academic Papers						Non-Academic Tasks
	Overall		Humanities	Social Sciences	Applied Disciplines	Natural Sciences	Overall		Domain-Specific	Literature & Arts	News & Information
	Rank	Score	Score	Score	Score	Score	Rank	Score	Score	Score	Score
GPT-5-high	1	77.07	71.93	78.79	84.25	75.23	1	76.03	76.00	68.29	78.97
Gemini-2.5-Pro	2	72.32	67.39	75.63	76.35	70.37	2	69.58	69.32	64.07	71.86
Qwen-3-235B	3	63.58	58.86	66.18	69.70	61.03	5	53.66	51.00	43.00	59.70
Kimi-K2	4	57.88	54.86	61.11	63.15	53.80	7	52.58	47.82	55.57	55.05
o3-high	5	56.02	52.11	64.29	63.70	45.80	3	54.86	50.00	49.00	60.76
o4-mini-high	6	55.36	53.50	58.45	64.25	48.40	4	54.68	47.43	40.86	65.41
Claude-4	8	54.57	49.14	55.61	63.80	52.51	6	53.20	49.71	35.00	62.73
Claude-4-T	7	55.30	47.50	57.68	62.95	54.57	8	50.80	48.43	33.36	59.19
Qwen-3-235B-T	11	51.34	49.18	49.74	49.10	56.09	10	47.86	44.86	41.29	52.62
GPT-4.1	9	52.07	52.18	52.84	57.40	48.11	12	45.94	37.82	33.50	56.78
DeepSeek-V3	12	50.58	43.89	51.97	58.80	49.71	9	48.10	42.32	39.14	55.86
Doubao-1.6-T	10	51.67	50.43	54.39	51.40	49.86	11	46.20	46.93	32.21	50.95
Youdao-14B	13	49.98	45.12	47.00	62.50	46.79	15	40.72	41.58	28.50	44.79
DeepSeek-R1	14	48.31	50.04	50.68	55.95	39.97	14	42.62	42.71	28.43	47.92
Gemini-2.5-Flash-Lite	15	46.24	39.18	49.58	58.50	41.26	16	40.59	31.43	21.93	54.59
Grok-4	16	44.12	34.57	47.21	54.20	42.66	13	43.33	35.89	34.29	52.38
GPT-4o	18	39.60	33.25	43.97	52.60	32.51	17	40.43	32.11	33.07	49.51
Qwen-3-14B	17	42.68	32.68	43.97	54.70	42.40	19	34.27	28.86	22.64	42.76
Google-NMT	19	38.86	33.18	41.35	47.00	36.03	18	34.42	30.50	15.50	44.83
Qwen-3-8B	20	32.79	22.96	31.66	46.65	33.94	20	21.59	18.64	10.57	28.00

Introduction

In the early AI era, driven by the rise of NLP, translation became a widely adopted application. With the advent of large models, translating short texts in conventional domains is no longer a challenge for them, and their performance has surpassed that of traditional machine translation.

But does this capability remain robust when dealing with long texts of 1,500 words or more? DiscoX Bench focuses on 200 Chinese and English texts, each over 1,500 words, covering genres such as news, academic papers, and literary works.

It evaluates the performance of large models in discourse-level translation from three dimensions: Accuracy, Fluency, and Appropriateness, providing guidance for model iteration.

DiscoX Benchmark

Comprises 200 discourse-level and expert-level texts in Chinese and English
Features texts that are each over 1,500 words or characters
Covers diverse genres, including news, academic papers, and literary works

Metric-S

Provides evaluation across three dimensions: Accuracy, Fluency, and Appropriateness
Calculates by error number and error severity
Offers strong explainability

DiscoX Benchmark

Comprises 200 discourse-level and expert-level texts in Chinese and English
Features texts that are each over 1,500 words or characters
Covers diverse genres, including news, academic papers, and literary works

Metric-S

Provides evaluation across three dimensions: Accuracy, Fluency, and Appropriateness
Calculates by error number and error severity
Offers strong explainability

Data Description & Distribution

DiscoX Bench invited a total of over 130 Vertical Domain experts from various fields (including experts with more than 3 years of professional experience and master's/doctoral students from world-class Universities to construct the evaluation dataset.) The texts in the dataset are required to be over 1,500 words, sourced from real-world industry and academic scenarios. They are logically coherent and of high quality, designed to challenge the upper limits of current large models' translation capabilities.

Primary Category	Secondary Category	Count
Academic Papers	Social Science Papers	38
	Natural Science Papers	35
	Humanities Papers	28
	Applied Science Papers	20
Non-academic Tasks	News and Information	37
	Domain-Specific Scenarios	28
	Literature and Arts	14
Total		200

Data Construction Process

A "one round of Design, two rounds of Quality Assurance" is adopted: Vertical Domain experts designed tasks based on terminology or mistranslated words, and a case was validated only if two randomly selected models (from a pool of five) both failed the translation. This rigorous process filtered 665 initial tasks down to 200 finalized cases.

Evaluation Process by Metric-S

To precisely evaluate the capabilities of AI translation, we have designed an automated scoring framework: Metric-S. It systematically assesses a translation through the following three core steps:

Setup

This initial section acts as a pre-filter. A "instruction follow" judge first checks if the candidate model has complied with the basic translation instructions. Only outputs that pass this fundamental check proceed to the detailed quality evaluation, while non-compliant ones will receive a score of 0.

Step 1

The Metric-S is used to comprehensively assess the translated texts based on accuracy (e.g. mistranslations, omissions), fluency (e.g. grammar), and appropriateness (e.g., style). This initial step aims to generate a complete "diagnostic output" that identifies all potential issues.

Step 2

The process moves to a De-duplication phase to refine the scoring mechanism. Its core task is to identify and merge duplicate error records, ensuring that each unique translation error is counted only once. This de-duplication process lays the groundwork for fair and accurate scoring.

Step 3

The system applies precise scoring to convert the verified error list into a final score. Points are deducted from a perfect score of 100 based on the type and severity of each error. The result is a single, comprehensive score that quantifies the model's overall translation quality.

Key Findings

1 Beyond a Single Score: Analyzing LLM Performance Across Dimensions

The models' performance is not balanced across the three dimensions (Accuracy, Fluency and Appropriateness). For example, Claude-4 performs well on Accuracy but poorly on Fluency-its translations are semantically correct but not smooth or natural. Conversely, DeepSeek-V3 performs well on Fluency but poorly on Accuracy.

Prompt

News and Information zh-en

"全世界一共研究出13个番茄种的基因组，我们现在掌握11个。"新疆农业科学院副院长...

Show Case

GPT-5-HIGH

Show Response

Claude-4.0

Show Response

DeepSeek-V3

Show Response

2 Asymmetry in Translation Directionality: LLM Better at Chinese-to-English than English-to-Chinese

The case reveals a significant performance disparity, with models performing considerably worse when translating from English to Chinese compared to the reverse direction. For instance, even the top-ranked GPT-5-high shows a noticeable quality gap between its English-to-Chinese output and the translations produced by professional human experts.

Prompt

Humanities en-zh

Machiavelli generally distrusted citizens, stating that "...in time of adversity, when the state is in need of it's citizens...

Show Case

GPT-5-HIGH

Show Response

Judge Output

Excerpt from evaluation analysis identifying translation issues, accuracy and fluency problems....

Show Analysis

Reference

Human Translator

Professional human translation serving as the reference standard...

Show Translation

3 The Performance Gap: Thinking LLMs Lag Behind Non-thinking LLMs

The study finds that thinking models generally underperform in non-thinking translation tasks. Thinking models are more prone to omitting information or providing summarized translations, which may lead to information loss.

Prompt

Natural Science en-zh

Incidents may be diagnosed and resolved by people in many different groups, depending on the complexity of the issue or the incident type. All of these groups need to understand the incident management process, and how their contribution to this helps to manage...

Show Case

Non-thinking Response

Show Response

Thinking Response

Show Response

Metric-S Fidelity Validation

Our proposed new evaluation metric, Metric-S, achieves 70.30% consistency with human expert judgments, significantly outperforming existing methods. Further ablation studies demonstrate that its superior performance stems from our meticulously designed multi-dimensional, multi-judge collaborative framework, as any simplification of it leads to a drop in evaluation accuracy.

Metric	Avg.	System Level		Segment Level
		zh→en	en→zh	zh→en	en→zh
Metric-S	70.30%	80.00%	90.00%	54.80%	56.40%
XCOMET-QE	34.70%	10.00%	70.00%	26.40%	32.40%
ChrF	–	–	–	–	–

Pairwise consistency of Metric-S and XCOMET-QE with human judgments. ChrF is excluded because it requires a reference.

Settings	System Level Consistency
Metric-S (Original)	90%
Metric-S (DS-R1 as judge)	70%
Single Dimension (accuracy)	70%
Single LLM (Detailed prompt)	60%
Single LLM (simple prompt)	20%

Alignment with human judgments in different settings of ablation studies.

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Performance Leaderboard

Overall Performance (Total Score + Three Dimensions)

Performance by Translation Direction (zh→en vs en→zh)

Performance by Domain (Academic vs Non-Academic)

Introduction

Overview

DiscoX Benchmark

Metric-S

DiscoX Benchmark

Metric-S

Data Description & Distribution

Data Construction Process

Evaluation Process by Metric-S

Setup

Step 1

Step 2

Step 3

Key Findings

1 Beyond a Single Score: Analyzing LLM Performance Across Dimensions

Prompt

GPT-5-HIGH

Claude-4.0

DeepSeek-V3

2 Asymmetry in Translation Directionality: LLM Better at Chinese-to-English than English-to-Chinese

Prompt

GPT-5-HIGH

Judge Output

Reference

3 The Performance Gap: Thinking LLMs Lag Behind Non-thinking LLMs

Prompt

Non-thinking Response

Thinking Response

Metric-S Fidelity Validation

Prompt Content

GPT-5-HIGH Response

Claude-4.0 Response

DeepSeek-V3 Response

Prompt Content - Humanities EN-CN

GPT-5-HIGH Response - Humanities EN-CN

Judge Output - Humanities EN-CN

Reference Response - Human Translator

Prompt - Academic Papers EN-CN

Non-thinking Response - Score: 71 (Acc: 45, Flu: 6, App: 20)

Thinking Response - Score: 14 (Acc: 0, Flu: 4, App: 10)