DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

DiscoX Bench focuses on 200 Chinese and English texts, each over 1,500 words, covering genres such as news, academic papers, and literary works. It evaluates large models on discourse-level translation from three dimensions: Accuracy, Fluency, and Appropriateness, providing guidance for model iteration.

Model Performance Overview

Model Performance Overview

Performance Leaderboard

Overall Performance (Total Score + Three Dimensions)

Rank Models Overall Accuracy Fluency Appropriateness Open-source
Rank Models Overall Accuracy Fluency Appropriateness Open-source
👑 ExpertHuman Expert80.1649.8015.9614.40-
🥇 1GPT-5-high76.6648.6515.2112.80×
🥈 2Gemini-2.5-Pro71.2546.6813.1411.43×
🥉 3Qwen-3-235B59.6633.1514.9611.55×
4Kimi-K255.8027.6316.4411.73×
5o3-high55.5728.7815.7911.00×
6o4-mini-high55.0929.5514.2911.25×
7Claude-454.0439.385.988.68×
8Claude-4-T53.5338.985.479.08×
9Qwen-3-235B-T49.9723.2015.5411.23×
10GPT-4.149.6529.2511.059.35×
11DeepSeek-V349.6022.8016.2010.60
12Doubao-1.6-T49.5129.3010.1110.10×
13DeepSeek-R146.0619.7516.1110.20
14Gemini-2.5-Flash-Lite44.0126.707.919.40×
15Grok-443.8231.384.717.73×
16GPT-4o39.9320.3511.288.30×
17Qwen-3-14B39.3622.407.739.23
18Qwen-3-8B28.3715.135.847.40
-Youdao-14B46.3728.509.828.05
-Google-NMT37.1018.9610.128.02-

Performance by Translation Direction (zh→en vs en→zh)

Models zh→en en→zh Diff
Score Accuracy Fluency Appropriateness Score Accuracy Fluency Appropriateness
GPT-5-high84.4952.3517.2414.9068.8344.9513.1810.7015.66
Gemini-2.5-Pro80.2250.2515.8214.1562.2643.1010.468.7017.96
Qwen-3-235B66.1536.3516.8013.0053.1729.9513.1210.1012.98
Kimi-K264.1232.9018.3212.9047.4622.3514.5610.5516.66
o3-high67.1836.1017.9813.1043.9521.4513.608.9023.23
o4-mini-high70.3440.1015.9414.3039.8419.0012.648.2030.50
Claude-462.4443.457.8411.1552.6235.304.126.209.82
Claude-4-T62.3444.156.9411.2544.7033.804.006.9017.64
Qwen-3-235b-T58.1228.4515.9213.7541.8117.9515.168.7016.31
GPT-4.165.8239.6513.6212.5533.4818.858.486.1532.34
DeepSeek-V366.9736.5517.6212.8032.239.0514.788.4034.74
Doubao-1.6-T53.1333.659.1810.3045.8924.9511.049.907.24
DeepSeek-R158.1228.6016.7212.8034.0010.9015.507.6024.12
Gemini-2.5-Flash-Lite62.5138.7511.2612.5025.5114.654.566.3037.00
Grok-459.2940.707.0411.5528.3322.052.383.9030.96
GPT-4o58.1330.9515.8811.3021.739.756.685.3036.40
Qwen-3-14B47.2026.809.0011.4031.5118.006.467.0515.69
Google-NMT46.4925.5111.909.0827.8012.478.366.9718.69
Qwen-3-8B32.9518.706.208.0523.7811.555.486.759.17
Average61.3736.0013.2212.1539.9422.119.717.7521.43

Performance by Domain (Academic vs Non-Academic)

Model Name Academic Papers Non-Academic Tasks
Overall Humanities Social Sciences Applied Disciplines Natural Sciences Overall Domain-Specific Literature & Arts News & Information
Rank Score Score Score Score Score Rank Score Score Score Score
GPT-5-high177.0771.9378.7984.2575.23176.0376.0068.2978.97
Gemini-2.5-Pro272.3267.3975.6376.3570.37269.5869.3264.0771.86
Qwen-3-235B363.5858.8666.1869.7061.03553.6651.0043.0059.70
Kimi-K2457.8854.8661.1163.1553.80752.5847.8255.5755.05
o3-high556.0252.1164.2963.7045.80354.8650.0049.0060.76
o4-mini-high655.3653.5058.4564.2548.40454.6847.4340.8665.41
Claude-4854.5749.1455.6163.8052.51653.2049.7135.0062.73
Claude-4-T755.3047.5057.6862.9554.57850.8048.4333.3659.19
Qwen-3-235B-T1151.3449.1849.7449.1056.091047.8644.8641.2952.62
GPT-4.1952.0752.1852.8457.4048.111245.9437.8233.5056.78
DeepSeek-V31250.5843.8951.9758.8049.71948.1042.3239.1455.86
Doubao-1.6-T1051.6750.4354.3951.4049.861146.2046.9332.2150.95
Youdao-14B1349.9845.1247.0062.5046.791540.7241.5828.5044.79
DeepSeek-R11448.3150.0450.6855.9539.971442.6242.7128.4347.92
Gemini-2.5-Flash-Lite1546.2439.1849.5858.5041.261640.5931.4321.9354.59
Grok-41644.1234.5747.2154.2042.661343.3335.8934.2952.38
GPT-4o1839.6033.2543.9752.6032.511740.4332.1133.0749.51
Qwen-3-14B1742.6832.6843.9754.7042.401934.2728.8622.6442.76
Google-NMT1938.8633.1841.3547.0036.031834.4230.5015.5044.83
Qwen-3-8B2032.7922.9631.6646.6533.942021.5918.6410.5728.00

Introduction

In the early AI era, driven by the rise of NLP, translation became a widely adopted application. With the advent of large models, translating short texts in conventional domains is no longer a challenge for them, and their performance has surpassed that of traditional machine translation.

But does this capability remain robust when dealing with long texts of 1,500 words or more? DiscoX Bench focuses on 200 Chinese and English texts, each over 1,500 words, covering genres such as news, academic papers, and literary works.

In the early AI era, driven by the rise of NLP, translation became a widely adopted application. With the advent of large models, translating short texts in conventional domains is no longer a challenge for them, and their performance has surpassed that of traditional machine translation.

But does this capability remain robust when dealing with long texts of 1,500 words or more? DiscoX Bench focuses on 200 Chinese and English texts, each over 1,500 words, covering genres such as news, academic papers, and literary works.

It evaluates the performance of large models in discourse-level translation from three dimensions: Accuracy, Fluency, and Appropriateness, providing guidance for model iteration.

Read More

Overview

DiscoX Benchmark

  • Comprises 200 discourse-level and expert-level texts in Chinese and English
  • Features texts that are each over 1,500 words or characters
  • Covers diverse genres, including news, academic papers, and literary works

Metric-S

  • Provides evaluation across three dimensions: Accuracy, Fluency, and Appropriateness
  • Calculates by error number and error severity
  • Offers strong explainability

DiscoX Benchmark

  • Comprises 200 discourse-level and expert-level texts in Chinese and English
  • Features texts that are each over 1,500 words or characters
  • Covers diverse genres, including news, academic papers, and literary works

Metric-S

  • Provides evaluation across three dimensions: Accuracy, Fluency, and Appropriateness
  • Calculates by error number and error severity
  • Offers strong explainability
Read More

Data Description & Distribution

DiscoX Bench invited a total of over 130 Vertical Domain experts from various fields (including experts with more than 3 years of professional experience and master's/doctoral students from world-class Universities to construct the evaluation dataset.) The texts in the dataset are required to be over 1,500 words, sourced from real-world industry and academic scenarios. They are logically coherent and of high quality, designed to challenge the upper limits of current large models' translation capabilities.

Primary Category Secondary Category Count
Academic Papers Social Science Papers 38
Natural Science Papers 35
Humanities Papers 28
Applied Science Papers 20
Non-academic Tasks News and Information 37
Domain-Specific Scenarios 28
Literature and Arts 14
Total 200
Data Distribution Chart

Data Construction Process

A "one round of Design, two rounds of Quality Assurance" is adopted: Vertical Domain experts designed tasks based on terminology or mistranslated words, and a case was validated only if two randomly selected models (from a pool of five) both failed the translation. This rigorous process filtered 665 initial tasks down to 200 finalized cases.

Data Construction Pipeline

Evaluation Process by Metric-S

To precisely evaluate the capabilities of AI translation, we have designed an automated scoring framework: Metric-S. It systematically assesses a translation through the following three core steps:

Metric-S Evaluation Pipeline

Setup

This initial section acts as a pre-filter. A "instruction follow" judge first checks if the candidate model has complied with the basic translation instructions. Only outputs that pass this fundamental check proceed to the detailed quality evaluation, while non-compliant ones will receive a score of 0.

Step 1

The Metric-S is used to comprehensively assess the translated texts based on accuracy (e.g. mistranslations, omissions), fluency (e.g. grammar), and appropriateness (e.g., style). This initial step aims to generate a complete "diagnostic output" that identifies all potential issues.

Step 2

The process moves to a De-duplication phase to refine the scoring mechanism. Its core task is to identify and merge duplicate error records, ensuring that each unique translation error is counted only once. This de-duplication process lays the groundwork for fair and accurate scoring.

Step 3

The system applies precise scoring to convert the verified error list into a final score. Points are deducted from a perfect score of 100 based on the type and severity of each error. The result is a single, comprehensive score that quantifies the model's overall translation quality.

Key Findings

1 Beyond a Single Score: Analyzing LLM Performance Across Dimensions

The models' performance is not balanced across the three dimensions (Accuracy, Fluency and Appropriateness). For example, Claude-4 performs well on Accuracy but poorly on Fluency-its translations are semantically correct but not smooth or natural. Conversely, DeepSeek-V3 performs well on Fluency but poorly on Accuracy.

Prompt

News and Information zh-en

"全世界一共研究出13个番茄种的基因组,我们现在掌握11个。"新疆农业科学院副院长...

Show Case

GPT-5-HIGH

Show Response

Claude-4.0

Show Response

DeepSeek-V3

Show Response

2 Asymmetry in Translation Directionality: LLM Better at Chinese-to-English than English-to-Chinese

The case reveals a significant performance disparity, with models performing considerably worse when translating from English to Chinese compared to the reverse direction. For instance, even the top-ranked GPT-5-high shows a noticeable quality gap between its English-to-Chinese output and the translations produced by professional human experts.

Prompt

Humanities en-zh

Machiavelli generally distrusted citizens, stating that "...in time of adversity, when the state is in need of it's citizens...

Show Case

GPT-5-HIGH

Show Response

Judge Output

Excerpt from evaluation analysis identifying translation issues, accuracy and fluency problems....

Show Analysis

Reference

Human Translator

Professional human translation serving as the reference standard...

Show Translation

3 The Performance Gap: Thinking LLMs Lag Behind Non-thinking LLMs

The study finds that thinking models generally underperform in non-thinking translation tasks. Thinking models are more prone to omitting information or providing summarized translations, which may lead to information loss.

Prompt

Natural Science en-zh

Incidents may be diagnosed and resolved by people in many different groups, depending on the complexity of the issue or the incident type. All of these groups need to understand the incident management process, and how their contribution to this helps to manage...

Show Case

Non-thinking Response

Show Response

Thinking Response

Show Response

Metric-S Fidelity Validation

Our proposed new evaluation metric, Metric-S, achieves 70.30% consistency with human expert judgments, significantly outperforming existing methods. Further ablation studies demonstrate that its superior performance stems from our meticulously designed multi-dimensional, multi-judge collaborative framework, as any simplification of it leads to a drop in evaluation accuracy.

Metric Avg. System Level Segment Level
zh→en en→zh zh→en en→zh
Metric-S 70.30% 80.00% 90.00% 54.80% 56.40%
XCOMET-QE 34.70% 10.00% 70.00% 26.40% 32.40%
ChrF
Pairwise consistency of Metric-S and XCOMET-QE with human judgments. ChrF is excluded because it requires a reference.
Settings System Level Consistency
Metric-S (Original) 90%
Metric-S (DS-R1 as judge) 70%
Single Dimension (accuracy) 70%
Single LLM (Detailed prompt) 60%
Single LLM (simple prompt) 20%
Alignment with human judgments in different settings of ablation studies.