MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Performance Leaderboard

Overall Performance

Rank	Models	Reasoning	Overall	Spatial Reasoning	Geometric Reasoning	Visual Knowledge Reasoning

Rank	Models	Reasoning	Overall	SR	GR	VKR
👑 Human	Human	—	95.86	95.83	95.83	95.92
🥇 1	Gemini-2.5-Pro		42.66	23.80	29.56	74.63
🥈 2	GPT-5 (high)		40.25	30.63	23.64	66.47
🥉 3	Doubao-Seed-1.6-vision-0815 (Think)		40.08	22.03	31.50	66.70
4	Gemini-2.5-Flash		37.57	18.60	21.21	72.90
5	o4-mini (high)		35.00	25.00	21.96	58.03
6	GPT-4.1		32.14	27.90	12.22	56.30
7	GLM-4.5V		30.45	13.27	13.34	64.73
8	GPT-4o-1120		26.88	22.60	10.12	47.93
9	Doubao-Seed-1.6-vision-0815 (Nonthink)		25.96	23.63	23.82	30.43
10	Qwen2.5-VL-72B-Instruct		23.59	12.47	8.96	49.33
11	MiMo-VL-7B-RL		20.90	9.30	11.10	42.30
12	GLM-4.1V-9B-Thinking		19.30	8.73	9.22	39.93
13	Qwen2.5-VL-32B-Instruct		14.39	9.03	8.56	25.57
14	InternVL3-8B		11.36	7.30	2.38	24.40
15	Keye-VL-8B-Preview		9.65	7.70	2.04	19.20
16	Qwen2.5-VL-7B-Instruct		7.50	4.70	3.22	14.57

Abstract

As multi-modal AI systems become increasingly sophisticated, evaluating their cognitive capabilities has become crucial for understanding their true potential and limitations. Traditional benchmarks often focus on narrow tasks, but real-world applications require complex reasoning across multiple modalities.

MME-CC addresses this gap by providing a challenging multi-modal evaluation benchmark that tests cognitive capacity across 11 diverse tasks with 1,173 samples, spanning three critical dimensions: Spatial Reasoning (SR), General Reasoning (GR), and Visual Knowledge Reasoning (VKR).

MME-CC addresses this gap by providing a challenging multi-modal evaluation benchmark that tests cognitive capacity across 11 diverse tasks with 1,173 samples, spanning three critical dimensions: Spatial Reasoning (SR), General Reasoning (GR), and Visual Knowledge Reasoning (VKR).

This benchmark evaluates how well multi-modal models can perform complex cognitive tasks that require deep understanding and reasoning across visual and textual information, providing insights into their true cognitive capabilities.

Data Description & Distribution

MME-CC is a comprehensive multi-modal evaluation benchmark designed to assess cognitive capacity across diverse reasoning tasks. The benchmark comprises 11 carefully designed tasks with 1,173 samples, spanning three critical dimensions: Spatial Reasoning (SR), General Reasoning (GR), and Visual Knowledge Reasoning (VKR). Each task is crafted to challenge multi-modal models' ability to understand and reason across visual and textual information, providing insights into their cognitive capabilities.

Spatial Reasoning

3 Tasks 319 Samples

Avg Input Tokens: 8,198

Avg Output Tokens: 4,076

Geometric Reasoning

5 Tasks 605 Samples

Avg Input Tokens: 1,549

Avg Output Tokens: 6,204

Visual Knowledge Reasoning

3 Tasks 249 Samples

Avg Input Tokens: 2,751

Avg Output Tokens: 1,329

Data Pipeline

MME-CC follows a systematic data construction pipeline to ensure high-quality multi-modal cognitive evaluation tasks. Each task is carefully designed to assess specific cognitive abilities across visual and textual modalities, with rigorous validation processes to maintain benchmark quality and reliability.

Key Findings

1 Spatial Reasoning Challenges Multi-Modal Models Most

Notably, GPT-5 (high) achieves the highest performance in Spatial Reasoning with a score of 30.3%, a result that appears attributable to its capabilities in sub-tasks requiring complex spatial orientation and object counting (e.g., Indoor Directional Reasoning and Indoor Deduplication Counting). Gemini-2.5-Pro, in contrast, demonstrates a clear advantage in Visual Knowledge Reasoning by attaining a leading score of 70.7%. In the domain of Geometric Reasoning, the Doubao model's performance advantage appears localized to the Jigsaw Puzzle task, whereas Gemini-2.5-Pro shows more robust results across other geometric reasoning challenges.

Model	Reasoning	Overall	SR	GR	VKR
Human^* (n=99, sampled)	—	95.86	95.83	95.83	95.92
*Closed-Source Models*
Gemini-2.5-Pro		42.66	23.80	29.56	74.63
GPT-5 (high)		40.25	30.63	23.64	66.47
Doubao-Seed-1.6-vision-0815 (Think)		40.08	22.03	31.50	66.70

^* 5 students who did not participate in question setting; all held advanced degrees.

2 MLLMs remain far from "thinking like humans"

Analysis of chain-of-thought behavior reveals that MLLMs follow a three-stage reasoning process: problem understanding, core analysis, and conclusion formation. While models repeatedly revisit visual information throughout reasoning, excessive verification often reduces efficiency. On spatial and geometric tasks, most models achieve only 20%-30% accuracy, with maze tasks showing < 2% success rates. Adding textual guidance ("describe the image first") yields consistent improvements, suggesting that initial visual descriptions stabilize subsequent reasoning by anchoring perception.

3 MLLMs exhibit recurring failures in orientation judgment, entity consistency, and instruction following

Analysis of failure cases reveals three recurring error patterns: (1) Orientation judgment errors where models fail to preserve object orientation across views and struggle with reference-frame alignment, particularly in indoor layout tasks; (2) Entity identity inconsistency in multi-view settings, leading to double counting or omission of the same entities; and (3) Over-reliance on literal descriptions where models prioritize visual content over task-specific counterfactual constraints in instructions, producing answers that conflict with required reasoning.

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Performance Leaderboard

Overall Performance

Abstract

Data Description & Distribution

Spatial Reasoning

Geometric Reasoning

Visual Knowledge Reasoning

Data Pipeline

Key Findings

1 Spatial Reasoning Challenges Multi-Modal Models Most

2 MLLMs remain far from "thinking like humans"

3 MLLMs exhibit recurring failures in orientation judgment, entity consistency, and instruction following

Satellite Image Matching

Indoor Directional Reasoning

Indoor Deduplication Counting

Gomoku Variation

Unblock Me

Maze

Jigsaw Puzzle

Chart Modification

Sandbagging

Counterfactual Instruction

Finding Wrong Answer