MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

MME-CC is a language-independent benchmark of vision-based cognitive tasks that systematically evaluates MLLMs across spatial, geometric, and visual-knowledge reasoning with a clear taxonomy and strictly curated, high-quality questions.

MME-CC Task Taxonomy

MME-CC Task Taxonomy

Performance Leaderboard

Overall Performance

Rank Models Reasoning Overall Spatial Reasoning Geometric Reasoning Visual Knowledge Reasoning
Rank Models Reasoning Overall SR GR VKR
👑 HumanHuman—95.8695.8395.8395.92
🥇 1Gemini-2.5-Pro42.6623.8029.5674.63
🥈 2GPT-5 (high)40.2530.6323.6466.47
🥉 3Doubao-Seed-1.6-vision-0815 (Think)40.0822.0331.5066.70
4Gemini-2.5-Flash37.5718.6021.2172.90
5o4-mini (high)35.0025.0021.9658.03
6GPT-4.132.1427.9012.2256.30
7GLM-4.5V30.4513.2713.3464.73
8GPT-4o-112026.8822.6010.1247.93
9Doubao-Seed-1.6-vision-0815 (Nonthink)25.9623.6323.8230.43
10Qwen2.5-VL-72B-Instruct23.5912.478.9649.33
11MiMo-VL-7B-RL20.909.3011.1042.30
12GLM-4.1V-9B-Thinking19.308.739.2239.93
13Qwen2.5-VL-32B-Instruct14.399.038.5625.57
14InternVL3-8B11.367.302.3824.40
15Keye-VL-8B-Preview9.657.702.0419.20
16Qwen2.5-VL-7B-Instruct7.504.703.2214.57

Abstract

As multi-modal AI systems become increasingly sophisticated, evaluating their cognitive capabilities has become crucial for understanding their true potential and limitations. Traditional benchmarks often focus on narrow tasks, but real-world applications require complex reasoning across multiple modalities.

MME-CC addresses this gap by providing a challenging multi-modal evaluation benchmark that tests cognitive capacity across 11 diverse tasks with 1,173 samples, spanning three critical dimensions: Spatial Reasoning (SR), General Reasoning (GR), and Visual Knowledge Reasoning (VKR).

As multi-modal AI systems become increasingly sophisticated, evaluating their cognitive capabilities has become crucial for understanding their true potential and limitations. Traditional benchmarks often focus on narrow tasks, but real-world applications require complex reasoning across multiple modalities.

MME-CC addresses this gap by providing a challenging multi-modal evaluation benchmark that tests cognitive capacity across 11 diverse tasks with 1,173 samples, spanning three critical dimensions: Spatial Reasoning (SR), General Reasoning (GR), and Visual Knowledge Reasoning (VKR).

This benchmark evaluates how well multi-modal models can perform complex cognitive tasks that require deep understanding and reasoning across visual and textual information, providing insights into their true cognitive capabilities.

Read More

Data Description & Distribution

MME-CC is a comprehensive multi-modal evaluation benchmark designed to assess cognitive capacity across diverse reasoning tasks. The benchmark comprises 11 carefully designed tasks with 1,173 samples, spanning three critical dimensions: Spatial Reasoning (SR), General Reasoning (GR), and Visual Knowledge Reasoning (VKR). Each task is crafted to challenge multi-modal models' ability to understand and reason across visual and textual information, providing insights into their cognitive capabilities.

Spatial Reasoning

3 Tasks 319 Samples

Avg Input Tokens: 8,198

Avg Output Tokens: 4,076

Geometric Reasoning

5 Tasks 605 Samples

Avg Input Tokens: 1,549

Avg Output Tokens: 6,204

Visual Knowledge Reasoning

3 Tasks 249 Samples

Avg Input Tokens: 2,751

Avg Output Tokens: 1,329

Data Pipeline

MME-CC follows a systematic data construction pipeline to ensure high-quality multi-modal cognitive evaluation tasks. Each task is carefully designed to assess specific cognitive abilities across visual and textual modalities, with rigorous validation processes to maintain benchmark quality and reliability.

MME-CC Data Pipeline

Key Findings

1 Spatial Reasoning Challenges Multi-Modal Models Most

Notably, GPT-5 (high) achieves the highest performance in Spatial Reasoning with a score of 30.3%, a result that appears attributable to its capabilities in sub-tasks requiring complex spatial orientation and object counting (e.g., Indoor Directional Reasoning and Indoor Deduplication Counting). Gemini-2.5-Pro, in contrast, demonstrates a clear advantage in Visual Knowledge Reasoning by attaining a leading score of 70.7%. In the domain of Geometric Reasoning, the Doubao model's performance advantage appears localized to the Jigsaw Puzzle task, whereas Gemini-2.5-Pro shows more robust results across other geometric reasoning challenges.

Model Reasoning Overall SR GR VKR
Human* (n=99, sampled) — 95.86 95.83 95.83 95.92
Closed-Source Models
Gemini-2.5-Pro 42.66 23.80 29.56 74.63
GPT-5 (high) 40.25 30.63 23.64 66.47
Doubao-Seed-1.6-vision-0815 (Think) 40.08 22.03 31.50 66.70

* 5 students who did not participate in question setting; all held advanced degrees.

2 MLLMs remain far from "thinking like humans"

Analysis of chain-of-thought behavior reveals that MLLMs follow a three-stage reasoning process: problem understanding, core analysis, and conclusion formation. While models repeatedly revisit visual information throughout reasoning, excessive verification often reduces efficiency. On spatial and geometric tasks, most models achieve only 20%-30% accuracy, with maze tasks showing < 2% success rates. Adding textual guidance ("describe the image first") yields consistent improvements, suggesting that initial visual descriptions stabilize subsequent reasoning by anchoring perception.

Chain-of-Thought Analysis

3 MLLMs exhibit recurring failures in orientation judgment, entity consistency, and instruction following

Analysis of failure cases reveals three recurring error patterns: (1) Orientation judgment errors where models fail to preserve object orientation across views and struggle with reference-frame alignment, particularly in indoor layout tasks; (2) Entity identity inconsistency in multi-view settings, leading to double counting or omission of the same entities; and (3) Over-reliance on literal descriptions where models prioritize visual content over task-specific counterfactual constraints in instructions, producing answers that conflict with required reasoning.

Orientation Judgment Error
Entity Identity Inconsistency

Satellite Image Matching

This task evaluates the model's ability to match satellite images based on spatial features and geographical landmarks. Models need to identify corresponding locations across different satellite views.

Key Challenges:

  • Scale and perspective variations
  • Seasonal and temporal changes
  • Identifying unique geographical features

Indoor Directional Reasoning

This task tests spatial reasoning abilities in indoor environments, requiring models to understand directional relationships and navigate through interior spaces.

Key Challenges:

  • Understanding relative positioning
  • Interpreting floor plans and layouts
  • Maintaining spatial orientation

Indoor Deduplication Counting

This task evaluates the model's ability to accurately count objects in indoor scenes while avoiding double-counting the same items viewed from different angles.

Key Challenges:

  • Multi-view object recognition
  • Avoiding duplicate counting
  • Understanding occlusion and partial visibility

Gomoku Variation

A strategic board game task that tests logical reasoning and pattern recognition abilities in a modified Gomoku setting.

Key Challenges:

  • Strategic planning and foresight
  • Pattern recognition on game boards
  • Understanding game rules and constraints

Unblock Me

A puzzle-solving task where models must find the optimal sequence of moves to free a target block from a constrained grid.

Key Challenges:

  • Spatial reasoning and planning
  • Understanding movement constraints
  • Finding optimal solution paths

Maze

Navigation tasks requiring models to find paths through complex maze structures, testing spatial reasoning and pathfinding abilities.

Key Challenges:

  • Path planning and navigation
  • Understanding spatial constraints
  • Avoiding dead ends and loops

Jigsaw Puzzle

Visual puzzle tasks that require models to understand how pieces fit together to form complete images.

Key Challenges:

  • Shape and edge matching
  • Visual pattern completion
  • Spatial relationship understanding

Chart Modification

Tasks involving understanding and modifying various types of charts and graphs, testing data interpretation skills.

Key Challenges:

  • Data visualization comprehension
  • Chart type recognition
  • Quantitative reasoning

Sandbagging

Tasks designed to detect when models intentionally underperform or hide their true capabilities.

Key Challenges:

  • Detecting deceptive behavior
  • Understanding strategic underperformance
  • Evaluating genuine vs. artificial limitations

Counterfactual Instruction

Tasks that test models' ability to reason about hypothetical scenarios and alternative outcomes.

Key Challenges:

  • Hypothetical reasoning
  • Understanding alternative scenarios
  • Logical consistency in counterfactuals

Finding Wrong Answer

Tasks that evaluate models' ability to identify incorrect information and reasoning errors.

Key Challenges:

  • Error detection and analysis
  • Critical thinking skills
  • Logical reasoning validation