FinSearchComp Benchmark

Performance Leaderboard

Global Data Performance

Rank	Model	Avg.	Time-Sensitive	Simple Historical	Complex Historical
🥇 1	Grok 4 (web)	68.9%	87.3%	68.1%	51.2%
🥈 2	GPT-5-Thinking (web)	63.9%	76.9%	67.2%	47.6%
🥉 3	Gemini 2.5 pro (web)	42.6%	56.0%	44.5%	27.4%
4	DouBao (web)	39.1%	61.2%	33.6%	22.6%
5	Qwen3-235B-A22B-2507 (web)	37.4%	60.2%	37.8%	14.3%
6	YuanBao (DeepSeek V3) (web)	30.5%	53.0%	24.4%	14.3%
7	YuanBao (HunYuan-T1-Thinking) (web)	29.8%	59.0%	18.5%	11.9%
7	YuanBao (DeepSeek R1) (web)	29.8%	53.7%	22.7%	13.1%
7	DouBao-Thinking (web)	29.8%	34.3%	33.6%	21.4%
10	Kimi k2 (web)	29.5%	30.6%	47.1%	10.7%
11	DeepSeek R1 (web)	17.2%	17.9%	19.3%	14.3%
12	ERNIE X1 (web)	16.6%	23.9%	15.1%	10.7%

Greater China Data Performance

Rank	Model	Avg.	Time-Sensitive	Simple Historical	Complex Historical
🥇 1	DouBao (web)	54.2%	88.3%	63.0%	11.4%
🥈 2	YuanBao (DeepSeek R1) (web)	52.5%	84.7%	58.0%	14.8%
🥉 3	Grok 4 (web)	51.9%	64.9%	67.0%	23.9%
4	YuanBao (HunYuan-T1-Thinking) (web)	50.5%	82.0%	58.0%	11.5%
5	DouBao-Thinking (web)	49.0%	62.2%	61.0%	23.9%
6	YuanBao (DeepSeek V3) (web)	48.8%	81.1%	55.0%	10.2%
7	GPT-5-Thinking (web)	46.4%	60.4%	63.0%	15.9%
8	ERNIE X1 (web)	40.8%	62.2%	49.0%	11.4%
9	DeepSeek R1 (web)	40.5%	56.8%	51.0%	13.6%
10	Kimi k2 (web)	38.3%	35.1%	73.0%	6.8%
11	Gemini 2.5 pro (web)	36.8%	51.9%	46.0%	12.5%
12	Qwen3-235B-A22B-2507 (web)	21.9%	18.1%	42.0%	5.7%

Introduction

Open-domain financial search is a foundational skill for financial professionals, serving as the critical first step in creating analytical reports, building valuation models, and making investment decisions.

This complex task demands a range of skills, including: 1) retrieving time-sensitive data, 2) finding figures buried in unstructured reports, and 3) performing multi-step retrievals with calculation or comparison.

All of this must be done while correctly navigating complex financial conventions (e.g., specifying Nominal GDP vs. Real GDP).

Before an LLM-based agent can perform sophisticated analysis, it must first prove its ability to find and process data with high precision and reliability as raw material for further analysis.

To address this, we introduce FinSearchComp, the first open-source benchmark specifically designed for open-domain financial search.

635 Expert-Crafted questions to ensure real-world relevance and accuracy.
Includes 3 sub-tasks (Time Sensitive Data Fetching, Simple Historical Lookup, Complex Historical Investigation) for both Global and Greater China subsets.

635 Expert-Crafted questions to ensure real-world relevance and accuracy.
Includes 3 sub-tasks (Time Sensitive Data Fetching, Simple Historical Lookup, Complex Historical Investigation) for both Global and Greater China subsets.
Challenging: Most state-of-the-art agents perform below 50% accuracy.
The entire dataset and evaluation suite are publicly available.

Data Description & Distribution

Our benchmark consists of three distinct sub-tasks, each designed to evaluate different aspects of financial data retrieval and analysis capabilities.

T1: Time-Sensitive Data Fetching

Tests the ability to retrieve time-sensitive, frequently updated data.

Example: IBM latest close price.

Answer: Will change - Obtained from a real-time API query

T2: Simple Historical Lookup

Tests the ability to retrieve specific, non-time-sensitive historical facts.

Example: What was the total assets of Starbucks as of 2020.09.27?

Answer: $29374.5 million, rounding errors allowed

T3: Complex Historical Investigation

Tests the ability to perform multi-step reasoning, aggregation, and calculation.

Example: From January 2010 to April 2025, in which month did the S&P 500 index experience the largest single-month increase?

Answer: April 2020, 12.68%, error ±0.1% allowed

Data Construction Process

To accommodate the unique characteristics of different tasks, we employ a variety of question-and-answer construction strategies to ensure both diversity and quality.

Time-Sensitive Data Fetching

Financial experts design questions for real-time data (stock prices, exchange rates, metal prices) that can be verified through APIs...

Simple Historical Lookup

Covers historical market data, corporate financials, and macro statistics obtained directly from sources...

Complex Historical Investigation

Involves financial data requiring calculation and reasoning from multiple historical data points...

Answer Verification

The answer verification mechanism utilizes a blind review module. After obtaining a question and its answer, one or two other financial experts solve the question independently without access to the answer key. If discrepancies arise in the results or if an expert deems a question to be ambiguous, a senior expert arbitrates the matter. Based on the final judgment, the question or answer will be modified, or the question will be discarded entirely.

Key Findings

1 Top Models Struggle Significantly

Human expert in general performence better than models
On straightforward tasks like Time Sensitive Data Fetching and Simple Historical Lookup, the average accuracy of top models is less than 60% & 40% respectively.
Performance plummets to below 20% on Complex Historical Investigation tasks, which require complex, multi-step reasoning. This highlights a major gap in advanced reasoning capabilities.

2 Live Web Search is Essential, Not Optional

Without search tool, models generally score 0 for time-sensitive tasks and marginal score for historical data search tasks.

Catastrophic Failure on Time-Sensitive Tasks: Models without live search consistently score zero on real-time data queries. Their static, internal knowledge is fundamentally incapable of providing up-to-date information like current stock prices.
Performance Halved on Historical Tasks: Relying on memory for historical data is also highly unreliable and leads to significant score degradation. Without search, a model's accuracy can be cut in half or more. For example, Gemini's score on simple historical queries plummets from 44.5 to 22.7, and on complex queries from 27.4 to 13.1.

3 Strong Regional Bias Impacts Performance

Models from China excel at queries related to the Greater China market but lag behind their international counterparts on Global market questions.

Performance Divide: U.S.-developed models (e.g., Grok-4) lead on the Global dataset, while Chinese models (e.g., DouBao) excel on the Greater China dataset.
Root Cause: This gap is driven by regional differences in training data, language conventions, and search tool optimization.
Key Challenge: This highlights a major hurdle for creating universally effective AI, as current models lack the cross-regional knowledge required for global generalization.