Our benchmark consists of three distinct sub-tasks, each designed to evaluate different aspects of financial data retrieval and analysis capabilities.

T1: Time-Sensitive Data Fetching
Tests the ability to retrieve time-sensitive, frequently updated data.
Example: IBM latest close price.
Answer: Will change - Obtained from a real-time API query
T2: Simple Historical Lookup
Tests the ability to retrieve specific, non-time-sensitive historical facts.
Example: What was the total assets of Starbucks as of 2020.09.27?
Answer: $29374.5 million, rounding errors allowed
T3: Complex Historical Investigation
Tests the ability to perform multi-step reasoning, aggregation, and calculation.
Example: From January 2010 to April 2025, in which month did the S&P 500 index experience the largest single-month increase?
Answer: April 2020, 12.68%, error ±0.1% allowed