Methodology

FinanceBenchmark v1 evaluates models on 45 tasks using a Python harness and publishes results to this leaderboard.

Verifiable ground truth

Every task has an objective answer: MCQ letter match, numeric tolerance, or executed code compared to reference implementations.

Quant tasks use seeded parameters so answers are reproducible but not easily memorized from public training data.

Temperature 0, three runs per task, versioned harness (0.1.0) and task set (v1), pinned prompts.

Includes Greeks precision and multi-step pricing where frontier models are known to underperform conceptual finance questions.

pip install -e .
finbench run --model anthropic/claude-opus-4-20250514 --tasks all --runs 3
finbench publish results/<model>_<timestamp>.json