FinanceBenchmark

Methodology

FinanceBenchmark v1 evaluates models on 45 tasks using a Python harness and publishes results to this leaderboard.

Verifiable ground truth

Every task has an objective answer: MCQ letter match, numeric tolerance, or executed code compared to reference implementations.

Contamination resistance

Quant tasks use seeded parameters so answers are reproducible but not easily memorized from public training data.

Reproducibility

Temperature 0, three runs per task, versioned harness (0.1.0) and task set (v1), pinned prompts.

Headroom

Includes Greeks precision and multi-step pricing where frontier models are known to underperform conceptual finance questions.

Scoring rules

CategoryMetricTolerance
KnowledgeLetter accuracyExact match
AnalysisNumeric accuracy1% relative
QuantCode execution vs reference0.1% (0.1–1% for MC)

How to reproduce

pip install -e .
finbench run --model anthropic/claude-opus-4-20250514 --tasks all --runs 3
finbench publish results/<model>_<timestamp>.json