Ranks #1 under Balanced because it leads on quality (6.25 points vs #2), creating a 0.31-point composite margin. It concedes speed but retains the strongest weighted tradeoff overall.
Live evaluator aggregate from 2 benchmark item(s) in this lane.
Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the active weighting profile.
Score Framework
Score Breakdown
Quality 100.00 | Speed 71.13 | Cost Efficiency 99.46 | Reliability 100.00
Based on 3 evaluation runs per benchmark item.
3/3 runs successful in the latest evaluation cycle.
Reliability favors repeated-run stability over one-off peaks.
Reliability measures repeated-run stability across evaluation scenarios.
A model with lower variation and more consistent outcomes scores higher.
It reflects stability under current evaluation conditions, not raw capability.
- Profile
- Balanced
- Overall Score
- 94.12
- Runner-Up Gap
- +0.31
- Decision Area
- longform_summarization
- Estimated Cost
- $0.000213
- Method
- Estimated tokens
- Cost Confidence
- medium
- Input Tokens
- 238
- Output Tokens
- 207
Estimated Cost is calculated using normalized token-based pricing across providers.
This allows fair comparison regardless of provider-specific billing differences.
Cost Efficiency reflects relative cheapness versus other models on the same benchmark item. Higher score = lower estimated cost.
Cost Efficiency reflects affordability, not capability. A lower-cost model may outperform a higher-cost model for some tasks, but not all.