OPHAELIS INDEX

Make AI model decisions you can defend.

Ophaelis Index evaluates real-world performance across quality, speed, cost efficiency, and reliability — so you can choose the right model for your constraints, not guess.

Continuously evaluated. Fully explainable. Built for real decisions.

Free Decision SurfaceLatest Update: Apr 2, 2026, 1:46 PMLive Lanes: 4Limited Coverage: 6

Why this exists

Most AI comparisons tell you what scores highest.

They don't tell you:

  • what you're trading off
  • what changes under different priorities
  • or why one model is better for your situation

Ophaelis Index was built to make those tradeoffs visible — so decisions aren't guesswork.

What actually changes when you evaluate correctly

  • The “best” model changes depending on cost vs quality priorities
  • Reliability differences expose unstable models under repeated runs
  • Provider updates can shift rankings within days

How to read this

Best Overall

Best Overall identifies the model with the strongest weighted tradeoff across quality, speed, cost efficiency, and reliability under the active ranking profile. It does not mean the model is the most capable in every scenario.

Overall Score

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the active weighting profile. It is designed to compare real-world tradeoffs, not measure raw intelligence.

Estimated Cost

Estimated Cost is calculated using normalized token-based pricing across providers. This allows fair comparison regardless of provider-specific billing differences.

Reliability

Reliability measures repeated-run stability across evaluation scenarios. A model with lower variation and more consistent outcomes scores higher. It reflects stability under current evaluation conditions, not raw capability.

Current System Verdict (Balanced Profile)

Best Overall reflects the strongest tradeoff across quality, speed, cost efficiency, and reliability — not the highest raw capability.

GPT 5.4 Mini

OpenAI

90.54

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Balanced weighting profile.

GPT 5.4 Mini wins 7 decision areas and sustains a typical 1.18-point margin over the runner-up in those lanes.

Decision Areas Evaluated

10

Limited Coverage Areas

6

Last Evaluator Run

Apr 2, 2026, 1:46 PM

Top Models by Decision Priority

Public quick views for different operating goals

Best Overall under Balanced

GPT 5.4 Mini

OpenAI

90.54

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Balanced weighting profile.

Strongest overall tradeoff across quality, speed, cost efficiency, and reliability under balanced weighting.

Highest Output Quality

Gemini 3.1 Pro Preview

Google

93.00

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Quality First weighting profile.

Strongest accuracy and task completion performance across evaluation scenarios.

Fastest Response

Gemini 3 Flash

Google

90.95

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Speed First weighting profile.

Lowest response latency while maintaining usable quality for time-sensitive workflows.

Most Cost Efficient

Gemini 2.5 Flash Lite

Google

93.90

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Cost First weighting profile.

Best affordability-to-performance tradeoff when spend efficiency is prioritized.

Most Consistent

Gemini 2.5 Flash Lite

Google

93.99

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Reliability First weighting profile.

Most stable repeated-run outcomes across changing evaluation conditions.

Performance Trend

Track how model performance shifts over time as new versions and conditions change.

Trend history is building. Run additional evaluator cycles to unlock charted movement.

Benchmark Status

Last run: Apr 2, 2026, 1:46 PM

Method: 3-run averaged evaluation

Dimensions: Quality • Speed • Cost Efficiency • Reliability

Coverage: 10 evaluation areas

Continuously updated as models and providers change

Compact Leaderboard

Sortable free view for quick model comparison

Last updated: Apr 2, 2026, 1:46 PM

Ranking Profile

Table Sort

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Balanced weighting profile.

Ranking Profile changes how Overall Score is calculated.

Table Sort changes how rows are ordered for inspection.

Active weighting: Quality 50% | Speed 20% | Cost Efficiency 15% | Reliability 15%

Number of times the model ranked #1 across individual evaluation dimensions.

A model can have fewer Top Finishes and still rank highly overall when it stays consistently strong across all weighted dimensions.

Example: if you care most about speed, use Speed First. Then use table sort to inspect which fast models also retain strong quality.

Provenance: leaderboard uses the active weighting profile; overall score is a weighted tradeoff score; live results are based on 3 evaluation runs per benchmark item; evaluated decision areas: 10 (4 live, 6 limited coverage).

Viewing: Balanced

Sorted by: Overall Score

Overall Score reflects: Balanced weighting across Quality, Speed, Cost Efficiency, and Reliability.

ModelProviderOverall ScoreQualitySpeedCost EfficiencyReliabilityTop Finishes
1. GPT 5.4 MiniOpenAI90.5490.586.985.497.87
2. Gemini 3.1 Pro PreviewGoogle89.8498.084.864.295.00
3. Gemini 2.5 Flash LiteGoogle89.6483.381.5100.0100.00
4. Gemini 3 FlashGoogle89.6488.595.883.291.70
5. Claude Haiku 4 5Anthropic88.1484.387.691.794.40
6. Mistral MediumOther frontier87.9684.173.897.8100.02
7. Mistral LargeOther frontier87.9489.588.879.290.30
8. GPT 5.4 NanoOpenAI87.6182.286.896.993.81
9. Mistral FastOther frontier87.2982.596.891.286.70
10. GPT 5.4OpenAI86.7293.873.667.199.40

Why teams trust this ranking

  • Real task-based evaluation across decision areas.
  • Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the active weighting profile. It is designed to compare real-world tradeoffs, not measure raw intelligence.
  • Based on 3 evaluation runs per benchmark item in live evaluator lanes.
  • Reliability favors repeated-run stability over one-off peaks.
  • Estimated Cost is calculated using normalized token-based pricing across providers. This allows fair comparison regardless of provider-specific billing differences.
  • Prompt set last refreshed: Q2 2026.
  • Prompt sets are periodically rotated to reduce contamination risk.
  • Transparent methodology and scoring definitions.
  • Cost Efficiency reflects affordability, not capability. A lower-cost model may outperform a higher-cost model for some tasks, but not all.
  • Trend confidence improves as evaluator run history accumulates.

Deeper Access

Member Data Layer

Open the deeper view for full tables, decision-area breakdowns, per-model details, and cost transparency diagnostics.

Explore Full Benchmark DataMethodology