OPHAELIS INDEX

Make AI model decisions you can defend.

Ophaelis Index evaluates real-world performance across quality, speed, cost efficiency, and reliability — so you can choose the right model for your constraints, not guess.

Continuously evaluated. Fully explainable. Built for real decisions.

Free Decision SurfaceLatest Update: Apr 2, 2026, 1:46 PMLive Lanes: 4Limited Coverage: 6

Explore Full Benchmark Data Methodology

Why this exists

Most AI comparisons tell you what scores highest.

They don't tell you:

what you're trading off
what changes under different priorities
or why one model is better for your situation

Ophaelis Index was built to make those tradeoffs visible — so decisions aren't guesswork.

What actually changes when you evaluate correctly

The “best” model changes depending on cost vs quality priorities
Reliability differences expose unstable models under repeated runs
Provider updates can shift rankings within days

How to read this

Best Overall

Best Overall identifies the model with the strongest weighted tradeoff across quality, speed, cost efficiency, and reliability under the active ranking profile. It does not mean the model is the most capable in every scenario.

Overall Score

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the active weighting profile. It is designed to compare real-world tradeoffs, not measure raw intelligence.

Estimated Cost

Estimated Cost is calculated using normalized token-based pricing across providers. This allows fair comparison regardless of provider-specific billing differences.

Reliability

Reliability measures repeated-run stability across evaluation scenarios. A model with lower variation and more consistent outcomes scores higher. It reflects stability under current evaluation conditions, not raw capability.

Current System Verdict (Balanced Profile)

Best Overall reflects the strongest tradeoff across quality, speed, cost efficiency, and reliability — not the highest raw capability.

GPT 5.4 Mini

OpenAI

90.54

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Balanced weighting profile.

GPT 5.4 Mini wins 7 decision areas and sustains a typical 1.18-point margin over the runner-up in those lanes.

Decision Areas Evaluated

Limited Coverage Areas

Last Evaluator Run

Apr 2, 2026, 1:46 PM

Top Models by Decision Priority

Public quick views for different operating goals

Best Overall under Balanced

GPT 5.4 Mini

OpenAI

90.54

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Balanced weighting profile.

Strongest overall tradeoff across quality, speed, cost efficiency, and reliability under balanced weighting.

Highest Output Quality

Gemini 3.1 Pro Preview

Google

93.00

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Quality First weighting profile.

Strongest accuracy and task completion performance across evaluation scenarios.

Fastest Response

Gemini 3 Flash

Google

90.95

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Speed First weighting profile.

Lowest response latency while maintaining usable quality for time-sensitive workflows.

Most Cost Efficient

Gemini 2.5 Flash Lite

Google

93.90

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Cost First weighting profile.

Best affordability-to-performance tradeoff when spend efficiency is prioritized.

Most Consistent

Gemini 2.5 Flash Lite

Google

93.99

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Reliability First weighting profile.

Most stable repeated-run outcomes across changing evaluation conditions.

Performance Trend

Track how model performance shifts over time as new versions and conditions change.

Trend history is building. Run additional evaluator cycles to unlock charted movement.

Benchmark Status

Last run: Apr 2, 2026, 1:46 PM

Method: 3-run averaged evaluation

Dimensions: Quality • Speed • Cost Efficiency • Reliability

Coverage: 10 evaluation areas

Continuously updated as models and providers change

Compact Leaderboard

Sortable free view for quick model comparison

Last updated: Apr 2, 2026, 1:46 PM

Ranking Profile

Table Sort

Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the Balanced weighting profile.

Ranking Profile changes how Overall Score is calculated.

Table Sort changes how rows are ordered for inspection.

Active weighting: Quality 50% | Speed 20% | Cost Efficiency 15% | Reliability 15%

Number of times the model ranked #1 across individual evaluation dimensions.

A model can have fewer Top Finishes and still rank highly overall when it stays consistently strong across all weighted dimensions.

Example: if you care most about speed, use Speed First. Then use table sort to inspect which fast models also retain strong quality.

Provenance: leaderboard uses the active weighting profile; overall score is a weighted tradeoff score; live results are based on 3 evaluation runs per benchmark item; evaluated decision areas: 10 (4 live, 6 limited coverage).

Viewing: Balanced

Sorted by: Overall Score

Overall Score reflects: Balanced weighting across Quality, Speed, Cost Efficiency, and Reliability.

Model	Provider	Overall Score	Quality	Speed	Cost Efficiency	Reliability	Top Finishes
1. GPT 5.4 Mini	OpenAI	90.54	90.5	86.9	85.4	97.8	7
2. Gemini 3.1 Pro Preview	Google	89.84	98.0	84.8	64.2	95.0	0
3. Gemini 2.5 Flash Lite	Google	89.64	83.3	81.5	100.0	100.0	0
4. Gemini 3 Flash	Google	89.64	88.5	95.8	83.2	91.7	0
5. Claude Haiku 4 5	Anthropic	88.14	84.3	87.6	91.7	94.4	0
6. Mistral Medium	Other frontier	87.96	84.1	73.8	97.8	100.0	2
7. Mistral Large	Other frontier	87.94	89.5	88.8	79.2	90.3	0
8. GPT 5.4 Nano	OpenAI	87.61	82.2	86.8	96.9	93.8	1
9. Mistral Fast	Other frontier	87.29	82.5	96.8	91.2	86.7	0
10. GPT 5.4	OpenAI	86.72	93.8	73.6	67.1	99.4	0

Why teams trust this ranking

Real task-based evaluation across decision areas.
Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the active weighting profile. It is designed to compare real-world tradeoffs, not measure raw intelligence.
Based on 3 evaluation runs per benchmark item in live evaluator lanes.
Reliability favors repeated-run stability over one-off peaks.
Estimated Cost is calculated using normalized token-based pricing across providers. This allows fair comparison regardless of provider-specific billing differences.
Prompt set last refreshed: Q2 2026.
Prompt sets are periodically rotated to reduce contamination risk.
Transparent methodology and scoring definitions.
Cost Efficiency reflects affordability, not capability. A lower-cost model may outperform a higher-cost model for some tasks, but not all.
Trend confidence improves as evaluator run history accumulates.

Deeper Access

Member Data Layer

Open the deeper view for full tables, decision-area breakdowns, per-model details, and cost transparency diagnostics.

Explore Full Benchmark Data Methodology