What this measures
Ophaelis Index evaluates AI models on real-world tasks and ranks them using measurable performance dimensions:
- Quality measures how well the model completes the task it was given.
Depending on the task, this includes accuracy, completeness, instruction adherence, factual fidelity, structure, and correctness.
- Speed reflects total response time from request to completion, relative to other models in the same evaluation.
- Cost Efficiency reflects the estimated cost required to produce the output, normalized across providers.
Higher scores indicate lower relative cost for the same task.
- Reliability measures repeated-run stability across evaluation scenarios.
A model with lower variation and more consistent outcomes scores higher.
It reflects stability under current evaluation conditions, not raw capability.
How models are evaluated
- Each model receives the same prompts for each benchmark item.
- Live evaluator results are based on 3 evaluation runs per benchmark item.
- The same scoring logic is applied to every model output.
- Outputs are structured and scored programmatically for consistency.
Benchmark structure
Models are evaluated across multiple task categories that map to real operational work:
- Structured extraction
- Classification
- Summarization
- Coding
Each benchmark item represents a concrete scenario rather than an abstract benchmark toy problem.
Benchmark Prompt Structure
Each decision area uses standardized prompt patterns with lane-specific objectives.
Representative structure is shown for transparency. Active prompts are partially withheld to protect benchmark integrity.
Longform Summarization (SUMM-001)
Tests whether a model can compress long source material into clear, faithful summaries.
Prompt structure
- - Provide a long source document with mixed signal and noise.
- - Set explicit summary constraints (length, audience, key points).
- - Require coverage of major claims and caveats.
Expected output shape
Structured narrative summary with key points, risk notes, and no fabricated facts.
What is scored
- - Factual fidelity
- - Coverage completeness
- - Instruction adherence
Common penalized failure modes
- - Hallucinated details
- - Missed critical caveats
- - Overly generic or verbose output
Structured Extraction (EXTR-001)
Tests schema-following precision when extracting fields from noisy unstructured text.
Prompt structure
- - Provide a document containing target and distractor values.
- - Specify strict field schema and formatting rules.
- - Require null handling for missing fields.
Expected output shape
Machine-parseable structured object matching required keys and data types.
What is scored
- - Field accuracy
- - Format adherence
- - Schema completeness
Common penalized failure modes
- - Wrong field mapping
- - Type mismatches
- - Invalid JSON or shape drift
Classification & Routing (CLSF-001)
Tests high-volume decision routing where classes are similar and mistakes are costly.
Prompt structure
- - Provide task or ticket payload with class taxonomy.
- - Define class boundaries and confidence expectations.
- - Require final class and concise rationale.
Expected output shape
Single class decision plus short evidence-based rationale and confidence marker.
What is scored
- - Class correctness
- - Boundary handling
- - Decision consistency
Common penalized failure modes
- - Confusing adjacent classes
- - Low-evidence rationales
- - Inconsistent outputs on similar inputs
Constraint-Based Planning (PLAN-001)
Tests planning quality under multiple constraints, tradeoffs, and feasibility checks.
Prompt structure
- - Provide objective, constraints, and resource limits.
- - Require explicit plan steps and dependency logic.
- - Require explanation of tradeoffs and constraint checks.
Expected output shape
Stepwise plan with constraint validation, dependencies, and tradeoff notes.
What is scored
- - Constraint adherence
- - Plan coherence
- - Feasibility
Common penalized failure modes
- - Constraint violations
- - Missing dependencies
- - Plans that look coherent but are not executable
Professional Response (RESP-001)
Tests business-ready communication quality for professional and stakeholder-facing responses.
Prompt structure
- - Provide scenario, audience role, and communication objective.
- - Set tone and policy constraints.
- - Require concise, actionable response with clear next steps.
Expected output shape
Professional response with clear structure, tone control, and actionable guidance.
What is scored
- - Clarity
- - Instruction adherence
- - Policy-safe communication
Common penalized failure modes
- - Tone mismatch
- - Vague recommendations
- - Missing critical context
Coding & Refactoring (CODE-001)
Tests code quality under modification pressure without breaking behavior expectations.
Prompt structure
- - Provide existing code and target refactor objective.
- - Define constraints (no regressions, preserve interfaces).
- - Require explanation of key changes and risks.
Expected output shape
Refactored code with consistent behavior and a concise implementation rationale.
What is scored
- - Correctness
- - Constraint adherence
- - Code quality and maintainability
Common penalized failure modes
- - Behavior regressions
- - Incomplete refactor coverage
- - Syntactic or logical defects
Debugging & Root-Cause Analysis (DEBUG-001)
Tests ability to identify real root causes from logs, symptoms, and partial evidence.
Prompt structure
- - Provide failing behavior, logs, and environment context.
- - Require root-cause hypothesis with evidence mapping.
- - Require remediation plan and verification steps.
Expected output shape
Root-cause diagnosis with evidence chain, fix proposal, and validation plan.
What is scored
- - Causal accuracy
- - Evidence grounding
- - Fix quality
Common penalized failure modes
- - Symptom-level guesses
- - Ignoring conflicting evidence
- - Weak or untestable remediation steps
Multi-Step Tool Reasoning (AGNT-001)
Tests tool-use planning and execution quality across multi-step reasoning workflows.
Prompt structure
- - Provide objective requiring multiple steps and tool calls.
- - Define tool constraints and expected sequencing.
- - Require explicit intermediate reasoning checkpoints.
Expected output shape
Ordered action plan with intermediate checks and final consolidated answer.
What is scored
- - Tool sequencing
- - Step correctness
- - Recovery from intermediate errors
Common penalized failure modes
- - Invalid step order
- - Skipping required tool steps
- - Failure to recover after intermediate mistakes
Policy / Governance Judgment (POL-001)
Tests judgment quality on policy-sensitive edge cases with competing considerations.
Prompt structure
- - Provide scenario with policy constraints and gray-area conditions.
- - Require decision plus policy-grounded rationale.
- - Require explicit uncertainty handling where applicable.
Expected output shape
Decision recommendation with policy reasoning, risk framing, and confidence bounds.
What is scored
- - Policy alignment
- - Reasoning consistency
- - Risk-aware judgment
Common penalized failure modes
- - Overconfident recommendations in ambiguous cases
- - Policy misapplication
- - Inconsistent decisions across similar scenarios
Executive Synthesis & Decision Memo (EXEC-001)
Tests executive synthesis quality for high-stakes decisions under information overload.
Prompt structure
- - Provide multi-source context with conflicting signals.
- - Require decision memo format with recommendation and alternatives.
- - Require explicit tradeoffs, risks, and next-step actions.
Expected output shape
Executive memo with clear recommendation, options, risk section, and action plan.
What is scored
- - Synthesis quality
- - Decision clarity
- - Tradeoff articulation
Common penalized failure modes
- - Weak recommendation clarity
- - Missing key risks
- - Unbalanced or unsupported tradeoff claims
Prompt Integrity and Rotation
Active prompts are partially withheld to preserve benchmark integrity. Ophaelis Index publishes representative structures for transparency, and retired prompts may be released on a delayed basis.
Benchmark prompts rotate over time to reduce contamination and leakage risk. Publishing every active prompt makes long-term benchmarks easier to game, so Ophaelis Index balances transparency with integrity safeguards.
Prompt set last refreshed: Q2 2026. Prompt sets are periodically rotated to reduce contamination risk.
Trust requires transparency, but integrity requires restraint. Ophaelis Index is explicit about structure and scoring intent while protecting active prompt details.
Retired Prompt Policy
Retired prompts can be released on a delayed basis. Representative examples may be published earlier, but exact expected outputs are not published as live answer keys.
Scoring system
Each model receives four component scores on each benchmark item:
- Quality
- Speed
- Cost Efficiency
- Reliability
Overall Score is a composite ranking score calculated from quality, speed, cost efficiency, and reliability using the active weighting profile.
It is designed to compare real-world tradeoffs, not measure raw intelligence.
Best Overall identifies the model with the strongest weighted tradeoff across quality, speed, cost efficiency, and reliability under the active ranking profile.
It does not mean the model is the most capable in every scenario.
Changing the active profile changes which tradeoffs are rewarded and can change which model ranks first.
Ranking Profiles and Weights
Each profile applies the same scoring framework with different priorities.
| Profile | Priority | Quality | Speed | Cost Efficiency | Reliability |
|---|
| Balanced | General-purpose weighting for day-to-day decision making. | 50% | 20% | 15% | 15% |
| Quality First | Prioritizes model output quality above all else. | 70% | 10% | 10% | 10% |
| Speed First | Optimized for low latency and throughput-sensitive workloads. | 35% | 40% | 15% | 10% |
| Cost First | Favors lower operating cost while preserving baseline quality. | 35% | 15% | 40% | 10% |
| Reliability First | Weighted toward consistency and dependable outcomes. | 40% | 10% | 10% | 40% |
How Reliability works
- Reliability reflects repeated-run stability, not raw capability.
- Models are evaluated across 3 runs per benchmark item in live lanes.
- A strong model should perform well consistently, not just produce one standout run.
- Lower variation in outcomes means higher reliability under current evaluation conditions.
Reliability measures repeated-run stability across evaluation scenarios.
A model with lower variation and more consistent outcomes scores higher.
It reflects stability under current evaluation conditions, not raw capability.
Human Review and Inter-Rater Reliability
Some lanes are more objectively scored than others. Judgment-heavy lanes use structured rubric design today and are candidates for multi-rater governance.
Inter-rater reliability means qualified reviewers should generally agree when applying the same rubric to the same output.
Judgment-Sensitive Lanes
- - Professional Response
- - Executive Synthesis & Decision Memo
- - Policy / Governance Judgment
- - Additional lanes may use spot-check review for calibration.
How disagreement is handled
- - Reviewers score against a published rubric.
- - Meaningful disagreement triggers adjudication or rubric refinement.
- - Repeated disagreement signals rubric ambiguity, not just reviewer error.
Why this matters
Some tasks do not have a single binary right answer. Trust requires constrained judgment, not hidden opinion. Ophaelis Index is designed to reduce reviewer subjectivity over time, not hide it.
Current state vs future state
- - Current: structured evaluator scoring and explicit rubric design.
- - Future: multi-rater validation, disagreement tracking, and published reviewer-agreement metrics for judgment-sensitive lanes.
Cost model (Level 2.5)
- Cost is estimated using token-normalized pricing.
- Tokens are approximated from text length.
- Pricing uses published provider rates.
Estimated Cost is calculated using normalized token-based pricing across providers.
This allows fair comparison regardless of provider-specific billing differences.
Cost Efficiency reflects affordability, not capability. A lower-cost model may outperform a higher-cost model for some tasks, but not all.
Benchmark Robustness and Adversarial Coverage
Production inputs are not always clean. Benchmark expansion includes messy, conflicting, malformed, and adversarial variants to test robustness, not just best-case performance.
Real users are messy, real documents are messy, and real instructions can conflict. A production benchmark should measure this over time.
Structured Extraction
Messy field order, conflicting values, and malformed source text test whether extraction remains schema-accurate.
Constraint-Based Planning
Conflicting constraints and incomplete context test whether plans stay feasible instead of sounding plausible.
Multi-Step Tool Reasoning
Noisy tool outputs and partial failures test step ordering, recovery behavior, and final answer consistency.
Policy / Governance Judgment
Ambiguous edge cases and competing obligations test whether policy reasoning remains consistent and grounded.
Debugging & Root-Cause Analysis
Misleading logs and mixed symptoms test whether diagnosis identifies causes rather than surface-level errors.
Current limitations
- Current reliability is grounded in 3-run evaluation and will strengthen with deeper history.
- Token counts are estimated for consistency.
- Result confidence improves as historical runs accumulate.
What's next
- Repeated-run evaluation (x3 sampling).
- Historical performance tracking.
- Consistency and variance metrics.