Evaluations
Compare prompt quality side by side.
Evaluations let you compare two prompt drafts — or the same draft with different model configurations — by running them multiple times with shared variables and scoring each response with a judge LLM.
Use Evaluations when you want data-driven answers to questions like:
- "Is this rewrite actually better?"
- "Does GPT-5 produce better results than GPT-5-mini for this prompt?"
- "How consistent are the scores across multiple runs?"
Modes
Prompt Comparison
Select two different drafts (Prompt A and Prompt B) from the same prompt. Both run with the same model configuration and the same variables, so the only variable is the prompt content itself.
Model Comparison
Select the same draft for both sides and assign different model configurations. This isolates the model as the variable — useful for deciding which model gives you the best results for a given prompt.
How It Works
- Select prompts — Pick two drafts from the same prompt. For model comparison, use the same draft on both sides.
- Configure models — Choose the execution model, reasoning effort, and text verbosity for each side independently.
- Generate variables — Raison auto-generates sample variable values from the prompt template so both prompts run with identical inputs. You can also enter values manually.
- Set output schema (optional) — Define a JSON schema for structured output. Raison can auto-generate one from the prompt content.
- Preview — Run a single preview to verify everything looks right before committing to a full evaluation.
- Run an iteration — Specify how many runs to execute (1–20). Each run sends both prompts to the configured LLM, then the judge scores each response on a 1–10 scale.
- Review results — Aggregated statistics (average, median, min, max, standard deviation) are computed per prompt. Charts visualize scores, cost, tokens, and duration across runs.
Scoring
Each run is scored by a judge LLM on a 1–10 scale across five criteria:
- Clarity and coherence — Is the output well-structured and easy to follow?
- Completeness — Does it address the full scope of the task?
- Usefulness and relevance — Is the response practical and on-topic?
- Professional tone — Is the language appropriate and polished?
- Correctness and defensibility — Is the reasoning sound and accurate?
The judge evaluates only the output quality. Internal instructions (flowcharts, procedural markers) and schema compliance are not scored.
Managing Iterations
After an iteration completes, you can:
- Extend — Add more runs to the same iteration for a larger sample size.
- Retry failed — Re-run only the runs that failed, preserving successful results.
- Trigger new iteration — Start a fresh iteration with different variables or run count.
- Pause / Resume — Pause a running iteration and resume it later.
Each evaluation can have multiple iterations. Navigate between them using the sidebar.
Configuration
Execution Models
| Model | Description |
|---|---|
| gpt-5-nano | Fastest, lowest cost |
| gpt-5-mini | Balanced speed and quality |
| gpt-5 | High quality |
| gpt-5.2 | Highest quality |
Reasoning Effort
Controls how much the model reasons before responding. Higher effort generally produces better results but takes longer and costs more.
Text Verbosity
Controls response length: Low, Medium, or High.
Iteration Settings
| Setting | Range | Description |
|---|---|---|
| Run count | 1–20 | Number of runs per iteration |
| Execution mode | Parallel / Serial | Run prompts concurrently or one at a time |
| Parallel run count | 1–50 | How many runs execute simultaneously (parallel mode only) |
Iteration Lifecycle
| Status | Description |
|---|---|
| Pending | Iteration created, waiting to start |
| Running | Runs are being executed |
| Paused | Manually paused; can be resumed |
| Completed | All runs finished |
| Failed | One or more runs failed (can retry failed runs) |
Billing
Evaluation runs are metered per seat per billing period.
| Plan | Evaluation runs / seat |
|---|---|
| Free | — |
| Team | 50 |
| Team Plus | 250 |
| Enterprise | Unlimited |
Each individual run within an iteration counts toward the limit. You can check your current usage in the evaluation creation view — it shows "X of Y used this period".
When the limit is reached you will see a "Run limit reached" message. Upgrade your plan or wait for the next billing period to continue.
Access
Evaluations are available on Team, Team Plus, and Enterprise plans.