Evaluations

Compare prompt quality side by side.

Evaluations let you compare two prompt drafts — or the same draft with different model configurations — by running them multiple times with shared variables and scoring each response with a judge LLM.

Use Evaluations when you want data-driven answers to questions like:

  • "Is this rewrite actually better?"
  • "Does GPT-5 produce better results than GPT-5-mini for this prompt?"
  • "How consistent are the scores across multiple runs?"

Modes

Prompt Comparison

Select two different drafts (Prompt A and Prompt B) from the same prompt. Both run with the same model configuration and the same variables, so the only variable is the prompt content itself.

Model Comparison

Select the same draft for both sides and assign different model configurations. This isolates the model as the variable — useful for deciding which model gives you the best results for a given prompt.

How It Works

  1. Select prompts — Pick two drafts from the same prompt. For model comparison, use the same draft on both sides.
  2. Configure models — Choose the execution model, reasoning effort, and text verbosity for each side independently.
  3. Generate variables — Raison auto-generates sample variable values from the prompt template so both prompts run with identical inputs. You can also enter values manually.
  4. Set output schema (optional) — Define a JSON schema for structured output. Raison can auto-generate one from the prompt content.
  5. Preview — Run a single preview to verify everything looks right before committing to a full evaluation.
  6. Run an iteration — Specify how many runs to execute (1–20). Each run sends both prompts to the configured LLM, then the judge scores each response on a 1–10 scale.
  7. Review results — Aggregated statistics (average, median, min, max, standard deviation) are computed per prompt. Charts visualize scores, cost, tokens, and duration across runs.

Scoring

Each run is scored by a judge LLM on a 1–10 scale across five criteria:

  • Clarity and coherence — Is the output well-structured and easy to follow?
  • Completeness — Does it address the full scope of the task?
  • Usefulness and relevance — Is the response practical and on-topic?
  • Professional tone — Is the language appropriate and polished?
  • Correctness and defensibility — Is the reasoning sound and accurate?

The judge evaluates only the output quality. Internal instructions (flowcharts, procedural markers) and schema compliance are not scored.

Managing Iterations

After an iteration completes, you can:

  • Extend — Add more runs to the same iteration for a larger sample size.
  • Retry failed — Re-run only the runs that failed, preserving successful results.
  • Trigger new iteration — Start a fresh iteration with different variables or run count.
  • Pause / Resume — Pause a running iteration and resume it later.

Each evaluation can have multiple iterations. Navigate between them using the sidebar.

Configuration

Execution Models

Model Description
gpt-5-nano Fastest, lowest cost
gpt-5-mini Balanced speed and quality
gpt-5 High quality
gpt-5.2 Highest quality

Reasoning Effort

Controls how much the model reasons before responding. Higher effort generally produces better results but takes longer and costs more.

Text Verbosity

Controls response length: Low, Medium, or High.

Iteration Settings

Setting Range Description
Run count 1–20 Number of runs per iteration
Execution mode Parallel / Serial Run prompts concurrently or one at a time
Parallel run count 1–50 How many runs execute simultaneously (parallel mode only)

Iteration Lifecycle

Status Description
Pending Iteration created, waiting to start
Running Runs are being executed
Paused Manually paused; can be resumed
Completed All runs finished
Failed One or more runs failed (can retry failed runs)

Billing

Evaluation runs are metered per seat per billing period.

Plan Evaluation runs / seat
Free
Team 50
Team Plus 250
Enterprise Unlimited

Each individual run within an iteration counts toward the limit. You can check your current usage in the evaluation creation view — it shows "X of Y used this period".

When the limit is reached you will see a "Run limit reached" message. Upgrade your plan or wait for the next billing period to continue.

Access

Evaluations are available on Team, Team Plus, and Enterprise plans.