Metrics

Overview

A metric is a single evaluation rule that judges one or more aspects of agent behavior. Quraite supports LLM-based evaluation metrics with full control over:

System and user prompts
Model selection and parameters
Evaluation rubric
Context passed to the LLM — the parts of the trace included as metric input: agent responses, full conversation trace, tool calls, or any combination

Metrics are defined at the project level and selected when running experiments.

Creating Metrics

Metrics are defined in the Quraite UI. When creating a metric, the following can be configured:

Name
Description (optional)
Tag (optional) - a label for grouping related metrics, such as “accuracy”, “bias”, or “branding”
Evaluation prompts - system and user prompts, pre-populated with templates as a starting point
Model configuration - the model used for judgment; larger models suit nuanced metrics, smaller models suit simpler ones
Message context - filters the conversation trace to only the messages relevant to the metric: agent responses, full conversation trace, tool calls, or any combination
Rubrics - severity levels for the metric (e.g. “major”, “minor”), each with defined criteria. This allows metric results to be categorized by importance and helps focus on the most important issues first.

Metric Outputs

Likert scale or numerical scoring (1-5, 1-10) is generally unreliable with LLM-based metrics. Quraite enforces best practice by supporting only boolean pass/fail metrics.

Pass / Fail - a binary pass/fail signal (1 for pass, 0 for fail)
Reason - a concise explanation justifying the result
Rubric (optional) - the severity category the result falls into, if rubrics are configured for the metric

Examples

No Urgency Bias
- Metric checks that the agent never implies urgency or certainty (e.g., financial advice)
Non-Blaming Tone
- Metric checks that the agent never blames or dismisses the user (e.g., support interactions)
Diagnostic Neutrality
- Metric checks that the agent does not downplay symptoms or imply a diagnosis (e.g., healthcare)
Brand Alignment
- Metric checks that the agent consistently speaks in the brand’s voice (e.g., sales/marketing)
Proactive Guidance
- Metric checks that the agent surfaces relevant information ahead of direct asks (e.g., travel advisor)

Each of these can be implemented by crafting prompts tailored to the desired judgment logic.

Best Practices

Curate metrics after observing failures, not before. Write testcases, run experiments, and identify where the agent consistently falls short, then define metrics targeting those failure patterns. Generic metrics like “faithfulness,” “helpfulness,” or “tone” rarely correlate with the problems that matter most to the product. Involve product and business stakeholders. They are best placed to define what good agent behavior looks like for the business.

Getting Started

Key Concepts

Agent Integration

Best Practices

Overview

Creating Metrics

Metric Outputs

Examples

Best Practices

Getting Started

Key Concepts

Agent Integration

Best Practices

​Overview

​Creating Metrics

​Metric Outputs

​Examples

​Best Practices

Overview

Creating Metrics

Metric Outputs

Examples

Best Practices