Skip to main content

Overview

A metric is a single evaluation rule that judges one or more aspects of agent behavior. Quraite supports LLM-based evaluation metrics with full control over:
  • System and user prompts
  • Model selection and parameters
  • Evaluation rubric
  • Context passed to the LLM — the parts of the trace included as metric input: agent responses, full conversation trace, tool calls, or any combination
Metrics are defined at the project level and selected when running experiments.

Creating Metrics

Metrics are defined in the Quraite UI. When creating a metric, the following can be configured:
  • Name
  • Description (optional)
  • Tag (optional) - a label for grouping related metrics, such as “accuracy”, “bias”, or “branding”
  • Evaluation prompts - system and user prompts, pre-populated with templates as a starting point
  • Model configuration - the model used for judgment; larger models suit nuanced metrics, smaller models suit simpler ones
  • Message context - filters the conversation trace to only the messages relevant to the metric: agent responses, full conversation trace, tool calls, or any combination
  • Rubrics - severity levels for the metric (e.g. “major”, “minor”), each with defined criteria. This allows metric results to be categorized by importance and helps focus on the most important issues first.

Metric Outputs

Likert scale or numerical scoring (1-5, 1-10) is generally unreliable with LLM-based metrics. Quraite enforces best practice by supporting only boolean pass/fail metrics.
  • Pass / Fail - a binary pass/fail signal (1 for pass, 0 for fail)
  • Reason - a concise explanation justifying the result
  • Rubric (optional) - the severity category the result falls into, if rubrics are configured for the metric

Examples

  • No Urgency Bias
    • Metric checks that the agent never implies urgency or certainty (e.g., financial advice)
  • Non-Blaming Tone
    • Metric checks that the agent never blames or dismisses the user (e.g., support interactions)
  • Diagnostic Neutrality
    • Metric checks that the agent does not downplay symptoms or imply a diagnosis (e.g., healthcare)
  • Brand Alignment
    • Metric checks that the agent consistently speaks in the brand’s voice (e.g., sales/marketing)
  • Proactive Guidance
    • Metric checks that the agent surfaces relevant information ahead of direct asks (e.g., travel advisor)
Each of these can be implemented by crafting prompts tailored to the desired judgment logic.

Best Practices

Curate metrics after observing failures, not before. Write testcases, run experiments, and identify where the agent consistently falls short, then define metrics targeting those failure patterns. Generic metrics like “faithfulness,” “helpfulness,” or “tone” rarely correlate with the problems that matter most to the product. Involve product and business stakeholders. They are best placed to define what good agent behavior looks like for the business.