Skip to main content

Overview

An Experiment is a single execution of a dataset against a specific agent. It binds together an agent, a dataset (or subset of it), and a set of metrics to produce a structured evaluation run. Each experiment run produces:
  • Trace-level metrics — latency, cost, and token usage
  • Turn-by-turn conversation logs — including all tool calls
  • Per-testcase pass/fail signals — with reasoning
  • Metric evaluations
An experiment answers a simple question:
“How does this agent perform on this dataset, with this model or prompt and these metrics?”

How Experiments Are Executed

When an experiment runs, Quraite loads the dataset and selected agent, then invokes the agent against every testcase. How each testcase is executed depends on its type:
  • Script-based testcases replay the predefined turns in sequence
  • Scenario-based testcases dynamically generate user messages and continue turns until the scenario concludes
  • In both cases, the agent endpoint is called once per turn

Configurability

The following can be configured when running an experiment:
  • Concurrent testcases — how many testcases run in parallel
  • Concurrent metrics per testcase — how many metrics are evaluated simultaneously
This controls the load on the agent and manages rate limits for LLM calls.

Consistency Testing

Conversational agents should produce reliable, repeatable responses — not just correct ones. This makes pass^k (all k runs must succeed) more important than pass@k (at least one of k runs must succeed). When running an experiment, the number of times each testcase is executed can be specified to measure consistency across repeated runs.