Experiments - Quraite Documentation

Overview

An Experiment is a single execution of a dataset against a specific agent. It binds together an agent, a dataset (or subset of it), and a set of metrics to produce a structured evaluation run. Each experiment run produces:

Trace-level metrics — latency, cost, and token usage
Turn-by-turn conversation logs — including all tool calls
Per-testcase pass/fail signals — with reasoning
Metric evaluations

An experiment answers a simple question:

“How does this agent perform on this dataset, with this model or prompt and these metrics?”

How Experiments Are Executed

When an experiment runs, Quraite loads the dataset and selected agent, then invokes the agent against every testcase. How each testcase is executed depends on its type:

Script-based testcases replay the predefined turns in sequence
Scenario-based testcases dynamically generate user messages and continue turns until the scenario concludes
In both cases, the agent endpoint is called once per turn

Configurability

The following can be configured when running an experiment:

Concurrent testcases — how many testcases run in parallel
Concurrent metrics per testcase — how many metrics are evaluated simultaneously

This controls the load on the agent and manages rate limits for LLM calls.

Consistency Testing

Conversational agents should produce reliable, repeatable responses — not just correct ones. This makes pass^k (all k runs must succeed) more important than pass@k (at least one of k runs must succeed). When running an experiment, the number of times each testcase is executed can be specified to measure consistency across repeated runs.

​Overview

​How Experiments Are Executed

​Configurability

​Consistency Testing

Overview

How Experiments Are Executed

Configurability

Consistency Testing