Create an evaluation dataset and evaluators (preview)

Important

Agent Optimizer is currently in preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

The agent optimizer evaluates your agent against a dataset — a collection of tasks with evaluation criteria — scored by evaluators. You can generate both automatically from the CLI or create a dataset manually for full control.

Prerequisites

A Foundry project with a deployed hosted agent
The azure.ai.agents CLI extension installed (see Quickstart: Optimize a hosted agent)

Generate a dataset and evaluators (recommended)

The fastest way to create evaluation assets is with azd ai agent eval generate. The command auto-detects your agent, then generates both a dataset and adaptive evaluators tuned to your agent's domain, and writes a runnable eval.yaml:

azd ai agent eval generate

For the interactive wizard, non-interactive flags, generated artifacts, and the full evaluation workflow, see Initialize evaluation assets.

After eval generate completes, azd ai agent optimize auto-detects the generated eval.yaml:

azd ai agent optimize

Or pass it explicitly:

azd ai agent optimize --config eval.yaml

Create a custom dataset manually (advanced)

For full control over evaluation tasks and criteria, create a JSONL dataset by hand. This is useful when you need precise control over test scenarios or have production data to use directly.

By default, azd ai agent optimize uses a built-in dataset with 3 general coding tasks and 25 criteria. For meaningful optimization of your specific agent, create a custom dataset that reflects your agent's real-world use cases.

Dataset format

Datasets use JSONL (JSON Lines) format. Each line is one JSON object that represents a single evaluation task. A task is an individual scenario in the dataset. It contains a prompt and evaluation criteria.

{"name": "task_1", "query": "Your prompt here", "criteria": [{"name": "criterion_name", "instruction": "What the evaluator checks for"}]}
{"name": "task_2", "query": "Another prompt", "criteria": [{"name": "check_1", "instruction": "..."}, {"name": "check_2", "instruction": "..."}]}

Field reference

Field	Required	Description
`name`	Yes	Unique task identifier (for example, `"greeting"`, `"math_test"`)
`query`	Yes	The message sent to the agent
`criteria`	Yes	Array of evaluation criteria — rules that define what "good" looks like for the task
`criteria[].name`	Yes	Short name for the criterion (for example, `"is_polite"`)
`criteria[].instruction`	Yes	What the evaluator checks. Be specific and testable. The built-in evaluator (`builtin.task_adherence`) scores each criterion independently as a binary value (0 or 1).
`ground_truth`	No	Expected answer (used by some evaluators for reference)

Example: Customer support agent

{"name": "refund_policy", "query": "What is your refund policy?", "criteria": [{"name": "mentions_30_days", "instruction": "Response must mention the 30-day refund window"}, {"name": "polite_tone", "instruction": "Response must be professional and empathetic"}]}
{"name": "order_status", "query": "Where is my order #12345?", "criteria": [{"name": "asks_for_details", "instruction": "Agent should ask for email or order details to look up the order"}, {"name": "no_hallucination", "instruction": "Agent must NOT make up a fake order status"}]}
{"name": "out_of_scope", "query": "Can you help me fix my car?", "criteria": [{"name": "polite_decline", "instruction": "Agent should politely explain this is outside its scope"}, {"name": "redirect", "instruction": "Agent should suggest contacting an appropriate service"}]}

Example: Coding assistant

{"name": "python_function", "query": "Write a Python function to reverse a linked list", "criteria": [{"name": "correct_algorithm", "instruction": "The function must correctly reverse a singly linked list"}, {"name": "handles_empty", "instruction": "The function must handle an empty list without errors"}, {"name": "includes_docstring", "instruction": "The function should include a descriptive docstring"}]}
{"name": "explain_concept", "query": "Explain what a closure is in JavaScript", "criteria": [{"name": "accurate_definition", "instruction": "Must correctly define a closure as a function that captures variables from its enclosing scope"}, {"name": "includes_example", "instruction": "Must include at least one working code example"}]}

Use a custom dataset

Reference your dataset in a YAML config file:

# eval.yaml
agent:
  name: my-agent

dataset_file: ./my_eval_dataset.jsonl

evaluators:
  - builtin.task_adherence

options:
  eval_model: gpt-4.1-mini
  optimization_model: gpt-5.1
  max_iterations: 5

Then run:

azd ai agent optimize --config eval.yaml

Before you run the command, validate the JSONL syntax:

python -c "import json; [json.loads(l) for l in open('my_eval_dataset.jsonl')]"

Tips for writing good datasets

Be specific in criteria

Bad:

{"name": "good_answer", "instruction": "The response should be good"}

Good:

{"name": "mentions_30_days", "instruction": "Response must explicitly mention the 30-day refund window"}

Specific criteria give the evaluator a clear, binary signal. Vague criteria lead to inconsistent scoring.

Include edge cases

Test beyond the happy path. Include:

Out-of-scope requests — Inputs your agent should decline or redirect
Ambiguous queries — Tasks where the agent should ask for clarification
Adversarial inputs — Attempts to trick the agent into bad behavior
Multi-step tasks — Complex requests that require structured reasoning

Size guidelines

Dataset size	Trade-off
3–5 tasks	Quick iteration, limited signal
5–10 tasks	Good balance of speed and coverage
10–20 tasks	Comprehensive evaluation, longer runs
20+ tasks	Thorough but slow — consider for final validation

Each task can have multiple criteria. A dataset with 5 tasks × 4 criteria each = 20 evaluation signals.

Write prompts like real users

Use actual messages from your users if possible. Real prompts capture the vocabulary and context that your agent faces in production.

Criteria are scored independently

Each criterion gets a binary score (0 or 1). The task score is the average of its criteria scores. The overall score is the average across all tasks. This means:

A task with 4 criteria where 3 pass scores 0.75
An agent that passes all criteria on 2 of 3 tasks scores 0.67

Ground truth is optional

The ground_truth field provides a reference answer for evaluators that support it. This field isn't required. The builtin.task_adherence evaluator works entirely from criteria instructions.

{"name": "geography_fact", "query": "What is the largest city in France by population?", "ground_truth": "Paris", "criteria": [{"name": "correct_answer", "instruction": "Response must state that Paris is the largest city in France by population"}]}

Troubleshooting

Problem	Cause	Fix
`dataset_file not found`	Wrong path in `eval.yaml`	Use a path relative to the config file location
`invalid JSON on line N`	Malformed JSONL	Validate that each line is valid JSON. Check for trailing commas.
Scores are inconsistent between runs	Vague criteria	Make criteria specific and binary-testable

Feedback

Was this page helpful?

Last updated on 2026-06-15