Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
Hosted agents and the Azure Developer CLI evaluation experience are currently in preview.
In this quickstart, you evaluate the hosted agent you deployed in Deploy your first hosted agent. You provide a test dataset, choose evaluators, run an evaluation against the deployed agent, and review the scores. Each step shows two ways to do the same task: the Azure Developer CLI (azd) and the Microsoft Foundry portal.
Evaluation establishes a quality baseline for your agent and lets you set acceptance thresholds, such as a task adherence passing rate, before you release changes to users.
Prerequisites
Before you begin, you need:
A deployed, invokable hosted agent from Deploy your first hosted agent, and the
azdproject directory you created in that quickstart.The Foundry User role on the Foundry resource.
To use the UI path, access to the Foundry portal. For the azd path, see the next requirements.
The
azd ai agentextension (azure.ai.agents), version 0.1.40-preview or later, which provides theazd ai agent evalcommands. This extension is included in themicrosoft.foundryextension you installed in the previous quickstart. Verify the installed version withazd ext list. If you need to install or upgrade it, runazd ext install microsoft.foundryorazd ext upgrade microsoft.foundry.An authenticated
azdsession. Check your status withazd auth status, and runazd auth loginif you're not signed in.Important
The Foundry RBAC roles were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner, and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner, and Azure AI Project Manager. You might still see the previous names in some places while the rename rolls out. The role IDs and core permissions are unchanged by the rename.
A chat-completion model deployment in the same Foundry project to use as the judge model that scores responses. You can reuse the model deployment your agent already uses, including the one from the previous quickstart, so you don't need a separate deployment.
Step 1: Confirm your deployed agent
Evaluation runs against a deployed, invokable agent. Confirm your agent responds before you set up the evaluation.
From your azd project directory, verify the agent is deployed and invokable:
azd ai agent show
Send a test prompt:
azd ai agent invoke "Write a haiku about deploying cloud applications."
You should see a response within a few seconds.
Step 2: Generate the evaluation suite
Provide a test dataset and choose the built-in evaluators that define what to measure.
First, create a JSONL file of test queries for your agent. Each line is a JSON object with a query field. Save it as tests/queries.jsonl in your agent project folder:
{"query": "What's the weather in Seattle?"}
{"query": "Book a flight to Paris"}
{"query": "Tell me a joke"}
Generate the evaluation suite from your dataset and a set of built-in evaluators:
azd ai agent eval generate \
--dataset ./tests/queries.jsonl \
--evaluator builtin.intent_resolution \
--evaluator builtin.task_adherence \
--eval-model <your-chat-completion-deployment>
Replace <your-chat-completion-deployment> with a chat-completion deployment in your project; you can reuse the one your agent already uses. The command creates eval.yaml in the agent project root and registers the dataset and evaluators in your project. The --eval-model value is the judge model that scores responses.
Open eval.yaml to review the agent target, dataset reference, and evaluators that the run uses. It looks similar to this:
name: <eval-suite-name>
agent:
name: <your-agent-name>
kind: hosted
dataset:
local_uri: tests/queries.jsonl
evaluators:
- builtin.intent_resolution
- builtin.task_adherence
options:
eval_model: <your-chat-completion-deployment>
max_samples: 15
The suite name and some values are generated. Your file might also include a generated evaluator in addition to the built-in ones you selected.
Step 3: Run the evaluation
Run the suite against your deployed agent. The service sends each test query to the agent, captures the response, and scores it with your selected evaluators.
Note
Target-based evaluation invokes your hosted agent directly. It works with agents that use the responses or invocations protocol with synchronous, non-streaming execution. To evaluate agents that use the A2A or Activity protocol, or other execution patterns such as long-running or streaming, evaluate the traces your agent emits instead. See Trace evaluation.
From the agent project folder, run the evaluation:
azd ai agent eval run
The command runs eval.yaml from the agent project root, sends each query to your agent, scores the responses, and prints a summary when it finishes:
Eval run started
Eval: eval_b36748dede424e4ba3f8e6c99ca2cf27
Run: evalrun_5f72ef189ad24790a32128e6f230b131
(✓) Done Eval run (4m 9s)
Results: 8 total, 8 passed, 0 failed, 0 errored
Per-criteria results:
intent_resolution: 8 passed, 0 failed, 0 errored
task_adherence: 8 passed, 0 failed, 0 errored
Step 4: Review the results
Evaluations typically complete in a few minutes, depending on the number of queries.
List recent evaluations:
azd ai agent eval list
Eval ID Name Status of last run Runs
------- ---- ------------------ ----
* eval_b36748dede424e4ba3f8e6c99ca2cf27 agent-core Completed 1
* = active eval in current environment
Show the most recent evaluation and its runs:
azd ai agent eval show
Eval: eval_b36748dede424e4ba3f8e6c99ca2cf27
Name: agent-core
Agent: <your-agent-name>
Runs: 1
Recent runs:
Run ID Status Passed Failed Created
------ ------ ------ ------ -------
evalrun_5f72ef189ad24790a32128e6f230b131 Completed 8/8 0 2026-06-17 14:52 UTC
Use the results to confirm which agent version was evaluated and which evaluator scores were produced. To see per-evaluator details and a link to the report in the Foundry portal, run azd ai agent eval show <eval-id> --eval-run-id <run-id>.
Clean up resources
The evaluation registers a dataset, evaluators, and run history in your Foundry project. These assets incur little or no ongoing cost. To remove everything you created across this and the previous quickstart, run azd down from your agent project directory.
Warning
azd down permanently deletes every resource in the resource group, including the Foundry project, model deployments, and the hosted agent.
Troubleshooting
| Issue | Solution |
|---|---|
azd ai agent eval command not found or fails |
Run azd ext list and verify the azd ai agent extension is 0.1.40-preview or later. Upgrade with azd ext upgrade microsoft.foundry |
| Evaluation target not found or agent not invokable | Confirm the agent is deployed and invokable with azd ai agent show. Redeploy with azd deploy if needed. |
| Many errored rows or unexpectedly low scores | Open the run report and check whether rows failed with agent response or evaluator errors. Fix the underlying errors, then rerun the evaluation. |
AuthenticationError or DefaultAzureCredential failure |
Refresh credentials with azd auth logout and then azd auth login. |
| Eval model deployment not found | Verify the chat-completion deployment name exists in your project under Build > Deployments. |
What you learned
In this quickstart, you:
- Created a test dataset and chose evaluators for your hosted agent.
- Ran an evaluation against the deployed agent.
- Reviewed aggregated and row-level results.
- Completed each task with both the Azure Developer CLI and the Foundry portal.