Owner: Agent Builder (with eval support from CoE). Input: Build complete in dev. Sub-steps:
- Define / load the golden dataset for this workflow (
framework.md§27). For a new agent, build the golden set during this step from real historical examples (anonymized where needed). - Run the agent against the golden set. Capture:
- Accuracy / correctness (decision quality vs. human benchmark).
- Coverage (does it handle the long tail?)
- Latency.
- Cost per execution.
- Run adversarial / red-team scenarios designed in Step 6.
- Run prompt-injection probes, especially for any agent that processes external content (emails, documents, RAG-retrieved data).
- Run bias / fairness probes if the agent makes decisions about people.
- Compare results against the eval criteria in the Agent Card (§14 item 12).
Output / gate criteria: A signed evaluation report attached to the registry. Pass on accuracy, latency, cost, and red-team scenarios.
Decision branches:
- Below threshold → fix the agent / revise the spec / reduce scope. Don't promote.
- Pass → go to Step 11.
Skip-this-step risk: Quality issues found in production instead of pre-production. Customers / employees become the test set.