← All steps
Part B · Step 10

Evaluation

Owner
Agent Builder (with eval support from CoE).
Input
Build complete in dev.

Owner: Agent Builder (with eval support from CoE). Input: Build complete in dev. Sub-steps:

  1. Define / load the golden dataset for this workflow (framework.md §27). For a new agent, build the golden set during this step from real historical examples (anonymized where needed).
  2. Run the agent against the golden set. Capture:
    • Accuracy / correctness (decision quality vs. human benchmark).
    • Coverage (does it handle the long tail?)
    • Latency.
    • Cost per execution.
  3. Run adversarial / red-team scenarios designed in Step 6.
  4. Run prompt-injection probes, especially for any agent that processes external content (emails, documents, RAG-retrieved data).
  5. Run bias / fairness probes if the agent makes decisions about people.
  6. Compare results against the eval criteria in the Agent Card (§14 item 12).

Output / gate criteria: A signed evaluation report attached to the registry. Pass on accuracy, latency, cost, and red-team scenarios.

Decision branches:

  • Below threshold → fix the agent / revise the spec / reduce scope. Don't promote.
  • Pass → go to Step 11.

Skip-this-step risk: Quality issues found in production instead of pre-production. Customers / employees become the test set.