Rulix AI — AI Governance Assessment Platform

Owner: Agent Builder (with eval support from CoE). Input: Build complete in dev. Sub-steps:

Define / load the golden dataset for this workflow (framework.md §27). For a new agent, build the golden set during this step from real historical examples (anonymized where needed).
Run the agent against the golden set. Capture:
- Accuracy / correctness (decision quality vs. human benchmark).
- Coverage (does it handle the long tail?)
- Latency.
- Cost per execution.
Run adversarial / red-team scenarios designed in Step 6.
Run prompt-injection probes, especially for any agent that processes external content (emails, documents, RAG-retrieved data).
Run bias / fairness probes if the agent makes decisions about people.
Compare results against the eval criteria in the Agent Card (§14 item 12).

Output / gate criteria: A signed evaluation report attached to the registry. Pass on accuracy, latency, cost, and red-team scenarios.

Decision branches:

Below threshold → fix the agent / revise the spec / reduce scope. Don't promote.
Pass → go to Step 11.

Skip-this-step risk: Quality issues found in production instead of pre-production. Customers / employees become the test set.