Purpose
The company-wide observability playbook for AI agents. Published once at A6 (standards documents) by the AI CoE. Built and maintained by the Platform Team. Every production agent inherits from this standard — no agent invents its own observability.
This is the document that answers, for every agent:
- What gets logged on every execution (per-execution fields)
- Where logs live (storage + retention by tier)
- What dashboards must exist (visible by who)
- What triggers an alert (thresholds + pager routing)
- How drift is detected (signal + tools + cadence)
- How adoption is measured (the metrics taxonomy)
- How decision audit trails capture business logic (not just telemetry)
Without this document standardized, every agent's observability is bespoke. Cross-agent dashboards become impossible. Auditors get inconsistent records. EU AI Act Article 12 (event logging) compliance becomes per-agent improvisation.
- When you use it: Published once at program start (A6 in In Action roadmap), referenced from every Agent Card §11. Updated annually or when material new signal added.
- Who owns: Platform Team builds. AI CoE Lead approves. Security signs.
- Format: Living document in
templates/17-observability-standard.md. Customized for the company's specific stack on adoption.
Worked example (ACME Corp Observability Standard v1.0)
ACME Corp — AI Agent Observability Standard v1.0
Effective: 2026-03-01 Owner: Platform Team (operational) · AI CoE Lead (policy) Approval: CISO + CoE Lead + Platform Lead Next review: 2027-03-01
§1. Required per-execution log fields
Every AI agent in production — Low, Medium, or High tier — emits the following fields on every execution. No exceptions.
| # | Field | Type | Notes |
|---|---|---|---|
| 1 | timestamp_start | ISO-8601 UTC | When invocation began |
| 2 | timestamp_end | ISO-8601 UTC | When invocation completed |
| 3 | agent_id | string | From Agent Card §1, e.g., finance-invoice-recon |
| 4 | agent_version | string | semver, e.g., 1.0.5 |
| 5 | correlation_id | UUID | Unique per execution; threads across all sub-calls |
| 6 | initiator | string | User ID OR system trigger ID — never anonymous |
| 7 | prompt | string | Sanitized (PII redacted per §4 below) |
| 8 | response | string | Sanitized |
| 9 | tool_calls | array | Each tool invocation: name, parameters (sanitized), result-status, latency, cost |
| 10 | model | string | LLM provider + model name + version, e.g., anthropic/claude-sonnet-4-6 |
| 11 | tokens_in | integer | |
| 12 | tokens_out | integer | |
| 13 | cost_usd | decimal | Computed from tokens × per-token cost |
| 14 | policy_checks | array | Each guardrail that fired: name, result (pass/block), reason |
| 15 | hitl_events | array | Each HITL gate: trigger, decision (approve/reject/override), human ID, timestamp |
| 16 | latency_per_step_ms | array | Per-step latency breakdown |
| 17 | outcome | enum | success / failure / human_overridden / exception_routed |
| 18 | error | object | If present: type, stack trace (sanitized), recoverable (yes/no) |
Implementation rule: these fields are emitted automatically by the orchestrator layer (n8n / LangGraph / etc.) and forwarded to the observability platform. Builders do NOT have to manually instrument them — the framework / approved stack from A4 provides them.
§2. Dashboards by tier
Required for ALL tiers (Low / Medium / High)
| Dashboard | Update cadence | Audience |
|---|---|---|
| Cost per agent per day | Real-time | Department Champion + CoE Lead |
| Execution count + failure rate (24h, 7d, 30d) | Real-time | Champion |
| Latency p50 / p95 / p99 (24h, 7d) | Real-time | Champion + on-call |
| Tokens consumed (24h, 7d, 30d) | Real-time | Champion |
| HITL events count + acceptance rate (24h, 7d, 30d) | Real-time | Champion |
Additional for Medium tier
| Dashboard | Update cadence | Audience |
|---|---|---|
| Output distribution shift (PSI vs eval baseline) | Hourly | CoE Lead + Builder |
| Exception routing patterns by category | Real-time | Champion + Builder |
| Per-step cost breakdown | Real-time | Builder + Finance |
| Tool-call frequency + success rate | Real-time | Builder |
| Adoption: unique users (DAU / WAU / MAU) | Daily | Champion + CoE Lead |
Additional for High tier
| Dashboard | Update cadence | Audience |
|---|---|---|
| Decision audit trail with business-logic context | Real-time | CoE Lead + Compliance + Legal |
| Demographic outcome distribution (if decisions affect people) | Daily | Compliance + Legal |
| Drift detector (PSI + KL-divergence + alarm history) | Hourly | Builder + CoE Lead |
| Geographic data residency map | Daily | Compliance + Privacy Officer |
§3. Log retention by tier and jurisdiction
| Scope | Active hot retention | Cold archive | Total |
|---|---|---|---|
| Low tier | 30 days | 6 months | 7 months |
| Medium tier | 90 days | 6 months minimum | 9+ months |
| High tier (default) | 90 days | 12 months minimum | 15+ months |
| High tier + EU AI Act Annex III | 90 days | 6 years | 6+ years |
| Any SOX-relevant tier | per tier hot | 7 years cold (§802) | 7+ years |
| Any HIPAA-relevant tier | per tier hot | 6 years cold | 6+ years |
| Any PCI-DSS-relevant tier | per tier hot | 1 year hot + 1 year cold | 2+ years |
Implementation: logs flow from observability platform (LangSmith / Helicone / Langfuse) → company log lake (Datadog / Splunk) → cold archive (S3 Glacier / Azure Archive / Google Coldline). Retention lifecycle configured at each layer.
Sector overlays apply — apply the longest retention from any overlapping rule.
§4. PII / data-class redaction in logs
| Data class | In raw prompt? | In raw response? | In logs? |
|---|---|---|---|
| Public | OK | OK | OK |
| Internal | OK | OK | OK |
| Confidential | OK if Agent Card §5 authorizes | OK | Mask if specifically flagged (e.g., contract dollar amounts) |
| PII (regulated) | OK if Agent Card §5 + RAI §2 authorize + DLP active | Mask | Mask before log emit |
| PHI | OK ONLY if HIPAA overlay applied + BAA in place | Mask | Mask before log emit |
| Financial account numbers | OK if Agent Card authorizes | Mask | Mask before log emit |
| Government identifiers (SSN, passport) | NEVER unless explicit High-tier risk-appetite approval | NEVER in raw form | NEVER in raw form |
Implementation: runtime layer (Lakera / Aporia / custom) detects sensitive patterns + redacts BEFORE the observability platform receives the log. The orchestrator's "log emit" hook runs the redaction filter.
§5. The 5 monitoring signals — concrete implementation
(Aligned with framework.md §21.)
Signal 1 — Output distribution shift (drift)
What: Compare production output distribution against eval baseline (captured in template 08 §6).
How:
- For numeric outputs (confidence scores, match scores, amounts) → Population Stability Index (PSI) vs baseline
- For categorical outputs (decisions, classifications) → KL-divergence vs baseline
- For LLM-text outputs → semantic similarity (embedding distance) vs sample of baseline
Thresholds: PSI 0.0–0.1 stable / 0.1–0.2 watch / >0.2 alert. KL-divergence: similar bands.
Cadence: Hourly recompute against rolling 24-hour window.
Alarm: PSI > 0.2 for 2 consecutive hours → Sev-2 alert to CoE Lead + Builder.
Tools: Arize AI, Langfuse Premium, Fiddler AI, Aporia. Custom on top of LangSmith if needed.
Signal 2 — HITL escalation rate
What: Rate at which the agent flags items to humans for approval/review.
How: count hitl_events / count executions per rolling 24h + 7d + 30d windows.
Thresholds:
- Healthy band: expected baseline ± 20% (e.g., 7–11% if baseline is 9%)
- Falling toward 0 = RED FLAG (agent overreaching or HITL gate disabled)
- Rising sharply = scope problem or quality problem
Alarm: drops below 70% of baseline OR rises above 150% of baseline → Sev-2 alert.
Signal 3 — Decision audit trail (business-logic capture)
What: for each execution, what data was considered, what policy governed it, was the policy enforced, who reviewed.
How: every execution writes a decision record with:
- Input context (sanitized)
- Reasoning trace (LLM chain-of-thought if available, otherwise structured rationale)
- Policy checks fired (from §1 field 14)
- Final action + actor (agent or human)
- Tool calls + results
Storage: retained per §3 retention rules. Auditor-queryable.
Why this matters: EU AI Act Article 12 + 19 — required for high-risk AI. Tokens + latency alone don't satisfy.
Signal 4 — Cost per execution at step level
What: cost is tracked NOT just per-execution-total but per-step (parser cost, LLM cost, tool-call cost, etc.).
Why: step-level cost reveals retry loops, unexpected tool invocations, prompt bloat — patterns that get hidden in total cost.
Thresholds:
- Per-execution cost > 2× baseline → Sev-3 alert
- Daily total > daily cap → Sev-2 alert
- Any single step > 5× baseline → Sev-3 alert
Tools: LangSmith cost dashboards (per-step), Helicone, custom Grafana on Anthropic/OpenAI usage APIs.
Signal 5 — Exception routing patterns
What: which inputs end up in the agent's exception path (instead of being handled inline)?
Why: Rising exception rate = data drift or scope drift. Falling exception rate (without scope expansion) = exception gate being bypassed.
How:
- Count exceptions per cause category (from
outcome+errorfields) - Track per-category trend over 7d / 30d
Alarm: new exception category appearing > 5% of volume → Sev-3 review trigger.
§6. Adoption metrics — the standard taxonomy
For every production agent, the following adoption metrics MUST be tracked. (These often hide in dashboards as afterthoughts — promote them.)
| Metric | Definition | How |
|---|---|---|
| Unique users (active) | Distinct user IDs that triggered ≥1 execution in window | Count distinct initiator field per window |
| DAU / WAU / MAU | Unique users in last 1d / 7d / 30d | Standard time-windowed distinct count |
| Stickiness | DAU / MAU ratio | (Indicates daily-use depth vs occasional use) |
| Executions per user-week | Total executions / unique users / weeks | (Engagement depth indicator) |
| Time-to-first-use for new user | Days between identity provisioning + first execution | (Onboarding friction indicator) |
| Workflow coverage | % of in-scope inputs handled by agent (vs going to manual workflow) | Per agent — must be defined per use case |
| Reach by department | Unique users by department | (Cross-department adoption indicator) |
| Self-reported satisfaction | Quarterly user survey, 1–5 scale | Manual collection at quarterly review |
| Churn / abandonment | Users active in prev period but not this period | Per period |
Cadence: Daily updates for usage metrics. Quarterly satisfaction surveys. Reported in template 14 (30-day review) §4 and template 16 (per-agent quarterly) §4.
§7. Alert thresholds + escalation matrix
| Severity | Trigger examples | Page within | Inform within |
|---|---|---|---|
| Sev-1 | Out-of-scope tool call attempted · Identity revoked · Confirmed data exfiltration · Customer-facing harmful output · Article 73 serious incident | 5 min | Executive Sponsor + General Counsel within 1 hr |
| Sev-2 | HITL acceptance < 70% of baseline for 2 days · Drift PSI > 0.2 sustained 2h · Sustained error rate > 5% per hour · Cost > 2× daily cap | 15 min | Department Head within 4 hr |
| Sev-3 | Cost spike > 2× baseline (single day) · Rate-limit hit · Transient outage < 30 min · New exception category > 5% | 30 min | (none — handled by on-call) |
| Sev-4 | Cosmetic issue · Single-case quirk · Performance degradation < threshold | Next business day | (none) |
Page routing: PagerDuty / Opsgenie / Splunk On-Call schedules per agent in Agent Card §11. Each agent has primary + backup on-call named.
§8. Recommended tooling matrix
| Layer | Recommended options | Notes |
|---|---|---|
| AI-native observability (primary) | LangSmith, Helicone, Langfuse, Arize AI, Braintrust, Maxim AI, AgentOps, Phoenix by Arize | Pick one as primary. Wire it from agent's first execution in dev. |
| General observability (pipe-through) | Datadog, Grafana + Loki, Microsoft Sentinel, Splunk, Elastic, New Relic, Honeycomb | Pipe AI events here so they sit alongside the rest of the company's monitoring. |
| Drift detection | Arize AI, Aporia, Fiddler AI, Langfuse Premium, custom | Use what already exists where possible. |
| Cost monitoring | LangSmith cost (LLM-specific), Helicone, Portkey (gateway-side), AWS Cost Explorer, Cloudability | Step-level cost is key — not just total. |
| Alerting / paging | PagerDuty, Opsgenie, Splunk On-Call, Microsoft Teams alerts | Use existing company tools. |
| Long-term cold archive | S3 Glacier, Azure Archive Storage, Google Coldline | Set lifecycle on §3 retention rules. |
| AI gateway (centralizes routing + control) | Portkey, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, Helicone Gateway | Adds policy/cost enforcement layer; worth it past ~10 agents. |
| Prompt management + versioning | LangSmith, Humanloop, PromptHub, Promptfoo, PromptLayer | Tied to source control. |
| Evaluation harness | Promptfoo, Ragas, DeepEval, OpenAI Evals, LangSmith Evals, Braintrust | Run at pre-prod (template 08) + quarterly re-eval. |
ACME's choice (worked example): LangSmith (primary) → Datadog (pipe-through) → S3 Glacier (cold archive). Arize for drift on Medium+ agents. PagerDuty for alerts. Portkey for AI gateway (added in Q3 2026).
§9. Implementation steps — Platform Team rollout
When this standard is first published, the Platform Team implements as follows:
| # | Step | Owner | Effort |
|---|---|---|---|
| 1 | Select observability platform (LangSmith / Helicone / Langfuse / Arize) | Platform Lead + CoE Lead | 1 week to evaluate |
| 2 | Set up shared org-level account with per-agent projects | Platform Team | 1 day |
| 3 | Define the per-execution log emit hook in orchestrator (n8n, LangGraph) — auto-emit §1 fields | Platform Engineer | 2–3 days |
| 4 | Build the per-agent dashboard template (covering all §2 tier requirements) | Platform Engineer | 3–5 days |
| 5 | Configure log forwarding to Datadog (or equivalent) | Platform Engineer | 1 day |
| 6 | Configure cold archive lifecycle (S3 Glacier or equivalent) per §3 retention rules | Platform Engineer | 1 day |
| 7 | Wire alert thresholds (§7) into PagerDuty | Platform Engineer | 2 days |
| 8 | Build drift-detection pipeline for Medium+ agents | Platform Engineer + Data | 1 week |
| 9 | Document onboarding checklist for new agents | Platform Lead + CoE Lead | 1 day |
| 10 | Run end-to-end test with one pilot agent (e.g., AP recon) | Platform Team | 1 day |
Total Platform Team effort: ~3 weeks one-time + ~0.5 day per new agent onboarding.
§10. When this standard changes
Triggers for revision:
- New monitoring signal proven valuable (added to §5)
- New sector overlay applies (e.g., expanding to healthcare → add PHI redaction to §4)
- New observability platform adopted (update §8)
- Annual review (always)
- After any Sev-1 or repeat Sev-2 that reveals an observability gap (lesson fed back via template 10 post-mortem)
Version control: this document lives at templates/17-observability-standard.md. Update the template_version and add a changelog entry below.
§11. Changelog
| Version | Date | Change | Author |
|---|---|---|---|
| 1.0 | 2026-03-01 | Initial standard published | CoE Lead + Platform Lead + CISO |
Sign-off
| Role | Name | Date |
|---|---|---|
| AI CoE Lead | Morteza Moradi | 2026-03-01 |
| Platform Lead | Jess (IT) | 2026-03-01 |
| CISO | Pat Lee | 2026-03-01 |
Blank template (copy below for your company)
# [Company] — AI Agent Observability Standard v[X.X]
**Effective:** [YYYY-MM-DD]
**Owner:** Platform Team · AI CoE Lead
**Approval:** [signers]
**Next review:** [YYYY-MM-DD]
## §1. Required per-execution log fields
| # | Field | Type | Notes |
|---|---|---|---|
| 1 | timestamp_start | ISO-8601 UTC | |
| 2 | timestamp_end | ISO-8601 UTC | |
| 3 | agent_id | string | From Agent Card §1 |
| 4 | agent_version | string | semver |
| 5 | correlation_id | UUID | |
| 6 | initiator | string | User ID or system trigger ID |
| 7 | prompt | string (sanitized) | |
| 8 | response | string (sanitized) | |
| 9 | tool_calls | array | name, parameters, status, latency, cost |
| 10 | model | string | provider + model + version |
| 11 | tokens_in | integer | |
| 12 | tokens_out | integer | |
| 13 | cost_usd | decimal | |
| 14 | policy_checks | array | guardrails fired + results |
| 15 | hitl_events | array | gate triggers + decisions |
| 16 | latency_per_step_ms | array | |
| 17 | outcome | enum | success / failure / human_overridden / exception_routed |
| 18 | error | object | if present |
## §2. Dashboards by tier
[Required for all tiers table]
[Additional for Medium tier table]
[Additional for High tier table]
## §3. Log retention by tier and jurisdiction
| Scope | Active hot | Cold archive | Total |
|---|---|---|---|
| Low | [duration] | [duration] | |
| Medium | [duration] | [duration] | |
| High (default) | [duration] | [duration] | |
| High + EU AI Act Annex III | [duration] | [duration] | |
| Any SOX-relevant | per tier | [duration] | |
| Any HIPAA-relevant | per tier | [duration] | |
## §4. PII / data-class redaction in logs
[Table mapping each data class to logging treatment]
## §5. The 5 monitoring signals — concrete implementation
### Signal 1 — Output distribution shift
- What:
- How:
- Thresholds:
- Cadence:
- Alarm:
- Tools:
### Signal 2 — HITL escalation rate
[same shape]
### Signal 3 — Decision audit trail
[same shape]
### Signal 4 — Cost per execution at step level
[same shape]
### Signal 5 — Exception routing patterns
[same shape]
## §6. Adoption metrics — the standard taxonomy
| Metric | Definition | How |
|---|---|---|
| Unique users (active) | | |
| DAU / WAU / MAU | | |
| Stickiness | | |
| Executions per user-week | | |
| Time-to-first-use for new user | | |
| Workflow coverage | | |
| Reach by department | | |
| Self-reported satisfaction | | |
| Churn / abandonment | | |
## §7. Alert thresholds + escalation matrix
| Severity | Trigger | Page within | Inform within |
|---|---|---|---|
| Sev-1 | | | |
| Sev-2 | | | |
| Sev-3 | | | |
| Sev-4 | | | |
## §8. Recommended tooling matrix
[Per-layer recommendations + company's chosen tools]
## §9. Implementation steps — Platform Team rollout
[Steps + owners + effort]
## §10. When this standard changes
[Revision triggers]
## §11. Changelog
| Version | Date | Change | Author |
|---|---|---|---|
## Sign-off
| Role | Name | Date |
|---|---|---|
| AI CoE Lead | | |
| Platform Lead | | |
| CISO | | |
Usage notes
- This is published ONCE, then referenced from every Agent Card §11. Don't restate it per agent.
- Platform Team owns the infrastructure behind this standard. AI CoE Lead owns the policy.
- The five monitoring signals from framework.md §21 are non-negotiable for Medium+ agents. Don't trim. Some can be deferred for Low (e.g., drift detector unnecessary if agent rarely runs).
- Adoption metrics are often the first to get cut when teams are under deadline pressure. Don't let it happen — adoption is the proof of value at quarterly review (template 16).
- Update this document after any incident that reveals a gap. Post-mortem (template 10) action items often update this standard.
- The retention rules in §3 are the LONGEST of any applicable rule. If an agent is Medium-tier AND SOX-adjacent, the SOX 7-year rule wins.
- AI gateway adoption (Portkey / LiteLLM / etc.) becomes worth it past ~10 production agents. Don't add complexity early; add it when needed.
Common pitfalls
| Pitfall | What it looks like | Fix |
|---|---|---|
| Each agent invents its own logging | 12 agents, 12 dashboards, no cross-agent visibility | Standardize §1 fields; auto-emit from orchestrator |
| "We'll add logging later" | Pilot runs blind for 30 days | Wire observability from day 0 of dev |
| Cost only tracked at total level | Step-level retry loops hide for months | §5 Signal 4 — step-level required |
| Drift detection skipped for Medium | "It's not high-risk, we don't need drift" | Medium tier needs drift too — quality degradation is invisible without it |
| Adoption metrics missing | Agent works technically but nobody uses it | §6 — track DAU/WAU/MAU + churn explicitly |
| PII appears in logs | Redaction filter not wired or has gaps | §4 — runtime layer must redact BEFORE log emit |
| Retention shorter than regulation | EU AI Act Annex III agent with 30-day default retention | §3 — apply longest applicable rule, configure cold archive |
| Alerts not paged to humans | Email alerts only, nobody reads | §7 — page via PagerDuty/Opsgenie; phone number, not inbox |
| Drift alarm fires, no investigation | Signal exists, response process doesn't | §7 escalation matrix + on-call rotation per Agent Card §11 |
Framework cross-references
framework.md§21 (5 monitoring signals — operationalized in §5)framework.md§24 (observability — fields list in §1)framework.md§17 (privileged identities — audit attribution requires named initiator in §1 field 6)framework.md§10 (risk tier — drives tier-specific dashboard + retention rules in §2 + §3)framework.md§22.1 EU AI Act Article 12 + 19 (record-keeping + retention)framework.md§22.1 EU AI Act Article 72 (post-market monitoring)framework.md§22.2 NIST AI RMF MEASURE functionframework.md§22.2.1 NIST AI 600-1 GenAI Profile (monitoring patterns)framework.md§22.3 ISO/IEC 42001 Clause 9.1workflows.mdStep A6 (standards documents — this is one of them)workflows.html→ In Action view → node M14 Pilot (must be live BEFORE pilot starts)- Companion templates:
09-runbook.md(per-agent operational),14-30-day-review.md(first formal results check),16-per-agent-quarterly-review.md(quarterly results)