Purpose

The company-wide observability playbook for AI agents. Published once at A6 (standards documents) by the AI CoE. Built and maintained by the Platform Team. Every production agent inherits from this standard — no agent invents its own observability.

This is the document that answers, for every agent:

What gets logged on every execution (per-execution fields)
Where logs live (storage + retention by tier)
What dashboards must exist (visible by who)
What triggers an alert (thresholds + pager routing)
How drift is detected (signal + tools + cadence)
How adoption is measured (the metrics taxonomy)
How decision audit trails capture business logic (not just telemetry)

Without this document standardized, every agent's observability is bespoke. Cross-agent dashboards become impossible. Auditors get inconsistent records. EU AI Act Article 12 (event logging) compliance becomes per-agent improvisation.

When you use it: Published once at program start (A6 in In Action roadmap), referenced from every Agent Card §11. Updated annually or when material new signal added.
Who owns: Platform Team builds. AI CoE Lead approves. Security signs.
Format: Living document in templates/17-observability-standard.md. Customized for the company's specific stack on adoption.

Worked example (ACME Corp Observability Standard v1.0)

ACME Corp — AI Agent Observability Standard v1.0

Effective: 2026-03-01 Owner: Platform Team (operational) · AI CoE Lead (policy) Approval: CISO + CoE Lead + Platform Lead Next review: 2027-03-01

§1. Required per-execution log fields

Every AI agent in production — Low, Medium, or High tier — emits the following fields on every execution. No exceptions.

#	Field	Type	Notes
1	`timestamp_start`	ISO-8601 UTC	When invocation began
2	`timestamp_end`	ISO-8601 UTC	When invocation completed
3	`agent_id`	string	From Agent Card §1, e.g., `finance-invoice-recon`
4	`agent_version`	string	semver, e.g., `1.0.5`
5	`correlation_id`	UUID	Unique per execution; threads across all sub-calls
6	`initiator`	string	User ID OR system trigger ID — never anonymous
7	`prompt`	string	Sanitized (PII redacted per §4 below)
8	`response`	string	Sanitized
9	`tool_calls`	array	Each tool invocation: name, parameters (sanitized), result-status, latency, cost
10	`model`	string	LLM provider + model name + version, e.g., `anthropic/claude-sonnet-4-6`
11	`tokens_in`	integer
12	`tokens_out`	integer
13	`cost_usd`	decimal	Computed from tokens × per-token cost
14	`policy_checks`	array	Each guardrail that fired: name, result (pass/block), reason
15	`hitl_events`	array	Each HITL gate: trigger, decision (approve/reject/override), human ID, timestamp
16	`latency_per_step_ms`	array	Per-step latency breakdown
17	`outcome`	enum	success / failure / human_overridden / exception_routed
18	`error`	object	If present: type, stack trace (sanitized), recoverable (yes/no)

Implementation rule: these fields are emitted automatically by the orchestrator layer (n8n / LangGraph / etc.) and forwarded to the observability platform. Builders do NOT have to manually instrument them — the framework / approved stack from A4 provides them.

§2. Dashboards by tier

Required for ALL tiers (Low / Medium / High)

Dashboard	Update cadence	Audience
Cost per agent per day	Real-time	Department Champion + CoE Lead
Execution count + failure rate (24h, 7d, 30d)	Real-time	Champion
Latency p50 / p95 / p99 (24h, 7d)	Real-time	Champion + on-call
Tokens consumed (24h, 7d, 30d)	Real-time	Champion
HITL events count + acceptance rate (24h, 7d, 30d)	Real-time	Champion

Additional for Medium tier

Dashboard	Update cadence	Audience
Output distribution shift (PSI vs eval baseline)	Hourly	CoE Lead + Builder
Exception routing patterns by category	Real-time	Champion + Builder
Per-step cost breakdown	Real-time	Builder + Finance
Tool-call frequency + success rate	Real-time	Builder
Adoption: unique users (DAU / WAU / MAU)	Daily	Champion + CoE Lead

Additional for High tier

Dashboard	Update cadence	Audience
Decision audit trail with business-logic context	Real-time	CoE Lead + Compliance + Legal
Demographic outcome distribution (if decisions affect people)	Daily	Compliance + Legal
Drift detector (PSI + KL-divergence + alarm history)	Hourly	Builder + CoE Lead
Geographic data residency map	Daily	Compliance + Privacy Officer

§3. Log retention by tier and jurisdiction

Scope	Active hot retention	Cold archive	Total
Low tier	30 days	6 months	7 months
Medium tier	90 days	6 months minimum	9+ months
High tier (default)	90 days	12 months minimum	15+ months
High tier + EU AI Act Annex III	90 days	6 years	6+ years
Any SOX-relevant tier	per tier hot	7 years cold (§802)	7+ years
Any HIPAA-relevant tier	per tier hot	6 years cold	6+ years
Any PCI-DSS-relevant tier	per tier hot	1 year hot + 1 year cold	2+ years

Implementation: logs flow from observability platform (LangSmith / Helicone / Langfuse) → company log lake (Datadog / Splunk) → cold archive (S3 Glacier / Azure Archive / Google Coldline). Retention lifecycle configured at each layer.

Sector overlays apply — apply the longest retention from any overlapping rule.

§4. PII / data-class redaction in logs

Data class	In raw prompt?	In raw response?	In logs?
Public	OK	OK	OK
Internal	OK	OK	OK
Confidential	OK if Agent Card §5 authorizes	OK	Mask if specifically flagged (e.g., contract dollar amounts)
PII (regulated)	OK if Agent Card §5 + RAI §2 authorize + DLP active	Mask	Mask before log emit
PHI	OK ONLY if HIPAA overlay applied + BAA in place	Mask	Mask before log emit
Financial account numbers	OK if Agent Card authorizes	Mask	Mask before log emit
Government identifiers (SSN, passport)	NEVER unless explicit High-tier risk-appetite approval	NEVER in raw form	NEVER in raw form

Implementation: runtime layer (Lakera / Aporia / custom) detects sensitive patterns + redacts BEFORE the observability platform receives the log. The orchestrator's "log emit" hook runs the redaction filter.

§5. The 5 monitoring signals — concrete implementation

(Aligned with framework.md §21.)

Signal 1 — Output distribution shift (drift)

What: Compare production output distribution against eval baseline (captured in template 08 §6).

How:

For numeric outputs (confidence scores, match scores, amounts) → Population Stability Index (PSI) vs baseline
For categorical outputs (decisions, classifications) → KL-divergence vs baseline
For LLM-text outputs → semantic similarity (embedding distance) vs sample of baseline

Thresholds: PSI 0.0–0.1 stable / 0.1–0.2 watch / >0.2 alert. KL-divergence: similar bands.

Cadence: Hourly recompute against rolling 24-hour window.

Alarm: PSI > 0.2 for 2 consecutive hours → Sev-2 alert to CoE Lead + Builder.

Tools: Arize AI, Langfuse Premium, Fiddler AI, Aporia. Custom on top of LangSmith if needed.

Signal 2 — HITL escalation rate

What: Rate at which the agent flags items to humans for approval/review.

How: count hitl_events / count executions per rolling 24h + 7d + 30d windows.

Thresholds:

Healthy band: expected baseline ± 20% (e.g., 7–11% if baseline is 9%)
Falling toward 0 = RED FLAG (agent overreaching or HITL gate disabled)
Rising sharply = scope problem or quality problem

Alarm: drops below 70% of baseline OR rises above 150% of baseline → Sev-2 alert.

Signal 3 — Decision audit trail (business-logic capture)

What: for each execution, what data was considered, what policy governed it, was the policy enforced, who reviewed.

How: every execution writes a decision record with:

Input context (sanitized)
Reasoning trace (LLM chain-of-thought if available, otherwise structured rationale)
Policy checks fired (from §1 field 14)
Final action + actor (agent or human)
Tool calls + results

Storage: retained per §3 retention rules. Auditor-queryable.

Why this matters: EU AI Act Article 12 + 19 — required for high-risk AI. Tokens + latency alone don't satisfy.

Signal 4 — Cost per execution at step level

What: cost is tracked NOT just per-execution-total but per-step (parser cost, LLM cost, tool-call cost, etc.).

Why: step-level cost reveals retry loops, unexpected tool invocations, prompt bloat — patterns that get hidden in total cost.

Thresholds:

Per-execution cost > 2× baseline → Sev-3 alert
Daily total > daily cap → Sev-2 alert
Any single step > 5× baseline → Sev-3 alert

Tools: LangSmith cost dashboards (per-step), Helicone, custom Grafana on Anthropic/OpenAI usage APIs.

Signal 5 — Exception routing patterns

What: which inputs end up in the agent's exception path (instead of being handled inline)?

Why: Rising exception rate = data drift or scope drift. Falling exception rate (without scope expansion) = exception gate being bypassed.

How:

Count exceptions per cause category (from outcome + error fields)
Track per-category trend over 7d / 30d

Alarm: new exception category appearing > 5% of volume → Sev-3 review trigger.

§6. Adoption metrics — the standard taxonomy

For every production agent, the following adoption metrics MUST be tracked. (These often hide in dashboards as afterthoughts — promote them.)

Metric	Definition	How
Unique users (active)	Distinct user IDs that triggered ≥1 execution in window	Count distinct `initiator` field per window
DAU / WAU / MAU	Unique users in last 1d / 7d / 30d	Standard time-windowed distinct count
Stickiness	DAU / MAU ratio	(Indicates daily-use depth vs occasional use)
Executions per user-week	Total executions / unique users / weeks	(Engagement depth indicator)
Time-to-first-use for new user	Days between identity provisioning + first execution	(Onboarding friction indicator)
Workflow coverage	% of in-scope inputs handled by agent (vs going to manual workflow)	Per agent — must be defined per use case
Reach by department	Unique users by department	(Cross-department adoption indicator)
Self-reported satisfaction	Quarterly user survey, 1–5 scale	Manual collection at quarterly review
Churn / abandonment	Users active in prev period but not this period	Per period

Cadence: Daily updates for usage metrics. Quarterly satisfaction surveys. Reported in template 14 (30-day review) §4 and template 16 (per-agent quarterly) §4.

§7. Alert thresholds + escalation matrix

Severity	Trigger examples	Page within	Inform within
Sev-1	Out-of-scope tool call attempted · Identity revoked · Confirmed data exfiltration · Customer-facing harmful output · Article 73 serious incident	5 min	Executive Sponsor + General Counsel within 1 hr
Sev-2	HITL acceptance < 70% of baseline for 2 days · Drift PSI > 0.2 sustained 2h · Sustained error rate > 5% per hour · Cost > 2× daily cap	15 min	Department Head within 4 hr
Sev-3	Cost spike > 2× baseline (single day) · Rate-limit hit · Transient outage < 30 min · New exception category > 5%	30 min	(none — handled by on-call)
Sev-4	Cosmetic issue · Single-case quirk · Performance degradation < threshold	Next business day	(none)

Page routing: PagerDuty / Opsgenie / Splunk On-Call schedules per agent in Agent Card §11. Each agent has primary + backup on-call named.

§8. Recommended tooling matrix

Layer	Recommended options	Notes
AI-native observability (primary)	LangSmith, Helicone, Langfuse, Arize AI, Braintrust, Maxim AI, AgentOps, Phoenix by Arize	Pick one as primary. Wire it from agent's first execution in dev.
General observability (pipe-through)	Datadog, Grafana + Loki, Microsoft Sentinel, Splunk, Elastic, New Relic, Honeycomb	Pipe AI events here so they sit alongside the rest of the company's monitoring.
Drift detection	Arize AI, Aporia, Fiddler AI, Langfuse Premium, custom	Use what already exists where possible.
Cost monitoring	LangSmith cost (LLM-specific), Helicone, Portkey (gateway-side), AWS Cost Explorer, Cloudability	Step-level cost is key — not just total.
Alerting / paging	PagerDuty, Opsgenie, Splunk On-Call, Microsoft Teams alerts	Use existing company tools.
Long-term cold archive	S3 Glacier, Azure Archive Storage, Google Coldline	Set lifecycle on §3 retention rules.
AI gateway (centralizes routing + control)	Portkey, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, Helicone Gateway	Adds policy/cost enforcement layer; worth it past ~10 agents.
Prompt management + versioning	LangSmith, Humanloop, PromptHub, Promptfoo, PromptLayer	Tied to source control.
Evaluation harness	Promptfoo, Ragas, DeepEval, OpenAI Evals, LangSmith Evals, Braintrust	Run at pre-prod (template 08) + quarterly re-eval.

ACME's choice (worked example): LangSmith (primary) → Datadog (pipe-through) → S3 Glacier (cold archive). Arize for drift on Medium+ agents. PagerDuty for alerts. Portkey for AI gateway (added in Q3 2026).

§9. Implementation steps — Platform Team rollout

When this standard is first published, the Platform Team implements as follows:

#	Step	Owner	Effort
1	Select observability platform (LangSmith / Helicone / Langfuse / Arize)	Platform Lead + CoE Lead	1 week to evaluate
2	Set up shared org-level account with per-agent projects	Platform Team	1 day
3	Define the per-execution log emit hook in orchestrator (n8n, LangGraph) — auto-emit §1 fields	Platform Engineer	2–3 days
4	Build the per-agent dashboard template (covering all §2 tier requirements)	Platform Engineer	3–5 days
5	Configure log forwarding to Datadog (or equivalent)	Platform Engineer	1 day
6	Configure cold archive lifecycle (S3 Glacier or equivalent) per §3 retention rules	Platform Engineer	1 day
7	Wire alert thresholds (§7) into PagerDuty	Platform Engineer	2 days
8	Build drift-detection pipeline for Medium+ agents	Platform Engineer + Data	1 week
9	Document onboarding checklist for new agents	Platform Lead + CoE Lead	1 day
10	Run end-to-end test with one pilot agent (e.g., AP recon)	Platform Team	1 day

Total Platform Team effort: ~3 weeks one-time + ~0.5 day per new agent onboarding.

§10. When this standard changes

Triggers for revision:

New monitoring signal proven valuable (added to §5)
New sector overlay applies (e.g., expanding to healthcare → add PHI redaction to §4)
New observability platform adopted (update §8)
Annual review (always)
After any Sev-1 or repeat Sev-2 that reveals an observability gap (lesson fed back via template 10 post-mortem)

Version control: this document lives at templates/17-observability-standard.md. Update the template_version and add a changelog entry below.

§11. Changelog

Version	Date	Change	Author
1.0	2026-03-01	Initial standard published	CoE Lead + Platform Lead + CISO

Sign-off

Role	Name	Date
AI CoE Lead	Morteza Moradi	2026-03-01
Platform Lead	Jess (IT)	2026-03-01
CISO	Pat Lee	2026-03-01

Blank template (copy below for your company)

# [Company] — AI Agent Observability Standard v[X.X]

**Effective:** [YYYY-MM-DD]
**Owner:** Platform Team · AI CoE Lead
**Approval:** [signers]
**Next review:** [YYYY-MM-DD]

## §1. Required per-execution log fields

| # | Field | Type | Notes |
|---|---|---|---|
| 1 | timestamp_start | ISO-8601 UTC | |
| 2 | timestamp_end | ISO-8601 UTC | |
| 3 | agent_id | string | From Agent Card §1 |
| 4 | agent_version | string | semver |
| 5 | correlation_id | UUID | |
| 6 | initiator | string | User ID or system trigger ID |
| 7 | prompt | string (sanitized) | |
| 8 | response | string (sanitized) | |
| 9 | tool_calls | array | name, parameters, status, latency, cost |
| 10 | model | string | provider + model + version |
| 11 | tokens_in | integer | |
| 12 | tokens_out | integer | |
| 13 | cost_usd | decimal | |
| 14 | policy_checks | array | guardrails fired + results |
| 15 | hitl_events | array | gate triggers + decisions |
| 16 | latency_per_step_ms | array | |
| 17 | outcome | enum | success / failure / human_overridden / exception_routed |
| 18 | error | object | if present |

## §2. Dashboards by tier

[Required for all tiers table]

[Additional for Medium tier table]

[Additional for High tier table]

## §3. Log retention by tier and jurisdiction

| Scope | Active hot | Cold archive | Total |
|---|---|---|---|
| Low | [duration] | [duration] | |
| Medium | [duration] | [duration] | |
| High (default) | [duration] | [duration] | |
| High + EU AI Act Annex III | [duration] | [duration] | |
| Any SOX-relevant | per tier | [duration] | |
| Any HIPAA-relevant | per tier | [duration] | |

## §4. PII / data-class redaction in logs

[Table mapping each data class to logging treatment]

## §5. The 5 monitoring signals — concrete implementation

### Signal 1 — Output distribution shift
- What:
- How:
- Thresholds:
- Cadence:
- Alarm:
- Tools:

### Signal 2 — HITL escalation rate
[same shape]

### Signal 3 — Decision audit trail
[same shape]

### Signal 4 — Cost per execution at step level
[same shape]

### Signal 5 — Exception routing patterns
[same shape]

## §6. Adoption metrics — the standard taxonomy

| Metric | Definition | How |
|---|---|---|
| Unique users (active) | | |
| DAU / WAU / MAU | | |
| Stickiness | | |
| Executions per user-week | | |
| Time-to-first-use for new user | | |
| Workflow coverage | | |
| Reach by department | | |
| Self-reported satisfaction | | |
| Churn / abandonment | | |

## §7. Alert thresholds + escalation matrix

| Severity | Trigger | Page within | Inform within |
|---|---|---|---|
| Sev-1 | | | |
| Sev-2 | | | |
| Sev-3 | | | |
| Sev-4 | | | |

## §8. Recommended tooling matrix

[Per-layer recommendations + company's chosen tools]

## §9. Implementation steps — Platform Team rollout

[Steps + owners + effort]

## §10. When this standard changes

[Revision triggers]

## §11. Changelog

| Version | Date | Change | Author |
|---|---|---|---|

## Sign-off

| Role | Name | Date |
|---|---|---|
| AI CoE Lead | | |
| Platform Lead | | |
| CISO | | |

Usage notes

This is published ONCE, then referenced from every Agent Card §11. Don't restate it per agent.
Platform Team owns the infrastructure behind this standard. AI CoE Lead owns the policy.
The five monitoring signals from framework.md §21 are non-negotiable for Medium+ agents. Don't trim. Some can be deferred for Low (e.g., drift detector unnecessary if agent rarely runs).
Adoption metrics are often the first to get cut when teams are under deadline pressure. Don't let it happen — adoption is the proof of value at quarterly review (template 16).
Update this document after any incident that reveals a gap. Post-mortem (template 10) action items often update this standard.
The retention rules in §3 are the LONGEST of any applicable rule. If an agent is Medium-tier AND SOX-adjacent, the SOX 7-year rule wins.
AI gateway adoption (Portkey / LiteLLM / etc.) becomes worth it past ~10 production agents. Don't add complexity early; add it when needed.

Common pitfalls

Pitfall	What it looks like	Fix
Each agent invents its own logging	12 agents, 12 dashboards, no cross-agent visibility	Standardize §1 fields; auto-emit from orchestrator
"We'll add logging later"	Pilot runs blind for 30 days	Wire observability from day 0 of dev
Cost only tracked at total level	Step-level retry loops hide for months	§5 Signal 4 — step-level required
Drift detection skipped for Medium	"It's not high-risk, we don't need drift"	Medium tier needs drift too — quality degradation is invisible without it
Adoption metrics missing	Agent works technically but nobody uses it	§6 — track DAU/WAU/MAU + churn explicitly
PII appears in logs	Redaction filter not wired or has gaps	§4 — runtime layer must redact BEFORE log emit
Retention shorter than regulation	EU AI Act Annex III agent with 30-day default retention	§3 — apply longest applicable rule, configure cold archive
Alerts not paged to humans	Email alerts only, nobody reads	§7 — page via PagerDuty/Opsgenie; phone number, not inbox
Drift alarm fires, no investigation	Signal exists, response process doesn't	§7 escalation matrix + on-call rotation per Agent Card §11

Framework cross-references

framework.md §21 (5 monitoring signals — operationalized in §5)
framework.md §24 (observability — fields list in §1)
framework.md §17 (privileged identities — audit attribution requires named initiator in §1 field 6)
framework.md §10 (risk tier — drives tier-specific dashboard + retention rules in §2 + §3)
framework.md §22.1 EU AI Act Article 12 + 19 (record-keeping + retention)
framework.md §22.1 EU AI Act Article 72 (post-market monitoring)
framework.md §22.2 NIST AI RMF MEASURE function
framework.md §22.2.1 NIST AI 600-1 GenAI Profile (monitoring patterns)
framework.md §22.3 ISO/IEC 42001 Clause 9.1
workflows.md Step A6 (standards documents — this is one of them)
workflows.html → In Action view → node M14 Pilot (must be live BEFORE pilot starts)
Companion templates: 09-runbook.md (per-agent operational), 14-30-day-review.md (first formal results check), 16-per-agent-quarterly-review.md (quarterly results)