← All templates
Template 17

Template 17 — Observability Standard

ID
17-observability-standard
Version
1
Last revised
2026-05-14
Owner
Platform Team (implements + maintains) · AI CoE Lead (sets policy) · Security (reviews)

Purpose

The company-wide observability playbook for AI agents. Published once at A6 (standards documents) by the AI CoE. Built and maintained by the Platform Team. Every production agent inherits from this standard — no agent invents its own observability.

This is the document that answers, for every agent:

  • What gets logged on every execution (per-execution fields)
  • Where logs live (storage + retention by tier)
  • What dashboards must exist (visible by who)
  • What triggers an alert (thresholds + pager routing)
  • How drift is detected (signal + tools + cadence)
  • How adoption is measured (the metrics taxonomy)
  • How decision audit trails capture business logic (not just telemetry)

Without this document standardized, every agent's observability is bespoke. Cross-agent dashboards become impossible. Auditors get inconsistent records. EU AI Act Article 12 (event logging) compliance becomes per-agent improvisation.

  • When you use it: Published once at program start (A6 in In Action roadmap), referenced from every Agent Card §11. Updated annually or when material new signal added.
  • Who owns: Platform Team builds. AI CoE Lead approves. Security signs.
  • Format: Living document in templates/17-observability-standard.md. Customized for the company's specific stack on adoption.

Worked example (ACME Corp Observability Standard v1.0)

ACME Corp — AI Agent Observability Standard v1.0

Effective: 2026-03-01 Owner: Platform Team (operational) · AI CoE Lead (policy) Approval: CISO + CoE Lead + Platform Lead Next review: 2027-03-01


§1. Required per-execution log fields

Every AI agent in production — Low, Medium, or High tier — emits the following fields on every execution. No exceptions.

#FieldTypeNotes
1timestamp_startISO-8601 UTCWhen invocation began
2timestamp_endISO-8601 UTCWhen invocation completed
3agent_idstringFrom Agent Card §1, e.g., finance-invoice-recon
4agent_versionstringsemver, e.g., 1.0.5
5correlation_idUUIDUnique per execution; threads across all sub-calls
6initiatorstringUser ID OR system trigger ID — never anonymous
7promptstringSanitized (PII redacted per §4 below)
8responsestringSanitized
9tool_callsarrayEach tool invocation: name, parameters (sanitized), result-status, latency, cost
10modelstringLLM provider + model name + version, e.g., anthropic/claude-sonnet-4-6
11tokens_ininteger
12tokens_outinteger
13cost_usddecimalComputed from tokens × per-token cost
14policy_checksarrayEach guardrail that fired: name, result (pass/block), reason
15hitl_eventsarrayEach HITL gate: trigger, decision (approve/reject/override), human ID, timestamp
16latency_per_step_msarrayPer-step latency breakdown
17outcomeenumsuccess / failure / human_overridden / exception_routed
18errorobjectIf present: type, stack trace (sanitized), recoverable (yes/no)

Implementation rule: these fields are emitted automatically by the orchestrator layer (n8n / LangGraph / etc.) and forwarded to the observability platform. Builders do NOT have to manually instrument them — the framework / approved stack from A4 provides them.

§2. Dashboards by tier

Required for ALL tiers (Low / Medium / High)

DashboardUpdate cadenceAudience
Cost per agent per dayReal-timeDepartment Champion + CoE Lead
Execution count + failure rate (24h, 7d, 30d)Real-timeChampion
Latency p50 / p95 / p99 (24h, 7d)Real-timeChampion + on-call
Tokens consumed (24h, 7d, 30d)Real-timeChampion
HITL events count + acceptance rate (24h, 7d, 30d)Real-timeChampion

Additional for Medium tier

DashboardUpdate cadenceAudience
Output distribution shift (PSI vs eval baseline)HourlyCoE Lead + Builder
Exception routing patterns by categoryReal-timeChampion + Builder
Per-step cost breakdownReal-timeBuilder + Finance
Tool-call frequency + success rateReal-timeBuilder
Adoption: unique users (DAU / WAU / MAU)DailyChampion + CoE Lead

Additional for High tier

DashboardUpdate cadenceAudience
Decision audit trail with business-logic contextReal-timeCoE Lead + Compliance + Legal
Demographic outcome distribution (if decisions affect people)DailyCompliance + Legal
Drift detector (PSI + KL-divergence + alarm history)HourlyBuilder + CoE Lead
Geographic data residency mapDailyCompliance + Privacy Officer

§3. Log retention by tier and jurisdiction

ScopeActive hot retentionCold archiveTotal
Low tier30 days6 months7 months
Medium tier90 days6 months minimum9+ months
High tier (default)90 days12 months minimum15+ months
High tier + EU AI Act Annex III90 days6 years6+ years
Any SOX-relevant tierper tier hot7 years cold (§802)7+ years
Any HIPAA-relevant tierper tier hot6 years cold6+ years
Any PCI-DSS-relevant tierper tier hot1 year hot + 1 year cold2+ years

Implementation: logs flow from observability platform (LangSmith / Helicone / Langfuse) → company log lake (Datadog / Splunk) → cold archive (S3 Glacier / Azure Archive / Google Coldline). Retention lifecycle configured at each layer.

Sector overlays apply — apply the longest retention from any overlapping rule.

§4. PII / data-class redaction in logs

Data classIn raw prompt?In raw response?In logs?
PublicOKOKOK
InternalOKOKOK
ConfidentialOK if Agent Card §5 authorizesOKMask if specifically flagged (e.g., contract dollar amounts)
PII (regulated)OK if Agent Card §5 + RAI §2 authorize + DLP activeMaskMask before log emit
PHIOK ONLY if HIPAA overlay applied + BAA in placeMaskMask before log emit
Financial account numbersOK if Agent Card authorizesMaskMask before log emit
Government identifiers (SSN, passport)NEVER unless explicit High-tier risk-appetite approvalNEVER in raw formNEVER in raw form

Implementation: runtime layer (Lakera / Aporia / custom) detects sensitive patterns + redacts BEFORE the observability platform receives the log. The orchestrator's "log emit" hook runs the redaction filter.

§5. The 5 monitoring signals — concrete implementation

(Aligned with framework.md §21.)

Signal 1 — Output distribution shift (drift)

What: Compare production output distribution against eval baseline (captured in template 08 §6).

How:

  • For numeric outputs (confidence scores, match scores, amounts) → Population Stability Index (PSI) vs baseline
  • For categorical outputs (decisions, classifications) → KL-divergence vs baseline
  • For LLM-text outputs → semantic similarity (embedding distance) vs sample of baseline

Thresholds: PSI 0.0–0.1 stable / 0.1–0.2 watch / >0.2 alert. KL-divergence: similar bands.

Cadence: Hourly recompute against rolling 24-hour window.

Alarm: PSI > 0.2 for 2 consecutive hours → Sev-2 alert to CoE Lead + Builder.

Tools: Arize AI, Langfuse Premium, Fiddler AI, Aporia. Custom on top of LangSmith if needed.

Signal 2 — HITL escalation rate

What: Rate at which the agent flags items to humans for approval/review.

How: count hitl_events / count executions per rolling 24h + 7d + 30d windows.

Thresholds:

  • Healthy band: expected baseline ± 20% (e.g., 7–11% if baseline is 9%)
  • Falling toward 0 = RED FLAG (agent overreaching or HITL gate disabled)
  • Rising sharply = scope problem or quality problem

Alarm: drops below 70% of baseline OR rises above 150% of baseline → Sev-2 alert.

Signal 3 — Decision audit trail (business-logic capture)

What: for each execution, what data was considered, what policy governed it, was the policy enforced, who reviewed.

How: every execution writes a decision record with:

  • Input context (sanitized)
  • Reasoning trace (LLM chain-of-thought if available, otherwise structured rationale)
  • Policy checks fired (from §1 field 14)
  • Final action + actor (agent or human)
  • Tool calls + results

Storage: retained per §3 retention rules. Auditor-queryable.

Why this matters: EU AI Act Article 12 + 19 — required for high-risk AI. Tokens + latency alone don't satisfy.

Signal 4 — Cost per execution at step level

What: cost is tracked NOT just per-execution-total but per-step (parser cost, LLM cost, tool-call cost, etc.).

Why: step-level cost reveals retry loops, unexpected tool invocations, prompt bloat — patterns that get hidden in total cost.

Thresholds:

  • Per-execution cost > 2× baseline → Sev-3 alert
  • Daily total > daily cap → Sev-2 alert
  • Any single step > 5× baseline → Sev-3 alert

Tools: LangSmith cost dashboards (per-step), Helicone, custom Grafana on Anthropic/OpenAI usage APIs.

Signal 5 — Exception routing patterns

What: which inputs end up in the agent's exception path (instead of being handled inline)?

Why: Rising exception rate = data drift or scope drift. Falling exception rate (without scope expansion) = exception gate being bypassed.

How:

  • Count exceptions per cause category (from outcome + error fields)
  • Track per-category trend over 7d / 30d

Alarm: new exception category appearing > 5% of volume → Sev-3 review trigger.

§6. Adoption metrics — the standard taxonomy

For every production agent, the following adoption metrics MUST be tracked. (These often hide in dashboards as afterthoughts — promote them.)

MetricDefinitionHow
Unique users (active)Distinct user IDs that triggered ≥1 execution in windowCount distinct initiator field per window
DAU / WAU / MAUUnique users in last 1d / 7d / 30dStandard time-windowed distinct count
StickinessDAU / MAU ratio(Indicates daily-use depth vs occasional use)
Executions per user-weekTotal executions / unique users / weeks(Engagement depth indicator)
Time-to-first-use for new userDays between identity provisioning + first execution(Onboarding friction indicator)
Workflow coverage% of in-scope inputs handled by agent (vs going to manual workflow)Per agent — must be defined per use case
Reach by departmentUnique users by department(Cross-department adoption indicator)
Self-reported satisfactionQuarterly user survey, 1–5 scaleManual collection at quarterly review
Churn / abandonmentUsers active in prev period but not this periodPer period

Cadence: Daily updates for usage metrics. Quarterly satisfaction surveys. Reported in template 14 (30-day review) §4 and template 16 (per-agent quarterly) §4.

§7. Alert thresholds + escalation matrix

SeverityTrigger examplesPage withinInform within
Sev-1Out-of-scope tool call attempted · Identity revoked · Confirmed data exfiltration · Customer-facing harmful output · Article 73 serious incident5 minExecutive Sponsor + General Counsel within 1 hr
Sev-2HITL acceptance < 70% of baseline for 2 days · Drift PSI > 0.2 sustained 2h · Sustained error rate > 5% per hour · Cost > 2× daily cap15 minDepartment Head within 4 hr
Sev-3Cost spike > 2× baseline (single day) · Rate-limit hit · Transient outage < 30 min · New exception category > 5%30 min(none — handled by on-call)
Sev-4Cosmetic issue · Single-case quirk · Performance degradation < thresholdNext business day(none)

Page routing: PagerDuty / Opsgenie / Splunk On-Call schedules per agent in Agent Card §11. Each agent has primary + backup on-call named.

§8. Recommended tooling matrix

LayerRecommended optionsNotes
AI-native observability (primary)LangSmith, Helicone, Langfuse, Arize AI, Braintrust, Maxim AI, AgentOps, Phoenix by ArizePick one as primary. Wire it from agent's first execution in dev.
General observability (pipe-through)Datadog, Grafana + Loki, Microsoft Sentinel, Splunk, Elastic, New Relic, HoneycombPipe AI events here so they sit alongside the rest of the company's monitoring.
Drift detectionArize AI, Aporia, Fiddler AI, Langfuse Premium, customUse what already exists where possible.
Cost monitoringLangSmith cost (LLM-specific), Helicone, Portkey (gateway-side), AWS Cost Explorer, CloudabilityStep-level cost is key — not just total.
Alerting / pagingPagerDuty, Opsgenie, Splunk On-Call, Microsoft Teams alertsUse existing company tools.
Long-term cold archiveS3 Glacier, Azure Archive Storage, Google ColdlineSet lifecycle on §3 retention rules.
AI gateway (centralizes routing + control)Portkey, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, Helicone GatewayAdds policy/cost enforcement layer; worth it past ~10 agents.
Prompt management + versioningLangSmith, Humanloop, PromptHub, Promptfoo, PromptLayerTied to source control.
Evaluation harnessPromptfoo, Ragas, DeepEval, OpenAI Evals, LangSmith Evals, BraintrustRun at pre-prod (template 08) + quarterly re-eval.

ACME's choice (worked example): LangSmith (primary) → Datadog (pipe-through) → S3 Glacier (cold archive). Arize for drift on Medium+ agents. PagerDuty for alerts. Portkey for AI gateway (added in Q3 2026).

§9. Implementation steps — Platform Team rollout

When this standard is first published, the Platform Team implements as follows:

#StepOwnerEffort
1Select observability platform (LangSmith / Helicone / Langfuse / Arize)Platform Lead + CoE Lead1 week to evaluate
2Set up shared org-level account with per-agent projectsPlatform Team1 day
3Define the per-execution log emit hook in orchestrator (n8n, LangGraph) — auto-emit §1 fieldsPlatform Engineer2–3 days
4Build the per-agent dashboard template (covering all §2 tier requirements)Platform Engineer3–5 days
5Configure log forwarding to Datadog (or equivalent)Platform Engineer1 day
6Configure cold archive lifecycle (S3 Glacier or equivalent) per §3 retention rulesPlatform Engineer1 day
7Wire alert thresholds (§7) into PagerDutyPlatform Engineer2 days
8Build drift-detection pipeline for Medium+ agentsPlatform Engineer + Data1 week
9Document onboarding checklist for new agentsPlatform Lead + CoE Lead1 day
10Run end-to-end test with one pilot agent (e.g., AP recon)Platform Team1 day

Total Platform Team effort: ~3 weeks one-time + ~0.5 day per new agent onboarding.

§10. When this standard changes

Triggers for revision:

  • New monitoring signal proven valuable (added to §5)
  • New sector overlay applies (e.g., expanding to healthcare → add PHI redaction to §4)
  • New observability platform adopted (update §8)
  • Annual review (always)
  • After any Sev-1 or repeat Sev-2 that reveals an observability gap (lesson fed back via template 10 post-mortem)

Version control: this document lives at templates/17-observability-standard.md. Update the template_version and add a changelog entry below.

§11. Changelog

VersionDateChangeAuthor
1.02026-03-01Initial standard publishedCoE Lead + Platform Lead + CISO

Sign-off

RoleNameDate
AI CoE LeadMorteza Moradi2026-03-01
Platform LeadJess (IT)2026-03-01
CISOPat Lee2026-03-01

Blank template (copy below for your company)

# [Company] — AI Agent Observability Standard v[X.X]

**Effective:** [YYYY-MM-DD]
**Owner:** Platform Team · AI CoE Lead
**Approval:** [signers]
**Next review:** [YYYY-MM-DD]

## §1. Required per-execution log fields

| # | Field | Type | Notes |
|---|---|---|---|
| 1 | timestamp_start | ISO-8601 UTC | |
| 2 | timestamp_end | ISO-8601 UTC | |
| 3 | agent_id | string | From Agent Card §1 |
| 4 | agent_version | string | semver |
| 5 | correlation_id | UUID | |
| 6 | initiator | string | User ID or system trigger ID |
| 7 | prompt | string (sanitized) | |
| 8 | response | string (sanitized) | |
| 9 | tool_calls | array | name, parameters, status, latency, cost |
| 10 | model | string | provider + model + version |
| 11 | tokens_in | integer | |
| 12 | tokens_out | integer | |
| 13 | cost_usd | decimal | |
| 14 | policy_checks | array | guardrails fired + results |
| 15 | hitl_events | array | gate triggers + decisions |
| 16 | latency_per_step_ms | array | |
| 17 | outcome | enum | success / failure / human_overridden / exception_routed |
| 18 | error | object | if present |

## §2. Dashboards by tier

[Required for all tiers table]

[Additional for Medium tier table]

[Additional for High tier table]

## §3. Log retention by tier and jurisdiction

| Scope | Active hot | Cold archive | Total |
|---|---|---|---|
| Low | [duration] | [duration] | |
| Medium | [duration] | [duration] | |
| High (default) | [duration] | [duration] | |
| High + EU AI Act Annex III | [duration] | [duration] | |
| Any SOX-relevant | per tier | [duration] | |
| Any HIPAA-relevant | per tier | [duration] | |

## §4. PII / data-class redaction in logs

[Table mapping each data class to logging treatment]

## §5. The 5 monitoring signals — concrete implementation

### Signal 1 — Output distribution shift
- What:
- How:
- Thresholds:
- Cadence:
- Alarm:
- Tools:

### Signal 2 — HITL escalation rate
[same shape]

### Signal 3 — Decision audit trail
[same shape]

### Signal 4 — Cost per execution at step level
[same shape]

### Signal 5 — Exception routing patterns
[same shape]

## §6. Adoption metrics — the standard taxonomy

| Metric | Definition | How |
|---|---|---|
| Unique users (active) | | |
| DAU / WAU / MAU | | |
| Stickiness | | |
| Executions per user-week | | |
| Time-to-first-use for new user | | |
| Workflow coverage | | |
| Reach by department | | |
| Self-reported satisfaction | | |
| Churn / abandonment | | |

## §7. Alert thresholds + escalation matrix

| Severity | Trigger | Page within | Inform within |
|---|---|---|---|
| Sev-1 | | | |
| Sev-2 | | | |
| Sev-3 | | | |
| Sev-4 | | | |

## §8. Recommended tooling matrix

[Per-layer recommendations + company's chosen tools]

## §9. Implementation steps — Platform Team rollout

[Steps + owners + effort]

## §10. When this standard changes

[Revision triggers]

## §11. Changelog

| Version | Date | Change | Author |
|---|---|---|---|

## Sign-off

| Role | Name | Date |
|---|---|---|
| AI CoE Lead | | |
| Platform Lead | | |
| CISO | | |

Usage notes

  • This is published ONCE, then referenced from every Agent Card §11. Don't restate it per agent.
  • Platform Team owns the infrastructure behind this standard. AI CoE Lead owns the policy.
  • The five monitoring signals from framework.md §21 are non-negotiable for Medium+ agents. Don't trim. Some can be deferred for Low (e.g., drift detector unnecessary if agent rarely runs).
  • Adoption metrics are often the first to get cut when teams are under deadline pressure. Don't let it happen — adoption is the proof of value at quarterly review (template 16).
  • Update this document after any incident that reveals a gap. Post-mortem (template 10) action items often update this standard.
  • The retention rules in §3 are the LONGEST of any applicable rule. If an agent is Medium-tier AND SOX-adjacent, the SOX 7-year rule wins.
  • AI gateway adoption (Portkey / LiteLLM / etc.) becomes worth it past ~10 production agents. Don't add complexity early; add it when needed.

Common pitfalls

PitfallWhat it looks likeFix
Each agent invents its own logging12 agents, 12 dashboards, no cross-agent visibilityStandardize §1 fields; auto-emit from orchestrator
"We'll add logging later"Pilot runs blind for 30 daysWire observability from day 0 of dev
Cost only tracked at total levelStep-level retry loops hide for months§5 Signal 4 — step-level required
Drift detection skipped for Medium"It's not high-risk, we don't need drift"Medium tier needs drift too — quality degradation is invisible without it
Adoption metrics missingAgent works technically but nobody uses it§6 — track DAU/WAU/MAU + churn explicitly
PII appears in logsRedaction filter not wired or has gaps§4 — runtime layer must redact BEFORE log emit
Retention shorter than regulationEU AI Act Annex III agent with 30-day default retention§3 — apply longest applicable rule, configure cold archive
Alerts not paged to humansEmail alerts only, nobody reads§7 — page via PagerDuty/Opsgenie; phone number, not inbox
Drift alarm fires, no investigationSignal exists, response process doesn't§7 escalation matrix + on-call rotation per Agent Card §11

Framework cross-references

  • framework.md §21 (5 monitoring signals — operationalized in §5)
  • framework.md §24 (observability — fields list in §1)
  • framework.md §17 (privileged identities — audit attribution requires named initiator in §1 field 6)
  • framework.md §10 (risk tier — drives tier-specific dashboard + retention rules in §2 + §3)
  • framework.md §22.1 EU AI Act Article 12 + 19 (record-keeping + retention)
  • framework.md §22.1 EU AI Act Article 72 (post-market monitoring)
  • framework.md §22.2 NIST AI RMF MEASURE function
  • framework.md §22.2.1 NIST AI 600-1 GenAI Profile (monitoring patterns)
  • framework.md §22.3 ISO/IEC 42001 Clause 9.1
  • workflows.md Step A6 (standards documents — this is one of them)
  • workflows.html → In Action view → node M14 Pilot (must be live BEFORE pilot starts)
  • Companion templates: 09-runbook.md (per-agent operational), 14-30-day-review.md (first formal results check), 16-per-agent-quarterly-review.md (quarterly results)