Purpose

The runbook is what the on-call person reads at 2am when something goes wrong. It's not a manual — it's a fast-reference cheat sheet for the agent's specific failure modes, where its dashboards are, how to kill it, and how to escalate.

Every production agent has one. Lives next to the agent's code in source control. Read by the named on-call before pilot-to-prod sign-off (template 07 item 7).

When you write it: During Build (M12), before Evaluate (M13). Updated whenever a Sev-2+ incident reveals a gap (template 10 action items often update the runbook).
Who writes it: Agent Builder, in collaboration with the named on-call.
Length target: 2–4 pages. Anything longer means it won't be read at 2am.
Format: Markdown in the agent's repo at runbook.md (or RUNBOOK.md).

Worked example (AP Accountant invoice reconciliation)

Runbook — `finance-invoice-recon` v1.0

Owner: Mike Chen (AP Manager) · CoE Lead: Morteza Moradi Last updated: 2026-06-12 (pre-prod launch) On-call rotation: PagerDuty schedule finance-ai-primary (primary) + finance-ai-backup (CoE Lead backup)

1. What this agent does (one paragraph)

Reads vendor invoice PDFs from the AP accountant's Gmail mailbox. Extracts vendor name, total amount, line items, and PO reference using a deterministic parser. Looks up the matching open PO in NetSuite. Scores match confidence (0.0–1.0). Drafts a reconciliation proposal email into the accountant's Gmail drafts folder. The accountant reviews and approves. The agent never writes to NetSuite, never sends email externally, never approves payment.

2. Kill switch (60-second target)

If you need to stop the agent NOW:

LaunchDarkly flag → flip finance-invoice-recon-enabled to off URL: https://app.launchdarkly.com/projects/acme/flags/finance-invoice-recon-enabled
Verify: open LangSmith project finance-invoice-recon → confirm no new traces in the last 60s
If LaunchDarkly is down: disable the user agent-finance-invoice-recon in Microsoft Entra ID URL: https://entra.microsoft.com/users/agent-finance-invoice-recon
As last resort: stop the n8n workflow finance-invoice-recon-prod in the n8n admin console

Drill: tested 2026-06-08, kill propagated in 23 seconds.

3. Dashboards

LangSmith (primary): https://smith.langchain.com/o/acme/projects/finance-invoice-recon Shows: per-execution traces, cost, latency, errors, HITL events
Datadog (overlays): dashboard Finance AI — cost-per-day + latency p95 + error rate widgets
AWS Secrets Manager: confirm credentials current → /agents/finance-invoice-recon/*
NetSuite admin: role AI-Recon-Reader usage logs (every 100 reads logged)

4. Top failure modes + diagnostics

4.1 Cost spike (Sev-3)

Symptom: Datadog "Finance AI" cost widget shows daily total > $20 (baseline: $5/day)
Diagnose:
1. Open LangSmith, filter to last 24h, sort by cost descending
2. Look for retry loops (same execution ID repeated) or unusually long prompts
3. Check Anthropic dashboard for token usage spikes
Common causes: malformed PDF causing retries; vendor sending a new format that confuses the parser
Mitigation: hotfix the parser if a new vendor format; if not urgent, route the specific vendor to manual queue for the day; reach out to Builder

4.2 HITL acceptance rate drops below 70% (Sev-2)

Symptom: LangSmith dashboard HITL panel shows acceptance < 70% over rolling 7-day window
Diagnose:
1. Open LangSmith → filter to last 7 days → group by confidence_score
2. Identify whether drop is across all confidence levels or concentrated in mid-confidence
3. Pull 10 rejected proposals; categorize root cause
Common causes: prompt drift after a model version update; NetSuite added a new field confusing the matcher; new vendor pattern not in golden set
Mitigation: revert to previous prompt version via Git tag prompt-v1.0; pause if cause unclear (use kill switch); ping Builder to investigate

4.3 NetSuite API rate limit (Sev-3)

Symptom: Errors logged with code NS_RATE_LIMIT_429
Diagnose: Check NetSuite admin dashboard for current rate-limit usage
Common causes: month-end invoice spike + parallel job
Mitigation: agent has built-in retry-with-backoff. If errors persist > 10 minutes, reduce parallel execution in n8n workflow settings (queue depth)

4.4 Out-of-scope tool call attempted (Sev-1 — ALWAYS PAGE)

Symptom: PagerDuty incident ai-out-of-scope fires; LangSmith shows a tool call to NetSuite.write or Gmail.send that was blocked by the tool allowlist
Diagnose:
1. Pull the full trace from LangSmith
2. Check the input that caused it (PDF source? user request?)
3. Look for prompt-injection pattern in the input
Immediate action:
1. Activate kill switch (Section 2)
2. Preserve trace ID + input data
3. Page CoE Lead + Security (security-oncall PagerDuty schedule)
4. Do NOT re-enable until Security clears
Why Sev-1: this means the agent attempted an action it was specifically forbidden from doing. Either prompt injection succeeded (partially), or the prompt itself drifted. Both require investigation before resuming.

4.5 Identity disabled / credential revoked (Sev-1)

Symptom: Workflow runs fail with 401 Unauthorized or agent identity disabled
Diagnose: Check Entra ID for agent-finance-invoice-recon status; check Secrets Manager for credential validity
Common causes: scheduled rotation failed; security-driven revocation
Action: if scheduled rotation failure, re-run rotation; if security revocation, contact security-oncall before re-enabling

4.6 LLM provider outage (Sev-3)

Symptom: Anthropic API returns 503; n8n logs LLM_PROVIDER_UNAVAILABLE
Diagnose: Check status.anthropic.com
Mitigation: agent halts gracefully; new invoices queue in Gmail (no action lost); post in #ai-pilot-finance channel that agent is paused; resume when provider recovers (no manual restart needed — auto-resume on next trigger)

4.7 Drift alarm fires (Sev-2)

Symptom: Arize drift detector pages — output distribution shift > threshold
Diagnose:
1. Compare current vs baseline distribution in LangSmith
2. Check for upstream changes: NetSuite schema, new vendor pattern, model version update
Action: pause via kill switch if drift is severe; otherwise investigate within 24h; update prompt or eval set as needed

5. Escalation path

Severity	Page	Within	Inform within
Sev-1 (out-of-scope action, identity revoked, data exfil)	PagerDuty primary on-call + CoE Lead + Security	5 min	Executive sponsor + General Counsel within 1 hour
Sev-2 (HITL collapse, drift, sustained error rate)	PagerDuty primary on-call + CoE Lead	15 min	Department Head within 4 hours
Sev-3 (cost spike, rate-limit, transient outage)	PagerDuty primary on-call	30 min	(none)
Sev-4 (cosmetic, single-invoice quirk)	Slack #ai-pilot-finance	Next business day	(none)

6. Common operations

Promote a prompt change

Edit prompts/match-prompt.md on a feature branch
Run eval (template 08) against current golden set — must pass all thresholds
PR with eval results attached → 2-reviewer approval
Merge to main → CI auto-deploys to dev → soak 24h → manual approval to prod
Tag the commit prompt-vX.Y for rollback reference

Rotate credentials

NetSuite OAuth: every 90 days via NetSuite admin → update Secrets Manager value → no agent restart needed (next execution picks up new value)
Anthropic API key: rotate via Anthropic console → update Secrets Manager → no restart needed
Verify: trigger one test execution → confirm success in LangSmith

Update the golden set

Pull 10–20 recent edge-case invoices (anonymized)
Add to golden-set/ in repo with human-labeled correct matches
Bump golden-set version
Re-run eval (template 08) — should still pass thresholds with expanded set
PR with rationale

Daily check (first 30 days post-launch)

5-minute morning routine:

LangSmith dashboard → any red traces in last 24h?
Cost widget → within $5–10/day band?
HITL acceptance → above 90%?
Any pending PagerDuty incidents?

If all green, you're done. If any red, follow Section 4.

7. Contacts

Role	Name	Channel
Primary on-call (Finance Champion)	Mike Chen	PagerDuty `finance-ai-primary`, Slack `@mike.chen`
Backup on-call (CoE Lead)	Morteza Moradi	PagerDuty `finance-ai-backup`, Slack `@morteza`
Security on-call	(current rotation)	PagerDuty `security-oncall`
Builder (for code questions)	Same as CoE Lead	—
Anthropic support	enterprise@anthropic.com	(For account-level issues)
NetSuite admin	Jane Roe	Slack `@jane.roe`

8. Reference

Agent Card: github.com/acme/agents/finance-invoice-recon/AGENT_CARD.md
Source code: github.com/acme/agents/finance-invoice-recon
Registry entry: [registry URL]
Threat model: templates/05-threat-model--finance-invoice-recon.md
Most recent eval report: eval-reports/eval-2026-05-09.md
Most recent post-mortem (if any): post-mortems/

9. Changelog

Date	Change	Author
2026-06-12	Initial runbook v1.0 for production launch	Morteza Moradi

Blank template (copy below for your agent)

# Runbook — [Agent ID] v[X.X]

**Owner:** [Department Champion]
**CoE Lead:** [Name]
**Last updated:** [YYYY-MM-DD]
**On-call rotation:** [PagerDuty / Opsgenie schedule names]

## 1. What this agent does (one paragraph)

[Plain-English description of the agent's job. Specific about scope: what it does AND what it never does.]

## 2. Kill switch (target: 60 seconds)

If you need to stop the agent NOW:

1. [Step 1 — primary mechanism: feature flag / orchestrator disable]
2. [Verification step]
3. [Backup mechanism]
4. [Last-resort mechanism]

Drill: [last tested date + measured time]

## 3. Dashboards

- **[Primary observability]**: [URL]
- **[Secondary observability]**: [URL]
- **[Other ops dashboards]**: [URL]

## 4. Top failure modes + diagnostics

### 4.1 [Failure mode 1] ([Sev-X])

- **Symptom**:
- **Diagnose**:
- **Common causes**:
- **Mitigation**:

### 4.2 [Failure mode 2]

[repeat structure]

### 4.X [Out-of-scope action attempted] (Sev-1 — ALWAYS PAGE)

- **Symptom**:
- **Immediate action**:
  1. Activate kill switch
  2. Preserve trace
  3. Page CoE Lead + Security
  4. Do NOT re-enable without clearance

## 5. Escalation path

| Severity | Page | Within | Inform within |
|---|---|---|---|
| Sev-1 | | | |
| Sev-2 | | | |
| Sev-3 | | | |
| Sev-4 | | | |

## 6. Common operations

### Promote a [code/prompt/config] change
[Step-by-step]

### Rotate credentials
[Step-by-step]

### Update the golden set / retrain / etc.
[Step-by-step]

### Daily check (first 30 days post-launch)
[5-minute routine]

## 7. Contacts

| Role | Name | Channel |
|---|---|---|
| | | |

## 8. Reference

- Agent Card: [link]
- Source code: [link]
- Registry entry: [link]
- Threat model: [link]
- Most recent eval report: [link]
- Most recent post-mortem (if any): [link]

## 9. Changelog

| Date | Change | Author |
|---|---|---|
| | | |

Usage notes

Keep it short. 2–4 pages. Longer = unread at 2am.
Section 4 is the heart of the document. Real failure modes specific to this agent. Generic "agent fails" doesn't help.
Test the kill switch and record the measured time. "We have a kill switch" without timing isn't credible.
Update after every post-mortem. Template 10 (post-mortem) explicitly creates runbook updates as action items.
The named on-call reads it before pilot-to-prod sign-off — quiz them on 3 questions (Section 4 #1, Section 2 step 1, Section 5 Sev-1 escalation).

Common pitfalls

Pitfall	What it looks like	Fix
Runbook too long	12 pages of theory	Cut to the diagnostics and procedures. Theory lives in framework.md.
Generic failure modes	"Agent fails: check logs"	Specific to THIS agent: which symptoms, which root causes, which mitigations
Kill switch untested	"Procedure documented" with no drill	Run the drill; record the time
Out-of-scope handling missing	Only happy-path failures listed	Section 4.X — what to do when the agent tries something forbidden
Contacts not updated	Original owner left the company 6 months ago	Quarterly contact review
No changelog	Updates lost in git history	Section 9 is the at-a-glance record

Framework cross-references

framework.md §25 (Security: Discover / Protect / Detect — Section 4 of runbook)
framework.md §24 (observability fields used in Section 3 dashboards)
framework.md §21 (5 monitoring signals — Sections 4 + 6)
framework.md §17 (privileged identities — Section 4.5)
framework.md §11.2 (per-agent lifecycle — runbook lives across Pilot → Production → Quarterly)
workflows.md Step 9 (Build → "Write runbook")
workflows.html → In Action view → node M12 (Build) + M16 (Continuous monitoring)