Purpose
The runbook is what the on-call person reads at 2am when something goes wrong. It's not a manual — it's a fast-reference cheat sheet for the agent's specific failure modes, where its dashboards are, how to kill it, and how to escalate.
Every production agent has one. Lives next to the agent's code in source control. Read by the named on-call before pilot-to-prod sign-off (template 07 item 7).
- When you write it: During Build (M12), before Evaluate (M13). Updated whenever a Sev-2+ incident reveals a gap (template 10 action items often update the runbook).
- Who writes it: Agent Builder, in collaboration with the named on-call.
- Length target: 2–4 pages. Anything longer means it won't be read at 2am.
- Format: Markdown in the agent's repo at
runbook.md(orRUNBOOK.md).
Worked example (AP Accountant invoice reconciliation)
Runbook — finance-invoice-recon v1.0
Owner: Mike Chen (AP Manager) · CoE Lead: Morteza Moradi
Last updated: 2026-06-12 (pre-prod launch)
On-call rotation: PagerDuty schedule finance-ai-primary (primary) + finance-ai-backup (CoE Lead backup)
1. What this agent does (one paragraph)
Reads vendor invoice PDFs from the AP accountant's Gmail mailbox. Extracts vendor name, total amount, line items, and PO reference using a deterministic parser. Looks up the matching open PO in NetSuite. Scores match confidence (0.0–1.0). Drafts a reconciliation proposal email into the accountant's Gmail drafts folder. The accountant reviews and approves. The agent never writes to NetSuite, never sends email externally, never approves payment.
2. Kill switch (60-second target)
If you need to stop the agent NOW:
- LaunchDarkly flag → flip
finance-invoice-recon-enabledtooffURL:https://app.launchdarkly.com/projects/acme/flags/finance-invoice-recon-enabled - Verify: open LangSmith project
finance-invoice-recon→ confirm no new traces in the last 60s - If LaunchDarkly is down: disable the user
agent-finance-invoice-reconin Microsoft Entra ID URL:https://entra.microsoft.com/users/agent-finance-invoice-recon - As last resort: stop the n8n workflow
finance-invoice-recon-prodin the n8n admin console
Drill: tested 2026-06-08, kill propagated in 23 seconds.
3. Dashboards
- LangSmith (primary):
https://smith.langchain.com/o/acme/projects/finance-invoice-reconShows: per-execution traces, cost, latency, errors, HITL events - Datadog (overlays): dashboard
Finance AI— cost-per-day + latency p95 + error rate widgets - AWS Secrets Manager: confirm credentials current →
/agents/finance-invoice-recon/* - NetSuite admin: role
AI-Recon-Readerusage logs (every 100 reads logged)
4. Top failure modes + diagnostics
4.1 Cost spike (Sev-3)
- Symptom: Datadog "Finance AI" cost widget shows daily total > $20 (baseline: $5/day)
- Diagnose:
- Open LangSmith, filter to last 24h, sort by cost descending
- Look for retry loops (same execution ID repeated) or unusually long prompts
- Check Anthropic dashboard for token usage spikes
- Common causes: malformed PDF causing retries; vendor sending a new format that confuses the parser
- Mitigation: hotfix the parser if a new vendor format; if not urgent, route the specific vendor to manual queue for the day; reach out to Builder
4.2 HITL acceptance rate drops below 70% (Sev-2)
- Symptom: LangSmith dashboard HITL panel shows acceptance < 70% over rolling 7-day window
- Diagnose:
- Open LangSmith → filter to last 7 days → group by
confidence_score - Identify whether drop is across all confidence levels or concentrated in mid-confidence
- Pull 10 rejected proposals; categorize root cause
- Open LangSmith → filter to last 7 days → group by
- Common causes: prompt drift after a model version update; NetSuite added a new field confusing the matcher; new vendor pattern not in golden set
- Mitigation: revert to previous prompt version via Git tag
prompt-v1.0; pause if cause unclear (use kill switch); ping Builder to investigate
4.3 NetSuite API rate limit (Sev-3)
- Symptom: Errors logged with code
NS_RATE_LIMIT_429 - Diagnose: Check NetSuite admin dashboard for current rate-limit usage
- Common causes: month-end invoice spike + parallel job
- Mitigation: agent has built-in retry-with-backoff. If errors persist > 10 minutes, reduce parallel execution in n8n workflow settings (queue depth)
4.4 Out-of-scope tool call attempted (Sev-1 — ALWAYS PAGE)
- Symptom: PagerDuty incident
ai-out-of-scopefires; LangSmith shows a tool call toNetSuite.writeorGmail.sendthat was blocked by the tool allowlist - Diagnose:
- Pull the full trace from LangSmith
- Check the input that caused it (PDF source? user request?)
- Look for prompt-injection pattern in the input
- Immediate action:
- Activate kill switch (Section 2)
- Preserve trace ID + input data
- Page CoE Lead + Security (
security-oncallPagerDuty schedule) - Do NOT re-enable until Security clears
- Why Sev-1: this means the agent attempted an action it was specifically forbidden from doing. Either prompt injection succeeded (partially), or the prompt itself drifted. Both require investigation before resuming.
4.5 Identity disabled / credential revoked (Sev-1)
- Symptom: Workflow runs fail with
401 Unauthorizedoragent identity disabled - Diagnose: Check Entra ID for
agent-finance-invoice-reconstatus; check Secrets Manager for credential validity - Common causes: scheduled rotation failed; security-driven revocation
- Action: if scheduled rotation failure, re-run rotation; if security revocation, contact
security-oncallbefore re-enabling
4.6 LLM provider outage (Sev-3)
- Symptom: Anthropic API returns 503; n8n logs
LLM_PROVIDER_UNAVAILABLE - Diagnose: Check status.anthropic.com
- Mitigation: agent halts gracefully; new invoices queue in Gmail (no action lost); post in
#ai-pilot-financechannel that agent is paused; resume when provider recovers (no manual restart needed — auto-resume on next trigger)
4.7 Drift alarm fires (Sev-2)
- Symptom: Arize drift detector pages — output distribution shift > threshold
- Diagnose:
- Compare current vs baseline distribution in LangSmith
- Check for upstream changes: NetSuite schema, new vendor pattern, model version update
- Action: pause via kill switch if drift is severe; otherwise investigate within 24h; update prompt or eval set as needed
5. Escalation path
| Severity | Page | Within | Inform within |
|---|---|---|---|
| Sev-1 (out-of-scope action, identity revoked, data exfil) | PagerDuty primary on-call + CoE Lead + Security | 5 min | Executive sponsor + General Counsel within 1 hour |
| Sev-2 (HITL collapse, drift, sustained error rate) | PagerDuty primary on-call + CoE Lead | 15 min | Department Head within 4 hours |
| Sev-3 (cost spike, rate-limit, transient outage) | PagerDuty primary on-call | 30 min | (none) |
| Sev-4 (cosmetic, single-invoice quirk) | Slack #ai-pilot-finance | Next business day | (none) |
6. Common operations
Promote a prompt change
- Edit
prompts/match-prompt.mdon a feature branch - Run eval (template 08) against current golden set — must pass all thresholds
- PR with eval results attached → 2-reviewer approval
- Merge to
main→ CI auto-deploys to dev → soak 24h → manual approval to prod - Tag the commit
prompt-vX.Yfor rollback reference
Rotate credentials
- NetSuite OAuth: every 90 days via NetSuite admin → update Secrets Manager value → no agent restart needed (next execution picks up new value)
- Anthropic API key: rotate via Anthropic console → update Secrets Manager → no restart needed
- Verify: trigger one test execution → confirm success in LangSmith
Update the golden set
- Pull 10–20 recent edge-case invoices (anonymized)
- Add to
golden-set/in repo with human-labeled correct matches - Bump golden-set version
- Re-run eval (template 08) — should still pass thresholds with expanded set
- PR with rationale
Daily check (first 30 days post-launch)
5-minute morning routine:
- LangSmith dashboard → any red traces in last 24h?
- Cost widget → within $5–10/day band?
- HITL acceptance → above 90%?
- Any pending PagerDuty incidents?
If all green, you're done. If any red, follow Section 4.
7. Contacts
| Role | Name | Channel |
|---|---|---|
| Primary on-call (Finance Champion) | Mike Chen | PagerDuty finance-ai-primary, Slack @mike.chen |
| Backup on-call (CoE Lead) | Morteza Moradi | PagerDuty finance-ai-backup, Slack @morteza |
| Security on-call | (current rotation) | PagerDuty security-oncall |
| Builder (for code questions) | Same as CoE Lead | — |
| Anthropic support | enterprise@anthropic.com | (For account-level issues) |
| NetSuite admin | Jane Roe | Slack @jane.roe |
8. Reference
- Agent Card:
github.com/acme/agents/finance-invoice-recon/AGENT_CARD.md - Source code:
github.com/acme/agents/finance-invoice-recon - Registry entry: [registry URL]
- Threat model:
templates/05-threat-model--finance-invoice-recon.md - Most recent eval report:
eval-reports/eval-2026-05-09.md - Most recent post-mortem (if any):
post-mortems/
9. Changelog
| Date | Change | Author |
|---|---|---|
| 2026-06-12 | Initial runbook v1.0 for production launch | Morteza Moradi |
Blank template (copy below for your agent)
# Runbook — [Agent ID] v[X.X]
**Owner:** [Department Champion]
**CoE Lead:** [Name]
**Last updated:** [YYYY-MM-DD]
**On-call rotation:** [PagerDuty / Opsgenie schedule names]
## 1. What this agent does (one paragraph)
[Plain-English description of the agent's job. Specific about scope: what it does AND what it never does.]
## 2. Kill switch (target: 60 seconds)
If you need to stop the agent NOW:
1. [Step 1 — primary mechanism: feature flag / orchestrator disable]
2. [Verification step]
3. [Backup mechanism]
4. [Last-resort mechanism]
Drill: [last tested date + measured time]
## 3. Dashboards
- **[Primary observability]**: [URL]
- **[Secondary observability]**: [URL]
- **[Other ops dashboards]**: [URL]
## 4. Top failure modes + diagnostics
### 4.1 [Failure mode 1] ([Sev-X])
- **Symptom**:
- **Diagnose**:
- **Common causes**:
- **Mitigation**:
### 4.2 [Failure mode 2]
[repeat structure]
### 4.X [Out-of-scope action attempted] (Sev-1 — ALWAYS PAGE)
- **Symptom**:
- **Immediate action**:
1. Activate kill switch
2. Preserve trace
3. Page CoE Lead + Security
4. Do NOT re-enable without clearance
## 5. Escalation path
| Severity | Page | Within | Inform within |
|---|---|---|---|
| Sev-1 | | | |
| Sev-2 | | | |
| Sev-3 | | | |
| Sev-4 | | | |
## 6. Common operations
### Promote a [code/prompt/config] change
[Step-by-step]
### Rotate credentials
[Step-by-step]
### Update the golden set / retrain / etc.
[Step-by-step]
### Daily check (first 30 days post-launch)
[5-minute routine]
## 7. Contacts
| Role | Name | Channel |
|---|---|---|
| | | |
## 8. Reference
- Agent Card: [link]
- Source code: [link]
- Registry entry: [link]
- Threat model: [link]
- Most recent eval report: [link]
- Most recent post-mortem (if any): [link]
## 9. Changelog
| Date | Change | Author |
|---|---|---|
| | | |
Usage notes
- Keep it short. 2–4 pages. Longer = unread at 2am.
- Section 4 is the heart of the document. Real failure modes specific to this agent. Generic "agent fails" doesn't help.
- Test the kill switch and record the measured time. "We have a kill switch" without timing isn't credible.
- Update after every post-mortem. Template 10 (post-mortem) explicitly creates runbook updates as action items.
- The named on-call reads it before pilot-to-prod sign-off — quiz them on 3 questions (Section 4 #1, Section 2 step 1, Section 5 Sev-1 escalation).
Common pitfalls
| Pitfall | What it looks like | Fix |
|---|---|---|
| Runbook too long | 12 pages of theory | Cut to the diagnostics and procedures. Theory lives in framework.md. |
| Generic failure modes | "Agent fails: check logs" | Specific to THIS agent: which symptoms, which root causes, which mitigations |
| Kill switch untested | "Procedure documented" with no drill | Run the drill; record the time |
| Out-of-scope handling missing | Only happy-path failures listed | Section 4.X — what to do when the agent tries something forbidden |
| Contacts not updated | Original owner left the company 6 months ago | Quarterly contact review |
| No changelog | Updates lost in git history | Section 9 is the at-a-glance record |
Framework cross-references
framework.md§25 (Security: Discover / Protect / Detect — Section 4 of runbook)framework.md§24 (observability fields used in Section 3 dashboards)framework.md§21 (5 monitoring signals — Sections 4 + 6)framework.md§17 (privileged identities — Section 4.5)framework.md§11.2 (per-agent lifecycle — runbook lives across Pilot → Production → Quarterly)workflows.mdStep 9 (Build → "Write runbook")workflows.html→ In Action view → node M12 (Build) + M16 (Continuous monitoring)