Purpose
A blameless retrospective of any Severity-2 or Severity-1 incident involving a production AI agent. Captures what happened, why, what worked, what didn't, and what changes prevent recurrence. The output is action items with named owners and due dates — those changes feed back into the framework, the runbook, the Agent Card, or the standards.
Severity definitions (use the company's existing severity scale if one exists):
-
Sev-1: Customer impact OR data exposure OR regulatory exposure OR agent took an out-of-scope action OR financial loss. Mandatory post-mortem within 5 business days.
-
Sev-2: Sustained quality degradation, sustained outage, broad user disruption, near-miss on Sev-1. Mandatory post-mortem within 10 business days.
-
Sev-3: Brief outage, single-instance quirk, fixed before user impact. Post-mortem optional (CoE Lead decides). Note in the registry.
-
Sev-4: Minor cosmetic. No post-mortem. Note in agent's issue tracker.
-
When you use it: After any Sev-2 or higher incident in production.
-
Who fills it: CoE Lead facilitates. Builder + on-call + affected stakeholders contribute.
-
Tone: Blameless. Focus on systemic causes, not individual mistakes.
-
Output: Post-mortem doc + action items committed to the registry + runbook updates as needed.
Worked example (AP Accountant invoice reconciliation)
Post-Mortem — finance-invoice-recon Sev-2 incident, 2026-08-14
Author: Morteza Moradi (CoE Lead) Date: 2026-08-19 (5 days post-incident) Severity: Sev-2 Status: Closed — all action items completed
1. Incident summary
On 2026-08-14 at 09:12 UTC, the finance-invoice-recon agent began producing incorrect PO matches at an elevated rate (~30% wrong matches vs. the baseline 4% wrong). The issue went undetected until 10:38 when AP accountant Sarah Patel noticed and reported via #ai-pilot-finance. The agent was paused via kill switch at 10:51. Root cause: NetSuite released a schema change on 2026-08-13 that introduced a new field type for line-item discounts, which the agent's deterministic parser misinterpreted. Total wrong matches generated: 47. All were caught by Sarah at HITL review — no incorrect data reached NetSuite, no payments authorized. Agent restored to service 2026-08-15 at 14:30 with a parser hotfix.
2. Timeline
| Time (UTC) | Event |
|---|---|
| 2026-08-13 23:00 | NetSuite ships schema change to production tenant (their scheduled maintenance window). No notification received because the change was classified by NetSuite as "additive" |
| 2026-08-14 06:00 | First Tuesday invoices begin arriving; agent processes them |
| 2026-08-14 09:12 | Wrong-match rate begins climbing — observable in LangSmith but no alert configured at this threshold |
| 2026-08-14 10:38 | Sarah Patel notices wrong matches; posts in #ai-pilot-finance |
| 2026-08-14 10:42 | Mike Chen (Finance Champion) acknowledges in Slack |
| 2026-08-14 10:51 | Kill switch activated by Mike Chen — agent paused |
| 2026-08-14 11:00 | CoE Lead notified; investigation begins |
| 2026-08-14 11:45 | Root cause identified: new NetSuite field lineItemDiscountType causing parser to misalign columns |
| 2026-08-14 16:00 | Parser hotfix branch created |
| 2026-08-15 09:00 | Hotfix PR reviewed |
| 2026-08-15 11:00 | Re-evaluated against golden set (96% accuracy, matching pre-incident baseline) |
| 2026-08-15 12:30 | Deployed to prod via standard CI/CD |
| 2026-08-15 14:30 | Agent re-enabled; closely monitored for next 24h |
| 2026-08-16 14:30 | Confirmed stable; incident closed operationally |
| 2026-08-19 | Post-mortem completed; action items filed |
3. Impact
| Dimension | Impact |
|---|---|
| Wrong matches generated | 47 (over ~3 hours of operation) |
| Wrong matches that reached NetSuite | 0 — all caught by HITL review |
| Payments incorrectly authorized | 0 |
| Customer impact | None (internal agent) |
| Regulatory exposure | None |
| Time the agent was paused | ~28 hours (10:51 Aug 14 → 14:30 Aug 15) |
| Sarah's time spent triaging wrong matches | ~3 hours over those 2 days |
| Agent downstream user impact | AP team reverted to manual processing for 1.5 days |
4. What went well
- HITL gate worked as designed — every wrong match was caught before any NetSuite action. The agent never wrote a wrong payment.
- Kill switch worked end-to-end — Mike paused the agent in under 60 seconds when he decided to.
- Communication was prompt — Sarah's Slack message reached the right people fast.
- Hotfix turnaround was fast — 28 hours from incident to fix in production.
- The audit trail in LangSmith was complete — root cause analysis was straightforward.
5. What went wrong
- Alert threshold gap. Wrong-match rate climbing from 4% baseline to 30% over 1.5 hours triggered no alarm. Current alerts only fire on Sev-1 indicators (out-of-scope tool calls) and broad metrics (cost spike, latency p95). Quality degradation at this magnitude should have paged.
- No vendor change-notification process. NetSuite shipped a schema change without explicit notice to the AP team. Our procurement process (template 13) requires AI vendor notification, but this was an upstream system (NetSuite is not an AI vendor — the change was operational).
- Detection depended on a human noticing. Sarah is conscientious; this could just as easily have gone hours longer if she'd been in a meeting all morning.
- No pre-prod canary against an updated schema. Our eval (template 08) runs against the golden set, which was captured before this NetSuite change. There's no automatic re-eval when an upstream system changes.
6. Root cause (5 whys)
- Q: Why did wrong matches start? A: Parser misaligned columns because of an unexpected new field in NetSuite responses.
- Q: Why did the parser misalign? A: It used positional column ordering, not column names.
- Q: Why didn't the deterministic parser catch this? A: The schema change was additive (new optional field), which didn't break parsing — it just made it produce wrong values.
- Q: Why didn't we detect this within minutes? A: We monitor for errors and Sev-1 indicators, not for quality-rate degradation.
- Q: Why don't we monitor for quality-rate degradation? A: The 5 monitoring signals (framework §21) include HITL acceptance rate, but our alert threshold is set at < 70% over a 7-day window. This incident produced 1.5 hours of severe degradation that didn't move the 7-day rolling average enough to trigger.
Underlying systemic cause: Our monitoring is tuned for sustained problems, not sharp incidents.
7. Action items
| # | Action | Severity | Owner | Due | Status |
|---|---|---|---|---|---|
| 1 | Switch parser to column-name-based extraction instead of positional | Hotfix | Builder | 2026-08-15 | ✅ Done (shipped 2026-08-15) |
| 2 | Add short-window HITL acceptance alert: fire if acceptance < 70% over rolling 2-hour window | Sev-2 prevention | Builder + Platform | 2026-09-01 | ✅ Done (2026-08-28) |
| 3 | Add short-window wrong-match-rate alert: fire if wrong-match rate > 15% over rolling 1-hour window | Sev-2 prevention | Builder | 2026-09-01 | ✅ Done (2026-08-28) |
| 4 | Update runbook (template 09) Section 4 with this failure mode: "upstream schema change" — diagnostics + mitigation | Runbook hygiene | CoE Lead | 2026-08-26 | ✅ Done |
| 5 | Add automated weekly re-eval against golden set + a synthetic test invoice that exercises every NetSuite field | Drift detection | Builder | 2026-10-01 | ✅ Done (2026-09-22) |
| 6 | Reach out to NetSuite to request notification of schema changes (this is a process gap, not just an AI gap) | Vendor relations | Mike Chen + IT | 2026-09-15 | ✅ Done — added to NetSuite ITSM watchlist |
| 7 | Document this as a framework lesson: drift detection should include short-window quality bands, not just rolling 7-day averages | Framework changelog | CoE Lead | 2026-08-30 | ✅ Done (framework changelog entry 2026-08-29) |
8. Communications
- Sarah Patel was informed in real time (Slack #ai-pilot-finance)
- Mike Chen briefed Head of Finance on 2026-08-14
- CoE Lead briefed Executive Sponsor at the regular monthly 1:1 (2026-08-25)
- No external communication required (no customer impact, no regulatory exposure)
9. Regulatory considerations
- EU AI Act Article 73 (serious incident reporting) — N/A. This was not a serious incident under the Act's definition (no harm to persons, no breach of fundamental rights, no major property damage). For reference: if this had occurred in a Stage 3 autonomous configuration with NetSuite write enabled, AND wrong matches had caused actual payment errors, Article 73 reporting could apply. Our HITL gate prevented this.
- SOX — Not material. No financial reporting was affected. AP records remain accurate.
- Internal audit — Notification sent 2026-08-19; no further action requested.
10. Sign-off
| Role | Name | Date |
|---|---|---|
| Author | Morteza Moradi | 2026-08-19 |
| Reviewer (Department) | Mike Chen | 2026-08-20 |
| Reviewer (Security) | Pat Lee | 2026-08-20 |
| Executive sponsor (informed) | Jane Doe | 2026-08-25 |
Blank template (copy below for your incident)
# Post-Mortem — [Agent ID] [Severity] incident, [YYYY-MM-DD]
**Author:** [CoE Lead or delegate]
**Date:** [Date of write-up, target ≤ 5 business days for Sev-1, ≤ 10 days for Sev-2]
**Severity:** [Sev-1 / Sev-2]
**Status:** [Open / Closed — all action items completed]
---
## 1. Incident summary
[3–5 sentences. What happened, when, how long, who was impacted, root cause in one sentence, current state.]
## 2. Timeline
| Time (UTC) | Event |
|---|---|
| | |
Include events from before the incident if relevant (e.g., upstream change a day earlier).
## 3. Impact
| Dimension | Impact |
|---|---|
| [Specific impact metric 1] | |
| [Specific impact metric 2] | |
| Customer impact | [None / describe] |
| Regulatory exposure | [None / describe] |
| Duration of impact / outage | |
| Downstream user impact | |
## 4. What went well
- [Item 1]
- [Item 2]
(Blameless — celebrate the parts of the response that worked.)
## 5. What went wrong
- [Item 1 — focus on systemic causes, not individuals]
- [Item 2]
## 6. Root cause (5 whys)
- Q: [Why did X happen?]
A: [Because Y]
- Q: [Why did Y happen?]
A: [Because Z]
- Q: [Why did Z happen?]
A: [...]
- Q: [...]
A: [...]
- Q: [...]
A: [Root systemic cause]
**Underlying systemic cause:** [One sentence]
## 7. Action items
| # | Action | Severity | Owner | Due | Status |
|---|---|---|---|---|---|
| 1 | | | | | |
Each action item must have:
- A specific deliverable (not "investigate")
- A named owner (one person)
- A due date
- A status track
## 8. Communications
- [Who was informed when]
- [Whether external communication was required]
## 9. Regulatory considerations
- **EU AI Act Article 73 serious incident reporting:** [Applicable? / N/A — reasoning]
- **Sector regulations:** [HIPAA breach? SOX material weakness? Other? / N/A — reasoning]
- **Internal audit:** [Notified / not applicable]
## 10. Sign-off
| Role | Name | Date |
|---|---|---|
| Author | | |
| Reviewer (Department) | | |
| Reviewer (Security) | | |
| Executive sponsor (informed) | | |
Usage notes
- Blameless. Names of individuals appear only in the timeline (as actors) and contacts. Section 5 ("what went wrong") focuses on systems and processes, not people. "Sarah noticed late" is not a finding — "we have no automated quality-degradation alert" is.
- Action items must close. A post-mortem with open action items 6 months later is a failed post-mortem. CoE Lead tracks closure at every quarterly review (template 15).
- Update upstream documents. When action items affect the runbook (template 09), Agent Card (template 03), or framework — update those documents. Don't let the post-mortem be the only record.
- EU AI Act Article 73 reporting. For serious incidents involving high-risk AI in scope of the Act, written notification to the relevant market surveillance authority is required within 15 days (faster for certain categories). Section 9 forces this consideration.
- Sev-3 incidents may also produce post-mortems at CoE Lead discretion — especially if they reveal a systemic pattern. Don't artificially cap.
Common pitfalls
| Pitfall | What it looks like | Fix |
|---|---|---|
| Blameful | "Sarah didn't notice fast enough" | Re-frame: "We have no automated quality-degradation alert at short window" |
| Single why | Stops at first cause | Force 5 whys; the root cause is rarely the proximate cause |
| Vague action items | "Improve monitoring" | Specific: "Add HITL acceptance alert at <70% over 2-hour window. Owner X. Due YYYY-MM-DD" |
| Action items never close | Post-mortem filed, items tracked for a week, then forgotten | CoE Lead reviews open items at every quarterly (template 15) |
| Regulatory not considered | Section 9 missing or N/A without reasoning | Force the question; "N/A" requires a sentence of why |
| Communications missed | Post-mortem written but stakeholders not briefed | Section 8 forces the comms check |
| Framework not updated | Lesson stays in this one document | Action item to update framework / standards explicitly |
Framework cross-references
framework.md§11.2 (continuous monitoring — post-mortems for Sev-2+)framework.md§21 (5 monitoring signals — gaps surfaced in post-mortems)framework.md§25.3 (Security: Detect — incident response)framework.md§22.1 EU AI Act Article 72 (post-market monitoring) + 73 (serious incident reporting)framework.md§22.2 NIST AI RMF MANAGEworkflows.mdStep 14 (Continuous monitoring — post-mortems for Sev-2+)workflows.html→ In Action view → node M16 (Continuous monitoring) → loop back to M13 (re-evaluate)