Purpose

A blameless retrospective of any Severity-2 or Severity-1 incident involving a production AI agent. Captures what happened, why, what worked, what didn't, and what changes prevent recurrence. The output is action items with named owners and due dates — those changes feed back into the framework, the runbook, the Agent Card, or the standards.

Severity definitions (use the company's existing severity scale if one exists):

Sev-1: Customer impact OR data exposure OR regulatory exposure OR agent took an out-of-scope action OR financial loss. Mandatory post-mortem within 5 business days.
Sev-2: Sustained quality degradation, sustained outage, broad user disruption, near-miss on Sev-1. Mandatory post-mortem within 10 business days.
Sev-3: Brief outage, single-instance quirk, fixed before user impact. Post-mortem optional (CoE Lead decides). Note in the registry.
Sev-4: Minor cosmetic. No post-mortem. Note in agent's issue tracker.
When you use it: After any Sev-2 or higher incident in production.
Who fills it: CoE Lead facilitates. Builder + on-call + affected stakeholders contribute.
Tone: Blameless. Focus on systemic causes, not individual mistakes.
Output: Post-mortem doc + action items committed to the registry + runbook updates as needed.

Worked example (AP Accountant invoice reconciliation)

Post-Mortem — `finance-invoice-recon` Sev-2 incident, 2026-08-14

Author: Morteza Moradi (CoE Lead) Date: 2026-08-19 (5 days post-incident) Severity: Sev-2 Status: Closed — all action items completed

1. Incident summary

On 2026-08-14 at 09:12 UTC, the finance-invoice-recon agent began producing incorrect PO matches at an elevated rate (~30% wrong matches vs. the baseline 4% wrong). The issue went undetected until 10:38 when AP accountant Sarah Patel noticed and reported via #ai-pilot-finance. The agent was paused via kill switch at 10:51. Root cause: NetSuite released a schema change on 2026-08-13 that introduced a new field type for line-item discounts, which the agent's deterministic parser misinterpreted. Total wrong matches generated: 47. All were caught by Sarah at HITL review — no incorrect data reached NetSuite, no payments authorized. Agent restored to service 2026-08-15 at 14:30 with a parser hotfix.

2. Timeline

Time (UTC)	Event
2026-08-13 23:00	NetSuite ships schema change to production tenant (their scheduled maintenance window). No notification received because the change was classified by NetSuite as "additive"
2026-08-14 06:00	First Tuesday invoices begin arriving; agent processes them
2026-08-14 09:12	Wrong-match rate begins climbing — observable in LangSmith but no alert configured at this threshold
2026-08-14 10:38	Sarah Patel notices wrong matches; posts in #ai-pilot-finance
2026-08-14 10:42	Mike Chen (Finance Champion) acknowledges in Slack
2026-08-14 10:51	Kill switch activated by Mike Chen — agent paused
2026-08-14 11:00	CoE Lead notified; investigation begins
2026-08-14 11:45	Root cause identified: new NetSuite field `lineItemDiscountType` causing parser to misalign columns
2026-08-14 16:00	Parser hotfix branch created
2026-08-15 09:00	Hotfix PR reviewed
2026-08-15 11:00	Re-evaluated against golden set (96% accuracy, matching pre-incident baseline)
2026-08-15 12:30	Deployed to prod via standard CI/CD
2026-08-15 14:30	Agent re-enabled; closely monitored for next 24h
2026-08-16 14:30	Confirmed stable; incident closed operationally
2026-08-19	Post-mortem completed; action items filed

3. Impact

Dimension	Impact
Wrong matches generated	47 (over ~3 hours of operation)
Wrong matches that reached NetSuite	0 — all caught by HITL review
Payments incorrectly authorized	0
Customer impact	None (internal agent)
Regulatory exposure	None
Time the agent was paused	~28 hours (10:51 Aug 14 → 14:30 Aug 15)
Sarah's time spent triaging wrong matches	~3 hours over those 2 days
Agent downstream user impact	AP team reverted to manual processing for 1.5 days

4. What went well

HITL gate worked as designed — every wrong match was caught before any NetSuite action. The agent never wrote a wrong payment.
Kill switch worked end-to-end — Mike paused the agent in under 60 seconds when he decided to.
Communication was prompt — Sarah's Slack message reached the right people fast.
Hotfix turnaround was fast — 28 hours from incident to fix in production.
The audit trail in LangSmith was complete — root cause analysis was straightforward.

5. What went wrong

Alert threshold gap. Wrong-match rate climbing from 4% baseline to 30% over 1.5 hours triggered no alarm. Current alerts only fire on Sev-1 indicators (out-of-scope tool calls) and broad metrics (cost spike, latency p95). Quality degradation at this magnitude should have paged.
No vendor change-notification process. NetSuite shipped a schema change without explicit notice to the AP team. Our procurement process (template 13) requires AI vendor notification, but this was an upstream system (NetSuite is not an AI vendor — the change was operational).
Detection depended on a human noticing. Sarah is conscientious; this could just as easily have gone hours longer if she'd been in a meeting all morning.
No pre-prod canary against an updated schema. Our eval (template 08) runs against the golden set, which was captured before this NetSuite change. There's no automatic re-eval when an upstream system changes.

6. Root cause (5 whys)

Q: Why did wrong matches start? A: Parser misaligned columns because of an unexpected new field in NetSuite responses.
Q: Why did the parser misalign? A: It used positional column ordering, not column names.
Q: Why didn't the deterministic parser catch this? A: The schema change was additive (new optional field), which didn't break parsing — it just made it produce wrong values.
Q: Why didn't we detect this within minutes? A: We monitor for errors and Sev-1 indicators, not for quality-rate degradation.
Q: Why don't we monitor for quality-rate degradation? A: The 5 monitoring signals (framework §21) include HITL acceptance rate, but our alert threshold is set at < 70% over a 7-day window. This incident produced 1.5 hours of severe degradation that didn't move the 7-day rolling average enough to trigger.

Underlying systemic cause: Our monitoring is tuned for sustained problems, not sharp incidents.

7. Action items

#	Action	Severity	Owner	Due	Status
1	Switch parser to column-name-based extraction instead of positional	Hotfix	Builder	2026-08-15	✅ Done (shipped 2026-08-15)
2	Add short-window HITL acceptance alert: fire if acceptance < 70% over rolling 2-hour window	Sev-2 prevention	Builder + Platform	2026-09-01	✅ Done (2026-08-28)
3	Add short-window wrong-match-rate alert: fire if wrong-match rate > 15% over rolling 1-hour window	Sev-2 prevention	Builder	2026-09-01	✅ Done (2026-08-28)
4	Update runbook (template 09) Section 4 with this failure mode: "upstream schema change" — diagnostics + mitigation	Runbook hygiene	CoE Lead	2026-08-26	✅ Done
5	Add automated weekly re-eval against golden set + a synthetic test invoice that exercises every NetSuite field	Drift detection	Builder	2026-10-01	✅ Done (2026-09-22)
6	Reach out to NetSuite to request notification of schema changes (this is a process gap, not just an AI gap)	Vendor relations	Mike Chen + IT	2026-09-15	✅ Done — added to NetSuite ITSM watchlist
7	Document this as a framework lesson: drift detection should include short-window quality bands, not just rolling 7-day averages	Framework changelog	CoE Lead	2026-08-30	✅ Done (framework changelog entry 2026-08-29)

8. Communications

Sarah Patel was informed in real time (Slack #ai-pilot-finance)
Mike Chen briefed Head of Finance on 2026-08-14
CoE Lead briefed Executive Sponsor at the regular monthly 1:1 (2026-08-25)
No external communication required (no customer impact, no regulatory exposure)

9. Regulatory considerations

EU AI Act Article 73 (serious incident reporting) — N/A. This was not a serious incident under the Act's definition (no harm to persons, no breach of fundamental rights, no major property damage). For reference: if this had occurred in a Stage 3 autonomous configuration with NetSuite write enabled, AND wrong matches had caused actual payment errors, Article 73 reporting could apply. Our HITL gate prevented this.
SOX — Not material. No financial reporting was affected. AP records remain accurate.
Internal audit — Notification sent 2026-08-19; no further action requested.

10. Sign-off

Role	Name	Date
Author	Morteza Moradi	2026-08-19
Reviewer (Department)	Mike Chen	2026-08-20
Reviewer (Security)	Pat Lee	2026-08-20
Executive sponsor (informed)	Jane Doe	2026-08-25

Blank template (copy below for your incident)

# Post-Mortem — [Agent ID] [Severity] incident, [YYYY-MM-DD]

**Author:** [CoE Lead or delegate]
**Date:** [Date of write-up, target ≤ 5 business days for Sev-1, ≤ 10 days for Sev-2]
**Severity:** [Sev-1 / Sev-2]
**Status:** [Open / Closed — all action items completed]

---

## 1. Incident summary

[3–5 sentences. What happened, when, how long, who was impacted, root cause in one sentence, current state.]

## 2. Timeline

| Time (UTC) | Event |
|---|---|
| | |

Include events from before the incident if relevant (e.g., upstream change a day earlier).

## 3. Impact

| Dimension | Impact |
|---|---|
| [Specific impact metric 1] | |
| [Specific impact metric 2] | |
| Customer impact | [None / describe] |
| Regulatory exposure | [None / describe] |
| Duration of impact / outage | |
| Downstream user impact | |

## 4. What went well

- [Item 1]
- [Item 2]

(Blameless — celebrate the parts of the response that worked.)

## 5. What went wrong

- [Item 1 — focus on systemic causes, not individuals]
- [Item 2]

## 6. Root cause (5 whys)

- Q: [Why did X happen?]
  A: [Because Y]
- Q: [Why did Y happen?]
  A: [Because Z]
- Q: [Why did Z happen?]
  A: [...]
- Q: [...]
  A: [...]
- Q: [...]
  A: [Root systemic cause]

**Underlying systemic cause:** [One sentence]

## 7. Action items

| # | Action | Severity | Owner | Due | Status |
|---|---|---|---|---|---|
| 1 | | | | | |

Each action item must have:
- A specific deliverable (not "investigate")
- A named owner (one person)
- A due date
- A status track

## 8. Communications

- [Who was informed when]
- [Whether external communication was required]

## 9. Regulatory considerations

- **EU AI Act Article 73 serious incident reporting:** [Applicable? / N/A — reasoning]
- **Sector regulations:** [HIPAA breach? SOX material weakness? Other? / N/A — reasoning]
- **Internal audit:** [Notified / not applicable]

## 10. Sign-off

| Role | Name | Date |
|---|---|---|
| Author | | |
| Reviewer (Department) | | |
| Reviewer (Security) | | |
| Executive sponsor (informed) | | |

Usage notes

Blameless. Names of individuals appear only in the timeline (as actors) and contacts. Section 5 ("what went wrong") focuses on systems and processes, not people. "Sarah noticed late" is not a finding — "we have no automated quality-degradation alert" is.
Action items must close. A post-mortem with open action items 6 months later is a failed post-mortem. CoE Lead tracks closure at every quarterly review (template 15).
Update upstream documents. When action items affect the runbook (template 09), Agent Card (template 03), or framework — update those documents. Don't let the post-mortem be the only record.
EU AI Act Article 73 reporting. For serious incidents involving high-risk AI in scope of the Act, written notification to the relevant market surveillance authority is required within 15 days (faster for certain categories). Section 9 forces this consideration.
Sev-3 incidents may also produce post-mortems at CoE Lead discretion — especially if they reveal a systemic pattern. Don't artificially cap.

Common pitfalls

Pitfall	What it looks like	Fix
Blameful	"Sarah didn't notice fast enough"	Re-frame: "We have no automated quality-degradation alert at short window"
Single why	Stops at first cause	Force 5 whys; the root cause is rarely the proximate cause
Vague action items	"Improve monitoring"	Specific: "Add HITL acceptance alert at <70% over 2-hour window. Owner X. Due YYYY-MM-DD"
Action items never close	Post-mortem filed, items tracked for a week, then forgotten	CoE Lead reviews open items at every quarterly (template 15)
Regulatory not considered	Section 9 missing or `N/A` without reasoning	Force the question; "N/A" requires a sentence of why
Communications missed	Post-mortem written but stakeholders not briefed	Section 8 forces the comms check
Framework not updated	Lesson stays in this one document	Action item to update framework / standards explicitly

Framework cross-references

framework.md §11.2 (continuous monitoring — post-mortems for Sev-2+)
framework.md §21 (5 monitoring signals — gaps surfaced in post-mortems)
framework.md §25.3 (Security: Detect — incident response)
framework.md §22.1 EU AI Act Article 72 (post-market monitoring) + 73 (serious incident reporting)
framework.md §22.2 NIST AI RMF MANAGE
workflows.md Step 14 (Continuous monitoring — post-mortems for Sev-2+)
workflows.html → In Action view → node M16 (Continuous monitoring) → loop back to M13 (re-evaluate)