← All templates
Template 10

Template 10 — Post-Mortem

ID
10-post-mortem
Version
1
Last revised
2026-05-14
Owner
AI CoE Lead (drives) · Agent Builder + On-call (contribute) · Affected department (informed)

Purpose

A blameless retrospective of any Severity-2 or Severity-1 incident involving a production AI agent. Captures what happened, why, what worked, what didn't, and what changes prevent recurrence. The output is action items with named owners and due dates — those changes feed back into the framework, the runbook, the Agent Card, or the standards.

Severity definitions (use the company's existing severity scale if one exists):

  • Sev-1: Customer impact OR data exposure OR regulatory exposure OR agent took an out-of-scope action OR financial loss. Mandatory post-mortem within 5 business days.

  • Sev-2: Sustained quality degradation, sustained outage, broad user disruption, near-miss on Sev-1. Mandatory post-mortem within 10 business days.

  • Sev-3: Brief outage, single-instance quirk, fixed before user impact. Post-mortem optional (CoE Lead decides). Note in the registry.

  • Sev-4: Minor cosmetic. No post-mortem. Note in agent's issue tracker.

  • When you use it: After any Sev-2 or higher incident in production.

  • Who fills it: CoE Lead facilitates. Builder + on-call + affected stakeholders contribute.

  • Tone: Blameless. Focus on systemic causes, not individual mistakes.

  • Output: Post-mortem doc + action items committed to the registry + runbook updates as needed.


Worked example (AP Accountant invoice reconciliation)

Post-Mortem — finance-invoice-recon Sev-2 incident, 2026-08-14

Author: Morteza Moradi (CoE Lead) Date: 2026-08-19 (5 days post-incident) Severity: Sev-2 Status: Closed — all action items completed


1. Incident summary

On 2026-08-14 at 09:12 UTC, the finance-invoice-recon agent began producing incorrect PO matches at an elevated rate (~30% wrong matches vs. the baseline 4% wrong). The issue went undetected until 10:38 when AP accountant Sarah Patel noticed and reported via #ai-pilot-finance. The agent was paused via kill switch at 10:51. Root cause: NetSuite released a schema change on 2026-08-13 that introduced a new field type for line-item discounts, which the agent's deterministic parser misinterpreted. Total wrong matches generated: 47. All were caught by Sarah at HITL review — no incorrect data reached NetSuite, no payments authorized. Agent restored to service 2026-08-15 at 14:30 with a parser hotfix.

2. Timeline

Time (UTC)Event
2026-08-13 23:00NetSuite ships schema change to production tenant (their scheduled maintenance window). No notification received because the change was classified by NetSuite as "additive"
2026-08-14 06:00First Tuesday invoices begin arriving; agent processes them
2026-08-14 09:12Wrong-match rate begins climbing — observable in LangSmith but no alert configured at this threshold
2026-08-14 10:38Sarah Patel notices wrong matches; posts in #ai-pilot-finance
2026-08-14 10:42Mike Chen (Finance Champion) acknowledges in Slack
2026-08-14 10:51Kill switch activated by Mike Chen — agent paused
2026-08-14 11:00CoE Lead notified; investigation begins
2026-08-14 11:45Root cause identified: new NetSuite field lineItemDiscountType causing parser to misalign columns
2026-08-14 16:00Parser hotfix branch created
2026-08-15 09:00Hotfix PR reviewed
2026-08-15 11:00Re-evaluated against golden set (96% accuracy, matching pre-incident baseline)
2026-08-15 12:30Deployed to prod via standard CI/CD
2026-08-15 14:30Agent re-enabled; closely monitored for next 24h
2026-08-16 14:30Confirmed stable; incident closed operationally
2026-08-19Post-mortem completed; action items filed

3. Impact

DimensionImpact
Wrong matches generated47 (over ~3 hours of operation)
Wrong matches that reached NetSuite0 — all caught by HITL review
Payments incorrectly authorized0
Customer impactNone (internal agent)
Regulatory exposureNone
Time the agent was paused~28 hours (10:51 Aug 14 → 14:30 Aug 15)
Sarah's time spent triaging wrong matches~3 hours over those 2 days
Agent downstream user impactAP team reverted to manual processing for 1.5 days

4. What went well

  • HITL gate worked as designed — every wrong match was caught before any NetSuite action. The agent never wrote a wrong payment.
  • Kill switch worked end-to-end — Mike paused the agent in under 60 seconds when he decided to.
  • Communication was prompt — Sarah's Slack message reached the right people fast.
  • Hotfix turnaround was fast — 28 hours from incident to fix in production.
  • The audit trail in LangSmith was complete — root cause analysis was straightforward.

5. What went wrong

  • Alert threshold gap. Wrong-match rate climbing from 4% baseline to 30% over 1.5 hours triggered no alarm. Current alerts only fire on Sev-1 indicators (out-of-scope tool calls) and broad metrics (cost spike, latency p95). Quality degradation at this magnitude should have paged.
  • No vendor change-notification process. NetSuite shipped a schema change without explicit notice to the AP team. Our procurement process (template 13) requires AI vendor notification, but this was an upstream system (NetSuite is not an AI vendor — the change was operational).
  • Detection depended on a human noticing. Sarah is conscientious; this could just as easily have gone hours longer if she'd been in a meeting all morning.
  • No pre-prod canary against an updated schema. Our eval (template 08) runs against the golden set, which was captured before this NetSuite change. There's no automatic re-eval when an upstream system changes.

6. Root cause (5 whys)

  • Q: Why did wrong matches start? A: Parser misaligned columns because of an unexpected new field in NetSuite responses.
  • Q: Why did the parser misalign? A: It used positional column ordering, not column names.
  • Q: Why didn't the deterministic parser catch this? A: The schema change was additive (new optional field), which didn't break parsing — it just made it produce wrong values.
  • Q: Why didn't we detect this within minutes? A: We monitor for errors and Sev-1 indicators, not for quality-rate degradation.
  • Q: Why don't we monitor for quality-rate degradation? A: The 5 monitoring signals (framework §21) include HITL acceptance rate, but our alert threshold is set at < 70% over a 7-day window. This incident produced 1.5 hours of severe degradation that didn't move the 7-day rolling average enough to trigger.

Underlying systemic cause: Our monitoring is tuned for sustained problems, not sharp incidents.

7. Action items

#ActionSeverityOwnerDueStatus
1Switch parser to column-name-based extraction instead of positionalHotfixBuilder2026-08-15✅ Done (shipped 2026-08-15)
2Add short-window HITL acceptance alert: fire if acceptance < 70% over rolling 2-hour windowSev-2 preventionBuilder + Platform2026-09-01✅ Done (2026-08-28)
3Add short-window wrong-match-rate alert: fire if wrong-match rate > 15% over rolling 1-hour windowSev-2 preventionBuilder2026-09-01✅ Done (2026-08-28)
4Update runbook (template 09) Section 4 with this failure mode: "upstream schema change" — diagnostics + mitigationRunbook hygieneCoE Lead2026-08-26✅ Done
5Add automated weekly re-eval against golden set + a synthetic test invoice that exercises every NetSuite fieldDrift detectionBuilder2026-10-01✅ Done (2026-09-22)
6Reach out to NetSuite to request notification of schema changes (this is a process gap, not just an AI gap)Vendor relationsMike Chen + IT2026-09-15✅ Done — added to NetSuite ITSM watchlist
7Document this as a framework lesson: drift detection should include short-window quality bands, not just rolling 7-day averagesFramework changelogCoE Lead2026-08-30✅ Done (framework changelog entry 2026-08-29)

8. Communications

  • Sarah Patel was informed in real time (Slack #ai-pilot-finance)
  • Mike Chen briefed Head of Finance on 2026-08-14
  • CoE Lead briefed Executive Sponsor at the regular monthly 1:1 (2026-08-25)
  • No external communication required (no customer impact, no regulatory exposure)

9. Regulatory considerations

  • EU AI Act Article 73 (serious incident reporting) — N/A. This was not a serious incident under the Act's definition (no harm to persons, no breach of fundamental rights, no major property damage). For reference: if this had occurred in a Stage 3 autonomous configuration with NetSuite write enabled, AND wrong matches had caused actual payment errors, Article 73 reporting could apply. Our HITL gate prevented this.
  • SOX — Not material. No financial reporting was affected. AP records remain accurate.
  • Internal audit — Notification sent 2026-08-19; no further action requested.

10. Sign-off

RoleNameDate
AuthorMorteza Moradi2026-08-19
Reviewer (Department)Mike Chen2026-08-20
Reviewer (Security)Pat Lee2026-08-20
Executive sponsor (informed)Jane Doe2026-08-25

Blank template (copy below for your incident)

# Post-Mortem — [Agent ID] [Severity] incident, [YYYY-MM-DD]

**Author:** [CoE Lead or delegate]
**Date:** [Date of write-up, target ≤ 5 business days for Sev-1, ≤ 10 days for Sev-2]
**Severity:** [Sev-1 / Sev-2]
**Status:** [Open / Closed — all action items completed]

---

## 1. Incident summary

[3–5 sentences. What happened, when, how long, who was impacted, root cause in one sentence, current state.]

## 2. Timeline

| Time (UTC) | Event |
|---|---|
| | |

Include events from before the incident if relevant (e.g., upstream change a day earlier).

## 3. Impact

| Dimension | Impact |
|---|---|
| [Specific impact metric 1] | |
| [Specific impact metric 2] | |
| Customer impact | [None / describe] |
| Regulatory exposure | [None / describe] |
| Duration of impact / outage | |
| Downstream user impact | |

## 4. What went well

- [Item 1]
- [Item 2]

(Blameless — celebrate the parts of the response that worked.)

## 5. What went wrong

- [Item 1 — focus on systemic causes, not individuals]
- [Item 2]

## 6. Root cause (5 whys)

- Q: [Why did X happen?]
  A: [Because Y]
- Q: [Why did Y happen?]
  A: [Because Z]
- Q: [Why did Z happen?]
  A: [...]
- Q: [...]
  A: [...]
- Q: [...]
  A: [Root systemic cause]

**Underlying systemic cause:** [One sentence]

## 7. Action items

| # | Action | Severity | Owner | Due | Status |
|---|---|---|---|---|---|
| 1 | | | | | |

Each action item must have:
- A specific deliverable (not "investigate")
- A named owner (one person)
- A due date
- A status track

## 8. Communications

- [Who was informed when]
- [Whether external communication was required]

## 9. Regulatory considerations

- **EU AI Act Article 73 serious incident reporting:** [Applicable? / N/A — reasoning]
- **Sector regulations:** [HIPAA breach? SOX material weakness? Other? / N/A — reasoning]
- **Internal audit:** [Notified / not applicable]

## 10. Sign-off

| Role | Name | Date |
|---|---|---|
| Author | | |
| Reviewer (Department) | | |
| Reviewer (Security) | | |
| Executive sponsor (informed) | | |

Usage notes

  • Blameless. Names of individuals appear only in the timeline (as actors) and contacts. Section 5 ("what went wrong") focuses on systems and processes, not people. "Sarah noticed late" is not a finding — "we have no automated quality-degradation alert" is.
  • Action items must close. A post-mortem with open action items 6 months later is a failed post-mortem. CoE Lead tracks closure at every quarterly review (template 15).
  • Update upstream documents. When action items affect the runbook (template 09), Agent Card (template 03), or framework — update those documents. Don't let the post-mortem be the only record.
  • EU AI Act Article 73 reporting. For serious incidents involving high-risk AI in scope of the Act, written notification to the relevant market surveillance authority is required within 15 days (faster for certain categories). Section 9 forces this consideration.
  • Sev-3 incidents may also produce post-mortems at CoE Lead discretion — especially if they reveal a systemic pattern. Don't artificially cap.

Common pitfalls

PitfallWhat it looks likeFix
Blameful"Sarah didn't notice fast enough"Re-frame: "We have no automated quality-degradation alert at short window"
Single whyStops at first causeForce 5 whys; the root cause is rarely the proximate cause
Vague action items"Improve monitoring"Specific: "Add HITL acceptance alert at <70% over 2-hour window. Owner X. Due YYYY-MM-DD"
Action items never closePost-mortem filed, items tracked for a week, then forgottenCoE Lead reviews open items at every quarterly (template 15)
Regulatory not consideredSection 9 missing or N/A without reasoningForce the question; "N/A" requires a sentence of why
Communications missedPost-mortem written but stakeholders not briefedSection 8 forces the comms check
Framework not updatedLesson stays in this one documentAction item to update framework / standards explicitly

Framework cross-references

  • framework.md §11.2 (continuous monitoring — post-mortems for Sev-2+)
  • framework.md §21 (5 monitoring signals — gaps surfaced in post-mortems)
  • framework.md §25.3 (Security: Detect — incident response)
  • framework.md §22.1 EU AI Act Article 72 (post-market monitoring) + 73 (serious incident reporting)
  • framework.md §22.2 NIST AI RMF MANAGE
  • workflows.md Step 14 (Continuous monitoring — post-mortems for Sev-2+)
  • workflows.html → In Action view → node M16 (Continuous monitoring) → loop back to M13 (re-evaluate)