Fatskills
Practice. Master. Repeat.
Study Guide: Principles of Product Management: Incident Management, Post-Mortems, Blameless Culture
Source: https://www.fatskills.com/product-management/chapter/product-management-incident-management-postmortems-blameless-culture

Principles of Product Management: Incident Management, Post-Mortems, Blameless Culture

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~6 min read

Incident Management, Post?Mortems, Blameless Culture


Incident Management, Post-Mortems, Blameless Culture

What This Is

Incident management is the process of detecting, responding to, and resolving unplanned disruptions (e.g., outages, bugs, security breaches) in a product. Post-mortems are structured retrospectives to analyze why an incident happened and how to prevent it. A blameless culture ensures teams focus on systemic fixes rather than punishing individuals. This matters because incidents erode trust, hurt revenue, and slow innovation—e.g., when Stripe’s API went down for 2 hours in 2020, merchants lost sales, but their transparent post-mortem and automated failover improvements restored confidence.


Key Terms & Frameworks

  • Incident: An unplanned disruption to a service (e.g., a payment gateway failing during Black Friday).
  • Severity Levels (SEV 1–4):
  • SEV 1: Critical (e.g., entire site down, >$100K/hour revenue loss).
  • SEV 2: High (e.g., checkout broken for 20% of users).
  • SEV 3: Medium (e.g., slow load times on a non-critical page).
  • SEV 4: Low (e.g., a typo in a blog post).
  • MTTR (Mean Time to Recovery): Total downtime / # of incidents. Measures response efficiency.
  • MTBF (Mean Time Between Failures): Total uptime / # of incidents. Measures reliability.
  • Blameless Post-Mortem: A retrospective focusing on systems (e.g., "Why did our monitoring miss this?") not people (e.g., "Who deployed the bad code?").
  • 5 Whys: Iteratively ask "why?" to trace an incident’s root cause (e.g., "Why did the DB fail?"-"Because it ran out of memory"-"Why?"-"Because queries weren’t optimized").
  • DRI (Directly Responsible Individual): The single person accountable for resolving an incident (e.g., the on-call engineer).
  • SLOs (Service Level Objectives): Targets for reliability (e.g., "99.9% uptime for the checkout flow").
  • Error Budget: 100% – SLO (e.g., 0.1% downtime allowed). Used to balance innovation (e.g., "We can ship this risky feature because we’re under budget").
  • Incident Command System (ICS): A framework for coordinating large-scale incidents (roles: Incident Commander, Communications Lead, etc.).
  • Fishbone Diagram (Ishikawa): Visual tool to categorize root causes (e.g., People, Process, Technology, Environment).

Step-by-Step / Process Flow

1. Detect & Declare the Incident

  • Action: Set up automated alerts (e.g., PagerDuty, Datadog) for SEV 1–2 incidents. Manually declare SEV 3–4 if users report issues.
  • Example: A monitoring tool pings the on-call engineer when checkout latency spikes to 10s (SEV 2).

2. Assemble the Response Team

  • Action: Assign a DRI (e.g., on-call engineer) and Incident Commander (usually a PM or tech lead) to coordinate. Use a war room (Slack channel, Zoom) for SEV 1–2.
  • Example: The PM joins the #incident-checkout channel to track progress and update stakeholders.

3. Mitigate & Communicate

  • Action:
  • Short-term: Roll back the bad deploy, fail over to a backup system, or disable the broken feature.
  • Stakeholders: Post updates in a public status page (e.g., status.company.com) every 15–30 mins. Use templates: > "We’re investigating elevated error rates in the checkout flow. Next update in 15 mins."
  • Example: The team rolls back a recent payment gateway update, reducing errors from 30% to 2%.

4. Conduct a Blameless Post-Mortem

  • Action:
  • Gather data: Logs, metrics, screenshots, user reports.
  • Timeline: Reconstruct the incident minute-by-minute (e.g., "14:03: DB query timeout-14:05: Alert triggered").
  • Root cause: Use 5 Whys or a Fishbone Diagram.
  • Action items: Assign owners and deadlines (e.g., "Add DB query timeouts to monitoring by EOD Friday").
  • Example: The post-mortem reveals the DB failed because a new feature’s queries weren’t indexed. The fix: add automated query performance checks to CI/CD.

5. Follow Up & Improve

  • Action:
  • Retro: Share the post-mortem with the team and leadership. Celebrate fixes (e.g., "Thanks to this, we caught a similar issue in staging!").
  • Metrics: Track MTTR and MTBF over time. Aim to reduce MTTR by 20% quarterly.
  • Error Budget: If you’re over budget, pause feature work to focus on reliability.
  • Example: The team adds a "chaos engineering" sprint to proactively test failure scenarios.

Common Mistakes

  • Mistake: Blaming individuals (e.g., "The engineer who wrote this code is terrible"). Correction: Focus on systems (e.g., "Why didn’t our code review catch this?"). Blame kills psychological safety and hides deeper issues.

  • Mistake: Skipping post-mortems for "minor" incidents (SEV 3–4). Correction: Even small incidents can reveal systemic risks (e.g., a typo in an email template might hint at poor QA processes).

  • Mistake: Writing vague action items (e.g., "Improve monitoring"). Correction: Make them SMART (e.g., "Add latency alerts for the checkout flow with a 500ms threshold by Friday").

  • Mistake: Not communicating during an incident. Correction: Silence erodes trust. Even a "We’re investigating" update is better than radio silence.

  • Mistake: Treating post-mortems as a checkbox exercise. Correction: Tie action items to OKRs (e.g., "Reduce SEV 1 incidents by 30% this quarter").


PM Interview / Practical Insights

What Interviewers Probe

  1. "How would you handle a SEV 1 incident during a major product launch?"
  2. Trap: Saying "I’d panic" or "I’d call the CTO immediately."
  3. Answer: Outline the Incident Command System (ICS): assign roles, mitigate, communicate, and follow up with a post-mortem.

  4. "How do you balance speed and reliability?"

  5. Trap: Saying "We should never ship risky features."
  6. Answer: Use error budgets (e.g., "We can ship this feature because we’re under our 0.1% downtime budget").

  7. "How do you foster a blameless culture?"

  8. Trap: Saying "I’d fire the person who caused the incident."
  9. Answer: Focus on psychological safety (e.g., "We assume everyone acted in good faith. Let’s fix the process, not the person.").

  10. "How do you measure the success of incident management?"

  11. Trap: Only mentioning uptime.
  12. Answer: Track MTTR, MTBF, and # of repeat incidents. Example: "Our MTTR dropped from 45 mins to 20 mins after adding automated rollbacks."

Quick Check Questions

  1. Scenario: Your team’s post-mortem reveals that a SEV 1 incident was caused by a junior engineer’s mistake. The CTO wants to fire them. What do you do?
  2. Answer: Push back and refocus on systemic fixes (e.g., "Let’s add automated testing to catch this in the future"). Why: Blame discourages transparency and hides deeper issues.

  3. Scenario: Your error budget is at 90% (only 10% left), but the CEO wants to launch a high-risk feature. How do you respond?

  4. Answer: Delay the launch or negotiate a smaller scope (e.g., "Let’s ship to 10% of users first"). Why: Exceeding the error budget risks reliability and user trust.

  5. Scenario: A SEV 2 incident lasts 2 hours, but your status page only updates once. Users are tweeting complaints. What’s the issue, and how do you fix it?

  6. Answer: The issue is poor communication. Fix: Update the status page every 15–30 mins, even if there’s no progress. Why: Transparency builds trust, even during outages.

Last-Minute Cram Sheet

  1. SEV 1–4: Critical (site down)-High (major feature broken)-Medium (slow load)-Low (cosmetic).
  2. MTTR = Total downtime / # of incidents (lower = better).
  3. MTBF = Total uptime / # of incidents (higher = better).
  4. Blameless post-mortem: Focus on systems, not people.
  5. 5 Whys: Ask "why?" 5 times to find root cause.
  6. Error budget = 100% – SLO (e.g., 0.1% downtime allowed).
  7. DRI: One person accountable for resolving an incident.
  8. ICS: Incident Commander, Communications Lead, etc.
  9. Post-mortems aren’t just for SEV 1s—even small incidents can reveal risks.
  10. Silence during an incident erodes trust—always communicate, even if there’s no update.