Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: Technical Writing for the Field (Runbooks, Incident Reports, User Guides)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-technical-writing-for-the-field-runbooks-incident-reports-user-guides

Forward Deployed Engineer 101: Technical Writing for the Field (Runbooks, Incident Reports, User Guides)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~10 min read

Technical Writing for the Field (Runbooks, Incident Reports, User Guides)



Technical Writing for the Field (Runbooks, Incident Reports, User Guides)


What This Is

Technical writing in the field isn’t about polished docs—it’s about saving lives, missions, or millions in downtime when things go wrong. As an FDE, you’ll write: - Runbooks to automate recovery steps (e.g., "How to restart the classified ML pipeline when the GPU node crashes at 3 AM").
- Incident reports to explain why a satellite feed dropped during a live op (and how to prevent it).
- User guides for operators who’ve never touched a CLI (e.g., "How to upload drone footage to the secure enclave without exposing PII").

Field example: You’re on-site at a disaster response HQ. The data pipeline feeding real-time flood maps to first responders fails. The customer’s ops team is panicking. Your runbook (written last week) has the exact kubectl commands to roll back the deployment, but it’s buried in a 50-page doc. Your incident report later reveals the root cause: a misconfigured firewall rule that blocked the API gateway. Now you’re rewriting the user guide to include a 3-step "Firewall Check" section—because the next FDE won’t have time to debug this again.


Key Terms & Concepts

  • Runbook: A step-by-step playbook for recovering from failures (e.g., "How to restore the PostgreSQL database when the primary node dies"). Often includes:
  • Triggers (e.g., "Alert: CPU > 95% for 5 mins").
  • Commands (e.g., systemctl restart nginx).
  • Decision trees (e.g., "If df -h shows /var > 90% full, run logrotate").
  • Tools: Markdown (for version control), Jupyter Notebooks (for interactive runbooks), Terraform (for IaC-based recovery).

  • Incident Report (IR): A blameless post-mortem that answers:

  • What happened? (Timeline: "14:32 UTC: API latency spiked to 5s.")
  • Why? (Root cause: "Misconfigured Redis cache eviction policy.")
  • How to prevent it? (Action items: "Add Redis monitoring, update runbook.")
  • Tools: GitHub/GitLab Issues, Google Docs (for classified networks), Jira (for tracking follow-ups).

  • User Guide: Zero-to-one documentation for end users (e.g., "How to use the new threat-detection dashboard"). Must include:

  • Prerequisites (e.g., "You need kubectl v1.25+ and VPN access").
  • Screenshots (for GUI tools) or CLI snippets (for engineers).
  • Troubleshooting (e.g., "If the map doesn’t load, check the browser console for CORS errors").
  • Tools: Sphinx (for Python docs), MkDocs (for Markdown), Confluence (for enterprise).

  • Ask vs. Infer:

  • Ask: What the customer says they need (e.g., "We need a dashboard to track cyber threats").
  • Infer: What they actually need (e.g., "They need a way to correlate IPs with historical attack patterns—dashboard is just the UI").
  • Why it matters: Customers often describe solutions, not problems. Your docs should reflect the inferred need.

  • Air-Gapped Docs:

  • Writing for environments with no internet access (e.g., classified networks, ships at sea).
  • Constraints:
    • No external images (host them locally).
    • No dynamic content (e.g., no embedded YouTube videos).
    • Must work offline (e.g., PDFs, static HTML).
  • Tools: Pandoc (convert Markdown → PDF), Docker (for offline doc builds).

  • Living vs. Static Docs:

  • Living docs: Updated in real-time (e.g., a runbook synced with Git).
  • Static docs: Frozen at a point in time (e.g., a PDF for an ATO submission).
  • Field rule: If it’s critical (e.g., disaster recovery), make it living. If it’s compliance (e.g., ATO paperwork), make it static.

  • Blame-Free Language:

  • Bad: "The ops team misconfigured the firewall."
  • Good: "The firewall rule DENY 0.0.0.0/0 was applied, blocking all traffic."
  • Why: Incident reports are about systems, not people.

  • Command-Line Snippets:

  • Always include:
    • Full command (e.g., kubectl get pods -n production --selector=app=backend).
    • Expected output (e.g., "Should return 3 pods in Running state").
    • Error handling (e.g., "If no pods appear, check kubectl describe ns production").
  • Tools: asciinema (record terminal sessions), carbon.now.sh (for pretty code snippets).

  • Decision Logs:

  • A timestamped record of key choices (e.g., "2024-05-20: Decided to use SQLite instead of PostgreSQL due to air-gap constraints").
  • Why: Future FDEs (or you, in 6 months) will ask, "Why the hell did we do this?"

  • ATO (Authorization to Operate):

  • A compliance document proving your system meets security standards (e.g., NIST 800-53).
  • Field trap: ATO docs are static—if you change the system, you may need to re-ATO.
  • Tools: OSCAL (for machine-readable ATOs), Word/PDF (for manual submissions).

  • ACO (Authority to Connect):

  • Permission to plug into a customer’s network (e.g., "Your laptop can’t connect to the classified enclave without an ACO").
  • Field rule: Always ask for the ACO before you arrive on-site.


Step-by-Step / Field Process

1. Write a Runbook (Before the Fire Starts)

Scenario: You’re deploying a real-time fraud detection model to a bank’s on-premise cluster. The model crashes every 3 days due to OOM errors. You need a runbook for the bank’s ops team.

Steps:
1. Reproduce the failure (in a staging env):
bash
# Trigger OOM by running the model with too much data
python train.py --batch-size 1000000
2. Document the fix (in Markdown):
```markdown
# Fraud Model OOM Recovery

Trigger: Alert container_memory_usage_bytes > 90% for 5 mins.

## Steps
1. SSH into the bastion host:
bash
ssh [email protected] -i ~/.ssh/bank_key

2. Check pod status:
bash
kubectl get pods -n fraud-detection

3. If pod is OOMKilled, scale down and restart:
bash
kubectl scale deployment fraud-model --replicas=0 -n fraud-detection
kubectl scale deployment fraud-model --replicas=1 -n fraud-detection

4. Validate:
bash
kubectl logs -n fraud-detection <pod-name> | grep "Model loaded"

``` 3. Test the runbook (have a teammate follow it blindly—if they fail, rewrite).
4. Store it where ops can find it (e.g., Git repo, Confluence, or a printed copy in the ops center).



2. Write an Incident Report (After the Fire is Out)

Scenario: The bank’s fraud model went down during a live demo to the CFO. The ops team fixed it, but the CFO wants a post-mortem.

Steps:
1. Gather data (timeline, logs, screenshots):
bash
# Get pod events from the last hour
kubectl get events -n fraud-detection --sort-by='.metadata.creationTimestamp' | tail -n 20
2. Write the timeline (in a Google Doc or GitHub Issue):
## Timeline
- 14:00 UTC: CFO demo begins.
- 14:02 UTC: Model pod crashes (`OOMKilled`).
- 14:03 UTC: Ops team receives PagerDuty alert.
- 14:05 UTC: Ops runs `kubectl scale` (from runbook).
- 14:07 UTC: Model back online.
3. Root cause analysis (RCA):
- What happened? The model’s batch size was increased for the demo, causing OOM.
- Why? No pre-demo load testing.
- How to prevent? Add a pre-demo checklist (e.g., "Test with 2x expected load").
4. Action items (assign owners and deadlines):
| Action Item | Owner | Deadline |
|-------------|-------|----------|
| Add pre-demo load test to runbook | Alice | 2024-06-01 |
| Set up OOM alerts for demo env | Bob | 2024-05-25 | 5. Share with stakeholders (CFO, ops team, your manager).



3. Write a User Guide (For Non-Engineers)

Scenario: The bank’s fraud analysts need to use your new dashboard, but they’ve never used a CLI.

Steps:
1. Interview a user (ask: "What’s the first thing you do when you open the dashboard?").
2. Write for their skill level (avoid jargon like "API endpoint"):
```markdown
# Fraud Dashboard User Guide

## How to Filter Transactions
1. Click the Date Range picker (top-right).
2. Select "Last 7 Days".
3. Click Apply.

## Troubleshooting
- Problem: "No data appears."
- Fix: Check your VPN connection (see VPN Guide).
`` 3. Add screenshots (usecarbon.now.shfor code snippets,draw.io` for diagrams).
4. Test with a non-technical user (e.g., have the CFO’s assistant try it).
5. Publish in their preferred format (e.g., PDF for air-gapped networks, Confluence for enterprise).


Common Mistakes

Mistake Correction Why
Writing docs after the fact Write the runbook before deploying to prod. If the system crashes at 3 AM, you won’t remember the fix.
Assuming users know your jargon Define terms like "pod" or "enclave" in a glossary. The bank’s fraud analysts don’t know Kubernetes.
Not testing runbooks Have a teammate follow the runbook without your help. If they can’t fix it, the runbook is useless.
Burying the lead in incident reports Put the root cause in the first paragraph. Executives won’t read past page 1.
Using "we" or "they" in incident reports Use blameless language (e.g., "The firewall rule was misconfigured"). Incident reports are about systems, not people.


FDE Interview / War Story Insights

1. The "Scope Creep" Trap

Interviewer: "You’re on-site, and the customer demands a feature that wasn’t in the original scope. How do you respond?" Field answer:
1. Clarify the ask: "Can you walk me through the mission impact? What happens if we don’t add this?" 2. Assess risk: "This change would require re-ATOing the system, which takes 30 days. Is this a blocker for go-live?" 3. Propose a workaround: "We can add this as a Phase 2 item, or we can deliver a manual process for now." 4. Document the decision: Update the decision log and user guide to reflect the change.

Why: Customers often don’t understand the cost of scope changes (e.g., re-ATOing a system). Your job is to protect the mission while keeping the customer happy.



2. The "Air-Gapped Docs" Nightmare

War story: You’re deploying to a classified network with no internet. You bring a USB drive with your docs, but the customer’s security team blocks all external media. Now you have no runbooks, no user guides, and no incident reports.

Field lesson:
- Always bring a printed copy of critical docs (e.g., runbooks, ATO paperwork).
- Host docs on the customer’s internal wiki (e.g., Confluence, SharePoint) before you arrive.
- Test offline access (e.g., download a PDF of the runbook and open it on a classified machine).


Quick Check Questions

1. You’re writing a runbook for a data pipeline that fails every time the input CSV has a new column. What’s the first step in your runbook?

Answer: "Check the input schema for unexpected columns with pandas.read_csv(..., nrows=5) and log the mismatch." Explanation: Always validate inputs before debugging the pipeline.



2. You’re on-site, and the customer’s ops team says, "The dashboard is broken." They won’t give you logs or error messages. What do you do?

Answer: "Ask: ‘What were you doing when it broke? Can you show me a screenshot?’ Then check the browser console (F12 → Console) for errors." Explanation: Non-technical users often describe symptoms, not root causes. Reproduce the issue first.



3. You’re writing an incident report for a system outage. The root cause was a misconfigured firewall rule. How do you phrase this in the report?

Answer: "The firewall rule DENY 10.0.0.0/8 was applied, blocking traffic between the API and database." Explanation: Blameless language focuses on the system, not the person who made the mistake.


Last-Minute Cram Sheet

  1. Runbook must-haves:
  2. Triggers (e.g., "Alert: 5xx_errors > 10/min").
  3. Commands (e.g., kubectl rollout restart deployment/backend).
  4. Expected output (e.g., "Should return deployment "backend" restarted").

  5. Incident report template:

  6. Timeline (what happened, when).
  7. Root cause (why it happened).
  8. Action items (how to prevent it).

  9. User guide must-haves:

  10. Prerequisites (e.g., "You need VPN access").
  11. Screenshots (for GUI tools) or CLI snippets (for engineers).
  12. Troubleshooting (e.g., "If the map doesn’t load, check the browser console").

  13. Air-gapped docs:

  14. No external images (host them locally).
  15. No dynamic content (e.g., no embedded YouTube videos).
  16. Must work offline (e.g., PDF, static HTML).

  17. Blame-free language:

  18. ❌ "The ops team misconfigured the firewall."
  19. ✅ "The firewall rule DENY 0.0.0.0/0 was applied, blocking all traffic."

  20. Common ports to know:

  21. 22 (SSH), 80 (HTTP), 443 (HTTPS), 5432 (PostgreSQL), 6379 (Redis).

  22. Field traps:

  23. ⚠️ Always test in the exact customer environment—what works in your lab will break behind their firewall.
  24. ⚠️ ATO docs are static—if you change the system, you may need to re-ATO.
  25. ⚠️ Printed runbooks save lives—USB drives can be blocked in classified networks.

  26. Tools to know:

  27. Markdown (for runbooks), Jupyter Notebooks (for interactive docs), Pandoc (convert Markdown → PDF).
  28. asciinema (record terminal sessions), carbon.now.sh (pretty code snippets).

  29. Acronyms:

  30. ATO (Authorization to Operate), ACO (Authority to Connect), IAM (Identity and Access Management).

  31. Golden rule: "If it’s not documented, it didn’t happen." Write it down.



ADVERTISEMENT