By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Technical writing in the field isn’t about polished docs—it’s about saving lives, missions, or millions in downtime when things go wrong. As an FDE, you’ll write: - Runbooks to automate recovery steps (e.g., "How to restart the classified ML pipeline when the GPU node crashes at 3 AM").- Incident reports to explain why a satellite feed dropped during a live op (and how to prevent it).- User guides for operators who’ve never touched a CLI (e.g., "How to upload drone footage to the secure enclave without exposing PII").
Field example: You’re on-site at a disaster response HQ. The data pipeline feeding real-time flood maps to first responders fails. The customer’s ops team is panicking. Your runbook (written last week) has the exact kubectl commands to roll back the deployment, but it’s buried in a 50-page doc. Your incident report later reveals the root cause: a misconfigured firewall rule that blocked the API gateway. Now you’re rewriting the user guide to include a 3-step "Firewall Check" section—because the next FDE won’t have time to debug this again.
kubectl
CPU > 95% for 5 mins
systemctl restart nginx
df -h
/var
logrotate
Tools: Markdown (for version control), Jupyter Notebooks (for interactive runbooks), Terraform (for IaC-based recovery).
Incident Report (IR): A blameless post-mortem that answers:
Tools: GitHub/GitLab Issues, Google Docs (for classified networks), Jira (for tracking follow-ups).
User Guide: Zero-to-one documentation for end users (e.g., "How to use the new threat-detection dashboard"). Must include:
Tools: Sphinx (for Python docs), MkDocs (for Markdown), Confluence (for enterprise).
Ask vs. Infer:
Why it matters: Customers often describe solutions, not problems. Your docs should reflect the inferred need.
Air-Gapped Docs:
Tools: Pandoc (convert Markdown → PDF), Docker (for offline doc builds).
Living vs. Static Docs:
Field rule: If it’s critical (e.g., disaster recovery), make it living. If it’s compliance (e.g., ATO paperwork), make it static.
Blame-Free Language:
DENY 0.0.0.0/0
Why: Incident reports are about systems, not people.
Command-Line Snippets:
kubectl get pods -n production --selector=app=backend
Running
kubectl describe ns production
Tools: asciinema (record terminal sessions), carbon.now.sh (for pretty code snippets).
Decision Logs:
Why: Future FDEs (or you, in 6 months) will ask, "Why the hell did we do this?"
ATO (Authorization to Operate):
Tools: OSCAL (for machine-readable ATOs), Word/PDF (for manual submissions).
ACO (Authority to Connect):
Scenario: You’re deploying a real-time fraud detection model to a bank’s on-premise cluster. The model crashes every 3 days due to OOM errors. You need a runbook for the bank’s ops team.
Steps:1. Reproduce the failure (in a staging env): bash # Trigger OOM by running the model with too much data python train.py --batch-size 1000000 2. Document the fix (in Markdown): ```markdown # Fraud Model OOM Recovery
bash # Trigger OOM by running the model with too much data python train.py --batch-size 1000000
Trigger: Alert container_memory_usage_bytes > 90% for 5 mins.
container_memory_usage_bytes > 90%
## Steps 1. SSH into the bastion host: bash ssh [email protected] -i ~/.ssh/bank_key 2. Check pod status: bash kubectl get pods -n fraud-detection 3. If pod is OOMKilled, scale down and restart: bash kubectl scale deployment fraud-model --replicas=0 -n fraud-detection kubectl scale deployment fraud-model --replicas=1 -n fraud-detection 4. Validate: bash kubectl logs -n fraud-detection <pod-name> | grep "Model loaded" ``` 3. Test the runbook (have a teammate follow it blindly—if they fail, rewrite).4. Store it where ops can find it (e.g., Git repo, Confluence, or a printed copy in the ops center).
bash ssh [email protected] -i ~/.ssh/bank_key
bash kubectl get pods -n fraud-detection
OOMKilled
bash kubectl scale deployment fraud-model --replicas=0 -n fraud-detection kubectl scale deployment fraud-model --replicas=1 -n fraud-detection
bash kubectl logs -n fraud-detection <pod-name> | grep "Model loaded"
Scenario: The bank’s fraud model went down during a live demo to the CFO. The ops team fixed it, but the CFO wants a post-mortem.
Steps:1. Gather data (timeline, logs, screenshots): bash # Get pod events from the last hour kubectl get events -n fraud-detection --sort-by='.metadata.creationTimestamp' | tail -n 20 2. Write the timeline (in a Google Doc or GitHub Issue): ## Timeline - 14:00 UTC: CFO demo begins. - 14:02 UTC: Model pod crashes (`OOMKilled`). - 14:03 UTC: Ops team receives PagerDuty alert. - 14:05 UTC: Ops runs `kubectl scale` (from runbook). - 14:07 UTC: Model back online. 3. Root cause analysis (RCA): - What happened? The model’s batch size was increased for the demo, causing OOM. - Why? No pre-demo load testing. - How to prevent? Add a pre-demo checklist (e.g., "Test with 2x expected load").4. Action items (assign owners and deadlines): | Action Item | Owner | Deadline | |-------------|-------|----------| | Add pre-demo load test to runbook | Alice | 2024-06-01 | | Set up OOM alerts for demo env | Bob | 2024-05-25 | 5. Share with stakeholders (CFO, ops team, your manager).
bash # Get pod events from the last hour kubectl get events -n fraud-detection --sort-by='.metadata.creationTimestamp' | tail -n 20
## Timeline - 14:00 UTC: CFO demo begins. - 14:02 UTC: Model pod crashes (`OOMKilled`). - 14:03 UTC: Ops team receives PagerDuty alert. - 14:05 UTC: Ops runs `kubectl scale` (from runbook). - 14:07 UTC: Model back online.
Scenario: The bank’s fraud analysts need to use your new dashboard, but they’ve never used a CLI.
Steps:1. Interview a user (ask: "What’s the first thing you do when you open the dashboard?").2. Write for their skill level (avoid jargon like "API endpoint"): ```markdown # Fraud Dashboard User Guide
## How to Filter Transactions 1. Click the Date Range picker (top-right). 2. Select "Last 7 Days". 3. Click Apply.
## Troubleshooting - Problem: "No data appears." - Fix: Check your VPN connection (see VPN Guide). `` 3. Add screenshots (usecarbon.now.shfor code snippets,draw.io` for diagrams).4. Test with a non-technical user (e.g., have the CFO’s assistant try it).5. Publish in their preferred format (e.g., PDF for air-gapped networks, Confluence for enterprise).
`` 3. Add screenshots (use
for code snippets,
Interviewer: "You’re on-site, and the customer demands a feature that wasn’t in the original scope. How do you respond?" Field answer:1. Clarify the ask: "Can you walk me through the mission impact? What happens if we don’t add this?" 2. Assess risk: "This change would require re-ATOing the system, which takes 30 days. Is this a blocker for go-live?" 3. Propose a workaround: "We can add this as a Phase 2 item, or we can deliver a manual process for now." 4. Document the decision: Update the decision log and user guide to reflect the change.
Why: Customers often don’t understand the cost of scope changes (e.g., re-ATOing a system). Your job is to protect the mission while keeping the customer happy.
War story: You’re deploying to a classified network with no internet. You bring a USB drive with your docs, but the customer’s security team blocks all external media. Now you have no runbooks, no user guides, and no incident reports.
Field lesson:- Always bring a printed copy of critical docs (e.g., runbooks, ATO paperwork).- Host docs on the customer’s internal wiki (e.g., Confluence, SharePoint) before you arrive.- Test offline access (e.g., download a PDF of the runbook and open it on a classified machine).
Answer: "Check the input schema for unexpected columns with pandas.read_csv(..., nrows=5) and log the mismatch." Explanation: Always validate inputs before debugging the pipeline.
pandas.read_csv(..., nrows=5)
Answer: "Ask: ‘What were you doing when it broke? Can you show me a screenshot?’ Then check the browser console (F12 → Console) for errors." Explanation: Non-technical users often describe symptoms, not root causes. Reproduce the issue first.
F12
Answer: "The firewall rule DENY 10.0.0.0/8 was applied, blocking traffic between the API and database." Explanation: Blameless language focuses on the system, not the person who made the mistake.
DENY 10.0.0.0/8
5xx_errors > 10/min
kubectl rollout restart deployment/backend
Expected output (e.g., "Should return deployment "backend" restarted").
deployment "backend" restarted
Incident report template:
Action items (how to prevent it).
User guide must-haves:
Troubleshooting (e.g., "If the map doesn’t load, check the browser console").
Air-gapped docs:
Must work offline (e.g., PDF, static HTML).
Blame-free language:
✅ "The firewall rule DENY 0.0.0.0/0 was applied, blocking all traffic."
Common ports to know:
22 (SSH), 80 (HTTP), 443 (HTTPS), 5432 (PostgreSQL), 6379 (Redis).
22
80
443
5432
6379
Field traps:
⚠️ Printed runbooks save lives—USB drives can be blocked in classified networks.
Tools to know:
asciinema (record terminal sessions), carbon.now.sh (pretty code snippets).
Acronyms:
ATO (Authorization to Operate), ACO (Authority to Connect), IAM (Identity and Access Management).
Golden rule: "If it’s not documented, it didn’t happen." Write it down.
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.