By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
(Prometheus, Grafana, ELK Stack, Alerting)
Monitoring and observability are the "eyes and ears" of a Forward Deployed Engineer. In the field, you’re often debugging systems you didn’t build, in environments you don’t control (e.g., a classified DoD network, a hospital’s air-gapped EHR system, or a disaster-response pipeline running on spotty satellite links). Unlike a clean dev environment, you’ll deal with: - No internet access (air-gapped deployments) - Strict security policies (no root, no outbound traffic, no third-party SaaS) - Unreliable infrastructure (power outages, flaky VPNs, misconfigured firewalls) - High-stakes failures (e.g., a model serving critical intel goes down during an operation, or a data pipeline feeding a live disaster map stops updating)
Field Example:You’re deployed to a military base to debug a real-time drone feed processing pipeline. The customer reports "the system is slow," but their logs are empty. You: 1. SSH into the bastion host (the only machine with external access).2. Check Prometheus metrics to see CPU/memory spikes during drone sorties.3. Tail the application logs (ELK Stack) to find a misconfigured Kafka consumer causing backpressure.4. Write a quick Python script to reprocess the backlog while you patch the consumer.5. Set up a Grafana dashboard + Slack alerts for the customer to monitor future issues.
Without observability, you’re flying blind—and in the field, that means mission failure.
Observability: Exploring unknown issues by asking arbitrary questions of your system (logs, traces, metrics). FDEs need both—monitoring for known failure modes, observability for the "WTF is happening?!" moments.
Prometheus
Key tools: prometheus.yml (config), PromQL (query language), node_exporter (system metrics), blackbox_exporter (probe endpoints).
prometheus.yml
PromQL
node_exporter
blackbox_exporter
Grafana
Field tip: Always export dashboards as JSON for version control and quick redeployment.
ELK Stack (Elasticsearch, Logstash, Kibana)
FDE use case: Debugging a failed data pipeline in a classified environment where you can’t kubectl logs (because the cluster is locked down). Instead, you query Elasticsearch for the pod’s logs.
kubectl logs
OpenTelemetry (OTel)
Key tools: otel-collector (agent), auto-instrumentation (for Python/Java/Go apps).
otel-collector
auto-instrumentation
Alertmanager (Prometheus)
Field trap: Always test alerts in staging—customers will ignore them if they’re noisy.
Service Level Objectives (SLOs) / Error Budgets
FDE use case: A customer demands a new feature, but their system is already violating SLOs. You use the error budget to push back: "We can’t add this until we fix the latency spikes."
Blackbox Monitoring
Example: curl -v http://drone-feed-api:8080/health → if this fails, the issue is network/firewall, not the app.
curl -v http://drone-feed-api:8080/health
Whitebox Monitoring
Monitoring internal metrics (e.g., "How many Kafka messages are in the queue?"). FDEs instrument apps with Prometheus client libraries (e.g., prometheus-client for Python) to expose custom metrics.
prometheus-client
Distributed Tracing
Field tip: Always add a traceparent header to HTTP requests for end-to-end tracing.
traceparent
Air-Gapped Observability
Deploying monitoring tools in a network with no internet access. FDEs must:
nexus
minio
pip install
docker pull
Security Constraints
FDE workaround: Use pushgateway (Prometheus) to push metrics from restricted hosts, or sidecar containers to collect logs.
pushgateway
Customer-Facing Dashboards
Scenario: You’re deploying a data pipeline in an air-gapped DoD network. You need monitoring, but: - No internet access.- No root on customer machines.- No outbound traffic allowed.
Steps: 1. Pre-download dependencies: bash # Download Prometheus, Grafana, and node_exporter binaries (or Docker images) wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz wget https://dl.grafana.com/oss/release/grafana-10.2.0.linux-amd64.tar.gz wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz - Store these on a USB drive or internal artifact repo (e.g., nexus).
bash # Download Prometheus, Grafana, and node_exporter binaries (or Docker images) wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz wget https://dl.grafana.com/oss/release/grafana-10.2.0.linux-amd64.tar.gz wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
scp
Start Prometheus: bash ./prometheus --config.file=prometheus.yml --storage.tsdb.path=/data/prometheus
bash ./prometheus --config.file=prometheus.yml --storage.tsdb.path=/data/prometheus
Deploy Grafana:
bash ./bin/grafana-server --homepath=./
Configure Prometheus as a data source (via Grafana UI or API): bash curl -X POST http://admin:admin@localhost:3000/api/datasources \ -H "Content-Type: application/json" \ -d '{"name":"Prometheus","type":"prometheus","url":"http://localhost:9090","access":"proxy"}'
bash curl -X POST http://admin:admin@localhost:3000/api/datasources \ -H "Content-Type: application/json" \ -d '{"name":"Prometheus","type":"prometheus","url":"http://localhost:9090","access":"proxy"}'
Instrument your app:
Add Prometheus metrics to your Python app: ```python from prometheus_client import start_http_server, Counter REQUEST_COUNT = Counter('app_requests_total', 'Total HTTP Requests')
@app.route('/process') def process(): REQUEST_COUNT.inc() # ... your logic `` - Expose metrics on/metrics(default port:8000`).
`` - Expose metrics on
(default port:
Set up alerts:
alert.rules
Configure Alertmanager to route alerts to Slack/email (if allowed) or a local file.
Test in staging:
Scenario: A customer reports "the drone feed processing is slow," but they don’t know why. You’re on-site with no prior access to the system.
Steps: 1. Check the basics: ```bash # SSH into the bastion host (the only machine with external access) ssh bastion@customer-gateway
# Check DNS resolution (common issue in classified networks) nslookup drone-feed-api
# Check if the service is reachable curl -v http://drone-feed-api:8080/health ```
Open Grafana (or query Prometheus directly): ```promql # Check CPU usage 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Check memory usage node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
# Check app latency histogram_quantile(0.95, sum(rate(app_latency_seconds_bucket[5m])) by (le)) ``` - Look for spikes during the reported slowdown.
Check logs (ELK Stack):
bash curl -X GET "http://elasticsearch:9200/logs-*/_search" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "level": "ERROR" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } } }'
Filter for the time window of the slowdown.
Check traces (Jaeger/OpenTelemetry):
bash # Find traces with high duration curl -X GET "http://jaeger-query:16686/api/traces?service=drone-feed-processor&minDuration=500ms"
Look for bottlenecks (e.g., a slow database query, Kafka consumer lag).
Reproduce the issue:
Write a quick Python script to simulate load: ```python import requests import time
while True: start = time.time() r = requests.get("http://drone-feed-api:8080/process") print(f"Latency: {time.time() - start:.2f}s") time.sleep(1) ``` - Run it from the customer’s network to rule out local issues.
Mitigate and fix:
Scenario: The customer (a military command center) wants a "Mission Readiness" dashboard to monitor drone feeds, model accuracy, and system health.
Steps: 1. Define the audience: - Non-technical users (e.g., officers) need high-level status (Green/Yellow/Red). - Technical users (e.g., IT staff) need detailed metrics.
Annotations: Add context (e.g., "Latency spike at 14:30 during sortie #42").
Build in Grafana:
promql # Drone Feed Status (1 = healthy, 0 = down) up{job="drone-feed-api"}
promql # Frames processed per minute sum(rate(drone_frames_processed_total[1m]))
Add thresholds (e.g., "Accuracy < 90% → Yellow").
Deploy to the customer:
bash curl -X GET http://admin:admin@localhost:3000/api/dashboards/uid/your-dashboard-uid > mission-readiness.json
Import it into the customer’s Grafana instance: bash curl -X POST http://admin:admin@customer-grafana:3000/api/dashboards/db \ -H "Content-Type: application/json" \ -d @mission-readiness.json
bash curl -X POST http://admin:admin@customer-grafana:3000/api/dashboards/db \ -H "Content-Type: application/json" \ -d @mission-readiness.json
Train the customer:
drone_frames_processed_total
What they’re probing: - Can you quickly orient yourself in an unfamiliar environment? - Do you know how to use observability tools to ask the right questions?
How to answer: 1. Check the basics: DNS, network connectivity, service health. bash nslookup service-name curl -v http://service-name:port/health 2. Look at metrics: Prometheus/Grafana for CPU, memory, latency, error rates.3. Check logs: ELK Stack for errors, stack traces, or unusual patterns.4. Trace requests: Jaeger/OpenTelemetry to follow a request through the system.5. Reproduce the issue: Write a quick script to simulate load or trigger the error.
bash nslookup service-name curl -v http://service-name:port/health
Field story: "I was debugging a classified system where the customer said, ‘The model is slow.’ There were no logs, no metrics, and no documentation. I started by checking if the service was even running (systemctl status), then used tcpdump to see if requests were reaching it. Turns out, the customer’s firewall was blocking traffic to the model’s port. I worked with their IT team to whitelist the port, and the issue was resolved in 30 minutes."
systemctl status
tcpdump
What they’re probing: - Can you push back on scope creep without damaging the relationship? - Do you understand the trade-offs between features and reliability?
How to answer: 1. Acknowledge the request: "I understand why this is important to you." 2. Clarify the impact: "Adding this feature now would violate our SLOs and risk downtime during the mission." 3. Propose alternatives: - "We can add this to the next sprint after we stabilize the system." - "Here’s a workaround that achieves the same goal without code changes." 4. Escalate if needed: "Let me check with my team to see if we can reprioritize."
Field story: "A customer demanded a last-minute feature to add real-time alerts to a drone feed dashboard. The system was already violating SLOs, and adding this would have required a major refactor. I explained that we’d need to delay the go-live to implement it safely. Instead, I built a quick Python script that polled the API and sent Slack alerts, which satisfied their needs without risking the deployment."
What they’re probing: - Can you adapt to extreme constraints? - Do you know alternative approaches?
How to answer: 1. Ask for constraints upfront: "What are the security policies we need to follow?" 2. Use lightweight alternatives: - Metrics: pushgateway (Prometheus) to push metrics from restricted hosts. - Logs: Sidecar containers (e.g., fluent-bit) to collect logs and forward them to Elasticsearch. - Traces: OpenTelemetry SDK with manual instrumentation.3. Leverage existing tools: If the customer already has Splunk or Nagios, use those instead of deploying new tools.4. Document everything: "Here’s what we can monitor, and here’s what we can’t due to security restrictions."
fluent-bit
Field story: "I was deploying to a hospital’s air-gapped network where we couldn’t install node_exporter or run Docker. Instead, I used pushgateway to collect metrics from the app, and I wrote a cron job to tail logs and send them to Elasticsearch via a sidecar. It wasn’t perfect, but it gave us enough visibility to debug issues."
Answer: Pre-download all dependencies (Prometheus binaries, Docker images, Helm charts) and store them on a USB drive or internal artifact repo.Why: Air-gapped environments block external access, so you can’t docker pull or wget in production.
wget
Answer: Check Prometheus metrics for CPU, memory, latency, and error rates.Why: Logs may be empty due to misconfiguration, but metrics will show if the system is under load or failing.
Answer: Configure Alertmanager to deduplicate, group, and silence alerts (e.g., "Only page me if the pipeline is down for >10 minutes").Why: Noisy alerts will be ignored, and critical issues will be missed.
9090
9100
9093
3000
9200
5601
5044
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.