Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: Monitoring and Observability (Prometheus, Grafana, ELK Stack, Alerting)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-monitoring-and-observability-prometheus-grafana-elk-stack-alerting

Forward Deployed Engineer 101: Monitoring and Observability (Prometheus, Grafana, ELK Stack, Alerting)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~13 min read

Monitoring and Observability (Prometheus, Grafana, ELK Stack, Alerting)


Forward Deployed Engineer (FDE) Study Guide: Monitoring & Observability

(Prometheus, Grafana, ELK Stack, Alerting)


What This Is

Monitoring and observability are the "eyes and ears" of a Forward Deployed Engineer. In the field, you’re often debugging systems you didn’t build, in environments you don’t control (e.g., a classified DoD network, a hospital’s air-gapped EHR system, or a disaster-response pipeline running on spotty satellite links). Unlike a clean dev environment, you’ll deal with: - No internet access (air-gapped deployments) - Strict security policies (no root, no outbound traffic, no third-party SaaS) - Unreliable infrastructure (power outages, flaky VPNs, misconfigured firewalls) - High-stakes failures (e.g., a model serving critical intel goes down during an operation, or a data pipeline feeding a live disaster map stops updating)

Field Example:
You’re deployed to a military base to debug a real-time drone feed processing pipeline. The customer reports "the system is slow," but their logs are empty. You: 1. SSH into the bastion host (the only machine with external access).
2. Check Prometheus metrics to see CPU/memory spikes during drone sorties.
3. Tail the application logs (ELK Stack) to find a misconfigured Kafka consumer causing backpressure.
4. Write a quick Python script to reprocess the backlog while you patch the consumer.
5. Set up a Grafana dashboard + Slack alerts for the customer to monitor future issues.

Without observability, you’re flying blind—and in the field, that means mission failure.


Key Terms & Concepts

  • Observability vs Monitoring
  • Monitoring: Tracking known metrics (e.g., CPU, error rates) to detect problems.
  • Observability: Exploring unknown issues by asking arbitrary questions of your system (logs, traces, metrics). FDEs need both—monitoring for known failure modes, observability for the "WTF is happening?!" moments.

  • Prometheus

  • Open-source time-series database for metrics. FDEs use it to scrape custom app metrics (e.g., "How many drone feeds processed per minute?") and set alerts (e.g., "Alert if latency > 500ms for 5 minutes").
  • Key tools: prometheus.yml (config), PromQL (query language), node_exporter (system metrics), blackbox_exporter (probe endpoints).

  • Grafana

  • Visualization layer for Prometheus (and other data sources). FDEs build dashboards for customers (e.g., a "Mission Readiness" dashboard for a command center) and for debugging (e.g., "Why is this model’s latency spiking?").
  • Field tip: Always export dashboards as JSON for version control and quick redeployment.

  • ELK Stack (Elasticsearch, Logstash, Kibana)

  • Elasticsearch: Distributed search engine for logs.
  • Logstash: Log pipeline (ingest, parse, enrich).
  • Kibana: Visualization (like Grafana for logs).
  • FDE use case: Debugging a failed data pipeline in a classified environment where you can’t kubectl logs (because the cluster is locked down). Instead, you query Elasticsearch for the pod’s logs.

  • OpenTelemetry (OTel)

  • Vendor-agnostic framework for traces, metrics, and logs. FDEs use it to instrument apps without vendor lock-in (critical for air-gapped or multi-cloud deployments).
  • Key tools: otel-collector (agent), auto-instrumentation (for Python/Java/Go apps).

  • Alertmanager (Prometheus)

  • Handles alerts from Prometheus (deduplication, routing, silencing). FDEs configure it to avoid alert fatigue (e.g., "Only page me if the drone feed pipeline is down for >10 minutes").
  • Field trap: Always test alerts in staging—customers will ignore them if they’re noisy.

  • Service Level Objectives (SLOs) / Error Budgets

  • SLO: A target for reliability (e.g., "99.9% of drone feed frames processed within 200ms").
  • Error Budget: How much downtime is acceptable before you stop shipping features to fix reliability.
  • FDE use case: A customer demands a new feature, but their system is already violating SLOs. You use the error budget to push back: "We can’t add this until we fix the latency spikes."

  • Blackbox Monitoring

  • Testing a system from the outside (e.g., "Can the drone feed API be reached from the command center?"). FDEs use blackbox_exporter to probe endpoints (HTTP, TCP, ICMP) and alert if they’re unreachable.
  • Example: curl -v http://drone-feed-api:8080/health → if this fails, the issue is network/firewall, not the app.

  • Whitebox Monitoring

  • Monitoring internal metrics (e.g., "How many Kafka messages are in the queue?"). FDEs instrument apps with Prometheus client libraries (e.g., prometheus-client for Python) to expose custom metrics.

  • Distributed Tracing

  • Tracking a request across microservices (e.g., "Why is the drone feed processing slow?"). FDEs use Jaeger or OpenTelemetry to trace requests through Kafka → ML model → database.
  • Field tip: Always add a traceparent header to HTTP requests for end-to-end tracing.

  • Air-Gapped Observability

  • Deploying monitoring tools in a network with no internet access. FDEs must:


    • Pre-download all dependencies (Docker images, Helm charts, binaries).
    • Use offline mirrors (e.g., nexus for artifacts, minio for Prometheus storage).
    • ⚠️ Never assume you can pip install or docker pull in production.
  • Security Constraints

  • No outbound traffic: Prometheus can’t scrape external endpoints.
  • No root: You can’t install node_exporter on customer machines.
  • No third-party SaaS: No Datadog, New Relic, or AWS CloudWatch.
  • FDE workaround: Use pushgateway (Prometheus) to push metrics from restricted hosts, or sidecar containers to collect logs.

  • Customer-Facing Dashboards

  • FDEs build dashboards for non-technical users (e.g., a "Mission Status" dashboard for a colonel). Rules:
    • No raw metrics (e.g., "CPU usage" → "System Health: Good/Warning/Danger").
    • Annotate with context (e.g., "Latency spike at 14:30 during drone sortie #42").
    • ⚠️ Never expose internal IPs or sensitive data.


Step-by-Step / Field Process


1. Deploy Observability in a Restricted Environment

Scenario: You’re deploying a data pipeline in an air-gapped DoD network. You need monitoring, but: - No internet access.
- No root on customer machines.
- No outbound traffic allowed.

Steps: 1. Pre-download dependencies:
bash
# Download Prometheus, Grafana, and node_exporter binaries (or Docker images)
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
wget https://dl.grafana.com/oss/release/grafana-10.2.0.linux-amd64.tar.gz
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz

- Store these on a USB drive or internal artifact repo (e.g., nexus).


  1. Deploy Prometheus:
  2. Copy binaries to the target machine (e.g., via scp).
  3. Configure prometheus.yml to scrape local targets:
    ```yaml
    scrape_configs:
    • job_name: 'node'
      static_configs:
      • targets: ['localhost:9100'] # node_exporter
    • job_name: 'app'
      static_configs:
      • targets: ['localhost:8000'] # your app's /metrics endpoint ```
  4. Start Prometheus:
    bash
    ./prometheus --config.file=prometheus.yml --storage.tsdb.path=/data/prometheus

  5. Deploy Grafana:

  6. Start Grafana:
    bash
    ./bin/grafana-server --homepath=./
  7. Configure Prometheus as a data source (via Grafana UI or API):
    bash
    curl -X POST http://admin:admin@localhost:3000/api/datasources \
    -H "Content-Type: application/json" \
    -d '{"name":"Prometheus","type":"prometheus","url":"http://localhost:9090","access":"proxy"}'

  8. Instrument your app:

  9. Add Prometheus metrics to your Python app:
    ```python
    from prometheus_client import start_http_server, Counter
    REQUEST_COUNT = Counter('app_requests_total', 'Total HTTP Requests')

    @app.route('/process') def process():
    REQUEST_COUNT.inc()
    # ... your logic ``
    - Expose metrics on
    /metrics(default port:8000`).

  10. Set up alerts:

  11. Configure alert.rules in Prometheus:
    ```yaml
    groups:
    • name: example
      rules:
    • alert: HighLatency
      expr: histogram_quantile(0.95, sum(rate(app_latency_seconds_bucket[5m])) by (le)) > 0.5
      for: 5m
      labels:
      severity: critical
      annotations:
      summary: "High latency detected (instance {{ $labels.instance }})" ```
  12. Configure Alertmanager to route alerts to Slack/email (if allowed) or a local file.

  13. Test in staging:

  14. Simulate failures (e.g., kill a pod, throttle network) and verify alerts fire.
  15. ⚠️ Always test in the exact customer environment—firewalls, SELinux, and network policies will break things.

2. Debug a Live Incident (e.g., "The System is Slow")

Scenario: A customer reports "the drone feed processing is slow," but they don’t know why. You’re on-site with no prior access to the system.

Steps: 1. Check the basics:
```bash
# SSH into the bastion host (the only machine with external access)
ssh bastion@customer-gateway

# Check DNS resolution (common issue in classified networks)
nslookup drone-feed-api

# Check if the service is reachable
curl -v http://drone-feed-api:8080/health
```


  1. Check Prometheus metrics:
  2. Open Grafana (or query Prometheus directly):
    ```promql
    # Check CPU usage
    100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

    # Check memory usage node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

    # Check app latency histogram_quantile(0.95, sum(rate(app_latency_seconds_bucket[5m])) by (le)) ```
    - Look for spikes during the reported slowdown.

  3. Check logs (ELK Stack):

  4. Query Elasticsearch for errors:
    bash
    curl -X GET "http://elasticsearch:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
    {
    "query": {
    "bool": {
    "must": [
    { "match": { "level": "ERROR" } },
    { "range": { "@timestamp": { "gte": "now-1h" } } }
    ]
    }
    }
    }'
  5. Filter for the time window of the slowdown.

  6. Check traces (Jaeger/OpenTelemetry):

  7. Open Jaeger UI and search for slow traces:
    bash
    # Find traces with high duration
    curl -X GET "http://jaeger-query:16686/api/traces?service=drone-feed-processor&minDuration=500ms"
  8. Look for bottlenecks (e.g., a slow database query, Kafka consumer lag).

  9. Reproduce the issue:

  10. Write a quick Python script to simulate load:
    ```python
    import requests
    import time

    while True:
    start = time.time()
    r = requests.get("http://drone-feed-api:8080/process")
    print(f"Latency: {time.time() - start:.2f}s")
    time.sleep(1) ```
    - Run it from the customer’s network to rule out local issues.

  11. Mitigate and fix:

  12. If it’s a resource issue (CPU/memory), scale the service or optimize the code.
  13. If it’s a dependency issue (e.g., slow database), add caching or retry logic.
  14. If it’s a network issue, work with the customer’s IT team to adjust firewalls.

3. Set Up Customer-Facing Dashboards

Scenario: The customer (a military command center) wants a "Mission Readiness" dashboard to monitor drone feeds, model accuracy, and system health.

Steps: 1. Define the audience:
- Non-technical users (e.g., officers) need high-level status (Green/Yellow/Red).
- Technical users (e.g., IT staff) need detailed metrics.


  1. Design the dashboard:
  2. Top row: Big status panels (e.g., "Drone Feed Status: OPERATIONAL").
  3. Middle row: Time-series trends (e.g., "Frames Processed per Minute").
  4. Bottom row: Anomaly detection (e.g., "Model Accuracy Drop Detected").
  5. Annotations: Add context (e.g., "Latency spike at 14:30 during sortie #42").

  6. Build in Grafana:

  7. Use the "Stat" panel for status indicators:
    promql
    # Drone Feed Status (1 = healthy, 0 = down)
    up{job="drone-feed-api"}
  8. Use the "Time series" panel for trends:
    promql
    # Frames processed per minute
    sum(rate(drone_frames_processed_total[1m]))
  9. Add thresholds (e.g., "Accuracy < 90% → Yellow").

  10. Deploy to the customer:

  11. Export the dashboard as JSON:
    bash
    curl -X GET http://admin:admin@localhost:3000/api/dashboards/uid/your-dashboard-uid > mission-readiness.json
  12. Import it into the customer’s Grafana instance:
    bash
    curl -X POST http://admin:admin@customer-grafana:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d @mission-readiness.json

  13. Train the customer:

  14. Walk them through the dashboard (e.g., "If the 'Model Accuracy' panel turns red, call us").
  15. Leave a one-pager with troubleshooting steps.

Common Mistakes

Mistake Correction Why
Assuming you can docker pull in production Pre-download all images and store them in an internal registry (e.g., nexus). Air-gapped environments block external access. Always test deployments in a staging environment that mirrors production.
Not testing alerts in staging Simulate failures (e.g., kill a pod, throttle network) and verify alerts fire. Customers will ignore alerts if they’re noisy or false positives.
Exposing raw metrics to non-technical users Build customer-facing dashboards with clear status indicators (e.g., "System Health: Good/Warning/Danger"). Raw metrics (e.g., "CPU usage: 85%") are meaningless to non-engineers.
Not instrumenting custom app metrics Add Prometheus client libraries to your app to expose custom metrics (e.g., drone_frames_processed_total). Default metrics (CPU, memory) won’t tell you if your app is working correctly.
Ignoring security constraints Use pushgateway for restricted hosts, sidecar containers for logs, and avoid third-party SaaS. Customers may block outbound traffic, root access, or external services. Always ask about constraints upfront.
Not annotating dashboards Add context to dashboards (e.g., "Latency spike at 14:30 during sortie #42"). Without annotations, users won’t know what caused an issue.


FDE Interview / War Story Insights


1. "How would you debug a system you’ve never seen before?"

What they’re probing: - Can you quickly orient yourself in an unfamiliar environment? - Do you know how to use observability tools to ask the right questions?

How to answer: 1. Check the basics: DNS, network connectivity, service health.
bash
nslookup service-name
curl -v http://service-name:port/health
2. Look at metrics: Prometheus/Grafana for CPU, memory, latency, error rates.
3. Check logs: ELK Stack for errors, stack traces, or unusual patterns.
4. Trace requests: Jaeger/OpenTelemetry to follow a request through the system.
5. Reproduce the issue: Write a quick script to simulate load or trigger the error.

Field story: "I was debugging a classified system where the customer said, ‘The model is slow.’ There were no logs, no metrics, and no documentation. I started by checking if the service was even running (systemctl status), then used tcpdump to see if requests were reaching it. Turns out, the customer’s firewall was blocking traffic to the model’s port. I worked with their IT team to whitelist the port, and the issue was resolved in 30 minutes."


2. "The customer demands a feature that violates the original scope. How do you respond?"

What they’re probing: - Can you push back on scope creep without damaging the relationship? - Do you understand the trade-offs between features and reliability?

How to answer: 1. Acknowledge the request: "I understand why this is important to you." 2. Clarify the impact: "Adding this feature now would violate our SLOs and risk downtime during the mission." 3. Propose alternatives:
- "We can add this to the next sprint after we stabilize the system."
- "Here’s a workaround that achieves the same goal without code changes." 4. Escalate if needed: "Let me check with my team to see if we can reprioritize."

Field story: "A customer demanded a last-minute feature to add real-time alerts to a drone feed dashboard. The system was already violating SLOs, and adding this would have required a major refactor. I explained that we’d need to delay the go-live to implement it safely. Instead, I built a quick Python script that polled the API and sent Slack alerts, which satisfied their needs without risking the deployment."


3. "How do you handle a situation where the customer’s environment is so locked down that you can’t deploy standard monitoring tools?"

What they’re probing: - Can you adapt to extreme constraints? - Do you know alternative approaches?

How to answer: 1. Ask for constraints upfront: "What are the security policies we need to follow?" 2. Use lightweight alternatives:
- Metrics: pushgateway (Prometheus) to push metrics from restricted hosts.
- Logs: Sidecar containers (e.g., fluent-bit) to collect logs and forward them to Elasticsearch.
- Traces: OpenTelemetry SDK with manual instrumentation.
3. Leverage existing tools: If the customer already has Splunk or Nagios, use those instead of deploying new tools.
4. Document everything: "Here’s what we can monitor, and here’s what we can’t due to security restrictions."

Field story: "I was deploying to a hospital’s air-gapped network where we couldn’t install node_exporter or run Docker. Instead, I used pushgateway to collect metrics from the app, and I wrote a cron job to tail logs and send them to Elasticsearch via a sidecar. It wasn’t perfect, but it gave us enough visibility to debug issues."


Quick Check Questions


1. You’re deploying Prometheus in an air-gapped environment. What’s your first step?

Answer: Pre-download all dependencies (Prometheus binaries, Docker images, Helm charts) and store them on a USB drive or internal artifact repo.
Why: Air-gapped environments block external access, so you can’t docker pull or wget in production.


2. A customer reports "the system is slow," but their logs are empty. What do you check first?

Answer: Check Prometheus metrics for CPU, memory, latency, and error rates.
Why: Logs may be empty due to misconfiguration, but metrics will show if the system is under load or failing.


3. You’re setting up alerts for a critical pipeline. What’s one thing you must do to avoid alert fatigue?

Answer: Configure Alertmanager to deduplicate, group, and silence alerts (e.g., "Only page me if the pipeline is down for >10 minutes").
Why: Noisy alerts will be ignored, and critical issues will be missed.


Last-Minute Cram Sheet

  1. Prometheus ports: 9090 (server), 9100 (node_exporter), 9093 (Alertmanager).
  2. Grafana port: 3000 (default).
  3. ELK ports: 9200 (Elasticsearch), 5601 (Kibana), 5044 (Logstash).
  4. Prometheus config file: prometheus.yml (scrape targets, alert rules).
  5. Grafana data source: Configure Prometheus/Elasticsearch via UI or API.
  6. Air-gapped deployments: Pre-download all dependencies (binaries, images, charts).
  7. Security constraints: Use pushgateway, sidecars, and avoid third-party SaaS.
  8. Customer dashboards: No raw metrics—use status indicators (Green/Yellow/Red).
  9. ⚠️ Always test alerts in staging—customers will ignore noisy alerts.
  10. ⚠️ Never assume you can kubectl logs—use ELK or OpenTelemetry instead.


ADVERTISEMENT