Fatskills
Practice. Master. Repeat.
Study Guide: How to Solve: Find Median from Data Stream
Source: https://www.fatskills.com/leetcode/chapter/how-to-solve-find-median-from-data-stream

How to Solve: Find Median from Data Stream

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~8 min read

How to Solve: Find Median from Data Stream

(Complete Guide for FAANG-Level Interviews)


? Introduction

"This problem appears in 1 out of every 3 onsite interviews—master it, and you prove you can design real-time systems while optimizing for time and space complexity under pressure."


? WHAT YOU NEED TO KNOW FIRST

Before tackling this problem, ensure you understand: 1. Heaps (Priority Queues) – Min-heap and max-heap properties, insertion/deletion in O(log N). 2. Two-Pointer Technique – Balancing two structures to maintain invariants. 3. Amortized Analysis – Why O(log N) per operation is acceptable for real-time systems.


? KEY TECHNIQUES & PATTERNS

Technique / Pattern When to Use Quick Example
Two Heaps (Min + Max) Maintain dynamic median in O(log N) Max-heap for lower half, min-heap for upper half. Median = avg(top of both).
Lazy Removal Avoid O(N) deletions in heaps Track "stale" elements and remove them only when they reach the top.
Balanced Heaps Ensure O(1) median access Keep heaps sizes equal or off by 1 (e.g., len(max_heap) == len(min_heap) ± 1).
Heapify Trick Convert list to heap in O(N) heapq.heapify(arr) (Python) instead of O(N log N) insertions.
Follow-Up: Stream Window Median over a sliding window Use two heaps + deque to evict expired elements.
Follow-Up: Multi-Threading Thread-safe median in O(log N) Use locks or atomic operations on heaps.

? STEP-BY-STEP FRAMEWORK

(Repeatable mental checklist for every "Median from Data Stream" problem.)

  1. Clarify Requirements
  2. Is the stream infinite or bounded?
  3. Should the median be exact or approximate?
  4. Are there duplicates? (Assume yes unless told otherwise.)

  5. Choose Data Structures

  6. Max-heap for the lower half (stores smaller numbers).
  7. Min-heap for the upper half (stores larger numbers).
  8. Optional: Hash map for lazy removal (if follow-up involves deletions).

  9. Define Invariants

  10. len(max_heap) == len(min_heap) OR len(max_heap) == len(min_heap) + 1.
  11. All elements in max_heap ≤ all elements in min_heap.

  12. Insertion Logic

  13. Push to max_heap first.
  14. Rebalance: Move the largest from max_heap to min_heap if needed.
  15. Ensure size invariant holds (adjust heaps if off by > 1).

  16. Median Access

  17. If heaps are equal size: median = (max_heap[0] + min_heap[0]) / 2.
  18. If unequal: median = max_heap[0].

  19. Edge Cases

  20. Empty stream → Return None or raise error.
  21. Single element → Median is that element.
  22. Duplicates → Heaps handle them naturally.

  23. Optimize for Follow-Ups

  24. Sliding window? Use a deque to track expired elements.
  25. Multi-threaded? Add locks around heap operations.

✅ WORKED EXAMPLES

Example 1 – Basic: Median from Infinite Stream

Problem: Design a class to find the median of a continuously growing stream of integers.

Thought Process

  1. Two heaps will track the lower and upper halves.
  2. Insertion:
  3. Push to max_heap (lower half).
  4. Move the largest from max_heap to min_heap (upper half).
  5. Rebalance if min_heap becomes larger than max_heap.
  6. Median:
  7. If heaps are equal size → average of tops.
  8. Else → top of the larger heap.

Python Code

import heapq

class MedianFinder:

def __init__(self):
self.max_heap = [] # lower half (max-heap, stored as negatives)
self.min_heap = [] # upper half (min-heap)
def addNum(self, num: int) -> None:
# Push to max_heap first
heapq.heappush(self.max_heap, -num)
# Move largest from max_heap to min_heap
heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))
# Rebalance if min_heap is larger
if len(self.min_heap) > len(self.max_heap):
heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))
def findMedian(self) -> float:
if len(self.max_heap) > len(self.min_heap):
return -self.max_heap[0]
return (-self.max_heap[0] + self.min_heap[0]) / 2

Complexity

  • Time: O(log N) per insertion, O(1) median access.
  • Space: O(N) to store all elements.

Why This Works

  • Balanced heaps ensure the median is always at the top.
  • Rebalancing maintains the invariant that max_heap is never smaller than min_heap by more than 1.

Example 2 – Medium: Median with Duplicates & Deletions

Problem: Extend the class to support removing numbers from the stream.

Thought Process

  1. Lazy Removal: Track "stale" elements in a hash map.
  2. Clean Heaps: Remove stale elements only when they reach the top.
  3. Insertion/Deletion:
  4. Insert normally, but mark the number as "active."
  5. On deletion, mark as "inactive" and clean heaps if needed.

Python Code

import heapq

class MedianFinderWithDeletion:

def __init__(self):
self.max_heap = []
self.min_heap = []
self.active = {} # tracks active elements
self.delayed = {} # tracks stale elements
def _clean_heap(self, heap, is_max_heap=False):
while heap:
num = -heap[0] if is_max_heap else heap[0]
if num in self.delayed:
self.delayed[num] -= 1
if self.delayed[num] == 0:
del self.delayed[num]
heapq.heappop(heap)
else:
break
def addNum(self, num: int) -> None:
if num in self.active:
self.active[num] += 1
else:
self.active[num] = 1
heapq.heappush(self.max_heap, -num)
heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))
if len(self.min_heap) > len(self.max_heap):
heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))
def removeNum(self, num: int) -> None:
if num not in self.active or self.active[num] == 0:
return
self.active[num] -= 1
if self.active[num] == 0:
del self.active[num]
self.delayed[num] = self.delayed.get(num, 0) + 1
self._clean_heap(self.max_heap, is_max_heap=True)
self._clean_heap(self.min_heap)
def findMedian(self) -> float:
self._clean_heap(self.max_heap, is_max_heap=True)
self._clean_heap(self.min_heap)
if len(self.max_heap) > len(self.min_heap):
return -self.max_heap[0]
return (-self.max_heap[0] + self.min_heap[0]) / 2

Complexity

  • Time: O(log N) per operation (amortized due to lazy removal).
  • Space: O(N) for heaps + O(N) for hash maps.

Why This Works

  • Lazy removal avoids O(N) heap deletions.
  • Cleaning heaps only when necessary keeps operations efficient.

Example 3 – Hard: Median in a Sliding Window

Problem: Find the median of the last k elements in a stream.

Thought Process

  1. Two Heaps + Deque:
  2. Use a deque to track the current window.
  3. Max-heap for the lower half, min-heap for the upper half.
  4. On eviction, remove the expired element from the appropriate heap.

  5. Balancing:

  6. After eviction, rebalance heaps to maintain size invariant.

Python Code

import heapq
from collections import deque

class SlidingWindowMedian:

def __init__(self, k: int):
self.k = k
self.window = deque()
self.max_heap = [] # lower half
self.min_heap = [] # upper half
def addNum(self, num: int) -> None:
self.window.append(num)
if len(self.window) > self.k:
self._remove(self.window.popleft())
self._add(num)
self._balance()
def _add(self, num: int) -> None:
if not self.max_heap or num <= -self.max_heap[0]:
heapq.heappush(self.max_heap, -num)
else:
heapq.heappush(self.min_heap, num)
def _remove(self, num: int) -> None:
if num <= -self.max_heap[0]:
# Lazy removal: mark as stale (not implemented here for brevity)
pass # In practice, use a hash map like Example 2
else:
pass
def _balance(self) -> None:
while len(self.max_heap) > len(self.min_heap) + 1:
heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))
while len(self.min_heap) > len(self.max_heap):
heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))
def findMedian(self) -> float:
if len(self.window) < self.k:
return -1 # or raise error
if len(self.max_heap) == len(self.min_heap):
return (-self.max_heap[0] + self.min_heap[0]) / 2
return -self.max_heap[0]

Complexity

  • Time: O(log N) per insertion (with lazy removal).
  • Space: O(N) for heaps + O(k) for the window.

Why This Works

  • Deque tracks the current window.
  • Heaps maintain the median dynamically.
  • Lazy removal (if implemented) avoids O(N) deletions.

❌ Common Mistakes

Mistake Why it Happens Correct Approach
Using a single sorted list "It’s simpler! → O(N) insertion. Use two heaps for O(log N) insertion.
Not rebalancing heaps Forgetting to maintain size invariant. After every insertion, ensure len(max_heap) == len(min_heap) ± 1.
Ignoring duplicates Assuming all numbers are unique. Heaps handle duplicates naturally; no extra work needed.
Not handling empty stream Edge case overlooked. Return None or raise an error if no elements exist.
Using O(N) median access Sorting the entire list on every query. Keep heaps balanced for O(1) median access.

? INTERVIEW TrapS

Trap How to Spot it How to Avoid it
"What if the stream is infinite?" Interviewer asks about memory limits. Clarify if approximate median is acceptable (e.g., reservoir sampling).
"Can you do better than O(log N)?" Follow-up on time complexity. No, O(log N) is optimal for heaps. Suggest O(1) if using a sorted list.
"How would you handle deletions?" Follow-up on dynamic removals. Use lazy removal with a hash map to track stale elements.

? 1-Minute Recap

"Alright, let’s lock this in. Here’s your 30-second cheat sheet for ‘Median from Data Stream’:

  1. Two heaps: Max-heap for the lower half, min-heap for the upper half.
  2. Insertion: Push to max-heap first, then rebalance to keep sizes equal or off by 1.
  3. Median access: If heaps are equal, average the tops. If not, take the top of the larger heap.
  4. Follow-ups: Sliding window? Use a deque. Deletions? Lazy removal with a hash map.
  5. Edge cases: Empty stream, single element, duplicates—handle them all.

You’ve got this. Walk in, write the heaps, balance them, and own the median in O(log N) time. Now go crush that interview!" ?



ADVERTISEMENT