Solution: Find Median from Data Stream

Let's solve the Find Median from Data Stream problem using the Two Heaps pattern.

Statement

Create a data structure that can store a list of integers that can change in size over time and find the median from this dynamically growing list in constant time, O(1)O(1).

Implement a class, MedianOfStream, which should support the following operations:

  1. Constructor(): This initializes the object of this class, which in turn creates the max and the min heap.

  2. Insert Num(num): This adds an integer, num, to the data structure.

  3. Find Median(): This finds the median of all elements seen so far. If there are an even number of elements, return the average of the two middle values.

Constraints:

  • 105-10^5 \leq num 105\leq 10^5, where num is an integer received from the data stream.

  • There will be at least one element in the data structure before the median is computed.

  • At most, 500500 calls will be made to the function that calculates the median.

Solution

So far, you’ve probably brainstormed some approaches and have an idea of how to solve this problem. Let’s explore some of these approaches and figure out which one to follow based on considerations such as time complexity and any implementation constraints.

Naive approach

The naive solution is to first sort the data and then find the median. Insertion sort is an algorithm that can be used to sort the data as it appears. This way, every time a new number is added to the stream, the numbers before that new number are already sorted, allowing insertion at the correct index. This, however, also requires us to shift the elements greater than the inserted number one place forward.

The overall time complexity of the algorithm becomes O(n2)O(n^2), where nn is the number of elements in the data stream. The amortized time complexity of each insertion is therefore O(n2n)O(\frac{n^2}{n}), that is, O(n)O(n). The time complexity of the function that calculates the median would be O(1)O(1), assuming we are storing the data in an array. The space complexity is O(1)O(1).

Optimized approach using two heaps

The most efficient data structure for repeatedly finding the smallest or largest number in a changing list is a heap. We can store the first half of the numbers in a max-heap and the second half in a min-heap. That’s why the two heaps pattern is a perfect fit.

This approach’s essence lies in the median dividing the list into two halves: one with numbers smaller than or equal to itself and the other with numbers larger. To manage these halves and repeatedly calculate the median, the approach uses two heaps: a max-heap to hold the smaller half, allowing quick access to the largest of the smaller numbers, and a min-heap for the larger half, providing quick access to the smallest of the larger numbers. When the data stream starts, the first element is the tentative median, xx. As the stream grows, the process keeps populating the max-heap with numbers smaller than the median, xx. It ensures the top of this heap represents the largest number in the lower half of the list. Conversely, the min-heap is filled with numbers larger than xx, making its top element the smallest in the upper half. The median is one of the top elements of the two heaps if the total number of elements is odd. Otherwise, it is the average of the top elements.

Now, let’s look at an illustration for a better understanding of the algorithm:

Let’s look at the code for this solution below:

class MedianOfStream {
PriorityQueue<Integer> maxHeapForSmallNum;
PriorityQueue<Integer> minHeapForLargeNum;
public MedianOfStream() {
maxHeapForSmallNum = new PriorityQueue<>((a, b) -> b - a);
minHeapForLargeNum = new PriorityQueue<>((a, b) -> a - b);
}
public void insertNum(int num) {
if (maxHeapForSmallNum.isEmpty() || maxHeapForSmallNum.peek() >= num)
maxHeapForSmallNum.add(num);
else
minHeapForLargeNum.add(num);
if (maxHeapForSmallNum.size() > minHeapForLargeNum.size() + 1)
minHeapForLargeNum.add(maxHeapForSmallNum.poll());
else if (maxHeapForSmallNum.size() < minHeapForLargeNum.size())
maxHeapForSmallNum.add(minHeapForLargeNum.poll());
}
public double findMedian() {
if (maxHeapForSmallNum.size() == minHeapForLargeNum.size()) {
return maxHeapForSmallNum.peek() / 2.0 + minHeapForLargeNum.peek() / 2.0;
}
return maxHeapForSmallNum.peek();
}
public static void main(String[] args) {
// Driver code
int[] nums = {35, 22, 30, 25, 1};
MedianOfStream medianOfAges = null;
for(int i=0; i< nums.length; i++){
System.out.print(i+1);
System.out.print(".\tData stream: [");
medianOfAges = new MedianOfStream();
for(int j=0; j<=i; j++){
System.out.print(nums[j]);
if(j != i)
System.out.print(", ");
medianOfAges.insertNum(nums[j]);
}
System.out.println("]");
System.out.println("\tThe median for the given numbers is: " + medianOfAges.findMedian());
System.out.println(PrintHyphens.repeat("-", 100));
}
}
}
Find Median from Data Stream

Solution summary

To recap, the solution to this problem can be divided into the following parts:

  • Use a min-heap to store the larger 50% of the numbers seen so far and a max-heap for the smaller 50% of the numbers.
  • Add the incoming elements to the appropriate heaps.
  • Calculate the median using the top elements of the two heaps.

Time complexity

The time complexity of the Insert Num method will change depending on how many numbers have already been received from the stream. So, the time complexity is amortized over the number of insert operations.

Each insert operation will trigger a heapify process that runs in O(logn)O(\log n) times, where nn is the count of numbers received so far from the stream. Because of this, the cumulative complexity of a sequence of nn insert operations is described by the expression log1+log2+log3++logn\log 1 + \log 2 + \log 3 + … + \log n.

This expression simplifies to logn!\log n!, which, as per Stirling’s approximation, is O(nlogn)O(n \log n). As we have performed nn insert operations, the amortized time complexity of one insert operation is O(nlognn)O(\frac{n \log n}{n}), that is, O(logn)O(\log n).

The time complexity of the Find Median method will be O(1)O(1) since retrieving the top element of a heap is a constant-time operation, and we need to do at most two such retrievals.

Space complexity

The space complexity will be O(n)O(n) since two heaps are used to store the numbers received from the data stream.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.