Solution: Repeated DNA Sequences

Let's solve the Repeated DNA Sequences problem using the Sliding Window pattern.

We'll cover the following

Statement
Solution
- Naive approach
- Optimized approach using sliding window

Statement

A DNA sequence consists of nucleotides represented by the letters ‘A’, ‘C’, ‘G’, and ‘T’ only. For example, “ACGAATTCCG” is a valid DNA sequence.

Given a string, s, that represents a DNA sequence, return all the 10-letter-long sequences (continuous substrings of exactly 10 characters) that appear more than once in s. You can return the output in any order.

Constraints:

$1 \leq$ s.length $\leq 10^{3}$
s[i] is either 'A', 'C', 'G', or 'T'.

Solution

So far, you’ve probably brainstormed some approaches to solving this problem. Considering time complexity and implementation constraints, let’s explore some of these approaches and determine which to follow.

Naive approach

The naive approach to solving this problem would be to use a nested loop to check all possible 10-letter-long substrings in the given DNA sequence. Using a set, we would extract every possible substring of length 10 and compare it with all previously seen substrings. If a substring appears more than once, we add it to the result.

Specifically, we start by iterating through the string and extracting every substring of length 10. For each substring, we check if it has already been seen before. We store it in a separate set to track repeated sequences if it has. If not, we add it to the set of seen substrings. Finally, we return all repeated sequences as a list. This method is simple to understand but inefficient because checking each substring against previously seen ones takes much time, making it slow for large inputs.

We extract all k-length (where $k$ = 10) substrings from the given string s, which has length n. This means we extract $(n - k + 1)$ substrings. Each substring extraction takes $O(k)$ time. Checking whether a substring is in a set (average case) takes $O(1)$ , but in the worst case (hash collisions), it takes $O(n - k + 1)$ comparisons. Inserting a new substring into the set takes $O(1)$ on average, but worst case $O(n - k + 1$ ). Therefore, the overall time complexity becomes $O((n-k) × k)$ .

The space complexity of this approach is $O((n−k) \times k)$ because, in the worst case, our set can contain $(n−k+1)$ elements, and at each iteration of the traversal, we are allocating memory to generate a new $k$ -length substring.

Optimized approach using sliding window

As we only need to check consecutive 10-letter substrings, we can slide over the string and update our hash efficiently instead of creating new substrings every time. To optimize it further, instead of computing a hash from scratch for each substring, we can update its value as we slide forward on the string. This technique is commonly known as rolling hashA rolling hash is a technique for efficiently calculating hash values for overlapping subarrays or substrings of a larger sequence without recalculating the entire hash from scratch..

The rolling hash can be divided into three main steps:

Initial hash calculation: Calculate the hash for the main string’s first window (substring).
Slide the window: Move the window one character forward.
Update the hash: Use the previous hash value to calculate the new hash without rescanning the whole substring.
- Remove the hash contribution of the outgoing character.
- Add the hash contribution of the incoming character.

Let’s look at the illustration below to better understand the concept of rolling hash.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.

Getting Started

Two Pointers

Fast and Slow Pointers

Sliding Window

Merge Intervals

In-Place Manipulation of a Linked List

Heaps

K-way merge

Top K Elements

Modified Binary Search

Subsets

Greedy Techniques

Backtracking

Dynamic Programming

Cyclic Sort

Topological Sort

Sort and Search

Matrices

Stacks

Graphs

Tree Depth-First Search

Tree Breadth-First Search

Trie

Hash Maps

Knowing What to Track

Union Find

Custom Data Structures

Bitwise Manipulation

Math and Geometry

Challenge Yourself

Conclusion

Solution: Repeated DNA Sequences

Statement

Solution

Naive approach

Optimized approach using sliding window