Solution: Repeated DNA Sequences
Let's solve the Repeated DNA Sequences problem using the Sliding Window pattern.
Statement
Given a string, dna
, that represents a DNA subsequence, and a number , return all the contiguous subsequences (substrings) of length that occur more than once in the string. The order of the returned subsequences does not matter. If no repeated substring is found, the function should return an empty set.
The DNA sequence is composed of a series of nucleotides abbreviated as , , , and . For example, is a DNA sequence. When studying DNA, it is useful to identify repeated subsequences in it.
Constraints:
-
dna.length
dna[i]
is either A, C, G, or T.
Solution
So far, you’ve probably brainstormed some approaches on how to solve this problem. Let’s explore some of these approaches and figure out which one to follow while considering time complexity and any implementation constraints.
Naive approach
A naive approach would be to iterate through the input DNA sequence and add all the unique substrings of length to a set. If a substring is already present in a set, it is a repeated substring.
Here’s how the algorithm works:
-
We iterate the string using a pointer , ranging from to . This is the number of -length substrings present in the sequence.
-
At each iteration, we generate the current -length substring, i.e., dna[i]…dna[i + k - 1].
-
Next, we check if this substring is already present in the set.
-
If it is, the current substring is a repeated sequence, so we add it to our output.
-
Otherwise, the current substring has not yet been repeated, so we just add it to the set.
-
-
We repeat the above process for all -length substrings.
-
Once all -length substrings have been evaluated, we return the output.
The time complexity of this approach is , where is the length of the input sequence and is the size of each contiguous subsequence we consider. This is because we iterate over substrings of length , and at each iteration, the time taken to generate a -length substring is .
The space complexity of this approach is , since in the worst case, our set can contain elements, and at each iteration of the traversal, we are allocating memory to generate a new -length substring.
Optimized approach using sliding window
The problem can be optimized using a sliding window approach. We use the Rabin-Karp algorithm that utilizes a sliding window with
Here’s the basic idea of the algorithm:
-
We traverse the string by using a sliding window of length , which slides one character forward on each iteration.
-
On each iteration, we compute the hash of the current -length substring in the window.
-
We check if the hash is already present in the set.
-
If it is, the substring is repeated, so we add it to the output.
-
Otherwise, the substring has not yet been repeated, so we add the computed hash to the set.
-
-
We repeat the above process for all -length substrings by sliding the window one character forward on each iteration.
-
After all -length substrings have been evaluated, we return the output.
There are multiple approaches for computing hash values, and the choice of the hash function can impact the algorithm’s time complexity. Let’s look at some approaches below.
Hashing and comparison in linear time
Let’s use a simple hashing method that sums the ASCII code of characters present in a window.
Consider the sequence with .
-
Initially, the sequence in the window is and its hash value is:
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of will be removed from the previous hash value and the ASCII of will then be added:
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of will be removed from the previous hash value and then again added:
Here, we have the same hash value but different sequences— and . This means that if a hash value is already present in the set, we need to compare the corresponding sequences as well to confirm if they are identical. In this case, they are not, so we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of will be removed from the previous hash value and then again added:
Here we have the same hash value, so we compare the two sequences. Since they are identical, the sequence has been repeated and is therefore added to the output.
Computing the hash value and then comparing the strings if the hashes are equal will take linear time, . In the worst case, the comparisons will occur after each slide, which will make the running time the same as that of the naive approach, which is .
Hashing and comparison in constant time
We need a hash function that helps us achieve constant-time hashing. For this purpose, we use the polynomial rolling hash technique:
Here, is a constant, are the characters in a sequence, and is the substring length. Since we only have possible nucleotides, our would be . We also assign numeric values to the nucleotides, as shown in the table below:
Note: We use a value of for the constant since this ensures that each nucleotide is assigned a unique value in the polynomial hash function. This reduces the number of hash collisions and helps reduce the randomness in the behaviour of the hash function.
It is not necessary to set the constant equal to . In fact, this choice is somewhat arbitrary and can be any number that is greater than or equal to the number of nucleotides. The idea is to choose a value that is large enough to avoid hash collisions while still being relatively small, so as to minimize the risk of arithmetic overflow.
Example: The polynomial hash value for the sequence will be
Consider the same sequence with .
-
Initially, the sequence in the window is , and its hash value is:
Since the above hash value has not been repeated yet, we add this hash value in the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , we first need to remove the contribution of from the previous hash value:
and then add the contribution of :
This can’t be right, since both and cannot yield the same hash value. Examining our work carefully, we notice that we are not correctly accounting for place values. So, we need to shift the remaining bases to the left by one position so that the hash corresponds to the new sliding window. We do this by multiplying the current hash value by the base value . This means that becomes and becomes . After that, we add the contribution of the incoming character to get the new hash value for the current sliding window:
Since this hash value has not been seen yet, we add it to the set and slide the window one character forward.
-
The sequence in the window is now . We compute the hash value of in the same way as explained above:
Since this hash value has not been seen yet, we add it to the set and slide the window one character forward. As you can see, even though and contain the same characters, since their relative order is different, the respective hash values of the two strings are also different.
-
The sequence in the window is now . We compute the hash value of in the same way as explained above:
This hash value is present in the set. Therefore, we add it to the output without checking if the sequences are identical.
From the above two approaches, it is clear that we will be using the polynomial rolling hash function in our solution to compute the hash values of the -length substrings since this method skips the need to compare two sequences if their hash values are the same and, therefore, helps us achieve constant-time hashing and comparison.
The slides below illustrate how the algorithm runs:
Note: In the following section, we will gradually build the solution. Alternatively, you can skip straight to just the code.
Step-by-step solution construction
We will first define the following variables to set ourselves up to implement the polynomial rolling hash function:
mapping
: This is a hash map defining the numeric mapping of nucleotides. It will be of the form: .baseValue
: This is the base value used in the polynomial hash function. We will use a base value of since there are nucleotides in the sequence.numbers
: This is an array storing the integer form of the stringdna
based on the mapping defined above. For example, for the sequence ,numbers
will consist of . This array will make it easier to access the numeric value of the nucleotides when calculating the hash value.
class RepeatedDNA {public static void findRepeatedSequences(String dna, int k) {int stringLength = dna.length();// Mapping of charactersMap<Character, Integer> mapping = new HashMap<>();mapping.put('A', 1);mapping.put('C', 2);mapping.put('G', 3);mapping.put('T', 4);// Base valueint baseValue = 4;// Numeric representation of the sequenceList<Integer> numbers = new ArrayList<>(Arrays.asList(new Integer[stringLength]));Arrays.fill(numbers.toArray(), 0);for (int i = 0; i < stringLength; i++) {numbers.set(i, mapping.get(dna.charAt(i)));}System.out.println("\tConverted sequence: " + numbers.toString());}// Driver codepublic static void main(String[] args) {List<String> inputsString = Arrays.asList("ACGT", "AGACCTAGAC", "AAAAACCCCCAAAAACCCCCC","GGGGGGGGGGGGGGGGGGGGGGGGG", "TTTTTCCCCCCCTTTTTTCCCCCCCTTTTTTT", "TTTTTGGGTTTTCCA","AAAAAACCCCCCCAAAAAAAACCCCCCCTG", "ATATATATATATATAT");List<Integer> inputsK = Arrays.asList(3, 3, 8, 12, 10, 14, 10, 6);for (int i = 0; i < inputsK.size(); i++) {System.out.println((i + 1) + ".\tInput sequence: " + inputsString.get(i) +"\n\tk: " + inputsK.get(i) + "\n");findRepeatedSequences(inputsString.get(i), inputsK.get(i));System.out.println(Print.repeat("-", 100));}}}
We declare a variable, hashValue
, to store the hash value of the current -length sequence in the window. It is initialized to .
Next, we slide the window along the string, dna
, using a pointer, i
, ranging from to :
-
When
i
is , the window is at its starting position, i.e., the first -length substring. For this sequence, we calculate the hash value from scratch using the above-mentioned polynomial hash function.for (int j = 0; j < k; j++) { hashValue += numbers.get(j) * (int) Math.pow(baseValue, k - j - 1); }
-
Otherwise, the window is not at its starting position. So, we calculate the hash value of the current -length substring by utilizing the hash value of the previous -length substring:
int previousHashValue = hashValue; hashValue = ((previousHashValue - numbers.get(i - 1) * (int) Math.pow(baseValue, k - 1)) * baseValue) + numbers.get(i + k - 1);
-
The above process is repeated by sliding the window one character forward.
class RepeatedDNA {public static void findRepeatedSequences(String dna, int k) {int stringLength = dna.length();// Mapping of charactersMap<Character, Integer> mapping = new HashMap<>();mapping.put('A', 1);mapping.put('C', 2);mapping.put('G', 3);mapping.put('T', 4);// Base valueint baseValue = 4;// Numeric representation of the sequenceList<Integer> numbers = new ArrayList<>(Arrays.asList(new Integer[stringLength]));Arrays.fill(numbers.toArray(), 0);for (int i = 0; i < stringLength; i++) {numbers.set(i, mapping.get(dna.charAt(i)));}// Current hash valueint hashValue = 0;for (int i = 0; i < stringLength - k + 1; i++) {// If the window is at its initial position, calculate the hash value from scratchif (i == 0) {for (int j = 0; j < k; j++) {hashValue += numbers.get(j) * (int) Math.pow(baseValue, k - j - 1);}} else {int previousHashValue = hashValue;hashValue = ((previousHashValue - numbers.get(i - 1) * (int) Math.pow(baseValue, k - 1)) * baseValue) + numbers.get(i + k - 1);}System.out.println("\tHash value of " + dna.substring(i, i + k) + ": " + hashValue);}}// Driver codepublic static void main(String[] args) {List<String> inputsString = Arrays.asList("ACGT", "AGACCTAGAC", "AAAAACCCCCAAAAACCCCCC","GGGGGGGGGGGGGGGGGGGGGGGGG", "TTTTTCCCCCCCTTTTTTCCCCCCCTTTTTTT", "TTTTTGGGTTTTCCA","AAAAAACCCCCCCAAAAAAAACCCCCCCTG", "ATATATATATATATAT");List<Integer> inputsK = Arrays.asList(3, 3, 8, 12, 10, 14, 10, 6);for (int i = 0; i < inputsK.size(); i++) {System.out.println((i + 1) + ".\tInput sequence: " + inputsString.get(i) +"\n\tk: " + inputsK.get(i) + "\n");findRepeatedSequences(inputsString.get(i), inputsK.get(i));System.out.println(Print.repeat("-", 100));}}}
We declare the following variables to keep track of the hash values and store the repeated substrings:
-
hashSet
: This is a set that stores all the unique hash values of the -length substrings. It is initialized to empty. -
output
: This is a set that stores the repeated substrings.
We check if the calculated hash value of the current -length substring is present in hashSet
:
-
If it is, the substring is repeated, so it is added to
output
:if (hashSet.contains(hashValue)) { output.add(dna.substring(i, i + k)); }
-
Otherwise, we will just add the hash value of the substring to
hashSet
:hashSet.add(hashValue)
When the hash values of all -length substrings have been evaluated, i.e., the sliding window can not move forward, we return output
.
class RepeatedDNA {public static Set<String> findRepeatedSequences(String dna, int k) {int stringLength = dna.length();// Mapping of charactersMap<Character, Integer> mapping = new HashMap<>();mapping.put('A', 1);mapping.put('C', 2);mapping.put('G', 3);mapping.put('T', 4);// Base valueint baseValue = 4;// Numeric representation of the sequenceList<Integer> numbers = new ArrayList<>(Arrays.asList(new Integer[stringLength]));Arrays.fill(numbers.toArray(), 0);for (int i = 0; i < stringLength; i++) {numbers.set(i, mapping.get(dna.charAt(i)));}// Current hash valueint hashValue = 0;// 1. Hash set to store the unique hash values// 2. Output set to store the repeated substringsSet<Integer> hashSet = new HashSet<>();Set<String> output = new HashSet<>();for (int i = 0; i < stringLength - k + 1; i++) {// If the window is at its initial position, calculate the hash value from scratchif (i == 0) {for (int j = 0; j < k; j++) {hashValue += numbers.get(j) * (int) Math.pow(baseValue, k - j - 1);}} else {int previousHashValue = hashValue;hashValue = ((previousHashValue - numbers.get(i - 1) * (int) Math.pow(baseValue, k - 1)) * baseValue) + numbers.get(i + k - 1);}// If the current hash value is present in the hash set, the current substring has been repeated, so we add it to the outputif (hashSet.contains(hashValue)) {output.add(dna.substring(i, i + k));}// We add the evaluated hash value to the hash sethashSet.add(hashValue);System.out.println("\n\tHash value of " + dna.substring(i, i + k) + ": " + hashValue +"\n\tHash Set: " + Print.printSetInt(hashSet) +"\n\tOutput: " + Print.printSetString(output));}return output;}// Driver codepublic static void main(String[] args) {List<String> inputsString = Arrays.asList("ACGT", "AGACCTAGAC", "AAAAACCCCCAAAAACCCCCC","GGGGGGGGGGGGGGGGGGGGGGGGG", "TTTTTCCCCCCCTTTTTTCCCCCCCTTTTTTT", "TTTTTGGGTTTTCCA","AAAAAACCCCCCCAAAAAAAACCCCCCCTG", "ATATATATATATATAT");List<Integer> inputsK = Arrays.asList(3, 3, 8, 12, 10, 14, 10, 6);for (int i = 0; i < inputsK.size(); i++) {System.out.println((i + 1) + ".\tInput sequence: " + inputsString.get(i) +"\n\tk: " + inputsK.get(i));findRepeatedSequences(inputsString.get(i), inputsK.get(i));System.out.println(Print.repeat("-", 100));}}}
The above solution works for most inputs. However, it will not work if the length of the string, dna
, is less than the window size, k
. So, we need to handle this case by returning an empty set.
class RepeatedDNA {public static Set<String> findRepeatedSequences(String dna, int k) {int stringLength = dna.length();if (stringLength < k) {return new HashSet<String>();}// Mapping of charactersMap<Character, Integer> mapping = new HashMap<>();mapping.put('A', 1);mapping.put('C', 2);mapping.put('G', 3);mapping.put('T', 4);// Base valueint baseValue = 4;// Numeric representation of the sequenceList<Integer> numbers = new ArrayList<>(Arrays.asList(new Integer[stringLength]));Arrays.fill(numbers.toArray(), 0);for (int i = 0; i < stringLength; i++) {numbers.set(i, mapping.get(dna.charAt(i)));}// Current hash valueint hashValue = 0;// 1. Hash set to store the unique hash values// 2. Output set to store the repeated substringsSet<Integer> hashSet = new HashSet<>();Set<String> output = new HashSet<>();for (int i = 0; i < stringLength - k + 1; i++) {// If the window is at its initial position, calculate the hash value from scratchif (i == 0) {for (int j = 0; j < k; j++) {hashValue += numbers.get(j) * (int) Math.pow(baseValue, k - j - 1);}} else {int previousHashValue = hashValue;hashValue = ((previousHashValue - numbers.get(i - 1) * (int) Math.pow(baseValue, k - 1)) * baseValue) + numbers.get(i + k - 1);}// If the current hash value is present in the hash set, the current substring has been repeated, so we add it to the outputif (hashSet.contains(hashValue)) {output.add(dna.substring(i, i + k));}// We add the evaluated hash value to the hash sethashSet.add(hashValue);}return output;}// Driver codepublic static void main(String[] args) {List<String> inputsString = Arrays.asList("ACGT", "AGACCTAGAC", "AAAAACCCCCAAAAACCCCCC","GGGGGGGGGGGGGGGGGGGGGGGGG", "TTTTTCCCCCCCTTTTTTCCCCCCCTTTTTTT", "TTTTTGGGTTTTCCA","AAAAAACCCCCCCAAAAAAAACCCCCCCTG", "ATATATATATATATAT");List<Integer> inputsK = Arrays.asList(3, 3, 8, 12, 10, 14, 10, 6);for (int i = 0; i < inputsK.size(); i++) {System.out.println((i + 1) + ".\tInput sequence: " + inputsString.get(i) +"\n\tk: " + inputsK.get(i) +"\n\n\tRepeated sequences: " + Print.printSetString(findRepeatedSequences(inputsString.get(i), inputsK.get(i))));System.out.println(Print.repeat("-", 100));}}}
Just the code
Here’s the complete solution to this problem:
class RepeatedDNA {public static Set<String> findRepeatedSequences(String dna, int k) {int stringLength = dna.length();if (stringLength < k) {return new HashSet<String>();}Map<Character, Integer> mapping = new HashMap<>();mapping.put('A', 1);mapping.put('C', 2);mapping.put('G', 3);mapping.put('T', 4);int baseValue = 4;List<Integer> numbers = new ArrayList<>(Arrays.asList(new Integer[stringLength]));Arrays.fill(numbers.toArray(), 0);for (int i = 0; i < stringLength; i++) {numbers.set(i, mapping.get(dna.charAt(i)));}int hashValue = 0;Set<Integer> hashSet = new HashSet<>();Set<String> output = new HashSet<>();for (int i = 0; i < stringLength - k + 1; i++) {if (i == 0) {for (int j = 0; j < k; j++) {hashValue += numbers.get(j) * (int) Math.pow(baseValue, k - j - 1);}} else {int previousHashValue = hashValue;hashValue = ((previousHashValue - numbers.get(i - 1) * (int) Math.pow(baseValue, k - 1)) * baseValue) + numbers.get(i + k - 1);}if (hashSet.contains(hashValue)) {output.add(dna.substring(i, i + k));}hashSet.add(hashValue);}return output;}// Driver codepublic static void main(String[] args) {List<String> inputsString = Arrays.asList("ACGT", "AGACCTAGAC", "AAAAACCCCCAAAAACCCCCC","GGGGGGGGGGGGGGGGGGGGGGGGG", "TTTTTCCCCCCCTTTTTTCCCCCCCTTTTTTT", "TTTTTGGGTTTTCCA","AAAAAACCCCCCCAAAAAAAACCCCCCCTG", "ATATATATATATATAT");List<Integer> inputsK = Arrays.asList(3, 3, 8, 12, 10, 14, 10, 6);for (int i = 0; i < inputsK.size(); i++) {System.out.println((i + 1) + ".\tInput sequence: " + inputsString.get(i) +"\n\tk: " + inputsK.get(i) +"\n\n\tRepeated sequences: " + Print.printSetString(findRepeatedSequences(inputsString.get(i), inputsK.get(i))));System.out.println(Print.repeat("-", 100));}}}
Solution summary
To recap, the solution to this problem can be divided into the following six main parts:
-
Iterate over all -length substrings.
-
Compute the hash value for the contents of the window.
-
Add this hash value to the set that keeps track of the hashes of all substrings of the given length.
-
Move the window one step forward and compute the hash of the new window using the rolling hash method.
-
If the hash value of a window has already been seen, the sequence in this window is repeated, so we add it to the output set.
-
Once all substrings have been evaluated, return the output set.
Time complexity
Average case
The average case time complexity of this solution is , where is the length of the input string. It is calculated as follows:
-
Time taken to populate the
numbers
array: . -
Time taken to traverse all the -length substrings: .
-
Time taken to calculate the hash value of a -length substring: .
So, the dominating time complexity becomes .
Worst case
To understand the worst case time complexity of this solution, consider the input string “AAAAAAAA” with . This combination of inputs ensures that a repeated sequence “AA” is detected and added to the output each time the window slides forward. Therefore, we must generate a -length substring on each iteration of the loop. The time to generate a -length substring is . Therefore, the overall time complexity becomes .
Space complexity
The space complexity of this solution is . It is calculated as follows:
-
Space occupied by the
mapping
hash map: . -
Space occupied by the
numbers
array: . -
Space occupied by the
hashSet
set: .
So, the dominating space complexity becomes .
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.