Charging Station: Solving the Clump Finding Problem
Let’s have a look at the solutions to the clump finding problem.
We'll cover the following
This lesson assumes that you’ve read Charging Station: The Frequency Array.
The pseudocode below slides a window of length L down Genome. After computing the frequency array for the current window, it identifies (L, t)-clumps simply by finding which k-mers occur at least t times within the window. To keep track of these clumps, our algorithm uses an array Clump of length 4 whose values are all initialized to zero. For each value of i between 0 and 4 − 1, we’ll set Clump(i) equal to 1 if NumberToPattern(i, k) forms an (L, t)-clump in Genome.
ClumpFinding(Genome, k, L, t)
FrequentPatterns ← an empty set
for i ← 0 to 4^k − 1
CLUMP(i) ← 0
for i ← 0 to |Genome|−L
Text ← the string of length L starting at position i in Genome
FrequencyArray ← ComputingFrequencies(Text, k)
for index ← 0 to 4^k − 1
if FREQUENCYARRAY(index) ≥ t
CLUMP(index) ← 1
for i ← 0 to 4^k − 1
if CLUMP(i) = 1
Pattern ← NUMBERTOPATTERN(i, k)
add Pattern to the set FrequentPatterns
return FrequentPatterns
Get hands-on with 1200+ tech skills courses.