Charging Station: Generating the Neighborhood of a String
Let’s look at how we can generate the neighborhood of a string by employing different algorithms.
We'll cover the following
The d-neighborhood
Our goal is to generate the d-neighborhood Neighbors(Pattern, d), the set of all k-mers whose Hamming distance from Pattern doesn’t exceed d. We’ll first generate the 1-neighborhood of Pattern using the following pseudocode:
ImmediateNeighbors(Pattern)
Neighborhood ← the set consisting of the single string Pattern
for i = 1 to |Pattern|
symbol ← i-th nucleotide of Pattern
for each nucleotide x different from symbol
Neighbor ← Pattern with the i-th nucleotide substituted by x
add Neighbor to Neighborhood
return Neighborhood
Our idea for generating Neighbors(Pattern, d) is as follows. If we remove the first symbol of Pattern (denoted FirstSymbol(Pattern)) then we’ll obtain a (k − 1)-mer that we denote by Suffix(Pattern).
STOP and Think: If we know Neighbors(Suffix(Pattern), d), how does it help us construct Neighbors(Pattern, d)?
Now, consider a (k − 1)-mer belonging to Neighbors(Suffix(Pattern), d).
By the definition of the d-neighborhood Neighbors(Suffix(Pattern), d), we know that HammingDistance(Pattern, Suffix(Pattern)) is either equal to d or less than d.
In the first case, we can add FirstSymbol(Pattern) to the beginning of Pattern to obtain a k-mer belonging to Neighbors(Pattern, d).
In the second case, we can add any symbol to the beginning of Pattern and obtain a k-mer belonging to Neighbors(Pattern, d).
For example, to generate Neighbors(CAA, 1), first form Neighbors(AA, 1) = {AA, CA, GA, TA, AC, AG, AT}. The Hamming distance between AA and each of six of these neighbors is 1. Firstly, concatenating C with each of these patterns results in six patterns (CAA, CCA, CGA, CTA, CAC, CAG, CAT) that belong to Neighbors(CAA, 1). Secondly, concatenating any nucleotide with AA results in four patterns (AAA, CAA, GAA, and TAA) that belong to Neighbors(CAA, 1). Thus, Neighbors(CAA, 1) comprises ten patterns.
In the following pseudocode, we use the notation symbol • Text to denote the concatenation of a character symbol and a string Text, e.g.,
A • GCATG = AGCATG. |
---|
You can test this out using the widget given below:
Get hands-on with 1200+ tech skills courses.