Charging Station: Generating the Neighborhood of a String

Let’s look at how we can generate the neighborhood of a string by employing different algorithms.

The d-neighborhood

Our goal is to generate the d-neighborhood Neighbors(Pattern, d), the set of all k-mers whose Hamming distance from Pattern doesn’t exceed d. We’ll first generate the 1-neighborhood of Pattern using the following pseudocode:

  ImmediateNeighbors(Pattern)
    Neighborhood ← the set consisting of the single string Pattern 
    for i = 1 to |Pattern|
        symbol ← i-th nucleotide of Pattern
        for each nucleotide x different from symbol
            Neighbor ← Pattern with the i-th nucleotide substituted by x 
            add Neighbor to Neighborhood
    return Neighborhood

Our idea for generating Neighbors(Pattern, d) is as follows. If we remove the first symbol of Pattern (denoted FirstSymbol(Pattern)) then we’ll obtain a (k − 1)-mer that we denote by Suffix(Pattern).

STOP and Think: If we know Neighbors(Suffix(Pattern), d), how does it help us construct Neighbors(Pattern, d)?

Now, consider a (k − 1)-mer PatternPattern^{\prime} belonging to Neighbors(Suffix(Pattern), d).
By the definition of the d-neighborhood Neighbors(Suffix(Pattern), d), we know that HammingDistance(Pattern', Suffix(Pattern)) is either equal to d or less than d. In the first case, we can add FirstSymbol(Pattern) to the beginning of Pattern' to obtain a k-mer belonging to Neighbors(Pattern, d). In the second case, we can add any symbol to the beginning of Pattern' and obtain a k-mer belonging to Neighbors(Pattern, d).

For example, to generate Neighbors(CAA, 1), first form Neighbors(AA, 1) = {AA, CA, GA, TA, AC, AG, AT}. The Hamming distance between AA and each of six of these neighbors is 1. Firstly, concatenating C with each of these patterns results in six patterns (CAA, CCA, CGA, CTA, CAC, CAG, CAT) that belong to Neighbors(CAA, 1). Secondly, concatenating any nucleotide with AA results in four patterns (AAA, CAA, GAA, and TAA) that belong to Neighbors(CAA, 1). Thus, Neighbors(CAA, 1) comprises ten patterns.

In the following pseudocode, we use the notation symbolText to denote the concatenation of a character symbol and a string Text, e.g.,

A • GCATG = AGCATG.

You can test this out using the widget given below:

Get hands-on with 1200+ tech skills courses.