Charging Station: Find Frequent Words with Mismatching by Sorting

Learn an algorithm that finds frequent words with mismatching by sorting.

We'll cover the following

This lesson uses some notation from Charging Station: Finding Frequent Words by Sorting.

The following pseudocode reduces the Frequent Words with Mismatches Problem to sorting. It first generates all neighbors (with up to d mismatches) for all k-mers in Text and combines them all into an array NeighborhoodArray. Note that a k-mer Pattern appears Countd_{d}(Text, Pattern) times in this array. The only thing left is to sort this array, count how many times each k-mer appears in the sorted array, and then return the k-mers that occur the maximum number of times.

Algorithm

The algorithm for the Frequent Words with Mismatches Problem is provided below:

 FindingFrequentWordsWithMismatchesBySorting(Text, k, d)
    FrequentPatterns ← an empty set
    Neighborhoods ← an empty list
    for i ← 0 to |Text|−k
        add Neighbors(Text(i, k), d) to Neighborhoods
    form an array NeighborhoodArray holding all strings in Neighborhoods 
    for i ← 0 to |Neighborhoods| − 1
        Pattern ← NeighborhoodArray(i)
        Index(i) ← PatternToNumber(Pattern)
        Count(i) ← 1
    SortedIndex ← Sort(Index) 
    for i ← 0 to |Neighborhoods| − 2
        if SortedIndex(i) = SortedIndex(i + 1) 
            Count(i + 1) ← Count(i) + 1
    maxCount ← maximum value in array Count
    for i ← 0 to |Neighborhoods| − 1
        if Count(i) = maxCount
            Pattern ← NumberToPattern(SortedIndex(i), k)
            add Pattern to FrequentPatterns
    return FrequentPatterns

Get hands-on with 1200+ tech skills courses.