K-Nearest Neighbors

In this chapter, we will be learning how to calculate K-nearest neighbors based on cosine similarity.

We'll cover the following

Chapter Goals:
A. Similar words
Time to Code!

A. Similar words

When comparing cosine similarities for word embeddings, a common procedure is to find the K-nearest neighbors for a given word. This means that for a given a word w and an integer K, we can find the K vocabulary words whose embedding vectors have the highest cosine similarity to the embedding vector for w.

Using K-nearest neighbors can help us evaluate our embedding model and make sure it was trained properly. For example, if we notice that the K-nearest neighbors for any given word is always the same K words, then something may have gone wrong with our embedding training. Likewise, if the K-nearest neighbors are completely different from what we would expect to see (e.g. if the 3 nearest neighbors for the word "computer" are "waterfall", "ocean", and "soda"), then we may want to take a closer look at our model's embedding matrix.

When the embedding model is trained well, the K-nearest neighbors metric can provide some useful insights about the text corpus, particularly for specialized text corpora. For example, we normally expect the word "code" to be related to terms about computer programming or software. However, if our training corpus were related to military operations, the K-nearest neighbors for "code" might include words like "signal", "transmission", or "decode".

Time to Code!

In this chapter, you'll be completing the k_nearest_neighbors function, which computes the K-nearest neighbors for an input word using the TensorFlow utility function tf.math.top_k.

To find the K-nearest neighbors for word, we need to compute the cosine similarities between the embedding vectors for word and every other vocabulary word.

Set cos_sims equal to self.compute_cos_sims applied with word and training_texts as the arguments.

The returned cos_sims has shape (1, self.vocab_size). However, when calculating the K-nearest neighbors, the extra dimension of size 1 is unnecessary. We can remove it using tf.squeeze.

Set squeezed_cos_sims equal to tf.squeeze applied with cos_sims as the only argument.

Now we can retrieve the K-nearest neighbors for word. The specific number of neighbors we retrieve is given by the integer argument, k.

Set top_k_output equal to tf.math.top_k applied with squeezed_cos_sims and k as the two arguments. Then return top_k_output.

Press + to interact

import tensorflow as tf
# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)
    # Forward run of the embedding model to retrieve embeddings
    def forward(self, target_ids):
        initial_bounds = 0.5 / self.embedding_dim
        initializer = tf.random.uniform(
            [self.vocab_size, self.embedding_dim],
            minval=-initial_bounds,
            maxval=initial_bounds)
        self.embedding_matrix = tf.compat.v1.get_variable('embedding_matrix',
            initializer=initializer)
        embeddings = tf.compat.v1.nn.embedding_lookup(self.embedding_matrix, target_ids)
        return embeddings
    
    # Compute cosine similarites between the word's embedding
    # and all other embeddings for each vocabulary word
    def compute_cos_sims(self, word, training_texts):
        self.tokenizer.fit_on_texts(training_texts)
        word_id = self.tokenizer.word_index[word]
        word_embedding = self.forward([word_id])
        normalized_embedding = tf.math.l2_normalize(word_embedding)
        normalized_matrix = tf.math.l2_normalize(self.embedding_matrix, axis=1)
        cos_sims = tf.linalg.matmul(normalized_embedding, normalized_matrix,
            transpose_b=True)
        return cos_sims
    
    # Compute K-nearest neighbors for input word
    def k_nearest_neighbors(self, word, k, training_texts):
        # CODE HERE
        pass

Get hands-on with 1300+ tech skills courses.

What you'll learn from this course

Word Embeddings

Language Model

Text Classification

Seq2Seq Model

K-Nearest Neighbors

Chapter Goals:

A. Similar words

Time to Code!