Cosine Similarity
Implement normalized cosine similarity to evaluate the embedding model.
We'll cover the following
Chapter Goals:
- Learn about cosine similarity and how it's used to compare embedding vectors
- Create a function that computes cosine similarities for a given word
A. Vector comparison
In mathematics, the standard way for comparing vector similarity is through cosine similarity. Since word embeddings are just vectors of real numbers, we can use also cosine similarity to compare embeddings for different words.
For two vectors, u and v, the equation for cosine similarity is
where represents the L2-norm of vector , and represents the dot product operation.
We refer to the quantity as the L2-normalization of vector .B. Correlation
The cosine similarity measures the correlation between two vectors, i.e. how closely related the two vectors are. The range of values for cosine similarity is [-1, 1]. A value of 1 means the vectors are perfectly identical, a value of -1 means the vectors are complete opposites, and a value of 0 means the vectors are orthogonal (i.e. completely uncorrelated).
Note that the cosine similarity values are based on a spectrum, so we can measure correlation based on the cosine similarity's proximity to 1, 0, or -1. For example, we would expect the word embeddings for "orange" and "juice" to have a cosine similarity close to 1, since they are often used in the same context in conjunction with one another. On the other hand, we would expect "good" and "bad" to have a negative cosine similarity, since they are antonyms. And in most text corpuses, "chocolate" and "fence" would have a cosine similarity near 0, since they tend to be unrelated.
For example, imagine two vectors and .
The L2 norm of is
The L2 norm of is
This number is very close to one, which means that and are very similar vectors
Time to Code!
In this chapter, you'll be completing the compute_cos_sims
function, which computes cosine similarities between vocabulary words. Specifically, you'll be completing the function where it currently has the placeholder "CODE HERE".
In order for the cosine similarities to be between 0 and 1, we need to first normalize both our retrieved embedding vector and the embedding matrix.
Set normalized_embedding
equal to tf.math.l2_normalize
applied with word_embedding
as the only argument.
Set normalized_matrix
equal to tf.math.l2_normalize
applied with self.embedding_matrix
as the required argument and axis=1
as the keyword argument.
We can now calculate the embedding vector cosine similarities by matrix multiplying normalized_embedding
and normalized_matrix
. The matrix multiplication returns a vector with shape (1, vocab_size)
, where each index contains a cosine similarity between the embedding vector for word
and the embedding vector for the vocabulary word whose ID matches the index.
Set cos_sims
equal to tf.linalg.matmul
applied with normalized_embedding
and normalized_matrix
as the required arguments, and transpose_b=True
as the keyword argument. Then return cos_sims
.
import tensorflow as tf# Skip-gram embedding modelclass EmbeddingModel(object):# Model Initializationdef __init__(self, vocab_size, embedding_dim):self.vocab_size = vocab_sizeself.embedding_dim = embedding_dimself.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)# Forward run of the embedding model to retrieve embeddingsdef forward(self, target_ids):initial_bounds = 0.5 / self.embedding_diminitializer = tf.random.uniform([self.vocab_size, self.embedding_dim],minval=-initial_bounds,maxval=initial_bounds)self.embedding_matrix = tf.compat.v1.get_variable('embedding_matrix',initializer=initializer)embeddings = tf.compat.v1.nn.embedding_lookup(self.embedding_matrix, target_ids)return embeddings# Compute cosine similarites between the word's embedding# and all other embeddings for each vocabulary worddef compute_cos_sims(self, word, training_texts):self.tokenizer.fit_on_texts(training_texts)word_id = self.tokenizer.word_index[word]word_embedding = self.forward([word_id])# CODE HERE
Get hands-on with 1300+ tech skills courses.