Why does Word2Vec generate two distinct embedding vectors, "input" and "output", per word?

37 viewed last edited 2 years ago
Mahesh Godavarti

This question has always bothered me.

If the intuition shared in the literature - Word2Vec finds vectors in the embedding space such that vectors of similar words are close to each other whereas those of dissimilar words are spread apart - is correct, then why does the Word2Vec training algorithm generate two distinct embedding vectors for each word - namely, input and output vectors? Indeed, this question has occurred to other researchers and engineers considering the questions asked on this very topic on Quora and StackExchange.




People have opined that input embeddings and output embeddings capture different aspects of word semantics. In some cases, "output" embeddings are simply been discarded, in some cases, they are combined with "input" embeddings to possibly create more powerful embeddings, but no one has clearly explained why we need two separate embeddings. Only that we get better performance out of the Word2Vec algorithm when the 'input" and "output" embeddings are distinct rather than when they are tied together.

In this blog, I will explain why Word2Vec generates two distinct embeddings, and why this makes Word2Vec such a simple, but powerful word embedding algorithm.

We start with recognizing that Word2Vec is, essentially, a neural-network language modeling algorithm (as all current word embedding algorithms are). The input to these algorithms is raw everyday text (e.g. newspaper articles, books, and Wikipedia). Word2Vec solves the problem of predicting the most likely word that follows in a sentence given the context (just like in a language model) and it maximizes the likelihood function composed of words and the neighboring ones forming the context.

A neural network at its root is a mapping function from an input vector x, to an output vector y,  y = f(x). In a neural network-based language model we want to essentially maximize a likelihood function L(x_N, f(x)) where x_N denotes the set of neighboring words in the context of the target word. The training algorithm learns the function f, that maximizes this likelihood.  

So, here is the trick that Word2Vec employs. First, note that the number of words in the training data is a finite number. I.e. x (input vector) and y (output vector) are essentially discrete elements in an n-dimensional space. Therefore, instead of learning the full continuous valued mapping f: R^n -> R^n, we only have to learn what f is over a discrete set of points in the n-dimensional space. This is essentially what Word2Vec does. In lieu of learning (using Deep Neural Networks) the function f: R^n -> R^n  that happens to evaluate to y_i = f(x_i) for i = 1, .., M, it simplifies the problem to learning M ordered pairs (x_i, y_i) where M is the vocabulary size. And this is why input and output vector embeddings are necessary for obtaining good performance from Word2Vec.

But what about the original question? Do we really need to input and vector embeddings given the intuition stated at the start of this article? The answer is NO!

I demonstrated that in the Toy Program that directly implements the intuition (https://www.qalaxia.com/blog/A-Toy-Program-for-Unsupervised-Generation-of-Word-Embeddings).

The direct link to the Toy program is: https://bitbucket.org/acarrot/toyword2vec/src/master

So, you see that the debate is far from settled and I welcome your thoughts!