# Paper Reading: Embedding Words in Non-Vector Space with Unsupervised Graph Learning

venue: EMNLP 2020

This paper proposes to use a graph-based method to train word embeddings.

The graph method is PRODIGE which learns a representation of data in a form of a weighted graph G(V,E,w,p). Each edge has a weight $w(e_i)$ and a Bernoulli random variable $b_i \sim Bern(p(e_i))$ indicating whether an edge is present or not. The distance between two nodes is formulated as the expected shortest path distance:

The edge probabilities are used only during training and edges with probabilities less than 0.5 are removed during testing (graph is deterministic during testing time).

Edge probabilities and weights are learned by minimizing:

The 1st term is task-specific loss. For example, in this paper it is replaced with a modified loss function of GloVe. The 2nd term is the $L_0$ regularizer on the number of edges so that a learned edge has probability close to 0/1.

Let’s revisit the loss function of GloVe:

To integrate it into the 1st term of equation (3), the dot product of 2 word embeddings is replaced with distance from the graph (equation 2) and only one bias term is used for each word:

The following 2 types of distance are experimented:

To note that, in the 2nd type, an extra “zero” node is added.

To construct an initial set of edges for training, each word is connected with K nearest neighboards and M randomly sampled words. The edges weights are calculated as cosine similarity between GloVe embeddings with 0.9 edge probabilities. The biases are initialized from normal distribution $N(0,0.01)$. A graph aware batching strategy is used where a batch (of size b x n) contains a small number of rows with potentially thousands of columns per row and each batch requires O(b) runs of Dijkstra algorithm.

The experiment results on word similarity and word analogy tasks show that the proposed GraphGlove outperforms both Euclidean and Poincare GloVe embeddings.