Paper Reading: Modeling Relational Data with Graph Convolutional Networks

This paper introduce Relational Graph Convolutional Networks (R-GCNs) which deal with relational data. In addition, techniques for parameter sharing and to enforce sparsity constraints are introduced to apply R-GCNs to multigraphs with large numbers of relations.

1. Relational Graph Convolutional Networks

Given directed and labeled multi-graphs as G=(V,\varepsilon,R) with nodes v_i \in V and labeled edges (v_i,r,v_j)\in \varepsilon, where r\in R is a relation type.

In a relational multi-graph, R-GCNs calculates the forward-pass update of a node v_i as:

h_i^{l+1}=\sigma (\sum_{r \in R} \sum_{j \in N_i^r} \frac{1}{c_{i,r}} W_r^{(l)}h_j^{(l)} + W_0^{(l)}h_0^{(l)}) (1)

where h_i^{l} \in \mathbb{R} ^{d^{(l)}} is the hidden state of node v_i in the l-th layer with d^{(l)} being the dimensionality of this layer’s representations. N_i^r denotes the neighbors of v_i under relation r \in R. c_{i,r} is a normalization constant which can be learned or chosen in advance(e.g. c_{i,r}=|N_i^r| ). \sigma is an element-wise activation function(e.g. ReLU).

From equation (1) we can see that the number of parameters grows with the number of relations in the graph, which can lead to overfitting on are relations and models of very large size. To overcome this issue, two regularization methods are proposed:

  • basis decomposition: each W_r^{l} is defined as W_r^{l}= \sum_{b=1}^{B} a_{rb}^{(l)}V_b^{(l)}. It can be seen as a form of weight sharing between different relation types.
  • block-diagonal decomposition: each W_r^{l} is defined as through the direct sum over a set of low-dimensional matrices: W_r^{l} = \oplus_{b=1}^B Q_{br}^{l}. Thereby, W_r^{l} are block-diagonal matrices W_r^{l}=diag(Q_{1r}^{l},...,Q_{1B}^{l}) with Q_{br}^{l} \in \mathbb{R} ^{(d^{l+1}/B) \times  (d^{l}/B)}. It can be seen as a sparsity constraint on the weight matrices for each relation type.

2. Applications

  • Entity classification (node classification)

Simply stacking R-GCN layers in equation (1), with a softmax activation(per node) on the output of the last layer, then minimizing the following cross-entropy loss on all labeled nodes(ignoring unlabeled nodes):

\mathcal{L}  = - \sum_{i \in \mathcal{Y}} \sum_{k=1}^K t_{ik}ln h_{ik}^{(L)}

where \mathcal{Y} is the set of node indices that have labels and h_{ik}^{(L)} is the k-th entry of the network output for the i-th labeled node. t_{ik} denotes ground truth label. In practice, full-batch gradient descent is used to train the model.

  • Link prediction

To solve this problem, a graph auto-encoder model is introduced. The encoder maps an entity to a real-valued vector and the decoder reconstruct edges by a scoring function. In the paper, R-GCN is used as encoder and DistMult factorization is used as scoring function(other scoring functions can also be used).

As in previous work(e.g. DistMult), negative sampling is used to train the model with following cross-entropy loss which pushes the model to score observable triples higher than negative ones:

\mathcal{L}=- \frac{1}{(1+w|\hat{\varepsilon}|)} \sum_{(s,r,o,y) \in \mathcal{T}} y log l(f(s,r,o)) + (1-y)log(1-l(f(s,r,o)))

where \hat{\varepsilon} is the given incomplete subset edges, \mathcal{T} is the total set of real and corrupted triples, l is the logistic sigmoid function, y is an indicator set to y=1 for positive triples and y=0 for negative ones.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s