This paper introduce Relational Graph Convolutional Networks (R-GCNs) which deal with relational data. In addition, techniques for parameter sharing and to enforce sparsity constraints are introduced to apply R-GCNs to multigraphs with large numbers of relations.
1. Relational Graph Convolutional Networks
Given directed and labeled multi-graphs as with nodes and labeled edges , where is a relation type.
where is the hidden state of node in the l-th layer with being the dimensionality of this layer’s representations. denotes the neighbors of under relation . is a normalization constant which can be learned or chosen in advance(e.g. ). is an element-wise activation function(e.g. ReLU).
From equation (1) we can see that the number of parameters grows with the number of relations in the graph, which can lead to overfitting on are relations and models of very large size. To overcome this issue, two regularization methods are proposed:
- basis decomposition: each is defined as . It can be seen as a form of weight sharing between different relation types.
- block-diagonal decomposition: each is defined as through the direct sum over a set of low-dimensional matrices: . Thereby, are block-diagonal matrices with . It can be seen as a sparsity constraint on the weight matrices for each relation type.
- Entity classification (node classification)
Simply stacking R-GCN layers in equation (1), with a softmax activation(per node) on the output of the last layer, then minimizing the following cross-entropy loss on all labeled nodes(ignoring unlabeled nodes):
where is the set of node indices that have labels and is the k-th entry of the network output for the i-th labeled node. denotes ground truth label. In practice, full-batch gradient descent is used to train the model.
- Link prediction
To solve this problem, a graph auto-encoder model is introduced. The encoder maps an entity to a real-valued vector and the decoder reconstruct edges by a scoring function. In the paper, R-GCN is used as encoder and DistMult factorization is used as scoring function(other scoring functions can also be used).
As in previous work(e.g. DistMult), negative sampling is used to train the model with following cross-entropy loss which pushes the model to score observable triples higher than negative ones:
where is the given incomplete subset edges, is the total set of real and corrupted triples, is the logistic sigmoid function, is an indicator set to for positive triples and for negative ones.