# Paper Reading: Modeling Relational Data with Graph Convolutional Networks

This paper introduce Relational Graph Convolutional Networks (R-GCNs) which deal with relational data. In addition, techniques for parameter sharing and to enforce sparsity constraints are introduced to apply R-GCNs to multigraphs with large numbers of relations.

## 1. Relational Graph Convolutional Networks

Given directed and labeled multi-graphs as $G=(V,\varepsilon,R)$ with nodes $v_i \in V$ and labeled edges $(v_i,r,v_j)\in \varepsilon$, where $r\in R$ is a relation type.

In a relational multi-graph, R-GCNs calculates the forward-pass update of a node $v_i$ as: $h_i^{l+1}=\sigma (\sum_{r \in R} \sum_{j \in N_i^r} \frac{1}{c_{i,r}} W_r^{(l)}h_j^{(l)} + W_0^{(l)}h_0^{(l)})$ (1)

where $h_i^{l} \in \mathbb{R} ^{d^{(l)}}$ is the hidden state of node $v_i$ in the l-th layer with $d^{(l)}$ being the dimensionality of this layer’s representations. $N_i^r$ denotes the neighbors of $v_i$ under relation $r \in R$. $c_{i,r}$ is a normalization constant which can be learned or chosen in advance(e.g. $c_{i,r}=|N_i^r|$). $\sigma$ is an element-wise activation function(e.g. ReLU).

From equation (1) we can see that the number of parameters grows with the number of relations in the graph, which can lead to overfitting on are relations and models of very large size. To overcome this issue, two regularization methods are proposed:

• basis decomposition: each $W_r^{l}$ is defined as $W_r^{l}= \sum_{b=1}^{B} a_{rb}^{(l)}V_b^{(l)}$. It can be seen as a form of weight sharing between different relation types.
• block-diagonal decomposition: each $W_r^{l}$ is defined as through the direct sum over a set of low-dimensional matrices: $W_r^{l} = \oplus_{b=1}^B Q_{br}^{l}$. Thereby, $W_r^{l}$ are block-diagonal matrices $W_r^{l}=diag(Q_{1r}^{l},...,Q_{1B}^{l})$ with $Q_{br}^{l} \in \mathbb{R} ^{(d^{l+1}/B) \times (d^{l}/B)}$. It can be seen as a sparsity constraint on the weight matrices for each relation type.

## 2. Applications

• Entity classification (node classification)

Simply stacking R-GCN layers in equation (1), with a softmax activation(per node) on the output of the last layer, then minimizing the following cross-entropy loss on all labeled nodes(ignoring unlabeled nodes): $\mathcal{L} = - \sum_{i \in \mathcal{Y}} \sum_{k=1}^K t_{ik}ln h_{ik}^{(L)}$

where $\mathcal{Y}$ is the set of node indices that have labels and $h_{ik}^{(L)}$ is the k-th entry of the network output for the i-th labeled node. $t_{ik}$ denotes ground truth label. In practice, full-batch gradient descent is used to train the model. $\mathcal{L}=- \frac{1}{(1+w|\hat{\varepsilon}|)} \sum_{(s,r,o,y) \in \mathcal{T}} y log l(f(s,r,o)) + (1-y)log(1-l(f(s,r,o)))$
where $\hat{\varepsilon}$ is the given incomplete subset edges, $\mathcal{T}$ is the total set of real and corrupted triples, $l$ is the logistic sigmoid function, $y$ is an indicator set to $y=1$ for positive triples and $y=0$ for negative ones.