# Paper Reading: Tensor Graph Convolutional Networks for Text Classification

The basic notations for GCN are the same with this post.

## 1. Graph Tensor Definition

Here we first describe the formal definition of graph tensor which consists of a series of graphs.

$\mathcal{G}$ is a graph tensor, where $\mathcal{G} = (G_1,...,G_r)$ and $G_i = (V_i, E_i, A_i)$, if $V_i = V_j (i,j=1,...,r)$ and $A_i \neq A_j$ (when $i \neq j$).

Where $G_i$ is the i-th graph in the graph tensor, $V_i(|V_i| = n)$ is the set of i-th graph nodes, $E_i$ is the set of the i-th graph edges. $A_i$ is the i-th adjacency matrix. For convinience, all adjacency matrices are packed into a graph adjacency tensor $\mathcal{A} = (A_1,...,A_i,...,A_r) \in R^{r \times n \times n}$. Similarly, the graph feature matrices are also packed into a graph feature tensor $\mathcal{H}^{(l)}=(H_1^{(l)},..,H_r^{(l)}) \in R^{r \times n \times d_l}$, where $H_i^{(l)} \in R^{n \times d_l}$ is the feature matrix of i-th graph.

## 2. Text Graph Tensor Construction

To construct a graph for a set of documents, words and documents are treated as nodes. Besides, two types of edges are considered: word-document edges and word-word edges. For word-document edges, TF-IDF method is used to calcualte the edge weights. Three types of word-word edges(semantic/syntactic/sequential) are considered to construct three text graphs which are described in the following:

• Semantic-based graph

A LSTM is trained on the training data. For every sentence/document, word embeddings from the LSTM is used to calculate the cosine similarity between words. If the similarity exceeds a predefined threshold, then we say the two words have a semantic relationship in the current sentence/document. The edge weight between two words is defined as:

$d_{semantic}(w_i,w_j) = \frac{\#N_{semantic}(w_i,w_j)}{\#N_{total}(w_i,w_j)}$

where $\#N_{semantic}$ denotes the number of times two words have semantic relationship over all sentences/documents, and $\#N_{total}$ denotes the number of times two words exist in the same sentence/document.

• Syntactic-based graph

For every sentence/document, Stanford CoreNLP is used to extract the dependency between two words. And the edge weight between two words in sytactic-based graph is defined as:

$d_{syntactic}(w_i,w_j) = \frac{\#N_{syntactic}(w_i,w_j)}{\#N_{total}(w_i,w_j)}$

where $\#N_{syntactic}$ denotes the the number of times two words have a syntactic dependency over all sentences/documents.

• Sequential-based graph

This one is also used in Graph Convolutional Networks for Text Classification which depicts the local co-occurence(PMI) between words. The edge weight between two words in the sequential-based graph is defined as:

$d_{sequential}(w_i,w_j) = \frac{p(w_i,w_j)}{p(w_i)p(w_j)}$

where $p(w_i,w_j)$ is the probability of the two words co-occuring in the same sliding window which is estimated by $\frac{\#N_{co-occurence}(w_i,w_j)}{\#N_{windows}}$. $\#N_{co-occurence}$ is the number of times two words to-occurs in the same sliding windows over the corpus. $\#N_{windows}$ is the total number of sliding windows. $p(w_i) = \frac{\#N_{occurrence}(w_i)}{\#N_{windows}}$ is the probability of the word is occuring in a given window over the corpus.

## 3. Graph Tensor Learning

After obtaining the constructed graph tensor, then we can esploit learning frameworks to perform GCN on the graph tensor.

• merge edges + GCN

Since a graph tensor consists of multiple graph, a simple strategy is to merge the edges into one graph by pooling the adjacency tensor $pooling(\mathcal{A}) = pooling(A_1,...,A_r)$ such as max pooling or mean pooling. However, some experiments show it does not work. Another way is to merge graph adjacency matrix by edge-wise attention: $A_{merge} = pooling(\mathcal{A}) = \sum_{i=1}^r W_{edge}^i \odot A_i$, where $W_{edge}^i$ is the attention matrix with the same size of adjacency matrix.

• TensorGCN

Instead of merging all graphs, TensorGCN performs two types of propagation learning: intra-graph propagation and inter-graph propagation. The overall procedure is shown in Figure 1. For example, the l-th layer of TensorGCN can be described as :

$\mathcal{H}^{(l)} -> \mathcal{H}_{intra}^{(l)} -> \mathcal{H}^{(l+1)}$

Intra-graph propagation aggregates information from neighbors of each node within a graph which is almost equal of to the vanilla GCN, but performs a tensor version. The feature of i-th graph in the l-th layer is updated as:

$\mathcal{H}_{intra}^{(l)} = \sigma (\hat{\mathcal{A}}(i,:,:),\mathcal{H}^{(l)}(i,:,:)W_{intra}^{(l,i)})$

where $\hat{\mathcal{A}}$ is the normalized symmetric graph adjacency tensor and $\hat{\mathcal{A}}(i,:,:) = \tilde{\mathcal{D}}(i,:,:)^{-\frac{1}{2}}\tilde{\mathcal{A}}(i,:,:)\tilde{\mathcal{D}}(i,:,:)^{-\frac{1}{2}}$$\tilde{\mathcal{A}} = \mathcal{A} + \mathcal{I}$ and $\mathcal{D}$ is the node degree tensor.

Inter-graph propagation propagates/exchanges information between different graphs in the tensor. To achieve this, virtual graphs are constructed by connecting with nodes cross the graphs as showin in Figure 1. Since every graph has the same set of nodes, n nodes result in n virtual graphs and a new graph adjacency tensor $\mathcal{A}^+ \in R^{r \times r \times n}$. The inter-graph information propagation is defined as:

$\mathcal{H}^{(l+1)}(:,j,:) = \sigma (\hat{\mathcal{A}}^+(:,:,j),\mathcal{H}_{intra}^{(l)}(:,j,:)W_{inter}^{(l,j)})$

where $\mathcal{H}^{(l+1)} \in R^{r \times n \times d_{l+1}}$ is the output of inter-graph propagatioon and the input feature tensor of the l+1-th layer in TensorGCN. There is neither symmetric normalization nor self-connections added.

In the last layer of TensorGCN, after completing inter-graph propataion, the representation of documents nodes are obtained by a mean pooling over graphs for classification.