Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?

The paper specifically studies the ability of attention heads(of BERT-like models) that can recover syntactic dependency relations.

Method 1: Maximum Attention Weights (MAX)

For a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. A relation is assigned to (w_i, w_j) such that j = \arg\max W[i]  for each row i where W \in (0,1)^{T \times T} is the attention weights of a head at some layer. Such dependency relations are extracted from all heads at all layers, and the maximum undirected unlabeled attachment scores (UUAS)  are used over all relation types.


Method 2: Maximum Spanning Tree (MST)

In the 1st method, directions are not considered since the formed graphs are not valid trees. To extract complete valid dependency trees from an attention weight matrix, the method in ” To extract complete valid dependency trees”  is used.  The attention matrix is treated as a complete weighted directed graph, with the edges pointing from the output token to each attended token. The root of the gold dependency tree is used as the starting node and the Chu-Liu-Edmonds algorithm is used to compute the maximum spanning tree.  Evaluation method is the same with ” A structural probe for finding syntax in word representations”.

Experimental Setup

Special tokens like [CLS], [SEP], <s>, </s> are excluded in order to focus on inter-word attention. English Parallel Universal Dependencies (PUD) treebank from the CoNLL 2017 shared task is used. The tokenization of parsed corpus may not match with model’s tokenization. For such cases, non-matching tokens with corresponding attention weights are merged until compatible.


Since many dependency relations tend to occur in specific positions relative to the parent word,  the most common positional offset between a parent and child word for a given dependency relation is used as a baseline.  For MST, a right-branching dependency tree is used as a baseline.  A BERT-large model with randomly initialized weights is used.



Figure 2 and Table 1 describe the accuracy for the most frequent relation types based on the MAX method.  BERT models fine-tuned on CoLA MNLI are also tested. MNLI-BERT has a tendency to track clausal dependencies and outperforms others for long-distance advcl and csubj dependency types. The non-random models outperform random BERT substantially for all dependency types.



Figure 3 shows the accuracy for nsubj, obj, advmod, and amod relations extracted based on the MST method.


Figure 4 describes the maximum undirected unlabeled attachment scores (UUAS) across each layer. Although the trained models perform better than the rightbranching baseline in most cases, the performance gap is not substantial. Given that the MST method
uses the root of the gold trees, whereas the rightbranching baseline does not, this implies that the attention weights in the different layers/heads of the BERT models do not appear to correspond to complete, syntactically informative parse trees.


One thought on “Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s