code of this paper: link
High-level Summary
This paper studies the attention maps of the pre-trained BERT-base model. More specically, it :
- explore generally how BERT’s attention heads behave.
- eg. attending to fixed positional offsets.
- eg. attending broadly over the whole sentence.
- a large amount of attention attends to [SEP].
- attention heads in the same layer behave similarly.
- probe each attention head for linguistic phenomena.
- given a word as input, output the most-attended-to other word. Does the word-pair correspond to any syntactic relation ?
- particular heads correspond remarkably well to particular relations (direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possessive pronouns with >75% accuracy).
- The same method can be applied to coreference resolution. It also performs well.
- given a word as input, output the most-attended-to other word. Does the word-pair correspond to any syntactic relation ?
1. Surface-Level Patterns in Attention
Attentin maps analyzed in this section are based over 1000 random Wikipedia segments. Some example attention maps are shown in Figure 1.

1.1 Relative Position
- most heads put little attention on the current token.
- there are heads that specialize to attending heavily on the next or previous token, especially in earlier layers of the network.
1.2 Attending to Separator Tokens
In Figure 2 and Figure 3, each point corresponds to the average attention a head puts toward a token type.

Figure 2 shows that
- Heads often attend to “special” tokens.
- Early heads attend to [CLS].
- Middle heads attend to [SEP].
- Deep heads attend to periods and commas.
One possible reasons that head attend a lot to “special” tokens is that every input has [SEP] and [CLS]. Also, punctuations commonly appear.
Another explaination is that [SEP] aggregates segment-level information which can then be read by other heads. In this case, One would expect attention heads processing [SEP] to attend broadly over the whole segment.They also cast doubt on it. Figure 3 shows that they entirely attend to themselves and the other [SEP] token.

Examples in Section 2 indicate that attention over special tokens might be used as “no-op” when the attention head’s function is not application. An example is shown in Figure 4. Head 8-10 may have the function that direct objects attend to their verbs. So non-nouns mostly attend to [SEP].

Gradient-based measures of feature importance is used to test the hypothesis. Intuitively, they measure how much changing the attention to a token will change BERT’s outputs. Results are shown in Figure 5. Starting from layer 5, the gradients of attention to [SEP] become very small, indicating changing the attentions to it does not change BERT’s outputs too much.

1.3 Focused vs Broad Attention
Average entropy of each head’s attention distribution is shown in Figure 6.
Attentions at low layers are broad with high entropy, and the output of those heads is roughly a bag-of-vectors of the sentence.
Entropies for all heads from only the [CLS] token are also calculated. The results are close to Figure 6. And [CLS] from the last layer has a high entropy which indicates broad attention.

2. Probing Individual Attention Heads
I skip writing this part since the methods and results in this section are similar to the paper “Do Attention Heads in BERT Track Syntactic Dependencies?”. The different part is how to preprocess the multiple word pieces correspond to a single word.
For dependency syntax, certain heads specialize to specific dependency relations with high accuracy.
For corefence resolution, one head achieves decent performance, particularly good with nominal mentions.

3. Probing Attention Head Combinations
This section proposes attention-based probing classifiers and applies it to dependency parsing. Given an input word, the classifier produces a probability distribution over the words in the sentence indicating how likely each other word is to be syntactivc head of the current one.
- Attention-Only Probe: simple linear combination of attention weights from all heads(two directions of attention). Only combination weights are trained.

- Attention-and-Words Probe: sets the weights of the attention heads based on the GloVe and assigns the probability of word i being word j’s head as

v denotes GoVe embeddings, ⊕ denotes concatenation. Only W and U are trained.
Results are shown in Table 1. Attn + Glove achieves a decent results. It means that BERT learns some aspects syntax purely as a by-product of self-supervised training.

4. Clustering Attention Heads
The distance between two heads is calculated by :

JS is the Jensen-Shannon Divergence between attention distributions.
With thsese distances, the heads are embed in 2-D space by applying multidimensional scaling. Results are shown in Figure 8.

We can observe that
- There are several clusters of heads that behave similarly.
- Heads within the same layer are often fairly close to each other, meaning that heads within the layer have similar attention distribution.