This paper studies the layer-wise BERT activations for sentence-level tasks and passage-level tasks.
1. BERT Sentence Embedding
SentEval toolkit is used to evaluate the quality of sentence representations from BERT activations. It has a variety of downstreaming sentence-level tasks and probing tasks. More details about SentEval are at: https://github.com/facebookresearch/SentEval
1.1 [CLS] from different layers
[CLS] token embeddings from different layers are used for classification. Only a logistic regression layer is used on top of [CLS]. Figure1 shows the the results.

We can see that generally top layer get better performance than lower layers. However, for certain probing task such as tense classification, subject, and object number classifications, middle layer embeddings perform the best. Also, there is a higher correlation in performance between bottom layer embeddings and GloVe embeddings than those of other layers.
1.2 different pooling methods
The following pooling methods are experimented:
- CLS-pooling: the hidden state corresponding to the [CLS] token
- SEP-pooling: the hidden state corresponding to the [SEP] token
- Mean-pooling: the average of the hidden state of the encoding layer on the time axis
- Max-pooling: the maximum of the hidden state of the encoding layer on the time axis
The performance of each pooling method is averaged over different layers. The results are summarized in Table 1, where the score of each task category comes from the average scores of all tasks belong to that category. The results show taht Mean-pooling performs the best.

1.3 Pre-trained vs. Fine-tuned BERT
The performance of embeddings from pre-trained BERT and fine-tuned BERT is compared. MNLI and SNLI are used in this experiment. Concatenating embeddings from multiple layers are also tested. Results are presented in Table 2.

From the table, we can see that pre-trained BERT is good at capturing sentence level syntactic information and semantic information, but poor at semantic similarity tasks and surface information tasks.
2. BERT Passage Embedding
In this section, BERT embeddings are used to solve QA problems(passage level) under a learning-to-rank setting.
The same pooling methods in sentence embedding experiment are used to extract passage embeddings. The following methods are used to combine query embeddings with answer passage embeddings:
- cosine similarity
- bilinear function
- concatenation
- (u,v,u*v,|u-v|) where u and v are query embedding and answer embedding respectively.
A logistic layer or an MLP is added on top of the embeddings to output an ranking score. Pairwise rank hinge loss is used for training. BERT passage embeddings are compared with BM25, other SOTA methods and fine-tuned BERT on in-domain supervised data. The results are shown in Table 3.

Overall, in-domain fine-tuned BERT delivers the best performance. Similar to BERT for sentence embeddings, mean-pooling and combining the top and bottom layer embeddings lead to better performance, and (u; v; u ∗ v; |u – v|) shows the strongest results among other interaction schemes.