Paper Reading: Universal Text Representation from BERT: An Empirical Study

This paper studies the layer-wise BERT activations for sentence-level tasks and passage-level tasks.

1. BERT Sentence Embedding

SentEval toolkit is used to evaluate the quality of sentence representations from BERT activations. It has a variety of downstreaming sentence-level tasks and probing tasks. More details about SentEval are at: https://github.com/facebookresearch/SentEval

1.1 [CLS] from different layers

[CLS] token embeddings from different layers are used for classification. Only a logistic regression layer is used on top of [CLS]. Figure1 shows the the results.

We can see that generally top layer get better performance than lower layers. However, for certain probing task such as tense classification, subject, and object number classifications, middle layer embeddings perform the best. Also, there is a higher correlation in performance between bottom layer embeddings and GloVe embeddings than those of other layers.

1.2 different pooling methods

The following pooling methods are experimented:

  • CLS-pooling: the hidden state corresponding to the [CLS] token
  • SEP-pooling: the hidden state corresponding to the [SEP] token
  • Mean-pooling: the average of the hidden state of the encoding layer on the time axis
  • Max-pooling: the maximum of the hidden state of the encoding layer on the time axis

The performance of each pooling method is averaged over different layers. The results are summarized in Table 1, where the score of each task category comes from the average scores of all tasks belong to that category. The results show taht Mean-pooling performs the best.

1.3 Pre-trained vs. Fine-tuned BERT

The performance of embeddings from pre-trained BERT and fine-tuned BERT is compared. MNLI and SNLI are used in this experiment. Concatenating embeddings from multiple layers are also tested. Results are presented in Table 2.

From the table, we can see that pre-trained BERT is good at capturing sentence level syntactic information and semantic information, but poor at semantic similarity tasks and surface information tasks.

2. BERT Passage Embedding

In this section, BERT embeddings are used to solve QA problems(passage level) under a learning-to-rank setting.

The same pooling methods in sentence embedding experiment are used to extract passage embeddings. The following methods are used to combine query embeddings with answer passage embeddings:

  • cosine similarity
  • bilinear function
  • concatenation
  • (u,v,u*v,|u-v|) where u and v are query embedding and answer embedding respectively.

A logistic layer or an MLP is added on top of the embeddings to output an ranking score. Pairwise rank hinge loss is used for training. BERT passage embeddings are compared with BM25, other SOTA methods and fine-tuned BERT on in-domain supervised data. The results are shown in Table 3.

Overall, in-domain fine-tuned BERT delivers the best performance. Similar to BERT for sentence embeddings, mean-pooling and combining the top and bottom layer embeddings lead to better performance, and (u; v; u ∗ v; |u – v|) shows the strongest results among other interaction schemes.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s