Paper Reading: Beyond [CLS] through Ranking by Generation

venlue: EMNLP 2020 (link) Previous work that uses pretrained language model (PLM) such as BERT for information retrieval takes the [CLS] embedding of the concatenation of query and document as features for discriminative learning. In other words, the relevance label for a given (query, document) pair is modeled as: where is the [CLS] embedding from the last layer of BERT and is usually a classification … Continue reading Paper Reading: Beyond [CLS] through Ranking by Generation

Paper Reading: What Does BERT Look At? An Analysis of BERT’s Attention

code of this paper: link High-level Summary This paper studies the attention maps of the pre-trained BERT-base model. More specically, it : explore generally how BERT’s attention heads behave. eg. attending to fixed positional offsets. eg. attending broadly over the whole sentence. a large amount of attention attends to [SEP]. attention heads in the same layer behave similarly. probe each attention head for linguistic phenomena. … Continue reading Paper Reading: What Does BERT Look At? An Analysis of BERT’s Attention

Paper Reading: Universal Text Representation from BERT: An Empirical Study

This paper studies the layer-wise BERT activations for sentence-level tasks and passage-level tasks. 1. BERT Sentence Embedding SentEval toolkit is used to evaluate the quality of sentence representations from BERT activations. It has a variety of downstreaming sentence-level tasks and probing tasks. More details about SentEval are at: https://github.com/facebookresearch/SentEval 1.1 [CLS] from different layers [CLS] token embeddings from different layers are used for classification. Only … Continue reading Paper Reading: Universal Text Representation from BERT: An Empirical Study

Paper Reading: Revealing the Dark Secrets of BERT

The paper tries to answer the following questions: What are the common attention patterns, how do they change during fine-tuning, and how does that impact the performance on a given task?  What linguistic knowledge is encoded in self-attention weights of the fine-tuned models and what portion of it comes from the pre-trained BERT? How different are the self-attention patterns of different heads, and how important … Continue reading Paper Reading: Revealing the Dark Secrets of BERT

Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?

The paper specifically studies the ability of attention heads(of BERT-like models) that can recover syntactic dependency relations. Method 1: Maximum Attention Weights (MAX) For a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. A relation is assigned to such that   for each row i where is the attention weights … Continue reading Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?