venlue: EMNLP 2020 (link) Previous work that uses pretrained language model (PLM) such as BERT for information retrieval takes the [CLS] embedding of the concatenation of query and document as features for discriminative learning. In other words, the relevance label for a given (query, document) pair is modeled as: where is the [CLS] embedding from the last layer of BERT and is usually a classification … Continue reading Paper Reading: Beyond [CLS] through Ranking by Generation
code of this paper: link High-level Summary This paper studies the attention maps of the pre-trained BERT-base model. More specically, it : explore generally how BERT’s attention heads behave. eg. attending to fixed positional offsets. eg. attending broadly over the whole sentence. a large amount of attention attends to [SEP]. attention heads in the same layer behave similarly. probe each attention head for linguistic phenomena. … Continue reading Paper Reading: What Does BERT Look At? An Analysis of BERT’s Attention
This paper studies the layer-wise BERT activations for sentence-level tasks and passage-level tasks. 1. BERT Sentence Embedding SentEval toolkit is used to evaluate the quality of sentence representations from BERT activations. It has a variety of downstreaming sentence-level tasks and probing tasks. More details about SentEval are at: https://github.com/facebookresearch/SentEval 1.1 [CLS] from different layers [CLS] token embeddings from different layers are used for classification. Only … Continue reading Paper Reading: Universal Text Representation from BERT: An Empirical Study
The paper tries to answer the following questions: What are the common attention patterns, how do they change during fine-tuning, and how does that impact the performance on a given task? What linguistic knowledge is encoded in self-attention weights of the fine-tuned models and what portion of it comes from the pre-trained BERT? How different are the self-attention patterns of different heads, and how important … Continue reading Paper Reading: Revealing the Dark Secrets of BERT
The paper specifically studies the ability of attention heads(of BERT-like models) that can recover syntactic dependency relations. Method 1: Maximum Attention Weights (MAX) For a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. A relation is assigned to such that for each row i where is the attention weights … Continue reading Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?