venlue: EMNLP 2020 (link) Previous work that uses pretrained language model (PLM) such as BERT for information retrieval takes the [CLS] embedding of the concatenation of query and document as features for discriminative learning. In other words, the relevance label for a given (query, document) pair is modeled as: where is the [CLS] embedding from the last layer of BERT and is usually a classification … Continue reading Paper Reading: Beyond [CLS] through Ranking by Generation
Venue: ACL 2020 This paper presents the empirical results of how the performance gap between pretraining models (RoBERTa) and vanilla LSTM changes in terms of the size of training samples for text classification tasks. They experimented on 3 text classification datasets with 3 models: RoBERTa, LSTM, LSTM initialized with pretrained RoBERTa embeddings. They used different portion of training samples(1%, 10%, 30%, 50%, 70%, 90%) to … Continue reading Paper Reading: To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks
code of this paper: link High-level Summary This paper studies the attention maps of the pre-trained BERT-base model. More specically, it : explore generally how BERT’s attention heads behave. eg. attending to fixed positional offsets. eg. attending broadly over the whole sentence. a large amount of attention attends to [SEP]. attention heads in the same layer behave similarly. probe each attention head for linguistic phenomena. … Continue reading Paper Reading: What Does BERT Look At? An Analysis of BERT’s Attention
This paper studies the layer-wise BERT activations for sentence-level tasks and passage-level tasks. 1. BERT Sentence Embedding SentEval toolkit is used to evaluate the quality of sentence representations from BERT activations. It has a variety of downstreaming sentence-level tasks and probing tasks. More details about SentEval are at: https://github.com/facebookresearch/SentEval 1.1 [CLS] from different layers [CLS] token embeddings from different layers are used for classification. Only … Continue reading Paper Reading: Universal Text Representation from BERT: An Empirical Study
The paper tries to answer the following questions: What are the common attention patterns, how do they change during fine-tuning, and how does that impact the performance on a given task? What linguistic knowledge is encoded in self-attention weights of the fine-tuned models and what portion of it comes from the pre-trained BERT? How different are the self-attention patterns of different heads, and how important … Continue reading Paper Reading: Revealing the Dark Secrets of BERT
The paper specifically studies the ability of attention heads(of BERT-like models) that can recover syntactic dependency relations. Method 1: Maximum Attention Weights (MAX) For a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. A relation is assigned to such that for each row i where is the attention weights … Continue reading Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?