code of this paper: link High-level Summary This paper studies the attention maps of the pre-trained BERT-base model. More specically, it : explore generally how BERT’s attention heads behave. eg. attending to fixed positional offsets. eg. attending broadly over the whole sentence. a large amount of attention attends to [SEP]. attention heads in the same layer behave similarly. probe each attention head for linguistic phenomena. … Continue reading Paper Reading: What Does BERT Look At? An Analysis of BERT’s Attention
This paper studies the layer-wise BERT activations for sentence-level tasks and passage-level tasks. 1. BERT Sentence Embedding SentEval toolkit is used to evaluate the quality of sentence representations from BERT activations. It has a variety of downstreaming sentence-level tasks and probing tasks. More details about SentEval are at: https://github.com/facebookresearch/SentEval 1.1 [CLS] from different layers [CLS] token embeddings from different layers are used for classification. Only … Continue reading Paper Reading: Universal Text Representation from BERT: An Empirical Study
The paper tries to answer the following questions: What are the common attention patterns, how do they change during fine-tuning, and how does that impact the performance on a given task? What linguistic knowledge is encoded in self-attention weights of the fine-tuned models and what portion of it comes from the pre-trained BERT? How different are the self-attention patterns of different heads, and how important … Continue reading Paper Reading: Revealing the Dark Secrets of BERT
Introduction Previous methods to evaluate word embeddings intrinsiclly (e.g. WordSim-353, SimLex-999, word analogy task) ignore the context and treat words in isolation. This paper proposes a dataset CoSimLex to evaluate the ability of word embeddings that reflect similarity judgements in context and answer the following question: How well do word embeddings model the effects that context has on word meaning? CoSimLex is used as the … Continue reading Paper Reading: CoSimLex: A Resource for Evaluating Graded Word Similarity in Context
The paper specifically studies the ability of attention heads(of BERT-like models) that can recover syntactic dependency relations. Method 1: Maximum Attention Weights (MAX) For a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. A relation is assigned to such that for each row i where is the attention weights … Continue reading Paper Reading: Do Attention Heads in BERT Track Syntactic Dependencies?