# Paper Reading: Beyond [CLS] through Ranking by Generation

Previous work that uses pretrained language model (PLM) such as BERT for information retrieval takes the [CLS] embedding of the concatenation of query and document as features for discriminative learning. In other words, the relevance label for a given (query, document) pair is modeled as:

$P(relevance|q,d) = f([CLS]_{q,d})$

where $[CLS]_{q,d}$ is the [CLS] embedding from the last layer of BERT and $f$ is usually a classification layer.

Language models are also used in traditional IR methods (link) in a generative way, where the conditional likelihood is used as the relevance score. This paper experiments with modern PLM to model $P(q|d)$.

To finetune the PLM, the input is formated as:

<bos> document <boq> query <eoq>

At training time, $-log P(query|document)$ is minimized while at inference time, the conditional log-likelihood is calculated for every document.

In practice, there could be different loss functions for finetuning. The following loss functions are tested:

LUL: loss function for likelihood and unlikelihood estimation. This is an extention of regular cross entropy loss where the unlikelihood training objective is added (2nd term). The 2nd term can be considered as a regularizer that makes the model less overconfident with query likelihoods.

RLL: pairwise ranking loss on the likelihood of positive and negative examples

MLE: maximum likelihood estimation (only positive examples are used)

Results with different PLMs and on different datasets are shown in the following table:

Take-way conclusions/tricks:

1. the unlikehood term in equation 4 is effective by comparing with MLE with LUL (row 6,7);
2. Among all generatice methods, BART-large(RLL) performs the best.
3. RLL loss seems to be the most effective for IR task, which is widely used in pre-BERT IR models. Though the paper emphasizes the effectiveness of the new generative approach, I am more attracted by the experiments on different loss functions (especially RLL) 🙂
4. When finetuning with RLL, for a question, 15 negative passages are sampled while only the one with highest score is used to update the model.