Summary of Information Retrieval Techniques

August 1, 2023

doc2query

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 [cs.IR] (paper)

This paper proposes finetuning the pretrained BERT model to predict relevant queries for a given document to solve vocabulary mismatch problem.
The model is trained on the MS MARCO qrels dataset and TREC CAR dataset.
For each document in the dataset, certain number of queries are predicted and appended to the end. They claim top-k random sampling performs slightly better than beam search.
It improves MRR and MAP for both first-stage retrieval as well as subsequent re-rankings (orthogonal when combined with other neural rankers) with minimal overhead.
The improvement in first-stage retrieval with BM25 not only comes from injection of new (~31%) terms, but also because of re-weighting due to copying of existing terms (~69%).
Consequently, this improves Recall@1000 giving re-rankers more relevant documents to consider, increasing the overall MRR.

DeepCT

Zhuyun Dai and Jamie Callan. 2019. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. arXiv:1910.10687 [cs.IR] (paper)

Instead of using term frequency, this paper proposes a Deep Contextualized Term Weighting framework that learns to map BERT embeddings to term weights.
It can be used to weigh terms in both documents and queries. It can be integrated directly with first-stage retrieval algorithms as term weights can be stored in inverted index.

docT5query (docTTTTTquery)

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery (paper)

Same as doc2query, but uses T5 instead of BERT.
Interestingly, doc2query and docT5query produce similar proportions of copied (~67%) and new words (~33%).