Summary of Information Retrieval Techniques
doc2query
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 [cs.IR] (paper)
- This paper proposes finetuning the pretrained BERT model to predict relevant queries for a given document to solve vocabulary mismatch problem.
- The model is trained on the MS MARCO qrels dataset and TREC CAR dataset.
- For each document in the dataset, certain number of queries are predicted and appended to the end. They claim top-k random sampling performs slightly better than beam search.
- It improves MRR and MAP for both first-stage retrieval as well as subsequent re-rankings (orthogonal when combined with other neural rankers) with minimal overhead.
- The improvement in first-stage retrieval with BM25 not only comes from injection of new (~31%) terms, but also because of re-weighting due to copying of existing terms (~69%).
- Consequently, this improves Recall@1000 giving re-rankers more relevant documents to consider, increasing the overall MRR.
DeepCT
Zhuyun Dai and Jamie Callan. 2019. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. arXiv:1910.10687 [cs.IR] (paper)
- Instead of using term frequency, this paper proposes a Deep Contextualized Term Weighting framework that learns to map BERT embeddings to term weights.
- It can be used to weigh terms in both documents and queries. It can be integrated directly with first-stage retrieval algorithms as term weights can be stored in inverted index.
docT5query (docTTTTTquery)
Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery (paper)
- Same as doc2query, but uses T5 instead of BERT.
- Interestingly, doc2query and docT5query produce similar proportions of copied (~67%) and new words (~33%).