Loading…

An empirical study of the design choices for local citation recommendation systems

As the number of published research articles grows on a daily basis, it is becoming increasingly difficult for scientists to keep up with the published work. Local citation recommendation (LCR) systems, which produce a list of relevant articles to be cited in a given text passage, could help allevia...

Full description

Saved in:

Bibliographic Details
Published in:	Expert systems with applications 2022-08, Vol.200, p.116852, Article 116852
Main Authors:	Medić, Zoran, Šnajder, Jan
Format:	Article
Language:	English
Subjects:	BM25 Citation recommendation Datasets Information retrieval Mathematical models Natural language processing Negative sampling Parameters Recommender systems Scientists SPECTER Training
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	As the number of published research articles grows on a daily basis, it is becoming increasingly difficult for scientists to keep up with the published work. Local citation recommendation (LCR) systems, which produce a list of relevant articles to be cited in a given text passage, could help alleviate the burden on scientists and facilitate research. While research on LCR is gaining popularity, building such systems involves a number of important design choices that are often overlooked. We present an empirical study of the impact of the three design choices in two-stage LCR systems consisting of a prefiltering and a reranking phase. In particular, we investigate (1) the impact of the prefiltering models’ parameters on the model’s performance, as well as the impact of (2) the training regime and (3) negative sampling strategy on the performance of the reranking model. We evaluate various combinations of these parameters on two datasets commonly used for LCR and demonstrate that specific combinations improve the model’s performance over the widely used standard approaches. Specifically, we demonstrate that (1) optimizing prefiltering models’ parameters improves R@1000 in the range of 3% to 12% in absolute value, (2) using the strict training regime improves both R@10 and MRR (up to a maximum of 3.4% and 2.6%, respectively) in all combinations of dataset and prefiltering model, and (3) a careful choice of negative examples can further improve both R@10 and MRR (up to a maximum of 11.9% and 8%, respectively) in both datasets used Our results show that the design choices we considered are important and should be given greater consideration when building LCR systems. •Certain design choices affect the accuracy of local citation recommendation systems.•Training reranker model with a strict regime improves the model’s performance.•Triplet-based reranking models benefit from non-random negative sampling strategies.•The best negative sampling strategy for triplet construction depends on the dataset.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2022.116852