Loading…

Improved sqrt-cosine similarity measurement

Text similarity measurement aims to find the commonality existing among text documents, which is fundamental to most information extraction, information retrieval, and text mining problems. Cosine similarity based on Euclidean distance is currently one of the most widely used similarity measurements...

Full description

Saved in:
Bibliographic Details
Published in:Journal of big data 2017-07, Vol.4 (1), p.1-13, Article 25
Main Authors: Sohangir, Sahar, Wang, Dingding
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text similarity measurement aims to find the commonality existing among text documents, which is fundamental to most information extraction, information retrieval, and text mining problems. Cosine similarity based on Euclidean distance is currently one of the most widely used similarity measurements. However, Euclidean distance is generally not an effective metric for dealing with probabilities, which are often used in text analytics. In this paper, we propose a new similarity measure based on sqrt-cosine similarity. We apply the proposed improved sqrt-cosine similarity to a variety of document-understanding tasks, such as text classification, clustering, and query search. Comprehensive experiments are then conducted to evaluate our new similarity measurement in comparison to existing methods. These experimental results show that our proposed method is indeed effective.
ISSN:2196-1115
2196-1115
DOI:10.1186/s40537-017-0083-6