Loading…

Nonlinear transformation of term frequencies for term weighting in text categorization

In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequ...

Full description

Saved in:
Bibliographic Details
Published in:Engineering applications of artificial intelligence 2012-10, Vol.25 (7), p.1505-1514
Main Authors: Erenel, Zafer, Altınçay, Hakan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequency factor whereas the leading collection frequency factors take into account the document frequency of each term. In this study, it is firstly shown that the best-fitting form of the term frequency factor depends on the distribution of term frequency values in the dataset under concern. Taking this observation into account, a novel collection frequency factor is proposed which considers term frequencies. Five datasets are firstly tested to show that the distribution of term frequency values is task dependent. The proposed method is then proven to provide better F1 scores compared to two recent approaches on majority of the datasets considered. It is confirmed that the use of term frequencies in the collection frequency factor is beneficial on tasks which does not involve highly repeated terms. It is also shown that the best F1 scores are achieved on majority of the datasets when smaller number of features are considered.
ISSN:0952-1976
1873-6769
DOI:10.1016/j.engappai.2012.06.013