Loading…

Using clustering and dynamic mutual information for topic feature selection

A good feature selection method should take into account both category information and high‐frequency information to select useful features that can effectively display the information of a target. Because basic mutual information (BMI) prefers low‐frequency features and ignores high‐frequency featu...

Full description

Saved in:
Bibliographic Details
Published in:Journal of the Society for Information Display 2014-11, Vol.22 (11), p.572-580
Main Authors: Xu, Jian-min, Wu, Shu fang, Zhu, Jie
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A good feature selection method should take into account both category information and high‐frequency information to select useful features that can effectively display the information of a target. Because basic mutual information (BMI) prefers low‐frequency features and ignores high‐frequency features, clustering mutual information is proposed, which is based on clustering and makes effective high‐frequency features become unique, better integrating category information and useful high‐frequency information. Time is an important factor in topic detection and tracking (TDT). In order to improve the performance of TDT, time difference is integrated into clustering mutual information to dynamically adjust the mutual information, and then another algorithm called the dynamic clustering mutual information (DCMI) is given. In order to obtain the optimal subsets to display topics information, an objective function is proposed, which is based on the idea that a good feature subset should have the smallest distance within‐class and the largest distance across‐class. Experiments on TDT4 corpora using this objective function are performed; then, comparing the performances of BMI, DCMI, and the only existed topic feature selection algorithm Incremental Term Frequency‐Inverted Document Frequency (ITF‐IDF), these performance information will be displayed by four figures. Computation time of DCMI is previously lower than BMI and ITF‐IDF. The optimal normalized‐detection performance (Cdet)norm of DCMI is decreased by 0.3044 and 0.0970 compared with those of BMI and ITF‐IDF, respectively. This paper includes six tables: Tables 1 and 2 are used to demonstrate the difference of mutual information (MI) using clustering before and after. Table 3 gives the evaluation cost parameters in different topic detection and tracking tasks. Table 4 shows the optimal size of feature subset for topic detection and tracking‐4 topics obtained by coordinate descent method. Table 5 describes the quantity of the feature subset size in different loops. Table 6 demonstrates the optimal detection performance of MI, dynamic clustering MI, and incremental term frequency‐inverted document frequency.
ISSN:1071-0922
1938-3657
DOI:10.1002/jsid.289