Loading…

A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

This paper presents a probabilistic mixture modeling framework for the hierarchic organization of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organization of a document collection can be further exploited...

Full description

Saved in:
Bibliographic Details
Published in:Journal of intelligent information systems 2002-03, Vol.18 (2-3), p.153
Main Authors: Vinokourov, Alexei, Girolami, Mark
Format: Article
Language:English
Subjects:
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper presents a probabilistic mixture modeling framework for the hierarchic organization of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organization of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance. [PUBLICATION ABSTRACT]
ISSN:0925-9902
1573-7675
DOI:10.1023/A:1013677411002