Loading…

Multisubject Analysis and Classification of Books and Book Collections, Based on a Subject Term Vocabulary and the Latent Dirichlet Allocation

In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by me...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2023, Vol.11, p.120881-120898
Main Authors: Makris, Nikolaos, Mitrou, Nikolaos
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w_{n} ) when sampling the subject ( s_{i} ) and denoted by Pr{ w_{n}\vert s_{i} })] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{ s_{i}\vert w_{n} }. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distr
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2023.3326722