Loading…

An optimal text compression algorithm based on frequent pattern mining

Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and l...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of ambient intelligence and humanized computing 2018-06, Vol.9 (3), p.803-822
Main Authors:	Oswald, C., Sivaselvan, B.
Format:	Article
Language:	English
Subjects:	Algorithms Artificial Intelligence Codes Coding Compression ratio Computational Intelligence Data compression Data integrity Data mining Datasets Dictionaries Engineering Entropy Huffman codes Original Research Pattern analysis Robotics and Automation Tables (data) User Interfaces and Human Computer Interaction
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.
ISSN:	1868-5137 1868-5145
DOI:	10.1007/s12652-017-0540-2