Loading…

An optimal text compression algorithm based on frequent pattern mining

Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and l...

Full description

Saved in:
Bibliographic Details
Published in:Journal of ambient intelligence and humanized computing 2018-06, Vol.9 (3), p.803-822
Main Authors: Oswald, C., Sivaselvan, B.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.
ISSN:1868-5137
1868-5145
DOI:10.1007/s12652-017-0540-2