Loading…
Map reduce in document clustering using big data
A vital concept of handling data nowadays is the use of big data. As the size of the data exponentially increases, the regular database management system experiences a number of constraints and restrictions to deal with the large volume of data. The complexity of the system increases in storing, ret...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | A vital concept of handling data nowadays is the use of big data. As the size of the data exponentially increases, the regular database management system experiences a number of constraints and restrictions to deal with the large volume of data. The complexity of the system increases in storing, retrieving and processing these data. Search engines suffers to performs their operations in spite of the presence of huge volume of data. Documents of different size are hyperlinked in various levels and forms web resources in the Internet. The size and number of documents increases and the web becomes overwhelmed with dense volume of documents. In this point, the concept of document clustering arises and plays important role in addressing this situation. This technique groups the documents with relevancy and forms clusters of documents. The size of the clusters would be manageable and hence this helps in handling the big data as reduced size clusters with reduced overheads. For the purpose of clustering and handling document, we use the MapReduce paradigm. This model processes the documents in an enormously parallel way. This could be achieved in two steps using two functions the Map function and Reduce function respectively. The Map function applied on the documents with key and value pairs and produces intermediate values. The Reduce function used in this work accepts the intermediate key value pairs and produces reduced number of data items. Hence the relevant documents are grouped with reduced size and clusters are formed. Document clusters are easily handled than processing the big data directly. The results are compared for serial execution and MapReduce execution. The comparison shows that MapReduce performs document clustering with reduced computation complexity for documents with large number of terms. |
---|---|
ISSN: | 0094-243X 1551-7616 |
DOI: | 10.1063/5.0153833 |