Loading…

Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications

The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-size enterprises may not be suitable for such ta...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on parallel and distributed systems 2015-08, Vol.26 (8), p.2300-2315
Main Authors:	Shi, Xuanhua, Chen, Ming, He, Ligang, Xie, Xu, Lu, Lu, Jin, Hai, Chen, Yong, Wu, Song
Format:	Article
Language:	English
Subjects:	Batch processing Data processing Data structures Educational institutions Engines Memory management Receivers Runtime
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83
cites	cdi_FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83
container_end_page	2315
container_issue	8
container_start_page	2300
container_title	IEEE transactions on parallel and distributed systems
container_volume	26
creator	Shi, Xuanhua Chen, Ming He, Ligang Xie, Xu Lu, Lu Jin, Hai Chen, Yong Wu, Song
description	The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.
doi_str_mv	10.1109/TPDS.2014.2345068
format	article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_6869021</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6869021</ieee_id><sourcerecordid>3760326601</sourcerecordid><originalsourceid>FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83</originalsourceid><addsrcrecordid>eNo9kF1LwzAUhoMoOKc_QLwpeN2Zz7PGuzHnNthQdF6HtEm0Y21q0ir793ZseHVeDs97DjwI3RI8IgTLh83r0_uIYsJHlHGBITtDAyJEllKSsfM-Yy5SSYm8RFcxbnFPCswHaL7WVeXbr8dkbnUo689koY33TbLxvzqYmKxt5cM-XdatrWP5Y5O1bt6s6QqbTJpmVxa6LX0dr9GF07tob05ziD6eZ5vpIl29zJfTySotGIM2ZXkOAjuJGbMkdyA1OGEEB4uNoBxygsHlBgjtl1wWZszGjmkwuXEgXMaG6P54twn-u7OxVVvfhbp_qQhISSETfNxT5EgVwccYrFNNKCsd9opgdfClDr7UwZc6-eo7d8dOaa395yEDiSlhf8uxZlY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1699268547</pqid></control><display><type>article</type><title>Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications</title><source>IEEE Xplore (Online service)</source><creator>Shi, Xuanhua ; Chen, Ming ; He, Ligang ; Xie, Xu ; Lu, Lu ; Jin, Hai ; Chen, Yong ; Wu, Song</creator><creatorcontrib>Shi, Xuanhua ; Chen, Ming ; He, Ligang ; Xie, Xu ; Lu, Lu ; Jin, Hai ; Chen, Yong ; Wu, Song</creatorcontrib><description>The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2014.2345068</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Batch processing ; Data processing ; Data structures ; Educational institutions ; Engines ; Memory management ; Receivers ; Runtime</subject><ispartof>IEEE transactions on parallel and distributed systems, 2015-08, Vol.26 (8), p.2300-2315</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83</citedby><cites>FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6869021$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>315,781,785,27926,27927,54798</link.rule.ids></links><search><creatorcontrib>Shi, Xuanhua</creatorcontrib><creatorcontrib>Chen, Ming</creatorcontrib><creatorcontrib>He, Ligang</creatorcontrib><creatorcontrib>Xie, Xu</creatorcontrib><creatorcontrib>Lu, Lu</creatorcontrib><creatorcontrib>Jin, Hai</creatorcontrib><creatorcontrib>Chen, Yong</creatorcontrib><creatorcontrib>Wu, Song</creatorcontrib><title>Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.</description><subject>Batch processing</subject><subject>Data processing</subject><subject>Data structures</subject><subject>Educational institutions</subject><subject>Engines</subject><subject>Memory management</subject><subject>Receivers</subject><subject>Runtime</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNo9kF1LwzAUhoMoOKc_QLwpeN2Zz7PGuzHnNthQdF6HtEm0Y21q0ir793ZseHVeDs97DjwI3RI8IgTLh83r0_uIYsJHlHGBITtDAyJEllKSsfM-Yy5SSYm8RFcxbnFPCswHaL7WVeXbr8dkbnUo689koY33TbLxvzqYmKxt5cM-XdatrWP5Y5O1bt6s6QqbTJpmVxa6LX0dr9GF07tob05ziD6eZ5vpIl29zJfTySotGIM2ZXkOAjuJGbMkdyA1OGEEB4uNoBxygsHlBgjtl1wWZszGjmkwuXEgXMaG6P54twn-u7OxVVvfhbp_qQhISSETfNxT5EgVwccYrFNNKCsd9opgdfClDr7UwZc6-eo7d8dOaa395yEDiSlhf8uxZlY</recordid><startdate>20150801</startdate><enddate>20150801</enddate><creator>Shi, Xuanhua</creator><creator>Chen, Ming</creator><creator>He, Ligang</creator><creator>Xie, Xu</creator><creator>Lu, Lu</creator><creator>Jin, Hai</creator><creator>Chen, Yong</creator><creator>Wu, Song</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20150801</creationdate><title>Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications</title><author>Shi, Xuanhua ; Chen, Ming ; He, Ligang ; Xie, Xu ; Lu, Lu ; Jin, Hai ; Chen, Yong ; Wu, Song</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Batch processing</topic><topic>Data processing</topic><topic>Data structures</topic><topic>Educational institutions</topic><topic>Engines</topic><topic>Memory management</topic><topic>Receivers</topic><topic>Runtime</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shi, Xuanhua</creatorcontrib><creatorcontrib>Chen, Ming</creatorcontrib><creatorcontrib>He, Ligang</creatorcontrib><creatorcontrib>Xie, Xu</creatorcontrib><creatorcontrib>Lu, Lu</creatorcontrib><creatorcontrib>Jin, Hai</creatorcontrib><creatorcontrib>Chen, Yong</creatorcontrib><creatorcontrib>Wu, Song</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shi, Xuanhua</au><au>Chen, Ming</au><au>He, Ligang</au><au>Xie, Xu</au><au>Lu, Lu</au><au>Jin, Hai</au><au>Chen, Yong</au><au>Wu, Song</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2015-08-01</date><risdate>2015</risdate><volume>26</volume><issue>8</issue><spage>2300</spage><epage>2315</epage><pages>2300-2315</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2014.2345068</doi><tpages>16</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1045-9219
ispartof	IEEE transactions on parallel and distributed systems, 2015-08, Vol.26 (8), p.2300-2315
issn	1045-9219 1558-2183
language	eng
recordid	cdi_ieee_primary_6869021
source	IEEE Xplore (Online service)
subjects	Batch processing Data processing Data structures Educational institutions Engines Memory management Receivers Runtime
title	Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T07%3A06%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mammoth:%20Gearing%20Hadoop%20Towards%20Memory-Intensive%20MapReduce%20Applications&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Shi,%20Xuanhua&rft.date=2015-08-01&rft.volume=26&rft.issue=8&rft.spage=2300&rft.epage=2315&rft.pages=2300-2315&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2014.2345068&rft_dat=%3Cproquest_ieee_%3E3760326601%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c336t-3bb650f9033e1bf69a6f5d546e0d5246b106fbd612d5449cd737f3a6dbdf65f83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1699268547&rft_id=info:pmid/&rft_ieee_id=6869021&rfr_iscdi=true