Loading…

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing bel...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-06
Main Authors:	Liu, Zechun, Zhao, Changsheng, rest Iandola, Chen, Lai, Tian, Yuandong, Fedorov, Igor, Xiong, Yunyang, Chang, Ernie, Shi, Yangyang, Krishnamoorthi, Raghuraman, Lai, Liangzhen, Chandra, Vikas
Format:	Article
Language:	English
Subjects:	Accuracy Large language models Mathematical models Network latency Parameters
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Liu, Zechun Zhao, Changsheng rest Iandola Chen, Lai Tian, Yuandong Fedorov, Igor Xiong, Yunyang Chang, Ernie Shi, Yangyang Krishnamoorthi, Raghuraman Lai, Liangzhen Chandra, Vikas
description	This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2931843429</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2931843429</sourcerecordid><originalsourceid>FETCH-proquest_journals_29318434293</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_-KDzQDct62pFB6XAOsusT5nMzTbt0K_PQz-g04P33ox4jPOQJhFjC-I71wZBwDZbFsfcI0VuKqkwy_I9XPpBdvIjdQPFWNHJK2k0XIUVHQ5oIRO6GUWDkJsnKge1sXDR9IBv-UC4O4RUOHQrMq-Fcuj_uCTr0_GWnmlvzWtEN5StGa2eUsl2PEwiHk387_oCqaI--Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2931843429</pqid></control><display><type>article</type><title>MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases</title><source>Publicly Available Content Database</source><creator>Liu, Zechun ; Zhao, Changsheng ; rest Iandola ; Chen, Lai ; Tian, Yuandong ; Fedorov, Igor ; Xiong, Yunyang ; Chang, Ernie ; Shi, Yangyang ; Krishnamoorthi, Raghuraman ; Lai, Liangzhen ; Chandra, Vikas</creator><creatorcontrib>Liu, Zechun ; Zhao, Changsheng ; rest Iandola ; Chen, Lai ; Tian, Yuandong ; Fedorov, Igor ; Xiong, Yunyang ; Chang, Ernie ; Shi, Yangyang ; Krishnamoorthi, Raghuraman ; Lai, Liangzhen ; Chandra, Vikas</creatorcontrib><description>This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Large language models ; Mathematical models ; Network latency ; Parameters</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2931843429?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25730,36988,44565</link.rule.ids></links><search><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Zhao, Changsheng</creatorcontrib><creatorcontrib>rest Iandola</creatorcontrib><creatorcontrib>Chen, Lai</creatorcontrib><creatorcontrib>Tian, Yuandong</creatorcontrib><creatorcontrib>Fedorov, Igor</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Chang, Ernie</creatorcontrib><creatorcontrib>Shi, Yangyang</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Lai, Liangzhen</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><title>MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases</title><title>arXiv.org</title><description>This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.</description><subject>Accuracy</subject><subject>Large language models</subject><subject>Mathematical models</subject><subject>Network latency</subject><subject>Parameters</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNikELgjAYQEcQJOV_-KDzQDct62pFB6XAOsusT5nMzTbt0K_PQz-g04P33ox4jPOQJhFjC-I71wZBwDZbFsfcI0VuKqkwy_I9XPpBdvIjdQPFWNHJK2k0XIUVHQ5oIRO6GUWDkJsnKge1sXDR9IBv-UC4O4RUOHQrMq-Fcuj_uCTr0_GWnmlvzWtEN5StGa2eUsl2PEwiHk387_oCqaI--Q</recordid><startdate>20240627</startdate><enddate>20240627</enddate><creator>Liu, Zechun</creator><creator>Zhao, Changsheng</creator><creator>rest Iandola</creator><creator>Chen, Lai</creator><creator>Tian, Yuandong</creator><creator>Fedorov, Igor</creator><creator>Xiong, Yunyang</creator><creator>Chang, Ernie</creator><creator>Shi, Yangyang</creator><creator>Krishnamoorthi, Raghuraman</creator><creator>Lai, Liangzhen</creator><creator>Chandra, Vikas</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240627</creationdate><title>MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases</title><author>Liu, Zechun ; Zhao, Changsheng ; rest Iandola ; Chen, Lai ; Tian, Yuandong ; Fedorov, Igor ; Xiong, Yunyang ; Chang, Ernie ; Shi, Yangyang ; Krishnamoorthi, Raghuraman ; Lai, Liangzhen ; Chandra, Vikas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29318434293</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Large language models</topic><topic>Mathematical models</topic><topic>Network latency</topic><topic>Parameters</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Zhao, Changsheng</creatorcontrib><creatorcontrib>rest Iandola</creatorcontrib><creatorcontrib>Chen, Lai</creatorcontrib><creatorcontrib>Tian, Yuandong</creatorcontrib><creatorcontrib>Fedorov, Igor</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Chang, Ernie</creatorcontrib><creatorcontrib>Shi, Yangyang</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Lai, Liangzhen</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liu, Zechun</au><au>Zhao, Changsheng</au><au>rest Iandola</au><au>Chen, Lai</au><au>Tian, Yuandong</au><au>Fedorov, Igor</au><au>Xiong, Yunyang</au><au>Chang, Ernie</au><au>Shi, Yangyang</au><au>Krishnamoorthi, Raghuraman</au><au>Lai, Liangzhen</au><au>Chandra, Vikas</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases</atitle><jtitle>arXiv.org</jtitle><date>2024-06-27</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2931843429
source	Publicly Available Content Database
subjects	Accuracy Large language models Mathematical models Network latency Parameters
title	MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-25T12%3A27%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=MobileLLM:%20Optimizing%20Sub-billion%20Parameter%20Language%20Models%20for%20On-Device%20Use%20Cases&rft.jtitle=arXiv.org&rft.au=Liu,%20Zechun&rft.date=2024-06-27&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2931843429%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29318434293%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2931843429&rft_id=info:pmid/&rfr_iscdi=true