Loading…

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-06
Main Authors:	Que, Haoran, Liu, Jiaheng, Zhang, Ge, Zhang, Chenchen, Qu, Xingwei, Ma, Yinghao, Duan, Feiyu, Bai, Zhiqi, Wang, Jiakai, Zhang, Yuanxing, Xu, Tan, Fu, Jie, Su, Wenbo, Wang, Jiamang, Qu, Lin, Zheng, Bo
Format:	Article
Language:	English
Subjects:	Large language models Mixtures Performance prediction Scaling laws
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Que, Haoran Liu, Jiaheng Zhang, Ge Zhang, Chenchen Qu, Xingwei Ma, Yinghao Duan, Feiyu Bai, Zhiqi Wang, Jiakai Zhang, Yuanxing Xu, Tan Fu, Jie Su, Wenbo Wang, Jiamang Qu, Lin Zheng, Bo
description	Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3064394653</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3064394653</sourcerecordid><originalsourceid>FETCH-proquest_journals_30643946533</originalsourceid><addsrcrecordid>eNqNi00KwjAYRIMgWLR3CLgOxHxt_dm2iguFggGXJdS0pMSkJg1e3wgewM28gTczQwkD2JBdxtgCpd4PlFJWbFmeQ4LuFSlrji_ifcCVfQpliB9lqzrV4tKaSZkgNK6dJNxFqUyPb63QX8YP7qyLdL2MafogYrnah9R-head0F6mPy7R-nTk5ZmMzr6C9FMz2OBMVA3QIoN9VuQA_60-RwA_zw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3064394653</pqid></control><display><type>article</type><title>D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models</title><source>Publicly Available Content Database</source><creator>Que, Haoran ; Liu, Jiaheng ; Zhang, Ge ; Zhang, Chenchen ; Qu, Xingwei ; Ma, Yinghao ; Duan, Feiyu ; Bai, Zhiqi ; Wang, Jiakai ; Zhang, Yuanxing ; Xu, Tan ; Fu, Jie ; Su, Wenbo ; Wang, Jiamang ; Qu, Lin ; Zheng, Bo</creator><creatorcontrib>Que, Haoran ; Liu, Jiaheng ; Zhang, Ge ; Zhang, Chenchen ; Qu, Xingwei ; Ma, Yinghao ; Duan, Feiyu ; Bai, Zhiqi ; Wang, Jiakai ; Zhang, Yuanxing ; Xu, Tan ; Fu, Jie ; Su, Wenbo ; Wang, Jiamang ; Qu, Lin ; Zheng, Bo</creatorcontrib><description>Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Large language models ; Mixtures ; Performance prediction ; Scaling laws</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3064394653?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Que, Haoran</creatorcontrib><creatorcontrib>Liu, Jiaheng</creatorcontrib><creatorcontrib>Zhang, Ge</creatorcontrib><creatorcontrib>Zhang, Chenchen</creatorcontrib><creatorcontrib>Qu, Xingwei</creatorcontrib><creatorcontrib>Ma, Yinghao</creatorcontrib><creatorcontrib>Duan, Feiyu</creatorcontrib><creatorcontrib>Bai, Zhiqi</creatorcontrib><creatorcontrib>Wang, Jiakai</creatorcontrib><creatorcontrib>Zhang, Yuanxing</creatorcontrib><creatorcontrib>Xu, Tan</creatorcontrib><creatorcontrib>Fu, Jie</creatorcontrib><creatorcontrib>Su, Wenbo</creatorcontrib><creatorcontrib>Wang, Jiamang</creatorcontrib><creatorcontrib>Qu, Lin</creatorcontrib><creatorcontrib>Zheng, Bo</creatorcontrib><title>D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models</title><title>arXiv.org</title><description>Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.</description><subject>Large language models</subject><subject>Mixtures</subject><subject>Performance prediction</subject><subject>Scaling laws</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi00KwjAYRIMgWLR3CLgOxHxt_dm2iguFggGXJdS0pMSkJg1e3wgewM28gTczQwkD2JBdxtgCpd4PlFJWbFmeQ4LuFSlrji_ifcCVfQpliB9lqzrV4tKaSZkgNK6dJNxFqUyPb63QX8YP7qyLdL2MafogYrnah9R-head0F6mPy7R-nTk5ZmMzr6C9FMz2OBMVA3QIoN9VuQA_60-RwA_zw</recordid><startdate>20240603</startdate><enddate>20240603</enddate><creator>Que, Haoran</creator><creator>Liu, Jiaheng</creator><creator>Zhang, Ge</creator><creator>Zhang, Chenchen</creator><creator>Qu, Xingwei</creator><creator>Ma, Yinghao</creator><creator>Duan, Feiyu</creator><creator>Bai, Zhiqi</creator><creator>Wang, Jiakai</creator><creator>Zhang, Yuanxing</creator><creator>Xu, Tan</creator><creator>Fu, Jie</creator><creator>Su, Wenbo</creator><creator>Wang, Jiamang</creator><creator>Qu, Lin</creator><creator>Zheng, Bo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240603</creationdate><title>D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models</title><author>Que, Haoran ; Liu, Jiaheng ; Zhang, Ge ; Zhang, Chenchen ; Qu, Xingwei ; Ma, Yinghao ; Duan, Feiyu ; Bai, Zhiqi ; Wang, Jiakai ; Zhang, Yuanxing ; Xu, Tan ; Fu, Jie ; Su, Wenbo ; Wang, Jiamang ; Qu, Lin ; Zheng, Bo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30643946533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Large language models</topic><topic>Mixtures</topic><topic>Performance prediction</topic><topic>Scaling laws</topic><toplevel>online_resources</toplevel><creatorcontrib>Que, Haoran</creatorcontrib><creatorcontrib>Liu, Jiaheng</creatorcontrib><creatorcontrib>Zhang, Ge</creatorcontrib><creatorcontrib>Zhang, Chenchen</creatorcontrib><creatorcontrib>Qu, Xingwei</creatorcontrib><creatorcontrib>Ma, Yinghao</creatorcontrib><creatorcontrib>Duan, Feiyu</creatorcontrib><creatorcontrib>Bai, Zhiqi</creatorcontrib><creatorcontrib>Wang, Jiakai</creatorcontrib><creatorcontrib>Zhang, Yuanxing</creatorcontrib><creatorcontrib>Xu, Tan</creatorcontrib><creatorcontrib>Fu, Jie</creatorcontrib><creatorcontrib>Su, Wenbo</creatorcontrib><creatorcontrib>Wang, Jiamang</creatorcontrib><creatorcontrib>Qu, Lin</creatorcontrib><creatorcontrib>Zheng, Bo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Que, Haoran</au><au>Liu, Jiaheng</au><au>Zhang, Ge</au><au>Zhang, Chenchen</au><au>Qu, Xingwei</au><au>Ma, Yinghao</au><au>Duan, Feiyu</au><au>Bai, Zhiqi</au><au>Wang, Jiakai</au><au>Zhang, Yuanxing</au><au>Xu, Tan</au><au>Fu, Jie</au><au>Su, Wenbo</au><au>Wang, Jiamang</au><au>Qu, Lin</au><au>Zheng, Bo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models</atitle><jtitle>arXiv.org</jtitle><date>2024-06-03</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3064394653
source	Publicly Available Content Database
subjects	Large language models Mixtures Performance prediction Scaling laws
title	D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T11%3A30%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=D-CPT%20Law:%20Domain-specific%20Continual%20Pre-Training%20Scaling%20Law%20for%20Large%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Que,%20Haoran&rft.date=2024-06-03&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3064394653%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30643946533%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3064394653&rft_id=info:pmid/&rfr_iscdi=true