Loading…
Scaling Laws for Precision
Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model&...
Saved in:
Published in: | arXiv.org 2024-11 |
---|---|
Main Authors: | , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Kumar, Tanishq Ankner, Zachary Spector, Benjamin F Bordelon, Blake Muennighoff, Niklas Mansheej, Paul Pehlevan, Cengiz Ré, Christopher Raghunathan, Aditi |
description | Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3126159854</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3126159854</sourcerecordid><originalsourceid>FETCH-proquest_journals_31261598543</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSQCk5OzMnMS1fwSSwvVkjLL1IIKEpNzizOzM_jYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0MjM0NTSwtTE2PiVAEAqicqog</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3126159854</pqid></control><display><type>article</type><title>Scaling Laws for Precision</title><source>Publicly Available Content Database</source><creator>Kumar, Tanishq ; Ankner, Zachary ; Spector, Benjamin F ; Bordelon, Blake ; Muennighoff, Niklas ; Mansheej, Paul ; Pehlevan, Cengiz ; Ré, Christopher ; Raghunathan, Aditi</creator><creatorcontrib>Kumar, Tanishq ; Ankner, Zachary ; Spector, Benjamin F ; Bordelon, Blake ; Muennighoff, Niklas ; Mansheej, Paul ; Pehlevan, Cengiz ; Ré, Christopher ; Raghunathan, Aditi</creatorcontrib><description>Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Degradation ; Inference ; Parameters ; Scaling laws</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3126159854?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Kumar, Tanishq</creatorcontrib><creatorcontrib>Ankner, Zachary</creatorcontrib><creatorcontrib>Spector, Benjamin F</creatorcontrib><creatorcontrib>Bordelon, Blake</creatorcontrib><creatorcontrib>Muennighoff, Niklas</creatorcontrib><creatorcontrib>Mansheej, Paul</creatorcontrib><creatorcontrib>Pehlevan, Cengiz</creatorcontrib><creatorcontrib>Ré, Christopher</creatorcontrib><creatorcontrib>Raghunathan, Aditi</creatorcontrib><title>Scaling Laws for Precision</title><title>arXiv.org</title><description>Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.</description><subject>Degradation</subject><subject>Inference</subject><subject>Parameters</subject><subject>Scaling laws</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSQCk5OzMnMS1fwSSwvVkjLL1IIKEpNzizOzM_jYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0MjM0NTSwtTE2PiVAEAqicqog</recordid><startdate>20241130</startdate><enddate>20241130</enddate><creator>Kumar, Tanishq</creator><creator>Ankner, Zachary</creator><creator>Spector, Benjamin F</creator><creator>Bordelon, Blake</creator><creator>Muennighoff, Niklas</creator><creator>Mansheej, Paul</creator><creator>Pehlevan, Cengiz</creator><creator>Ré, Christopher</creator><creator>Raghunathan, Aditi</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241130</creationdate><title>Scaling Laws for Precision</title><author>Kumar, Tanishq ; Ankner, Zachary ; Spector, Benjamin F ; Bordelon, Blake ; Muennighoff, Niklas ; Mansheej, Paul ; Pehlevan, Cengiz ; Ré, Christopher ; Raghunathan, Aditi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31261598543</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Degradation</topic><topic>Inference</topic><topic>Parameters</topic><topic>Scaling laws</topic><toplevel>online_resources</toplevel><creatorcontrib>Kumar, Tanishq</creatorcontrib><creatorcontrib>Ankner, Zachary</creatorcontrib><creatorcontrib>Spector, Benjamin F</creatorcontrib><creatorcontrib>Bordelon, Blake</creatorcontrib><creatorcontrib>Muennighoff, Niklas</creatorcontrib><creatorcontrib>Mansheej, Paul</creatorcontrib><creatorcontrib>Pehlevan, Cengiz</creatorcontrib><creatorcontrib>Ré, Christopher</creatorcontrib><creatorcontrib>Raghunathan, Aditi</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kumar, Tanishq</au><au>Ankner, Zachary</au><au>Spector, Benjamin F</au><au>Bordelon, Blake</au><au>Muennighoff, Niklas</au><au>Mansheej, Paul</au><au>Pehlevan, Cengiz</au><au>Ré, Christopher</au><au>Raghunathan, Aditi</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Scaling Laws for Precision</atitle><jtitle>arXiv.org</jtitle><date>2024-11-30</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3126159854 |
source | Publicly Available Content Database |
subjects | Degradation Inference Parameters Scaling laws |
title | Scaling Laws for Precision |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T20%3A41%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Scaling%20Laws%20for%20Precision&rft.jtitle=arXiv.org&rft.au=Kumar,%20Tanishq&rft.date=2024-11-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3126159854%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31261598543%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3126159854&rft_id=info:pmid/&rfr_iscdi=true |