Loading…

LUNA: A Model-Based Universal Analysis Framework for Large Language Models

Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more div...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on software engineering 2024-07, Vol.50 (7), p.1921-1948
Main Authors:	Song, Da, Xie, Xuan, Song, Jiayang, Zhu, Derui, Huang, Yuheng, Juefei-Xu, Felix, Ma, Lei
Format:	Article
Language:	English
Subjects:	Analytical models Artificial intelligence Artificial neural networks Codes Data analysis deep neural networks Demand analysis Hidden Markov models Large language models Measurement model-based analysis Natural language processing Neural networks Performance evaluation Quality assurance Recurrent neural networks Semantics Software Software engineering Task analysis Transformers Trustworthiness
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413
container_end_page	1948
container_issue	7
container_start_page	1921
container_title	IEEE transactions on software engineering
container_volume	50
creator	Song, Da Xie, Xuan Song, Jiayang Zhu, Derui Huang, Yuheng Juefei-Xu, Felix Ma, Lei
description	Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter se
doi_str_mv	10.1109/TSE.2024.3411928
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3081870448</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10562221</ieee_id><sourcerecordid>3081870448</sourcerecordid><originalsourceid>FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413</originalsourceid><addsrcrecordid>eNpNkD1PwzAQhi0EEqWwMzBEYk45O3Fss4Wq5UMBBtrZcpxzlZI2xW5A_fektAPLvTc87-n0EHJNYUQpqLvZx2TEgKWjJKVUMXlCBlQlKk44g1MyAFAy5lyqc3IRwhIAuBB8QF6K-Vt-H-XRa1thEz-YgFU0X9ff6INponxtml2oQzT1ZoU_rf-MXOujwvgF9nO96Ey__HXDJTlzpgl4dcwhmU8ns_FTXLw_Po_zIrZU8G3MKmWNRFVKa51wqWVJ6QCcKykVgNLxEjMLCBWiyURFuaVZVgpjmWIipcmQ3B7ubnz71WHY6mXb-f7RoBOQVApIU9lTcKCsb0Pw6PTG1yvjd5qC3hvTvTG9N6aPxvrKzaFSI-I_nGeMMZr8AgHQZng</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3081870448</pqid></control><display><type>article</type><title>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</title><source>IEEE Xplore (Online service)</source><creator>Song, Da ; Xie, Xuan ; Song, Jiayang ; Zhu, Derui ; Huang, Yuheng ; Juefei-Xu, Felix ; Ma, Lei</creator><creatorcontrib>Song, Da ; Xie, Xuan ; Song, Jiayang ; Zhu, Derui ; Huang, Yuheng ; Juefei-Xu, Felix ; Ma, Lei</creatorcontrib><description>Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna .</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2024.3411928</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Analytical models ; Artificial intelligence ; Artificial neural networks ; Codes ; Data analysis ; deep neural networks ; Demand analysis ; Hidden Markov models ; Large language models ; Measurement ; model-based analysis ; Natural language processing ; Neural networks ; Performance evaluation ; Quality assurance ; Recurrent neural networks ; Semantics ; Software ; Software engineering ; Task analysis ; Transformers ; Trustworthiness</subject><ispartof>IEEE transactions on software engineering, 2024-07, Vol.50 (7), p.1921-1948</ispartof><rights>Copyright IEEE Computer Society 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413</cites><orcidid>0000-0002-9552-0097 ; 0009-0008-7093-9781 ; 0000-0003-3981-8515 ; 0000-0003-3666-4020 ; 0000-0001-9267-4229 ; 0000-0002-8621-2420 ; 0000-0002-0857-8611</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10562221$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Song, Da</creatorcontrib><creatorcontrib>Xie, Xuan</creatorcontrib><creatorcontrib>Song, Jiayang</creatorcontrib><creatorcontrib>Zhu, Derui</creatorcontrib><creatorcontrib>Huang, Yuheng</creatorcontrib><creatorcontrib>Juefei-Xu, Felix</creatorcontrib><creatorcontrib>Ma, Lei</creatorcontrib><title>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna .</description><subject>Analytical models</subject><subject>Artificial intelligence</subject><subject>Artificial neural networks</subject><subject>Codes</subject><subject>Data analysis</subject><subject>deep neural networks</subject><subject>Demand analysis</subject><subject>Hidden Markov models</subject><subject>Large language models</subject><subject>Measurement</subject><subject>model-based analysis</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>Performance evaluation</subject><subject>Quality assurance</subject><subject>Recurrent neural networks</subject><subject>Semantics</subject><subject>Software</subject><subject>Software engineering</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Trustworthiness</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkD1PwzAQhi0EEqWwMzBEYk45O3Fss4Wq5UMBBtrZcpxzlZI2xW5A_fektAPLvTc87-n0EHJNYUQpqLvZx2TEgKWjJKVUMXlCBlQlKk44g1MyAFAy5lyqc3IRwhIAuBB8QF6K-Vt-H-XRa1thEz-YgFU0X9ff6INponxtml2oQzT1ZoU_rf-MXOujwvgF9nO96Ey__HXDJTlzpgl4dcwhmU8ns_FTXLw_Po_zIrZU8G3MKmWNRFVKa51wqWVJ6QCcKykVgNLxEjMLCBWiyURFuaVZVgpjmWIipcmQ3B7ubnz71WHY6mXb-f7RoBOQVApIU9lTcKCsb0Pw6PTG1yvjd5qC3hvTvTG9N6aPxvrKzaFSI-I_nGeMMZr8AgHQZng</recordid><startdate>20240701</startdate><enddate>20240701</enddate><creator>Song, Da</creator><creator>Xie, Xuan</creator><creator>Song, Jiayang</creator><creator>Zhu, Derui</creator><creator>Huang, Yuheng</creator><creator>Juefei-Xu, Felix</creator><creator>Ma, Lei</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-9552-0097</orcidid><orcidid>https://orcid.org/0009-0008-7093-9781</orcidid><orcidid>https://orcid.org/0000-0003-3981-8515</orcidid><orcidid>https://orcid.org/0000-0003-3666-4020</orcidid><orcidid>https://orcid.org/0000-0001-9267-4229</orcidid><orcidid>https://orcid.org/0000-0002-8621-2420</orcidid><orcidid>https://orcid.org/0000-0002-0857-8611</orcidid></search><sort><creationdate>20240701</creationdate><title>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</title><author>Song, Da ; Xie, Xuan ; Song, Jiayang ; Zhu, Derui ; Huang, Yuheng ; Juefei-Xu, Felix ; Ma, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Analytical models</topic><topic>Artificial intelligence</topic><topic>Artificial neural networks</topic><topic>Codes</topic><topic>Data analysis</topic><topic>deep neural networks</topic><topic>Demand analysis</topic><topic>Hidden Markov models</topic><topic>Large language models</topic><topic>Measurement</topic><topic>model-based analysis</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>Performance evaluation</topic><topic>Quality assurance</topic><topic>Recurrent neural networks</topic><topic>Semantics</topic><topic>Software</topic><topic>Software engineering</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Trustworthiness</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Song, Da</creatorcontrib><creatorcontrib>Xie, Xuan</creatorcontrib><creatorcontrib>Song, Jiayang</creatorcontrib><creatorcontrib>Zhu, Derui</creatorcontrib><creatorcontrib>Huang, Yuheng</creatorcontrib><creatorcontrib>Juefei-Xu, Felix</creatorcontrib><creatorcontrib>Ma, Lei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore (Online service)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Song, Da</au><au>Xie, Xuan</au><au>Song, Jiayang</au><au>Zhu, Derui</au><au>Huang, Yuheng</au><au>Juefei-Xu, Felix</au><au>Ma, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2024-07-01</date><risdate>2024</risdate><volume>50</volume><issue>7</issue><spage>1921</spage><epage>1948</epage><pages>1921-1948</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2024.3411928</doi><tpages>28</tpages><orcidid>https://orcid.org/0000-0002-9552-0097</orcidid><orcidid>https://orcid.org/0009-0008-7093-9781</orcidid><orcidid>https://orcid.org/0000-0003-3981-8515</orcidid><orcidid>https://orcid.org/0000-0003-3666-4020</orcidid><orcidid>https://orcid.org/0000-0001-9267-4229</orcidid><orcidid>https://orcid.org/0000-0002-8621-2420</orcidid><orcidid>https://orcid.org/0000-0002-0857-8611</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0098-5589
ispartof	IEEE transactions on software engineering, 2024-07, Vol.50 (7), p.1921-1948
issn	0098-5589 1939-3520
language	eng
recordid	cdi_proquest_journals_3081870448
source	IEEE Xplore (Online service)
subjects	Analytical models Artificial intelligence Artificial neural networks Codes Data analysis deep neural networks Demand analysis Hidden Markov models Large language models Measurement model-based analysis Natural language processing Neural networks Performance evaluation Quality assurance Recurrent neural networks Semantics Software Software engineering Task analysis Transformers Trustworthiness
title	LUNA: A Model-Based Universal Analysis Framework for Large Language Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T10%3A55%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LUNA:%20A%20Model-Based%20Universal%20Analysis%20Framework%20for%20Large%20Language%20Models&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Song,%20Da&rft.date=2024-07-01&rft.volume=50&rft.issue=7&rft.spage=1921&rft.epage=1948&rft.pages=1921-1948&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2024.3411928&rft_dat=%3Cproquest_cross%3E3081870448%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3081870448&rft_id=info:pmid/&rft_ieee_id=10562221&rfr_iscdi=true