Loading…

LUNA: A Model-Based Universal Analysis Framework for Large Language Models

Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more div...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on software engineering 2024-07, Vol.50 (7), p.1921-1948
Main Authors: Song, Da, Xie, Xuan, Song, Jiayang, Zhu, Derui, Huang, Yuheng, Juefei-Xu, Felix, Ma, Lei
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413
container_end_page 1948
container_issue 7
container_start_page 1921
container_title IEEE transactions on software engineering
container_volume 50
creator Song, Da
Xie, Xuan
Song, Jiayang
Zhu, Derui
Huang, Yuheng
Juefei-Xu, Felix
Ma, Lei
description Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter se
doi_str_mv 10.1109/TSE.2024.3411928
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3081870448</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10562221</ieee_id><sourcerecordid>3081870448</sourcerecordid><originalsourceid>FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413</originalsourceid><addsrcrecordid>eNpNkD1PwzAQhi0EEqWwMzBEYk45O3Fss4Wq5UMBBtrZcpxzlZI2xW5A_fektAPLvTc87-n0EHJNYUQpqLvZx2TEgKWjJKVUMXlCBlQlKk44g1MyAFAy5lyqc3IRwhIAuBB8QF6K-Vt-H-XRa1thEz-YgFU0X9ff6INponxtml2oQzT1ZoU_rf-MXOujwvgF9nO96Ey__HXDJTlzpgl4dcwhmU8ns_FTXLw_Po_zIrZU8G3MKmWNRFVKa51wqWVJ6QCcKykVgNLxEjMLCBWiyURFuaVZVgpjmWIipcmQ3B7ubnz71WHY6mXb-f7RoBOQVApIU9lTcKCsb0Pw6PTG1yvjd5qC3hvTvTG9N6aPxvrKzaFSI-I_nGeMMZr8AgHQZng</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3081870448</pqid></control><display><type>article</type><title>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</title><source>IEEE Xplore (Online service)</source><creator>Song, Da ; Xie, Xuan ; Song, Jiayang ; Zhu, Derui ; Huang, Yuheng ; Juefei-Xu, Felix ; Ma, Lei</creator><creatorcontrib>Song, Da ; Xie, Xuan ; Song, Jiayang ; Zhu, Derui ; Huang, Yuheng ; Juefei-Xu, Felix ; Ma, Lei</creatorcontrib><description>Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna .</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2024.3411928</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Analytical models ; Artificial intelligence ; Artificial neural networks ; Codes ; Data analysis ; deep neural networks ; Demand analysis ; Hidden Markov models ; Large language models ; Measurement ; model-based analysis ; Natural language processing ; Neural networks ; Performance evaluation ; Quality assurance ; Recurrent neural networks ; Semantics ; Software ; Software engineering ; Task analysis ; Transformers ; Trustworthiness</subject><ispartof>IEEE transactions on software engineering, 2024-07, Vol.50 (7), p.1921-1948</ispartof><rights>Copyright IEEE Computer Society 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413</cites><orcidid>0000-0002-9552-0097 ; 0009-0008-7093-9781 ; 0000-0003-3981-8515 ; 0000-0003-3666-4020 ; 0000-0001-9267-4229 ; 0000-0002-8621-2420 ; 0000-0002-0857-8611</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10562221$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Song, Da</creatorcontrib><creatorcontrib>Xie, Xuan</creatorcontrib><creatorcontrib>Song, Jiayang</creatorcontrib><creatorcontrib>Zhu, Derui</creatorcontrib><creatorcontrib>Huang, Yuheng</creatorcontrib><creatorcontrib>Juefei-Xu, Felix</creatorcontrib><creatorcontrib>Ma, Lei</creatorcontrib><title>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna .</description><subject>Analytical models</subject><subject>Artificial intelligence</subject><subject>Artificial neural networks</subject><subject>Codes</subject><subject>Data analysis</subject><subject>deep neural networks</subject><subject>Demand analysis</subject><subject>Hidden Markov models</subject><subject>Large language models</subject><subject>Measurement</subject><subject>model-based analysis</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>Performance evaluation</subject><subject>Quality assurance</subject><subject>Recurrent neural networks</subject><subject>Semantics</subject><subject>Software</subject><subject>Software engineering</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Trustworthiness</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkD1PwzAQhi0EEqWwMzBEYk45O3Fss4Wq5UMBBtrZcpxzlZI2xW5A_fektAPLvTc87-n0EHJNYUQpqLvZx2TEgKWjJKVUMXlCBlQlKk44g1MyAFAy5lyqc3IRwhIAuBB8QF6K-Vt-H-XRa1thEz-YgFU0X9ff6INponxtml2oQzT1ZoU_rf-MXOujwvgF9nO96Ey__HXDJTlzpgl4dcwhmU8ns_FTXLw_Po_zIrZU8G3MKmWNRFVKa51wqWVJ6QCcKykVgNLxEjMLCBWiyURFuaVZVgpjmWIipcmQ3B7ubnz71WHY6mXb-f7RoBOQVApIU9lTcKCsb0Pw6PTG1yvjd5qC3hvTvTG9N6aPxvrKzaFSI-I_nGeMMZr8AgHQZng</recordid><startdate>20240701</startdate><enddate>20240701</enddate><creator>Song, Da</creator><creator>Xie, Xuan</creator><creator>Song, Jiayang</creator><creator>Zhu, Derui</creator><creator>Huang, Yuheng</creator><creator>Juefei-Xu, Felix</creator><creator>Ma, Lei</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-9552-0097</orcidid><orcidid>https://orcid.org/0009-0008-7093-9781</orcidid><orcidid>https://orcid.org/0000-0003-3981-8515</orcidid><orcidid>https://orcid.org/0000-0003-3666-4020</orcidid><orcidid>https://orcid.org/0000-0001-9267-4229</orcidid><orcidid>https://orcid.org/0000-0002-8621-2420</orcidid><orcidid>https://orcid.org/0000-0002-0857-8611</orcidid></search><sort><creationdate>20240701</creationdate><title>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</title><author>Song, Da ; Xie, Xuan ; Song, Jiayang ; Zhu, Derui ; Huang, Yuheng ; Juefei-Xu, Felix ; Ma, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Analytical models</topic><topic>Artificial intelligence</topic><topic>Artificial neural networks</topic><topic>Codes</topic><topic>Data analysis</topic><topic>deep neural networks</topic><topic>Demand analysis</topic><topic>Hidden Markov models</topic><topic>Large language models</topic><topic>Measurement</topic><topic>model-based analysis</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>Performance evaluation</topic><topic>Quality assurance</topic><topic>Recurrent neural networks</topic><topic>Semantics</topic><topic>Software</topic><topic>Software engineering</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Trustworthiness</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Song, Da</creatorcontrib><creatorcontrib>Xie, Xuan</creatorcontrib><creatorcontrib>Song, Jiayang</creatorcontrib><creatorcontrib>Zhu, Derui</creatorcontrib><creatorcontrib>Huang, Yuheng</creatorcontrib><creatorcontrib>Juefei-Xu, Felix</creatorcontrib><creatorcontrib>Ma, Lei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore (Online service)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Song, Da</au><au>Xie, Xuan</au><au>Song, Jiayang</au><au>Zhu, Derui</au><au>Huang, Yuheng</au><au>Juefei-Xu, Felix</au><au>Ma, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LUNA: A Model-Based Universal Analysis Framework for Large Language Models</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2024-07-01</date><risdate>2024</risdate><volume>50</volume><issue>7</issue><spage>1921</spage><epage>1948</epage><pages>1921-1948</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA , which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA . To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA , we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2024.3411928</doi><tpages>28</tpages><orcidid>https://orcid.org/0000-0002-9552-0097</orcidid><orcidid>https://orcid.org/0009-0008-7093-9781</orcidid><orcidid>https://orcid.org/0000-0003-3981-8515</orcidid><orcidid>https://orcid.org/0000-0003-3666-4020</orcidid><orcidid>https://orcid.org/0000-0001-9267-4229</orcidid><orcidid>https://orcid.org/0000-0002-8621-2420</orcidid><orcidid>https://orcid.org/0000-0002-0857-8611</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0098-5589
ispartof IEEE transactions on software engineering, 2024-07, Vol.50 (7), p.1921-1948
issn 0098-5589
1939-3520
language eng
recordid cdi_proquest_journals_3081870448
source IEEE Xplore (Online service)
subjects Analytical models
Artificial intelligence
Artificial neural networks
Codes
Data analysis
deep neural networks
Demand analysis
Hidden Markov models
Large language models
Measurement
model-based analysis
Natural language processing
Neural networks
Performance evaluation
Quality assurance
Recurrent neural networks
Semantics
Software
Software engineering
Task analysis
Transformers
Trustworthiness
title LUNA: A Model-Based Universal Analysis Framework for Large Language Models
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T10%3A55%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LUNA:%20A%20Model-Based%20Universal%20Analysis%20Framework%20for%20Large%20Language%20Models&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Song,%20Da&rft.date=2024-07-01&rft.volume=50&rft.issue=7&rft.spage=1921&rft.epage=1948&rft.pages=1921-1948&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2024.3411928&rft_dat=%3Cproquest_cross%3E3081870448%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c175t-2d9ca8e9b8ccf7f4c23bf00ffb1170e8f5be6c0e0deea67d15c166b7ac2927413%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3081870448&rft_id=info:pmid/&rft_ieee_id=10562221&rfr_iscdi=true