Loading…

Automated Testing Linguistic Capabilities of NLP Models

Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techni...

Full description

Saved in:

Bibliographic Details
Published in:	ACM transactions on software engineering and methodology 2024-09, Vol.33 (7), p.1-33, Article 176
Main Authors:	Lee, Jaeseong, Chen, Simin, Mordahl, Austin, Liu, Cong, Yang, Wei, Wei, Shiyi
Format:	Article
Language:	English
Subjects:	Computing methodologies Natural language processing Software and its engineering Software verification and validation
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-a169t-b386e1f03bf2d73d4993597a2c9cbcaee931a5dd24781de1e20acf5524e0c1373
container_end_page	33
container_issue	7
container_start_page	1
container_title	ACM transactions on software engineering and methodology
container_volume	33
creator	Lee, Jaeseong Chen, Simin Mordahl, Austin Liu, Cong Yang, Wei Wei, Shiyi
description	Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techniques to enhance the trustworthiness of NLP-based applications. However, most existing works use a single, aggregated metric (i.e., accuracy) which is difficult for users to assess NLP model performance on fine-grained aspects, such as LCs. To address this limitation, we present ALiCT, an automated testing technique for validating NLP applications based on their LCs. ALiCT takes user-specified LCs as inputs and produces diverse test suite with test oracles for each of given LC. We evaluate ALiCT on two widely adopted NLP tasks, sentiment analysis and hate speech detection, in terms of diversity, effectiveness, and consistency. Using Self-BLEU and syntactic diversity metrics, our findings reveal that ALiCT generates test cases that are 190% and 2213% more diverse in semantics and syntax, respectively, compared to those produced by state-of-the-art techniques. In addition, ALiCT is capable of producing a larger number of NLP model failures in 22 out of 25 LCs over the two NLP applications.
doi_str_mv	10.1145/3672455
format	article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3672455</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3672455</sourcerecordid><originalsourceid>FETCH-LOGICAL-a169t-b386e1f03bf2d73d4993597a2c9cbcaee931a5dd24781de1e20acf5524e0c1373</originalsourceid><addsrcrecordid>eNo9j81LxDAUxIMouK7i3VNunqp5eUmzOS5FV6F-HFbwVtLkRSJbuzTdg_-9lV29zAzMj4Fh7BLEDYDSt1gaqbQ-YjPQ2hQGrTyeslC2QIT3U3aW86cQgEKqGTPL3dh3bqTA15TH9PXB60l2acqeV27r2rRJY6LM-8if61f-1Afa5HN2Et0m08XB5-zt_m5dPRT1y-qxWtaFg9KORYuLkiAKbKMMBoOyFrU1TnrrW--ILILTIUhlFhAISArno9ZSkfCABufser_rhz7ngWKzHVLnhu8GRPP7tzn8ncirPel89w_9lT9z808A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Automated Testing Linguistic Capabilities of NLP Models</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Lee, Jaeseong ; Chen, Simin ; Mordahl, Austin ; Liu, Cong ; Yang, Wei ; Wei, Shiyi</creator><creatorcontrib>Lee, Jaeseong ; Chen, Simin ; Mordahl, Austin ; Liu, Cong ; Yang, Wei ; Wei, Shiyi</creatorcontrib><description>Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techniques to enhance the trustworthiness of NLP-based applications. However, most existing works use a single, aggregated metric (i.e., accuracy) which is difficult for users to assess NLP model performance on fine-grained aspects, such as LCs. To address this limitation, we present ALiCT, an automated testing technique for validating NLP applications based on their LCs. ALiCT takes user-specified LCs as inputs and produces diverse test suite with test oracles for each of given LC. We evaluate ALiCT on two widely adopted NLP tasks, sentiment analysis and hate speech detection, in terms of diversity, effectiveness, and consistency. Using Self-BLEU and syntactic diversity metrics, our findings reveal that ALiCT generates test cases that are 190% and 2213% more diverse in semantics and syntax, respectively, compared to those produced by state-of-the-art techniques. In addition, ALiCT is capable of producing a larger number of NLP model failures in 22 out of 25 LCs over the two NLP applications.</description><identifier>ISSN: 1049-331X</identifier><identifier>EISSN: 1557-7392</identifier><identifier>DOI: 10.1145/3672455</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Natural language processing ; Software and its engineering ; Software verification and validation</subject><ispartof>ACM transactions on software engineering and methodology, 2024-09, Vol.33 (7), p.1-33, Article 176</ispartof><rights>Copyright held by the owner/author(s).</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a169t-b386e1f03bf2d73d4993597a2c9cbcaee931a5dd24781de1e20acf5524e0c1373</cites><orcidid>0000-0003-3031-8848 ; 0009-0001-6775-1269 ; 0009-0001-1756-5767 ; 0000-0002-5338-7347 ; 0000-0002-2826-1857 ; 0000-0001-5035-3398</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids></links><search><creatorcontrib>Lee, Jaeseong</creatorcontrib><creatorcontrib>Chen, Simin</creatorcontrib><creatorcontrib>Mordahl, Austin</creatorcontrib><creatorcontrib>Liu, Cong</creatorcontrib><creatorcontrib>Yang, Wei</creatorcontrib><creatorcontrib>Wei, Shiyi</creatorcontrib><title>Automated Testing Linguistic Capabilities of NLP Models</title><title>ACM transactions on software engineering and methodology</title><addtitle>ACM TOSEM</addtitle><description>Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techniques to enhance the trustworthiness of NLP-based applications. However, most existing works use a single, aggregated metric (i.e., accuracy) which is difficult for users to assess NLP model performance on fine-grained aspects, such as LCs. To address this limitation, we present ALiCT, an automated testing technique for validating NLP applications based on their LCs. ALiCT takes user-specified LCs as inputs and produces diverse test suite with test oracles for each of given LC. We evaluate ALiCT on two widely adopted NLP tasks, sentiment analysis and hate speech detection, in terms of diversity, effectiveness, and consistency. Using Self-BLEU and syntactic diversity metrics, our findings reveal that ALiCT generates test cases that are 190% and 2213% more diverse in semantics and syntax, respectively, compared to those produced by state-of-the-art techniques. In addition, ALiCT is capable of producing a larger number of NLP model failures in 22 out of 25 LCs over the two NLP applications.</description><subject>Computing methodologies</subject><subject>Natural language processing</subject><subject>Software and its engineering</subject><subject>Software verification and validation</subject><issn>1049-331X</issn><issn>1557-7392</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9j81LxDAUxIMouK7i3VNunqp5eUmzOS5FV6F-HFbwVtLkRSJbuzTdg_-9lV29zAzMj4Fh7BLEDYDSt1gaqbQ-YjPQ2hQGrTyeslC2QIT3U3aW86cQgEKqGTPL3dh3bqTA15TH9PXB60l2acqeV27r2rRJY6LM-8if61f-1Afa5HN2Et0m08XB5-zt_m5dPRT1y-qxWtaFg9KORYuLkiAKbKMMBoOyFrU1TnrrW--ILILTIUhlFhAISArno9ZSkfCABufser_rhz7ngWKzHVLnhu8GRPP7tzn8ncirPel89w_9lT9z808A</recordid><startdate>20240930</startdate><enddate>20240930</enddate><creator>Lee, Jaeseong</creator><creator>Chen, Simin</creator><creator>Mordahl, Austin</creator><creator>Liu, Cong</creator><creator>Yang, Wei</creator><creator>Wei, Shiyi</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-3031-8848</orcidid><orcidid>https://orcid.org/0009-0001-6775-1269</orcidid><orcidid>https://orcid.org/0009-0001-1756-5767</orcidid><orcidid>https://orcid.org/0000-0002-5338-7347</orcidid><orcidid>https://orcid.org/0000-0002-2826-1857</orcidid><orcidid>https://orcid.org/0000-0001-5035-3398</orcidid></search><sort><creationdate>20240930</creationdate><title>Automated Testing Linguistic Capabilities of NLP Models</title><author>Lee, Jaeseong ; Chen, Simin ; Mordahl, Austin ; Liu, Cong ; Yang, Wei ; Wei, Shiyi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a169t-b386e1f03bf2d73d4993597a2c9cbcaee931a5dd24781de1e20acf5524e0c1373</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computing methodologies</topic><topic>Natural language processing</topic><topic>Software and its engineering</topic><topic>Software verification and validation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lee, Jaeseong</creatorcontrib><creatorcontrib>Chen, Simin</creatorcontrib><creatorcontrib>Mordahl, Austin</creatorcontrib><creatorcontrib>Liu, Cong</creatorcontrib><creatorcontrib>Yang, Wei</creatorcontrib><creatorcontrib>Wei, Shiyi</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on software engineering and methodology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lee, Jaeseong</au><au>Chen, Simin</au><au>Mordahl, Austin</au><au>Liu, Cong</au><au>Yang, Wei</au><au>Wei, Shiyi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automated Testing Linguistic Capabilities of NLP Models</atitle><jtitle>ACM transactions on software engineering and methodology</jtitle><stitle>ACM TOSEM</stitle><date>2024-09-30</date><risdate>2024</risdate><volume>33</volume><issue>7</issue><spage>1</spage><epage>33</epage><pages>1-33</pages><artnum>176</artnum><issn>1049-331X</issn><eissn>1557-7392</eissn><abstract>Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techniques to enhance the trustworthiness of NLP-based applications. However, most existing works use a single, aggregated metric (i.e., accuracy) which is difficult for users to assess NLP model performance on fine-grained aspects, such as LCs. To address this limitation, we present ALiCT, an automated testing technique for validating NLP applications based on their LCs. ALiCT takes user-specified LCs as inputs and produces diverse test suite with test oracles for each of given LC. We evaluate ALiCT on two widely adopted NLP tasks, sentiment analysis and hate speech detection, in terms of diversity, effectiveness, and consistency. Using Self-BLEU and syntactic diversity metrics, our findings reveal that ALiCT generates test cases that are 190% and 2213% more diverse in semantics and syntax, respectively, compared to those produced by state-of-the-art techniques. In addition, ALiCT is capable of producing a larger number of NLP model failures in 22 out of 25 LCs over the two NLP applications.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3672455</doi><tpages>33</tpages><orcidid>https://orcid.org/0000-0003-3031-8848</orcidid><orcidid>https://orcid.org/0009-0001-6775-1269</orcidid><orcidid>https://orcid.org/0009-0001-1756-5767</orcidid><orcidid>https://orcid.org/0000-0002-5338-7347</orcidid><orcidid>https://orcid.org/0000-0002-2826-1857</orcidid><orcidid>https://orcid.org/0000-0001-5035-3398</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1049-331X
ispartof	ACM transactions on software engineering and methodology, 2024-09, Vol.33 (7), p.1-33, Article 176
issn	1049-331X 1557-7392
language	eng
recordid	cdi_crossref_primary_10_1145_3672455
source	Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
subjects	Computing methodologies Natural language processing Software and its engineering Software verification and validation
title	Automated Testing Linguistic Capabilities of NLP Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T16%3A44%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automated%20Testing%20Linguistic%20Capabilities%20of%20NLP%20Models&rft.jtitle=ACM%20transactions%20on%20software%20engineering%20and%20methodology&rft.au=Lee,%20Jaeseong&rft.date=2024-09-30&rft.volume=33&rft.issue=7&rft.spage=1&rft.epage=33&rft.pages=1-33&rft.artnum=176&rft.issn=1049-331X&rft.eissn=1557-7392&rft_id=info:doi/10.1145/3672455&rft_dat=%3Cacm_cross%3E3672455%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a169t-b386e1f03bf2d73d4993597a2c9cbcaee931a5dd24781de1e20acf5524e0c1373%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true