Loading…

Classification performance bias between training and test sets in a limited mammography dataset

To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) fort...

Full description

Saved in:

Bibliographic Details
Published in:	PloS one 2024-02, Vol.19 (2), p.e0282402-e0282402
Main Authors:	Hou, Rui, Lo, Joseph Y, Marks, Jeffrey R, Hwang, E Shelley, Grimm, Lars J
Format:	Article
Language:	English
Subjects:	Analysis Biology and Life Sciences Carcinoma, Ductal Care and treatment Computer and Information Sciences Diagnosis Machine learning Mammography Medical imaging equipment Medicine and Health Sciences Methods Research and Analysis Methods
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c590t-8505d42dce9049cb19674dd1145b1ef85061267b6231c13169d336c16e52c2223
container_end_page	e0282402
container_issue	2
container_start_page	e0282402
container_title	PloS one
container_volume	19
creator	Hou, Rui Lo, Joseph Y Marks, Jeffrey R Hwang, E Shelley Grimm, Lars J
description	To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) forty times. For each split, cross-validation was used for training, followed by an assessment of the test set. Logistic regression with regularization and support vector machine were used as the machine learning classifiers. For each split and classifier type, multiple models were created based on radiomics and/or clinical features. Area under the curve (AUC) performances varied considerably across the different data splits (e.g., radiomics regression model: train 0.58-0.70, test 0.59-0.73). Performances for regression models showed a tradeoff where better training led to worse testing and vice versa. Cross-validation over all cases reduced this variability, but required samples of 500+ cases to yield representative estimates of performance. In medical imaging, clinical datasets are often limited to relatively small size. Models built from different training sets may not be representative of the whole dataset. Depending on the selected data split and model, performance bias could lead to inappropriate conclusions that might influence the clinical significance of the findings. Performance bias can result from model testing when using limited datasets. Optimal strategies for test set selection should be developed to ensure study conclusions are appropriate.
doi_str_mv	10.1371/journal.pone.0282402
format	article
fullrecord	<record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_775eca960f4245d494e86a8fa3966f75</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A781731040</galeid><doaj_id>oai_doaj_org_article_775eca960f4245d494e86a8fa3966f75</doaj_id><sourcerecordid>A781731040</sourcerecordid><originalsourceid>FETCH-LOGICAL-c590t-8505d42dce9049cb19674dd1145b1ef85061267b6231c13169d336c16e52c2223</originalsourceid><addsrcrecordid>eNqNkl2L1DAUhoso7rr6D0QCgqwXM-araXO1LMOqAwsLft2G0zTtZGiTMUnV_fdmnHGZgheSi4RznvNy8vIWxUuCl4RV5N3WT8HBsNx5Z5aY1pRj-qg4J5LRhaCYPT55nxXPYtxiXLJaiKfFGasZ5SUvzwu1GiBG21kNyXqHdiZ0PozgtEGNhYgak34a41AKYJ11PQLXomRiQtGkiKxDgAY72mRaNMI4-j7AbnOPWkiQiefFkw6GaF4c74vi6_ubL6uPi9u7D-vV9e1ClxKnRV3isuW01UZiLnVDpKh42xLCy4aYLrcFoaJqBGVEE0aEbBkTmghTUk0pZRfF-qDbetiqXbAjhHvlwao_BR96BSFZPRhVVaXRIAXueDah5ZKbWkDdAZNCdFWZta4OWrupGU3eyeXPDzPRecfZjer9D0VwzWXeMCtcHhWC_z5ls9RoozbDAM74KSqaKUmYxDKjrw9oD3k36zqfJfUeV9dVTSpGMMeZWv6Dyqc1o9U5AZ3N9dnA29lAZpL5lXqYYlTrz5_-n737NmffnLAbA0PaRD9M-_DEOcgPoA4-xmC6B_8IVvsAq2OA1T7A6hjgPPbq1PuHob-JZb8B6trrvw</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923913909</pqid></control><display><type>article</type><title>Classification performance bias between training and test sets in a limited mammography dataset</title><source>Publicly Available Content Database</source><source>PubMed Central</source><creator>Hou, Rui ; Lo, Joseph Y ; Marks, Jeffrey R ; Hwang, E Shelley ; Grimm, Lars J</creator><contributor>Rosen-Zvi, Michal</contributor><creatorcontrib>Hou, Rui ; Lo, Joseph Y ; Marks, Jeffrey R ; Hwang, E Shelley ; Grimm, Lars J ; Rosen-Zvi, Michal</creatorcontrib><description>To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) forty times. For each split, cross-validation was used for training, followed by an assessment of the test set. Logistic regression with regularization and support vector machine were used as the machine learning classifiers. For each split and classifier type, multiple models were created based on radiomics and/or clinical features. Area under the curve (AUC) performances varied considerably across the different data splits (e.g., radiomics regression model: train 0.58-0.70, test 0.59-0.73). Performances for regression models showed a tradeoff where better training led to worse testing and vice versa. Cross-validation over all cases reduced this variability, but required samples of 500+ cases to yield representative estimates of performance. In medical imaging, clinical datasets are often limited to relatively small size. Models built from different training sets may not be representative of the whole dataset. Depending on the selected data split and model, performance bias could lead to inappropriate conclusions that might influence the clinical significance of the findings. Performance bias can result from model testing when using limited datasets. Optimal strategies for test set selection should be developed to ensure study conclusions are appropriate.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0282402</identifier><identifier>PMID: 38324545</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Analysis ; Biology and Life Sciences ; Carcinoma, Ductal ; Care and treatment ; Computer and Information Sciences ; Diagnosis ; Machine learning ; Mammography ; Medical imaging equipment ; Medicine and Health Sciences ; Methods ; Research and Analysis Methods</subject><ispartof>PloS one, 2024-02, Vol.19 (2), p.e0282402-e0282402</ispartof><rights>Copyright: © 2024 Hou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</rights><rights>COPYRIGHT 2024 Public Library of Science</rights><rights>2024 Hou et al 2024 Hou et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c590t-8505d42dce9049cb19674dd1145b1ef85061267b6231c13169d336c16e52c2223</cites><orcidid>0000-0002-0348-8772 ; 0000-0002-9540-5072</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10849231/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10849231/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27923,27924,37012,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38324545$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Rosen-Zvi, Michal</contributor><creatorcontrib>Hou, Rui</creatorcontrib><creatorcontrib>Lo, Joseph Y</creatorcontrib><creatorcontrib>Marks, Jeffrey R</creatorcontrib><creatorcontrib>Hwang, E Shelley</creatorcontrib><creatorcontrib>Grimm, Lars J</creatorcontrib><title>Classification performance bias between training and test sets in a limited mammography dataset</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) forty times. For each split, cross-validation was used for training, followed by an assessment of the test set. Logistic regression with regularization and support vector machine were used as the machine learning classifiers. For each split and classifier type, multiple models were created based on radiomics and/or clinical features. Area under the curve (AUC) performances varied considerably across the different data splits (e.g., radiomics regression model: train 0.58-0.70, test 0.59-0.73). Performances for regression models showed a tradeoff where better training led to worse testing and vice versa. Cross-validation over all cases reduced this variability, but required samples of 500+ cases to yield representative estimates of performance. In medical imaging, clinical datasets are often limited to relatively small size. Models built from different training sets may not be representative of the whole dataset. Depending on the selected data split and model, performance bias could lead to inappropriate conclusions that might influence the clinical significance of the findings. Performance bias can result from model testing when using limited datasets. Optimal strategies for test set selection should be developed to ensure study conclusions are appropriate.</description><subject>Analysis</subject><subject>Biology and Life Sciences</subject><subject>Carcinoma, Ductal</subject><subject>Care and treatment</subject><subject>Computer and Information Sciences</subject><subject>Diagnosis</subject><subject>Machine learning</subject><subject>Mammography</subject><subject>Medical imaging equipment</subject><subject>Medicine and Health Sciences</subject><subject>Methods</subject><subject>Research and Analysis Methods</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNqNkl2L1DAUhoso7rr6D0QCgqwXM-araXO1LMOqAwsLft2G0zTtZGiTMUnV_fdmnHGZgheSi4RznvNy8vIWxUuCl4RV5N3WT8HBsNx5Z5aY1pRj-qg4J5LRhaCYPT55nxXPYtxiXLJaiKfFGasZ5SUvzwu1GiBG21kNyXqHdiZ0PozgtEGNhYgak34a41AKYJ11PQLXomRiQtGkiKxDgAY72mRaNMI4-j7AbnOPWkiQiefFkw6GaF4c74vi6_ubL6uPi9u7D-vV9e1ClxKnRV3isuW01UZiLnVDpKh42xLCy4aYLrcFoaJqBGVEE0aEbBkTmghTUk0pZRfF-qDbetiqXbAjhHvlwao_BR96BSFZPRhVVaXRIAXueDah5ZKbWkDdAZNCdFWZta4OWrupGU3eyeXPDzPRecfZjer9D0VwzWXeMCtcHhWC_z5ls9RoozbDAM74KSqaKUmYxDKjrw9oD3k36zqfJfUeV9dVTSpGMMeZWv6Dyqc1o9U5AZ3N9dnA29lAZpL5lXqYYlTrz5_-n737NmffnLAbA0PaRD9M-_DEOcgPoA4-xmC6B_8IVvsAq2OA1T7A6hjgPPbq1PuHob-JZb8B6trrvw</recordid><startdate>20240207</startdate><enddate>20240207</enddate><creator>Hou, Rui</creator><creator>Lo, Joseph Y</creator><creator>Marks, Jeffrey R</creator><creator>Hwang, E Shelley</creator><creator>Grimm, Lars J</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-0348-8772</orcidid><orcidid>https://orcid.org/0000-0002-9540-5072</orcidid></search><sort><creationdate>20240207</creationdate><title>Classification performance bias between training and test sets in a limited mammography dataset</title><author>Hou, Rui ; Lo, Joseph Y ; Marks, Jeffrey R ; Hwang, E Shelley ; Grimm, Lars J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c590t-8505d42dce9049cb19674dd1145b1ef85061267b6231c13169d336c16e52c2223</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Analysis</topic><topic>Biology and Life Sciences</topic><topic>Carcinoma, Ductal</topic><topic>Care and treatment</topic><topic>Computer and Information Sciences</topic><topic>Diagnosis</topic><topic>Machine learning</topic><topic>Mammography</topic><topic>Medical imaging equipment</topic><topic>Medicine and Health Sciences</topic><topic>Methods</topic><topic>Research and Analysis Methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hou, Rui</creatorcontrib><creatorcontrib>Lo, Joseph Y</creatorcontrib><creatorcontrib>Marks, Jeffrey R</creatorcontrib><creatorcontrib>Hwang, E Shelley</creatorcontrib><creatorcontrib>Grimm, Lars J</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Opposing Viewpoints in Context (Gale)</collection><collection>Gale In Context: Science</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hou, Rui</au><au>Lo, Joseph Y</au><au>Marks, Jeffrey R</au><au>Hwang, E Shelley</au><au>Grimm, Lars J</au><au>Rosen-Zvi, Michal</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Classification performance bias between training and test sets in a limited mammography dataset</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2024-02-07</date><risdate>2024</risdate><volume>19</volume><issue>2</issue><spage>e0282402</spage><epage>e0282402</epage><pages>e0282402-e0282402</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) forty times. For each split, cross-validation was used for training, followed by an assessment of the test set. Logistic regression with regularization and support vector machine were used as the machine learning classifiers. For each split and classifier type, multiple models were created based on radiomics and/or clinical features. Area under the curve (AUC) performances varied considerably across the different data splits (e.g., radiomics regression model: train 0.58-0.70, test 0.59-0.73). Performances for regression models showed a tradeoff where better training led to worse testing and vice versa. Cross-validation over all cases reduced this variability, but required samples of 500+ cases to yield representative estimates of performance. In medical imaging, clinical datasets are often limited to relatively small size. Models built from different training sets may not be representative of the whole dataset. Depending on the selected data split and model, performance bias could lead to inappropriate conclusions that might influence the clinical significance of the findings. Performance bias can result from model testing when using limited datasets. Optimal strategies for test set selection should be developed to ensure study conclusions are appropriate.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>38324545</pmid><doi>10.1371/journal.pone.0282402</doi><tpages>e0282402</tpages><orcidid>https://orcid.org/0000-0002-0348-8772</orcidid><orcidid>https://orcid.org/0000-0002-9540-5072</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1932-6203
ispartof	PloS one, 2024-02, Vol.19 (2), p.e0282402-e0282402
issn	1932-6203 1932-6203
language	eng
recordid	cdi_doaj_primary_oai_doaj_org_article_775eca960f4245d494e86a8fa3966f75
source	Publicly Available Content Database; PubMed Central
subjects	Analysis Biology and Life Sciences Carcinoma, Ductal Care and treatment Computer and Information Sciences Diagnosis Machine learning Mammography Medical imaging equipment Medicine and Health Sciences Methods Research and Analysis Methods
title	Classification performance bias between training and test sets in a limited mammography dataset
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T19%3A55%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Classification%20performance%20bias%20between%20training%20and%20test%20sets%20in%20a%20limited%20mammography%20dataset&rft.jtitle=PloS%20one&rft.au=Hou,%20Rui&rft.date=2024-02-07&rft.volume=19&rft.issue=2&rft.spage=e0282402&rft.epage=e0282402&rft.pages=e0282402-e0282402&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0282402&rft_dat=%3Cgale_doaj_%3EA781731040%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c590t-8505d42dce9049cb19674dd1145b1ef85061267b6231c13169d336c16e52c2223%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2923913909&rft_id=info:pmid/38324545&rft_galeid=A781731040&rfr_iscdi=true