Loading…

Natural Language Generation Model for Mammography Reports Simulation

Extending the size of labeled corpora of medical reports is a major step towards a successful training of machine learning algorithms. Simulating new text reports is a key solution for reports augmentation, which extends the cohort size. However, text generation in the medical domain is challenging...

Full description

Saved in:
Bibliographic Details
Published in:IEEE journal of biomedical and health informatics 2020-09, Vol.24 (9), p.2711-2717
Main Authors: Hoogi, Assaf, Mishra, Arjun, Gimenez, Francisco, Dong, Jeffrey, Rubin, Daniel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3
cites cdi_FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3
container_end_page 2717
container_issue 9
container_start_page 2711
container_title IEEE journal of biomedical and health informatics
container_volume 24
creator Hoogi, Assaf
Mishra, Arjun
Gimenez, Francisco
Dong, Jeffrey
Rubin, Daniel
description Extending the size of labeled corpora of medical reports is a major step towards a successful training of machine learning algorithms. Simulating new text reports is a key solution for reports augmentation, which extends the cohort size. However, text generation in the medical domain is challenging because it needs to preserve both content and style that are typical for real reports, without risking the patients' privacy. In this paper, we present a conditioned LSTM-RNN architecture for simulating realistic mammography reports. We evaluated the performance by analyzing the characteristics of the simulated reports and classifying them into benign and malignant classes. An average classification AUC was calculated over two distinct test sets. A qualitative analysis was also performed in which a masked radiologist classified 0.75 of the simulated reports as real reports, showing that both the style and content of the simulated reports were similar to real reports. Finally, we compared our RNN-LSTM generative model with Markov Random Fields. The RNN-LSTM provided significantly better and more stable performance than MRFs (p< 0.01, Wilcoxon).
doi_str_mv 10.1109/JBHI.2020.2980118
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_JBHI_2020_2980118</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9072639</ieee_id><sourcerecordid>2441009814</sourcerecordid><originalsourceid>FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3</originalsourceid><addsrcrecordid>eNpdkEtPwzAMgCMEYtPYD0BIqBIXLh1x0kdyhAHb0AYSj3OUtu7o1DYjaQ_793TsccAXW_Zny_oIuQQ6AqDy7uVhOhsxyuiISUEBxAnpM4iEzxgVp4caZNAjQ-dWtAvRtWR0TnqccRaEcdwnj6-6aa0uvbmul61eojfBGq1uClN7C5Nh6eXGegtdVWZp9fp7473j2tjGeR9F1ZZ_4AU5y3XpcLjPA_L1_PQ5nvrzt8lsfD_3Ux7Ixk8yiCQLNZeQhizCUCRhnIHmKc0hyxKuAfIkwUxgREUeUez-zUWMgBkPRM4H5HZ3d23NT4uuUVXhUixLXaNpnWJcBkJIHvIOvfmHrkxr6-47xYIAKJUCgo6CHZVa45zFXK1tUWm7UUDV1rLaWlZby2pvudu53l9ukwqz48bBaQdc7YACEY9jSWMWccl_ARgGf4Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2441009814</pqid></control><display><type>article</type><title>Natural Language Generation Model for Mammography Reports Simulation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Hoogi, Assaf ; Mishra, Arjun ; Gimenez, Francisco ; Dong, Jeffrey ; Rubin, Daniel</creator><creatorcontrib>Hoogi, Assaf ; Mishra, Arjun ; Gimenez, Francisco ; Dong, Jeffrey ; Rubin, Daniel</creatorcontrib><description>Extending the size of labeled corpora of medical reports is a major step towards a successful training of machine learning algorithms. Simulating new text reports is a key solution for reports augmentation, which extends the cohort size. However, text generation in the medical domain is challenging because it needs to preserve both content and style that are typical for real reports, without risking the patients' privacy. In this paper, we present a conditioned LSTM-RNN architecture for simulating realistic mammography reports. We evaluated the performance by analyzing the characteristics of the simulated reports and classifying them into benign and malignant classes. An average classification AUC was calculated over two distinct test sets. A qualitative analysis was also performed in which a masked radiologist classified 0.75 of the simulated reports as real reports, showing that both the style and content of the simulated reports were similar to real reports. Finally, we compared our RNN-LSTM generative model with Markov Random Fields. The RNN-LSTM provided significantly better and more stable performance than MRFs (&lt;inline-formula&gt;&lt;tex-math notation="LaTeX"&gt;p&lt; 0.01&lt;/tex-math&gt;&lt;/inline-formula&gt;, Wilcoxon).</description><identifier>ISSN: 2168-2194</identifier><identifier>EISSN: 2168-2208</identifier><identifier>DOI: 10.1109/JBHI.2020.2980118</identifier><identifier>PMID: 32324577</identifier><identifier>CODEN: IJBHA9</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Algorithms ; Cancer ; Classification ; Computer simulation ; Fields (mathematics) ; Informatics ; Learning algorithms ; Machine learning ; mammo-graphy reports ; Mammography ; Medical diagnosis ; Medical diagnostic imaging ; Natural language generation ; Natural languages ; Performance evaluation ; Qualitative analysis ; RNN-LSTM ; simulation ; Test sets ; Training</subject><ispartof>IEEE journal of biomedical and health informatics, 2020-09, Vol.24 (9), p.2711-2717</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3</citedby><cites>FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3</cites><orcidid>0000-0001-5057-4369 ; 0000-0002-4542-6254</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9072639$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,54771</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32324577$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Hoogi, Assaf</creatorcontrib><creatorcontrib>Mishra, Arjun</creatorcontrib><creatorcontrib>Gimenez, Francisco</creatorcontrib><creatorcontrib>Dong, Jeffrey</creatorcontrib><creatorcontrib>Rubin, Daniel</creatorcontrib><title>Natural Language Generation Model for Mammography Reports Simulation</title><title>IEEE journal of biomedical and health informatics</title><addtitle>JBHI</addtitle><addtitle>IEEE J Biomed Health Inform</addtitle><description>Extending the size of labeled corpora of medical reports is a major step towards a successful training of machine learning algorithms. Simulating new text reports is a key solution for reports augmentation, which extends the cohort size. However, text generation in the medical domain is challenging because it needs to preserve both content and style that are typical for real reports, without risking the patients' privacy. In this paper, we present a conditioned LSTM-RNN architecture for simulating realistic mammography reports. We evaluated the performance by analyzing the characteristics of the simulated reports and classifying them into benign and malignant classes. An average classification AUC was calculated over two distinct test sets. A qualitative analysis was also performed in which a masked radiologist classified 0.75 of the simulated reports as real reports, showing that both the style and content of the simulated reports were similar to real reports. Finally, we compared our RNN-LSTM generative model with Markov Random Fields. The RNN-LSTM provided significantly better and more stable performance than MRFs (&lt;inline-formula&gt;&lt;tex-math notation="LaTeX"&gt;p&lt; 0.01&lt;/tex-math&gt;&lt;/inline-formula&gt;, Wilcoxon).</description><subject>Algorithms</subject><subject>Cancer</subject><subject>Classification</subject><subject>Computer simulation</subject><subject>Fields (mathematics)</subject><subject>Informatics</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>mammo-graphy reports</subject><subject>Mammography</subject><subject>Medical diagnosis</subject><subject>Medical diagnostic imaging</subject><subject>Natural language generation</subject><subject>Natural languages</subject><subject>Performance evaluation</subject><subject>Qualitative analysis</subject><subject>RNN-LSTM</subject><subject>simulation</subject><subject>Test sets</subject><subject>Training</subject><issn>2168-2194</issn><issn>2168-2208</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNpdkEtPwzAMgCMEYtPYD0BIqBIXLh1x0kdyhAHb0AYSj3OUtu7o1DYjaQ_793TsccAXW_Zny_oIuQQ6AqDy7uVhOhsxyuiISUEBxAnpM4iEzxgVp4caZNAjQ-dWtAvRtWR0TnqccRaEcdwnj6-6aa0uvbmul61eojfBGq1uClN7C5Nh6eXGegtdVWZp9fp7473j2tjGeR9F1ZZ_4AU5y3XpcLjPA_L1_PQ5nvrzt8lsfD_3Ux7Ixk8yiCQLNZeQhizCUCRhnIHmKc0hyxKuAfIkwUxgREUeUez-zUWMgBkPRM4H5HZ3d23NT4uuUVXhUixLXaNpnWJcBkJIHvIOvfmHrkxr6-47xYIAKJUCgo6CHZVa45zFXK1tUWm7UUDV1rLaWlZby2pvudu53l9ukwqz48bBaQdc7YACEY9jSWMWccl_ARgGf4Q</recordid><startdate>20200901</startdate><enddate>20200901</enddate><creator>Hoogi, Assaf</creator><creator>Mishra, Arjun</creator><creator>Gimenez, Francisco</creator><creator>Dong, Jeffrey</creator><creator>Rubin, Daniel</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>K9.</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5057-4369</orcidid><orcidid>https://orcid.org/0000-0002-4542-6254</orcidid></search><sort><creationdate>20200901</creationdate><title>Natural Language Generation Model for Mammography Reports Simulation</title><author>Hoogi, Assaf ; Mishra, Arjun ; Gimenez, Francisco ; Dong, Jeffrey ; Rubin, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Cancer</topic><topic>Classification</topic><topic>Computer simulation</topic><topic>Fields (mathematics)</topic><topic>Informatics</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>mammo-graphy reports</topic><topic>Mammography</topic><topic>Medical diagnosis</topic><topic>Medical diagnostic imaging</topic><topic>Natural language generation</topic><topic>Natural languages</topic><topic>Performance evaluation</topic><topic>Qualitative analysis</topic><topic>RNN-LSTM</topic><topic>simulation</topic><topic>Test sets</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hoogi, Assaf</creatorcontrib><creatorcontrib>Mishra, Arjun</creatorcontrib><creatorcontrib>Gimenez, Francisco</creatorcontrib><creatorcontrib>Dong, Jeffrey</creatorcontrib><creatorcontrib>Rubin, Daniel</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE journal of biomedical and health informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoogi, Assaf</au><au>Mishra, Arjun</au><au>Gimenez, Francisco</au><au>Dong, Jeffrey</au><au>Rubin, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Natural Language Generation Model for Mammography Reports Simulation</atitle><jtitle>IEEE journal of biomedical and health informatics</jtitle><stitle>JBHI</stitle><addtitle>IEEE J Biomed Health Inform</addtitle><date>2020-09-01</date><risdate>2020</risdate><volume>24</volume><issue>9</issue><spage>2711</spage><epage>2717</epage><pages>2711-2717</pages><issn>2168-2194</issn><eissn>2168-2208</eissn><coden>IJBHA9</coden><abstract>Extending the size of labeled corpora of medical reports is a major step towards a successful training of machine learning algorithms. Simulating new text reports is a key solution for reports augmentation, which extends the cohort size. However, text generation in the medical domain is challenging because it needs to preserve both content and style that are typical for real reports, without risking the patients' privacy. In this paper, we present a conditioned LSTM-RNN architecture for simulating realistic mammography reports. We evaluated the performance by analyzing the characteristics of the simulated reports and classifying them into benign and malignant classes. An average classification AUC was calculated over two distinct test sets. A qualitative analysis was also performed in which a masked radiologist classified 0.75 of the simulated reports as real reports, showing that both the style and content of the simulated reports were similar to real reports. Finally, we compared our RNN-LSTM generative model with Markov Random Fields. The RNN-LSTM provided significantly better and more stable performance than MRFs (&lt;inline-formula&gt;&lt;tex-math notation="LaTeX"&gt;p&lt; 0.01&lt;/tex-math&gt;&lt;/inline-formula&gt;, Wilcoxon).</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32324577</pmid><doi>10.1109/JBHI.2020.2980118</doi><tpages>7</tpages><orcidid>https://orcid.org/0000-0001-5057-4369</orcidid><orcidid>https://orcid.org/0000-0002-4542-6254</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 2168-2194
ispartof IEEE journal of biomedical and health informatics, 2020-09, Vol.24 (9), p.2711-2717
issn 2168-2194
2168-2208
language eng
recordid cdi_crossref_primary_10_1109_JBHI_2020_2980118
source IEEE Electronic Library (IEL) Journals
subjects Algorithms
Cancer
Classification
Computer simulation
Fields (mathematics)
Informatics
Learning algorithms
Machine learning
mammo-graphy reports
Mammography
Medical diagnosis
Medical diagnostic imaging
Natural language generation
Natural languages
Performance evaluation
Qualitative analysis
RNN-LSTM
simulation
Test sets
Training
title Natural Language Generation Model for Mammography Reports Simulation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T19%3A51%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Natural%20Language%20Generation%20Model%20for%20Mammography%20Reports%20Simulation&rft.jtitle=IEEE%20journal%20of%20biomedical%20and%20health%20informatics&rft.au=Hoogi,%20Assaf&rft.date=2020-09-01&rft.volume=24&rft.issue=9&rft.spage=2711&rft.epage=2717&rft.pages=2711-2717&rft.issn=2168-2194&rft.eissn=2168-2208&rft.coden=IJBHA9&rft_id=info:doi/10.1109/JBHI.2020.2980118&rft_dat=%3Cproquest_cross%3E2441009814%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c349t-bd16925a391c526e58b57d1a3c0f1ddb3a11fbbed8e608f60e168f87e1ed348f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2441009814&rft_id=info:pmid/32324577&rft_ieee_id=9072639&rfr_iscdi=true