Loading…
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering
Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowl...
Saved in:
Published in: | IEEE transactions on medical imaging 2024-02, Vol.43 (2), p.1-1 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3 |
---|---|
cites | cdi_FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3 |
container_end_page | 1 |
container_issue | 2 |
container_start_page | 1 |
container_title | IEEE transactions on medical imaging |
container_volume | 43 |
creator | Huang, Xiaofei Gong, Hongfang |
description | Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability. |
doi_str_mv | 10.1109/TMI.2023.3322868 |
format | article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_miscellaneous_2875380187</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10274495</ieee_id><sourcerecordid>2875380187</sourcerecordid><originalsourceid>FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3</originalsourceid><addsrcrecordid>eNpdkU1P4zAQhi0EglK47wEhS1z2kuLPxjlWwLJILQhRYG-WHU8gbOqAnaji369LC1pxmsszj96ZF6EflIwoJcXpfHY1YoTxEeeMqbHaQgMqpcqYFH-20YCwXGWEjNke2o_xhRAqJCl20R7PFWVSkgF6nuDz3jTZpOvAd3Xr8RRM8LV_wtfQLdvwFy_r7hk_tsFh4x2-Sxj4EvDFwoJzK7BqA56Bq0vT4Ic6Jh2-7SF-2CY-LiEk6gDtVKaJcLiZQ3T_62J-9jub3lxenU2mWcmF6rKcWhDOWJ7OE0aVVuWcQmWdVbYocw5GpuzjihXOOscFM5WlhltSkIQIy4fo59r7Gtq3VQq9qGMJTWM8tH3UTOWSK0KTd4hOvqEvbR98SqdZwdKDhKRFosiaKkMbY4BKv4Z6YcK7pkSvWtCpBb1qQW9aSCvHG3FvF-C-Fj7fnoCjNVADwH8-lgtRSP4PSv-Lzw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2921254519</pqid></control><display><type>article</type><title>A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Huang, Xiaofei ; Gong, Hongfang</creator><creatorcontrib>Huang, Xiaofei ; Gong, Hongfang</creatorcontrib><description>Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability.</description><identifier>ISSN: 0278-0062</identifier><identifier>ISSN: 1558-254X</identifier><identifier>EISSN: 1558-254X</identifier><identifier>DOI: 10.1109/TMI.2023.3322868</identifier><identifier>PMID: 37812550</identifier><identifier>CODEN: ITMID4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Ablation ; Attention ; Cognition ; Data mining ; Diagnosis, Computer-Assisted ; double embedding ; Embedding ; Feature extraction ; guided attention ; Information processing ; Language ; Learning ; Medical diagnostic imaging ; Medical imaging ; medical information ; Medical research ; Medical visual question answering ; Modules ; Natural language processing ; Question answering (information retrieval) ; Questions ; Reasoning ; Sentences ; Task analysis ; Visual discrimination learning ; Visual perception ; visual reasoning ; Visualization ; Words (language)</subject><ispartof>IEEE transactions on medical imaging, 2024-02, Vol.43 (2), p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3</citedby><cites>FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3</cites><orcidid>0000-0002-9220-7222 ; 0000-0003-2618-9174</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10274495$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,54771</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37812550$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Huang, Xiaofei</creatorcontrib><creatorcontrib>Gong, Hongfang</creatorcontrib><title>A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering</title><title>IEEE transactions on medical imaging</title><addtitle>TMI</addtitle><addtitle>IEEE Trans Med Imaging</addtitle><description>Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability.</description><subject>Ablation</subject><subject>Attention</subject><subject>Cognition</subject><subject>Data mining</subject><subject>Diagnosis, Computer-Assisted</subject><subject>double embedding</subject><subject>Embedding</subject><subject>Feature extraction</subject><subject>guided attention</subject><subject>Information processing</subject><subject>Language</subject><subject>Learning</subject><subject>Medical diagnostic imaging</subject><subject>Medical imaging</subject><subject>medical information</subject><subject>Medical research</subject><subject>Medical visual question answering</subject><subject>Modules</subject><subject>Natural language processing</subject><subject>Question answering (information retrieval)</subject><subject>Questions</subject><subject>Reasoning</subject><subject>Sentences</subject><subject>Task analysis</subject><subject>Visual discrimination learning</subject><subject>Visual perception</subject><subject>visual reasoning</subject><subject>Visualization</subject><subject>Words (language)</subject><issn>0278-0062</issn><issn>1558-254X</issn><issn>1558-254X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpdkU1P4zAQhi0EglK47wEhS1z2kuLPxjlWwLJILQhRYG-WHU8gbOqAnaji369LC1pxmsszj96ZF6EflIwoJcXpfHY1YoTxEeeMqbHaQgMqpcqYFH-20YCwXGWEjNke2o_xhRAqJCl20R7PFWVSkgF6nuDz3jTZpOvAd3Xr8RRM8LV_wtfQLdvwFy_r7hk_tsFh4x2-Sxj4EvDFwoJzK7BqA56Bq0vT4Ic6Jh2-7SF-2CY-LiEk6gDtVKaJcLiZQ3T_62J-9jub3lxenU2mWcmF6rKcWhDOWJ7OE0aVVuWcQmWdVbYocw5GpuzjihXOOscFM5WlhltSkIQIy4fo59r7Gtq3VQq9qGMJTWM8tH3UTOWSK0KTd4hOvqEvbR98SqdZwdKDhKRFosiaKkMbY4BKv4Z6YcK7pkSvWtCpBb1qQW9aSCvHG3FvF-C-Fj7fnoCjNVADwH8-lgtRSP4PSv-Lzw</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Huang, Xiaofei</creator><creator>Gong, Hongfang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-9220-7222</orcidid><orcidid>https://orcid.org/0000-0003-2618-9174</orcidid></search><sort><creationdate>20240201</creationdate><title>A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering</title><author>Huang, Xiaofei ; Gong, Hongfang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Ablation</topic><topic>Attention</topic><topic>Cognition</topic><topic>Data mining</topic><topic>Diagnosis, Computer-Assisted</topic><topic>double embedding</topic><topic>Embedding</topic><topic>Feature extraction</topic><topic>guided attention</topic><topic>Information processing</topic><topic>Language</topic><topic>Learning</topic><topic>Medical diagnostic imaging</topic><topic>Medical imaging</topic><topic>medical information</topic><topic>Medical research</topic><topic>Medical visual question answering</topic><topic>Modules</topic><topic>Natural language processing</topic><topic>Question answering (information retrieval)</topic><topic>Questions</topic><topic>Reasoning</topic><topic>Sentences</topic><topic>Task analysis</topic><topic>Visual discrimination learning</topic><topic>Visual perception</topic><topic>visual reasoning</topic><topic>Visualization</topic><topic>Words (language)</topic><toplevel>online_resources</toplevel><creatorcontrib>Huang, Xiaofei</creatorcontrib><creatorcontrib>Gong, Hongfang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Nursing & Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on medical imaging</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huang, Xiaofei</au><au>Gong, Hongfang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering</atitle><jtitle>IEEE transactions on medical imaging</jtitle><stitle>TMI</stitle><addtitle>IEEE Trans Med Imaging</addtitle><date>2024-02-01</date><risdate>2024</risdate><volume>43</volume><issue>2</issue><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>0278-0062</issn><issn>1558-254X</issn><eissn>1558-254X</eissn><coden>ITMID4</coden><abstract>Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>37812550</pmid><doi>10.1109/TMI.2023.3322868</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-9220-7222</orcidid><orcidid>https://orcid.org/0000-0003-2618-9174</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0278-0062 |
ispartof | IEEE transactions on medical imaging, 2024-02, Vol.43 (2), p.1-1 |
issn | 0278-0062 1558-254X 1558-254X |
language | eng |
recordid | cdi_proquest_miscellaneous_2875380187 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Ablation Attention Cognition Data mining Diagnosis, Computer-Assisted double embedding Embedding Feature extraction guided attention Information processing Language Learning Medical diagnostic imaging Medical imaging medical information Medical research Medical visual question answering Modules Natural language processing Question answering (information retrieval) Questions Reasoning Sentences Task analysis Visual discrimination learning Visual perception visual reasoning Visualization Words (language) |
title | A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T10%3A57%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Dual-Attention%20Learning%20Network%20with%20Word%20and%20Sentence%20Embedding%20for%20Medical%20Visual%20Question%20Answering&rft.jtitle=IEEE%20transactions%20on%20medical%20imaging&rft.au=Huang,%20Xiaofei&rft.date=2024-02-01&rft.volume=43&rft.issue=2&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=0278-0062&rft.eissn=1558-254X&rft.coden=ITMID4&rft_id=info:doi/10.1109/TMI.2023.3322868&rft_dat=%3Cproquest_pubme%3E2875380187%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c348t-71be4dab31094a8cb8731efbdb8b9c73ea57816f29dbdd342afb1a3b090db84b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2921254519&rft_id=info:pmid/37812550&rft_ieee_id=10274495&rfr_iscdi=true |