Loading…

Memory Guided Transformer With Spatio-Semantic Visual Extractor for Medical Report Generation

Medicalimaging-based report writing for effective diagnosis in radiology is time-consuming and can be error-prone by inexperienced radiologists. Automatic reporting helps radiologists avoid missed diagnoses and saves valuable time. Recently, transformer-based medical report generation has become pro...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE journal of biomedical and health informatics 2024-05, Vol.28 (5), p.3079-3089
Main Authors:	Divya, Peketi, Sravani, Yenduri, Vishnu, Chalavadi, Mohan, C. Krishna, Chen, Yen Wei
Format:	Article
Language:	English
Subjects:	Algorithms Backbone Computer architecture Decoding Deformable network Deformation Extractors Feature extraction Formability Humans Information processing Medical diagnostic imaging Medical imaging Medical report generation Neural Networks, Computer Radiology Radiology Information Systems Report writing Semantic network Semantics Spatio-semantic visual extractor Transformers Visualization
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Medicalimaging-based report writing for effective diagnosis in radiology is time-consuming and can be error-prone by inexperienced radiologists. Automatic reporting helps radiologists avoid missed diagnoses and saves valuable time. Recently, transformer-based medical report generation has become prominent in capturing long-term dependencies of sequential data with its attention mechanism. Nevertheless, input features obtained from traditional visual extractor of conventional transformers do not capture spatial and semantic information of an image. So, the transformer is unable to capture fine-grained details and may not produce detailed descriptive reports of radiology images. Therefore, we propose a spatio-semantic visual extractor (SSVE) to capture multi-scale spatial and semantic information from radiology images. Here, we incorporate two types of networks in ResNet 101 backbone architecture, i.e. (i) deformable network at the intermediate layer of ResNet 101 that utilizes deformable convolutions in order to obtain spatially invariant features, and (ii) semantic network at the final layer of backbone architecture which uses dilated convolutions to extract rich multi-scale semantic information. Further, these network representations are fused to encode fine-grained details of radiology images. The performance of our proposed model outperforms existing works on two radiology report datasets, i.e., IU X-ray and MIMIC-CXR.
ISSN:	2168-2194 2168-2208 2168-2208
DOI:	10.1109/JBHI.2024.3371894