Auxiliary signal-guided knowledge encoder-decoder for medical report generation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-022-01013-6.pdf

Auxiliary signal-guided knowledge encoder-decoder for medical report generation

World Wide Web https://doi.org/10.1007/s11280-022-01013-6 Auxiliary signal‑guided knowledge encoder‑decoder for medical report generation Mingjie Li1 · Rui Liu2 · Fuyu Wang3 · Xiaojun Chang1 · Xiaodan Liang4 Received: 18 July 2021 / Revised: 17 December 2021 / Accepted: 17 January 2022 © The Author(s) 2022 Abstract Medical reports have significant clinical value to radiologists and specialists, especially during a pandemic like COVID. However, beyond the common difficulties faced in the natural image captioning, medical report generation specifically requires the model to describe a medical image with a fine-grained and semantic-coherence paragraph that should satisfy both medical commonsense and logic. Previous works generally extract the global image features and attempt to generate a paragraph that is similar to referenced reports; however, this approach has two limitations. Firstly, the regions of primary interest to radiologists are usually located in a small area of the global image, meaning that the remainder parts of the image could be considered as irrelevant noise in the training procedure. Secondly, there are many similar sentences used in each medical report to describe the normal regions of the image, which causes serious data bias. This deviation is likely to teach models to generate these inessential sentences on a regular basis. To address these problems, we propose an Auxiliary Signal-Guided Knowledge Encoder-Decoder (ASGK) to mimic radiologists’ working patterns. Specifically, the auxiliary patches are explored to expand the widely used visual patch features before fed to the Transformer encoder, while the external linguistic signals help the decoder better master prior knowledge during the pre-training process. Our approach performs well on common benchmarks, including CX-CHR, IU X-Ray, and COVID-19 CT Report dataset (COV-CTR), demonstrating combining auxiliary signals with transformer architecture can bring a significant improvement in terms of medical report generation. The experimental results confirm that auxiliary signals driven Transformer-based models are with solid capabilities to outperform previous approaches on both medical terminology classification and paragraph generation metrics. Keywords Medical report generation · Auxiliary signals · Transformer · Generative pretraining Guest Editor: Jianxin Li, Chengfei Liu, Ziyu Guan, and Yinghui Wu * Xiaojun Chang Extended author information available on the last page of the article 13 Vol.:(0123456789) World Wide Web 1 Introduction When you take a medical image in any hospital, you will receive a medical report. This medical report describes both normal and abnormal terminologies, and can assist radiologists and specialists in diagnosing and reviewing. However, writing medical reports is error-prone and time-consuming, especially during a pandemic like COVID-19, because radiologists may have to diagnose hundreds of images per day. Therefore, the topic of automatically generating medical reports has attracted research attention from both artificial intelligence and clinical medicine fields. The most similar task to medical report generation in the computer vision field is image captioning. Beyond the common difficulties in natural image captioning, there are three more bottlenecks for medical report generation. Firstly, the amount of image-report pairs in existing datasets are considered small compared to the captioning datasets, which are insufficient to learn visual representations; Secondly, it is hard to acquire the object features which are widely used in the natural image captioning tasks [1] from medical images. Only a few medical images can provide the well-annotated segmentation or location information of lesions; Thirdly, there are severe data deviation exists in these datasets. Some diseases are rare in nature, and their positive samples are hard to collect. Moreover, there are many similar sentences used in each report to describe the routine observation, which leads to the overfitting problem and limits the generalization of neural approaches [18, 21, 33, 34]. Recently, many approaches have been designed to address these problems and achieved promising performance on automatically generating medical reports [3, 12, 17, 21]. For example, Xue et al. [40] encode multiple image modalities to generate multi sentences. Li et al. [21] manually proposed several templates and Zhang et al. [45] encode and modeled visual contents relationships by the incorporation of graph module to generate fine-grained reports. With the success of Transformer [36] in image captioning tasks, Chen et al. [3] firstly proposed a memory-driven Transformer that can update the memory during generating process. Although achieving promising performances, R2Gen [3] focuses on designing extra modules, ignoring activating the characteristic learning ability of Transformer. Although achieving promising performances, existing approaches did not fully activate neural models’ potentiality, especially Transformer. Inspired by the radiologists’ working patterns, in this paper, we explore auxiliary signals’ power to facilitate generating medical reports. Generally, when a radiologist describes a medical image, he/she will carefully inspect the suspicious regions after quickly browsing the global image. Then, he/she will write a report that draws on the knowledge he/ she learned from the external medical domain and his/her working experience. As shown in Fig. 1, the suspicious region takes up only a tiny portion of the global image but has been treated equally to other regions in previous works. Therefore, other regions could be considered irrelevant noise that distracts the model. Although these regions may get more attention based on the self-attention mechanism in Transformer, Dosovitskiy et al. [6] pointed out that Transformer can learn a better visual representation when fed with original image patches instead of the encoded visual features. Using large extra corpora to pre-train the Transformer is an effective way to alleviate the corpus deviation in the training datasets [5, 31]. However, there is a considerable textual semantic gap between the medical and common domains. Accordingly, to mimic the behavior of medical experts and address the abovementioned learning difficulties, we propose an Auxiliary Signal-Guided Knowledge (ASGK) approach including two kinds of auxiliary signals to improve a Transformer to 13 World Wide Web Fig. 1 Two samples from CX-CHR and our COV-CTR datasets. Red bounding boxes annotated by a radiologist indicate the regions that he pays more attention to describing this image. The red text describes the abnormalities. Underlined text indicates alignment between ground truth reports and generated reports generate medical reports. Firstly, we automatically find a suspicious region where the pre-trained neural visual extractor paid the most attention. After resizing and cutting, the au (...truncated)