Auxiliary signal-guided knowledge encoder-decoder for medical report generation
World Wide Web
https://doi.org/10.1007/s11280-022-01013-6
Auxiliary signal‑guided knowledge encoder‑decoder
for medical report generation
Mingjie Li1 · Rui Liu2 · Fuyu Wang3 · Xiaojun Chang1 · Xiaodan Liang4
Received: 18 July 2021 / Revised: 17 December 2021 / Accepted: 17 January 2022
© The Author(s) 2022
Abstract
Medical reports have significant clinical value to radiologists and specialists, especially
during a pandemic like COVID. However, beyond the common difficulties faced in the natural image captioning, medical report generation specifically requires the model to describe
a medical image with a fine-grained and semantic-coherence paragraph that should satisfy
both medical commonsense and logic. Previous works generally extract the global image
features and attempt to generate a paragraph that is similar to referenced reports; however,
this approach has two limitations. Firstly, the regions of primary interest to radiologists are
usually located in a small area of the global image, meaning that the remainder parts of the
image could be considered as irrelevant noise in the training procedure. Secondly, there
are many similar sentences used in each medical report to describe the normal regions of
the image, which causes serious data bias. This deviation is likely to teach models to generate these inessential sentences on a regular basis. To address these problems, we propose an Auxiliary Signal-Guided Knowledge Encoder-Decoder (ASGK) to mimic radiologists’ working patterns. Specifically, the auxiliary patches are explored to expand the
widely used visual patch features before fed to the Transformer encoder, while the external
linguistic signals help the decoder better master prior knowledge during the pre-training
process. Our approach performs well on common benchmarks, including CX-CHR, IU
X-Ray, and COVID-19 CT Report dataset (COV-CTR), demonstrating combining auxiliary signals with transformer architecture can bring a significant improvement in terms of
medical report generation. The experimental results confirm that auxiliary signals driven
Transformer-based models are with solid capabilities to outperform previous approaches
on both medical terminology classification and paragraph generation metrics.
Keywords Medical report generation · Auxiliary signals · Transformer · Generative pretraining
Guest Editor: Jianxin Li, Chengfei Liu, Ziyu Guan, and Yinghui Wu
* Xiaojun Chang
Extended author information available on the last page of the article
13
Vol.:(0123456789)
World Wide Web
1 Introduction
When you take a medical image in any hospital, you will receive a medical report. This
medical report describes both normal and abnormal terminologies, and can assist radiologists and specialists in diagnosing and reviewing. However, writing medical reports is
error-prone and time-consuming, especially during a pandemic like COVID-19, because
radiologists may have to diagnose hundreds of images per day. Therefore, the topic of automatically generating medical reports has attracted research attention from both artificial
intelligence and clinical medicine fields.
The most similar task to medical report generation in the computer vision field is image
captioning. Beyond the common difficulties in natural image captioning, there are three
more bottlenecks for medical report generation. Firstly, the amount of image-report pairs in
existing datasets are considered small compared to the captioning datasets, which are insufficient to learn visual representations; Secondly, it is hard to acquire the object features
which are widely used in the natural image captioning tasks [1] from medical images. Only
a few medical images can provide the well-annotated segmentation or location information
of lesions; Thirdly, there are severe data deviation exists in these datasets. Some diseases
are rare in nature, and their positive samples are hard to collect. Moreover, there are many
similar sentences used in each report to describe the routine observation, which leads to
the overfitting problem and limits the generalization of neural approaches [18, 21, 33, 34].
Recently, many approaches have been designed to address these problems and achieved
promising performance on automatically generating medical reports [3, 12, 17, 21]. For
example, Xue et al. [40] encode multiple image modalities to generate multi sentences. Li
et al. [21] manually proposed several templates and Zhang et al. [45] encode and modeled
visual contents relationships by the incorporation of graph module to generate fine-grained
reports. With the success of Transformer [36] in image captioning tasks, Chen et al. [3]
firstly proposed a memory-driven Transformer that can update the memory during generating process. Although achieving promising performances, R2Gen [3] focuses on designing extra modules, ignoring activating the characteristic learning ability of Transformer.
Although achieving promising performances, existing approaches did not fully activate
neural models’ potentiality, especially Transformer.
Inspired by the radiologists’ working patterns, in this paper, we explore auxiliary signals’ power to facilitate generating medical reports. Generally, when a radiologist describes
a medical image, he/she will carefully inspect the suspicious regions after quickly browsing the global image. Then, he/she will write a report that draws on the knowledge he/
she learned from the external medical domain and his/her working experience. As shown
in Fig. 1, the suspicious region takes up only a tiny portion of the global image but has
been treated equally to other regions in previous works. Therefore, other regions could
be considered irrelevant noise that distracts the model. Although these regions may get
more attention based on the self-attention mechanism in Transformer, Dosovitskiy et al. [6]
pointed out that Transformer can learn a better visual representation when fed with original
image patches instead of the encoded visual features. Using large extra corpora to pre-train
the Transformer is an effective way to alleviate the corpus deviation in the training datasets [5, 31]. However, there is a considerable textual semantic gap between the medical and
common domains.
Accordingly, to mimic the behavior of medical experts and address the abovementioned learning difficulties, we propose an Auxiliary Signal-Guided Knowledge
(ASGK) approach including two kinds of auxiliary signals to improve a Transformer to
13
World Wide Web
Fig. 1 Two samples from CX-CHR and our COV-CTR datasets. Red bounding boxes annotated by a radiologist indicate the regions that he pays more attention to describing this image. The red text describes the
abnormalities. Underlined text indicates alignment between ground truth reports and generated reports
generate medical reports. Firstly, we automatically find a suspicious region where the
pre-trained neural visual extractor paid the most attention. After resizing and cutting,
the au (...truncated)