Development and Validation of a Pathology Image Analysis-based Predictive Model for Lung Adenocarcinoma Prognosis - A Multi-cohort Study
Development and Validation of a pathology Image Analysis- based predictive Model for Lung Adenocarcinoma prognosis - A Multi-cohort study
Xin Luo 0
shen Yin 0 2
LinYang 0 3
Junya Fujimoto 4
Cesar Moran 6
Neda Kalhor 6
Annikka Weissferdt 6
Yang Xie 0 1 7
Adi Gazdar 7 8 9
John Minna 7 8
Ignacio Ivan Wistuba 4
Yousheng Mao 5
Guanghua Xiao 0 1 7
0 Quantitative Biomedical Research c enter, Department of c linical Sciences, University of t exas Southwestern Medical Center , Dallas, Texas, 75390 , USA
1 Department of Bioinformatics, University of t exas Southwestern Medical Center , Dallas, Texas, 75390 , USA
2 Department of Statistics, Southern Methodist University , Dallas, t exas , USA
3 Department of Pathology, National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College , Beijing, 100021 , China
4 Department of t ranslational Molecular Pathology, the University of Texas MD Anderson Cancer Center , Houston, Texas , USA
5 Department of thoracic Surgery, national Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences (CAMS) , Beijing , China
6 Department of Pathology, Division of Pathology/Lab Medicine,The University ofTexas MD Anderson Cancer Center , Houston,Texas , USA
7 Simmons Comprehensive Cancer Center, University ofTexas Southwestern Medical Center , Dallas,Texas , USA
8 Hamon Center for Therapeutic Oncology, University of Texas Southwestern Medical Center , Dallas, Texas , USA
9 Department of Pathology, University of Texas Southwestern Medical Center , Dallas, Texas , USA
Published: xx xx xxxx prediction of disease prognosis is essential for improving cancer patient care. previously, we have demonstrated the feasibility of using quantitative morphological features of tumor pathology images to predict the prognosis of lung cancer patients in a single cohort. In this study, we developed and validated a pathology image-based predictive model for the prognosis of lung adenocarcinoma (ADC) patients across multiple independent cohorts. Using quantitative pathology image analysis, we extracted morphological features from H&E stained sections of formalin fixed paraffin embedded (FFpe) tumor tissues. A prediction model for patient prognosis was developed using tumor tissue pathology images from a cohort of 91 stage I lung ADC patients from the Chinese Academy of Medical sciences (CAMs ), and validated in ADC patients from the National Lung screening t rial (NLst), and the Ut special program of Research excellence (spoRe) cohort. the morphological features that are associated with patient survival in the training dataset from the CAMs cohort were used to develop a prognostic model, which was independently validated in both the NLst (n = 185) and the SPORE (n = 111) cohorts. The association between predicted risk and overall survival was significant for both the NLst (Hazard Ratio (HR) = 2.20, pv = 0.01) and the SPORE cohorts (HR = 2.15 and pv = 0.044), respectively, after adjusting for key clinical variables. Furthermore, the model also predicted the prognosis of patients with stage I ADC in both the NLst (n = 123, pv = 0.0089) and SPORE (n = 68, pv = 0.032) cohorts. The results indicate that the pathology image-based model predicts the prognosis of ADC patients across independent cohorts.
Number of Patients
Number of Slides (Tumor)
Age at Diagnosis (Years) Median [LQ-HQ]
Follow-up (Years) Median [LQ-HQ]
Vital Status (%)
Cancer Stage (%)
Smoking Status (%)
attributed to the highly heterogeneous nature of tumor cells and their close interaction with the diverse tumor
microenvironment2,3. Recently, different technologies and methods have been developed to stratify cancer
patients based on their molecular profiles4,5 or histopathological factors6,7, in order to facilitate personalized
treatment of individual patients. Formalin fixed paraffin embedded (FFPE)?tumor tissue slides provide a vast
amount of information about the tumor and its surrounding microenvironment8; however, their potential for
cancer diagnosis and treatment planning is still far from being fully explored. Currently, H&E stained tumor
tissue slide scanning is becoming a routine clinical procedure. Recently, we9 and Yu et al.10 have demonstrated that
pathology image analysis could be a promising tool to assist pathologists in lung cancer diagnosis and prognosis.
However, both studies trained and validated the model using The Cancer Genome Atlas (TCGA) cohort alone.
Since pathology images and patients from different cohorts may display different characteristics, in order to test
the generalizability of the model, it is essential to evaluate the performance of a predictive model across multiple
independent cohorts. In this study, we developed a pathology image-based prognostic model for lung
adenocarcinoma (ADC) patients and validated the model in two independent lung ADC patient cohorts. This study
established a generalized model that could be applied across different lung ADC patient cohorts.
Materials and Methods
ethics approval and consent to participate. The University of Texas Southwestern Institutional Review
Board granted approval for this research (IRB#: STU 072016-028). Data were collected under informed consent
for study participation. Informed consent has been obtained for all study participation. All methods were
performed in accordance with the relevant guidelines and regulations.
Datasets. We acquired H&E-stained histological images and the corresponding clinical information for 91
stage I ADC patients from the Chinese Academy of Medical Sciences, China (CAMS), 185 ADC patients from
the National Lung Screening Trial (NLST), and 111 ADC patients from the University of Texas Special Program
of Research Excellence (SPORE) in Lung Cancer project. There are 91, 433, and 130 tissue slides for the CAMS,
NLST and SPORE cohorts, respectively. When a patient had multiple tissue slides, the summarized value of the
morphological features from multiple slides was averaged to represent the value in the patient for further
statistical analyses. All tumor tissue slides are FFPE and were scanned at ?20 or ?40 magnifications. Our pathologists,
Drs. Lin Yang and Junya Fujimoto, manually inspected the tissue slide images, and images with low image quality
were removed from further analysis. The images captured at X40 were normalized to X20 using the?method
described in our previous study9. The characteristics of patients from different cohorts are summarized in Table?1.
extract Morphological Features. Using the method described in our previous study9, morphological
features for each image slide were extracted using CellProfiler11,12 software by choosing different analyses modules.
These features include global features such as tissue texture and granularity, as well as cell nuclear-based features
such as the size, shape, distribution, texture and neighboring architecture of nuclei. These features covered
comprehensive morphological information provided by the histological images. The average signal was taken for
patients with multiple image slides.
prognostic Model Development and Validation. Since all the pathology images and clinical
information from the CAMS cohort had been strictly reviewed and assessed by a pathologist, Dr. Lin Yang, this cohort
was used as the training set to develop a pathology image-based prognostic model for lung ADC patients. The
morphological features were first screened by their association with patients? survival using a univariate Cox
proportional hazards regression model. Morphological features that were significantly associated (Z score <?2
or >2) with patients? overall survival were selected to build a prognostic model using the random survival forest
method12. The model was then validated in ADC patients from the NLST and SPORE cohorts, respectively. Using
the risk scores assigned by the model, the patients were separated into high- and low-risk groups by the median
risk score in each of the two testing sets.
statistical Analysis. The survival curve for each group was estimated by Kaplan-Meier method. The
differences in the overall survival outcomes between high- and low-risk groups were compared using the log-rank test.
Multivariate Cox proportional hazards models were used to determine the association between predicted risk
groups and overall survival after adjusting for key?clinical variables, including age, sex, smoking status, grade, race
and stage. All the analyses were performed with R10 version 3.4.1.
extracted Morphological Features are Associated with p atients? survival outcome. In total,
943 morphological features were extracted from H&E stained tumor tissue images. Among these morphological
features, the top 15 features were significantly associated with patients? survival outcome?in the CAMS cohort
(Table?2). Top features with the most significant Z scores were enriched in the categories of ?Tissue Granularity?,
?Nuclei Texture? and ?Nuclei Size Shape?. Some of the features showed elevated levels of measurement in the
highrisk group, whereas others showed the opposite pattern.
Predictive Model is Robust in Different Cohorts. A prognostic model was developed from the CAMS
patient cohort using the 15 top features as predictors. The model was then validated in both the NLST and SPORE
patient cohorts. The model separated the patients in each test cohort into high- and low-risk groups. The patients
in the predicted high-risk group showed a significantly worse survival than those in the predicted low-risk group,
in the NLST dataset (pv = 0.0406) and SPORE dataset (pv = 0.0288), respectively. In the NLST dataset, the 5
year survival rate for the group with low risk scores was 81%, with 95% confident interval (CI) = [73?89%]
versus 73% (95% CI = [64?83%]) for the group with high risk scores. In the SPORE dataset, the 5 year survival
rate for the group with low risk scores was 73% (95% CI = [60?87%]) versus 58% (95% CI = [44?76%]) for the
group with high risk scores (Fig.?1a,b). In multivariate analysis adjusting for clinical variables, including age, sex,
smoking status, grade, race and stage (Tables?3 and 4), the association between predicted risk group and overall
survival was significant for both the NLST cohort, with HR = 2.20 (predicted high-risk vs. low-risk group) and
pv = 0.01, and the SPORE cohort with HR = 2.15 and pv = 0.044. Furthermore, the model predicted the prognosis
of patients with stage I ADC in both the NLST (n = 123, pv = 0.0089) and the?SPORE (n = 68, pv = 0.032) cohorts
Because of the lack of standard guidelines for pathology images, images from different cohorts may vary
substantially regarding the slide thickness, sectioning, staining quality and scanning magnitude. Patients in different
cohorts may also display different demographic and clinical characteristics. It is essential to test the
generalizability of such prognostic models by evaluating the prediction performance across multiple independent test cohorts.
In this study, we have successfully validated the H&E stained tumor pathology image-based prognostic models in
two independent cohorts, demonstrating the feasibility of integrating such analysis into real medical practice to
assist pathologists in cancer diagnosis. Obtaining good quality and highly representative image data from patients
may further improve the predication accuracy, which urges a demand for standard guidelines for pathology image
acquisition and processing in the field.
The data that support the findings of this study are available from the University of Texas Special Program of
Research Excellence (SPORE) in Lung Cancer, National Cancer Center/Cancer Hospital and Peking Union Medical
College, China, but restrictions apply to the availability of these data. Data are available from the authors upon
reasonable request. Pathology images of the?NLST cohort are available online at the?NLST website
Competing Interests: The authors declare no competing interests.
Publisher?s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The images or other third party material in this
article are included in the article?s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article?s Creative Commons license and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
This work was partially supported by the National Institutes of Health [5R01CA152301, P50CA70907, 5P30CA142543, 1R01GM115473, and 1R01CA172211], and the Cancer Prevention and Research Institute of Texas [RP120732].
? The Author(s) 2019