Interpretation of convolutional neural networks reveals crucial sequence features involving in transcription during fiber development.
(2022) 23:91
Liu et al. BMC Bioinformatics
https://doi.org/10.1186/s12859-022-04619-9
RESEARCH
BMC Bioinformatics
Open Access
Interpretation of convolutional neural
networks reveals crucial sequence features
involving in transcription during fiber
development
Shang Liu1,2, Hailiang Cheng1,2, Javaria Ashraf1,3, Youping Zhang1,2, Qiaolian Wang1,2, Limin Lv1,2, Man He1,
Guoli Song1,2* and Dongyun Zuo1,2*
*Correspondence:
;
1
Institute of Cotton Research
of Chinese Academy
of Agricultural Sciences,
Anyang 455000, China
Full list of author information
is available at the end of the
article
Abstract
Background: Upland cotton provides the most natural fiber in the world. During fiber
development, the quality and yield of fiber were influenced by gene transcription.
Revealing sequence features related to transcription has a profound impact on cotton
molecular breeding. We applied convolutional neural networks to predict gene expression status based on the sequences of gene transcription start regions. After that, a
gradient-based interpretation and an N-adjusted kernel transformation were implemented to extract sequence features contributing to transcription.
Results: Our models had approximate 80% accuracies, and the area under the
receiver operating characteristic curve reached over 0.85. Gradient-based interpretation revealed 5’ untranslated region contributed to gene transcription. Furthermore,
6 DOF binding motifs and 4 transcription activator binding motifs were obtained by
N-adjusted kernel-motif transformation from models in three developmental stages.
Apart from 10 general motifs, 3 DOF5.1 genes were also detected. In silico analysis
about these motifs’ binding proteins implied their potential functions in fiber formation. Besides, we also found some novel motifs in plants as important sequence
features for transcription.
Conclusions: In conclusion, the N-adjusted kernel transformation method could interpret convolutional neural networks and reveal important sequence features related to
transcription during fiber development. Potential functions of motifs interpreted from
convolutional neural networks could be validated by further wet-lab experiments and
applied in cotton molecular breeding.
Keywords: Cotton fiber, Transcription, Convolutional neural network, Model
interpretation, Motif detection
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Liu et al. BMC Bioinformatics
(2022) 23:91
Background
Upland cotton (Gossypium.hirustum.L) takes up about 90% of cotton cultivated over the
world and is the main crop contributing to renewable textile fibers [1]. Fiber development of upland cotton could be divided into four stages: initiation, elongation, secondary
cell wall thickening (SCW), and maturity. Agronomic traits of fiber are mainly formed in
the first three stages, and corresponding genes related to fiber formation are also transcripted in these stages [2–4]. Genome assembly of upland cotton enables researchers
to perform high throughput transcriptome analysis and fetch gene sequences efficiently
[2–5]. Given that gene transcription is the base of phenotype formation and the genome
sequences are the base of heredity, it’s of significant meanings to disclose transcriptionrelated sequence features for molecular breeding. In maize, convolutional neural networks (CNNs) were applied to the prediction of relative transcriptional abundance and
roles of untranslated regions (UTR) in transcription were revealed [6]. The application
of CNN in maize inspired us to utilize CNNs in upland cotton to detected sequence features related to transcription.
CNNs have been applied to predict binding sites of transcription factors or RNAbinding proteins [7–11]. For the binding sites prediction tasks, convolutional neural
networks (CNN) showed good performance in accuracy. DeepBind used chromatin
immunoprecipitation sequencing (CHIP-seq) and crosslinking-immunoprecipitation
sequencing (CLIP) to predict binding sites of transcription factors and RNA binding
proteins, respectively [11]. DeepSEA utilized CHIP-seq datasets, DNase I–hypersensitive sites, and histone-mark profiles to identify binding sites of transcription factors and
accessibility of chromatin [10]. In these models, input sequence was one-hot encoded
as a 1-D sequence with 4 channels (A, T, C, G). The encoded sequences were dealt with
models to get a binding score which presented the binding ability of the transcription
factor. Successful applications of these models in protein-sequence binding prediction
indicate that CNNs are suitable for dealing with genome sequences. Apart from the high
accuracy reached by CNNs, the other advantage CNNs possess is the ability for motif
detection, which could interpret models’ parameters into sequence features with biological meanings [8].
Motif detection implemented by interpretation of CNNs has been tried in several
types of research about a prediction of protein-sequence binding chromatin accessibility [8, 10–12]. In these previous studies, filters in the first convolutional layer were
supposed as motif scanners and selected for interpretation. Strategies for kernel transformation in these studies are similar. Searching for activated regions of sequences is
implemented. Subsequently, activated regions selected by several criteria were pooled
together. Finally, sequences within activated regions were used to calculate a position
weight matrix (PWM) through the cross-entropy method. PWMs are aligned to the
database in JASPAR for known motifs detection, while unaligned PWMs will be detected
as novel motifs [12]. In these model interpretation strategies, PWMs were generated
from activated regions within sequences. Compared with other interpretation methods, such as DeepLIFT, SHAP, and saliency analysis which calculate significant scores
for single nucleotide, model interpretation performed by generating PWMs could provide sequence features in the form of biologica (...truncated)