Interpretation of convolutional neural networks reveals crucial sequence features involving in transcription during fiber development. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8922751/pdf/

Interpretation of convolutional neural networks reveals crucial sequence features involving in transcription during fiber development.

(2022) 23:91 Liu et al. BMC Bioinformatics https://doi.org/10.1186/s12859-022-04619-9 RESEARCH BMC Bioinformatics Open Access Interpretation of convolutional neural networks reveals crucial sequence features involving in transcription during fiber development Shang Liu1,2, Hailiang Cheng1,2, Javaria Ashraf1,3, Youping Zhang1,2, Qiaolian Wang1,2, Limin Lv1,2, Man He1, Guoli Song1,2* and Dongyun Zuo1,2* *Correspondence: ; 1 Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang 455000, China Full list of author information is available at the end of the article Abstract Background: Upland cotton provides the most natural fiber in the world. During fiber development, the quality and yield of fiber were influenced by gene transcription. Revealing sequence features related to transcription has a profound impact on cotton molecular breeding. We applied convolutional neural networks to predict gene expression status based on the sequences of gene transcription start regions. After that, a gradient-based interpretation and an N-adjusted kernel transformation were implemented to extract sequence features contributing to transcription. Results: Our models had approximate 80% accuracies, and the area under the receiver operating characteristic curve reached over 0.85. Gradient-based interpretation revealed 5’ untranslated region contributed to gene transcription. Furthermore, 6 DOF binding motifs and 4 transcription activator binding motifs were obtained by N-adjusted kernel-motif transformation from models in three developmental stages. Apart from 10 general motifs, 3 DOF5.1 genes were also detected. In silico analysis about these motifs’ binding proteins implied their potential functions in fiber formation. Besides, we also found some novel motifs in plants as important sequence features for transcription. Conclusions: In conclusion, the N-adjusted kernel transformation method could interpret convolutional neural networks and reveal important sequence features related to transcription during fiber development. Potential functions of motifs interpreted from convolutional neural networks could be validated by further wet-lab experiments and applied in cotton molecular breeding. Keywords: Cotton fiber, Transcription, Convolutional neural network, Model interpretation, Motif detection © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Liu et al. BMC Bioinformatics (2022) 23:91 Background Upland cotton (Gossypium.hirustum.L) takes up about 90% of cotton cultivated over the world and is the main crop contributing to renewable textile fibers [1]. Fiber development of upland cotton could be divided into four stages: initiation, elongation, secondary cell wall thickening (SCW), and maturity. Agronomic traits of fiber are mainly formed in the first three stages, and corresponding genes related to fiber formation are also transcripted in these stages [2–4]. Genome assembly of upland cotton enables researchers to perform high throughput transcriptome analysis and fetch gene sequences efficiently [2–5]. Given that gene transcription is the base of phenotype formation and the genome sequences are the base of heredity, it’s of significant meanings to disclose transcriptionrelated sequence features for molecular breeding. In maize, convolutional neural networks (CNNs) were applied to the prediction of relative transcriptional abundance and roles of untranslated regions (UTR) in transcription were revealed [6]. The application of CNN in maize inspired us to utilize CNNs in upland cotton to detected sequence features related to transcription. CNNs have been applied to predict binding sites of transcription factors or RNAbinding proteins [7–11]. For the binding sites prediction tasks, convolutional neural networks (CNN) showed good performance in accuracy. DeepBind used chromatin immunoprecipitation sequencing (CHIP-seq) and crosslinking-immunoprecipitation sequencing (CLIP) to predict binding sites of transcription factors and RNA binding proteins, respectively [11]. DeepSEA utilized CHIP-seq datasets, DNase I–hypersensitive sites, and histone-mark profiles to identify binding sites of transcription factors and accessibility of chromatin [10]. In these models, input sequence was one-hot encoded as a 1-D sequence with 4 channels (A, T, C, G). The encoded sequences were dealt with models to get a binding score which presented the binding ability of the transcription factor. Successful applications of these models in protein-sequence binding prediction indicate that CNNs are suitable for dealing with genome sequences. Apart from the high accuracy reached by CNNs, the other advantage CNNs possess is the ability for motif detection, which could interpret models’ parameters into sequence features with biological meanings [8]. Motif detection implemented by interpretation of CNNs has been tried in several types of research about a prediction of protein-sequence binding chromatin accessibility [8, 10–12]. In these previous studies, filters in the first convolutional layer were supposed as motif scanners and selected for interpretation. Strategies for kernel transformation in these studies are similar. Searching for activated regions of sequences is implemented. Subsequently, activated regions selected by several criteria were pooled together. Finally, sequences within activated regions were used to calculate a position weight matrix (PWM) through the cross-entropy method. PWMs are aligned to the database in JASPAR for known motifs detection, while unaligned PWMs will be detected as novel motifs [12]. In these model interpretation strategies, PWMs were generated from activated regions within sequences. Compared with other interpretation methods, such as DeepLIFT, SHAP, and saliency analysis which calculate significant scores for single nucleotide, model interpretation performed by generating PWMs could provide sequence features in the form of biologica (...truncated)