Addendum: Detection of colinear blocks and synteny and evolutionary analyses based on utilization of MCScanX (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41596-026-01380-8.pdf

Addendum: Detection of colinear blocks and synteny and evolutionary analyses based on utilization of MCScanX

Corrections & amendments Addendum: Detection of colinear blocks and synteny and evolutionary analyses based on utilization of MCScanX Addendum to: Nature Protocols https:// doi.org/10.1038/s41596-024-00968-2, published online 15 March 2024. Xi Zhang, Yupeng Wang, Paule V. Joseph, Andrew H. Paterson & David Roy Smith https://doi.org/10.1038/s41596-026-01380-8 The bioinformatics protocol by Wang et al.1 outlines the steps for efficiently identifying colinear blocks in intra- and inter-species BLASTP outputs using the Multiple Colinearity Scan Toolkit Version X (MCScanX)2. Part 2 of the protocol provides steps for downloading data directly from NCBI and preparing the necessary .gff and .blast files yielded from blast all-vs-all analyses. While using MCScanX, we (X.Z. and D.R.S.) discovered that Part 2 lacks an essential pre-processing step—the step required for determining whether there are multiple isoforms derived from alternative splicing. Indeed, when analyzing data downloaded directly from NCBI, it can be crucial to have a process known as transcript filtering to identify the longest transcript as the primary protein sequence. Similar issues could happen when researchers fail to use primary assembly (one haploid set) in diploid genome assemblies, because the alleles could be mistakenly treated as gene duplicates. To avoid misprediction of gene duplicates, especially for genome data from NCBI or other online resources, it can be helpful to include a transcript filtering step3,4. This is particularly true when analyzing datasets of species with large numbers of duplicated genes. Without this step, the number of duplicate genes will be overrepresented. In Fig. 1, we show how using a transcript filtering step can dramatically impact the results of the analysis carried out in the protocol by Wang et al. Indeed, by following the MCScanX protocol and comparing Arabidopisis thaliana with and without transcript filtering, we found as many as 25,776 protein-coding genes categorized as singleton duplications after employing transcript filtering, compared to only 3,086 protein-coding genes when not filtering. Similarly, there are 11,948 proteincoding genes categorized as dispersed duplications when filtering vs. 7,297 protein-coding genes when not. Finally, there are only 2,216 genes categorized as tandem duplications after transcript filtering compared to 30,101 genes without filtering. This suggests that in the absence of transcript filtering, multiple isoforms from the same gene can be misinterpreted as tandem duplication events. Figure 2 shows the visualization of the synonymous substitution rate (Ks) distributions of colinear genes in A. thaliana and Medicago truncatula before and after transcript filtering. We are not questioning the reliability of running the MCScanX algorithms but want to highlight potential issues when using the protocol, particularly the potential challenges when preparing input files in Part 2 (Steps 7–20). Overall, MCScanX is a useful tool for efficiently identifying colinear blocks and downstream evolutionary analysis, but additional work is needed for preparing the input data and running the tool. We provide the following useful tips for increasing the utility of the protocol: as noted by the authors in Steps 11–13 and the Troubleshooting section, we found that when generating the correct .gff file, it is better to offer alternative options similar to the ‘mkGFF3.pl’ program in the MCScanX_protocol package. This is because the downloaded .gff can have different formats and it is important to convert it to the one MCScanX can read. We have found that the ‘gff2bed’ script from BEDOPS v2.4.41 (ref. 5), AGAT v1.6.1 (ref. 6) and the custom processing script on the ‘XX_feature_table.txt’ can help yield the .gff file for MCScanX. In terms of generating the .blast file at Steps 14–20, we found it is not efficient to prepare the ‘runBLASTP.sh’, especially when an all-against-all BLASTP is needed for each reciprocal genome pair. We have provided custom scripts with the MCScanX_Assistant tool at GitHub (https://github.com/zx0223winner/ MCScanX_Assistant) to iterate the genome all-against-all BLASTP processing, which greatly improves the preparation step (see also the Supplementary Text S1–S4 in ref. 7). These comments were well-received by the MCScanX team (Y.W., P.V.J. and A.H.P.) and a notice has been added to the external link of the protocol (http://bdx-consulting.com/ mcscanx-protocol/) stating the following: “…the current stage lacks a transcript filtering step for handling multiple alternative splice isoforms per locus, which may lead to confusion Published online: xx xx xxxx Check for updates nature protocols 1 Corrections & amendments 40,000 35,000 Fig. 1 | Comparison of gene duplication modes among closely related Arabidopsis taxa, with and without transcript filtering. This figure was adapted from Fig. 6 of the protocol1, where transcript filtering was not used (A. arenosa, A. suecica and A. thaliana without filtering are shown in grey, in orange and in light blue, respectively, as in Fig. 6 of the protocol1). A. thaliana after transcript filtering is shown in dark blue. Strikingly, it appears that tandem duplications are less prevalent in A. thaliana than A. arenosa, and singleton duplications represent an overwhelming proportion of gene duplications in A. thaliana compared to the other two species, when transcript filtering is incorporated into the workflow. Arabidopsis thaliana (after transcript filtering) Arabidopsis thaliana (Wang et al. 2024) Arabidopsis suecica (Wang et al. 2024) Arabidopsis arenosa (Wang et al. 2024) No. of genes 30,000 25,000 20,000 15,000 10,000 5,000 0 Singleton Dispersed Proximal Tandem WGD or segmental Gene duplication mode among paralogous genes. To address this limitation, users are encouraged to utilize MCScanX_Assistant, which provides the necessary functionality”. The MCScanX team further addresses the protocol’s lack of transcript filtering step as follows: During the development of the original software, the MCScanX team recognized that alternative splicing could influence MCScanX results. To address this, the accompanying README file (https://github. com/wyp1125/MCScanX) explicitly states that “The xyz.bed file holds gene positions,” and the included example uses Arabidopsis thaliana gene symbols (e.g., AT1G01010) rather than transcript identifiers (e.g., AT1G01010.1). This guidance clearly indicates that users should supply gene-level names and coordinates—not transcript-level entries—in the .bed file. Furthermore, the original publication2 noted that “If a gene had more than one transcript, only the first transcript in the annotation was used.” Although the MCScanX toolkit did not include a dedicated a 1.0 Arabidopsis thaliana Araport11 (Wang et al. 2024) 0.6 0.4 Arabidopsis thaliana TAIR10 (after transcript filtering) 2.0 Density Density 0.8 b 0.2 1.5 1.0 0.5 0 0 2 4 6 (...truncated)