Addendum: Detection of colinear blocks and synteny and evolutionary analyses based on utilization of MCScanX
Corrections & amendments
Addendum: Detection of colinear blocks and synteny and
evolutionary analyses based on utilization of MCScanX
Addendum to: Nature Protocols https://
doi.org/10.1038/s41596-024-00968-2,
published online 15 March 2024.
Xi Zhang, Yupeng Wang, Paule V. Joseph, Andrew H. Paterson
& David Roy Smith
https://doi.org/10.1038/s41596-026-01380-8
The bioinformatics protocol by Wang et al.1 outlines the steps for efficiently identifying colinear
blocks in intra- and inter-species BLASTP outputs using the Multiple Colinearity Scan Toolkit
Version X (MCScanX)2. Part 2 of the protocol provides steps for downloading data directly from
NCBI and preparing the necessary .gff and .blast files yielded from blast all-vs-all analyses. While
using MCScanX, we (X.Z. and D.R.S.) discovered that Part 2 lacks an essential pre-processing
step—the step required for determining whether there are multiple isoforms derived from alternative splicing. Indeed, when analyzing data downloaded directly from NCBI, it can be crucial
to have a process known as transcript filtering to identify the longest transcript as the primary
protein sequence. Similar issues could happen when researchers fail to use primary assembly
(one haploid set) in diploid genome assemblies, because the alleles could be mistakenly treated
as gene duplicates.
To avoid misprediction of gene duplicates, especially for genome data from NCBI or other
online resources, it can be helpful to include a transcript filtering step3,4. This is particularly
true when analyzing datasets of species with large numbers of duplicated genes. Without this
step, the number of duplicate genes will be overrepresented. In Fig. 1, we show how using a
transcript filtering step can dramatically impact the results of the analysis carried out in the
protocol by Wang et al. Indeed, by following the MCScanX protocol and comparing Arabidopisis
thaliana with and without transcript filtering, we found as many as 25,776 protein-coding
genes categorized as singleton duplications after employing transcript filtering, compared
to only 3,086 protein-coding genes when not filtering. Similarly, there are 11,948 proteincoding genes categorized as dispersed duplications when filtering vs. 7,297 protein-coding
genes when not. Finally, there are only 2,216 genes categorized as tandem duplications after
transcript filtering compared to 30,101 genes without filtering. This suggests that in the
absence of transcript filtering, multiple isoforms from the same gene can be misinterpreted
as tandem duplication events.
Figure 2 shows the visualization of the synonymous substitution rate (Ks) distributions of
colinear genes in A. thaliana and Medicago truncatula before and after transcript filtering. We
are not questioning the reliability of running the MCScanX algorithms but want to highlight
potential issues when using the protocol, particularly the potential challenges when preparing
input files in Part 2 (Steps 7–20). Overall, MCScanX is a useful tool for efficiently identifying
colinear blocks and downstream evolutionary analysis, but additional work is needed for
preparing the input data and running the tool.
We provide the following useful tips for increasing the utility of the protocol: as noted by
the authors in Steps 11–13 and the Troubleshooting section, we found that when generating the
correct .gff file, it is better to offer alternative options similar to the ‘mkGFF3.pl’ program in the
MCScanX_protocol package. This is because the downloaded .gff can have different formats
and it is important to convert it to the one MCScanX can read. We have found that the ‘gff2bed’
script from BEDOPS v2.4.41 (ref. 5), AGAT v1.6.1 (ref. 6) and the custom processing script on
the ‘XX_feature_table.txt’ can help yield the .gff file for MCScanX. In terms of generating the
.blast file at Steps 14–20, we found it is not efficient to prepare the ‘runBLASTP.sh’, especially
when an all-against-all BLASTP is needed for each reciprocal genome pair. We have provided
custom scripts with the MCScanX_Assistant tool at GitHub (https://github.com/zx0223winner/
MCScanX_Assistant) to iterate the genome all-against-all BLASTP processing, which greatly
improves the preparation step (see also the Supplementary Text S1–S4 in ref. 7).
These comments were well-received by the MCScanX team (Y.W., P.V.J. and A.H.P.) and
a notice has been added to the external link of the protocol (http://bdx-consulting.com/
mcscanx-protocol/) stating the following: “…the current stage lacks a transcript filtering
step for handling multiple alternative splice isoforms per locus, which may lead to confusion
Published online: xx xx xxxx
Check for updates
nature protocols
1
Corrections & amendments
40,000
35,000
Fig. 1 | Comparison of gene duplication modes
among closely related Arabidopsis taxa, with and
without transcript filtering. This figure was adapted
from Fig. 6 of the protocol1, where transcript filtering
was not used (A. arenosa, A. suecica and A. thaliana
without filtering are shown in grey, in orange and in
light blue, respectively, as in Fig. 6 of the protocol1).
A. thaliana after transcript filtering is shown in dark
blue. Strikingly, it appears that tandem duplications
are less prevalent in A. thaliana than A. arenosa, and
singleton duplications represent an overwhelming
proportion of gene duplications in A. thaliana
compared to the other two species, when transcript
filtering is incorporated into the workflow.
Arabidopsis thaliana (after transcript filtering)
Arabidopsis thaliana (Wang et al. 2024)
Arabidopsis suecica (Wang et al. 2024)
Arabidopsis arenosa (Wang et al. 2024)
No. of genes
30,000
25,000
20,000
15,000
10,000
5,000
0
Singleton
Dispersed
Proximal
Tandem
WGD or segmental
Gene duplication mode
among paralogous genes. To address this limitation, users are encouraged to utilize MCScanX_Assistant, which provides the necessary
functionality”.
The MCScanX team further addresses the protocol’s lack of
transcript filtering step as follows:
During the development of the original software, the MCScanX
team recognized that alternative splicing could influence MCScanX
results. To address this, the accompanying README file (https://github.
com/wyp1125/MCScanX) explicitly states that “The xyz.bed file holds
gene positions,” and the included example uses Arabidopsis thaliana
gene symbols (e.g., AT1G01010) rather than transcript identifiers (e.g.,
AT1G01010.1). This guidance clearly indicates that users should supply
gene-level names and coordinates—not transcript-level entries—in the
.bed file. Furthermore, the original publication2 noted that “If a gene
had more than one transcript, only the first transcript in the annotation
was used.” Although the MCScanX toolkit did not include a dedicated
a
1.0
Arabidopsis thaliana Araport11
(Wang et al. 2024)
0.6
0.4
Arabidopsis thaliana TAIR10
(after transcript filtering)
2.0
Density
Density
0.8
b
0.2
1.5
1.0
0.5
0
0
2
4
6
(...truncated)