Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data
Nucleic Acids Research
Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data
Silvia Liu 1 2
Wei-Hsiang Tsai 0
Ying Ding 1 2
Rui Chen 2
Zhou Fang 2
Zhiguang Huo 2
SungHwan Kim 2
Tianzhou Ma 2
Ting-Yu Chang 6
Nolan Michael Priedigkeit 5
Adrian
V. Lee 4
Jianhua Luo 3
Hsei-Wei Wang 0 6 7
I-Fang Chung 0 7
George C. Tseng 1 2
0 Institute of Biomedical Informatics, National Yang-Ming University , No. 155, Sec. 2, Linong Street, Beitou District, Taipei 112 , Taiwan
1 Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh , Biomedical Science Tower 3, 3501 Fifth Avenue, Pittsburgh, PA 15213 , USA
2 Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh , 130 De Soto Street, Pittsburgh, PA 15261 , USA
3 Department of Pathology, School of Medicine, University of Pittsburgh , 3550 Terrace Street, Pittsburgh, PA 15261 , USA
4 Magee-Women's Research Institute , 204 Craft Avenue, Pittsburgh, PA 15213 , USA
5 Molecular Pharmacology, School of Medicine, University of Pittsburgh , 3550 Terrace Street, Pittsburgh, PA 15261 , USA
6 Institute of Microbiology and Immunology, National Yang-Ming University , No. 155, Sec. 2, Linong Street, Beitou District, Taipei 112 , Taiwan
7 Center for Systems and Synthetic Biology, National Yang-Ming University , No. 155, Sec. 2, Linong Street, Beitou District, Taipei 112 , Taiwan
-
Background: Fusion transcripts are formed by
either fusion genes (DNA level) or trans-splicing events
(RNA level). They have been recognized as a
promising tool for diagnosing, subtyping and treating
cancers. RNA-seq has become a precise and efficient
standard for genome-wide screening of such
aberration events. Many fusion transcript detection
algorithms have been developed for paired-end
RNAseq data but their performance has not been
comprehensively evaluated to guide practitioners. In this
paper, we evaluated 15 popular algorithms by their
precision and recall trade-off, accuracy of
supporting reads and computational cost. We further
combine top-performing methods for improved ensemble
detection.
Results: Fifteen fusion transcript detection tools
were compared using three synthetic data sets
under different coverage, read length, insert size and
background noise, and three real data sets with
selected experimental validations. No single method
dominantly performed the best but SOAPfuse
generally performed well, followed by FusionCatcher and
JAFFA. We further demonstrated the potential of a
meta-caller algorithm by combining top performing
methods to re-prioritize candidate fusion transcripts
with high confidence that can be followed by
experimental validation.
Conclusion: Our result provides insightful recommendations when applying individual tool or combining top performers to identify fusion transcript candidates.
INTRODUCTION
Fusion gene is a result of chromosomal insertion, deletion,
translocation or inversion that joins two otherwise
separated genes. Fusion genes are often oncogenes that play an
important role in the development of many cancers.
Transsplicing is an event that two different primary RNA
transcripts are ligated together. Both fusion genes (DNA level)
and trans-splicing events (RNA level) can form fusion
transcripts. These events usually come from different types of
aberrations in post-transcription and chromosomal
rearrangements: large segment deletion (e.g. the well-known
fusion TMPRSS2-ERG in prostate cancer (
1
)),
chromosome translocation (e.g. the well-known fusion BCR-ABL1
in chronic myeloid leukemia (
2
) and EML4-ALK in
nonsmall-cell lung cancer (
3
)), trans-splicing (
4
) or readthrough
(two adjacent genes) (
5
). To date, many fusion transcripts
have been found and collected in public databases. For
example, there are 10 890 fusions in COSMIC (release 72) (
6
),
1374 fusion sequences found in human tumors (involving
431 different genes) in TICdb (release 3.3) (
7
), 2327 gene
fusions in the Mitelman database (updated on Feb 2015) (
8
)
and 29 159 chimeric transcripts in ChiTaRS (version 2.1)
(
9,10
). Some databases (such as COSMIC, TICdb and
ChiTaRS) collected fusion gene sequences and some (e.g.
COSMIC and ChiTaRS) offered further summaries of the
original tissue types.
The advances in Massively Parallel Sequencing (MPS)
have enabled sequencing of hundreds of millions of short
reads and have been routinely applied to genomic and
transcriptomic studies. The per-base sequencing resolution has
provided a precise and efficient standard for fusion
transcript detection, especially using paired-end RNA-Seq
platforms (
11
). For example, Berger et al. detected and
verified 11 fusion transcripts in melanoma samples, and also
identified 12 novel chimeric readthrough transcripts (
12
).
McPherson et al. verified 45 out of 268 detected fusion
transcripts in ovarian and sarcoma samples (
13
). (...truncated)