TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees

BMC Research Notes, Mar 2018

Thomas Sauvage, Sophie Plouviez, William E. Schmidt, Suzanne Fredericq

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs13104-018-3268-y.pdf

TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees

Sauvage et al. BMC Res Notes TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees Thomas Sauvage 0 1 Sophie Plouviez 0 William E. Schmidt 0 Suzanne Fredericq 0 0 Department of Biology, University of Louisiana at Lafayette , 410 E. Saint Mary Boulevard, Lafayette, LA 70503 , USA 1 Smithsonian Marine Station , 701 Seaway Drive, Fort Pierce, FL 34949 , USA Objective: The body of DNA sequence data lacking taxonomically informative sequence headers is rapidly growing in user and public databases (e.g. sequences lacking identification and contaminants). In the context of systematics studies, sorting such sequence data for taxonomic curation and/or molecular diversity characterization (e.g. crypticism) often requires the building of exploratory phylogenetic trees with reference taxa. The subsequent step of segregating DNA sequences of interest based on observed topological relationships can represent a challenging task, especially for large datasets. Results: We have written TREE2FASTA, a Perl script that enables and expedites the sorting of FASTA-formatted sequence data from exploratory phylogenetic trees. TREE2FASTA takes advantage of the interactive, rapid point-andclick color selection and/or annotations of tree leaves in the popular Java tree-viewer FigTree to segregate groups of FASTA sequences of interest to separate files. TREE2FASTA allows for both simple and nested segregation designs to facilitate the simultaneous preparation of multiple data sets that may overlap in sequence content. Barcoding; Biodiversity; Clone; Contaminant; Cryptic; Environmental; FigTree; Forensic; Metabarcoding; OTU; Phylogeny; Systematics Introduction A classic workflow in DNA-based systematics studies [ 1 ] consists in building exploratory trees to visualize topological relationships of novel sequences within a larger framework of reference taxa. This allows for the molecular identification of uncurated sequences, the discovery of molecular crypticism [ 2 ], as well as choosing relevant ingroup/outgroup taxa [ 3 ] (i.e. those to be segregated among the pool of available FASTA sequences for focused systematics studies). Systematists may also need to segregate groups of FASTA sequences to examine sequence attributes across different clades, such as comparing GC content, examine sequence motifs or divergence. Currently, efficiently mining FASTA sequences of interest from tree topologies can represent a difficult task since tree-viewing relies on a Newick string [ 4 ] that does not contain DNA information, the latter being enclosed in the original FASTA file used for tree-building. Thus, to relate DNA strings to tip labels (i.e. sequence names), one usually needs to script in programming language such as R, e.g. relying on the package Ape [ 5 ] with function ‘drop.tip’ or ‘extract.clade’ to create object lists of sequence names to match to DNA sequences. While this may facilitate part of the process, rapidly selecting numerous clades or tips interactively in the R interface may not be as fluid as in a dedicated tree-viewer such as the popular Java program FigTree [ 6 ]. For researchers with limited scripting skills, the process requires to manually edit FASTA files via copy/paste (or delete) in a text editor for wanted (or unwanted) sequences. Others may type extensive lists of observed tip labels (i.e. sequence names) that can be used to parse FASTA files with dedicated scripts available from the community, or with matching functions (See figure on next page.) Fig. 1 Simulated phylogeny displaying taxa named ‘A’ to ‘T’. a Basic workflow for FASTA sequence extraction with TREE2FASTA. An exploratory tree is built following multiple-alignment of FASTA data. The Newick tree string (NWK) is visualized and edited in the tree-viewer FigTree and saved as a NEXUS file (NEX). TREE2FASTA uses the FASTA alignment and the NEXUS file (NEX) to produce subsetted FASTA files according to user selection scheme (here color). b Example of possible color and/or annotation selection schemes in FigTree for TREE2FASTA sequence extraction. The FASTA icon marked with an asterisk ‘*’ contains FASTA sequences for taxa H and I lacking color selection (i.e. achromatic) or lacking annotation. For figure clarity annotation ‘Group1’ to ‘Group4’ are reported G1 to G4 within FASTA file icons. FASTA files output to different folders are delimited by dashed boxes of the Galaxy tool shed [ 7 ], as well as with command line tools such as samtools [ 8 ] or blastdbcmd from the NCBI Blast+ package [ 9 ]. Overall, although some of the above practices may be feasible for small datasets (e.g. typing lists), they may rapidly become unpractical for researchers who are faced with large data sets (100 to 1000+ sequences to be sorted). Here, to offer a rapid and interactive solution to sequence selection from exploratory phylogenies, we devised a Perl script named TREE2FASTA that allows the batch ex (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13104-018-3268-y.pdf

Thomas Sauvage, Sophie Plouviez, William E. Schmidt, Suzanne Fredericq. TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees, BMC Research Notes, 2018, pp. 164, Volume 11, Issue 1, DOI: 10.1186/s13104-018-3268-y