EvIL- Research

TOAST | BUSCO & ALIGNMENTS

What is in this section

TOAST was originally designed to work with BUSCO, however many of the functions in earlier versions of TOAST are now also available in BUSCO. As such, TOAST has been updated to automate assembly of loci following a series of BUSCO runs

This section focuses on creating alignments of individual loci from a series of BUSCO runs

Note that the alignment step can be deployed on any set of fasta sequences you have assembled and requires installing MAFFT and placing this into your global path (e.g., modify your ~/.bashrc file)

Please ensure MAFFT is working before proceeding

3. Parsing BUSCO

This part of the tutorial assumes you have already run BUSCO and need to parse the output.

For users looking to align and assemble other gene/locus fasta files: start from the section on aligning below

For those utilizing BUSCO, let's begin by reviewing what a series of BUSCO runs looks like using some crayfish worms as an example

BUSCO Runs
Get filepaths
This tutorial begins assuming you have just run a batch of BUSCO runs across a series of genomes or transcriptome fasta files. Now how do you extract this information? The first thing you will need to do is provide the path for the full_table.tsv from EACH species or sample. In this example, the first file from Ankyrodrilus_legaeus is in the pictured subdirectory on the left. In R you will need to specify the file paths as in the eample code below, followed by the locations of the original fasta files you ran BUSCO with

tsvLocations<-c("~/Documents/Branchiops/Busco_results/Ankyrodrilus_legaeus.fa/run_arthropoda_odb10/full_table.tsv","~/Documents/Branchiops/Busco_results/Xironogiton_victoriensis.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Bdellodrilus_illuminatus.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Branchiobdella_kobayashi.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Branchiobdella_parasita.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Cambarincola_gracilis.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Cambarincola_holti.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Cirrodrilus_suzukii.fa/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Triannulata_magna.fa/run_arthropoda_odb10/full_table.tsv")

In this example We are linking to all of the full_table.tsv locations in one object called tsvLocations

Next we need to provide the corresponding locations of the original fasta in one object, in this case called fastaLocations

fastaLocations<-c("~/Documents/Branchiops/Transcriptomes/Ankyrodrilus_legaeus.fa","~/Documents/Branchiops/Transcriptomes/Xironogiton_victoriensis.fa","~/Documents/Branchiops/Transcriptomes/Bdellodrilus_illuminatus.fa",
"~/Documents/Branchiops/Transcriptomes/Branchiobdella_kobayashi.fa",
"~/Documents/Branchiops/Transcriptomes/Branchiobdella_parasita.fa",
"~/Documents/Branchiops/Transcriptomes/Cambarincola_gracilis.fa",
"~/Documents/Branchiops/Transcriptomes/Cambarincola_holti.fa",
"~/Documents/Branchiops/Transcriptomes/Cirrodrilus_suzukii.fa",
"~/Documents/Branchiops/Transcriptomes/Triannulata_magna.fa")

In a moment we will use these locations to pull each BUSCO from the original fasta using the output from the full_table.tsv file
First there are just two more steps, we need to say where we want these files to be stored and also what we want each species to be called in each file.
This second topic is important as your fasta may simply have some DNA code like "sample_1234" or other code like Mmus for a genus and species.
We can add in an object that can give the names we want now to make phylogenetic analyses easier.
Note that the names should be in the same order as the table and fasta locations.

# users can adjust below to name the fasta sequences per species, default is "Genus_species"
SampleIDs<-c("Ankyrodrilus_legaeus","Xironogiton_victoriensis","Bdellodrilus_illuminatus","Branchiobdella_kobayashi","Branchiobdella_parasita","Cambarincola_gracilis","Cambarincola_holti","Cirrodrilus_suzukii","Triannulata_magna")

We have the corresponding names of species in each fasta in one object, in this case called SampleIDs
Next we just need to say what directory we want to extract all these loci into as follows"

#Location where to write the extracted sequences into
ed<-"~/Documents/Branchiops/Transcriptomes/Extracted_Buscos"

Now we are ready to get our sequences! Before we do, let's quickly consider what is in the full_table.tsv

BUSCO Output
Complete, duplicated, or fragmented?
Looking at the full_table.tsv in a text editor reveals how BUSCO is cataloging the completeness of loci. This leaves you with the choice of which loci are appropriate for your analysis.
There is no immediate right or wrong answer since this depends entirely on what you are doing!
The function to extract buscos has parameters that allow you to select any of the BUSCO types, so you can mix and match complete, fragmented, or duplicated loci as necessary. You simply need to toggle the arguments between TRUE and FALSE. The general case is below

extractBuscos(tsvLocations, fastaLocations, ed, SampleIDs,complete=TRUE, fragmented=TRUE, duplicated=TRUE,threshold=300, )

Note that the threshold parameter represents the minimum number of base pairs required for a fragmented sequence to be extracted.
It is not used when fragmented = false
Running the above will generate a directory full of FASTA files for each BUSCO.

What if you want to add another sample?

Fear not, simply repeat the above with the same directory!
The files will be appended to include any additional samples. In this example say we forgot three outgroup taxa.

#Location of tsv files
tsvLocations<-c("~/Documents/Branchiops/Busco_results/Antarctodrilus_proboscidea_TRI_1_15_NORM.fasta/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Haenopis_sanguisuga.fasta/run_arthropoda_odb10/full_table.tsv",
"~/Documents/Branchiops/Busco_results/Theromyzon_tessulatum.fasta/run_arthropoda_odb10/full_table.tsv")
#location of fasta files
fastaLocations<-c("~/Documents/Branchiops/Transcriptomes/Real_Data/Branch_fastas-1/Antarctodrilus_proboscidea_TRI_1_15_NORM.fasta",
"~/Documents/Branchiops/Transcriptomes/Real_Data/Branch_fastas-1/Haenopis_sanguisuga.fasta",
"~/Documents/Branchiops/Transcriptomes/Real_Data/Branch_fastas-1/Theromyzon_tessulatum.fasta")
# get the additional "Genus_species" names
SampleIDs<-c("Antarctodrilus_proboscidea","Haenopis_sanguisuga","Theromyzon_tessulatum")

As you can see this part is the same, now just run the extractBuscos function again using the same directory

extractBuscos(tsvLocations, fastaLocations, ed, SampleIDs,complete=TRUE, threshold=300)

4. Alignment

TOAST can call mafft to quickly align all of these BUSCO sequences (or any folder full of fasta files!)

All you need to do is point toast to the directory with the unaligned fastafiles and a new directory you would like to write alignments into

#note that mafft is multithreaded so you can speed things up by changing the thread count depending on your machine
MafftOrientAlign(extract_dir = "~/Documents/Branchiops/Transcriptomes/Extracted_Buscos", mafft_dir = "~/Documents/Branchiops/Transcriptomes/Mafft_aligned", threads = 12)

That's it! Now you are ready to concatenate some data or start filtering based on missing data patterns or gene trees!

Next Section: Concatenation | Missing Data

Skip to: Gene tree based filtration

Skip to: Utilities | Interactive Plots

Back to: Installation

Back to: TOAST main page