Antarctica

Walking Through TOAST, Part2: Data Harvesting

Use Linux and BUSCO to harvest public data

This is a three part tutorial. Part 1 focuses on installation. Make sure to look at this page prior to proceeding.

This part focuses on using BUSCO to harvest orthologs from online or locally stored fasta files, as well as the combination of the two. NOTE you need to be using MAC or LINUX to use BUSCO!

Part 3 focuses on visualization missing data and concatenated alignment assembly. You can use any platform for these functions.

Quick Overview - Suggested Project Structure


TOAST will move between directories on your harddrive generating files. We recommend the following folder architecture to keep your project organized Within a main project folder,
1) Create a separate folder in which you store FASTA files. TOAST will also place downloaded FASTA files here.
2) Have a busco_results folder for storing and retrieving busco IDs
3) An extracted folder for your raw ortholog sequences
4) a mafft_aligned folder for your alignments of each ortholog
5) a threshold folder for you alignments of each ortholog with species removed due to not meeting requirements of representation
TOAST will write concatenated alignments, custom concatenated alignments of user selected loci, and partition files into the main directory for easy retrieval.

Function Overview


Here is an overview of core functions and their purpose
TOAST functions can be divided into 4 groups
1) Sequence gathering functions and 2) BUSCO related functions (covered on this page)
3) Missing data pattern visualization and 4) Alignment assembly (covered in the next section)

Getting a Taxonomic ID


For a given focal clade (mammals, cetaceans, squirrels, etc), TOAST will utilize BUSCO and BLAST to download, find, and extract all orthologs from available public data on NCBI.
Use of this method requires a Taxonomic ID for your focal clade
The taxonomic ID can be easily obtained from NCBI as follows

1) Navigate to www.ncbi.nlm.nih.gov/taxonomy/
2) In the Taxonomy search field enter your desired clade
3) Click on the result. This will give you a full breakdown of subclade taxonomy. Click on the link for the clade you want to focus on.
4) This will send you to a page that includes the taxonomic ID. In the case of Cetacea this is 9721.

Harvesting New Ortholog Datasets

To use BUSCO and harvest orthologs you need to be using Mac or LINUX and also have installed the dependencies mentioned in section 1. If you said yes to both of these things, then we are ready to begin.

Looking at the example_script.R file from within the example folder, you will see that you need to define some locations as well as the number of cores to use.

#universal variables
td <- "/home/carolinafishes/temp/trial1" #toast_directory
fd <- "/home/carolinafishes/temp/trial1/fasta" #fasta_dir
bs <- "/home/carolinafishes/software/busco/scripts/run_BUSCO.py" #path to busco_script
bd <- "/home/carolinafishes/temp/trial1/busco_results" #path to busco results directory
ed <- "/home/carolinafishes/temp/trial1/extracted" #extracted_dir
md <- "/home/carolinafishes/temp/trial1/mafft_aligned" #mafft_dir
od <- "/home/carolinafishes/temp/trial1/350_laurasiatheria_odb9" #path to orthoDB directory
ad <- "/home/carolinafishes/temp/trial1/mafft_aligned" #mafft_dir, which is a directory of aligned fastas
cpu <- 12 #number of threads to use at various steps


To explain, you need to provide the location of toast (td), the location of where to store and or find your FASTA files (fd), the location of BUSCO (bs), where to store your BUSCO results (bd), where to store extracted orthologs (ed), where to store alignments (md), where the ortholog database is (od), where to store/find aligned FASTA files. We recommend working on a directory per project as above, but leave that choice up to you.

Step 1 Download Data

Once you have set up your paths, we can download sequences as follows

EntrezDownload(txid = 9721, fasta_dir = fd, minimumSeq = 350, maximumSeq = NULL)	

Using a taxonomy id and a specified fasta directory, this is will download all databases per species with at least X number of sequences specified by minimumSeq. By default minimumSeq is set to 350. In addition, if you wish to simply test that things are working you can specify the maximum number of sequences to download. This is only to be used for testing and set to NULL by default.

Step 2 Run Busco

Once you have downloaded all the sequence data, you are ready to run BUSCO.

RunBusco(fasta_dir = fd, toast_dir = td, path_to_run_busco.py = bs, path_to_orthoDB = od, threads = cpu)

This function uses the paths you set up earlier to run BUSCO with a specified number of cores. Note that this step may take some time depending on the number of sequences and taxa you are searching.

Step 3 & 4 Parse and Extract Results


BUSCO generates a lot of results, so next we will parse out BUSCO IDs and extract the sequences of interest.

parsed_busco_results <- ParseBuscoResults(busco_dir = bd)
write.table(parsed_busco_results, file = paste0(td, "/parsed_busco_results.tsv"), sep = "\t", row.names = FALSE) 
ExtractBuscoSeqs(busco_table = parsed_busco_results, fasta_dir = fd, extract_dir = ed) #parsed_busco_results from previous step

These lines go through the BUSCO directory, parse the IDs to generate a table that is both stored locally for your records using write.table and used to extract all orthologs.

Step 5 Align

All that is left to do is to align as follows

MafftOrientAlign(extract_dir = ed, mafft_dir = md, threads = cpu)

This will now generate alignments of each ortholog! Now lets see how local data could be added to this process.

Harvesting Orthologs from local FASTA files

TOAST gives you the option to use BUSCO to either assemble new alignment from online data (previous section) or to use a combination of local and public files.
This means you can use TOAST to assemble orthologs from any local set of fasta files in a directory as covered above! Just throw the files into the mix!

Should you already have a target set of FASTA files from downloading as above or may have generated novel data (assembled transcriptomes, genomes, etc). It is of course possible to use TOAST to harvest BUSCO orthologs locally and either skip the download step or add public data to your own!

To do this all you have to do is
1) Place your FASTA file/files into the specified FASTA directory as above

2) Go through the code above, remembering to first set all of your directories and also your number of cores for BUSCO.

That’s it! Use any combination of local and public data to extract orthologs.

If you do not wish to download data or have already downloaded, simply omit ‘Step 1 Download Data’ above.

At this point you should now be able to use BUSCO to harvest orthologs from any FASTA files, provided you are using LINUX. We encourage integration of public and novel data however caution that there can be high levels of missing data.

In the next section we will cover how to visualize missing data patterns and use this information to assemble custom concatenated alignments that meet user defined acceptable levels of data representation for a given problem.

Dialogue & Discussion