PhyInformR

Overview

This section covers the basics of installation and how to get site rates using IQTREE2. Also featured is a demonstration of how to import this external data and the dependencies PhyInformR utilizes

Once you have imported your site rate file and your guide tree this section covers how to generate PI profiles


Installation

PhyInformR is easy to install. Note that it is currently not on CRAN but will be back following a major update. In the meantime install via devtools or download the compressed R script from github and install it manually

 ##cran install is currently not available 
								#install.packages("PhyInformR") 
								#library(PhyInformR)

To install from github
library(devtools) 
									install_github("carolinafishes/PhyInformR")
									library(PhyInformR)
							

FOR WINDOWS USERS - devtools will not install depencies for certain versions of windows. This is being addressed and should be fixed in the next major release. If your install fails, please first install the dependencies through CRAN and then use devtools as above for the final install.
install.packages("doParallel") 
								install.packages("phytools") 
								install.packages("splines") 
								install.packages("gplots") 
								install.packages("RColorBrewer") 
								install.packages("foreach") 
								install.packages("iterators") 
								install.packages("geiger") 
								install.packages("doParallel") 
								install.packages("gridExtra") 
								install.packages("hexbin") 
								install.packages("PBSmodelling") 
								install.packages("ggplot2")
							
Once you load PhyInformR, set the number of cores at the start of your session to enable later parallel processing if desired
We will also be hosting more sample data through Zenodo archives and github to explore new features as we develop them, so check back often!

Dependencies

PhyInformR is built upon the efforts of several other R packages including:
phytools
								splines
								gplots
								RColorBrewer
								foreach
								iterators
								geiger
								doParallel
								gridExtra
								hexbin
								ggplot2
							PBSmodelling
						
Several functions in PhyInformR use parallel processing. Enable this via
library(doParallel) 
							#set the number of cores if you are working in parallel 
							registerDoParallel(cores=8)
					
now set your working directory to save files
setwd("~/Documents/phyinformR")
					
Informativeness Profiles
Townsend's phylogenetic informativeness profiles are a visual tool that enables assessment of the predicted utility of a given sequence for phylogenetic inference across a timescale of interest. Use of this method requires two inputs: site rates and a guide tree

Site rates can be obtained through a variety of software applications such as hyphy, rate4site, or DNArates. Currently the easiest way to obtain maximum likelihood estimated site-rates is through IQTREE2 This is done using the following command in terminal
./iqtree2 -s your_data_file -p your_partition_file -te your_tree_file -m chosen_substitution_model -blfix --mlrate 
					
Here you are providing your aligment (-s), an optional file of partitions such as genes (-p), your ultrametric guidetree (-te), and a substitution model (-m). The next flags fix the tree (-blfix) and conduct maximum likelihood site rate estimates (--mlrate). The resulting file produces a table that can be read directly into R and contains each site in a column, optional partitions in the next (if present), and the rate of each site

For this walkthrough, we will be using the avian tree and site rates from Prum et al.3 that are distributed with PhyInformR
tree<-read.tree(system.file("extdata","Prumetal_timetree.phy",package="PhyInformR"))
						rr<-as.matrix(prumetalrates)
					

Now you can make phylogenetic informativeness profiles (Townsend 2007) that look like this using only a single line of code
informativeness.profile(rr,tree, codon="FALSE", values="off")
					

Exploring Data with PI Profiles

Let's do something different and partition the data by site rates. First we will view the rates:
hist(rr)

We can see a bit of a tail going out, lets see what happens when we partition the data by rates above and below (0.003). We'll start by creating some partitions by defining rate based breaks in our data, we can see the PI of "fast" versus "slow" sites
lower<-c(0,0.003) 
						upper<-c(0.003000001,10) 
						breaks<-cbind(lower,upper)
					
PhyinformR has a function allowing profiles to be broken along any point in the rate vector, to assess changes in phylogenetic informativeness associated with thresholding the dataset at that rate
multi.profile(rr,tree, breaks)

Partition 1 represents the slower site rates. As expected, the decay in phylogenetic informativeness for partition 1 is much lower across the tree than for partition 2. Conversely, we can see the faster sites in part two are informative for recent divergences, yet exhibit a rapid decline in informative site patterns as we move to deeper portions of the tree.

The above examples serve to illustrate what phyinformR does, but this approach is not common practice. Instead, it is more common to work with character sets partitioned by loci you wish to evaluate. In this case, simply use the same approach as above to define your loci and use defined.multi.profile

In this example we will compare locus 1, that spans sites 1-1594 in the alignment and locus2, that spans sites 1595-2787
Lower<-c(1,1594)
						Upper<-c(1595,2787) 
						Breaks<-cbind(Lower,Upper) 
					defined.multi.profile(rr,tree,Breaks, values="off")
				

In this example the two loci are very similar.
Using this logic we can break datasets into any size partition we wish to evaluate. Feel free to give this a whirl with other included trees on the github repo or some of our other studies (Dornburg et al. 2015; Dornburg et al. 2014) to get comfortable.

Next Steps
Now, how about visualizing signal or noise probabilities across a tree?
On to Signal & Noise
Skip to Visualization