Hells Canyon, ID

Walking Through phyinformR, Part1: Installation & PI Profiles

Learn how to conduct phylogenetic informativeness analyses in R, Part 1


This is a three part tutorial. Part 1 focuses on the generation of PI profiles

Part 2 focuses on advanced features, visualizations, and calculations of quartet resolution probabilities

Part 3 focuses on visualization

1. Installation

PhyInformR is easy to install. Simply install via CRAN or download the compressed R script from github and install it manually

##cran install
install.packages("PhyInformR")
library(PhyInformR)


To install from github

library(devtools) 
install_github("carolinafishes/PhyInformR")
library(PhyInformR)


FOR WINDOWS USERS - devtools will not install depencies for certain versions of windows. This is being addressed and should be fixed in the next major release. If your install fails, please first install the dependencies through CRAN and then use devtools as above for the final install.

install.packages("doParallel") 
install.packages("phytools") 
install.packages("splines") 
install.packages("gplots") 
install.packages("RColorBrewer") 
install.packages("foreach") 
install.packages("iterators") 
install.packages("geiger") 
install.packages("doParallel") 
install.packages("gridExtra") 
install.packages("hexbin") 
install.packages("PBSmodelling") 
install.packages("ggplot2")

Once you load PhyInformR, set the number of cores at the start of your session to enable later parallel processing if desired
We will also be hosting more sample data through Zenodo archives and github to explore new features as we develop them, so check back often!

2. Dependencies

PhyInformR is built upon the efforts of several other R packages including:

phytools
splines
gplots
RColorBrewer
foreach
iterators
geiger
doParallel
gridExtra
hexbin
ggplot2
PBSmodelling

Several functions in PhyInformR use parallel processing. Enable this via

library(doParallel) 
#set the number of cores if you are working in parallel 
registerDoParallel(cores=8)

now set your working directory to save files

setwd("~/Documents/phyinformR")

3. Phylogenetic Informativeness Profiles


Townsend’s phylogenetic informativeness profiles are a visual tool that enables assessment of the predicted utility of a given sequence for phylogenetic inference across a timescale of interest. Use of this method requires two inputs: site rates and a guide tree

Site rates can be obtained through a variety of software applications such as hyphy, rate4site, or DNArates. The phydesign web interface2 makes quantifying site rates easy:

1) Navigate to http://phydesign.townsend.yale.edu/
2) Upload an alignment and ultrametric tree
3) Choose your program for estimating rates from a dropdown
4) Wait for the email that your results are ready
Once you have site rates, use the the “c” function in R to format them. You are ready to explore your data

 mysiterates<-c(0.00034, 0.005678, 0.0,..., 0.008967)

Getting Started

For this walkthrough, we will be using the avian tree and site rates from Prum et al.3 that are distributed with PhyInformR

read.tree(system.file("extdata","Prumetal_timetree.phy",package="PhyInformR"))->tree
as.matrix(prumetalrates)->rr
informativeness.profile(rr,tree, codon="FALSE", values="off")


Easy! Now you can make phylogenetic informativeness profiles (Townsend 2007) that look like this To obtain PI profiles for each codon position, you can toggle codon=”TRUE” if you are in reading frame

informativeness.profile(rr,tree, codon=”TRUE”)

If you would like phyinformR to output of branching times and PI values, simply switch the values=”on”

Exploring Data with PI Profiles


Let’s do something different and partition the data by site rates. First we will view the rates:

 hist(rr) 

We can see a bit of a tail going out, lets see what happens when we partition the data by rates above and below (0.003). We’ll start by creating some partitions

By defining rate based breaks in our data, we can see the PI of “fast” versus “slow” sites

lower<-c(0,0.003) 
upper<-c(0.003000001,10) 
cbind(lower,upper)->breaks


phyinformR has a function allowing profiles to be broken along any point in the rate vector, to assess changes in phylogenetic informativeness associated with thresholding the dataset at that rate multi.profile(rr,tree, breaks) Partition 1 represents the slower site rates. As expected, the decay in phylogenetic informativeness for partition 1 is much lower across the tree than for partition 2. Conversely, we can see the faster sites in part two are informative for recent divergences, yet exhibit a rapid decline in informative site patterns as we move to deeper portions of the tree.
The above examples serve to illustrate what phyinformR does, but this approach is not common practice. Instead, it is more common to work with character sets partitioned by loci you wish to evaluate. In this case, simply use the same approach as above to define your loci and use defined.multi.profile
In this example we will compare locus 1, that spans sites 1-1594 in the alignment and locus2, that spans sites 1595-2787.

Lower<-c(1,1594)
Upper<-c(1595,2787) 
Breaks<-cbind(Lower,Upper) 
defined.multi.profile(rr,tree,Breaks, values="off")

In this example the two loci are very similar.
Using this logic we can break datasets into any size partition we wish to evaluate. Feel free to give this a whirl with other included trees on the github repo from one of our recent studies (Dornburg et al. 2015; Dornburg et al. 2014) to get comfortable. Now, how about visualizing signal or noise probabilities across a tree?

Continue to Part 2

Dialogue & Discussion