Whale breach in Antarctica

Considering phylogenetic information in legacy data

While cruising around the Antarctic I assisted Yale Ph.D. student Elyse Parker on analyzing the results of her M.S. work. The resulting paper was recently published and concerns the need to hold inferences based on legacy markers to the same level of scrutiny as next-gen datasets. I thought I would revist the topic here.

Opening cold cases to end incongruence

As we move further into the genomics era, a repeating research theme has been to contrast inferences based on hundreds or thousands of markers to historical inferences based on a handful of legacy markers. This contrast is of course necessary to reveal both conflict and congruence in relationships.

However, sources of conflict in legacy data are rarely scrutinized while newer datasets are often subjected to a high level of scrutiny through increasingly sophisticated methods.

Our manuscript points out that there is a missed opportunity here. Why not also scrutinize available legacy data to see what relationships we have confidence in?

Using a series of markers commonly used in fish phylogenetics we did just that for the radiation of goodeines, a really cool group of freshwater fishes.

We find that there simply is not much information in the markers used for resolving the recent rapid divergences of this clade.

What do we know about goodeine relationships? About this much. We know the major clades but legacy data has little justifiable evidence strongly supporting their inter-relationships.

Looking closey within each marker reveals codon positions that either lack of variable sites and thereby contribute little to inference, or high levels of GC3 bias. Scrutinizing the same markers for lack of resolution and uncertainty in triggerfish and squirrelfish relationships revealed the same pathologies.

FIgure from our manuscript summarizing the extant of phylogenetic information in these markers.

These results are not terribly suprising. The percieved utility of legacy markers was not based on experimental design, but largely on their ability to be readily sequenced and thereby supply data for investigators. Nearly 20 years ago Nick Goldman published a paper referring to this practice as selecting markers based on “folklore”.

At the time, this was the best we had as phylogeneticists. But now as we move forward towards resolving the Tree of Life, adding assessments of legacy data is a trivial (by computational comparison) way to assess whether a historical conflicting topology is based on strong evidence, noise, or bias.

This is particularly important for legacy datasets spanning deeper timescales.

Informativeness profiles and heatmaps of the probabiliy of correct resolution for a quartet (QIRP) for legacy markers used in leech phylogenetics.

I recently aided Bronwyn Williams in an analysis of leech inter-relationships and found that the “missing link” that has been used to explain the evolution of leeches is a highly derived marine leech. The legacy markers were simply noisy and likely driving erroneous placement of this taxon outside of “true leeches”.

I would like to add that this is not a condemnation of historical data at all. In fact, I would argue that huge portions of the Tree of Life that were previously resolved have remained virtually unchanged as we added more data. However, there is an opportunity here to assess where we are and are not confident in our inferences.

Dialogue & Discussion