Bayesian Estimation of Species Divergence Times Integrating Fossil and Molecular Information


Because genes accumulate changes over time at a constant rate, the genetic distance between two species, measured by the number of changes accumulated, will be proportional to the time of species divergence. Thus molecules can serve as a clock, keeping time of species divergence by the accumulated changes. If fossil records or geological events can be used to assign an absolute geological time to a species divergence event on the phylogenetic tree, one can convert all calculated genetic distances into absolute geological times. This rationale for molecular clock dating has recently been extended to deal with local variation in evolutionary rate. Critical to molecular dating is the use of fossil information to calibrate the clock. In this project, we are developing statistical models and computer programs to analyze fossil and molecular data to accurately represent and incorporate the information in the fossil record in molecular dating analysis. The new methods will be applied, for example, to analyze datasets to estimate divergence times among mammals and to date viral transmission events. The project is funded by the BBSRC, UK, grant number BB/J009709/1, and it will run between March 2012 and February 2015.

The grant holder is Prof. Ziheng Yang (University College London), who is working in collaboration with Prof. Philip CJ Donoghue (Bristol University). The named postdoctoral fellow in the grant is Mario dos Reis (University College London).

The project is now finished and this website will not be updated any more. These are a few examples of the work that was carried out during the grant, as well as work done during a previous related grant:

The uncertainty of divergence time estimates under relaxed clock models

In a paper published in Systematic Biology (early 2015, read it here) we study how uncertainty in divergence time estimates decreases with the amount of molecular sequence data under relaxed clock models. This study is an extension of our previous work in the Journal of Systematics and Evolution (here) where we study similar issues under the strict clock. With the relaxed clock, uncertainty in divergence time estimates is due to uncertainty in the fossil calibrations, uncertainty due to limited sequence data (sampling errors in the data), and uncertainty in rate variation among lineages and among loci. As the number of loci in the analysis is increased, the total uncertainty decreases until it reaches a limiting value determined by the fossil calibrations. Here the number of loci (each of which has its own relaxed clock) is the crucial factor in reducing uncertainty. Thus, analysing many genes (or loci) as separate partitions is preferable than analysing a super-gene concatenation.

Effect of the rate prior on times estimates with multiple loci

Estimating divergence times using multiple loci (multiple genes or partitions) requires constructing a joint prior on the substitution rates across loci. For example, if we are analysing 10 genes, then we would estimate 10 substitution rates, one for each gene. Virtually all Bayesian molecular dating programs use i.i.d. priors for the locus rates, that is, they assume that the substitution rates at loci are independent, identically distributed random variables from some distribution. In a paper in Systematic Biology (early 2014, read it here) we show that the innocent looking i.i.d. rate prior is problematic, leading to overtly precise estimates of divergence times, and if the rate prior is misspecified (say, if it assumes that genes evolve too fast or too slow compared to the real rate), estimated times will converge with absolute precision to wrong values. The more loci we analyse, the worst the results will be. In our paper we implement a new rate prior based on the Dirichlet distribution that is robust to prior misspecification and that does not lead to overtly precise time estimates. The new prior is implemented in MCMCTree v4.8. Users of MCMCTree will not notice any difference in usage with respect to older versions, other than posterior time estimates will be more robust to rate prior misspecification. In our paper we also give a strategy to construct a safe i.i.d. rate prior for users of other Bayesian programs (BEAST, MrBayes and so on implement the problematic i.i.d. prior).

Virus dating

The MCMCTree program for dating now implements models to obtain divergence times among molecular sequences that have been serially sampled in time. The most common case is virus dating. The program implements a new time prior based on the birth-death-sequential-sampling (BDSS) process. An important finding is that estimation of deep divergence times (say the root of the phylogeny) can be very sensitive to the parameters of the BDSS prior, even with well serially sampled sequences. You can read the paper by Tanja Stadler and Ziheng Yang published in Systematic Biology in April 2013 here.

The unbearable uncertainty of Bayesian divergence time estimation

Bayesian estimation of divergence times of extant species using molecular data is an unconventional statistical problem. Molecular data provide information only about the distances among species on a phylogeny, but not about the geological ages of groups of species nor the molecular evolutionary rate. Information from the fossil record is needed to convert molecular distances into divergence time estimates. This means that as the amount of molecular data is increased, the uncertainty in estimated divergence times does not approach zero, but approaches a limiting value determined by the uncertainty in the fossil calibrations. In this project we extend the mathematical theory of the uncertainty in divergence time estimation. In particular, we show that the uncertainty in time estimates approach their theoretical limit at the rate 1/n, with n to be the sample size (the number of sites in the molecular sequence alignment). We also studied the effect of conflicting fossils and sequence data on the precision of time estimates. You can read the paper here. The paper was featured in the cover of the 2014 January issue of the Journal of Systematics and Evolution.

The timescale of mammalian phylogeny

The fossil record shows the sudden appearance of Placental mammals 65 My ago, soon after the Cretaceous-Paleogene mass extinction event when 76% of species, including non-avian dinosaurs, died out. Some molecular studies have placed the origin and diversification of placental mammals deep in the Cretaceous, before the mass extinction event. On the other hand, palaeontological analysis of the fossil record indicate a diversification peak of placental orders in a short 16 My post-Cretaceous window. The discrepancies between estimates of mammalian divergence between molecular and palaeontological studies have been considered unacceptably large. We revised the time-line of mammalian diversification by Bayesian analysis of a large alignment (21 million sites) from 36 mammalian genomes, together with mitochondrial data from 274 mammal species. We used 26 soft, non-limiting fossil constraints to calibrate the molecular clock.  Our analysis indicates that intra-ordinal diversification of placental mammals occurred in a 20 My post-Cretaceous window, in accordance with palaeontological estimates. On the other hand, we find that the last common ancestor of placentals originated in the Creataceous (90-88 My ago). We find that genomic data reduces uncertainty in divergence time estimates towards the theoretical limit of precision and allowed us to confidently reject a pre-extinction diversification of placentals. Other famous controversies of mismatch between palaeontological and molecular time estimates, such as the origin of animal phyla or of flowering plants, are likely to be resolved with a genomic-scale approach.

You can read the full paper published in May 2012 in Proceedings of the Royal Society B here, and a comment in EvoDevo here. You can also download a large dated mammal phylogeny (274 species) here. The alignment and trees with fossil calibrations are here. Below is a simplified figure that summarises the finding of the project: nearly all modern crown placental orders postdate the K-Pg event, in accordance with a rapid diversification of mammals after the mass extinction.
Timescale of mammal phylogeny

In early 2013, O'Leary and colleagues (Science, 339: 662) published an analysis of a large morphological matrix of extant and fossil mammal species to reconstruct and date the placental mammal ancestor. They incorrectly treated fossil ages as divergence time estimates for crown groups. For example, they recognise Protungulatum donnae (giving it an age of 64.85 My) as the oldest placental mammal fossil. They then estimate the origin of placentals to be the same as the age of P. donnae, therefore concluding, wrongly, that Placentalia originated 64.85 My ago after the K-Pg extinction. The molecular data is not compatible with such a young age for Placentalia. Such overtly simplistic interpretation of the fossil record is naive and misleading, and ignores the integrative work that paleontologists and molecular biologists have carried out on the theory of divergence time estimation during the past two decades. We published a rebuttal in Biology Letters in January 2013 here. Our critique was discussed by Nature here, by The Scientist here and by Discovery News here.

Approximate likelihood calculation for Bayesian estimation of species divergence times

Bayesian estimation of divergence times from molecular data is computationally expensive. The posterior distribution of divergence times cannot be calculated analytically, so we must rely on Markov Chain Monte Carlo (MCMC) methods to obtain an approximation to the posterior distribution. Calculation of the log-likelihood of a molecular sequence alignment during MCMC is the most expensive part of the posterior computation. Thorne and colleagues (1998, MBE 15:1647) proposed a normal approximation to the log-likelihood function to speed up computation during the MCMC. However, the accuracy of this approximation was never fully tested. In this project we performed a detailed mathematical analysis of the approximation and suggested using variable transforms to increase accuracy. Extensive testing using real molecular data indicates that the approximation is highly reliable. Our new ARCSINE based transform is now the default approximate likelihood method in the program MCMCTREE. You can read the paper, published in MBE in February 2011, about the approximate likelihood method here. With the traditional exact method Bayesian estimation of the divergence times of a few species may take from a few days to months, while with the approximate method analysis in one to two days are possible. The exact method becomes prohibitively expensive for genome scale data. The approximate method in MCMCTREE allowed us to estimate the divergence times of mammal species using and alignment of over 21 million sites, and solve a long standing controversy in molecular evolution (see above and here).

The impact of fossil calibrations on divergence time estimation

The construction of a multidimensional prior in Bayesian statistics is a complicated problem. Bayesian estimation of species divergence times involves the construction of the joint prior for the divergence times. For a set of s species, there are s-1 times to be estimated, and an s-1 multidimensional joint time prior must be constructed. Different Bayesian divergence time estimation programs (such as MULTIDIVTIME, BEAST or MCMCTREE) use different strategies to integrate fossil information into a phylogeny in order to construct the time prior. These different strategies may lead to different priors from seemingly similar fossil calibration densities. In turn, the different priors may lead to surprisingly different posterior time estimates for the different programs, despite the use of the same molecular data and fossil information. In this project a detailed mathematical analysis of the construction of the time prior in MCMCTREE and MULTIDIVTIME was made. This work highlights the critical importance of calculating the time prior explicitly and comparing it to the fossil calibration densities. You can read the full paper in Syst. Biol. (first published in Nov 2009) by Jun Inoue et al. here. (Also, you may want to look at fig. 3 in our mammal paper for an example comparing the marginal time prior and the fossil calibration densities).

For any questions about this web-page contact Mario dos Reis or Ziheng Yang.

Last updated February 2015.