Jessica Vamathevan

Department of Biology
Statistical Genetics

Home

Bioinformatics Tools

Phylogenomics Glossary

Useful Links

Resume

Publications

Bootstrap
A statistical method by which distributions that are difficult to calculate exactly can be estimated by the repeated creation and analysis of artificial datasets.
In the non-parametric bootstrap, these datasets are generated by resampling from the original data, whereas in the parametric bootstrap, the data are simulated according to the hypothesis being tested. The name derives from the nearmiraculous way in which the method can ‘pull itself up by its bootstraps’ and generate statistical distributions from almost nothing.

Consistency
A statistical estimation method has the property of consistency when the estimate if a quantity is certain to converge to its true value as more and more date is accumulated.

Empirical and parametric models (parameterization)
Most mathematical models of sequence evolution include variables that represent features of the process of evolution, but the numerical values of which are not known a priori.
These variables are termed parameters of the models. Empirically constructed models take the values of their parameters from pre-computed analyses of large quantities of data, with the particular data under analysis having limited or no influence. By contrast, parametric models do not have pre-specified parameter values. Maximum likelihood can be used to estimate such values from the data under analysis.

Homoplasy
States that arise more than once in different places of the tree.

Likelihood ratio test (LRT)
A powerful form of statistical test in which competing hypotheses (H0 and H1) are compared using a statistic based on the ratio of the maximum likelihoods ( l₀, l₁) under each hypothesis; for example,

2δ = 2ln(l₁ / l₀) = 2(ln l₁ – ln l₀)

Results can be expressed in terms of P-values, the probability of the statistic being at least as extreme as observed when H0 is true:
low P-values (e.g. <0.05) suggest rejection of H0 in favour of H1.

Maximum likelihood (ML)
The likelihood (lH) of a hypothesis (H) is equal to the probability of observing the data if that hypothesis were correct. The statistical method of maximum likelihood (ML) chooses amongst hypotheses by selecting the one which maximizes the likelihood; that is, which renders the data the most plausible.
In the context of molecular phylogenetics, a model of nucleotide or amino acid replacement permits the calculation of the likelihood for any possible combination of tree topology and branch lengths. The topology and branch lengths that maximize this likelihood (or, equivalently, its natural logarithm, ln lH, which is almost invariably used to give a more manageable number) are the ML estimates.
Any parameters with values not explicitly specified by the replacement model can be simultaneously estimated, again by selecting the values that maximize the likelihood.

Molecular clock
Compare structures of similar biomolecules in different species – estimate how divergent these species were from one another and then estimate rate of molecular mutation – can come up with a clock to time evolutionary divergence (concept developed by Linus Pauling, 1965 and Emile Zuckerkandl).
Definition: rate of evolution in a given protein/DNA is approximately constant over time and within evolutionary lineages.
Also there exists a statistical proportionality:Time elapsed since last common ancestor of 2 homologous proteins is directly proportional to the number of amino acid differences in their sequences.

Molecular phylogenetics
The study of phylogenies and processes of evolution by the analysis of DNA or amino acid sequence data.

Orthologues
Homologous genes for which there is an exact one-to-one relationship correspondence [ortho=exact] between the ancestral relationship of the genes and the ancestral relationship of the species.

OTU
Operational taxonomic unit eg. Population, phylum

Paralogues
Homologous genes that arose by gene duplication.

Phylogenetic tree
The hierarchical relationships among organisms arising through evolution. In his Origin of Species, Darwin’s only figure uses a sketch of a tree-like structure to describe evolution: from ancestors at the limbs and branches of the tree, through more recent ancestors at its twigs, to contemporary organisms at its buds. Today, these relationships are usually represented by a schematic ‘tree’ comprising a set of nodes linked together by branches.
Terminal nodes (tips or leaves) typically represent known sequences from extant organisms.
Internal nodes represent ancestral divergences into two (or more) genetically isolated groups; each internal node is attached to one branch representing evolution from its ancestor, and two (or more) branches representing its descendants.
The lengths of the branches in the tree can represent the evolutionary distances that separate the nodes; the tree topology is the information on the order of relationships, without consideration of the branch lengths.

Population
Whole set of measurements or counts about which we want to draw a conclusion. If we are interested in only one variable we call the population univariate. A population is a set of measurements not the individuals or objects on which the measurements or counts are made.
A sample is a subset of a population, a set of some of the measurements or counts which comprise the population.

Rate heterogeneity and gamma distribution
Mutation rates vary considerably amongst sites of DNA and amino acid sequences, because of biochemical factors, constraints of the genetic code, selection for gene function, etc. This variation is often modeled using a gamma distribution of rates across sequence sites. The shape of the gamma distribution is controlled by a parameter α, and the distribution’s mean and variance are 1 and 1/α, respectively.
Large values of α (particularly α>1) give a bell curve-shaped distribution, suggesting little or no rate heterogeneity; Small values of α give a reverse – J-shaped distribution, suggesting higher levels of rate heterogeneity along with many sites with low rates of evolution.

Speciation
The evolutionary formation of new biological species, usually by the division of a single species into two or more genetically distinct ones.

Statistics
Involving or containing a random variable or variables: stochastic calculus. Involving chance or probability: a stochastic stimulation.

Stochastic
Involving or containing a random variable or variables.

Last modified 15 July, 2006->->->