Bootstrap A
statistical method by which distributions that are difficult to calculate
exactly can be estimated by the repeated creation and analysis of artificial datasets. In the
non-parametric bootstrap, these datasets are generated by resampling from the
original data, whereas in the parametric bootstrap, the data are simulated
according to the hypothesis being tested. The name derives from the
nearmiraculous way in which the method can ‘pull itself up by its bootstraps’
and generate statistical distributions from almost nothing.
Consistency A statistical estimation method has the property of
consistency when the estimate if a quantity is certain to converge to its true
value as more and more date is accumulated.
Empirical
and parametric models (parameterization) Most
mathematical models of sequence evolution include variables that represent
features of the process of evolution, but the numerical values of which are not
known a priori. These
variables are termed parameters of the models. Empirically constructed models
take the values of their parameters from pre-computed analyses of large
quantities of data, with the particular data under analysis having limited or
no influence. By contrast, parametric models do not have pre-specified
parameter values. Maximum likelihood can be used to estimate such values from
the data under analysis.
Homoplasy States that arise more than once in different
places of the tree.
Likelihood
ratio test (LRT) A powerful form of statistical test in which competing hypotheses (H0 and H1) are
compared using a statistic based on the ratio of the maximum likelihoods ( l0,
l1) under each hypothesis; for example,
2δ = 2ln(l1 / l0) = 2(ln
l1 – ln l0)
Results
can be expressed in terms of P-values, the probability of the statistic being
at least as extreme as observed when H0 is true: low P-values (e.g. <0.05) suggest rejection of H0 in favour of H1.
Maximum
likelihood (ML) The likelihood (lH) of a hypothesis (H) is equal to the probability of observing
the data if that hypothesis were correct. The statistical method of maximum
likelihood (ML) chooses amongst hypotheses by selecting the one which maximizes
the likelihood; that is, which renders the data the most plausible. In the context
of molecular phylogenetics, a model of nucleotide or amino acid replacement
permits the calculation of the likelihood for any possible combination of tree
topology and branch lengths. The topology and branch lengths that maximize this
likelihood (or, equivalently, its natural logarithm, ln lH, which is almost
invariably used to give a more manageable number) are the ML estimates. Any
parameters with values not explicitly specified by the replacement model can be
simultaneously estimated, again by selecting the values that maximize the
likelihood.
Molecular clock Compare
structures of similar biomolecules in different
species – estimate how divergent these species were from one another and then
estimate rate of molecular mutation – can come up with a clock to time
evolutionary divergence (concept developed by Linus
Pauling, 1965 and Emile Zuckerkandl). Definition:
rate of evolution in a given protein/DNA is approximately constant over time
and within evolutionary lineages. Also there exists a statistical proportionality:Time
elapsed since last common ancestor of 2 homologous proteins is directly proportional to the number of amino acid differences in their sequences.
Molecular
phylogenetics The study of phylogenies and processes of evolution by the analysis of DNA or amino
acid sequence data.
Orthologues Homologous
genes for which there is an exact one-to-one relationship correspondence [ortho=exact] between the ancestral relationship of the
genes and the ancestral relationship of the species.
OTU Operational taxonomic unit eg.
Population, phylum
Paralogues Homologous genes that arose by gene duplication.
Phylogenetic tree The
hierarchical relationships among organisms arising through evolution. In his
Origin of Species, Darwin’s only figure uses a sketch of a tree-like structure
to describe evolution: from ancestors at the limbs and branches of the tree,
through more recent ancestors at its twigs, to contemporary organisms at its
buds. Today, these relationships are usually represented by a schematic ‘tree’ comprising
a set of nodes linked together by branches. Terminal nodes (tips or leaves)
typically represent known sequences from extant organisms. Internal nodes represent
ancestral divergences into two (or more) genetically isolated groups; each internal
node is attached to one branch representing evolution from its ancestor, and
two (or more) branches representing its descendants. The lengths of the
branches in the tree can represent the evolutionary distances that separate the
nodes; the tree topology is the information on the order of relationships,
without consideration of the branch lengths.
Population Whole set of measurements or counts about which we
want to draw a conclusion. If we are interested in only one variable we call
the population univariate. A population is a set of measurements not the
individuals or objects on which the measurements or counts are made. A sample is a subset of a population, a set of some
of the measurements or counts which comprise the population.
Rate
heterogeneity and gamma distribution Mutation
rates vary considerably amongst sites of DNA and amino acid sequences, because
of biochemical factors, constraints of the genetic code, selection for gene
function, etc. This variation is often modeled using a gamma distribution of
rates across sequence sites. The shape of the gamma distribution is controlled
by a parameter α, and the distribution’s mean and variance are 1 and
1/α, respectively. Large values of α (particularly α>1) give
a bell curve-shaped distribution, suggesting little or no rate heterogeneity;
Small values of α give a reverse – J-shaped distribution, suggesting
higher levels of rate heterogeneity along with many sites with low rates of
evolution.
Speciation The evolutionary
formation of new biological species, usually by the division of a single
species into two or more genetically distinct ones.
Statistics Involving or
containing a random variable or variables: stochastic calculus.
Involving chance or probability: a stochastic stimulation.
Stochastic Involving or containing a random variable or variables.
Last modified
15 July, 2006->->->
|