Motivation for the INDELible project

Why are simulation programs useful?

Advances in technology have lead to the rapid accumulation of vast amounts of molecular sequence data which can be used to reconstruct evolutionary histories or investigate the forces and mechanisms behind the evolutionary process. A large variety of methods, and computer programs implementing these methods, aid in this goal by reconstructing sequence alignments and estimating evolutionary parameters, and phylogenetic relationships, from extant data.

Because true phylogenetic relationships between homologous molecular sequences are not known with certainty, except in a few rare cases, simulated data are used to check the accuracy and efficiency of phylogenetic reconstruction methods, ancestral sequence reconstruction methods, or methods of sequence alignment. It can also be used in parametric bootstrap approaches to assessing the adequacy of models or assumptions, or calculating confidence intervals for estimates of evolutionary parameters.

Many methods assume particular evolutionary models and data simulated under these models can be used to test the accuracy of phylogeny reconstruction or parameter estimation in ideal circumstances. Conversely, simulated data can be analysed under incorrect models to assess the robustness of such procedures to model misspecification.

Why are insertions and deletions (indels) important?

When simulated data does not include indels there is no need to align the sequences, which eliminates an important step that makes a significant contribution to topological errors in inferred phylogenies. In addition, molecular evolution is known to be the result of the combined processes of substitution, insertion and deletion, so any proper simulation of molecular evolution should incorporate all these phenomena.

Why is INDELible different?

Most existing computer programs for simulating molecular sequence evolution could be considered lacking in one respect or another. For example:

  • Some do not include indels at all or have unrealistic models of indel formation.
  • Others are inflexible and applicable in only very specific cases.
  • Most can only simulate with either nucleotide or amino-acid sequences but not both.
  • Only one program can simulate using codon models, but it does not include indels.
  • Only one program can simulate non-stationary non-homogenous processes, but it does not include amino-acid or codon models.

INDELible was designed to fill these gaps and combines these features previously only found separately in these disparate programs.