Molecular phylogenetics methods

Overview

Molecular phylogenetics reconstructs evolutionary trees from DNA, protein, and genomic sequences using computational methods that range from fast distance-based algorithms like neighbor-joining to statistically rigorous approaches including maximum likelihood and Bayesian inference, each with distinct strengths and assumptions about how sequences evolve.
The accuracy of any phylogenetic analysis depends critically on the substitution model chosen to describe molecular evolution, with model selection criteria such as AIC and BIC guiding researchers toward the best-fitting model from a hierarchy that spans the single-parameter Jukes-Cantor model to the fully parameterized general time-reversible (GTR) model.
Modern phylogenomics uses thousands of loci from whole genomes, ultraconserved elements, or RADseq markers and increasingly relies on coalescent-based species tree methods like ASTRAL and *BEAST that account for gene tree discordance caused by incomplete lineage sorting, rather than simply concatenating genes into a single supermatrix.

Molecular phylogenetics is the branch of evolutionary biology that uses data from DNA, RNA, and protein sequences to infer the evolutionary relationships among organisms. Since its emergence in the 1960s and 1970s, the field has developed a sophisticated toolkit of computational and statistical methods for reconstructing phylogenetic trees — branching diagrams that depict ancestor-descendant relationships and the relative or absolute times at which lineages diverged. These methods differ in their underlying philosophies, their computational demands, and their sensitivity to violations of assumptions, but all share a common goal: to extract the historical signal embedded in molecular sequences and to express it as a well-supported hypothesis of evolutionary history.²³ The choice of method, substitution model, and data type can profoundly influence the resulting tree, making an understanding of the strengths and limitations of each approach essential for any modern evolutionary analysis.

The development of molecular phylogenetics has been inseparable from advances in DNA sequencing technology and computational power. Early studies relied on single genes or small protein datasets, but the genomics revolution has made it routine to analyse thousands of loci simultaneously, giving rise to the discipline of phylogenomics.²² This explosion of data has brought new challenges — including pervasive gene tree discordance and the need for species tree methods that account for the stochastic processes governing how genetic lineages sort across speciation events — but it has also delivered unprecedented resolution of evolutionary relationships at every taxonomic level, from populations to the deepest branches of the tree of life.^{16, 22}

Types of molecular data

The raw material of molecular phylogenetics is the nucleotide or amino acid sequence. DNA sequences are the most widely used data type because they are straightforward to obtain through PCR amplification and sequencing, and because the four-state nucleotide alphabet (A, C, G, T) provides a natural framework for modelling substitution processes. Protein-coding genes can be analysed either as nucleotide sequences or as the amino acid sequences they encode; the latter approach is often preferred for comparisons among distantly related organisms because amino acid sequences evolve more slowly than the underlying DNA, reducing the problem of substitution saturation — the progressive loss of phylogenetic signal as multiple substitutions accumulate at the same site over deep time.²³

Ribosomal RNA genes, particularly the small subunit (16S in prokaryotes, 18S in eukaryotes), were among the earliest molecular markers used in phylogenetics and remain foundational for microbial systematics. Their universal presence across all cellular life, a mixture of conserved and variable regions, and the availability of enormous reference databases make them ideal for broad-scale comparisons.²² Mitochondrial genes such as cytochrome b and cytochrome oxidase I (COI) are widely used in animal systematics and in DNA barcoding initiatives. For studies requiring finer resolution, nuclear protein-coding genes, introns, and non-coding regulatory regions provide additional phylogenetic information, though they are more challenging to amplify across divergent taxa.

The advent of high-throughput sequencing has made it feasible to gather data from hundreds to thousands of genomic loci simultaneously. Whole-genome sequences, transcriptomes, reduced-representation libraries generated by restriction-site associated DNA sequencing (RADseq), and targeted capture of ultraconserved elements (UCEs) all provide dense sampling of the genome and have become the foundation of modern phylogenomics.^{19, 20, 22} This abundance of data has transformed the field from one limited by the quantity of available sequences to one whose central challenge is developing methods capable of extracting accurate evolutionary signal from vast, heterogeneous datasets.

Sequence alignment

Before any phylogenetic analysis can begin, homologous sequences must be aligned so that nucleotide or amino acid positions descended from a common ancestral position are placed in the same column of a data matrix.

Diagram of the three domains of life — Bacteria, Archaea, and Eukarya — as a branching phylogenetic tree rooted at a universal common ancestor — A rooted phylogenetic tree illustrating the three domains of life derived from molecular sequence data. Sequence alignment of conserved genes such as 16S ribosomal RNA provides the raw data from which such trees are inferred; the quality of the alignment directly determines the accuracy of the resulting topology. Eric Gaba, Wikimedia Commons, Public domain

This step is foundational: errors in alignment propagate directly into errors in the inferred tree. The mathematical framework for pairwise sequence alignment was established by Needleman and Wunsch in 1970, who introduced a dynamic programming algorithm that finds the optimal global alignment of two sequences by maximising a similarity score while penalising gaps that represent insertions or deletions.¹

In practice, phylogenetic datasets contain many sequences that must be aligned simultaneously, a problem known as multiple sequence alignment (MSA). Because the exact solution to MSA is computationally intractable for more than a handful of sequences, all widely used programs employ heuristic strategies. The most common approach is progressive alignment, in which sequences are first compared pairwise, a guide tree is constructed from the pairwise distances, and sequences are then progressively merged along the guide tree. ClustalW, introduced in 1994, popularised this strategy and became one of the most cited papers in biology.² However, progressive alignment is sensitive to errors in the guide tree and to the order in which sequences are added, because early mistakes are propagated through subsequent steps.

Two programs that addressed these limitations became the workhorses of modern alignment. MUSCLE, published by Edgar in 2004, introduced iterative refinement: after an initial progressive alignment, the program repeatedly partitions the alignment and re-optimises it, improving accuracy without a dramatic increase in computation time.³ MAFFT, developed by Katoh and colleagues, employs fast Fourier transforms to identify homologous regions and offers multiple alignment strategies ranging from rapid progressive methods to highly accurate iterative approaches suitable for large datasets.⁴ Benchmarking studies consistently rank MAFFT and MUSCLE among the most accurate alignment programs, though no single method is universally superior across all dataset types and divergence levels.^{3, 4}

A persistent challenge in molecular phylogenetics is alignment ambiguity: regions of a sequence alignment where the correct placement of gaps and residues is uncertain, particularly in non-coding regions or among highly divergent sequences. Because different plausible alignments can yield different phylogenetic trees, some researchers exclude ambiguously aligned regions before analysis, while others explore the sensitivity of their results to alignment variation. The recognition that alignment and tree estimation are logically interdependent has motivated the development of methods that co-estimate alignment and phylogeny simultaneously, though these remain computationally demanding.²³

Distance-based methods

Distance-based methods represent the computationally simplest approach to phylogenetic inference. They begin by calculating a matrix of pairwise evolutionary distances between all sequences in the dataset, then use a clustering algorithm to construct a tree from that matrix. The evolutionary distance between two sequences is not simply the proportion of sites at which they differ (the p-distance), because observed differences underestimate the true number of substitutions due to multiple hits — sites where more than one substitution has occurred since the sequences diverged. To correct for this, distance methods apply a substitution model to convert observed differences into estimated numbers of substitutions per site.²³

Neighbor-joining phylogenetic tree of Antispila moths based on COI barcodes, with bootstrap support values on branches and colour-coded Vitaceae-feeding clusters — A neighbor-joining tree of *Antispila* moths reconstructed from cytochrome oxidase I (COI) barcode sequences, with 10,000-replicate bootstrap support values shown on branches. Vitaceae-feeding clusters are colour-coded, illustrating how NJ trees built from pairwise molecular distances can reveal both species relationships and ecologically meaningful groupings — while the bootstrap values quantify confidence in each branch. van Nieukerken E et al., Wikimedia Commons, CC BY 3.0

The most widely used distance-based algorithm is the neighbor-joining (NJ) method, introduced by Saitou and Nei in 1987.⁵ Neighbor-joining constructs an unrooted tree by iteratively identifying the pair of operational taxonomic units whose joining minimises the total branch length of the tree. Unlike the earlier UPGMA algorithm (unweighted pair group method with arithmetic mean), which assumes a constant rate of evolution across all lineages and produces an ultrametric tree, neighbor-joining does not require a molecular clock assumption and allows branches to have unequal lengths.⁵ This makes NJ substantially more realistic for most biological datasets, where rates of molecular evolution vary among lineages. Neighbor-joining is extremely fast, capable of handling thousands of sequences in seconds, and for this reason remains widely used as a starting point for exploratory analyses and as a method for constructing guide trees in other applications, even though it has been largely superseded by model-based methods for formal phylogenetic inference.^{5, 23}

Maximum parsimony

Maximum parsimony is the oldest character-based method for phylogenetic inference, rooted in the philosophical principle that the simplest explanation consistent with the data is to be preferred. Applied to molecular sequences, parsimony seeks the tree (or set of trees) that requires the fewest evolutionary changes — the minimum number of nucleotide or amino acid substitutions — to explain the observed differences among sequences. The algorithm for counting the minimum number of changes on a given tree was formalised by Fitch in 1971, who presented an efficient method for reconstructing hypothetical ancestral character states that minimise the total number of substitutions along a predetermined tree topology.⁶

Parsimony has the appeal of simplicity and makes minimal assumptions about the process of molecular evolution: it does not require specification of a substitution model, nor does it assume any particular distribution of rates across sites or lineages. However, this apparent advantage is also its principal weakness. Because parsimony does not model the substitution process, it cannot account for the possibility of multiple substitutions at the same site. When rates of evolution are high or branches are long, observed similarity between two sequences may reflect convergent substitutions rather than shared ancestry, and parsimony will be positively misled — it will converge on the wrong tree with increasing confidence as more data are added.¹⁵ This phenomenon, known as long-branch attraction, was first demonstrated formally by Felsenstein in 1978 and remains the most important known source of systematic error in parsimony analysis.¹⁵

Despite these limitations, parsimony was the dominant method in phylogenetics from the 1970s through the 1990s, particularly among systematists influenced by the Willi Hennig school of cladistics, and it continues to be used in morphological phylogenetics and as a point of comparison with model-based methods. The maximum parsimony search problem is NP-hard, meaning that an exhaustive search of all possible tree topologies is feasible only for small numbers of taxa; for larger datasets, heuristic search strategies involving tree bisection and reconnection (TBR) or subtree pruning and regrafting (SPR) are employed to explore tree space.²³

Maximum likelihood

The maximum likelihood (ML) approach to phylogenetics, introduced by Felsenstein in 1981, represented a fundamental shift from counting changes to modelling the stochastic process of sequence evolution.⁷ Under ML, the data are the observed sequence alignment and the model specifies the probabilities of different nucleotide substitutions occurring over time. For each candidate tree topology and set of branch lengths, the likelihood — the probability of observing the data given the tree and model — is calculated, and the tree with the highest likelihood is selected as the best estimate of the phylogeny. Felsenstein showed that this approach is statistically consistent: given a correct model and sufficient data, ML will converge on the true tree, even in the Felsenstein zone where parsimony fails.^{7, 15}

Central to ML phylogenetics is the choice of substitution model, which describes how nucleotides change over evolutionary time. The simplest model is the Jukes-Cantor (JC69) model, which assumes that all four nucleotides are equally frequent and that all types of substitutions occur at the same rate. The Kimura two-parameter (K80) model introduces a distinction between transitions (purine-purine or pyrimidine-pyrimidine changes, such as A↔G) and transversions (purine-pyrimidine changes, such as A↔T), recognising that transitions typically occur more frequently than transversions in real sequences.⁸ At the other end of the complexity spectrum, the general time-reversible (GTR) model, described by Tavaré in 1986, allows all six pairwise substitution rates and all four nucleotide frequencies to be free parameters, providing the most general reversible model of nucleotide evolution.⁹ These models can be further extended by incorporating rate variation across sites, typically modelled with a discrete gamma distribution (denoted +Γ), and a proportion of invariable sites (+I).²³

Hierarchy of common nucleotide substitution models^{8, 9, 23}

Model	Free rate parameters	Base frequencies	Key assumption
JC69 (Jukes & Cantor)	0	Equal (0.25 each)	All substitutions equally probable
K80 (Kimura 2-parameter)	1 (κ)	Equal (0.25 each)	Transitions ≠ transversions
HKY85	1 (κ)	Unequal (estimated)	Transitions ≠ transversions; unequal base frequencies
GTR (Tavaré)	5	Unequal (estimated)	All six pairwise rates independent
GTR+Γ	5 + α	Unequal (estimated)	GTR with gamma-distributed rate variation
GTR+Γ+I	5 + α + p_inv	Unequal (estimated)	GTR+Γ with a proportion of invariable sites

Choosing the appropriate model is not merely a technical detail; using an overly simple model can lead to systematic bias in the inferred tree, while an overly complex model wastes parameters and reduces statistical power. Model selection is typically performed using information-theoretic criteria: the Akaike information criterion (AIC) balances model fit (measured by the likelihood) against model complexity (measured by the number of parameters), while the Bayesian information criterion (BIC) applies a stronger penalty for additional parameters, especially with large datasets. Both AIC and BIC have been shown to outperform the older hierarchical likelihood ratio test approach for phylogenetic model selection.¹⁰

Modern ML software packages such as RAxML and IQ-TREE can analyse alignments containing thousands of taxa and hundreds of thousands of sites. RAxML, developed by Stamatakis, introduced efficient parallelisation strategies and rapid bootstrap analysis that made large-scale ML phylogenetics practical.²⁴ IQ-TREE, introduced by Nguyen and colleagues in 2015, employs a stochastic perturbation algorithm that efficiently explores tree space and has been shown to find higher-likelihood trees than competing programs on many empirical datasets.²⁵ Both programs implement automatic model selection, partitioned analysis for multi-gene datasets, and various measures of branch support.

Bayesian phylogenetic inference

Bayesian inference applies Bayes' theorem to phylogenetics: the posterior probability of a tree given the data is proportional to the product of the likelihood (the probability of the data given the tree and model) and the prior probability of the tree, summed over all possible parameter values. Unlike ML, which seeks a single point estimate of the best tree, Bayesian inference produces an entire posterior distribution of trees, allowing researchers to quantify uncertainty in every aspect of the phylogeny — topology, branch lengths, and model parameters — in a unified probabilistic framework.^{11, 23}

Because the posterior distribution cannot be calculated analytically for real phylogenetic problems, Bayesian methods rely on Markov chain Monte Carlo (MCMC) sampling. MCMC algorithms construct a chain that wanders through the space of possible trees and parameter values, spending time in each region of parameter space in proportion to its posterior probability. After a sufficient burn-in period during which the chain converges from its starting point, the sampled trees and parameters provide an approximation of the posterior distribution. The probability that a particular clade (a group of taxa sharing a common ancestor) appears in the posterior sample is the posterior probability of that clade, which is directly interpretable as the probability that the clade is real given the data and model.^{11, 12}

The first widely adopted Bayesian phylogenetics program was MrBayes, released by Huelsenbeck and Ronquist in 2001 and substantially expanded in version 3 in 2003.^{11, 12} MrBayes implements a wide range of substitution models, allows different partitions of the data to evolve under different models, and uses Metropolis-coupled MCMC (MC3) to improve mixing across the posterior landscape. BEAST (Bayesian Evolutionary Analysis by Sampling Trees), introduced by Drummond and Rambaut in 2007, extended the Bayesian framework to co-estimate phylogeny, divergence times, population sizes, and substitution rates in a single analysis, making it the standard tool for molecular dating and phylodynamic studies.¹³

Bayesian methods have the advantage of providing a natural, coherent measure of support for clades through posterior probabilities, integrating over uncertainty in nuisance parameters rather than conditioning on point estimates. However, they are computationally demanding, sometimes requiring days to weeks of MCMC sampling for large datasets, and their results can be sensitive to the choice of prior distributions and to issues of MCMC convergence and mixing. Studies have also shown that posterior probabilities can be inflated relative to the true probability of a clade being correct, particularly when the substitution model is misspecified.²³

Measures of confidence: bootstrap and posterior probabilities

No phylogenetic tree should be presented without an assessment of the confidence in its individual branches. The two most widely used measures of support are the nonparametric bootstrap and the Bayesian posterior probability. The bootstrap, introduced to phylogenetics by Felsenstein in 1985, works by resampling columns from the original sequence alignment with replacement to generate many pseudoreplicate datasets, each of the same size as the original.¹⁴ A phylogenetic tree is inferred from each pseudoreplicate, and the proportion of pseudoreplicate trees in which a given clade appears is the bootstrap support value for that clade. A bootstrap value of 70% or higher is conventionally regarded as indicative of moderate to strong support, and values above 95% are generally considered robust, though the precise correspondence between bootstrap values and the probability that a clade is correct depends on the dataset and method.¹⁴

Bayesian posterior probabilities, as described above, represent the probability of a clade given the data and model. There is no simple one-to-one correspondence between bootstrap values and posterior probabilities: empirical comparisons have consistently found that posterior probabilities tend to be higher than bootstrap values for the same clades, a pattern that may reflect the different sources of uncertainty each measure captures. Bootstrap values are often considered more conservative, while posterior probabilities may overestimate support when the model is inadequate.^{14, 23} For this reason, many phylogenetic studies report both measures, and researchers are encouraged to interpret high support values in the context of the specific analysis rather than treating any threshold as an absolute guarantee of accuracy.

The problem of long-branch attraction

Long-branch attraction (LBA) is a systematic error in phylogenetic inference in which two or more lineages that have accumulated many substitutions are incorrectly grouped together as sister taxa, not because they share a recent common ancestor but because their long branches have independently evolved similar character states by chance. The phenomenon was first formalised by Felsenstein in 1978, who demonstrated that parsimony will converge on the wrong tree with increasing certainty as more data are added when the true tree contains two long branches separated by a short internal branch — a topology now known as the Felsenstein zone.¹⁵

Long-branch attraction is not limited to parsimony. Any phylogenetic method can be affected if its underlying model fails to adequately describe the substitution process on the long branches. However, model-based methods such as ML and Bayesian inference are less susceptible because they explicitly account for the probability of multiple substitutions at the same site, reducing the tendency to mistake convergent substitutions for shared derived states. Strategies for detecting and mitigating LBA include increasing taxon sampling to break up long branches, using more complex and realistic substitution models, removing fast-evolving sites or third codon positions, and analysing amino acid sequences instead of nucleotides for deep divergences.^{15, 23} The recognition that LBA represents a failure of the model rather than a failure of the data has been one of the most important insights in molecular phylogenetics and has driven the development of increasingly sophisticated models of sequence evolution.

Gene trees, species trees, and the multispecies coalescent

A fundamental insight of modern molecular phylogenetics is that the evolutionary history of a gene is not necessarily the same as the evolutionary history of the species that carry it. Gene trees can differ from the species tree due to several biological processes, the most pervasive of which is incomplete lineage sorting (ILS) — the failure of ancestral allelic lineages to coalesce within the species in which they originated, so that they persist through one or more speciation events and sort randomly into descendant species.¹⁶ ILS is expected to be most prevalent when speciation events are closely spaced in time relative to the ancestral population sizes, creating conditions where many gene lineages fail to coalesce between successive speciation events. Other sources of gene tree discordance include hybridisation and introgression, horizontal gene transfer, gene duplication and loss, and recombination within loci.¹⁶

The multispecies coalescent model provides a probabilistic framework that explicitly accounts for ILS by modelling the stochastic process by which gene lineages trace back through ancestral populations and merge (coalesce) in common ancestors. Under this model, each gene tree is treated as a random draw from a distribution of possible gene trees given the species tree topology and branch lengths (measured in coalescent units), allowing the species tree to be inferred while accommodating gene tree heterogeneity.^{16, 18}

Two broad strategies have emerged for species tree estimation. Summary methods first estimate individual gene trees independently and then summarise them into a species tree. The most widely used summary method is ASTRAL, which estimates the species tree that is consistent with the largest number of quartet topologies (four-taxon subtrees) induced by the input gene trees. ASTRAL is statistically consistent under the multispecies coalescent, computationally efficient enough to handle thousands of gene trees, and has become the standard approach for coalescent-based species tree estimation in phylogenomic studies.¹⁷ Full coalescent methods such as *BEAST (StarBEAST) jointly estimate gene trees, the species tree, divergence times, and population sizes in a single Bayesian MCMC analysis, using the multispecies coalescent as the prior distribution on gene trees given the species tree.¹⁸ This joint estimation approach is theoretically superior because it accounts for uncertainty in gene tree estimation, but it is computationally intensive and currently practical only for datasets with relatively few species and loci.

The alternative to coalescent methods is concatenation (also called supermatrix analysis), in which sequences from multiple genes are joined end-to-end into a single large alignment and analysed as though they share a common tree. Concatenation has the advantage of simplicity and statistical power, and it performs well when gene tree discordance is modest. However, it has been shown to be statistically inconsistent in the so-called anomaly zone — regions of species tree space where the most common gene tree differs from the species tree — and can yield strongly supported but incorrect results when ILS is severe.¹⁶ For this reason, current best practice in phylogenomics recommends using both concatenation and coalescent-based approaches and evaluating whether they converge on the same topology.^{22, 23}

Phylogenomics and whole-genome approaches

The term phylogenomics refers broadly to the use of genomic-scale data for phylogenetic inference. Rather than relying on one or a few genes, phylogenomic studies typically analyse hundreds to thousands of loci, providing a far more comprehensive sample of the genome's evolutionary history.²² Several strategies exist for obtaining phylogenomic datasets, each with distinct advantages and trade-offs.

Whole-genome sequencing provides the most complete representation of a species' genetic material and is increasingly feasible for large numbers of taxa thanks to declining sequencing costs. However, the computational burden of assembling, annotating, and aligning whole genomes across divergent species remains formidable, and issues of orthology assignment — ensuring that the compared sequences are truly derived from the same ancestral gene rather than from gene duplication events — require careful bioinformatic analysis.²²

Ultraconserved elements (UCEs) are highly conserved genomic regions, typically several hundred base pairs in length, that are shared across distantly related organisms. Flanking sequences around UCE cores evolve at progressively higher rates with increasing distance from the core, providing phylogenetic information at a range of evolutionary timescales. Faircloth and colleagues demonstrated in 2012 that targeted capture and sequencing of UCE loci can generate thousands of orthologous markers spanning hundreds of millions of years of divergence, making them a versatile marker system for phylogenomics across vertebrates and other groups.¹⁹

Restriction-site associated DNA sequencing (RADseq) generates large numbers of single nucleotide polymorphism (SNP) markers by sequencing DNA adjacent to restriction enzyme cut sites throughout the genome. Introduced by Baird and colleagues in 2008, RADseq is particularly powerful for phylogenetic studies at shallow evolutionary timescales — among populations, closely related species, and recent radiations — where shared restriction sites ensure that orthologous loci are sampled across taxa.²⁰ At deeper timescales, however, mutations at restriction sites cause locus dropout, and the number of shared loci decreases rapidly with increasing divergence.

Phylogenomic analyses have resolved numerous long-standing controversies in systematics, from the placement of turtles within the reptile tree to the root of the placental mammal radiation to the relationships among early-diverging animal phyla. At the same time, the sheer volume of data in phylogenomic studies means that even subtle model misspecification or systematic biases can produce highly supported but incorrect results, making careful model selection, assessment of gene tree discordance, and comparison of analytical approaches more important than ever.^{22, 23}

Molecular clock models and divergence time estimation

Molecular phylogenetics is concerned not only with the topology of the tree — which species are more closely related to which — but also with the timing of divergence events. Estimating when lineages split requires a molecular clock, the concept that molecular sequences accumulate substitutions at a roughly constant rate over time, allowing genetic distance to be converted into absolute time with appropriate calibration. The original strict clock model, proposed by Zuckerkandl and Pauling in the 1960s, assumed a single global rate of evolution across all lineages, but decades of empirical work have demonstrated that substitution rates vary substantially among lineages, genes, and genomic regions.²¹

To accommodate this rate variation, relaxed clock models allow each branch of the tree to evolve at its own rate, drawn from a specified distribution. The most widely used relaxed clock is the uncorrelated lognormal model implemented in BEAST, in which each branch's rate is an independent draw from a lognormal distribution whose mean and variance are estimated from the data.²¹ This model makes no assumption about the correlation of rates between parent and daughter branches, allowing it to accommodate both gradual and abrupt rate changes. Alternative relaxed clock models include autocorrelated models, in which the rate on a branch is drawn from a distribution centred on the rate of its parent branch, reflecting the expectation that closely related lineages may have similar rates due to shared biology.^{21, 23}

All molecular clock analyses require external calibration to anchor the relative timescale to absolute time. Fossil calibrations are the most common source of temporal information: the oldest known fossil assignable to a particular clade provides a minimum age for the node representing the common ancestor of that clade and its sister group. These calibrations are typically implemented as probability distributions (uniform, exponential, or lognormal priors on node ages) rather than as fixed point estimates, reflecting the uncertainty inherent in the fossil record. Other calibration sources include biogeographic events of known age (such as the formation of an island or the closure of a seaway), rates estimated from ancient DNA, and secondary calibrations derived from previous molecular clock studies, though the latter must be used with caution to avoid propagating errors.^{21, 23}

Current best practices

The field of molecular phylogenetics has matured to the point where a broad consensus on best practices has emerged, even as methods continue to evolve. For tree inference, model-based methods — maximum likelihood and Bayesian inference — are strongly preferred over parsimony for molecular data, because they explicitly model the substitution process, are statistically consistent under the conditions in which parsimony may fail, and provide well-defined measures of branch support.^{7, 23} ML analysis is typically the method of choice for rapid, large-scale inference, while Bayesian analysis is preferred when co-estimation of divergence times, population sizes, or other parameters is required, or when a full posterior distribution of trees is desired.

Model selection using AIC or BIC should be performed for every dataset, and the most commonly justified starting point for nucleotide data is the GTR+Γ model, which can then be simplified if warranted by model comparison.^{9, 10} For phylogenomic datasets, partitioned models that assign different substitution parameters to different genes or codon positions are standard, and programs such as IQ-TREE and RAxML automate the partition-finding process.^{24, 25}

When multiple loci are available, as in any modern phylogenomic study, researchers should estimate individual gene trees and assess the level of gene tree discordance before choosing between concatenation and coalescent-based species tree methods. If discordance is extensive — as it often is for rapid radiations, recent divergences with large ancestral population sizes, or taxa with histories of hybridisation — coalescent-based methods such as ASTRAL should be employed.^{16, 17} For divergence time estimation, relaxed clock models with carefully justified fossil calibrations represent the current standard, implemented within Bayesian frameworks such as BEAST.^{13, 21}

Perhaps the most important best practice is the recognition that no single analysis is definitive. Robust phylogenetic conclusions are those that are supported across different data types, analytical methods, substitution models, and sampling strategies. The concordance of results from maximum likelihood and Bayesian analyses, from concatenation and coalescent approaches, and from different genomic markers provides the strongest evidence for the accuracy of an inferred evolutionary relationship.^{22, 23}

References

A general method applicable to the search for similarities in the amino acid sequence of two proteins

Needleman, S. B. & Wunsch, C. D. · Journal of Molecular Biology 48: 443–453, 1970