Phylogenetics and molecular evolution

Overview

Phylogenetics reconstructs the evolutionary relationships among organisms by comparing DNA, RNA, and protein sequences, producing branching diagrams (phylogenetic trees) that depict the history of life—a revolution that began when Zuckerkandl and Pauling proposed in 1965 that molecules themselves record evolutionary time.
Tree-building methods—maximum parsimony, maximum likelihood, and Bayesian inference—each infer evolutionary relationships from sequence data using different optimality criteria, while molecular clocks calibrated by fossils allow biologists to estimate when lineages diverged, dating the human–chimpanzee split to approximately 5–7 million years ago.
Kimura's neutral theory of molecular evolution provided the theoretical foundation for molecular clocks by showing that most substitutions at the DNA level are selectively neutral and accumulate at roughly constant rates, while Carl Woese's comparison of 16S ribosomal RNA sequences revealed the three-domain tree of life—Bacteria, Archaea, and Eukarya—fundamentally redrawing the deepest branches of evolutionary history.

Phylogenetics is the science of reconstructing the evolutionary relationships among organisms, producing branching diagrams—phylogenetic trees—that depict patterns of descent from common ancestors. For most of the history of biology, these relationships were inferred from comparative anatomy, embryology, and the fossil record. The field was transformed in the second half of the twentieth century by the realization that DNA, RNA, and protein sequences carry within them a detailed record of evolutionary history, one that can be read with mathematical precision. In 1965, Emile Zuckerkandl and Linus Pauling proposed that the number of amino acid differences between homologous proteins in two species is roughly proportional to the time since their last common ancestor—a concept they termed the molecular clock.¹ This insight launched the era of molecular evolution, in which sequence comparisons became the primary tool for reconstructing the tree of life.^{7, 12}

Phylogenetic tree of Ceratosauria as presented by Rauhut, Oliver and Carrano, Matthew (2016). XxKingsman13, Wikimedia Commons, CC BY-SA 4.0

The molecular revolution did more than refine existing phylogenies. It resolved long-standing debates that morphology alone could not settle, revealed deep relationships invisible to anatomy, and provided an independent timescale for evolutionary events. Molecular data confirmed the close kinship of humans and African great apes, dated the human–chimpanzee divergence to approximately 5–7 million years ago, and—most dramatically—revealed that the fundamental division of cellular life is not between plants and animals but among three domains: Bacteria, Archaea, and Eukarya.^{3, 8, 10}

The molecular revolution in systematics

Before the advent of molecular techniques, systematists classified organisms by comparing their physical features—bone structure, organ arrangement, developmental patterns. This approach was powerful but limited: convergent evolution could make distantly related organisms appear similar, while rapid morphological change could obscure close relationships. Molecular sequences offered a solution. Every organism's genome accumulates mutations over time, and because the vast majority of these substitutions are selectively neutral—as Motoo Kimura's neutral theory of molecular evolution demonstrated—they accumulate at approximately constant rates, providing a molecular record of evolutionary divergence.^{2, 12}

Among the earliest and most influential molecular comparisons involved cytochrome c, a protein essential to cellular respiration that is found in virtually all aerobic organisms. Richard Dickerson's 1972 analysis showed that the amino acid sequence of cytochrome c is extraordinarily conserved: human and chimpanzee cytochrome c are identical, human and rhesus monkey sequences differ by one residue, and even human and yeast sequences—separated by over a billion years of evolution—share approximately 60% of their amino acids.¹¹ The pattern of differences across species precisely mirrors the phylogenetic tree inferred from morphology and the fossil record, providing powerful independent confirmation of common descent.^{11, 12}

The ratio of nonsynonymous to synonymous substitutions (dN/dS) in protein-coding genes has become a standard measure for detecting natural selection at the molecular level. A dN/dS ratio significantly less than one indicates purifying selection preserving protein function; a ratio near one indicates neutral evolution; and a ratio greater than one signals positive selection driving adaptive change. This metric has revealed, for example, that genes involved in immunity, reproduction, and sensory perception are frequently subject to positive selection across primate lineages, while housekeeping genes remain under strong purifying constraint.^{8, 12}

Methods for reconstructing phylogenies

Modern phylogenetics employs several computational methods to infer evolutionary trees from sequence data, each grounded in a different optimality criterion. Maximum parsimony, the simplest approach, selects the tree that requires the fewest evolutionary changes to explain the observed data. While intuitive, parsimony can be inconsistent when rates of evolution vary across lineages—a problem known as long-branch attraction, in which rapidly evolving lineages are erroneously grouped together.⁴

Maximum likelihood, introduced to phylogenetics by Joseph Felsenstein in 1981, evaluates trees by calculating the probability of observing the data under an explicit model of sequence evolution. By incorporating parameters for substitution rates, base composition, and rate variation across sites, likelihood methods are statistically rigorous and generally more accurate than parsimony, particularly when evolutionary rates are heterogeneous.⁵ Bayesian inference, implemented in widely used software such as MrBayes, extends the likelihood framework by incorporating prior probability distributions over tree topologies and model parameters, producing posterior probability distributions that quantify the uncertainty in every aspect of the inferred phylogeny.⁶ Both likelihood and Bayesian methods require substantial computational resources, but advances in algorithms and hardware have made them the standard approaches in molecular phylogenetics.^{4, 7}

All tree-building methods depend on accurate sequence alignment—the identification of homologous positions across sequences from different species. Multiple sequence alignment algorithms such as MUSCLE and MAFFT arrange sequences so that nucleotides or amino acids descended from the same ancestral position are placed in the same column. Errors in alignment propagate directly into errors in tree inference, making alignment quality a critical determinant of phylogenetic accuracy.^{4, 7}

Molecular clocks and the timing of evolution

The concept of the molecular clock—that molecular sequences evolve at roughly constant rates over time—allows biologists to convert genetic distances between species into estimates of divergence time. Zuckerkandl and Pauling's original proposal was based on the observation that the number of amino acid differences in hemoglobin between any two vertebrate lineages was approximately proportional to the time since their last common ancestor, as estimated from the fossil record.¹ Kimura's neutral theory provided the theoretical foundation: if most substitutions are selectively neutral, the rate of molecular evolution equals the rate of neutral mutation, which is expected to be roughly constant over time and across lineages.²

In practice, molecular clocks are not perfectly constant. Rates vary across genes, across lineages, and across time periods—organisms with shorter generation times tend to evolve faster, and different proteins are subject to different levels of functional constraint. Modern methods address this variation using relaxed clock models that allow rates to vary across branches of the tree, calibrated by fossil dates at specific nodes. These approaches have been applied to date major events in the history of life, including the divergence of humans and chimpanzees at approximately 5.6 million years ago, the radiation of placental mammals in the Late Cretaceous, and the origin of eukaryotic cells over 1.5 billion years ago.^{8, 9, 12}

The three-domain tree of life

Perhaps the most consequential discovery of molecular phylogenetics was Carl Woese's demonstration that the deepest division in the tree of life is not between prokaryotes and eukaryotes, as previously assumed, but among three distinct domains. In the late 1970s, Woese and George Fox compared sequences of 16S ribosomal RNA—a molecule present in all cellular organisms and evolving slowly enough to retain phylogenetic signal across billions of years—and discovered that a group of organisms then classified as bacteria were in fact as genetically distinct from true bacteria as both groups were from eukaryotes.¹⁰ These organisms, initially called archaebacteria and later renamed Archaea, include extremophiles inhabiting hot springs, salt lakes, and anaerobic environments, but subsequent research revealed them to be far more widespread.

Woese, Otto Kandler, and Mark Wheelis formally proposed the three-domain system in 1990, dividing all cellular life into Bacteria, Archaea, and Eukarya.¹⁰ This reclassification was initially controversial among microbiologists trained in the traditional prokaryote–eukaryote dichotomy, but it has been overwhelmingly confirmed by subsequent genomic analyses. The three-domain framework transformed biology's understanding of early evolution and demonstrated the power of molecular phylogenetics to resolve relationships that morphological comparison could not even detect.^{3, 7} Whole-genome phylogenomics—the inference of evolutionary relationships from hundreds or thousands of genes simultaneously—has further refined the tree of life, resolving nodes that individual gene trees left ambiguous and revealing the pervasive role of horizontal gene transfer in microbial evolution.⁷

Together, the tools of molecular phylogenetics—sequence comparison, statistical tree inference, molecular clocks, and cladistic classification—have produced a detailed, testable, and continuously refined picture of how all living organisms are related. The field continues to grow as genomic sequencing becomes cheaper and more comprehensive, extending the reach of phylogenetic analysis to uncultured microorganisms, ancient DNA, and the deepest nodes of the tree of life.^{4, 7, 12}

Gene trees, species trees, and incomplete lineage sorting

One important development in modern phylogenetics is the recognition that individual gene trees do not always match the species tree. When ancestral populations are large and successive speciation events occur in rapid succession, different genes may be sorted into descendant lineages by genetic drift in patterns that differ from the branching order of species—a phenomenon known as incomplete lineage sorting (ILS). For instance, approximately 30% of the human genome has a genealogical history in which humans are more closely related to gorillas than to chimpanzees at individual loci, even though the species tree unambiguously places humans and chimpanzees as sister taxa.^{8, 13}

Coalescent-based methods have been developed specifically to account for ILS by modelling the gene-tree-within-species-tree process explicitly, estimating species relationships from the distribution of gene tree topologies rather than from concatenated sequences. These approaches have resolved several contentious nodes in the tree of life, including the branching order of early placental mammal diversification, where rapid radiation made concatenation methods unreliable.^{13, 15}

Ancient DNA and paleogenomics

The recovery and sequencing of ancient DNA (aDNA) from fossil specimens has opened an entirely new dimension in molecular phylogenetics. Advances in extraction techniques and next-generation sequencing have made it possible to obtain genome-wide data from specimens tens of thousands to hundreds of thousands of years old, including Neanderthals, Denisovans, and extinct megafauna such as woolly mammoths and cave bears. These paleogenomic data provide direct snapshots of past genetic diversity and allow phylogenetic placement of extinct taxa with a precision that morphology alone could never achieve.¹⁴

Ancient DNA has also revealed past episodes of hybridisation that left no morphological trace. The discovery that modern non-African human populations carry 1–4% Neanderthal DNA, and that some Melanesian and Australian populations carry an additional 3–6% Denisovan DNA, demonstrated that the tree of human evolution includes reticulate branches produced by interbreeding between lineages that had been separated for hundreds of thousands of years.¹⁴ These findings have expanded the conceptual framework of phylogenetics beyond strictly bifurcating trees to include phylogenetic networks that accommodate both divergence and gene flow among lineages.^{7, 14}

Molecular clock How mutation rates measure evolutionary time Molecular clocks Calibrating the pace of molecular evolution Molecular phylogenetics methods Algorithms for inferring evolutionary trees from sequences Neutral theory of molecular evolution Why most molecular changes are selectively neutral Cladistics and taxonomy Classifying organisms by shared derived characters

References

Molecules as Documents of Evolutionary History

Zuckerkandl, E. & Pauling, L. · Journal of Theoretical Biology 8: 357–366, 1965