DNA and the genetic code – Open Research Encyclopedia

Overview

DNA is a double-helical polymer of nucleotides whose two strands are held together by complementary base pairing — adenine with thymine, guanine with cytosine — a structure that immediately suggested its mechanism of replication and that encodes genetic information in the linear sequence of its bases.
The genetic code is a near-universal mapping of three-nucleotide codons to amino acids, shared with only minor variations across all domains of life — from bacteria to humans — constituting one of the strongest molecular lines of evidence for the common ancestry of all living organisms.
DNA replication achieves extraordinary fidelity through polymerase proofreading and mismatch repair, yet the residual error rate generates the heritable variation upon which natural selection and genetic drift act, making DNA both the conservator of biological information and the ultimate source of evolutionary novelty.

Deoxyribonucleic acid — DNA — is the molecule that stores, copies, and transmits genetic information in nearly all living organisms. Its discovery as the carrier of heredity, its structural elucidation by James Watson and Francis Crick in 1953, and the subsequent deciphering of the genetic code that relates DNA sequences to proteins rank among the most consequential achievements in the history of science. Together, these advances unified Mendelian genetics with biochemistry, revealed the molecular basis of evolution, and demonstrated that all known life shares a common molecular language — a universality that constitutes one of the most powerful lines of evidence for common descent.^{1, 11}

This article examines the structure of DNA, the genetic code that maps nucleotide sequences to amino acid sequences, the mechanisms by which DNA is replicated and expressed, and the ways in which these molecular processes connect to the broader framework of evolutionary biology.

Structure of DNA

The structure of DNA was determined by Watson and Crick in April 1953, building on X-ray diffraction data collected by Rosalind Franklin and Maurice Wilkins and on Erwin Chargaff's observation that the amount of adenine in any DNA sample equals the amount of thymine, and the amount of guanine equals the amount of cytosine. Watson and Crick proposed that DNA consists of two polynucleotide strands wound around each other in a right-handed double helix, with the sugar-phosphate backbones on the outside and the nitrogenous bases projecting inward, where they form specific hydrogen-bonded pairs: adenine (A) with thymine (T), and guanine (G) with cytosine (C).¹ The two strands run in opposite directions (antiparallel), with one oriented 5′ to 3′ and its complement 3′ to 5′. Each complete turn of the helix spans approximately 3.4 nanometres and encompasses about ten base pairs.^{1, 13}

The original DNA double helix model built by Crick and Watson in 1953 — The original DNA molecular model built by Francis Crick and James Watson in 1953, on display at the National Science Museum in London. This physical model demonstrated the double-helical structure with complementary base pairing that immediately suggested a mechanism for genetic replication. Alkivar, Wikimedia Commons, Public domain

Each nucleotide — the monomeric unit of DNA — consists of three components: a five-carbon deoxyribose sugar, a phosphate group, and one of four nitrogenous bases. Adenine and guanine are purines (double-ringed structures), while cytosine and thymine are pyrimidines (single-ringed). The pairing of a purine with a pyrimidine at every position keeps the width of the helix constant, a constraint that Watson and Crick recognised as essential to explaining Chargaff's ratios and the regularity of the X-ray diffraction pattern.¹ Nucleotides are linked to one another by phosphodiester bonds between the 3′ carbon of one sugar and the 5′ carbon of the next, forming the continuous sugar-phosphate backbone along which genetic information is read.

The complementary base pairing between the two strands was immediately recognised as the key to understanding how DNA could serve as the hereditary material. Watson and Crick noted in their landmark paper that "it has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material."¹ In a companion paper published the following month, they elaborated this insight, proposing that the two strands could separate and each serve as a template for the synthesis of a new complementary strand, producing two identical copies of the original molecule — the semiconservative model of replication.²

The genetic code

The genetic code is the set of rules by which the nucleotide sequence of a gene is translated into the amino acid sequence of a protein. The code is read in non-overlapping triplets called codons, each consisting of three consecutive nucleotides. With four possible bases at each of three positions, there are 4³ = 64 possible codons. Of these, 61 specify amino acids and three (UAA, UAG, UGA) serve as stop signals that terminate translation. Because only 20 standard amino acids are used in protein synthesis, the code is degenerate (or redundant): most amino acids are specified by more than one codon, with variation typically occurring at the third (wobble) position of the codon.^{3, 4}

The experimental decipherment of the genetic code was achieved through a series of elegant experiments in the early 1960s. The foundational breakthrough came in 1961, when Marshall Nirenberg and Heinrich Matthaei demonstrated that a synthetic messenger RNA composed entirely of uracil (poly-U) directed the cell-free synthesis of a polypeptide composed entirely of phenylalanine, establishing that UUU is the codon for phenylalanine and providing the first concrete entry in the genetic code table.⁴ Subsequent work by Nirenberg, Har Gobind Khorana, and their colleagues, using synthetic polynucleotides of defined sequence and a ribosome-binding assay for individual codons, completed the full assignment of all 64 codons by 1966. Nirenberg and Khorana shared the 1968 Nobel Prize in Physiology or Medicine for this achievement.^{4, 13}

That the code is read in triplets was established experimentally by Crick, Brenner, Barnett, and Watts-Tobin in 1961 through a genetic analysis of frameshift mutations in the rII region of bacteriophage T4. They showed that insertions or deletions of one or two nucleotides abolished gene function, but insertions or deletions of three nucleotides restored it, demonstrating that the reading frame is based on groups of three and that the code is read sequentially from a fixed starting point without gaps or overlaps.³

The standard genetic code^{4, 11}

First position	Second position: U	Second position: C	Second position: A	Second position: G
U	Phe, Phe, Leu, Leu	Ser, Ser, Ser, Ser	Tyr, Tyr, Stop, Stop	Cys, Cys, Stop, Trp
C	Leu, Leu, Leu, Leu	Pro, Pro, Pro, Pro	His, His, Gln, Gln	Arg, Arg, Arg, Arg
A	Ile, Ile, Ile, Met	Thr, Thr, Thr, Thr	Asn, Asn, Lys, Lys	Ser, Ser, Arg, Arg
G	Val, Val, Val, Val	Ala, Ala, Ala, Ala	Asp, Asp, Glu, Glu	Gly, Gly, Gly, Gly

The near-universality of the genetic code is one of its most striking features and one of the strongest pieces of molecular evidence for the common ancestry of all life. With only minor variations — most notably in mitochondrial genomes and in a handful of organisms such as Mycoplasma and certain ciliated protists — the same 64 codons specify the same amino acids in bacteria, archaea, plants, fungi, and animals.^{11, 15} This universality is most parsimoniously explained by inheritance from a single ancestral population in which the code was already established, rather than by independent invention of the same arbitrary mapping in multiple lineages. The code appears to be optimised to minimise the impact of point mutations and translation errors: amino acids with similar chemical properties tend to be assigned to codons that differ by only a single nucleotide, reducing the probability that a random change will produce a radically different protein.¹¹

DNA replication and its fidelity

DNA replication is the process by which a cell duplicates its entire genome prior to cell division, producing two identical copies of each chromosome. The semiconservative mechanism proposed by Watson and Crick — in which each strand of the parental double helix serves as a template for the synthesis of a new complementary strand — was confirmed experimentally by Matthew Meselson and Franklin Stahl in 1958 using density-gradient centrifugation of Escherichia coli DNA labelled with heavy nitrogen (¹⁵N).¹³

Replication begins at specific chromosomal sites called origins of replication, where the double helix is unwound by helicase enzymes to create a replication fork. DNA polymerase III (in bacteria) or the replicative polymerases delta and epsilon (in eukaryotes) then synthesise new DNA by adding nucleotides complementary to each template strand, always extending in the 5′-to-3′ direction. Because the two template strands are antiparallel, one new strand (the leading strand) is synthesised continuously toward the replication fork, while the other (the lagging strand) is synthesised discontinuously as short Okazaki fragments that are later joined by DNA ligase.¹³

The fidelity of DNA replication is extraordinary. Replicative DNA polymerases insert incorrect nucleotides at a rate of roughly one per 10⁴ to 10⁵ base pairs, but the intrinsic 3′-to-5′ exonuclease proofreading activity immediately excises most of these errors, improving accuracy roughly 100-fold. Post-replicative mismatch repair then corrects most of the remaining errors, yielding an overall error rate of approximately 10⁻⁹ to 10⁻¹⁰ per base pair per cell division.⁷ In the human genome of 6.4 billion base pairs, this corresponds to fewer than one uncorrected error per cell division on average — a remarkable achievement considering that the replication machinery must copy the entire genome in a matter of hours. The residual errors that escape all layers of correction become mutations, the raw material of evolutionary change.^{7, 14}

From DNA to protein: the central dogma

The flow of genetic information from DNA to RNA to protein was formalised by Francis Crick in 1958 as the central dogma of molecular biology. In its original formulation, the central dogma states that sequence information in nucleic acids can be transferred to other nucleic acids or to protein, but information in protein cannot be transferred back to nucleic acid.⁵ Crick restated the principle more precisely in 1970, clarifying that the dogma concerns information transfer (the sequential order of residues) rather than material transfer, and that while unusual transfers such as reverse transcription (RNA to DNA) are possible, the transfer of sequence information from protein to nucleic acid has never been observed.⁶

Transcription is the first step in gene expression, in which the enzyme RNA polymerase synthesises a single-stranded messenger RNA (mRNA) molecule complementary to one strand of the DNA template. RNA polymerase binds to a promoter sequence upstream of the gene, unwinds a short segment of the double helix, and synthesises the mRNA in the 5′-to-3′ direction by adding ribonucleotides (using uracil in place of thymine) complementary to the template strand. In eukaryotes, the initial transcript (pre-mRNA) undergoes extensive processing — including 5′ capping, 3′ polyadenylation, and the removal of introns by splicing — before the mature mRNA is exported to the cytoplasm for translation.¹³

Translation is the process by which the ribosome decodes the mRNA sequence into a polypeptide chain. The ribosome reads the mRNA in successive three-nucleotide codons, beginning at the start codon (AUG, which specifies methionine) and continuing until a stop codon is encountered. At each codon, a transfer RNA (tRNA) molecule bearing the complementary anticodon delivers the appropriate amino acid, which is joined to the growing polypeptide by a peptide bond catalysed by the ribosomal RNA of the large subunit. The discovery that the peptidyl transferase centre of the ribosome is composed entirely of RNA, with no protein within 18 angstroms of the catalytic site, provides compelling evidence that translation originated in an RNA world before proteins existed.¹³

Mutations and genetic variation

Despite the extraordinary fidelity of DNA replication, errors do occur. These heritable changes in the nucleotide sequence — mutations — are the ultimate source of all genetic variation and therefore of all evolutionary change. Point mutations substitute one base pair for another; insertions and deletions add or remove nucleotides; and larger-scale rearrangements can duplicate, invert, or translocate entire chromosomal segments. In protein-coding regions, point mutations may be synonymous (changing a codon to another that specifies the same amino acid, thanks to the degeneracy of the genetic code) or non-synonymous (altering the encoded amino acid), with non-synonymous mutations more likely to affect protein function and therefore more likely to be subject to natural selection.^{7, 14}

The human germline mutation rate has been estimated at approximately 1.2 × 10⁻⁸ per nucleotide per generation, corresponding to roughly 70 new single-nucleotide mutations per individual per generation.¹⁴ Most of these mutations occur in non-coding DNA and are selectively neutral, having no effect on the organism's fitness. A smaller fraction are deleterious and are removed from the population by purifying selection. An even smaller fraction are beneficial and may spread through the population by positive selection, driving adaptive evolution. This process — the generation of random variation by mutation, followed by the non-random sorting of that variation by natural selection and genetic drift — is the molecular foundation of evolution.^{14, 13}

Gene duplication provides another major source of evolutionary novelty. When a segment of DNA is duplicated, one copy can continue to perform its original function while the other is free to accumulate mutations and potentially acquire a new function (neofunctionalisation) or divide the ancestral function between the two copies (subfunctionalisation). Whole-genome duplications have occurred at key points in evolutionary history, including two rounds early in the vertebrate lineage that are thought to have provided the raw material for the elaboration of the vertebrate body plan.¹³

DNA as evidence for evolution

The molecular details of DNA provide some of the most compelling evidence for the theory of evolution and the common ancestry of all life. Several lines of molecular evidence converge on the same conclusion.

The near-universality of the genetic code, discussed above, is most economically explained by inheritance from a common ancestor. The assignment of codons to amino acids is chemically arbitrary — there is no intrinsic reason why UUU should code for phenylalanine rather than any other amino acid — so the fact that the same code is used by organisms as diverse as E. coli, oak trees, and humans points to a single origin. The handful of known exceptions involve only minor reassignments, most commonly in mitochondrial genomes, and are best understood as evolutionary modifications of the ancestral code rather than independent inventions.^{11, 15}

Phylogenetic comparisons of DNA sequences allow evolutionary relationships to be reconstructed with quantitative precision. Because mutations accumulate at roughly constant rates in selectively neutral DNA sequences, the degree of sequence divergence between two species reflects the time since they shared a common ancestor — the principle underlying the molecular clock. These molecular phylogenies consistently agree with phylogenies derived from anatomical, embryological, and fossil evidence, providing independent confirmation of evolutionary relationships. For example, comparisons of the cytochrome c gene across dozens of species produce a branching tree that mirrors the tree derived from the fossil record, with closely related species showing few sequence differences and distantly related species showing many.¹³

Shared molecular errors provide particularly powerful evidence for common descent. When two species share the same non-functional DNA sequence — a pseudogene, a transposon insertion, or an endogenous retrovirus — at the same chromosomal location, the most parsimonious explanation is that the element was present in a common ancestor and was inherited by both descendant lineages. The human and chimpanzee genomes share thousands of such elements at identical positions, including the same inactivating mutation in the vitamin C synthesis gene (GULO) and the same pattern of chromosome 2 fusion, in which two ancestral ape chromosomes fused end-to-end in the human lineage.^{9, 12}

DNA sequence comparisons also reveal that different genes and different genomic regions evolve at different rates, a pattern that reflects the varying intensity of natural selection acting on different parts of the genome. Functionally critical sequences, such as the active sites of essential enzymes or the ribosomal RNA genes, are highly conserved across billions of years of evolution because most mutations in these regions are deleterious and are eliminated by purifying selection. Pseudogenes and intergenic sequences, which are under little or no selective constraint, accumulate mutations at rates close to the underlying mutation rate and are therefore useful as neutral molecular clocks for dating divergence events.^{13, 14}

Genome organisation

The sequencing of the human genome, completed in draft form in 2001, revealed that the organisation of a complex eukaryotic genome is far more intricate than the simple picture of genes arrayed along chromosomes might suggest. Of the approximately 3.2 billion base pairs in the haploid human genome, only about 1.5 percent encodes protein — roughly 20,000 to 25,000 protein-coding genes.⁹ The remaining 98.5 percent consists of non-coding DNA, including regulatory sequences, introns, repetitive elements, transposable elements, pseudogenes, and large stretches of DNA whose function, if any, remains under investigation.^{9, 10}

Protein-coding genes in eukaryotes are typically split into exons (sequences that are retained in the mature mRNA and translated into protein) and introns (intervening sequences that are transcribed but removed by RNA splicing before translation). The average human gene contains about eight introns, and introns often account for the vast majority of a gene's total length. The dystrophin gene, for example, spans 2.4 million base pairs but produces a mature mRNA of only about 14,000 nucleotides.^{9, 13} The evolutionary origin and significance of introns has been debated since their discovery in 1977. Under the "introns-early" hypothesis, introns are remnants of the ancient RNA world and facilitated the recombination of exon-encoded protein modules; under the "introns-late" hypothesis, introns are mobile genetic elements that invaded genes after the split between prokaryotes and eukaryotes.⁸

Nearly half of the human genome is composed of transposable elements — repetitive sequences derived from mobile genetic elements that have inserted copies of themselves throughout the genome over evolutionary time. The most abundant class, long interspersed nuclear elements (LINEs), accounts for about 20 percent of the genome, while short interspersed nuclear elements (SINEs), including the primate-specific Alu elements, account for about 13 percent.⁹ Most of these elements are now inactive, their sequences degraded by accumulated mutations, but their presence provides a detailed record of genomic evolution. The distribution of shared transposable element insertions across species is a powerful tool for reconstructing phylogenetic relationships, because the probability of two independent insertions occurring at precisely the same genomic location is vanishingly small.^{9, 13}

The ENCODE project, which aimed to identify all functional elements in the human genome, reported in 2012 that approximately 80 percent of the genome shows some evidence of biochemical activity, including transcription, transcription factor binding, or chromatin modification.¹⁰ However, the interpretation of this finding has been debated: biochemical activity does not necessarily imply biological function, and evolutionary analyses suggest that the fraction of the genome under purifying selection — and therefore functionally important — is substantially smaller, perhaps 5 to 15 percent.^{10, 14} The tension between biochemical activity and evolutionary conservation remains an active area of research in genomics.

Evolutionary significance of the molecular framework

The discovery of DNA's structure and the deciphering of the genetic code transformed evolutionary biology from a discipline based primarily on comparative anatomy and the fossil record into one grounded in molecular mechanisms. The population genetics framework established by Fisher, Haldane, and Wright in the early twentieth century had shown mathematically how allele frequencies change under selection and drift, but the physical nature of alleles and mutations remained abstract until the molecular revolution of the 1950s and 1960s. With the Watson-Crick model, a mutation could be understood as a specific chemical change in a specific nucleotide at a specific position in the genome, and the phenotypic consequences of that change could be traced through transcription and translation to an alteration in protein structure and function.^{1, 5, 13}

The universality of the DNA-RNA-protein system across all life — the same four-letter nucleotide alphabet, the same genetic code with only minor variations, the same ribosomal mechanism of translation, the same 20 amino acids — provides evidence for common descent that is independent of, and complementary to, the anatomical, embryological, and palaeontological evidence that Darwin and his successors assembled. Every newly sequenced genome adds to this evidence, confirming that the molecular machinery of life was established once and inherited by all subsequent lineages. The molecular framework also makes precise, quantitative predictions: if two species share a common ancestor, their DNA sequences should be more similar than those of more distantly related species, and the pattern of similarities and differences should form a nested hierarchy consistent with a branching tree of descent. These predictions have been confirmed across every domain of life examined.^{11, 13}

The molecular understanding of DNA has also illuminated the mechanism by which endosymbiosis shaped eukaryotic evolution. Mitochondria and chloroplasts retain their own small circular DNA genomes, which bear unmistakable similarities to bacterial genomes in gene content, codon usage, and sequence — confirming that these organelles descend from free-living bacteria that were engulfed by ancestral eukaryotic cells. Over evolutionary time, most of the endosymbiont's genes were transferred to the host nuclear genome, a process recorded in the DNA sequences of modern organisms and traceable through comparative genomics.¹³ The story of DNA is, in this sense, not merely the story of a molecule but a window into the entire history of life on Earth.

References

Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid

Watson, J. D. & Crick, F. H. C. · Nature 171: 737–738, 1953