Information theory and evolution

Overview

The creationist claim that ‘evolution cannot create new genetic information’ rests on a fundamental equivocation: the word ‘information’ is used in a colloquial, undefined sense that has no basis in either Shannon information theory or Kolmogorov complexity, both of which are perfectly compatible with — and in Shannon’s framework actively predict — increases in biological information through mutation and selection.
Multiple well-documented mechanisms increase genome size and functional content: gene duplication, polyploidy, horizontal gene transfer, de novo gene origination from noncoding sequences, and transposable element co-option all add new coding capacity that natural selection can then refine into functional genes.
Documented empirical cases of entirely new genes arising include nylonase enzymes in bacteria (appearing within decades of nylon’s invention), antifreeze proteins in Antarctic notothenioid fish and Arctic codfishes evolving independently, and dozens of de novo protein-coding genes identified in Drosophila and human lineages that originated from previously noncoding intergenic DNA.

Among the most persistent arguments advanced by creationist and intelligent design literature is the claim that evolutionary processes cannot generate new genetic information — and that, by implication, the functional complexity of living genomes requires a designing intelligence. The argument borrows its rhetorical weight from information theory, a rigorous mathematical discipline developed by Claude Shannon in the late 1940s, and from related concepts in algorithmic complexity. Yet the scientific consensus, grounded in decades of molecular biology, genomics, and population genetics, is that this claim misrepresents what information theory actually says, misidentifies what “information” means in a biological context, and overlooks well-documented mechanisms through which genomes routinely increase in functional content.^{3, 22}

The evidentiary record now encompasses dozens of documented cases of entirely new genes arising within observed or reconstructable evolutionary timescales: enzymes that digest a synthetic industrial compound that did not exist before 1938, antifreeze proteins that evolved independently in fish lineages separated by entire ocean basins, and scores of protein-coding genes in Drosophila and humans that demonstrably originated from sequences that were noncoding in ancestral species.^{9, 12, 16} Each represents a measurable increase in the functional information content of a genome. Understanding why the creationist argument fails requires examining what information theory actually asserts, what mechanisms produce new genes, and where the equivocation in the original claim lies.

What Shannon information actually says

Claude Shannon’s 1948 paper “A Mathematical Theory of Communication” established information theory as a formal discipline.¹ Shannon defined the information content of a message in terms of its entropy — the average unpredictability of a symbol in a message, measured in bits. A message in which every symbol is equally probable carries the maximum entropy and therefore the maximum Shannon information. A message that is entirely repetitive or predictable carries very low entropy and very little information. Crucially, in this framework, a completely random DNA sequence has higher Shannon information than a highly ordered, repetitive one — the reverse of what the creationist argument implies.^{1, 3}

The immediate consequence is that point mutations, which introduce novel bases at positions that were previously fixed in a population, increase Shannon entropy at those positions across the population, thereby increasing Shannon information. Duplications, insertions, and transposable element insertions all increase the total length of the genome and — unless perfectly repetitive — increase total Shannon entropy as well.³ Shannon’s formalism therefore actively predicts that mutation increases information by the standard technical definition. When creationists assert that “mutations cannot create new information,” they are not using Shannon’s definition; they are using a colloquial, undefined substitute that is never rigorously specified in their literature.^{3, 22}

Christoph Adami, applying information theory to biological systems, has argued that the biologically relevant measure is not raw Shannon entropy but what he terms physical information: the degree to which a genome’s sequence is correlated with the environment that determines fitness.³ By this definition, natural selection is precisely the mechanism that increases genomic information, because selection differentially preserves genomes whose sequences are better correlated with survival and reproduction in a given environment. Computational experiments using digital organisms evolving under selection have confirmed that populations accumulate genomic information over time, with richer environments producing organisms with higher informational complexity.³ There is, in short, no theoretical barrier within Shannon’s framework to the generation of biological complexity through mutation and selection.

Kolmogorov complexity and the limits of the argument

A second branch of information theory, independent of Shannon’s communication-theoretic approach, is Kolmogorov complexity, also called algorithmic information theory. Developed in the 1960s by Andrey Kolmogorov, Ray Solomonoff, and Gregory Chaitin, it defines the complexity of a string as the length of the shortest computer program that can produce that string.² A highly repetitive string has low Kolmogorov complexity because it can be described by a short program (“repeat XYZ one thousand times”). A random string has high Kolmogorov complexity because no program shorter than the string itself will reproduce it.²

Creationist arguments sometimes invoke Kolmogorov complexity, or language loosely derived from it, to claim that natural processes cannot produce highly complex (low-compressibility) sequences. But this inverts the logic of the framework. Gene duplication, for instance, initially reduces Kolmogorov complexity by making a sequence more compressible (“copy gene A twice”). Subsequent divergence of the duplicate through mutation then increases Kolmogorov complexity as the two copies diverge from one another and can no longer be jointly described by a short program.^{2, 5} The interplay of duplication and divergence — the dominant mechanism of new gene creation — is therefore a mechanism that, by the Kolmogorov definition, produces increasingly complex genomic sequences over time. Once again, the formal mathematical framework contradicts, rather than supports, the creationist conclusion.

Mechanisms that increase genetic information

Evolutionary biology identifies several well-characterized molecular mechanisms by which genomes accumulate new functional sequences. These mechanisms are not hypothetical; they are documented in thousands of comparative genomic studies and have been observed directly in laboratory and natural populations.^{4, 5, 7}

Gene duplication is the most thoroughly studied source of new genes. When a portion of a chromosome is copied — through unequal crossing over, retrotransposition, or segmental duplication — the organism carries two copies of the affected gene or genes. One copy continues to perform the ancestral function, freeing the duplicate from the constraints of purifying selection. Over time, the duplicate may accumulate mutations that confer a new function (neofunctionalization), subdivide the ancestral function between the two copies (subfunctionalization), or degrade into a nonfunctional pseudogene.^{4, 5} Susumu Ohno, in his landmark 1970 monograph Evolution by Gene Duplication, argued that duplication is the principal engine of evolutionary novelty, and subsequent comparative genomics has confirmed this view across all domains of life.⁴ The vertebrate globin gene superfamily, the Hox gene clusters, and the immunoglobulin superfamily all trace to serial duplication events followed by functional divergence.¹⁹

Polyploidy — the duplication of the entire genome — is an extreme form of gene duplication that simultaneously doubles every gene and regulatory element in an organism. Polyploidy is especially prevalent in plant evolution: it is estimated that 15 to 70 percent of flowering plant species have polyploid ancestry, and many major crop species, including wheat, cotton, and coffee, are polyploids.⁶ Two rounds of whole-genome duplication early in vertebrate evolution produced the raw material for the four Hox gene clusters that pattern the vertebrate body axis, compared with the single cluster found in most invertebrates.²⁰ Each round of polyploidy doubles the total information content of the genome, providing an enormous substrate for subsequent specialization and innovation.⁶

Horizontal gene transfer (HGT) — the movement of genetic material between organisms outside of normal vertical inheritance from parent to offspring — is another route by which genomes acquire entirely new sequences. In prokaryotes, HGT via conjugation, transformation, and transduction is a dominant force of evolution, with comparative analyses indicating that the majority of genes in any sequenced bacterial genome have been laterally transferred at some point in their evolutionary history.⁷ A pathogenic bacterium that acquires an antibiotic resistance gene, a toxin gene, or an entire metabolic pathway from a distantly related donor has unambiguously gained new functional information that was not present in its lineage before the transfer event.⁷

Transposable elements (TEs), sometimes called jumping genes, are mobile DNA sequences that replicate and insert themselves throughout the genome. Initially characterized as selfish genetic elements that spread at the expense of their hosts, TEs have been increasingly recognized as sources of genomic novelty.^{8, 21} TE insertions can create new regulatory sequences, new exons, new promoters, and occasionally entirely new protein-coding genes. In mammals, sequences derived from TEs make up roughly half of the genome by mass, and many have been co-opted into functional roles in gene regulation, chromosome biology, and even immunity.²¹ The RAG recombinase that drives the diversity of the vertebrate adaptive immune system was itself domesticated from a transposable element of the Transib superfamily, representing perhaps the most consequential example of TE co-option in animal evolution.²¹

De novo gene origination — the emergence of a new protein-coding gene from previously noncoding intergenic or intronic DNA — was long considered too improbable to be a significant evolutionary force. Comparative genomic studies over the past two decades have overturned this view. The recognition that the genome contains vast quantities of pervasively transcribed noncoding RNA, combined with systematic identification of open reading frames that are protein-coding in one species but noncoding in closely related outgroups, has revealed de novo gene birth as a reproducible, documented process.¹⁶

Documented cases of new functional genes

Among the most compelling evidence against the “no new information” claim are empirical cases in which the evolutionary origin of a functional gene can be traced with molecular precision. These cases span timescales from decades to tens of millions of years and involve organisms ranging from bacteria to vertebrates.

Scanning electron micrograph of Pseudomonas aeruginosa bacteria — Scanning electron micrograph of *Pseudomonas aeruginosa*, the bacterium used in experimental evolution studies demonstrating the emergence of new nylon-degrading enzyme activity — a documented case of new functional genetic information arising through mutation. Janice Haney Carr / CDC, Wikimedia Commons, Public domain

The nylonase case is perhaps the most striking because the selective agent — nylon-6 and its breakdown products — is a synthetic industrial compound that did not exist on Earth before its invention in 1938. In 1975, Japanese researchers discovered a strain of Flavobacterium living in wastewater ponds adjacent to a nylon factory that could metabolize 6-aminohexanoic acid linear oligomers, byproducts of nylon-6 manufacture.¹⁰ Three novel enzymes responsible for this activity were characterized; none bore significant sequence similarity to any previously known enzyme. Subsequent molecular analysis by Negoro and colleagues demonstrated that at least one of the nylonase enzymes arose through a frameshift mutation in a region of DNA that was previously noncoding, generating a new open reading frame that happened to encode a functional protein.¹¹ Experimental evolution studies later showed that analogous nylon-degrading activity could be induced in laboratory populations of Pseudomonas aeruginosa within a tractable number of generations.⁹ Because the substrate is synthetic, the enzymes that degrade it represent new biological functions that arose after the substrate was introduced — new information by any reasonable definition of the term.^{10, 11}

The evolution of antifreeze proteins in polar fish provides a case of new gene origination that has been reconstructed in molecular detail and independently verified in two separate lineages. Antarctic notothenioid fish produce antifreeze glycoproteins (AFGPs) that prevent ice crystal growth in their blood and tissues — a critical adaptation to the −1.9°C waters of the Southern Ocean. Chi-Hing Christina Cheng and colleagues demonstrated in 1997 that the notothenioid AFGP gene evolved from a copy of a pancreatic trypsinogen gene: a nine-nucleotide element encoding a Thr-Ala-Ala tripeptide repeat within the trypsinogen ancestor was amplified to produce the repetitive AFGP coding sequence, with loss of most of the original trypsinogen coding sequence except the signal peptide and the propeptide regions.¹² The ancestral trypsinogen gene retained its original function; the duplicate gave rise to a structurally and functionally novel protein that did not previously exist.

Strikingly, AFGP genes in Arctic codfishes evolved independently and by a different molecular mechanism. Baalsrud and colleagues showed in 2018 using whole-genome sequence data that codfish AFGPs arose from a noncoding genomic region with no homology to the notothenioid trypsinogen ancestor, with the repetitive Thr-Ala-Ala coding unit supplied by a different ancestral nine-nucleotide element.¹³ Two entirely new genes, each encoding a protein of the same functional class but arising from different ancestral sequences in different lineages separated by a complete ocean — this is convergent molecular innovation documented at the sequence level, and constitutes a direct refutation of the claim that natural processes cannot generate genuinely new protein-coding information.^{12, 13, 14}

De novo gene origination in Drosophila has been documented in a series of studies that phylogenetically bracket the origin of new protein-coding genes. David Begun and colleagues identified several genes in Drosophila melanogaster that are absent from closely related species, translated into protein as confirmed by ribosome profiling, and demonstrably originate from sequences that are noncoding in the outgroup species.¹⁵ A comprehensive survey by Zhao and colleagues in 2014 identified 248 candidate de novo genes in D. melanogaster, of which dozens show strong evidence of functional constraint consistent with ongoing purifying selection — indicating that they have been integrated into the organism’s biology rather than merely tolerated as genomic noise.²³ The jingwei gene, first characterized by Manyuan Long and Charles Langley, represents an earlier documented case: a chimeric gene in Drosophila teissieri and D. yakuba formed by retrotransposition of the Adh (alcohol dehydrogenase) transcript into an unrelated pre-existing gene, creating a novel hybrid gene with a new expression pattern.¹⁷

In the human lineage, Vakirlis and colleagues reported in 2022 a systematic genomic survey identifying thousands of candidate de novo genes that originated from noncoding sequences after the divergence of humans from other primates, with a subset showing population-level variation consistent with ongoing functional evolution.¹⁸ These de novo genes contribute to human-specific traits in transcription, cellular processes, and development, and some have been linked to complex disease phenotypes — demonstrating that newly originated genes are not merely evolutionary curiosities but functional participants in contemporary human biology.¹⁸

The equivocation at the heart of the argument

The scientific failure of the “evolution cannot create new information” argument is not primarily empirical but logical. The argument relies on a systematic equivocation — the use of the word “information” with two incompatible meanings that are quietly substituted for one another at key steps in the reasoning.^{3, 22}

In the technical sense used by information theorists, “information” refers to Shannon entropy, Kolmogorov complexity, or Adami’s mutual information between genome and environment. As detailed above, none of these measures supports the conclusion that evolutionary processes cannot increase biological information. Shannon entropy increases with mutation and sequence diversification. Kolmogorov complexity increases as duplicated sequences diverge from one another. Adami’s environmental mutual information increases under natural selection by definition.^{1, 2, 3} The technical definitions actively predict information increase through evolutionary mechanisms.

In the colloquial sense used in creationist literature, “information” typically means something like “specified functional complexity” or “meaning that requires a mind to generate.” This colloquial sense is never formally defined, is not derived from any published mathematical framework, and functions primarily as a rhetorical device: it imports the technical credibility of information theory while substituting a concept that has been pre-loaded with the desired conclusion.^{3, 22} When the argument reaches its conclusion — that natural processes cannot create new “information” — it is this undefined colloquial sense that is meant, not the Shannon or Kolmogorov sense that was invoked to establish the scientific-sounding premise. The National Academy of Sciences noted in its 2008 report that intelligent design arguments about information do not constitute testable scientific claims and have not been supported by any peer-reviewed experimental evidence.²²

A related equivocation concerns the distinction between a sequence being novel and a sequence being functional. New mutations generate sequences not previously present in a genome; gene duplication creates new sequence capacity; horizontal gene transfer introduces sequences from entirely different organisms. Whether any of these newly present sequences acquire or already possess biological function is a separate question answered by natural selection: functional sequences that improve fitness are retained and refined; nonfunctional sequences are either eliminated or drift to fixation as neutral variants. The creationist argument collapses the two questions, implying that any increase in sequence novelty must simultaneously produce functional complexity. But this is not what information theory says, and it is not how evolutionary biology works.^{3, 5, 16}

Genome size, content, and information accumulation

A broader perspective on the information content of genomes comes from comparing genome sizes and gene counts across the tree of life. Genome size — the total amount of DNA per haploid cell, called the C-value — varies by five orders of magnitude across eukaryotes, from about 2.3 megabases in the microsporidian Encephalitozoon intestinalis to more than 130 gigabases in some salamanders and flowering plants.²⁰ This variation does not track organismal complexity in a simple way, a paradox known as the C-value enigma, but it does demonstrate that genome size is dynamic and that mechanisms increasing genome size are real and pervasive.²⁴

Gene count, a more direct proxy for functional information content, also varies substantially and does track some aspects of biological complexity. Vertebrate genomes typically contain 20,000 to 25,000 protein-coding genes, compared with roughly 6,000 in budding yeast and around 19,000 in Caenorhabditis elegans. The vertebrate gene complement expanded substantially through the two rounds of whole-genome duplication early in vertebrate history, through subsequent lineage-specific gene duplications, and through the de novo origination of new genes in each lineage.²⁰ Regulatory complexity, encoded in the noncoding portions of the genome, has expanded in parallel: the human genome encodes far more regulatory sequences, enhancers, and noncoding RNAs than simpler organisms, representing a vast increase in the total functional information content of the genome relative to ancestral states.²¹

The genomic record therefore reflects exactly what evolutionary mechanisms predict: a historical accumulation of new genes, new regulatory sequences, and new functional elements through duplication, transfer, transposable element co-option, and de novo origination — all mediated by mutation and refined by natural selection. The picture is one of continuous, well-documented informational enrichment of biological genomes over evolutionary time, not the stasis that the “no new information” argument implies.^{4, 8, 21}

Scientific consensus and the status of the claim

The claim that evolution cannot create new genetic information is not a live controversy within the scientific community. Molecular biologists, geneticists, evolutionary biologists, and information theorists who have examined the argument have uniformly found it to misrepresent the relevant mathematics and to be contradicted by the empirical evidence.^{3, 22} The National Academy of Sciences has explicitly addressed the argument, noting that it does not constitute a testable scientific hypothesis and has not been supported by peer-reviewed research.²²

The argument persists for reasons that are rhetorical rather than scientific. The vocabulary of information theory carries a patina of mathematical rigor that is persuasive to non-specialists, and the intuition that “complexity requires a designer” is deeply rooted in everyday human cognition. But intuition is not evidence, and the persuasiveness of an argument is not a measure of its validity. The empirical record — nylonase enzymes that digest a material invented within living memory, antifreeze proteins that arose independently in two fish lineages from different ancestral sequences, hundreds of de novo genes in Drosophila and humans with no noncoding precursor in outgroup species — constitutes direct, positive evidence that mutations, natural selection, and the associated mechanisms of genetic change produce new biological functions.^{9, 12, 16, 23}

The information-theoretic argument against evolution, rigorously examined, reduces to a definitional maneuver: define “information” so that the conclusion is built into the premise, then present the conclusion as if it followed from a mathematical theorem. Information theory, properly understood, does not support this maneuver. Evolution, as documented in the genomes of living organisms and reconstructed through the methods of comparative genomics and molecular phylogenetics, routinely and verifiably produces new genetic information in both the technical and the functional sense of the term.^{1, 3, 16}

References

A mathematical theory of communication

Shannon, C. E. · Bell System Technical Journal 27: 379–423, 1948