One day in 2006, while a postdoc in the Rockefeller University laboratory of Nathaniel Heintz, I had an unexpected eye-opener. Heintz showed me some electron microscopy images of Purkinje neuron nuclei in the murine cerebellum. They stunned me—the heterochromatin localization in the nucleus was different from anything I’d ever seen before. Rather than the dispersed, irregular patches with enrichment near the nuclear membrane typical of many cells, nearly all the heterochromatin was in the center of the nucleus, adhered to the single large nucleolus. Not only did heterochromatin organization look different, the volume of it in Purkinje neurons seemed much lower, too. Because links between DNA methylation and heterochromatin proteins were suggested in the literature, we thought that DNA methylation might be depleted in Purkinje neurons.

After nearly a year of work, I was able to isolate enough Purkinje nuclei to start quantifying DNA methylation using thin-layer chromatography. This technique usually yields a single spot per each base in the DNA that has a neighboring G. Normally, five intense spots representing the bases A, G, T, C, and methylated C (known as 5-methylcytosine, or 5mC) migrate to expected locations. In our experiments with Purkinje neuron DNA, however, we consistently noticed the presence of a sixth spot that had not been previously described. Could the spot represent a novel DNA base variant, which had gone unrecognized due to the low abundance of Purkinje neurons in the brain? (In the cerebellum, they constitute just 0.3 percent of all cell types.) After several long months of research, we identified the suspect: a cytosine base that had not only a methyl group added, but also a hydroxyl. We termed this new mark 5-hydroxymethylcytosine (5hmC).1

The diversity of all life on Earth is largely encoded by a relatively simple alphabet: the standard set of four DNA bases, A, G, C, and T. But in many organisms, this alphabet can be expanded by modifications to these bases. Bacteriophages are known to incorporate modified bases during DNA replication, for example. More commonly, organisms make modifications to the DNA bases after replication by adding chemical extensions to nucleic acids. Some postreplication modifications can be stably propagated during cell division, thus adding another layer of information onto DNA, a phenomenon that serves as the founding and principal example of the field of epigenetics. While extending the DNA alphabet typically does not affect the encoding of proteins, it can influence the expression or maintenance of phenotypic traits, and thus play a role in organisms’ survival and evolution.

DNA modifications add to the toolkit of critical gene-regulatory mecha­nisms.

The existence of modified bases varies throughout the tree of life, and the distribution of these variant bases may shed light on how and why these modifications evolved. Some organisms, such as yeasts and members of the order Diptera (flies, mosquitos), contain no modified bases, while others, including viruses, bacteria, plants, some fungi, nematodes, ants, honeybees, and all examined vertebrates, have modified DNA. Modifications reside typically, but not exclusively, on DNA bases. The most common way of modifying bases is the addition of a methyl mark, and across species, methylation has been found on cytosines and adenines, resulting in 5mC, N4-methylcytosine (N4mC), or 6-methyladenine (6mA). N4mC is present in bacteria, while 6mA, also once thought to be exclusively prokaryotic, was recently reported in the DNA of metazoan species, where its function still remains elusive.

In vertebrate genomes, 5mC is the most common modified base, found predominantly on cytosines that are followed by guanines (the so-called CpG context), with 70 percent to 80 percent of all CpGs in the genome containing such methylation. This epigenetic mark has been investigated for nearly 60 years, but in 2009, our work—and work done simultaneously by Anjana Rao’s group, then at Harvard Medical School2—revealed the existence of 5hmC. The abundance of 5hmC is quite variable, ranging from less than 1 percent of 5mC in some cancer cell lines to nearly 30 percent of 5mC in Purkinje neurons. Research is now underway to understand how this DNA modification is regulated and how it differs functionally from 5mC.

See “Unmasking Secret Identities

Just two years after discovery of 5hmC, Yi Zhang, then at the University of North Carolina at Chapel Hill, and Thomas Carell of Ludwig-Maximilians-Universität in Germany identified two other types of marks that can be added to cytosine, resulting in 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC).3,4 These modifications are even rarer than 5hmC, occurring at levels nearly two orders of magnitude lower. But their discovery, along with continued research into 5mC and 5hmC, have scientists rethinking the prevalence and functions of cytosine modifications—and how they alter the basic function of DNA.

Arms race

It is widely accepted that one of the main purposes of modified DNA bases in bacteria is to defend the genome against bacteriophages. The defense strategy is based on the activity of two bacterial enzymes—a restriction enzyme that cuts the DNA at defined sequences and a second enzyme that modifies the DNA in that same sequence context to protect it from the cutter enzyme. When the genes for both of these enzymes are present in a bacterium—often found in close proximity in the genome, in what’s referred to as a restriction-modification (R-M) operon—the two gene products cause no harm to its genome, as the modifying enzyme provides the necessary shield before the cutter can do its work. But when a bacteriophage injects unmodified DNA, it is cleaved by restriction enzymes, disabling viral replication.

Viruses have evolved counter-defense systems of their own. One simple way a bacteriophage evolves to avoid DNA-cutting activity is by elimination of the restriction enzyme recognition sites from its genome. In response, bacteria have evolved R-M operons that target different DNA sequences. Alternatively, some phages have evolved to protect their DNA using base modifications. In response to this tactic, bacteria have evolved enzymes that recognize and cut the modified foreign DNA, simultaneously losing their own modifications at matching sites in the genome. The consequence of this evolutionary arms race is an extensive list of R-M enzymes in bacteria with different DNA sequence preferences and a panoply of DNA modifications in both bacteria and bacteriophages.

One of the most common base modifications is methylation. The methyl group is small in size and the most neutral modification in terms of reactivity, bond participation, and influence on electron configuration of the base to which it binds. This means that methylation is able to protect bacteria against DNA cutting by restriction enzymes while having minimal consequences for the main functions of the DNA, such as transcription, replication, and mutability.

Bacteriophages also have a variety of methylation modifications, but compared with bacteria, they possess a much more extensive DNA modification profile, with around 20 known base modifications. These include less common marks such as glucosylated 5hmC, 5-dihydropentyluracil, and hexosylated 5-hydroxycytosine. Rather than modifying DNA after synthesis, bacteriophages often produce enzymes capable of modifying the building blocks of DNA—nucleotide triphosphates—which are then incorporated randomly into the DNA during replication. Although it’s not known why viruses have a more diverse repertoire of DNA modifications than bacteria, it may be due to the fact that bacteria are more complex and may suffer adverse consequences from more-elaborate modifications, such as an increased mutation rate during replication or affinity changes for DNA-binding proteins.

What can we learn about the evolution of DNA modifications in higher organisms from these bacterial and viral systems? In mammals, there is no known antiviral defense mechanism comparable to the bacterial R-M system, but intracellular strategies for combating viruses do exist. Instead of digesting foreign nucleic acids, mammalian cells have enzymes capable of mutating them. Cytosine deaminases convert cytosine to uracil (the RNA base that corresponds to thymine), eventually leading to C-to-T mutations in the viral genome. The best example of such antiviral restriction is the deaminase APOBEC3G, which is capable of inhibiting HIV infection. However, HIV evolved a protein called Vif capable of degrading the deaminase, thus maintaining viral infectivity.5

It is unlikely that modified bases in mammals provide substantial viral defenses in a way that is analogous to the bacterial R-M system. For one, modified bases are rare in the human genome, with just 4 percent of all cytosines being modified. Moreover, there is little overlap between methylated DNA sequences and the target sequences of some deaminases. Rather, substrate selectivity for single-stranded DNA and the fact that deaminases are usually restricted to the cytoplasm are the most likely mechanisms of preventing adverse effects of deaminases on the host DNA, which is securely locked away in the nucleus. The protection is clearly not flawless, however, as sites targeted by known deaminases are frequently mutated in cancer, suggesting that in some circumstances the enzyme can gain access to DNA and damage the genome.

Rather than participating in direct destruction of foreign DNA, DNA methylation in mammals is involved in suppressing the activity of viruses and parasites that have invaded our genomes, which are littered with remnants of these pathogens. If unleashed, such incorporated sequences could be detrimental to genome stability, but methylation is one of the mechanisms that prevents such activity.

Brakes on or off

What, then, is the function of epigenetic modifications in the genomes of eukaryotes? One hypothesis is that modified bases play a role in gene regulation. The presence of 5mC modification in promoters strongly correlates with a lack of expression of those genes. During embryonic development, for example, DNA methylation is often associated with the silencing of a gene, such as during X chromosome inactivation in females. Another group of genes regulated by DNA methylation consists of imprinted genes whose expression is dependent on the parent of origin. These genes contain differentially methylated regions, which promote allele-specific expression.

DNA methylation may also regulate gene expression in a more dynamic way, possibly with environmental factors influencing the addition or removal of methyl marks to control gene activity in response to external conditions. In these cases, however, it is not known whether DNA methylation actually regulates expression. Often there is just correlation between DNA methylation and expression, which does not prove causality.

DNA meth­ylation in mammals is involved in supressing activity of viruses and parasites that have invaded our genomes.

In terms of exactly how DNA methylation can prevent transcription initiation, two main mechanisms of gene silencing have been proposed: the methyl group could occlude binding of transcription activators, or it could attract transcriptional repressors. Some transcriptional repressors are known to bind 5mC and often act on genes by recruiting histone deacetylases, resulting in a chromatin state that is less compatible with transcription.

Employing DNA modifications for transcription regulation does not come “free of charge,” however. The hefty price of having 5mC in the DNA is elevated mutability, with the cytosine spontaneously mutating to thymine. Because 5mC is predominantly found in CpG dinucleotides, this has resulted in the depletion of CpGs across the methylated parts of vertebrate genomes. Thus, instead of one CpG every 16 dinucleotides (which one would expect given randomness), methylated regions in typical vertebrate genomes contain just one CpG per 100 bp (with the exception of “CpG islands,” where one CpG is observed every 30 bp). CpG dinucleotides are present in four out of six codons coding for arginine, resulting in an enrichment of mutations affecting this particular amino acid in proteins.

Cytosine is the most commonly altered base, with methylation being the most common addition. In vertebrates, this modified based, called 5-methylcytosine (5mC), is found primarily in the CpG context—on cytosines followed by guanines. Recent research has revealed that this base can be further modified into a number of variants, including 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC), though these modifications are generally rare. Researchers are still hunting for the functions of such DNA bases, but evidence points to their roles in gene regulation and DNA integrity, affecting learning and memory.
See full infographic: WEB | PDF


In addition to the footprint of DNA methylation on vertebrate genomes, researchers have identified frequent C-to-T mutations at methylated sites in genetic diseases and cancer. Last year, for example, my colleagues and I discovered that mutations at methylated CpGs are observed nearly twice as frequently as at nonmethylated ones in most cancers.6 Interestingly, mutation frequency at 5hmC-containing sites is nearly twofold lower than at 5mC sites, making mutability of 5hmC equivalent to that of unmodified cytosine.

The role of 5hmC in gene regulation appears to be opposite to that of 5mC, as deduced from its location in actively transcribed regions. Several proteins of the MBD family of transcriptional repressors (e.g., Mbd1 and Mbd2) are unable to bind to 5hmC-decorated DNA, providing a possible mechanism for facilitating chromatin structure compatible with expression. But this remains an area of active investigation. Additional antisilencing mechanisms may involve 5hmC’s ability to attract specific binding proteins.

Beyond its transcriptional effects, 5hmC was demonstrated to act as an intermediate for demethylation. During demethylation, enzymes known as TETs further oxidize 5hmC to 5fC and 5caC, which are subsequently removed by base excision–repair primarily triggered by thymine DNA glycosylase (TDG). (See illustration here.) Demethylation can happen by a different route as well; replication of 5hmC-containing DNA results in this modification on one strand of the daughter DNA molecule. This asymmetric 5hmC site turns out to be a poor substrate for DNA methyltransferase 1, leading to the generation of unmodified DNA during subsequent rounds of replication.

Small mark, many jobs

In bacteria, modified bases influence DNA damage as well, but instead of increasing mutation rates, bacteria use such DNA modifications to enhance DNA repair. Adenine N6-methylation, for example, has been shown to direct mismatch repair after replication. The DNA methyltransferase Dam methylates adenine bases at palindromic GATC sequences, resulting in the symmetric modification on both strands of DNA. The key utility of methylation here is the ability to make parental and daughter DNA strands distinguishable after DNA replication, as only parental strands will have the modification before the symmetrical state is re-established. During base mismatch repair, MutH endonuclease confers strand specificity by cutting the unmethylated strand, which initiates repair using the parental (methylated) DNA strand as a template.

Whether DNA modifications play a role in mismatch repair in eukaryotes is less clear. Despite the fact that the majority of DNA methylation in replicating cells is found in the symmetrical CpG sequence and could indicate parental origin of newly replicated DNA, strong evidence to support the idea that DNA methylation guides mismatch repair is lacking. Some reports were able to observe methylation-guided repair in mammals, but others not.7

Methylation also appears to play a role in DNA replication in bacteria. Once again, the mechanism is based on the appearance of asymmetrically modified DNA—in this case, Dam-deposited adenine methylation at the origin of replication after DNA synthesis—with the parental strand containing the modification while the daughter strand does not. This asymmetrical methylation is recognized and bound by SeqA protein, suppressing the reinitiation of replication origin before one round of replication is finished. This provides a time window for the complete replication of bacterial chromosomes once per cell cycle, until Dam outcompetes SeqA to re-establish symmetrical modification, which enables replication origin for subsequent division.

In contrast to bacteria, the majority of eukaryotic species do not have clearly definable or strictly sequence-dependent replication origins. Instead, replication initiates at regions coinciding with a number of features such as promoters, DNase I accessible regions, and CpG islands. Methylated CpG islands replicate later than unmethylated ones, suggesting that DNA modifications could have a function in replication, though the significance of this is still unclear. And the fact that mouse embryonic stem cells do not display major replication defects after genetic elimination of all DNA methyltransferases argues against a major role of DNA modifications in replication.

DNA modification Found in which species/type of organism Found in what genomic context/cell type Frequency in human or mouse genome Molecular roles
5-methylcytosine (5mC) Ubiquitous,
some exceptions
Primarily CG but also found in other contexts, ubiquitous 2 percent
to 4 percent of C
Represses gene expression
5-hydroxymethylcytosine (5hmC) Vertebrates, some fungi, protozoans Primarily CG, enriched in brain and other differentiated tissues 0.1 percent
to 0.8 percent of C
Intermediate for demethylation, other roles debated
5-formylcytosine (5fC) Vertebrates, some fungi, protozoans Primarily CG, enriched in mouse embryonic stem cells <0.002 percent of C Intermediate for demethylation, other roles debated
5-carboxylcytosine (5caC) Vertebrates, some fungi, protozoans Primarily CG, enriched in mouse embryonic stem cells <0.0003 percent of C Intermediate for demethylation, other roles debated

Touch of the mind

Observations in human cells and in mice suggest that modified DNA bases may be more important to the normal function of the nervous system than of any other tissue. A number of intriguing publications have documented that neuronal cells have unusual profiles of DNA modifications. For starters, 5hmC is nearly threefold more abundant in the brain than in any other organ. The extreme example is in Purkinje neurons, where nearly a third of modified cytosines are in the 5hmC state, which is tenfold higher than in any non-neuronal cell type. Moreover, neuronal cells have the most abundant non-CpG methylation, which is close to the level of methylated Cs in the CpG context. Is it possible that evolution invented yet another function for DNA modifications in neuronal cells? Perhaps the best starting point would be to think about how unusual a neuronal cell is, compared with all the other cell types in a multicellular organism.

Neuronal cells connect in networks, enabling learning and memory. The stability as well as plasticity of neural networks is therefore critical for behavior, and the longevity of some neuronal cells (e.g., those involved in the coordination of movement) could therefore be under strong selection. Neuronal cells are also metabolically active and quite large—human motor neurons of the spinal cord have axons that extend to 1 meter in length, and the majority of neurons have other neuronal projections that measure on millimeter and centimeter scales. Combining enhanced metabolism with longevity is not easy, as oxidative phosphorylation in the mitochondria can generate DNA-damaging reactive oxygen species. Thus, the unusual DNA modification landscape of neurons may favor chromatin with elevated resilience to mutations. Alternatively, the cells’ high metabolism and associated requirement for enhanced gene expression, without any need to replicate DNA (differentiated neurons do not divide), may have selected for the use of DNA modifications for more efficient transcription.

A third possibility is that neurons benefit from a more-accurate regulation of transcription, enabling “transcriptional memory.” A number of reports indicate that, in addition to the synaptic mechanism of memory, transcription plays important roles in an organism’s ability to consolidate and store memories. In animal models, deletion or overexpression of DNA methyltransferases (DNMTs) and TET oxygenases in post-mitotic neurons results in defects in neural plasticity and memory consolidation. In addition, neuronal stimulation induces changes in DNA modifications. These results indicate that DNA modifications regulate the expression of some genes in neuronal cells that are critically important for normal nervous system function.8

When all goes wrong

Defects causing stark disruption of DNA modification dynamics lead to extreme phenotypes. In mice, deletion of DNA methyltransferases Dnmt1 or Dnmt3b, or of all three TET families of dioxygenases, results in lethal developmental defects. In humans, mutations in DNA modification–related proteins are also known to cause disease. Germline mutation of DNMT3B, for example, causes immunodeficiency–centromeric instability–facial anomalies (ICF) syndrome, a rare genetic disorder characterized by immunodeficiency and facial deformities. Mutations in the 5mC-binding protein MECP2, on the other hand, cause a neurological disorder known as Rett syndrome, which presents as numerous verbal and physical disabilities. Somatic TET2 and DNMT3A mutations are observed in a number of blood cancers, including acute myelogenous leukemia (AML) and chronic myelomonocytic leukemia (CMML).

Altogether, these loss-of-function observations do not demonstrate a particular trend that could link one phenotypic trait to DNA modifications. Instead, they reflect the idea that DNA modifications add to the toolkit of critical gene-regulatory mechanisms. This is well supported by numerous studies demonstrating the importance of DNA methylation in a wide variety of processes, ranging from the activation of T cells in the immune system to memory formation in the brain.

It is thus clear that DNA modifications are key to proper development and function of those organisms in which they exist. These epigenetic factors offer additional options for genome management. In bacteria, DNA modifications are a critical part of immune defense. In mammals, modifications play a key role in gene regulation. Finally, there is some evidence to suggest that DNA modifications affect the mutability of DNA, as well as its repair in certain species.

The diversity of cellular functions relating to DNA modifications is perhaps not surprising, considering that modified bases have a broad genomic presence across various genes. Such an expanded alphabet has presumably undergone positive selection to drive the evolution of organisms to survive and pass on their genomes through the millennia. 

Skirmantas Kriaucionis is an associate professor at the Ludwig Institute for Cancer Research, University of Oxford, U.K.


  1. S. Kriaucionis, N. Heintz, “The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain,” Science, 324:929-30, 2009.
  2. M. Tahiliani et al., “Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1,” Science, 324:930-35, 2009.
  3. S. Ito et al., “Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine,” Science, 333:1300-03, 2011.
  4. T. Pfaffeneder et al., “The discovery of 5-formylcytosine in embryonic stem cell DNA,” Angew Chem Int Ed, 50:7008-12, 2011.
  5. A.M. Sheehy et al., “Isolation of a human gene that inhibits HIV-1 infection and is suppressed by the viral Vif protein,” Nature, 418:646-50, 2002.
  6. M. Tomkova et al., “5-hydroxymethylcytosine marks regions with reduced mutation frequency in human DNA,” eLife, 5:e17082, 2016.
  7. R.R. Iyer et al., “DNA mismatch repair: Functions and mechanisms,” Chem Rev, 106:302-23, 2006.
  8. J. Shin et al., “DNA modifications in the mammalian brain,” Philos Trans R Soc Lond B Biol Sci, 369:20130512, 2014.