If just encoding text, one way is to convert each letter of the alphabet into a three-letter code. Using three bases, such as A, C, and T, gives 27 combinations—enough for the English alphabet plus a space—with a code such as AAA = A, AAC = B, and so on (1 in graphic below). However, researchers often want to encode more than just text, so most current methods instead first translate data into binary code—the language of 1s and 0s used in electronic media. Using binary, the four bases of DNA could theoretically store up to two bits of information per nucleotide, with a code such as A = 00, C = 01, and so on (2).

In reality, though, biochemical features of nucleic acids make some combinations of bases more desirable than others. Particularly problematic are homopolymers—long strands of the same nucleotide—which are difficult to write and read using current methods. One way to avoid homopolymers is by allocating two bases to each binary digit; long runs of the same digit can then be encoded by alternating base pairs (3). A more efficient method is to convert text or other data into a code that employs three digits rather than two, and then write bases so that no base is used twice in a row—for example by encoding 0, 1, and 2 as C, G, and T after an A, but as G, T, and A after a C (4). Newer methods include more complex codes, as well as error-correcting techniques, to pack as much information as possible into DNA while maximizing the accuracy of information retrieval.

Sources for methods depicted: 1. Bancroft et al., 2001; 3. Church et al., 2012; 4. Goldman et al., 2013.

Storage Cycle

After an encoding method is chosen, researchers write the DNA message into a series of long oligonucleotides. In earlier methods, these fragments were each tagged with a unique address sequence to aid reassembly, as well as common flanking sequences that allow amplification by PCR (1). Newer methods incorporate selective retrieval of specific sections of stored data, known as random access, by combining the address and PCR sequences into unique codes on either side of every oligonucleotide. Appropriate primers allow researchers to select and amplify only a sequence of interest (2).

These oligonucleotides are synthesized into tiny test tubes or printed onto DNA microchips, which are stored in a cold, dry, dark place. When the message needs to be read, researchers rehydrate the sample and add primers corresponding to the addresses of the sequences of interest. The amplified product is then sequenced and decoded in order to retrieve the original message.


Read the full story.