Next-Generation Sequencing
Ian S. Hagemann
DNA SEQUENCING IN HISTORICAL PERSPECTIVE
Maxam–Gilbert Sequencing
Early DNA sequencing methods took advantage of chemical reactions that selectively degrade DNA at specific nucleotides. For example, treatment with dimethyl sulfate followed by hot aqueous piperazine causes cleavage at guanine nucleotides, whereas 60% to 80% aqueous formic acid followed by hot aqueous piperidine cleaves at both adenine and guanine.1 One reaction tube is used for each chemical agent. If a homogeneous population of molecules is end-labeled, treated in individual tubes with nucleotide-specific cleavage agents, and separated by gel electrophoresis, the sequence can be read off the resulting gel. Maxam and Gilbert2 described a standard set of reactions cleaving preferentially at guanine, adenine, cytosine, and equally at cytosine and thymine. Though colorful (sometimes literally), these methods were cumbersome owing to the variety of reagents and conditions required, including radiolabeling. They had a limited read length of ∼100 nt and were poorly amenable to automation.2 They are now of historical interest only.
Sanger Sequencing
Sanger3 described a method to sequence DNA by synthesis, rather than by degradation. In the dideoxynucleotide chain-terminating method, as initially described, DNA was polymerized on a homogeneous population of template molecules, using four parallel reaction mixtures. Each mixture contained a 5′-radiolabeled sequencing primer and all four deoxynucleotide triphosphates (dNTPs) along with a small admixed percentage of a single dideoxynucleotide triphosphate (ddNTP). At each step of elongation, there is a chance that a ddNTP will be added in place of a dNTP, in which case, in the absence of a 3′ hydroxyl group, the growing DNA chain will be terminated. The reaction produces a collection of DNA molecules of varying lengths, each terminated by a dideoxynucleotide.4 These molecules can be separated by gel electrophoresis, and the sequence can be read off the gel.
Sanger sequencing originally required radiolabeling of the sequencing primer at the 5′ end, but the ddNTPs today are fluorescently labeled, allowing the product to be detected by fluorimetry. A different fluorophore is attached to each ddNTP so that the four reactions can be multiplexed together and the sequencing reaction can take place in a single tube. Sanger sequencing reactions are now read by capillary electrophoresis, yielding four-color electropherograms with subnucleotide resolution from which the sequence can be “called” by a computer program. The time required for a typical sequencing reaction is now less than 15 minutes, with a further 60 minutes required for capillary electrophoresis.5 Reads of up to 1,000 nt can be obtained.
These two innovations—capillary electrophoresis and fluorescent detection—have made Sanger sequencing a practical and versatile technique that is still used today. One major application is in sequencing of individual genes, where higher-throughput methods (described in the next section) are not needed. Another application is as an adjunct to next-generation sequencing (NGS), for example, when NGS results are unclear, to “patch” regions poorly covered by NGS, or when laboratory standard operating procedures require NGS-detected variants to be confirmed by an orthogonal method.
Limitations of Sanger sequencing include relatively low sensitivity: variants must be present at ∼20% allele frequency within the sample in order to be detected. In addition, Sanger sequencing cannot determine whether two variants are present in cis (i.e., on the same DNA strand) or in trans, a problem referred to as “phasing.” Despite these limitations, Sanger sequencing is still generally considered a gold standard and remains in common use.
Emergence of Next-Generation (Second-Generation) Methods
So-called next-generation or second-generation methods increase the throughput of DNA sequencing by making it possible to sequence multiple molecules, or a collection of molecules, at the same time. There are several platforms in current use, and the technology is changing rapidly.
Most current NGS platforms can be described as “cyclic array sequencing” platforms, as they involve fragmenting the target DNA of interest, converting these fragments into a sequencing library by adding primer binding sites or other necessary sequences to the ends of each fragment, distributing these target sequences across a two-dimensional array, and then sequencing the targets by iterative cycles of sequencing chemistry.6 The resulting reads can be reassembled de novo or, much more commonly in clinical applications, aligned to a reference genome.
The sequence reads are on the order of 50 to 150 nt, which is considerably shorter than those obtained in Sanger sequencing. If the read length is shorter than the DNA insert (fragment) length, it is often possible to sequence from both ends of the insert to obtain “paired-end” reads; the paired-end reads must map close to one another, which serves to increase the confidence with which any given read can be mapped. To permit multiplexing of multiple samples on the same substrate, the DNA fragments derived from a single specimen are often tagged with index sequences (barcodes).
Clinical NGS-based assays can target regions ranging from a few individual genes up to the entire genome. When the region to be sequenced is less than the entire genome, target enrichment is performed to avoid wasting resources in obtaining unnecessary sequence. Two methods are in common use. In target capture methods, a library is first prepared from fragmented genomic DNA to add indexes and any adapters necessary for the sequencing chemistry. Biotinylated oligonucleotide baits corresponding to the region of interest are hybridized to this library DNA and then captured using magnetic streptavidin beads. Typically, a small number of PCR cycles are then performed to increase the mass of DNA available for sequencing.
Amplicon sequencing methods, in contrast, perform a larger number of PCR cycles on the input DNA to amplify the regions of interest and simultaneously add an index and/or other necessary adapters. Because so many cycles of PCR are performed, the quantity of input DNA required for amplicon sequencing is lower (typically on the 10-ng scale) than that required for target capture (closer to 100-ng scale). The amplification-based approach is also faster, because hybridization is typically an overnight process. However, the larger number of PCR cycles in amplicon sequencing has the potential to introduce polymerase errors and PCR bias, which can obscure copy number variants and assessment of variant allele fractions. For large capture spaces consisting of many exons, it can also be unwieldy to multiplex the large number of PCR reactions needed for amplicon sequencing. Thus, although both methods of target enrichment have a role, amplicon sequencing is better suited to smaller assays on smaller amounts of DNA, whereas target capture is better suited to larger capture spaces and when more input DNA is available.
Illumina Sequencing
The major platforms in current clinical use are those provided by Illumina, Inc. (San Diego, CA) and Ion Torrent (Thermo Fisher Scientific, Waltham, MA). In Illumina sequencing, libraries are hybridized to the two-dimensional surface of a flow cell, and each molecule is subjected to “bridge amplification” to create a cluster of about 2,000 identical fragments within a diameter of ∼1 µm. A single lane of a flow cell can hold >37 million individual amplified clusters. These fragments are “sequenced by synthesis” by successively incorporating fluorescently labeled, reversibly terminated nucleotides. After each elongation step, the surface of the flow cell is imaged by a charge-coupled device (CCD) to query each position for the identity of the most recently incorporated nucleotide. Successive cycles of deprotection, elongation, and imaging result in a series of large image files that are then processed to determine the sequence of each cluster on the flow cell.
Illumina sequencers in use today are the HiSeq, NextSeq, and MiSeq systems. The HiSeq is a larger-scale instrument producing more total reads and more total data per run, with longer run times, at a lower price per sample. The MiSeq is framed as a smaller-scale benchtop solution and is the first NGS system cleared by the U.S. Food and Drug Administration (FDA) as an in vitro diagnostic device. The NextSeq has properties intermediate between those of the HiSeq and MiSeq.
Ion Torrent Sequencing
Whereas Sanger and Illumina sequencing are based on detecting fluorescence, the Ion Torrent approach is to measure very small changes in pH that occur as a result of H+ release when nucleotides are incorporated into an elongating sequence. Molecules of the library to be sequenced are bound to the surface of microscopic beads and amplified by emulsion PCR so that each bead becomes coated with a population of identical molecules.
The Ion Torrent sequencers in current use include the Ion Personal Genome Machine (PGM), a benchtop instrument, and the Ion Proton, a higher-throughput device.
The beads are distributed into wells on a chip constructed by complementary metal-oxide semiconductor (CMOS) technology.7 The sequencing reaction performed on the chip is analogous to pyrosequencing, except that pH is detected instead of light. The Ion Torrent chip has the property that each well, containing an embedded field effect transistor, functions as an extremely sensitive pH meter. Sequencing reactions are accomplished by successively flooding the plate with each deoxyribonucleotide (dATP, dCTP, dGTP, dTTP). Incorporation of a nucleotide causes release of H+, causes a voltage change at the gate of the transistor, and allows current to flow across the transistor, resulting in a signal.7 Homopolymers cause incorporation of multiple nucleotides, with a correspondingly larger pH change.
Each method has a unique mix of advantages that may make it suitable for specific applications. Ion Torrent sequencing uses natural deoxyribonucleotides rather than synthetic derivatives, which can reduce sequencing biases related to incorporation of unnatural nucleotides. Reads are relatively long, and reaction times are relatively short (3hours for 300 bases). Disadvantages of the Ion Torrent platform include a relatively higher error rate, often attributed to difficulties in discriminating the multiplicity of longer homopolymers.8 Several paired-end modes are available, but require off-instrument repriming.7
Third-Generation Methods
Third-generation NGS platforms are those that obtain sequence from specimens closely resembling native DNA, that is, with minimal need for target amplification and library preparation. Third-generation NGS is predominantly a single-molecule approach with extremely long read lengths (>10,000 nt), reducing the need for assembly of short reads and improving the ability to discern the phase of detected variants. The major vendors in this area today are Pacific Biosciences (Menlo Park, CA) and Oxford Nanopore Technologies (Oxford, UK).
The Pacific Biosciences approach is to perform sequencing on the surface of a single-molecule real time (SMRT) cell patterned with 150,000 wells each containing a zero-mode waveguide (ZMW). These ZMWs serve to immobilize a single template molecule and make it possible to interrogate light emission from that one molecule. As extension is carried out, fluorescent reports from each ZMW indicate the nucleotide that has been added at that position.
Oxford Nanopore sequencers contain engineered bacterial nanopores through which a molecule of DNA is fed. The electrical resistance of the pore varies in proportion to the identity of the nucleotide passing through the pore, allowing the sequence to be read off. Although multiple formats have been developed, one of them is a portable, self-contained, disposable MinION that plugs into the universal serial bus (USB) port of a personal computer.
Third-generation technologies are at an early stage of development and have not been validated for clinical use, but will undoubtedly play an important role in future diagnostics.
Analytes Other Than DNA
The methods presented earlier have focused on DNA sequencing. Variants of these methods have been developed to allow the detection of other analytes related to nucleic acids, including RNA sequence, epigenetic changes such as DNA methylation, and DNA–protein interactions. These methods have not yet found widespread clinical applications.