General Principles of Molecular Biology
Immunohistochemistry (IHC) is a common technique used for the detection of protein expression in various tissue samples. In modern pathology practice, this methodology is expanded and complemented by molecular techniques that test for changes in nucleic acids—in effect, DNA and RNA—to assist the immunohistologic diagnosis.
Many of the chapters in this book refer to theranostic and genomic principles that can be investigated with immunohistology and used directly for patient care. The underpinning of these immunohistologic tests requires an understanding of the molecular abnormalities of these disease states and how molecular methods apply to their study. In addition, the molecular methods discussed here may be valuable in diagnosis when immunohistologic results are nonspecific.
Genetic information in human cells is encoded in deoxyribonucleic acid (DNA), which is primarily located in the nucleus of each cell. DNA is a double-stranded molecule that consists of two complementary strands of linearly arranged nucleotides, each composed of a phosphorylated sugar and one of four nitrogen-containing bases: adenine (A), guanine (G), thymine (T), or cytosine (C). The order of these four bases encodes genetic information. Two strands of DNA run in opposite directions and are held together through pairing between specific bases—in effect, between adenine and thymine (A:T pairing) and guanine and cytosine (G:C pairing)—that forms a double-stranded helix. As a result, the nucleotide sequence of one DNA strand is complementary to the nucleotide sequence of the other DNA strand.
The human genome contains approximately 3 billion base pairs (bp) of DNA. The DNA is folded to fit within the nucleus. It is divided among chromosomes and is efficiently packed into chromatin by histones and other accessory proteins. Each normal somatic cell contains two copies of 22 different somatic chromosomes and two sex chromosomes, either XX or XY. Less than 5% of DNA actually encodes protein and other functional products, such as transfer RNA (tRNA), ribosomal RNA (rRNA), micro-RNA (miRNA), and other small nuclear RNAs (snRNAs). Most human DNA (>95%) consists of noncoding sequences, typically repetitive sequences such as minisatellites, microsatellites, short interspersed elements, and long interspersed elements. Microsatellites are short tandem repeats (STRs), and each repeat is from 1 to 13 bp long. Minisatellites are tandemly repeated DNA sequences with a repeat unit of 14 to 500 bp. Microsatellite and minisatellite repeats are also known as STRs. Highly repetitive sequences that contain thousands of repeated units are also found at the telomeric ends of the chromosomes and near the centromere; they play a role in establishing and maintaining chromosome structure and stability.
For the genetic information to be decoded, the DNA is copied, or transcribed, into messenger RNA (mRNA), which is then translocated into the cytoplasm, where it governs translation into protein ( Fig. 2.1 ). Genes are segments of genomic DNA that encode proteins and other functional products. Each gene is typically present in a cell in two copies, one on a maternal and another on a paternal chromosome. Current estimations suggest that about 25,000 distinct genes are present in the human genome. Each gene typically consists of exons, which are protein-coding sequences, and introns, which are noncoding sequences located between the coding regions (see Fig. 2.1 ). Transcription initiation and termination codons flank the portion of a gene that codes for a protein. Gene transcription and silencing are facilitated by promoters and enhancers, which are DNA regions typically located nearby and “upstream” from the gene they regulate, although they may also be located at a great distance.
Ribonucleic acid (RNA) is a single-stranded molecule that consists of a chain of nucleotides on a sugar-phosphate backbone. However, the sugar in RNA is ribose, rather than deoxyribose, and thymine is replaced by uracil. RNA is more susceptible to chemical and enzymatic hydrolysis and is less stable than DNA.
Several types of RNA exist, and each is different in its structure, function, and location. The most abundant types of RNA are rRNA and tRNA, which comprise up to 90% of the total cellular RNA. They are predominantly located in the cytoplasm and have important functions in protein synthesis: rRNA, in a complex with specific proteins, forms ribosomes on which proteins are synthesized, and tRNA is responsible for the carrying and adding of the amino acid to the growing polypeptide chain during protein synthesis. mRNA comprises 1% to 5% of total RNA, and each mRNA molecule is a copy of a specific gene and functions to transfer genetic information from the nucleus to the cytoplasm, where it serves as a “blueprint” for protein synthesis. The gene sequence is first transcribed into the primary RNA transcript by RNA polymerase. This transcript is an exact complementary copy of the gene and includes all exons and introns. Next, intron portions are spliced out from the primary RNA transcript while it is processed into mature mRNA (see Fig. 2.1 ). Other types of RNA include heterogeneous RNA (hnRNA) and snRNA. Several classes of short RNAs have also been discovered, one of which is miRNAs, which are short (19 to 22 nt) single-stranded molecules that function as negative regulators of the coding gene expression.
The abundance of a protein within each cell depends on the expression levels of the gene (i.e., how many mRNA copies are transcribed from DNA) and the stability of the protein. Proteins are synthesized on ribosomes in the cell cytoplasm, and mRNA carries genetic information to the ribosomes, which then direct the assembly of polypeptide chains by reading a three-letter genetic code on the mRNA and pairing it with a complementary tRNA linked to an amino acid. The three-bases code, called the codon, defines which specific amino acid is added by the tRNA to the growing polypeptide chain. After synthesis, the protein undergoes posttranslational modification, such as chain cleavage, chain joining, addition of nonprotein groups, and folding into a complex, tridimensional structure.
Genetic Polymorphism and Mutations
Variations in DNA sequence are common among individuals. Genetic polymorphism is an alteration in DNA sequence found in the general population at a frequency greater than 1%. Polymorphism may be associated with a single nucleotide change, known as a single-nucleotide polymorphism (SNP), or with variation in a number of repetitive DNA sequences, such as minisatellites or microsatellites, called length polymorphism. Usually, genetic polymorphism does not directly cause a disease, but, rather, it may serve as a predisposing factor.
Mutation is a permanent alteration of the DNA sequence of a gene that is found in less than 1% of the population and most likely causes disease. Mutations can be either germline, present in all cells of the body, or somatic, found in tumor cells only. Somatic mutations may provide a selective advantage for cell growth and may initiate cancer development, but they are not transmitted to offspring. In contrast, germline mutations are passed on to the next generation.
Mutations located in a coding sequence, in the regulatory elements, or at the intron-exon boundaries of a gene may affect transcription and/or translation and may result in alteration of the protein structure and function. The sequencing of cancer genomes has revealed that most mutations occur in genes in which the products affect signaling pathways that control important cell functions. It is estimated that most mutations (90%) result in activation of a gene, typically forming an oncogene such as RAS or BRAF ; smaller proportions of mutations (10%) lead to loss of function of a tumor suppressor gene, such as TP53 .
A current list of somatic mutations in cancer can be viewed at the Catalogue of Somatic Mutations in Cancer (COSMIC) database, which documents somatic cancer mutations reported in the literature and identified during the Cancer Genome Project ( www.cancer.sanger.ac.uk/cosmic ). Not all somatic mutations have a clear biological effect. Mutations that increase cell growth and survival and are positively selected for tumor development are called driver mutations. Conversely, genetic alterations that do not confer a selective growth advantage to the cell and do not have functional consequences are known as passenger mutations. They may be coincidently present in a cell that acquires a driver mutation and are carried along during clonal expansion, or they occur during clonal expansion of a tumor. It is generally believed that only a small fraction of mutations in a given tumor are represented by driver mutations; thus, it has been estimated that a typical human tumor carries approximately 80 mutations that change the amino acid sequences of proteins, of which less than 15 are driver mutations.
Mutations can be classified according to size and structure into small-scale mutations (sequence mutations) and large-scale mutations (chromosomal alterations). Small-scale mutations include point mutations, which are single-nucleotide substitutions, and small deletions and insertions. Point mutations can be further classified as missense mutations, which lead to amino acid change and result in production of abnormal protein; silent mutations that do not lead to a change in amino acids; and nonsense mutations, when substitution of a single nucleotide results in formation of a stop codon and a truncated protein. Deletion and insertion mutations can result in either deletion or insertion of a number of nucleotides divisible by 3, leading to a change in the number of amino acids and a shorter or longer protein, or it leads to insertion or deletion of a number of nucleotides not divisible by 3, which causes a shift in the open reading frame of the gene; this affects multiple amino acids and typically produces a stop codon and protein truncation. Large-scale mutations can be due to (1) numerical chromosomal change, that is, loss or duplication of the entire chromosome; (2) chromosomal rearrangement, translocations or inversions that result in an exchange of chromosomal segments between two nonhomologous chromosomes, or within the same chromosome, and typically lead to activation of specific genes located at the fusion point; (3) amplification, when a particular chromosomal region is repeated multiple times on the same chromosome or different chromosomes, resulting in the increased copy number of the gene located within this region; and (4) chromosomal deletion or loss of heterozygosity (LOH), when deletion of a discrete chromosomal region leads to loss of a tumor suppressor gene residing in this area. Functional consequences of each mutation type vary. In general, mutations result in either activation of the gene—typically forming an oncogene, such as KRAS or RET— or loss of function of a tumor suppressor gene ( TP53, PTEN, CDKN1A ).
Specimen Requirements for Molecular Testing
Molecular testing in surgical pathology can be performed on a variety of clinical samples, including fresh- or snap-frozen tissue, formalin-fixed paraffin-embedded (FFPE) tissue, cytology specimens (fresh and fixed fine needle aspiration [FNA] samples), blood, bone marrow, and buccal swabs. Specimen requirement depends on the type of disease and on molecular techniques used for the analysis. Peripheral blood lymphocytes or cells from buccal swabs are typically used for detection of germline mutations responsible for a given inherited disease, such as RET mutations in familiar medullary thyroid carcinoma. Blood and bone marrow biopsy materials are frequently used for detection of chromosomal rearrangements in hematologic malignancies ( BCR/ABL1 in acute lymphocytic leukemia). Tumor tissue samples are required to detect somatic mutations such as KRAS point mutation in colorectal cancer, SS18/SSX1 rearrangement in synovial sarcomas, or EGFR mutation in lung adenocarcinomas.
Fresh- or snap-frozen tissue is the best sample for testing because freezing minimizes the degradation and provides excellent quality of DNA, RNA, and protein. Such specimens can be successfully used for any type of molecular analysis, including detection of somatic mutations, chromosomal rearrangements, gene-expression arrays, and miRNA profiling. FFPE tissue samples or fixed cytology specimens do not provide such highly preserved nucleic acids; however, these specimens can be successfully used for molecular testing in many situations, particularly for tests that require DNA. Usually, 10% neutral-buffered formalin (NBF) is most commonly used for tissue fixation. However, it leads to fragmentation of DNA; therefore molecular assays need to be optimized when FFPE tissue samples are used by amplification of shorter DNA fragments (250 to 300 bp in length). Prolonged (>24 to 48 hours) fixation in 10% NBF adversely affects the quality of nucleic acids; therefore specimens should preferably not be fixed for prolonged times. Tissue specimens that were processed with bone decalcifying solution cannot be used for molecular analysis because of extensive DNA fragmentation. Similarly, it is not recommended to perform molecular testing on specimens exposed to fixatives that contain heavy metals (e.g., Zenker, B5, acetic acid-zinc-formalin) because of inhibition of DNA polymerases and other enzymes that are essential for molecular assays.
RNA molecules are less stable than DNA and are easily degraded by a variety of ribonuclease enzymes present in abundance in the cell and environment. Therefore only freshly collected or frozen samples are universally acceptable for RNA-based testing. RNA isolated from FFPE tissue is of poor quality and can be used for some but not all applications, particularly in a setting of clinical diagnostic testing.
The amount of tissue required for molecular testing depends on the sensitivity of a technique and on the purity of the tumor sample. When selecting a sample for molecular testing, a pathologist must review a representative hematoxylin and eosin (H&E) slide of the tissue to identify a target and determine the purity of the tumor; that is, the proportion of tumor cells and benign stromal and inflammatory cells in the area selected for testing must be evaluated. Manual or laser-capture microdissection can be performed with unstained tissue sections under the guidance of an H&E slide to enrich the tumor cell population. The minimum percentage of tumor cells required for molecular testing depends on the methodology being used for analysis. In general, a minimum tumor cellularity of 50% and at least 300 to 500 tumor cells are required for Sanger sequencing.
For molecular testing of hematologic specimens, blood and bone marrow should be collected in the presence of the anticoagulants ethylenediaminetetraacetic acid (EDTA) or acid-citrate-dextrose (ACD), but not heparin, because even a small residual concentration of heparin inhibits polymerase chain reaction (PCR) amplification.
Conventional cytogenetic analysis requires fresh tissue. Fluorescence in situ hybridization (FISH) can be performed on a variety of specimens including frozen tissue sections, touch preparations, paraffin-embedded tissue sections, and cytology slides.
Common Techniques for Molecular Analysis
Polymerase Chain Reaction
PCR is an amplification technique most frequently used in molecular laboratories. The introduction of PCR has dramatically increased the speed and accuracy of DNA and RNA analysis, and the technique is based on exponential and bidirectional amplification of DNA sequences with a set of oligonucleotide primers.
Every PCR run must include the DNA template, two primers complementary to the target sequence, four deoxynucleotide triphosphates—dATP, dCTP, dGTP, and dTTP—DNA polymerase, and magnesium chloride (MgCl 2 ) mixed in the reaction buffer. Three steps occur in the PCR cycle ( Fig. 2.2 ). First, the reaction mixture is heated to a high temperature (95°C), which leads to DNA denaturing, or separation of the double-stranded DNA into two single strands. The second step involves annealing of primers, in which the reaction is cooled to 55°C to 65°C to allow primers to attach to their complementary sequences. The third step is DNA extension, in which the reaction is heated to 72°C to allow the enzyme DNA polymerase to build a new DNA strand by adding specific nucleotides to the attached primers. These three steps are repeated 35 to 40 times, and during each cycle, the newly synthesized DNA strands serve as a template for further DNA synthesis. This approach results in the exponential increase in the amount of a targeted DNA sequence and production of 10 7 to 10 11 copies from a single DNA molecule.
The efficiency of PCR amplification depends on many factors, which include the quality of the isolated DNA template, size of the PCR product, optimal primer design, and optimal conditions of the reaction. Quality DNA allows amplification of long products (as high as 3 to 5 kb). However, when dealing with DNA of suboptimal quality—that is, when DNA is isolated from fixed tissue or cytology preparation—the reliable amplification can be achieved on only relatively short DNA sequences (400 to 500 bp or shorter).
Once the PCR procedure is complete, the products of amplification should be visualized for analysis and interpretation. A simple way to achieve this is to use agarose gel electrophoresis and ethidium bromide staining. However, this method cannot separate amplification products that differ in size by only few nucleotides, and finer separation can be achieved with polyacrylamide gel or capillary gel electrophoresis ( Fig. 2.3 ). PCR amplification followed by gel electrophoresis is frequently used for detection of small deletions or insertions, microsatellite instability (MSI), and LOH. For detection of point mutations, the PCR products should be interrogated by other molecular techniques.
Reverse Transcription Polymerase Chain Reaction
Reverse transcription PCR (RT-PCR) is a modification of the standard PCR technique that can be used to amplify mRNA. As a first step, isolated mRNA is converted to a complementary DNA (cDNA) molecule with an RNA-dependent DNA polymerase, also known as reverse transcriptase, during a process called reverse transcription. The cDNA can be used as any other DNA molecule for PCR amplification. The primers used for cDNA synthesis can be non–sequence specific, a mixture of random hexamers or oligo-dT primers, or sequence specific ( Fig. 2.4 ). Random hexamers are a mixture of all possible combinations of six nucleotide sequences that can attach randomly to mRNA and initiate reverse transcription of the entire RNA pool. The oligo-dT primers are complementary to the poly-A tails of the mRNA molecules and allow synthesis of cDNA only from mRNA molecules. Sequence-specific primers are the most restricted because they are designed to bind selectively to the mRNA molecules of interest, making the entire process of reverse transcription target specific.
Reverse transcription and PCR amplification can be performed as a two-step process in a single tube or as two separate reactions. RT-PCR performed on fresh-frozen tissue provides quality amplification and reliable results. However, when FFPE tissue is used for RT-PCR analysis, the results vary and depend on the level of RNA degradation and length of PCR amplicon. For more stable RT-PCR amplification to be achieved from FFPE tissues, the target is typically chosen to be less than 150 to 200 nt long.
RT-PCR analysis is used in molecular laboratories for the detection of gene rearrangements and gene expression. RT-PCR may also be used to amplify several exonic sequences in one reaction because it can take advantage of the fact that all introns are spliced out in mRNA, leaving the coding sequences intact and significantly shortening the potential product of amplification. However, it is important to recognize that RNA is easily degradable and has to be handled with great care during the entire process of reverse transcription to avoid degradation. Amplification of a housekeeping gene has to accompany each RT-PCR reaction as an internal control to monitor the quality and quantity of RNA in a given sample.
Real-Time Polymerase Chain Reaction
Real-time PCR uses the main principles of conventional PCR but detects and quantifies the PCR product in real time as the reaction progresses. In addition to all components of conventional PCR, real-time PCR uses fluorescently labeled molecules for the visualization of PCR amplicons. It can be performed in two main formats, with incorporation of DNA dyes such as SYBR Green 1, SYTO9, EvaGreen, or LC Green into the PCR product or taking advantage of fluorescently labeled probes (e.g., fluorescence resonance energy transfer (FRET) hybridization probes, and TaqMan probes) annealing to the PCR product. During PCR, the increasing amounts of fluorescence that result from exponential increase in the amount of amplified DNA sequence are detected by a PCR instrument. The instrument software allows construction of an amplification plot of fluorescence intensity versus cycle number. During the early cycles, the amount of PCR product is low, and fluorescence is not sufficient to exceed the baseline. As the PCR product accumulates, the fluorescence signal crosses the baseline and increases exponentially ( Fig. 2.5 ). At the end of the reaction, the fluorescence reaches a plateau because most of the reagents have been consumed.
The real-time detection of amplification allows the detection of PCR product in real time and removes the need for subsequent gel electrophoresis. Another advantage is that it can use post-PCR melting curve analysis to detect sequence variations at the specific locus. For example, in the LightCycler probe format, binding of hybridization probes to the PCR product in a head-to-tail fashion initiates the FRET from one probe to another, and detected fluorescence is proportional to the amount of amplified product. During post-PCR melting curve analysis, the PCR product is gradually heated, and fluorescence is measured at each temperature point. During this process, even a single mismatch between the labeled probe and the amplified sequence significantly reduces the melting temperature ( T m ), defined as the temperature at which 50% of the double-stranded DNA becomes single stranded. Therefore the presence of a point mutation or SNP in the region covered by a fluorescent probe is detected as an additional T m peak on melting curve analysis (see Fig. 2.5 ).
Quantitative PCR (qPCR) is a variation of real-time PCR that can be used for evaluation of gene expression levels or gene copy numbers. The quantitative assessment of the initial template used for PCR amplification can be done by comparison of the amount of PCR product of the target sequence with the PCR products generated by amplification of the known quantities of DNA or cDNA.
Real-time PCR is frequently used in molecular laboratories because it is a rapid, less laborious technique compared with other techniques, and it does not require processing of samples after PCR amplification; this minimizes the time of the procedure and risk of contamination by previous PCR products.
Polymerase Chain Reaction–Restriction Fragment Length Polymorphism Analysis
Restriction enzymes, or restriction endonucleases, are enzymes that cut DNA at specific nucleotide sequences known as restriction sites. The restriction sites are usually 4 to 8 nt long and are frequently palindromic (the DNA has the same sequences in both directions). The restriction fragment length polymorphism (RFLP) analysis exploits the ability of restriction enzymes to cut DNA at these specific sites. If a given DNA sequence variation, such as a point mutation, alters the restriction site for a specific enzyme, either creating or destroying it, this changes the size of the PCR product, which can be detected by gel electrophoresis.
This method is frequently used for detection of known point mutations or SNPs. In addition, it can be used for separation between two amplified sequences that have high similarity in their nucleotide composition. Fig. 2.6 illustrates the use of PCR-RFLP to differentiate between SS18/SSX1 and SS18/SSX2 rearrangements, which are common in synovial sarcomas.