Genomes and Nucleic Acid Alterations

Chapter 38


Genomes and Nucleic Acid Alterations



Molecular diagnostics focuses on medically important sequence variations within a background of complex genomic structure. This chapter reviews the organization of human, bacterial, viral, and fungal genomes and the spectrum of variations in nucleic acids that are of medical concern.



Human Genome


Each human cell contains two copies of a 3 billion member sequence code of nucleic acids on 46 chromosomes.4,5,12 Box 38-1 lists statistics for the human genome and the types of variations that are important in clinical diagnostics.



Three quarters of human DNA is intergenic, or between genes. More than 60% of this intergenic sequence consists of “parasitic” DNA regions of transposable elements 100 to 11,000 bases in length. Between 2 million and 3 million of these elements are present in each copy of the genome. They contribute to genetic recombination and chromosome structure and provide an evolutionary record of sequence variation and selection.


Segmental duplications constitute 5.3% of the human genome. They are over 1 kilobase (a thousand bases, or kb) in length and have a sequence identity of at least 90%; they are not transposable. Segmental duplications are common in the human genome and are prone to deletion and/or rearrangement, often with medical consequences.


Intergenic DNA carries most of the simple sequence repeats (SSRs) present in the genome. These repeats are known as microsatellites or short tandem repeats (STRs) when the repeat unit is 1 to 13 bases, and minisatellites or variable number of tandem repeats (VNTRs) when the repeat unit is 14 to 500 bases. SSRs are critical markers in genetic linkage studies and in forensic or medical identity testing. They are formed by slippage during replication and are highly polymorphic between individuals. The most common SSRs are dinucleotide repeats, such as ACACAC and ATATAT. On average, approximately one SSR occurs every 2000 bases.


Approximately 2% of DNA is required to maintain the structure of chromosomes and is located at chromosome centers (centromeres) and ends (telomeres). Centromeric DNA consists of many tandem copies of nearly identical 171 base pair (bp) repeats encompassing 0.24 to 5.0 Mb per chromosome. Each chromosome end is capped with several kb of the telomeric 6 base repeat TTAGGG.


Although intergenic DNA does not code for protein and was originally considered “junk,” much of this DNA is transcribed to RNA, producing a complex “transcriptome” network of RNA control elements whose function and mechanics are active areas of investigation.1


One quarter of the human genome consists of genes. A total of 20,000 to 25,000 genes are found in the human genome. The average gene covers 27 kb, but only about 1300 of these bases code for amino acid sequences. The primary RNA transcript is processed by splicing to retain exons that are interspersed throughout the gene and have a higher GC content than noncoding regions. On average, 95% of a gene is spliced out as introns, retaining a mean of 10.4 exons, of which 9.1 are translated into proteins. Exons make up only 1.9% of the total genome, with 1.1% of the genome coding for proteins. Some important genes are present in many copies, so that overall protein expression is not affected if a chance variation occurs in one copy. If extra copies of genes lose their function, they are known as pseudogenes. At least as many pseudogenes as functional genes are present in the human genome. It is important to distinguish pseudogenes from functional genes because sequence variations in pseudogenes are seldom of clinical importance.


Even though 99% of the genome does not code for protein, most of it is transcribed into noncoding RNA. At least 93% of the genome is transcribed,1 producing more than 10 times the amount of RNA that is produced from the coding segments of genes.2 Both strands of DNA may be transcribed, and long noncoding transcripts may overlap coding regions, producing a complex transcriptome of functional RNA molecules that may variably regulate transcription of coding regions, RNA processing, mRNA stability, translation, protein stability, and secretion. In addition to long noncoding RNA, ribosomal RNA, and transfer RNA, specific classes of noncoding RNAs include small nuclear RNAs critical for splicing, small nucleolar RNAs that modify rRNA, telomerase RNAs for maintenance of telomeres, small interfering RNAs, and microRNAs that regulate gene expression.8,9


MicroRNAs (or miRNAs) are noncoding but functional single-stranded RNAs that are about 22 bases long and are expressed in a tissue-specific manner. They are initially transcribed as longer precursors that undergo two rounds of truncations as they are transported from nucleus to cytoplasm in the cell. The mature miRNA is then integrated into a protein complex called the RNA-induced silencing complex, which regulates translation of mRNA. MicroRNAs hybridize to a 6 to 8 base sequence in the 3′ untranslated region of a target mRNA and inhibit mRNA expression, by mRNA degradation if the remaining bases are perfectly complementary, or by blocking of translation if they are imperfectly complementary. More than 700 different miRNAs have been reported,3 and sequences encoding for miRNA have been found on every chromosome except the Y chromosome.



Variation Within the Human Genome


Consider the genome as a book. Nucleotides are the individual letters, and three bases make up each word as an amino acid codon. The words are organized into sentences or exons that are separated by periods or introns. Each sentence is further organized into paragraphs or genes. Many paragraphs constitute a chapter or chromosome, and several chapters make up a book or genome. If the DNA of any two individuals is compared, on average one spelling difference is noted every 1250 bases (i.e., approximately 99.9% of the sequence is identical between randomly chosen copies of the genome). However, different individuals (copies of the same book) vary in a subtler way. Some of the pages are copied more than once and may be scattered throughout the book. Such copy number variants involve a greater amount of text than the spelling differences, with 0.5% of the genome differing on average between two individuals when 50 kb pages are considered,7 that is, between individuals, at least five times as many bases are affected by copy number changes than by small sequence differences.


Any sequence change (compared with a reference sequence) is called a sequence variant or alteration. If a sequence variant or alteration is present in at least 1% of a population, it is a polymorphism. Many sequence variants, alterations, and polymorphisms in the genome do not affect human health and are benign or silent. For example, most copy number variations do not cause disease. Furthermore, most single-base changes [also known as single-nucleotide polymorphisms (SNPs)] and SSRs found between genes are seldom associated with disease. Similarly, most of the SNPs within introns, except for splicing and regulatory variants, are not known to affect gene function. In addition, some of the SNPs within exons are silent alterations that do not code for a change in amino acid sequence because of the redundancy in the genetic code. Still other SNPs in exons code for amino acid changes that do not affect protein function. Even such silent SNPs nonetheless may be of considerable interest as genetic markers.


The most commonly observed sequence variations are SNPs. Millions of SNPs have been described, and many new SNPs continue to be reported. Some SNPs are common in the population, with allele frequencies of 0.1 to 0.5 (i.e., present in 10 to 50 of every 100 haploid copies studied), although other single-base changes are very rare. Although single-base variants have been identified every 100 to 300 bases, many of these are not found frequently in the population. A vast majority of SNPs (97%) occur in noncoding regions; only 3% of SNPs are within coding sequences.


Although SNPs are the most common sequence variant, copy number variants cover more of the genome than SNPs. These copy number variants (CNVs) occur in stretches of DNA that may range from 100 bases up to several Mb (megabases, or million bases) in size. CNVs may be duplicated in tandem or may involve complex gains or losses of homologous sequences at multiple sites in the genome. CNV regions exist in every chromosome and involve 5 to 12% of the human genome.7,10 Most CNVs are inherited and biallelic, similar to SNPs.7 More than 6000 CNV loci have been reported, and many of them overlap with genes. Individuals differ on average at more than 200 CNV loci, and these overlap the transcribed regions of more than 100 genes.7

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Nov 27, 2016 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Genomes and Nucleic Acid Alterations

Full access? Get Clinical Tree

Get Clinical Tree app for offline access