Advances in DNA sequencing technology have led to improved accuracy as well as decreased costs and have resulted in the recent introduction of individual whole genome, whole exome, and large gene panel sequencing from certified reference laboratories for clinical molecular diagnosis.
Bioinformatic filtering of the over 3 billion nucleotide base pairs found in any genome is required to identify causal mutations. Currently, most analysis focuses on the estimated 4000 genes (out of a total of ~20,000 genes) in the human genome implicated in human disease.
Prior to testing patients should be informed of (a) potential disclosure of nonpaternity, consanguinity, and/or unrelatedness to one or more family members, (b) the possibility of finding either a diagnosis or predisposition to a disease unrelated to the reason that sequencing was ordered, (c) limitations of current sequencing approaches (even when genetic etiology is strongly suspected), and (d) potential risk of genetic discrimination affecting either insurability and employment. They should also be informed of the US Federal law (GINA) which seeks to protect against the existing genetic discrimination.
Diagnostic use of this technology is made more complex by the uncovering of numerous incidental findings (some with unknown significance and others with an established pathogenic effect) present in any individual genome. Recent ACMG recommendations on release of incidental findings advises the return of “known pathogenic” and “expected pathogenic” variants identified across 56 well curated genes.
Appropriate genetic counseling to address incidental findings and informed consent issues is necessary both prior to as well as following medical genomic testing in a patient.
The rapidly evolving knowledge base allows providers to discuss with patients the expectation that future reanalysis of the same genomic dataset will yield additional valuable insights into the health implications of the data for the patient. Many predict that annual reanalysis will become the norm.
Allowing patients to “opt out” for disclosure of abnormal results (even when findings are incidental or secondary) is generally not the norm in medicine. While controversial, there may be special instances in medical genomics where providing an “opt out” option for abnormal results is appropriate – such as when incidental findings for an adult-onset disease in a pediatric patient is discovered. Even in these cases, however, disclosure to the parents/guardians is generally favored over suppression of these results.
The use of traditional DNA sequencing approaches to genetic diagnosis (ie, Sanger sequencing) continues to have a clinical role in the diagnosis of many well-studied conditions. Currently, pathogenic mutations generated from next-generation sequencing (NGS) are usually confirmed with an alternative method of clinical sequencing before being medically acted upon and Sanger sequencing is often considered the “gold standard” for confirmation.
Limitations of NGS:
Current technology does not perfectly represent the complete genome or exome due to technical limitations. At this time, clinical exome sequencing on average covers 92-99% or more of targeted coding areas of genes and provides a diagnostic yield in around one-quarter of cases when applied to highly selected scenarios of “diagnostic unknowns.” Some causative mutations, such as trinucleotide repeat expansions in disorders like Huntington Disease, may be better identified by other methodologies like Southern blots.
Genome Types and Sources
Genomes can be nuclear or mitochondrial. Nuclear genomes are sequenced from DNA in a cell’s nucleus. Mitochondrial genomes are sequenced from circular DNA found in mitochondria organelles in the cytoplasm and are maternally inherited. At 37 kb, mitochondrial genomes are much smaller than the approximately 3 Gb haploid nuclear genome. Mitochondrial genomes can also have large sequence variation (a) between individuals, (b) within the same individual in different tissues, and/or (c) within the same tissue (a phenomena termed heteroplasmy). In this chapter we refer to nuclear genomes unless otherwise specified.
Comparison of germline mutations (those that are transmitted to offspring) against somatic mutations (those accumulated over one’s lifetime) has been shown to have clinical impact specifically in oncology. Studying these changes to the normal germline genome is expected to refine molecular characterization of cancers and lead to more targeted and individualized treatments. An important challenge is distinguishing “driver” mutations (those mutations causing tumors and enabling disease progression) from “passenger” mutations (mutations arising from tumor progression with neutral effect that are not present in the germline genome). Identifying somatic mutations that are pharmacogenetically informative may also lead to safer and better targeted cancer treatment regimens.
DNA can be extracted from any tissue in the body. Traditionally, DNA extracted from leukocytes in whole blood is the sterile source of DNA for most clinical testing. When DNA is derived from nonsterile sources, clinicians should be aware that DNA sequence from normal flora may mistakenly be confused for host sequence from either sequencing or bioinformatics error. Obtaining DNA from a nonsterile site, such as with buccal swab or saliva sample, however, has the advantage over blood draw of being less invasive and has been used for large-scale genomic research studies.
Basic Individual Genome Statistics
Average basic genomic statistics have been shown to be generally uniform across individuals and diverse ethnic populations. The difference between any two unrelated individuals on the single nucleotide level is less than 0.1%. Diversity between individuals is greatly increased, however, when factoring in structural chromosomal rearrangements or insertions and deletions, and RNA transcript variation. Because of this similarity, analysis detecting pathogenic variants can be prioritized on the proportionately smaller number of variants that are either novel (ie, not seen in standard reference genomic databases) and/or have low minor allele frequencies.
A human haploid genome has roughly 3.2 billion bases. The number of single-nucleotide variants against the reference genome is 3.5 to 4 million changes per genome. The number of single-nucleotide variants causing nonsynonymous changes (ie, those causing a change in amino acid and potentially deleterious) averages greater than 9500/genome.
Clinical interpretation of genomes always requires some degree of bioinformatic filtering to identify those changes most likely to cause disease or influence disease severity. Often described as trying to “find a needle in a haystack,” distinguishing causal variants from a potentially large number of false-positive variants is a major challenge in clinical genome analyses.
Genomic Disease Architecture
Focusing genomic sequencing and/or data filtering on coding regions, representing approximately 1.5% to 2% of the genome that codes for protein is a reasonable first pass for most clinical situations as over 85% of known disease-causing mutations are found here. Sequencing exomes (ie, the 180,000 exons in the human genome) instead of whole genomes reduces cost and storage requirements. At the time of this writing, exome sequencing is more common in clinical use than whole genome sequencing. The trade-off for this approach, however, is potentially missing a pathogenic variant(s) in noncoding regions and losing potentially important gene-regulatory information.
With the increase of competitive pricing, it is likely that clinical WGS will become preferable to exome sequencing. In selected cases, WGS may be a good follow-up test when exome sequencing is not diagnostic. Further, the disease contribution of noncoding regions is likely to be generally underestimated because of the difficulty traditional Sanger sequencing historically has had comprehensively capturing these large noncoding sequence segments. Prioritizing which regions of a genome are clinically analyzed (ie, on the basis of coding region, known causal genes, and/or mutation type) should be guided by disease-specific information when available.
Determining which genetic changes in noncoding regions are pathogenic is also more challenging as the presence and/or type of protein change produced may be less obvious. In addition, presumably lower selection pressure in these noncoding areas can potentially make analysis of evolutionary conservation less informative.
Currently, clinical molecular genetics laboratories offering exome sequencing are generally reporting 92% to 97% coverage of targeted exome regions at a sequencing depth sufficient enough to allow confident zygosity calls. The sequence accuracy within these regions is typically reported at 99.9%. False-positive rates are relatively high at 5% to 10%. Diagnostic yield from exome sequencing across cases appears to be around 25% at the present time (and as high as 50% in carefully selected cases).
A gene is a contiguous portion of DNA that codes for protein and is considered a functional part of the genome. While the existence of pathogenic sequence variations outside of genes are well documented, the majority of known disease mutations are either within genes or within close enough proximity (eg, within 50 kb) of a nearby gene to affect function. Recently published data from ENCODE highlights the progress being made at better categorization of noncoding regions affecting gene function and expression. The data suggest functional elements in noncoding sequence throughout the genome may have been underestimated (ie, reside >50 kb from the gene it regulates). With respect to clinical analyses of genomic data however, the majority of known genetic disease is still overwhelmingly in coding regions at the present time and therefore is still typically prioritized in most clinically focused analyses. More clinical outcomes data are needed to study the clinical impact of these regulatory regions.
From roughly 20,000 genes in the human genome, only around 3000 to 4000 genes are robustly associated with highly penetrant inherited diseases at present. Prioritizing analyses in genes specifically known to cause the patient’s pathogenic phenotype before looking genome—wide to discover a gene without previous disease association is reasonable—especially when the number of cases analyzed is low and/or familial genomic references are not available.