Chapter 7 The role of information, bioinformatics and genomics

The pharmaceutical industry as an information industry

As outlined in earlier chapters, making drugs that can affect the symptoms or causes of diseases in safe and beneficial ways has been a substantial challenge for the pharmaceutical industry (Munos, 2009). However, there are countless examples where safe and effective biologically active molecules have been generated. The bad news is that the vast majority of these are works of nature, and to produce the examples we see today across all biological organisms, nature has run a ‘trial-and-error’ genetic algorithm for some 4 billion years. By contrast, during the last 100 years or so, the pharmaceutical industry has been heroically developing handfuls of successful new drugs in 10–15 year timeframes. Even then, the overwhelming majority of drug discovery and development projects fail and, recently, the productivity of the pharmaceutical industry has been conjectured to be too low to sustain its current business model (Cockburn, 2007; Garnier, 2008).

A great deal of analysis and thought is now focused on ways to improve the productivity of the drug discovery and development process (see, for example, Paul et al., 2010). While streamlining the process and rebalancing the effort and expenditures across the various phases of drug discovery and development will likely increase productivity to some extent, the fact remains that what we need to know to optimize productivity in drug discovery and development far exceeds what we do know right now. To improve productivity substantially, the pharmaceutical industry must increasingly become what in a looser sense it always was, an information industry (Robson and Baek, 2009).

Initially, the present authors leaned towards extending this idea by highlighting the pharmaceutical process as an information flow, a flow seen as a probabilistic and information theoretic network, from the computations of probabilities from genetics and other sources, to the probability that a drug will play a useful role in the marketplace. As may easily be imagined, such a description is rich, complex, and evolving (and rather formal), and the chapter was rapidly reaching the size of a book while barely scratching the surface of such a description. It must suffice to concentrate on the nodes (old and new) of that network, and largely on those nodes that represent the sources and pools of data and information that are brought together in multiple ways.

This chapter begins with general concepts about information then focuses on information about biological molecules and on its relationship to drug discovery and development. Since most drugs act by binding to and modulating the function of proteins, it is reasonable for pharmaceutical industry scientists to want to have as much information as possible about the nature and behaviour of proteins in health and disease. Proteins are vast in numbers (compared to genes), have an enormous dynamic range of abundances in tissues and body fluids, and have complex variations that underlie specific functions. At present, the available information about proteins is limited by lack of technologies capable of cost-efficiently identifying, characterizing and quantifying the protein content of body fluids and tissues. So, this chapter deals primarily with information about genes and its role in drug discovery and development.

Innovation depends on information from multiple sources

Information from three diverse domains sparks innovation in drug discovery and development (http://dschool.stanford.edu/big_picture/multidisciplinary_approach.php):

• Feasibility (Is the product technically/scientifically possible?)

• Viability (Can such a product be brought cost-effectively to the marketplace?)

• Desirability (Is there a real need for such a product?).

The viability and desirability domains include information concerning public health and medical need, physician judgment, patient viewpoints, market research, healthcare strategies, healthcare economics and intellectual property. The feasibility domain includes information about biology, chemistry, physics, statistics and mathematics. Currently, the totality of available information can be obtained from diverse, unconnected (’siloed’) public or private sources, such as the brains of human beings, books, journals, patents, general literature, databases available via the Internet (see, for example, Fig. 7.1) and results (proprietary or otherwise) of studies undertaken as part of drug discovery and development. However, due to its siloed nature, discreet sets of information from only one domain are often applied, suboptimally, in isolation to particular aspects of drug discovery and development. One promising approach for optimizing the utility of available, but diverse, information across all aspects of drug discovery and development is to connect apparently disparate information sets using a common format and a collection of rules (a language) that relate elements of the content to each other. Such connections make it possible for users to access the information in any domain or set then traverse across information in multiple sets or domains in a fashion that sparks innovation. This promising approach is embodied in the Semantic Web, a vision for the connection of not just web pages but of data and information (see http://en.wikipedia.org/wiki/Semantic_Web and http://www.w3.org/2001/sw/), which is already beginning to impact the life sciences, generally, and drug discovery and development, in particular (Neumann and Quan, 2006; Stephens et al., 2006; Ruttenberg et al., 2009).

Fig. 7.1 Flow of public sequence data between major sequence repositories. Shown in blue are the components of the International Nucleotide Sequence Database Collaboration (INSDC) comprising Genbank (USA), the European Molecular Biology Laboratory (EMBL) Database (Europe) and the DNA Data Bank of Japan (DDBJ).

Bioinformatics

The term bioinformatics for the creation, analysis and management of information about living organisms, and particularly nucleotide and protein sequences, was probably first coined in 1984, in the announcement of a funding programme by the European Economic Community (EC COM84 Final). The programme was in response to a memo from the White House that Europe was falling behind America and Japan in biotechnology.

The overall task of bioinformatics as it was defined in that announcement is to generate information from biological data (bioinformation) and to make that information accessible to humans and machines that are in need of information to advance toward the achievement of an objective. Handling all the data and information about life forms, even the currently available information on nucleotide and protein sequences, is not trivial. Overviews of established basic procedures and databases in bioinformatics, and the ability to try one’s hand at them, are provided by a number of high-quality sources many of which have links to other sources, for example:

• The European Bioinformatics Institute (EMBL-EBI), which is part of the European Molecular Biology Laboratory (EMBL) (http://www.ebi.ac.uk/2can/home.html)

• The National Center for Biotechnology Information of the National Institutes of Health (http://www.ncbi.nlm.nih.gov/Tools/)

• The Biology Workbench at the University of San Diego (http://workbench.sdsc.edu/).

Rather than give a detailed account of specific tools for bioinformatics, our focus will be on the general ways by which bioinformation is generated by bioinformatics, and the principles used to analyse and interpret information.

Bioinformatics is the management and data analytics of bioinformation. In genetics and molecular biology applications, that includes everything from classical statistics and specialized applications of statistics or probability theory, such as the Hardy–Weinberg equilibrium law and linkage analysis, to modelling interactions between the products of gene expression and subsequent biological interpretation (for interpretation tool examples see www.ingenuity.com and www.genego.com). According to taste, one may either say that a new part of data analytics is bioinformatics, or that bioinformatics embraces most if not all of data analytics as applied to genes and proteins and the consequences of them. Here, we embrace much as bioinformatics, but also refer to the more general field of data analytics for techniques on which bioinformatics draws and will continue to borrow.

The term bioinformatics does have the merit of addressing not only the analysis but also the management of data using information technology. Interestingly, it is not always easy to distinguish data management from the processing and analysis of data. This is not least because computers often handle both. Sometimes, efficiency and insight can be gained by not segregating the two aspects of bioinformatics.

Bioinformatics as data mining and inference

Data mining includes also analysis of market, business, communications, medical, meteorological, ecological, astronomical, military and security data, but its tools have been implicit and ubiquitous in bioinformatics from the outset, even if the term ‘data mining’ has only fairly recently been used in that context. All the principles described below are relevant to a major portion of traditional bioinformatics. The same data mining programme used by one of the authors has been applied to both joint market trends in South America and the relationship of protein sequences to their secondary structure and immunological properties. In the broader language of data analytics, bioinformatics can be seen as having two major modes of application in the way it obtains information from data – query or data mining. In the query mode (directed analysis), for example, a nucleotide sequence might be used to find its occurrence in other gene sequences, thus ‘pulling them from the file’. That is similar to a Google search and also somewhat analogous to testing a hypothesis in classical statistics, in that one specific question is asked and tested as to the hypothesis that it exists in the data.

In the data mining mode (undirected or unsupervised analysis), one is seeking to discover anything interesting in the data, such as a hidden pattern. Ultimately, finding likely (and effectively testing) hypotheses for combinations of N symbols or factors (states, events, measurements, etc.) is equivalent to making 2^N-1 queries or classical statistical tests of hypotheses. For N = 100, this is 10³⁰. If such an activity involves continuous variables of interest to an error of e percent (and e is usually much less than 50%) then this escalates to (100/e)^N. Clearly, therefore, data mining is a strategy, not a guaranteed solution, but, equally clearly, delivers a lot more than one query. Insomuch as issuing one query is simply a limiting case of the ultimate in highly constrained data mining, both modes can be referred to as data mining.

Both querying and data mining seem a far cry from predicting what regions of DNA are likely to be a gene, or the role a pattern of gene variants might play in disease or drug response, or the structure of a protein, and so on. As it happens, however, prediction of what regions of DNA are genes or control points, what a gene’s function is, of protein secondary and tertiary structure, of segments in protein primary structure that will serve as a basis for a synthetic diagnostic or vaccine, are all longstanding examples of what we might today call data mining followed by its application to prediction. In those activities, the mining of data, as a training set, is used to generate probabilistic parameters often called ‘rules’. These rules are preferably validated in an independent test set. The further step required is some process for using the validated rules to draw a conclusion based on new data or data sets, i.e. formally the process of inference, as a prediction.

An Expert System also uses rules and inference except the rules and their probabilities are drawn from human experts at the rate of 2–5 a day (and, by definition, are essentially anecdotal and likely biased). Computer-based data mining can generate hundreds of thousands of unbiased probabilistic rules in the order of minutes to hours (which is essentially in the spirit of evidence based medicine’s best evidence). In the early days of bioinformatics, pursuits like predicting protein sequence, signal polypeptide sequences, immunological epitopes, DNA consensus sequences with special meaning, and so forth, were often basically like Expert Systems using rules and recipes devised by experts in the field. Most of those pursuits have now succumbed to use rules provided from computer-based mining of increasingly larger amounts of data, and those rules bear little resemblance to the original expert rules.

General principles for data mining

Data mining is usually undertaken on a sample dataset. Several difficulties have dogged the field. At one end of the spectrum is the counterintuitive concern of ‘too much (relevant) information’. Ideally, to make use of maximum data available for testing a method and quality of the rules, it quickly became clear that one should use the jackknife method. For example, in predicting something about each and every accessible gene or protein in order to test a method, that gene or protein is removed from the sample data set used to generate the rules for its prediction. So for a comprehensive test of predictive power, the rules are regenerated afresh for every gene or protein in the database, or more correctly put, for the absence of each in the database. The reason is that probabilistic rules are really terms in a probabilistic or information theoretic expansion that, when brought together correctly, can predict something with close to 100% accuracy, if the gene or protein was in the set used to derive the rules. That has practical applications, but for most purposes would be ‘cheating’ and certainly misleading. Once the accuracy is established as acceptable, the rules are generated from all genes or proteins available, because they will typically be applied to new genes or proteins that emerge and which were not present in the data. On the other hand, once these become ‘old news’, they are added to the source data, and at intervals the rules are updated from it.

At the other end of the scale are the concerns of too little relevant information. For example, data may be too sparse for rules with many parameters or factors (the so-called ‘curse of high dimensionality’), and this includes the case where no observations are available at all. Insight or predictions may then be incorrect because many relevant rules may need to come together to make a final pronouncement. It is rare that a single probabilistic rule will say all that needs to be said. Perhaps the greatest current concern, related to the above, is that the information obtained will only be of general utility if the sample dataset is a sufficient representation of the entire population. This is a key consideration in any data mining activity and likely underlies many disputes in the literature and elsewhere about the validity of the results of various studies involving data mining. To generate useful information from any such study, it is essential to pay particular attention to the design of the study and to replicate and validate the results using other sample datasets from the entire population. The term data dredging is often used in reference to preliminary data mining activities on sample datasets that are too small to generate results of sufficient statistical power to likely be valid but can generate hypotheses for exploration using large sample datasets.

A further concern is that sparse events in data can be particularly important precisely because they are sparse. What matters is not the support for a rule in terms of the amount and quality of data concerning it, but (in many approaches at least) whether the event occurred with an abundance much more, or much less, than expected, say on a chance basis. Negative associations are of great importance in medicine when we want to prevent something, and we want a negative relationship between a therapy and disease. The so-called unicorn events about observations never seen at all are hard to handle. A simple pedagogic example is the absence of pregnant males in a medical database. Whilst this particular example may only be of interest to a Martian, most complex rules that are not deducible from simpler ones (that are a kind of subset to them) might be of this type. Of particular concern is that drugs A, B, and C might work 100% used alone, and so might AB, BC, and AC, but ABC might be a useless (or, perhaps, lethal) combination. Yet, traditionally, unicorn events are not even seen to justify consideration, and in computing terms they are not even ‘existentially qualified’, no variables may even be created to consider them. If we do force creation of them, then the number of things to allow for because they just might be, in principle, can be astronomical.

Data mining algorithms can yield information about:

• the presence of subgroups of samples within a sample dataset that are similar on the basis of patterns of characteristics in the variables for each sample (clustering samples into classes)

• the variables that can be used (with weightings reflecting the importance of each variable) to classify a new sample into a specified class (classification)

• mathematical or logical functions that model the data (for example, regression analysis) or

• relationships between variables that can be used to explore the behaviour of the system of variables as a whole and to reveal which variables provide either novel (unique) or redundant information (association or correlation).

Some general principles for data mining in medical applications are exemplified in the mining of 667 000 patient records in Virginia by Mullins et al. (2006). In that study, which did not include any genomic data, three main types of data mining were used (pattern discovery, predictive analysis and association analysis). A brief description of each method follows.

(Robson, 2003, 2004, 2005, 2008; Robson and Mushlin, 2004; Robson and Vaithiligam, 2010). Clearly it does take account of the occurrence of A. This opens up the full power of information theory, of instinctive interest to the pharmaceutical industry as an information industry, and of course to other mutual information measures such as:

(3)

and

(4)

The last atomic form is of interest since other measures can be calculated from it (see below). Clearly it can also be calculated from the conditional probabilities of predictive analysis. To show relationship with other fields such as evidence based medicine and epidemiology,

(5)

where ~A is a negative of complementary state or event such that

(6)

is familiar as log predictive odds, while

(7)

is the log odds ratio.

The association analysis approach handles positive, zero, and negative associations including treatment sparse joint events. To that end, it may use the more general definition of information in terms of zeta functions, ζ. Unlike predictive analysis, the approach used in this way returns expected information, basically building into the final value the idea of support. In the Virginia study, using the above ‘zeta approach’, one could detect patterns of 2–7 symbols or factors, the limit being the sparseness of data for many such. As data increase, I(males; tall) = ζ(1, observed[males, tall]) − ζ(1, expected[males, tall]) will rapidly approach log_e (observed[males, tall]) − log_e (expected[males, tall]), but unlike log ratios, ζ(1, observed[males, pregnant]) − ζ(1, expected[males, pregnant]) works appropriately with the data for the terms that are very small or zero. To handle unicorn events still requires variables to be created in the programme, but the overall approach is more natural.

The above seem to miss out various forms of data mining such as time series analysis and clustering analysis, although ultimately these can be expressed in the above terms. What seems to require an additional comment is correlation. While biostatistics courses often use association and correlation synonymously, data miners do not. Association relates to the extent to the number of times things are observed together more, or less, than on a chance basis in a ‘presence or absence’ fashion (such as the association between a categorical SNP genotype and a categorical phenotype). This is reminiscent of the classical chi square test, but revealing the individual contributions to non-randomness within the data grid (as well as the positive or negative nature of the association). In contrast, correlation relates to trends in values of potentially continuous variables (independence between the variances), classically exemplified by use of Pearson’s correlation. Correlation is important in gene expression analysis, in proteomics and in metabolomics, since a gene transcript (mRNA) or a protein or a small molecule metabolite, in general, has a level of abundance in any sample rather than a quantized presence/absence. Despite the apparent differences, however, the implied comparison of covariance with what is expected on independent, i.e. chance, basis is essentially the same general idea as for association. Hence results can be expressed in mutual information format, based on a kind of fuzzy logic reasoning (Robson and Mushlin, 2004).

Much of the above may not seem like bioinformatics, but only because the jargon is different. That this ‘barrier’ is progressively coming down is important, as each discipline has valuable techniques less well known in the other. Where they do seem to be bioinformatics it is essentially due to the fact that they come packaged in distinct suites of applications targeted at bioinformatics users, and where they do not seem to be bioinformatics, they do not come simply packaged for bioinformatics users.

Genomics

The genome and its offspring ‘-omes’

In contrast to bioinformatics, the term genome is much older, first believed to be used in 1920 by Professor Hans Winkler at the University of Hamburg, as describing the world or system within the discipline of biology and within each cell of an organism that addresses the inherited executable information. The word genome (Gk: ) means I become, I am born, to come into being, and the Oxford English Dictionary gives its aetiology as being from gene and chromosome. This aetiology may not be entirely correct.

In this chapter, genomes of organisms are in computer science jargon the ‘primary objects’ on which bioinformatics ‘acts’. Their daughter molecular objects, such as the corresponding transcriptomes, proteomes and metabolomes, should indeed be considered in their own right but may also be seen as subsets or derivatives of the genome concept. The remainder of this chapter is largely devoted to genomic information and its use in drug discovery and development.

While the term genome has recently spawned many offspring ‘-omes’ relating to the disciplines that address various matters downstream from inherited information in DNA, e.g. the proteome, these popular -ome words have an even earlier origin in the 20th century (e.g. biome and rhizome). Adding the plural ‘-ics’ suffix seems recent. The use of ‘omics’ as a suffix is more like an analogue of the earlier ‘-netics’ and ‘-onics’ in engineering. The current rising hierarchy of ‘-omes’ is shown in Table 7.1, and these and others are discussed by Robson and Baek (2009). There are constant additions to the ‘-omes’.

Table 7.1 Gene to function is paved with ‘-omes’

Commonly used terms
Genome	Full complement of genetic information (i.e. DNA sequence, including coding and non-coding regions)	Static
Transcriptome	Population of mRNA molecules in a cell under defined conditions at a given time	Dynamic
Proteome	Either: the complement of proteins (including post-translational modifications) encoded by the genome	Static
	or: the set of proteins and their post-translational modifications expressed in a cell or tissue under defined conditions at a specific time (also sometimes referred to as the translatome)	Dynamic
Terms occasionally encountered (to be interpreted with caution)
Secretome	Population of secreted proteins produced by a cell	Dynamic
Metabolome	Small molecule content of a cell	Dynamic
Interactome	Grouping of interactions between proteins in a cell	Dynamic
Glycome	Population of carbohydrate molecules in a cell	Dynamic
Foldome	Population of gene products classified by tertiary structure	Dynamic
Phenome	Population of observable phenotypes describing variations of form and function in a given species	Dynamic