1.1 The Future Is So Bright...
1.2 Functional Genomics
1.2.1 Informatics and advances in enabling technology
1.2.2 Why do we need new techniques?
1.3 Missing the Forest for the Dendrograms
1.3.1 Sociology of a functional genomics pipeline
1.4 Functional Genomics, Not Genetics
1.4.1 In silico analysis will never substitute for in vitro and in vivo Basic Biology
1.5.1 Biological caveats in mRNA measurements
1.5.2 Sequence-level genomics
1.5.3 Proteomics
2.1 The Safe Conception of a Functional Genomic Experiment
2.1.1 Experiment design space
2.1.2 Expression space
2.1.3 Exercising the expression space
2.1.4 Discarding data and low-hanging fruit
2.2 Gene-Clustering Dogma
2.2.1 Supervised versus unsupervised learning
2.2.2 Figure of merit: The elusive gold standard in functional genomics
3.1 Generic Features of Microarray Technologies
3.1.1 Robotically spotted microarrays
3.1.2 Oligonucleotide microarrays
3.2 Replicate Experiments, Reproducibility, and Noise
3.2.1 What is a replicate experiment? A reproducible experimental outcome?
3.2.2 Reproducibility across repeated microarray experiments:Absolute expression level and fold difference
3.2.3 Cross-platform (technology) reproducibility
3.2.4 Pooling sample probes and PCR for replicate experiments
3.2.5 What is noise?
3.2.6 Sources and examples of noise in the generic microarray experiment.
3.2.7 Biological variation as noise: The Human Genome Project and irreproducibility of expression measurements.
3.2.8 Managing noise.
3.3 Prototypical Objectives and Questions.
3.3.1 Two examples: Inter-array and intra-array.
3.4 Preprocessing: Filters and Normalization.
3.4.1 Normalization
3.5 Background on Fold
3.5.1 Fold calculation and significance
3.5.2 Fold change may not mean the same thing in different expression measurement technologies
3.6 Dissimilarity and Similarity Measures
3.6.1 Linear correlation
3.6.2 Entropy and mutual information
3.6.3 Dynamics
4.1 Introduction
4.2 What Can Be Clustered in Functional Genomics?
4.3 What Does it Mean to Cluster?
4.4 Hierarchy of Bioinformatics Algorithms
4.5 Data Reduction and Filtering
4.5.1 Variation filter
4.5.2 Low entropy filter
4.5.3 Minimum expression level filter
4.5.4 Target ambiguity filter
4.6 Self-Organizing Maps
4.6.1 K-means clustering
4.7 Finding Genes That Split Sets
4.8 Phylogenetic-Type Trees
4.8.1 Two-dimensional dendrograms
4.9 Relevance Networks
4.10 Other Methods
4.11 Which Technique Should I Use?
4.12 Determining the Significance of Findings
4.12.1 Permutation testing
4.12.2 Testing and training sets
4.12.3 Performance metrics
4.12.4 Receiver operating characteristic curves
4.13 Genetic Networks
4.13.1 What is a genetic network
4.13.2 Reverse-engineering and modeling a genetic network using limited data
4.13.3 Bayesian networks for functional genomics
5.1 Ontologies
5.1.1 Bio-ontology projects
5.1.2 Advanced knowledge representation systems for bio-ontology
5.2 Expressivity versus Computability
5.3 Ontology versus Data Model versus Nomenclature
5.3.1 Exploiting the explicit and implicit ontologies of the biomedical literature
5.4 Data Model Introduction,
5.5.1 The unique gene identifier
5.6 Postanalysis Challenges
5.6.1 Linking to downstream biological validation
5.6.2 Problems in determining the results \Psi *1
6.1 Electronic Medical Records
6.2 Standardized Vocabularies for Clinical Phenotypes
6.3 Privacy of Clinical Data
6.3.1 Anonymization
6.3.2 Privacy rules
6.4 Costs of Clinical Data Acquisition
7.1 New Methods for Gene Expression Profiling
7.1.1 Electronic positioning of molecules: Nanogen
7.1.2 Ink-jet spotting of arrays: Agilent
7.1.3 Coded microbeads bound to oligonucleotides:
7.1.4 Serial Analysis of Gene Expression (SAGE)
7.1.5 Parallel signature sequencing on microbead arrays: Lynx
7.1.6 Gel pad technology: Motorola
7.2 Respecting the Older Generation
7.2.1 The generation gap
7.2.2 Separating the wheat from the chaff
7.2.3 A persistent problem
7.3 Selecting Software
7.4 Investing in the Future of the Genomic Enterprise
The impact of microarray measurements on biology and bioinformatics has been astounding. Starting from virtually no literature a few years ago, this field has come to dominate many conferences and journals. As an example, the Intelligent Systems for Molecular Biology conference, the annual meeting of the International Society for Computational Biology held in Copenhagen in 2002, had almost 50% of its papers in the areas addressed directly or indirectly within this book. Four years ago, there were none. Bioinformatics has always been driven by the availability of data--sequential, structural, and most recently functional. The availability of sequence data brought into biology a cadre of computer scientists with special skills in string processing. The availability of structural data brought in technical experts in visualization and computational geometry. This most recent development--the availability of relatively large data sets measuring the expression of genes within cells--has helped attract yet another group of scientists--the data miners, machine learners and statisticians.
In many ways, the impact of this data on biology and informatics can be summarized in figure 1.4 of this book--the world has not quite been turned upside down, but it certainly has been turned on its side! A decade ago, if confronted with the data matrix shown on the right of this figure, a well-trained information scientist would say "This is ridiculous. Why would you ask me to analyze a data set where you clearly have a profoundly under-determined problem? There's not enough data here to distinguish between any of the zillion hypotheses that could be consistent with this data set. And who designed these experiments, anyway? How can you measure so many features of such a few examples?" Yet, these experiments are proceeding and are making major contributions to our understanding of how gene systems interact, how to distinguish different types of cancer, and how to measure the impact of the environment on a cell. Our information scientist friend is, in some sense, correct about the relative paucity of data. (Have you ever tried to convinced a biologist holding a microarray with 45,000 spots that this is a relatively data-poor exercise? It's not fun.) However, the information scientist has missed the point about the design and analysis of these experiments. These data sets do indeed contain gold, but the experiments (as for all experiments) must be considered carefully in both the design and implementation phases in order to maximize value. This is where the authors of this book have made a contribution. They start from the premise that these experiments offer great potential, but must be performed and analyzed carefully. They set the context of traditional reductionist biology, and then go on to discuss the design, analysis, storage and interpretation of this first generation of functional genomics experiments. The writing is lively and candid, and the examples are taken from an array of applications. The authors' practical experience in dealing with this data comes through, and they intersperse practical advice with philosophical reverie. Sometimes, these two merge into important discussions such as on the role of ontologies in making sense of these data sets, or on the challenges of linking microarray results with phenotypic data pertinent to human disease.
The functional genomics revolution is here. We do not know how it will change our view of biology and medicine. They are both much more likely to become quantitative and systematic (as opposed to qualitative and reductionist). The informatics techniques required to address this revolution are not entirely clear, but this text gets us started in the right direction.
Russ Altman Stanford University March 2002
Three years ago, when a colleague would approach us with questions about functional genomics and the informatics techniques required to leverage the data obtained from measurement techniques such as DNA microarrays, we had a standard response: "Come listen to a 1-hour presentation by one of us and you'll have a foundation for further discussions." Since then, this response has become inadequate. First, the field can hardly be summarized in even eight 1-hour lectures, and second, the growth in the number of potential collaborators has far outstripped the time available to us to make the necessary presentations.
In early 2000, Ben Reis, one of the graduates of the Children's Hospital Informatics Program, had the inspired idea of formalizing our introductory presentations into a book. We immediately agreed that this was a timely suggestion and to its credit, so did The MIT Press. In our teaching duties in several courses within the Division of Health Sciences and Technology (HST) at Harvard/MIT the range of topics within functional genomics that we were covering in formal presentations grew rapidly. Subsequently, with the inception of the Bioinformatics and Integrative Genomics training program at HST and the development of a Genomic Medicine course at HST, we felt the need for this book all the more acutely.
We recognize that the readership of this book will be varied due to the intrinsically multidisciplinary nature of the functional genomics enterprise (as will be emphasized in the introductory chapter). Accordingly we outline the content of the following chapters so that readers may choose for themselves the path that suits them. Nonetheless, our intent and contention is that the current ordering of the chapters provides the most efficient way of acquiring the content of this book.
Introduction. Here we establish the motivation and the scope of this book and touch upon substantial obstacles to success in the successful application of bioinformatics to an integrative genomics. The notion of an interdisciplinary functional genomics pipeline is introduced. We also review which kind of readers might find this book worthwhile. The promise and limitations of functional genomics techniques, the nature of various kinds of genomic data, and the central role played by the discipline of bioinformatics are outlined. For those who have a limited background in biological sciences, there is a subsection on the basic minimum of molecular biology concepts that will be needed to grasp the the following chapters.
Chapter 2. Experimental Design. This chapter develops a framework for approaching the design of microarray-driven functional genomics experiments. Very little here is quantitative or mathematical. Rather the emphasis is on ways of thinking about the design of experiments and how it might impact the yield of these experiments. We address challenges that are particular to computer scientists (e.g., defining a figure merit for the performance of the bioinformatics algorithms) and to biologists (e.g., discarding potentially valuable data using formal decision theory because of the scale issues in massively parallel data acquisition using noisy measurement devices), respectively. In exploring the design issues we introduce the functional genomics clustering dogma, the broad machine-learning categories of supervised and unsupervised learning, and the nature of the analyses developed using these techniques.
Chapter 3. Microarray Measurements to Analyses. We lay the foundations for performing analyses of microarray data sets. This is the first of the more quantitative and mathematical chapters. We start with a discussion of the acquisition of digital data from the two most widely employed classes of microarrays. Then we consider the two most generic problems of comparing gene expression within a single microarray, i.e., intraarray analyses, and comparing expression across microarrays, i.e., inter-array analyses; in so doing, we introduce the fundamental concept of (dis)similarity and similarity measures and the several kinds of such measures. These measures become the building blocks for the genomic data-mining techniques described in the following chapter.
Chapter 4. Genomic Data-Mining Techniques. When gene expression is measured in more than two samples, gene expression patterns have to be analyzed using methods that consider the coordinated interactions of genes across multiple conditions. This chapter assesses the components of biomedical experiments that can be included in a data-mining investigation. We then cover the most commonly used analytic techniques, discussing the advantages and disadvantages of each technique, as well as the postanalysis process. Where appropriate, we provide pseudocode that will allow readers with some training in computer science to understand the details of the most often used and cited data-mining algorithms. The emerging field of genetic network reverse engineering is also introduced here.
Chapter 5. Bio-Ontologies, Data Models, Nomenclature. This chapter addresses possibly the least exciting but the most pressing bioinformatics need for genomic research: creating and using comprehensive annotations of gene function, storing and organizing microarray expression data, and ensuring standardized access to these data. We review current efforts to create formalized systems of description of gene function and the various kinds of "ontologies" that support these descriptions. The challenge to design "standardized data models" for the storage of microarray data is addressed and the principal contenders claiming to be this standard are reviewed. Naming schemes--nomenclatures-- most applicable to gene expression studies are described. Nomenclatures, data models, and ontologies are placed in a perspective of the general problem of analyzing the results of functional genomics experiments. Tools that leverage these standardization efforts and the on-line published literature are also described.
Chapter 6. From Functional Genomics to Clinical Relevance: Getting the Phenotype Right. Here we address the process of translating the functional genomics research agenda into one of clinical relevance. We place in this perspective the value and deficiencies of electronic medical records and standardized clinical vocabularies. Although by no means comprehensive, we provide the highlights of the privacy issues (e.g., the implications of the Health Insurance Portability and Accountability Act, anonymization, cryptographic identifiers, etc.) that are most likely to have an impact on the clinical application of genomic technologies.
Chapter 7. The
Near Future. As the techniques and goals of functional genomics are in rapid
flux, we engage in some short-term forecasting to guide readers planning in
this time window. Microarray technologies being developed and recently released
are previewed. In this context, the problem of comparing expression measurements
across generations of microarray measurement platforms is appraised. More
broadly, the different kinds of software required for the successful functional
genomics enterprise are described. Finally, a model to meet the training needs
of this new discipline is outlined.