. 1
( 12)



>>

Developing Bioinformatics Computer Skills
Cynthia Gibas
Per Jambeck
Publisher: O'Reilly

First Edition April 2001
ISBN: 1-56592-664-1, 446 pages




Developing Bioinformatics Computer Skills
Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved.

Printed in the United States of America.

Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O'Reilly & Associates books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safari.oreilly.com). For more information
contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

The O'Reilly logo is a registered trademark of O'Reilly & Associates, Inc. Many of the designations
used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those
designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the
designations have been printed in caps or initial caps. The association between the image of a
Caenorhabditis elegans and the topic of bioinformatics is a trademark of O'Reilly & Associates, Inc.

While every precaution has been taken in the preparation of this book, the publisher assumes no
responsibility for errors or omissions, or for damages resulting from the use of the information
contained herein.




2
Preface__________________________________________________________________________________ 6
Audience for This Book _________________________________________________________________ 6
Structure of This Book _____________________________________________________________________ 7
Our Approach to Bioinformatics ______________________________________________________________ 9
URLs Referenced in This Book_______________________________________________________________ 9
Conventions Used in This Book __________________________________________________________ 9
Comments and Questions _______________________________________________________________ 9
Acknowledgments _______________________________________________________________________ 10
Chapter 1. Biology in the Computer Age ____________________________________________________ 11
1.1 How Is Computing Changing Biology? ________________________________________________ 11
1.2 Isn't Bioinformatics Just About Building Databases?____________________________________ 15
1.3 What Does Informatics Mean to Biologists?__________________________________________________ 18
1.4 What Challenges Does Biology Offer Computer Scientists? ______________________________________ 18
1.5 What Skills Should a Bioinformatician Have? ________________________________________________ 19
1.6 Why Should Biologists Use Computers? ____________________________________________________ 20
1.7 How Can I Configure a PC to Do Bioinformatics Research? ______________________________ 21
1.8 What Information and Software Are Available? _______________________________________________ 22
1.9 Can I Learn a Programming Language Without Classes? ________________________________________ 23
1.10 How Can I Use Web Information?________________________________________________________ 23
1.11 How Do I Understand Sequence Alignment Data? ____________________________________________ 24
1.12 How Do I Write a Program to Align Two Biological Sequences? _________________________________ 24
1.13 How Do I Predict Protein Structure from Sequence?___________________________________________ 24
1.14 What Questions Can Bioinformatics Answer? _______________________________________________ 24
Chapter 2. Computational Approaches to Biological Questions _________________________________ 26
2.1 Molecular Biology's Central Dogma ___________________________________________________ 26
2.2 What Biologists Model ______________________________________________________________ 30
2.3 Why Biologists Model _________________________________________________________________ 33
2.4 Computational Methods Covered in This Book _________________________________________ 34
2.5 A Computational Biology Experiment ______________________________________________________ 38
Chapter 3. Setting Up Your Workstation____________________________________________________ 44
3.1 Working on a Unix System______________________________________________________________ 44
3.2 Setting Up a Linux Workstation __________________________________________________________ 46
3.3 How to Get Software Working ________________________________________________________ 51
3.4 What Software Is Needed? ______________________________________________________________ 57
Chapter 4. Files and Directories in Unix_____________________________________________________ 58
4.1 Filesystem Basics __________________________________________________________________ 58
4.2 Commands for Working with Directories and Files ______________________________________ 63
4.3 Working in a Multiuser Environment __________________________________________________ 70
Chapter 5. Working on a Unix System ______________________________________________________ 78
5.1 The Unix Shell _______________________________________________________________________ 78
5.2 Issuing Commands on a Unix System_________________________________________________ 79
5.3 Viewing and Editing Files____________________________________________________________ 84
5.4 Transformations and Filters _________________________________________________________ 90
5.5 File Statistics and Comparisons______________________________________________________ 97
5.6 The Language of Regular Expressions ________________________________________________ 99
5.7 Unix Shell Scripts____________________________________________________________________ 102
5.8 Communicating with Other Computers _______________________________________________ 103
5.9 Playing Nicely with Others in a Shared Environment ___________________________________ 108
Chapter 6. Biological Research on the Web _________________________________________________ 120
6.1 Using Search Engines _________________________________________________________________ 120
6.2 Finding Scientific Articles __________________________________________________________ 122
6.3 The Public Biological Databases ____________________________________________________ 126
3
6.4 Searching Biological Databases_____________________________________________________ 131
6.5 Depositing Data into the Public Databases __________________________________________________ 138
6.6 Finding Software ____________________________________________________________________ 138
6.7 Judging the Quality of Information _______________________________________________________ 139
Chapter 7. Sequence Analysis, Pairwise Alignment, and Database Searching ____________________ 142
7.1 Chemical Composition of Biomolecules ___________________________________________________ 143
7.2 Composition of DNA and RNA ______________________________________________________ 143
7.3 Watson and Crick Solve the Structure of DNA _________________________________________ 144
7.4 Development of DNA Sequencing Methods ___________________________________________ 146
7.5 Genefinders and Feature Detection in DNA _________________________________________________ 149
7.6 DNA Translation __________________________________________________________________ 151
7.7 Pairwise Sequence Comparison_____________________________________________________ 152
7.8 Sequence Queries Against Biological Databases ______________________________________ 160
7.9 Multifunctional Tools for Sequence Analysis ________________________________________________ 167
Chapter 8. Multiple Sequence Alignments, Trees, and Profiles ________________________________ 169
8.1 The Morphological to the Molecular ______________________________________________________ 169
8.2 Multiple Sequence Alignment _______________________________________________________ 170
8.3 Phylogenetic Analysis _____________________________________________________________ 175
8.4 Profiles and Motifs ________________________________________________________________ 180
Chapter 9. Visualizing Protein Structures and Computing Structural Properties _________________ 189
9.1 A Word About Protein Structure Data _____________________________________________________ 189
9.2 The Chemistry of Proteins __________________________________________________________ 190
9.3 Web-Based Protein Structure Tools __________________________________________________ 201
9.4 Structure Visualization _____________________________________________________________ 202
9.5 Structure Classification ____________________________________________________________ 210
9.6 Structural Alignment _______________________________________________________________ 215
9.7 Structure Analysis ___________________________________________________________________ 218
9.8 Solvent Accessibility and Interactions________________________________________________ 221
9.9 Computing Physicochemical Properties ____________________________________________________ 224
9.10 Structure Optimization ____________________________________________________________ 226
9.11 Protein Resource Databases____________________________________________________________ 229
9.12 Putting It All Together _____________________________________________________________ 230
Chapter 10. Predicting Protein Structure and Function from Sequence _________________________ 232
10.1 Determining the Structures of Proteins ______________________________________________ 232
10.2 Predicting the Structures of Proteins _____________________________________________________ 236
10.3 From 3D to 1D _____________________________________________________________________ 237
10.4 Feature Detection in Protein Sequences ___________________________________________________ 238
10.5 Secondary Structure Prediction ____________________________________________________ 239
10.6 Predicting 3D Structure ___________________________________________________________ 243
10.7 Putting It All Together: A Protein Modeling Project ____________________________________ 247
10.8 Summary _______________________________________________________________________ 252
Chapter 11. Tools for Genomics and Proteomics ____________________________________________ 253
11.1 From Sequencing Genes to Sequencing Genomes ____________________________________ 254
11.2 Sequence Assembly ______________________________________________________________ 258
11.3 Accessing Genome Informationon the Web __________________________________________ 259
11.4 Annotating and Analyzing Whole Genome Sequences ________________________________________ 263
11.5 Functional Genomics: New Data Analysis Challenges _________________________________ 265
11.6 Proteomics ______________________________________________________________________ 270
11.7 Biochemical Pathway Databases ___________________________________________________ 274
11.8 Mo deling Kinetics and Physiology_______________________________________________________ 277
11.9 Summary _______________________________________________________________________ 278
Chapter 12. Automating Data Analysis with Perl ____________________________________________ 280
12.1 Why Perl? ________________________________________________________________________ 280
12.2 Perl Basics ________________________________________________________________________ 281
12.3 Pattern Matching and Regular Expressions_________________________________________________ 286

4
12.4 Parsing BLAST Output Using Perl ______________________________________________________ 287
12.5 Applying Perl to Bioinformatics ____________________________________________________ 292
Chapter 13. Building Biological Databases__________________________________________________ 296
13.1 Types of Databases ______________________________________________________________ 296
13.2 Database Software __________________________________________________________________ 303
13.3 Introduction to SQL_______________________________________________________________ 305
13.4 Installing the MySQL DBMS ________________________________________________________ 310
13.5 Database Design _________________________________________________________________ 314
13.6 Developing Web-Based Software That Interacts with Databases ________________________ 317
Chapter 14. Visualization and Data Mining_________________________________________________ 324
14.1 Preparing Your Data _________________________________________________________________ 324
14.2 Viewing Graphics ___________________________________________________________________ 325
14.3 Sequence Data Visualization _______________________________________________________ 326
14.4 Networks and Pathway Visualization ________________________________________________ 328
14.5 Working with Numerical Data ______________________________________________________ 329
14.6 Visualization: Summary ___________________________________________________________ 334
14.7 Data Mining and Biological Information______________________________________________ 335
Biblio.1 Unix__________________________________________________________________________ 340
Biblio.2 SysAdmin ______________________________________________________________________ 340
Biblio.3 Perl___________________________________________________________________________ 340
Biblio.4 General Reference________________________________________________________________ 341
Biblio.5 Bioinformatics Reference __________________________________________________________ 341
Biblio.6 Molecular Biology/Biology Reference _________________________________________________ 341
Biblio.7 Protein Structure and Biophysics _____________________________________________________ 341
Biblio.8 Genomics ______________________________________________________________________ 342
Biblio.9 Biotechnology___________________________________________________________________ 342
Biblio.10 Databases _____________________________________________________________________ 342
Biblio.11 Visualization___________________________________________________________________ 342
Biblio.12 Data Mining ___________________________________________________________________ 343
Colophon______________________________________________________________________________ 344




5
Preface
Computers and the World Wide Web are rapidly and dramatically changing the face of biological
research. These days, the term "paradigm shift" is used to describe everything from new business
trends to new flavors of cola, but biological science is in the midst of a paradigm shift in the classical
sense. Theoretical and computational biology have existed for decades on the "fringe" of biological
science. But within just a few short years, the flood of new biological data produced by genomics
efforts and, by necessity, the application of computers to the analysis of this genomic data, has begun to
affect every aspect of the biological sciences. Research that used to start in the laboratory now starts at
the computer, as scientists search databases for information that might suggest new hypotheses.

In the last two decades, both personal computers and supercomputers have become accessible to
scientists across all disciplines. Personal computers have developed from expensive novelties with little
real computing power into machines that are as powerful as the supercomputers of 10 years ago. Just as
they've replaced the author's typewriter and the accountant's ledger, computers have taken their place in
controlling and collecting data from lab equipment. They have the potential to completely replace
laboratory notebooks and files as a means of storing data. The power of computer databases allows
much easier access to stored data than nonelectronic forms of recording. Beyond their usefulness for
the storage, analysis, and visualization of data, however, computers are powerful devices for
understanding any system that can be described in a mathematical way, giving rise to the disciplines of
computational biology and, more recently, bioinformatics.

Bioinformatics is the application of information technology to the management of biological data. It's a
rapidly evolving scientific discipline. In the last two decades, storage of biological data in public
databases has become increasingly common, and these databases have grown exponentially. The
biological literature is growing exponentially as well. It's impossible for even the most zealous
researcher to stay on top of necessary information in the field without the aid of computer-based tools,
and the Web has made it possible for users at any location to interact with programs and databases at
any other site”provided they know how to build the right tools.

Bioinformatics is first and foremost a biological science. It's often less about developing perfectly
elegant algorithms than it is about answering practical questions. Bioinformaticians (or
bioinformaticists, if you prefer) are the tool-builders, and it's critical that they understand biological
problems as well as computational solutions in order to produce useful tools. Bioinformatics algorithms
need to encompass complex scientific assumptions that can complicate programming and data
modeling in unique ways.

Research in bioinformatics and computational biology can encompass anything from the abstraction of
the properties of a biological system into a mathematical or physical model, to the implementation of
new algorithms for data analysis, to the development of databases and web tools to access them. To
engage in computational research, a biologist must be comfortab le using software tools that run on a
variety of operating systems. This book introduces and explains many of the most popular tools used in
bioinformatics research. We've included lots of additional information and background material to help
you understand how the tools are best used and why they are important. We hope that it will help you
through the first steps of using computers productively in your research.

Audience for This Book

6
Most biological science students and researchers are starting to use computers as more than word-
processing or data-collection and plotting devices. Many don't have backgrounds in computer science
or computational theory, and to them, the fields of computational biology and bioinformatics may seem
hopelessly large and complex. This book, motivated by our interactions with our students and
colleagues, is by no means a comprehensive bible on all aspects of bioinformatics. It is, however, a
thoughtful introduction to some of the most important topics in bioinformatics. We introduce standard
computational techniques for finding information in biological sequence, genome, and molecular
structure databases; we talk about how to identify genes and detect characteristic patterns that identify
gene families; and we discuss the modeling of phylogenetic relationships, molecular structures, and
biochemical properties. We also discuss ways you can use your computer as a tool to organize data, to
think systematically about data-analysis processes, and to begin thinking about automation of data
handling.

Bioinformatics is a fairly advanced topic, so even an introductory book like this one assumes certain
levels of background knowledge. To get the most out of this book you should have some coursework or
experience in molecular biology, chemistry, and mathematics. An undergraduate course or two in
computer programming would also be helpful.

Structure of This Book
We've arranged the material in this book to allow you to read it from start to finish or to skip around,
digesting later sections before previous ones. It's divided into four parts:

Part I

Chapter 1 defines bioinformatics as a discipline, delves into a bit of history, and provides a brief tour of
what the book covers and why.

Chapter 2 introduces the core concepts of bioinformatics and molecular biology and the technologies
and research initiatives that have made increasing amounts of biological data available. It also covers
the ever-growing list of basic computer procedures every biologist should know.

Part II

Chapter 3 introduces Unix, then moves on to the basics of installing Linux on a PC and getting
software up and running.

Chapter 4 covers the ins and outs of moving around a Unix filesystem, including file hierarchies,
naming schemes, commonly used directory commands, and working in a multiuser environment.

Chapter 5 explains many Unix commands users will encounter on a daily basis, including commands
for viewing, editing, and extracting information from files; regular expressions; shell scripts; and
communicating with other computers.

Part III




7
Chapter 6 is about the art of finding biological information on the Web. The chapter covers search
engines and searching, where to find scientific articles and software, how to use the online information
sources, and the public biological databases.

Chapter 7 begins with a review of molecular evolution and then moves on to cover the basics of
pairwise sequence-analysis techniques such as predicting gene location, global and local alignment, and
local alignment-based searching against databases using BLAST and FASTA. The chapter concludes
with coverage of multifunctional tools for sequence analysis.

Chapter 8 moves on to study groups of related genes or proteins. It covers strategies for multiple
sequence alignment with tools such as ClustalW and Jalview, then discusses tools for phylogenetic
analysis, and constructing profiles and motifs.

Chapter 9 covers 3D analysis of proteins and the tools used to compute their structural properties. The
chapter begins with a review of protein chemistry and quickly moves to a discussion of web-based
protein structure tools; structure classification, alignment, and analysis; solvent accessibility and
solvent interactions; and computing physicochemical properties of proteins. The chapter concludes
with structure optimization and a tour through protein resource databases.

Chapter 10 covers the tools that determine the structures of proteins from their sequences. The chapter
discusses feature detection in protein sequences, secondary structure prediction, predicting 3D
structure. It concludes with an example project in protein modeling.

Chapter 11 puts it all together. Up to now we've covered tools and techniques for analyzing single
sequences or structures, and for comparing multiple sequences of single-gene length. This chapter
discusses some of the datatypes and tools that are becoming available for studying the integrated
function of all the genes in a genome, including sequencing an entire genome, accessing genome
information on the Web, annotating and analyzing whole genome sequences, and emerging
technologies and proteomics.

Part IV

Chapter 12 shows you how a programming language such as Perl can help you sift through mountains
of data to extract just the information you require. It won't teach you to program in Perl, but the chapter
gives you a brief introduction to the language and includes examples to start you on your way toward
learning to program.

Chapter 13 is an introduction to database concepts. It covers the types of databases used in biological
research, the database software that builds them, database languages (in particular, the SQL language),
and developing web-based software that interacts with databases.

Chapter 14 covers the computational tools and techniques that allow you to make sense of your results.
The first part of the chapter introduces programs that are used to visualize data arising from
bioinformatics research. They range from general-purpose plotting and statistical packages for
numerical data, such as Grace and gnuplot, to programs such as TEXshade that are dedicated to
presenting sequence and structural information in an interpretable form. The second part of the chapter
presents tools for data mining”the process of finding, interpreting, and evaluating patterns in large sets
of data”in the context of applications in bioinformatics.

8
Our Approach to Bioinformatics
We confess, we're structural biologists (biophysicists, actually). We have a hard time thinking about
genes without thinking about their protein products. DNA sequences, to us, aren't just sequences. To a
structural biologist, genes (with a few exceptions) imply 3D structures, molecular shapes and
conformational changes, active sites, chemical reactions, and detailed intermolecular interactions. Our
focus in this book is on using sequence information as structural biologists and biochemists tend to use
it”to understand the chemical basis of biological function. We've probably neglected some
applications of sequence analysis that are dear to the hearts of molecular biologists and geneticists, so
feel free send us your comments.

URLs Referenced in This Book
For more information on the URLs we reference in this book and for additional material about
bioinformatics, see the web page for this book, which is listed in Section P.6.

Conventions Used in This Book
The following conventions are used in this book:

Italic

Used for commands, filenames, directory names, variables, URLs, and for the first use of a term

Constant width

Used in code examples and to show the output of commands

Constant width italic

Used in "Usage" phrases to denote variables.

This icon designates a note, which is an important aside to the nearby text.



This icon designates a warning relating to the nearby text.



Comments and Questions

Please address comments and questions concerning this book to the publisher:

O'Reilly & Associates, Inc.
101 Morris Street
Sebastopol, CA 95472
(800) 998-9938 (in the United States or Canada)
(707) 829-0515 (international or local)

9
(707) 829-0104 (fax)

We have a web page for this book, where we list errata, examples, or any additional information. You
can access this page at:

http://www.oreilly.com/catalog/bioskills/

To comment or ask technical questions about this book, send email to:

bookquestions@oreilly.com

For more information about our books, conferences, software, Resource Centers, and the O'Reilly
Network, see our web site at:

http://www.oreilly.com

Acknowledgments
From Cynthia: I'd like to thank all of the people who have restrained themselves from laughing when
they heard me say, for the thousandth time during the last year, "We're almost finished with the book."
Thanks to my family and friends, for putting up with extremely infrequent phone calls and updates
during the last few months; the students in my Fall 2000 Bioinformatics course, for acting as guinea
pigs in my first bioinformatics teaching experiment and helping me identify topics that needed to be
explained more thoroughly; my colleagues at Virginia Tech, for a year's worth of interesting
discussions of what bioinformatics means and what bioinformatics students need to know; and our
friend and colleague Jim Fenton for his contributions early in the development of the book; and my
thesis advisor Shankar Subramaniam. I'd also like to thank our technical reviewers, Sean Eddy, Peter
Leopold, Andrew Odewahn, Clay Shirky, and Jim Tisdall, for their helpful comments and excellent
advice. And finally, thanks goes to the staff of O'Reilly, and our editor, Lorrie LeJeune, for infinite
patience and moral support during the writing process.

From Per: First, I am deeply grateful to my advisor, Professor Shankar Subramaniam, who has been a
continuous source of inspiration and a mainstay of our lab's congenial working environment at UCSD.
My thanks also go to two of my mentors, Professor Charles Elkan of the University of California, San
Diego, and Professor Michael R. Brent, now of Washington University, whose wise guidance has
shaped my understanding of computational problems. Sanna Herrgard and Markus Herrgard read early
versions of this book and provided valuable comments and moral support. The book has also benefited
from feedback and helpful conversations with Ewan Birney, Phil Bourne, Jim Fenton, Mike Farnum,
Brian Saunders, and Winny Tan. Thanks to Joe Johnston of O'Reilly for providing Perl advice and code
in Chapter 12. Our technical reviewers made indispensable suggestions and contributions, and I owe
special thanks to Sean Eddy, Peter Leopold, Andrew Odewahn, Clay Shirky, and Jim Tisdall for their
careful attention to detail. It has been a pleasure to work with the staff at O'Reilly, and in particular
with our editor Lorrie LeJeune, who patiently and cheerfully guided us through the project. Finally, my
part of this book would not have been possible without the support and encouragement of my family.




10
Chapter 1. Biology in the Computer Age
From the interaction of species and populations, to the function of tissues and cells within an individual
organism, biology is defined as the study of living things. In the course of that study, biologists collect
and interpret data. Now, at the beginning of the 21st century, we use sophisticated laboratory
technology that allows us to collect data faster than we can interpret it. We have vast volumes of DNA
sequence data at our fingertips. But how do we figure out which parts of that DNA control the various
chemical processes of life? We know the function and structure of some proteins, but how do we
determine the function of new proteins? And how do we predict what a protein will look like, based on
knowledge of its sequence? We understand the relatively simple code that translates DNA into protein.
But how do we find meaningful new words in the code and add them to the DNA-protein dictionary?

Bioinformatics is the science of using information to understand biology; it's the tool we can use to help
us answer these questions and many others like them. Unfortunately, with all the hype about mapping
the human genome, bioinformatics has achieved buzzword status; the term is being used in a number of
ways, depending on who is using it. Strictly speaking, bioinformatics is a subset of the larger field of
computational biology , the application of quantitative analytical techniques in modeling biological
systems. In this book, we stray from bioinformatics into computational biology and back again. The
distinctions between the two aren't important for our purpose here, which is to cover a range of tools
and techniques we believe are critical for molecular biologists who want to understand and apply the
basic computational tools that are available today.

The field of bioinformatics relies heavily on work by experts in statistical methods and pattern
recognition. Researchers come to bioinformatics from many fields, including mathematics, computer
science, and linguistics. Unfortunately, biology is a science of the specific as well as the general.
Bioinformatics is full of pitfalls for those who look for patterns and make predictions without a
complete understanding of where biological data comes from and what it means. By providing
algorithms, databases, user interfaces, and statistical tools, bioinformatics makes it possible to do
exciting things such as compare DNA sequences and generate results that are potentially significant.
"Potentially significant" is perhaps the most important phrase. These new tools also give you the
opportunity to overinterpret data and assign meaning where none really exists. We can't overstate the
importance of understanding the limitations of these tools. But once you gain that understanding and
become an intelligent consumer of bioinformatics methods, the speed at which your research
progresses can be truly amazing.

1.1 How Is Computing Changing Biology?
An organism's hereditary and functional information is stored as DNA, RNA, and proteins, all of which
are linear chains composed of smaller molecules. These macromolecules are assembled from a fixed
alphabet of well-understood chemicals: DNA is made up of four deoxyribonucleotides (adenine,
thymine, cytosine, and guanine), RNA is made up from the four ribonucleotides (adenine, uracil,
cytosine, and guanine), and proteins are made from the 20 amino acids. Because these macromolecules
are linear chains of defined components, they can be represented as sequences of symbols. These
sequences can then be compared to find similarities that suggest the mo lecules are related by form or
function.

Sequence comparison is possibly the most useful computational tool to emerge for molecular
biologists. The World Wide Web has made it possible for a single public database of genome sequence

11
data to provide services through a uniform interface to a worldwide community of users. With a
commonly used computer program called fsBLAST, a molecular biologist can compare an
uncharacterized DNA sequence to the entire publicly held collection of DNA sequences. In the next
section, we present an example of how sequence comparison using the BLAST program can help you
gain insight into a real disease.

1.1.1 The Eye of the Fly

Fruit flies (Drosophila melanogaster ) are a popular model system for the study of development of
animals from embryo to adult. Fruit flies have a gene called eyeless, which, if it's "knocked out" (i.e.,
eliminated from the genome using molecular biology methods), results in fruit flies with no eyes. It's
obvious that the eyeless gene plays a role in eye development.

Researchers have identified a human gene responsible for a condition called aniridia. In humans who
are missing this gene (or in whom the gene has mutated just enough for its protein product to stop
functioning properly), the eyes develop without irises.

If the gene for aniridia is inserted into an eyeless drosophila "knock out," it causes the production of
normal drosophila eyes. It's an interesting coincidence. Could there be some similarity in how eyeless
and aniridia function, even though flies and humans are vastly different organisms? Possibly. To gain
insight into how eyeless and aniridia work together, we can compare their sequences. Always bear in
mind, however, that genes have complex effects on one another. Careful experimentation is required to
get a more definitive answer.

As little as 15 years ago, looking for similarities between eyeless and aniridia DNA sequences would
have been like looking for a needle in a haystack. Most scientists compared the respective gene
sequences by hand-aligning them one under the other in a word processor and looking for matches
character by character. This was time-consuming, not to mention hard on the eyes.

In the late 1980s, fast computer programs for comparing sequences changed molecular biology forever.
Pairwise comparison of biological sequences is the foundation of most widely used bioinformatics
techniques. Many tools that are widely available to the biology community”including everything from
multiple alignment, phylogenetic analysis, motif id entification, and homology-modeling software, to
web-based database search services”rely on pairwise sequence-comparison algorithms as a core
element of their function.

These days, a biologist can find dozens of sequence matches in seconds using sequence-alignment
programs such as BLAST and FASTA. These programs are so commonly used that the first encounter
you have with bioinformatics tools and biological databases will probably be through the National
Center for Biotechnology Information's (NCBI) BLAST web interface. Figure 1-1 shows a standard
form for submitting data to NCBI for a BLAST search.

Figure 1-1. Form for submitting a BLAST search against nucleotide databases at NCBI




12
1.1.2 Labels in Gene Sequences

Before you rush off to compare the sequences of eyeless and aniridia with BLAST, let us tell you a
little bit about how sequence alignment works.

It's important to remember that biological sequence (DNA or protein) has a chemical function, but
when it's reduced to a single-letter code, it also functions as a unique label, almost like a bar code.
From the information technology point of view, sequence information is priceless. The sequence label
can be applied to a gene, its product, its function, its role in cellular metabolism, and so on. The user
searching for information related to a particular gene can then use rapid pairwise sequence comparison
to access any information that's been linked to that sequence label.

The most important thing about these sequence labels, though, is that they don't just uniquely identify a
particular gene; they also contain biologically meaningful patterns that allow users to compare different
labels, connect information, and make inferences. So not only can the labels connect all the information
about one gene, they can help users connect information about genes that are slightly or even
dramatically different in sequence.

If simple labels were all that was needed to make sense of biological data, you could just slap a unique
number (e.g., a GenBank ID) onto every DNA sequence and be done with it. But biological sequences
are related by evolution, so a partial pattern match between two sequence labels is a significant find.
BLAST differs from simple keyword searching in its ability to detect partial matches along the entire
length of a protein sequence.


13
1.1.3 Comparing eyeless and aniridia with BLAST

When the two sequences are compared using BLAST, you'll find that eyeless is a partial match for
aniridia. The text that follows is the raw data that's returned from this BLAST search:

pir||A41644 homeotic protein aniridia - human
Length = 447

Score = 256 bits (647), Expect = 5e-67
Identities = 128/146 (87%), Positives = 134/146 (91%), Gaps = 1/146 (0%)

Query: 24 IERLPSLEDMAHKGHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSN 83
I R P+ M + HSGVNQLGGVFV GRPLPDSTRQKIVELAHSGARPCDISRILQVSN
Sbjct: 17 IPRPPARASMQNS-HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSN 75

Query: 84 GCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQEN 143
GCVSKILGRYYETGSIRPRAIGGSKPRVAT EVVSKI+QYKRECPSIFAWEIRDRLL E
Sbjct: 76 GCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEG 135

Query: 144 VCTNDNIPSVSSINRVLRNLAAQKEQ 169
VCTNDNIPSVSSINRVLRNLA++K+Q
Sbjct: 136 VCTNDNIPSVSSINRVLRNLASEKQQ 161


Score = 142 bits (354), Expect = 1e-32
Identities = 68/80 (85%), Positives = 74/80 (92%)

Query: 398 TEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQV 457
+++ Q RL LKRKLQRNRTSFT +QI++LEKEFERTHYPDVFARERLA KI LPEARIQV
Sbjct: 222 SDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQV 281

Query: 458 WFSNRRAKWRREEKLRNQRR 477
WFSNRRAKWRREEKLRNQRR
Sbjct: 282 WFSNRRAKWRREEKLRNQRR 301

The output shows local alignments of two high-scoring matching regions in the protein sequences of
the eyeless and aniridia genes. In each set of three lines, the query sequence (the eyeless sequence that
was submitted to the BLAST server) is on the top line, and the aniridia sequence is on the bottom line.
The middle line shows where the two sequences match. If there is a letter on the middle line, the
sequences match exactly at that position. If there is a plus sign on the middle line, the two sequences
are different at that position, but there is some chemical similarity between the amino acids (e.g., D and
E, aspartic and glutamic acid). If there is nothing on the middle line, the two sequences don't match at
that position.

In this example, you can see that, if you submit the whole eyeless gene sequence and look (as standard
keyword searches do) for an exact match, you won't find anything. The local sequence regions make up
only part of the complete proteins: the region from 24-169 in eyeless matches the region from 17-161
in the human aniridia gene, and the region from 398-477 in eyeless matches the region from 222-301 in
aniridia. The rest of the sequence doesn't match! Even the two regions shown, which match closely,
don't match 100%, as they would have to, in order to be found in a keyword search.

However, this partial match is significant. It tells us that the human aniridia gene, which we don't know
much about, is substantially related in sequence to the fruit fly's eyeless gene. And we do know a lot


14
about the eyeless gene, from its structure and function (it's a DNA binding protein that promotes the
activity of other genes) to its effects on the phenotype”the form of the grown fruit fly.

BLAST finds local regions that match even in pairs of sequences that aren't exactly the same overall. It
extends matches beyond a single-character difference in the sequence, and it keeps trying to extend
them in all directions until the overall score of the sequence match gets too small. As a result, BLAST
can detect patterns that are imperfectly replicated from sequence to sequence, and hence distant
relationships that are inexact but still biologically meaningful.

Depending on the quality of the match between two labels, you can transfer the information attached to
one label to the other. A high-quality sequence match between two full-length sequences may suggest
the hypothesis that their functions are similar, although it's important to remember that the
identification is only tentative until it's been experimentally verified. In the case of the eyeless and
aniridia genes, scientists hope that studying the role of the eyeless gene in Drosophila eye development
will help us understand how aniridia works in human eye development.

1.2 Isn't Bioinformatics Just About Building Databases?
Much of what we currently think of as part of bioinformatics”sequence comparison, sequence
database searching, sequence analys is”is more complicated than just designing and populating
databases. Bioinformaticians (or computational biologists) go beyond just capturing, managing, and
presenting data, drawing inspiration from a wide variety of quantitative fields, including statistics,
physics, computer science, and engineering. Figure 1-2 shows how quantitative science intersects with
biology at every level, from analysis of sequence data and protein structure, to metabolic modeling, to
quantitative analysis of populations and ecology.

Figure 1-2. How technology intersects with biology




Bioinformatics is first and foremost a component of the biological sciences. The main goal of
bioinformatics isn't developing the most elegant algorithms or the most arcane analyses; the goal is

15
finding out how living things work. Like the molecular biology methods that greatly expanded what
biologists were capable of studying, bioinformatics is a tool and not an end in itself. Bioinformaticians
are the tool-builders, and it's critical that they understand biological problems as well as computational
solutions in order to produce useful tools.

Research in bioinformatics and computational biology can encompass anything from abstraction of the
properties of a biological system into a mathematical or physical model, to implementation of new
algorithms for data analysis, to the development of databases and web tools to access them.

1.2.1 The First Information Age in Biology

Biology as a science of the specific means that biologists need to remember a lot of details as well as
general principles. Biologists have been dealing with problems of information management since the
17th century.

The roots of the concept of evolution lie in the work of early biologists who catalogued and compared
species of living things. The cataloguing of species was the preoccupation of biologists for nearly three
centuries, beginning with animals and plants and continuing with microscopic life upon the invention
of the compound microscope. New forms of life and fossils of previously unknown, extinct life forms
are still being discovered even today.

All this cataloguing of plants and animals resulted in what seemed a vast amount of information at the
time. In the mid-16th century, Otto Brunfels published the first major modern work describing plant
species, the Herbarium vitae eicones. As Europeans traveled more widely around the world, the
number of catalogued species increased, and botanical gardens and herbaria were established. The
number of catalogued plant types was 500 at the time of Theophrastus, a student of Aristotle. By 1623,
Casper Bauhin had observed 6,000 types of plants. Not long after John Ray introduced the concept of
distinct species of animals and plants, and developed guidelines based on anatomical features for
distinguishing conclusively between species. In the 1730s, Carolus Linnaeus catalogued 18,000 plant
species and over 4,000 species of animals, and established the basis for the modern taxonomic naming
system of kingdoms, classes, genera, and species. By the end of the 18th century, Baron Cuvier had
listed over 50,000 species of plants.

It was no coincidence that a concurrent preoccupation of biologists, at this time of exploration and
cataloguing, was classification of species into an orderly taxonomy. A botany text might encompass
several volumes of data, in the form of painstaking illustrations and descriptions of each species
encountered. Biologists were faced with the problem of how to organize, access, and sensibly add to
this information. It was apparent to the casual observer that some living things were more closely
related than others. A rat and a mouse were clearly more similar to each other than a mouse and a dog.
But how would a biologist know that a rat was like a mouse (but that rat was not just another name for
mouse) without carrying around his several volumes of drawings? A nomenclature that uniquely
identified each living thing and summed up its presumed relationship with other living things, all in a
few words, needed to be invented.

The solution was relatively simple, but at the time, a great innovation. Species were to be named with a
series of one-word names of increasing specificity. First a very general division was specified: animal
or plant? This was the kingdom to which the organism belonged. Then, with increasing specificity,
came the names for class, genera, and species. This schematic way of classifying species, as illustrated
in Figure 1-3, is now known as the "Tree of Life."
16
Figure 1-3. The "Tree of Life" represents the nomenclature system that classifies species




A modern taxonomy of the earth's millions of species is too complicated for even the most zealous
biologist to memorize, and fortunately computers now provide a way to maintain and access the
taxonomy of species. The University of Arizona's Tree of Life project and NCBI's Taxonomy database
are two examples of online taxonomy projects.

Taxonomy was the first informatics problem in biology. Now, biologists have reached a similar point
of information overload by collecting and cataloguing information about individual genes. The problem
of organizing this information and sharing knowledge with the scientific community at the gene level
isn't being tackled by developing a nomenclature. It's being attacked directly with computers and
databases from the start.

The evolution of computers over the last half-century has fortuitously paralleled the developments in
the physical sciences that allow us to see biological systems in increasingly fine detail. Figure 1-4
illustrates the astonishing rate at which biological knowledge has expanded in the last 20 years.

Figure 1-4. The growth of GenBank and the Protein Data Bank has been astronomical




17
Simply finding the right needles in the haystack of information that is now available can be a research
problem in itself. Even in the late 1980s, finding a match in a sequence database was worth a five-page
publication. Now this procedure is routine, but there are many other questions that follow on our ability
to search sequence and structure databases. These questions are the impetus for the field of
bioinformatics.

1.3 What Does Informatics Mean to Biologists?
The science of informatics is concerned with the representation, organization, manipulation,
distribution, maintenance, and use of information, particularly in digital form. There is more than one
interpretation of what bioinformatics”the intersection of informatics and biology”actually means,
and it's quite possible to go out and apply for a job doing bioinformatics and find that the expectations
of the job are entirely different than you thought.

The functional aspect of bioinformatics is the representation, storage, and distribution of data.
Intelligent design of data formats and databases, creation of tools to query those databases, and
development of user interfaces that bring together different tools to allow the user to ask complex
questions about the data are all aspects of the development of bioinformatics infrastructure.

Developing analytical tools to discover knowledge in data is the second, and more scientific, aspect of
bioinformatics. There are many levels at which we use biological information, whether we are
comparing sequences to develop a hypothesis about the function of a newly discovered gene, breaking
down known 3D protein structures into bits to find patterns that can help predict how the protein folds,
or modeling how proteins and metabolites in a cell work together to make the cell function. The
ultimate goal of analytical bioinformaticians is to develop predictive methods that allow scientists to
model the function and phenotype of an organism based only on its genome sequence. This is a grand
goal, and one that will be approached only in small steps, by many scientists working together.

1.4 What Challenges Does Biology Offer Computer Scientists?
The goal of biology, in the era of the genome projects, is to develop a quantitative understanding of
how living things are built from the genome that encodes them.


18
Cracking the genome code is complex. At the very simplest level, we still have difficulty identifying
unknown genes by computer analysis of genomic sequence. We still have not managed to predict or
model how a chain of amino acids folds into the specific structure of a functional protein.

Beyond the single-molecule level, the challenges are immense. The sheer amount of data in GenBank
is now growing at an exponential rate, and as datatypes beyond DNA, RNA, and protein sequence
begin to undergo the same kind of explosion, simply managing, accessing, and presenting this data to
users in an intelligible form is a critical task. Human-computer interaction specialists need to work
closely with academic and clinical researchers in the biological sciences to manage such staggering
amounts of data.

Biological data is very complex and interlinked. A spot on a DNA array, for instance, is connected not
only to immediate information about its intensity, but to layers of information about genomic location,
DNA sequence, structure, function, and more. Creating information systems that allow biologists to
seamlessly follow these links without getting lost in a sea of information is also a huge opportunity for
computer scientists.

Finally, each gene in the genome isn't an independent entity. Multiple genes interact to form
biochemical pathways, which in turn feed into other pathways. Biochemistry is influenced by the
external environment, by interaction with pathogens, and by other stimuli. Putting genomic and
biochemical data together into quantitative and predictive models of biochemistry and physiology will
be the work of a generation of computational biologists. Computer scientists, mathematicians, and
statisticians will be a vital part of this effort.

1.5 What Skills Should a Bioinformatician Have?
There's a wide range of topics that are useful if you're interested in pursuing bioinformatics, and it's not
possible to learn them all. However, in our conversations with scientists working at companies such as
Celera Genomics and Eli Lilly, we've picked up on the following "core requirements" for
bioinformaticians:

• You should have a fairly deep background in some aspect of molecular biology. It can be
biochemistry, molecular biology, molecular biophysics, or even molecular modeling, but
without a core of knowledge of molecular biology you will, as one person told us, "run into
brick walls too often."
• You must absolutely understand the central dogma of molecular biology. Understanding how
and why DNA sequence is transcribed into RNA and translated into protein is vital. (In Chapter
2, we define the central dogma, as well as review the processes of transcription and translation.)
• You should have substantial experience with at least one or two major molecular biology
software packages, either for sequence analysis or molecular modeling. The experience of
learning one of these packages makes it much easier to learn to use other software quickly.
• You should be comfortable working in a command-line computing environment. Working in
Linux or Unix will provide this experience.
• You should have experience with programming in a computer language such as C/C++, as well
as in a scripting language such as Perl or Python.

There are a variety of other advanced skill sets that can add value to this background: molecular
evolution and systematics; physical chemistry”kinetics, thermodynamics and statistical mechanics;

19
statistics and probabilistic methods; database design and implementation; algorithm development;
molecular biology laboratory methods; and others.

1.6 Why Should Biologists Use Computers?
Computers are powerful devices for understanding any system that can be described in a mathematical
way. As our understanding of biological processes has grown and deepened, it isn't surprising, then,
that the disciplines of computational biology and, more recently, bioinformatics, have evolved from the
intersection of classical biology, mathematics, and computer science.

1.6.1 A New Approach to Data Collection

Biochemistry is often an anecdotal science. If you notice a disease or trait of interest, the imperative to
understand it may drive the progress of research in that direction. Based on their interest in a particular
biochemical process, biochemists have determined the sequence or structure or analyzed the expression
characteristics of a single gene product at a time. Often this leads to a detailed understanding of one
biochemical pathway or even one protein. How a pathway or protein interacts with other biological
components can easily remain a mystery, due to lack of hands to do the work, or even because the need
to do a particular experiment isn't communicated to other scientists effectively.

The Internet has changed how scientists share data and made it possible for one central warehouse of
information to serve an entire research community. But more importantly, experimental technologies
are rapidly advancing to the point at which it's possible to imagine systematically collecting all the data
of a particular type in a central "factory" and then distributing it to researchers to be interpreted.

In the 1990s, the biology community embarked on an unprecedented project: sequencing all the DNA
in the human genome. Even though a first draft of the human genome sequence has been completed,
automated sequencers are still running around the clock, determining the entire sequences of genomes
from various life forms that are commonly used for biological research. And we're still fine-tuning the
data we've gathered about the human genome over the last 10 years. Immense strings of data, in which
the locations of only a relatively few important genes are known, have been and still are being
generated. Using image-processing techniques, maps of entire genomes can now be generated much
more quickly than they could with chemical mapping techniques, but even with this technology,
complete and detailed mapping of the genomic data that is now being produced may take years.

Recently, the techniques of x-ray crystallography have been refined to a degree that allows a complete
set of crystallographic reflections for a protein to be obtained in minutes instead of hours or days.
Automated analysis software allows structure determination to be completed in days or weeks, rather
than in months. It has suddenly become possible to conceive of the same type of high-throughput
approach to structure determination that the Human Genome Project takes to sequence determination.
While crystallization of proteins is still the limiting step, it's likely that the number of protein structures
available for study will increase by an order of magnitude within the next 5 to 10 years.

Parallel computing is a concept that has been around for a long time. Break a problem down into
computationally tractable components, and instead of solving them one at a time, employ multiple
processors to solve each subproblem simultaneously. The parallel approach is now making its way into
experimental molecular biology with technologies such as the DNA microarray. Microarray technology
allows researchers to conduct thousands of gene expression experiments simultaneously on a tiny chip.

20
Miniaturized parallel experiments absolutely require computer support for data collection and analysis.
They also require the electronic publication of data, because information in large datasets that may be
tangential to the purpose of the data collector can be extremely interesting to someone else. Finding
information by searching such databases can save scientists literally years of work at the lab bench.

The output of all these high-throughput experimental efforts can be shared only because of the
development of the World Wide Web and the advances in communication and information transfer that
the Web has made possible.

The increasing automation of experimental molecular biology and the application of information
technology in the biological sciences have lead to a fundamental change in the way biological research
is done. In addition to anecdotal research”locating and studying in detail a single gene at a time”we
are now cataloguing all the data that is available, making complete maps to which we can later return
and mark the points of interest. This is happening in the domains of sequence and structure, and has
begun to be the approach to other types of data as well. The trend is toward storage of raw biological
data of all types in public databases, with open access by the research community. Instead of doing
preliminary research in the lab, scientists are going to the databases first to save time and resources.

1.7 How Can I Configure a PC to Do Bioinformatics Research?
Up to now you've probably gotten by using word-processing software and other canned programs that
run under user-friendly operating systems such as Windows or MacOs. In order to make the most of
bioinformatics, you need to learn Unix, the classic operating system of powerful computers known as
servers and workstations. Most scientific software is developed on Unix machines, and serious
researchers will want access to programs that can be run only under Unix. Unix comes in a number of
flavors, the two most popular being BSD and SunOs. Recently, however, a third choice has entered the
marketplace: Linux. Linux is an open source Unix operating system. In Chapter 3, Chapter 4, and
Chapter 5, we discuss how to set up a workstation for bioinformatics running under Linux. We cover
the operating system and how it works: how files are organized, how programs are run, how processes
are managed, and most importantly, what to type at the command prompt to get the computer to do
what you want.

1.7.1 Why Use Unix or Linux?

Setting up your computer with a Linux operating system allows you to take advantage of cutting-edge
scientific -research tools developed for Unix systems. As it has grown popular in the mass market,
Linux has retained the power of Unix systems for developing, compiling, and running programs,
networking, and managing jobs started by multiple users, while also providing the standard trimmings
of a desktop PC, including word processors, graphics programs, and even visual programming tools.
This book operates on the assumption that you're willing to learn how to work on a Unix system and
that you'll be working on a machine that has Linux or another flavor of Unix installed. For many of the
specific bioinformatics tools we discuss, Unix is the most practical choice.

On the other hand, Unix isn't necessarily the most practical choice for office productivity in a
predominantly Mac or PC environment. The selection of available word processing and desktop
publishing software and peripheral devices for Linux is improving as the popularity of the operating
system increases. However, it can't (yet) go head-to-head with the consumer operating systems in these


21
areas. Linux is no more difficult to maintain than a normal PC operating system, once you know how,
but the skills needed and the problems you'll encounter will be new at first.

As of this writing, my desktop computer has been reliably up and running Linux
for nearly five months, with the exception of a few days time out for a hardware
failure. No software crashes, no little bombs or unhappy faces, no missing *.dll
files or mysterious error messages. Installation of Linux took about two days and
some help from tech support the first time I did it, and about one hour the second
time (on a laptop, no less). Realistically, the main problem I have encountered
being the only Linux user in a Mac/PC environment is opening email attachments
from Mac users.”CJG

Fortunately, some of the companies selling packaged Linux distributions have substantially automated
the installation procedure, and also offer 90 days of phone and web technical support for your
installation. Companies such as Red Hat and SuSE and organizations such as Debian provide Linux
distributions for PCs, while Yellow Dog (and others) provide Linux distributions for Macintosh
computers.

There are a couple of ways to phase Linux in gradually. Of course, if you have more than one computer
workstation, you can experiment with converting one of your machines to Linux while leaving your
familiar operating system on the rest. The other choice is to do a dual boot installation. In a dual boot
installation, you create two sections (called partitions) on your hard drive, and install Linux in one of
them, with your old operating system in the other. Then, when you turn on your computer, you have a
choice of whether to start up Linux or your other operating system. You can leave all your old files and
programs where they are and start with new work in your Linux partition. Newer versions of Linux,
such as Yellow Dog Linux for the PowerPC, allow users to emulate a MacOS environment within
Linux and access software and files for both platforms simultaneously.

1.8 What Information and Software Are Available?
In Chapter 6, we cover information literacy. Only a few years ago, biologists had to know how to do
literature searches using printed indexes that led them to references in the appropriate technical
journals. Modern biologists search web-based databases for the same information and have access to
dozens of other information types as well. Knowing how to navigate these resources is a vital skill for
every biologist, computational or not.

We then introduce the basic tools you'll need to locate databases, computer programs, and other
resources on the Web, to transfer these resources to your computer, and to make them work once you
get them there. In Chapter 7 through Chapter 11 we turn to particular types of scientific questions and
the tools you will need to answer them. In some cases, there are computer programs that are becoming
the standard for solving a particular type of problem (e.g., BLAST and FASTA for amino acid and
nucleic acid sequence alignment). In other areas, where the method for solving a problem is still an
open research question, there may be a number of competing tools, or there may be no tool that
completely solves the problem.

1.8.1 Why Do I Need to Install a Program from the Web?



22
Handling large volumes of complex data requires a systematic and automated approach. If you're
searching a database for matches to one query, a web form will do the trick. But what if you want to
search for matches to 10,000 queries, and then sort through the information you get back to find
relationships in the results? You certainly don't want to type 10,000 queries into a web form, and you
probably don't want your results to come back formatted to look nice on a web page. Shared public web
servers are often slow, and using them to process large batches of data is impractical. Chapter 12
contains examples of how to use Perl as a driver to make your favorite program process large volumes
of data using your own computer.

1.9 Can I Learn a Programming Language Without Classes?
Anyone who has experience with designing and carrying out an experiment to answer a question has
the basic skills needed to program a computer. A laboratory experiment begins with a question, which
evolves into a testable hypothesis, that is, a statement that can be tested for truth based on the results of
an experiment or experiments. The processes developed to test the hypotheses are analogous to
computer programs. The essence of an experiment is: if you take system X, and do something to it,
what happens? The experiment that is done must be designed to have results that can be clearly
interpreted. Computer programs must also be carefully designed so that the values that are passed from
one part of a program to the next can be clearly interpreted. The human programmer must set up
unambiguous instructions to the computer and must think through, in advance, what different types of
results mean and what the computer should do with them. A large part of practical computer
programming is the ability to think critically, to design a process to answer a question, and to
understand what is required to answer the question unambiguously.

Even if you have these skills, learning a computer language isn't a trivial undertaking, but it has been
made a lot easier in recent years by the development of the Perl language. Perl, referred to by its creator
as "the duct tape of the Internet, and of everything else," began its evolution as a scripting language
optimized for data processing. It continues to evolve into a full-featured programming language, and
it's practical to use Perl to develop prototypes for virtually any kind of computer program. Perl is a very
flexible language; you can learn just enough to write a simple script to solve a one-off problem, and
after you've done that once or twice, you have a core of knowledge to build on. The key to learning
Perl is to use it and to use it right away. Just as no amount of reading the textbook can make you speak
Spanish fluently, no amount of reading O'Reilly's Learning Perl is going to be as helpful as getting out
there and trying to "speak" it. In Chapter 12, we provide example Perl code for parsing common
biological datatypes, driving and processing output from programs written in other languages, and even
a couple of Perl implementations that solve common computational biology problems. We hope these
examples inspire you to try a little programming of your own.

1.10 How Can I Use Web Information?
Chapter 6 also introduces the public databases where biological data is archived to be shared by
researchers worldwide.

While you can quickly find a single protein structure file or DNA sequence file by filling in a web form
and searching a public database, it's likely that eventually you will want to work with more than one
piece of data. You may even be collecting and archiving your own data; you may want to make a new
type of data available to a broader research community. To do these things efficiently, you need to
store data on your own computer. If you want to process your stored data using a computer program,

23
you need to structure your data. Understanding the difference between structured and unstructured data
and designing a data format that suits your data storage and access needs is the key to making your data
useful and accessible.

There are many ways to organize data. While most biological data is still stored in flat file databases,
this type of database becomes inefficient when the quantity of data being stored becomes extremely
large. Chapter 13 covers the basic database concepts you need to talk to database experts and to build
your own databases. We discuss the differences between flat file and relational databases, introduce the
best public -domain tools for managing databases, and show you how to use them to store and access
your data.

1.11 How Do I Understand Sequence Alignment Data?
It's hard to make sense of your data, or make a point, without visualization tools. The extraction of
cross sections or subsets of complex multivariate data sets is often required to make sense of biological
data. Storing your data in structured databases, which are discussed in Chapter 13, creates the
infrastructure for analysis of complex data.

Once you've stored data in an accessible, flexible format, the next step is to extract what is important to
you and visualize it. Whether you need to make a histogram of your data or display a molecular
structure in three dimensions and watch it move in real time, there are visualization tools that can do
what you want. Chapter 14 covers data-analysis and data-visualization tools, from generic plotting
packages to domain-specific programs for marking up biological sequence alignments, displaying
molecular structures, creating phylogenetic trees, and a host of other purposes.

1.12 How Do I Write a Program to Align Two Biological Sequences?
An important component of any kind of computational science is knowing when you need to write a
program yourself and when you can use code someone else has written. The efficient programmer is a
lazy programmer; she never wastes effort writing a program if someone else has already made a
perfectly good program available. If you are looking to do something fairly routine, such as aligning
two protein sequences, you can be sure that someone else has already written the program you need and
that by searching you can probably even find some source code to look at. Similarly, many
mathematical and statistical problems can be solved using standard code that is freely available in code
libraries. Perl programmers make code that simplifies standard operations available in modules; there
are many freely available modules that manage web-related processes, and there are projects underway
to create standard modules for handling biological-sequence data.

1.13 How Do I Predict Protein Structure from Sequence?
There are some questions we can't answer for you, and that's one of them; in fact, it's one of the biggest
open research questions in computational biology. What we can and do give you are the tools to find
information about such problems and others who are working on them, and even, with the proper
inspiration, to develop approaches to answering them yourself. Bioinformatics, like any other science,
doesn't always provide quick and easy answers to problems.

1.14 What Questions Can Bioinformatics Answer?

24
The questions that drive (and fund) bioinformatics research are the same questions humans have been
working away at in applied biology for the last few hundred years. How can we cure disease? How can
we prevent infection? How can we produce enough food to feed all of humanity? Companies in the
business of developing drugs, agricultural chemicals, hybrid plants, plastics and other petroleum
derivatives, and biological approaches to environmental remediation, among others, are developing
bioinformatics divisions and looking to bioinformatics to provide new targets and to help replace scarce
natural resources.

The existence of genome projects implies our intention to use the data they generate. The implicit goals
of modern molecular biology are, simply stated, to read the entire genomes of living things, to identify
every gene, to match each gene with the protein it encodes, and to determine the structure and function
of each protein. Detailed knowledge of gene sequence, protein structure and function, and gene
expression patterns is expected to give us the ability to understand how life works at the highest
possible resolution. Implicit in this is the ability to manipulate living things with precision and
accuracy.




25
Chapter 2. Computational Approaches to Biological
Questions
There is a standard range of techniques that are taught in bioinformatics courses. Currently, most of the
important techniques are based on one key principle: that sequence and structural homology (or
similarity) between molecules can be used to infer structural and functional similarity. In this chapter,
we'll give you an overview of the standard computer techniques available to biologists; later in the
book, we'll discuss how specific software packages implement these techniques and how you should
use them.

2.1 Molecular Biology's Central Dogma
Before we go any further, it's essential that you understand some basics of cell and molecular biology.
If you're already familiar with DNA and protein structure, genes, and the processes of transcription and
translation, feel free to skip ahead to the next section.

The central dogma of molecular biology states that:

DNA acts as a template to replicate itself, DNA is also transcribed into RNA, and RNA is translated
into protein.

As you can see, the central dogma sums up the function of the genome in terms of information. Genetic
information is conserved and passed on to progeny through the process of replication. Genetic
information is also used by the individual organism through the processes of transcription and
translation. There are many layers of function, at the structural, biochemical, and cellular levels, built
on top of genomic information. But in the end, all of life's functions come back to the information
content of the genome.

Put another way, genomic DNA contains the master plan for a living thing. Without DNA, organisms
wouldn't be able to replicate themselves. The raw "one-dimensional" sequence of DNA, however,
doesn't actually do anything biochemically; it's only information, a blueprint if you will, that's read by
the cell's protein synthesizing machinery. DNA sequences are the punch cards; cells are the computers.

DNA is a linear polymer made up of individual chemical units called nucleotides or bases. The four
nucleotides that make up the DNA sequences of living things (on Earth, at least) are adenine, guanine,
cytosine, and thymine”designated A, G, C, and T, respectively. The order of the nucleotides in the
linear DNA sequence contains the instructions that build an organism. Those instructions are read in
processes called replication, transcription, and translation.

2.1.1 Replication of DNA

The unusual structure of DNA molecules gives DNA special properties. These properties allow the
information stored in DNA to be preserved and passed from one cell to another, and thus from parents
to their offspring. Two molecules of DNA form a double-helical structure, twining around each other in
a regular pattern along their full length”which can be millions of nucleotides. The halves of the
double helix are held together by bonds between the nucleotides on each strand. The nucleotides also
bond in particular ways: A can pair only with T, and G can pair only with C. Each of these pairs is

26
referred to as a base pair, and the length of a DNA sequence is often described in base pairs (or bp),
kilobases (1,000 bp), megabases (1 million bp), etc.

Each strand in the DNA double helix is a chemical "mirror image" of the other. If there is an A on one
strand, there will always be a T opposite it on the other. If there is a C on one strand, its partner will
always be a G.

When a cell divides to form two new daughter cells, DNA is replicated by untwisting the two strands
of the double helix and using each strand as a template to build its chemical mirror image, or
complementary strand. This process is illustrated in Figure 2-1.

Figure 2-1. Schematic replication of one strand of the DNA helix




2.1.2 Genomes and Genes

The entire DNA sequence that codes for a living thing is called its genome. The genome doesn't
function as one long sequence, however. It's divided into individual genes. A gene is a small, defined
section of the entire genomic sequence, and each gene has a specific, unique purpose.

There are three classes of genes. Protein-coding genes are templates for generating molecules called
proteins. Each protein encoded by the genome is a chemical machine with a distinct purpose in the
organism. RNA-specifying genes are also templates for chemical machines, but the building blocks of
RNA machines are different from those that make up proteins. Finally, untranscribed genes are regions
of genomic DNA that have some functional purpose but don't achieve that purpose by being transcribed
or translated to create another molecule.

2.1.3 Transcription of DNA

DNA can act not only as a template for making copies of itself but also as a blueprint for a molecule
called ribonucleic acid (RNA). The process by which DNA is transcribed into RNA is called
transcription and is illustrated inFigure 2-2. RNA is structurally similar to DNA. It's a polymeric
molecule made up of individual chemical units, but the chemical backbone that holds these units
together is slightly different from the backbone of DNA, allowing RNA to exist in a single-stranded
form as well as in a double helix. These single-stranded molecules still form base pairs between
different parts of the chain, causing RNA to fold into 3D structures. The individual chemical units of
RNA are designated A, C, G, and U (uracil, which takes the place of thymine).
27
Figure 2-2. Schematic of DNA being transcribed into RNA




The genome provides a template for the synthesis of a variety of RNA molecules: the three main types
of RNA are messenger RNA, transfer RNA, and ribosomal RNA. Messenger RNA (mRNA) molecules
are RNA transcripts of genes. They carry information from the genome to the ribosome, the cell's
protein synthesis apparatus. Transfer RNA (tRNA) molecules are untranslated RNA molecules that
transport amino acids, the building blocks of proteins, to the ribosome. Finally, ribosomal RNA
(rRNA) molecules are the untranslated RNA components of ribosomes, which are complexes of protein
and RNA. rRNAs are involved in anchoring the mRNA molecule and catalyzing some steps in the
translation process. Some viruses also use RNA instead of DNA as their genetic material.

2.1.4 Translation of mRNA

Translation of mRNA into protein is the final major step in putting the information in the genome to
work in the cell.

Like DNA, proteins are linear polymers built from an alphabet of chemically variable units. The
protein alphabet is a set of small molecules called amino acids.

Unlike DNA, the chemical sequence of a protein has physicochemical "content" as well as information
content. Each of the 20 amino acids commonly found in proteins has a different chemical nature,
determined by its side chain”a chemical group that varies from amino acid to amino acid. The
chemical sequence of the protein is called its primary structure, but the way the sequence folds up to
form a compact molecule is as important to the function of the protein as is its primary structure. The
secondary and tertiary structure elements that make up the protein's final fold can bring distant parts of
the chemical sequence of the protein together to form functional sites.

As shown in Figure 2-3, the genetic code is the code that translates DNA into protein. It takes three
bases of DNA (called a codon) to code for each amino acid in a protein sequence. Simple
combinatorics tells us that there are 64 ways to choose 3 nucleotides from a set of 4, so there are 64
possible codons and only 20 amino acids. Some codons are redundant; others have the special function
of telling the cell's translation machinery to stop translating an mRNA molecule. Figure 2-4 shows how
RNA is translated into protein.

Figure 2-3. The genetic code


28
Figure 2-4. Synthesis of protein with standard base pairing




2.1.5 Molecular Evolution

Errors in replication and transcription of DNA are relatively common. If these errors occur in the
reproductive cells of an organism, they can be passed to its progeny. Alterations in the sequence of
DNA are known as mutations. Mutations can have harmful results ”results that make the progeny less
likely to survive to adulthood. They can also have beneficial results, or they can be neutral. If a
mutation doesn't kill the organism before it reproduces, the mutation can become fixed in the
population over many generations. The slow accumulation of such changes is responsible for the
process known as evolution. Access to DNA sequences gives us access to a more precise understanding
of evolution. Our understanding of the molecular mechanism of evolution as a gradual process of
accumulating DNA sequence mutatio ns is the justification for developing hypotheses based on DNA
and protein sequence comparison.


29
2.2 What Biologists Model
Now that we've completed our ultra-short course in cell biology, let's look at how to apply it to
problems in molecular biology. One of the most important exercises in biology and bioinformatics is
modeling. A model is an abstract way of describing a complicated system. Turning something as
complex (and confusing) as a chromosome, or the cycle of cell division, into a simplified
representation that captures all the features you are trying to study can be extremely difficult. A model
helps us see the larger picture. One feature of a good model is that it makes systems that are otherwise
difficult to study easier to analyze using quantitative approaches. Bioinformatics tools rely on our
ability to extract relevant parameters from a biological system (be it a single molecule or something as
complicated as a cell), describe them quantitatively, and then develop computational methods that use
those parameters to compute the properties of a system or predict its behavior.

To help you understand what a model is and what kind of analysis a good model makes possible, let's
look at three examples on which bioinformatics methods are based.

2.2.1 Accessing 3D Molecules Through a 1D Representation

In reality, DNA and proteins are complicated 3D molecules, composed of thousands or even millions
of atoms bonded together. However, DNA and proteins are both polymers, chains of repeating
chemical units (monomers) with a common backbone holding them together. Each chemical unit in the
polymer has two subsets of atoms: a subset of atoms that doesn't vary from monomer to monomer and
that makes up the backbone of the polymer, and a subset of atoms that does vary from monomer to
monomer.

In DNA, four nucleic acid monomers (A, T, C, and G) are commonly used to build the polymer chain.
In proteins, 20 amino acid monomers are used. In a DNA chain, the four nucleic acids can occur in any
order, and the order they occur in determines what the DNA does. In a protein, amino acids can occur
in any order, and their order determines the protein's fold and function.

Not too long after the chemical natures of DNA and proteins were understood, researchers recognized
that it was convenient to represent them by strings of single letters. Instead of representing each nucleic
acid in a DNA sequence as a detailed chemical entity, they could be represented simply as A, T, C, and
G. Thus, a short piece of DNA that contains thousands of individual atoms can be represented by a
sequence of few hundred letters. Figure 2-5 illustrates the simplified way to represent a polymer chain.

Figure 2-5. Simplifying the representation of a polymer chain




30
Not only does this abstraction save storage space and provide a convenient form for sharing sequence
information, it represents the nature of a molecule uniquely and correctly and ignores levels of detail
(such as atomic structure of DNA and many proteins) that are experimentally inaccessible. Many
computational biology methods exploit this 1D abstraction of 3D biological macromolecules.

The abstraction of nucleic acid and protein sequences into 1D strings has been one of the most fruitful
modeling strategies in computational molecular biology, and analysis of character strings is a long-
standing area of research in computer science. One of the elementary questions you can ask about
[1]


strings is, "Do they match?" There are well-established algorithms in computer science for finding
exact and inexact matches in pairs of strings. These algorithms are applied to find pairwise matches
between biological sequences and to search sequence databases using a sequence query.
[1]
A string is simply an unbroken sequence o f characters. A character is a single letter chosen from a set of defined letters, whether that be binary code (strings of
zeros and ones) or the more complicated alphabetic and numerical alphabet that can be typed on a computer keyboard.


In addition to matching individual sequences, string-based methods from computer science have been
successfully applied to a number of other problems in molecular biology. For example, algorithms for
reconstructing a string from a set of shorter substrings can assemble DNA sequences from overlapping
sequence fragments. Techniques for recognizing repeated patterns in single sequences or conserved
patterns across multiple sequences allow researchers to identify signatures associated with biological
structures or functions. Finally, multiple sequence-alignment techniques allow the simultaneous
comparison of several molecules that can infer evolutionary relationships between sequences.

This simplifying abstraction of DNA and protein sequence seems to ignore a lot of biology. The
cellular context in which biomolecules exist is completely ignored, as are their interactions with other
molecules and their molecular structure. And yet it has been shown over and over that matches between
biological sequences ”for example, in the detection of similarity in eye-development genes in humans
and flies, as we discussed in Chapter 1”can be biologically meaningful.


31
2.2.2 Abstractions for Modeling Protein Structure

There is more to biology than sequences. Proteins and nucleic acids also have complex 3D structures
that provide clues to their functions in the living organism. Molecular structures are usually represented
as collections of atoms, each of which has a defined position in 3D space. Structure analysis can be
performed on static structures, or movements and interactions in the molecules can be studied with
molecular simulation methods.

Standard molecular simulation approaches model proteins as a collection of point masses (atoms)
connected by bonds. The bond between two atoms has a standard length, derived from experimental
chemistry, and an associated applied force that constrains the bond at that length. The angle between
three adjacent atoms has a standard value and an applied force that constrains the bond angle around
that value. The same is true of the dihedral angle described by four adjacent atoms. In a molecular
dynamics simulation, energy is added to the molecular system by simulated "heating." Following
standard Newtonian laws, the atoms in the molecule move. The energy added to the system provides an
opposing force that moves atoms in the molecule out of their standard conformations. The actions and
reactions of hundreds of atoms in a molecular system can be simulated using this abstraction.

However, the computational demands of molecular simulations are huge, and there is some uncertainty
both in the force field -- the collection of standard forces that model the molecule”and in the modeling
of nonbonded interactions -- interactions between nonadjacent atoms. So it has not proven possible to
predict protein structure using the all-atom modeling approach.

Some researchers have recently had moderate success in predicting protein topology for simple
proteins using an intermediate level of abstraction”more than linear sequence, but less than an all-
atom model. In this case, the protein is treated as a series of beads (representing the individual amino
acids) on a string (representing the backbone). Beads may have different characters to represent the
differences in the amino acid sidechains. They may be positively or negatively charged, polar or
nonpolar, small or large. There are rules governing which beads will attract each other. Like charges
repel; unlike charges attract. Polar groups cluster with other polar groups, and nonpolar with nonpolar.
There are also rules governing the string; mainly that it can't pass through itself in the course of the
simulation. The folding simulation itself is conducted through sequential or simultaneous perturbation
of the position of each bead.

2.2.3 Mathematical Modeling of Biochemical Systems

Using theoretical models in biology goes far beyond the single molecule level. For years, ecologists
have been using mathematical models to help them understand the dynamics of changes in
interdependent populations. What effect does a decrease in the population of a predator species have on
the population of its prey? What effect do changes in the environment have on population? The
answers to those questions are theoretically predictable, given an appropriate mathematical model and a
knowledge of the sizes of populations and their standard rates of change due to various factors.

In molecular biology, a similar approach, called metabolic control analysis, is applied to biochemical
reactions that involve many molecules and chemical species. While cells contain hundreds or thousands
of interacting proteins, small molecules, and ions, it's possible to create a model that describes and
predicts a small corner of that complicated metabolism. For instance, if you are interested in the
biological processes that maintain different concentrations of hydrogen ions on either side of the

32
mitochondrial inner membrane in eukaryotic cells, it's probably not necessary for your model to include
the distant group of metabolic pathways that are closely involved in biosynthesis of the heme structure.

Metabolic models describe a biochemical process in terms of the concentrations of chemical species
involved in a pathway, and the reactions and fluxes that affect those concentrations. Reactions and
fluxes can be described by differential equations; they are essentially rates of change in concentration.
What makes metabolic simulation interesting is the possibility of modeling dozens of reactions
simultaneously to see what effect they have on the concentration of particular chemical species. Using
a properly constructed metabolic model, you can test different assumptions about cellular conditions
and fine-tune the model to simulate experimental observations. That, in turn, can suggest testable
hypotheses to drive further research.

2.3 Why Biologists Model
We've mentioned more than once that theoretical modeling provides testable hypotheses, not definitive
answers. It sometimes isn't so easy to maintain this distinction, especially with pairwise sequence
comparison, which seems to provide such ready answers. Even identification of genes based on
sequence similarity ultimately needs to be validated experimentally. It's not sufficient to say that an
unknown DNA sequence is similar to the sequence of a gene that has been subject to detailed
characterization, so therefore it must have an identical function. The two sequences could be distantly
related but have evolved to have different functions. However, it's altogether reasonable to use
sequence similarity as the starting point for verification; if sequence homology suggests that an
unknown gene is similar to citrate synthases, your first experimental approach might be to test the
unknown gene product for citrate synthase activity.

One of the main benefits of using computational tools in biology is that it becomes easier to preselect
targets for experimentation in molecular biology and biochemistry. Using everything from sequence
profiling methods to geometric and physicochemical analysis of protein structures, researchers can
focus narrowly on the parts of a sequence or structure that appear to have some functional significance.
Only a decade ago, this focusing might have been done using "shotgun" approaches to site-directed
mutagenesis, in which random single-residue mutants of a protein were created and characterized in
order to select possible targets. Functional genomics and metabolic reconstruction efforts are beginning
to provide biochemists with a framework for narrowing their research focuses as well.

For the researcher focused on developing bioinformatics methods, the discovery of general rules and
properties in data is by far the most interesting category of problems that can be addressed using a
computer. It's also a diverse category and one we can't give you many rules for. Researchers have
found interesting and useful properties in everything from sequence patterns to the separation of atoms
in molecular structures and have applied these findings to produce such tools as genefinders, secondary
structure prediction tools, profile methods, and homology modeling tools.

Bioinformatics researchers are still tackling problems that currently have reasonably successful
solutions, from basecalling to sequence alignment to genome comparison to protein structure modeling,
attempting to improve the accuracy and range of these procedures. Information-technology experts are
currently developing database structures and query tools for everything from gene-expression data to
intermolecular interactions. Like any other field of research, there are many niches of inquiry available,
and the only way to find them is to delve into the current literature.


33
2.4 Computational Methods Covered in This Book
Molecular biology research is a fast-growing area. The amount and type of data that can be gathered is
exploding, and the trend of storing this data in public databases is spilling over from genome sequence
to all sorts of other biological datatypes. The information landscape for biologists is changing so
rapidly that anything we say in this book is likely to be somewhat behind the times before it even hits
the shelves.

Yet, since the inception of the Human Genome Project, a core set of computational approaches has
emerged for dealing with the types of data that are currently shared in public databases”DNA, protein
sequence, and protein structure. Although databases containing results from new high-throughput
molecular biology methods have not yet grown to the extent the sequence databases have, standard
methods for analyzing these data have begun to emerge.

While not exhaustive, the following list gives you an overview of the computational methods we
address in this book:

Using public databases and data formats

The first key skill for biologists is to learn to use online search tools to find information.
Literature searching is no longer a matter of looking up references in a printed index. You can

. 1
( 12)



>>