. 9
( 12)


in the model that's being built, based on the distance between the two homologous residues in the
template structure. Restraints can also be applied to bond angles, dihedral angles, and pairs of
dihedrals. By applying enough of these spatial restraints, Modeller effectively limits the number of
conformations the model can assume.

The exact form of the restraints are based on a statistical analysis of differences between pairs of
homologous structures. What those statistics contribute is a quantitative description of how much
various properties are likely to vary among homologous structures. The amount of allowed variation
between, for instance, equivalent alpha-carbon-to-alpha-carbon distances is expressed as a PDF, or
probability density function.

What the use of PDF-based restraints allows you to do, in homology modeling, is to build a structure
that isn't exactly like the template structure. Instead, the structure of the model is allowed to deviate
from the template but only in a way consistent with differences found between homologous proteins of
known structure. For instance, if a particular dihedral angle in your template structure has a value of -
60º, the PDF-based restraint you apply should allow that dihedral angle to assume a value of 60 plus or
minus some value. That value is determined by what is observed in known pairs of homologous
structures, and it's assigned probabilistically, according to the form of the probability density function.

Homology-based spatial restraints aren't the only restraints applied to the model. A force field
controlling proper stereochemistry is also applied, so that the model structure can't violate the rules of
chemistry to satisfy the spatial restraints derived from the template structures. All chemical restraints
and spatial restraints applied to the model are combined in a function (called an objective function)
that's optimized in the course of the model building process. ModBase: a database of automatically generated models

The developers of Modeller have made available a queryable online database of annotated homology
models. The models are prepared using an automated prediction pipeline. The first step in the pipeline
is to compare each unknown protein sequence with a database of existing protein structures. Proteins
that have significant sequence homology to domains of known structures are then modeled using those
structures as templates. Unknown sequences are aligned to known structures using the ALIGN2D
command in Modeller, and 3D structures are built using the Modeller program. The final step in the
pipeline is to evaluate the models; results of the evaluation step are presented to the ModBase user as
part of the search results. Since this is all standard procedure for homology-model development that's
managed by a group of expert homology modelers, checking ModBase before you set off to build a
homology model on your own is highly recommended.

The general procedure for building a model with Modeller is to identify homologies between the
unknown sequence and proteins of known structures, build a multiple alignment of known structures

for use as a template, and apply the Modeller algorithm to the unknown sequence. Models can
subsequently be evaluated using standard structure-evaluation methods. The SWISS-MODEL server

SWISS-MODEL is an automated homology modeling web server based at the Swiss Institute of
Bioinformatics. SWISS-MODEL allows you to submit a sequence and get back a structure
automatically. The automated procedure that's used by SWISS-MODEL mimics the standard steps in a
homology modeling project:

1. Uses BLAST to search the protein structure database for sequences of known structure
2. Selects templates and looks for domains that can be modeled based on non-homologous
3. Uses a model-building program to generate a model
4. Uses a molecular mechanics force field to optimize the model

You must supply an unknown sequence to initiate a SWISS-MODEL run in their First Approach mode;
however, you can also select the template chains that are used in the model building process. This
information is entered via a simple web form. You can have the results sent to you as a plain PDB file,
or as a project file that can be opened using the SWISS-PDBViewer, a companion tool for the SWISS-
MODEL server you can download and install on your own machine.

Although that sounds simple, such an automatic procedure is error-prone. In a nonautomated molecular
modeling project, there is plenty of room for user intervention. SWISS-MODEL actually allows you to
intervene in the process using their Project Mode. In Project Mode, you can use the SWISS-
PDBViewer to align your template and target sequences manually, then write out a project file, and
upload it to the SWISS-MODEL server.

10.6.2 Tools for Ab-Initio Prediction

Since ab-initio structure prediction from sequence has not been done with any great degree of success
so far, we can't recommend software for doing this routinely. If you are interested in the ab-initio
structure-prediction problem and want to familiarize yourself with current research in the field, we
suggest you start with any of these tools: the software package RAMP developed by Ram Samudrala,
the I-Sites/ ROSETTA prediction server developed by David Baker and Chris Bystroff, and the
ongoing work of John Moult. Recent journal articles describing these methods can serve as adequate
entry points into the ab-initio protein structure prediction literature.

10.7 Putting It All Together: A Protein Modeling Project
So how do all of these tools work to produce a protein structure model from sequence? We haven't
described every single point and click in this chapter, because most of the software is web-based and
quite self-explanatory in that respect. However, you may still be wondering how you'd put it all
together to manage a tough modeling project.

As an example, we built a model of a target sequence from CASP 4, the most recent CASP
competition. We've deliberately chosen a difficult sequence to model. There are no unambiguously
homologous structures in the PDB, though there are clues that can be brought together to align the
target with a possible template and build a model. We make no claims that the model is correct; its
purpose is to illustrate the kind of process you might go through to build a partial 3D model of a protein
based on a distant similarity.

The process for building an initial homology model when you do have an obvious, strong homology to
a known structure is much more straightforward: simply align the template and target along their full
length, edit the alignment if necessary, write it out in a format that Modeller can read, and submit; or
submit the sequence of your unknown to SWISS-MODEL in First Approach mode.

10.7.1 Finding Homologous Structures

The first step in any protein modeling project is to find a template structure (if possible) to base a
homology model on.

Using the target sequence T0101 from CASP 4, identified as a "400 amino acid pectate lyase L" from a
bacterium called Erwinia chrysanthemi, we searched the PDB for homologs. We started by using the
PDB SearchFields form to initiate a FASTA search.

The results returned were disheartening at first glance. As the CASP target list indicated, there were no
strong sequence homologies to the target to be found in the PDB. None of the matches had E-values
less than 1, though there were several in the less-than-10 range. None of the matches spanned the full
length of the protein, the longest matching being a 330 amino acid overlap with a chondroitinase, with
an E-value of 3.9.

Each of the top scoring proteins looked different, too, as you can see in Figure 10-5. The top match was
an alpha-beta barrel type structure, while the second match (the chondroitinase) was a mainly beta
structure with a few decorative alpha helices, and the third match was an entirely different,
multidomain all-beta structure.

Figure 10-5. Pictures of top three sequence matches of a target sequence from CASP 4

Out of curiosity, we also did a simple keyword search for pectate lyase in the PDB. There were eight
pectate lyase structures listed, but none, apparently, were close enough in sequence to the T0101 target
to be recognized as related by sequence information alone. None of these structures was classified as
pectate lyase L; they included types A, E, and C. However, we observed that each of the pectate lyase
molecules in the PDB had a common structural feature: a large, quasihelical arrangement of short beta
strands known as a beta solenoid, or, less picturesquely, as a single-stranded right-handed beta helix
(Figure 10-6).

Figure 10-6. The beta-solenoid domain

10.7.2 Looking for Distant Homologies

We used CE to examine the structural neighbors of the known pectate lyases. Interestingly, one of the
three proteins (1DBG, a chondroitinase from Flavobacterium heparinium) we first identified with
FASTA as a sequence match for our target sequence showed up as a high-scoring structural neighbor of
many of the known pectate lyases.

Although the homology between T0101 and these other pectate lyases wasn't significant, the sequence
similarity between T0101 and their close structural neighbor 1DBG seemed to suggest that the structure
of our target protein might be distantly related to that of the other pectate lyases (Figure 10-7). Note
that the alignment in the figure shows a strongly conserved structural domain”the ladderlike structure
at the right of the molecule where the chain traces coincide.

Figure 10-7. A structural alignment of known pectate lyase structures; the beta solenoid domain is
visible as a ladderlike structure in the center of the molecule

However, in order to do any actual homology modeling, we need to somehow align the T0101
sequence to potential template structures. And since none of the pectate lyase sequences were similar

enough to the unknown sequence to be aligned to it using standard pairwise alignment, we would need
to get a little bit crafty with our alignment strategy.

10.7.3 Predicting Secondary Structure from Sequence

We applied several secondary structure prediction algorithms to the T0101 target sequence using the
JPred structure prediction server. While the predictions from each method aren't exactly the same, we
can see from Figure 10-8 that the consensus from JPred is clear: the T0101 sequence is predicted to
form many short stretches of beta structure, exactly the pattern that is required to form the beta-
solenoid domain.

Figure 10-8. Partial secondary structure predictions for T0101, from JPred

10.7.4 Using Threading Methods to Find Potential Folds

We also sent the sequence to the 3D-PSSM and 123D+ threading servers to analyze its fitness for
various protein folds. The top-scoring results from the 3D-PSSM threading server, with E-values in the
95% certainty range, included the proteins 1AIR (a pectate lyase), 1DBG (the chondroitinase that was
identified as a homolog of our unknown by sequence-based searching), 1IDK, and 1PCL, all pectate
lyases identified by CE as structural neighbors of 1DBG. These proteins were also found in the top
results from 123D+.

10.7.5 Using Profile Methods to Align Distantly Related Sequences

We now had evidence from multiple sources that suggested the structures 1AIR, 1DBG, and 1IDK
would be reasonable candidates to use as templates to construct a model of the T0101 unknown
sequence. However, the remaining challenge was to align the unknown sequence to the template
structures. We had many different alignments to work with: the initial FASTA alignment of the
unknown with 1DBG; the CE structural alignment of 1DBG and its structural neighbors 1AIR, 1DBG,
and 1IDK; and the individual alignments of predicted secondary structure in the unknown to known
secondary structure for each of the database hits from 3D-PSSM. Each alignment was different, of
course, because they were generated by different methods. We chose to combine the information in the
individual 3D-PSSM sequence-to-structure alignments of the unknown sequence with 1AIR and 1IDK
into a single alignment file. We did this by aligning those two alignments to each other using Clustal's
Profile Alignment mode. Finally, we wrote out the alignment to a single file in a format appropriate for
Modeller and used this as input for our first approach.

10.7.6 Building a Homology Model

We created the following input for Modeller:

The input script, peclyase.top:

Homology modelling by the MODELLER TOP routine 'model'.

INCLUDE # Include the predefined TOP routines
SET ALNFILE = 'peclyase.ali' # alignment filename
SET KNOWNS = '1air','1idk' # codes of the templates
SET SEQUENCE = 't0101' # code of the target
SET ATOM_FILES_DIRECTORY = './templates' # directories for input atom files
SET STARTING_MODEL= 1 # index of the first model
SET ENDING_MODEL = 3 # index of the last model
# (determines how many models to calculate)
CALL ROUTINE = 'model' # do homology modeling

We created a sequence alignment file, peclyase.ali, in PIR format, built as described in the example and
modified to indicate to Modeller whether each sequence was a template or a target.

We also placed PDB files, containing structural information for the template chains of 1AIR and 1IDK,
in a templates subdirectory of our working directory. The files were named 1air.atm and 1idk.atm, as
Modeller requires, and we then ran Modeller to create structural models. The models looked similar to
their templates, especially in the beta solenoid domain, and evaluated reasonably well by standard
methods of structure verification, including 3D/1D profiles and geometric evaluation methods.
However, just like the actual CASP 4 competitors, we await the publication of the actual structure of
the T0101 target for validation of our structural model.

10.8 Summary
Solving protein structure is complicated at best, but as you've seen, there are a number of software tools
to make it easier. Table 1 provides a summary of the most popular structure prediction tools and
techniques available to you.

Table 1. Structure Prediction Tools and Techniques
What you do Why you do it What you use to do it
Secondary structure
As a starting point for classification and structural modeling JPred, Predict-Protein
To check the fitness of a protein sequence to assume a known fold; to
Threading 3D-PSSM, PhD, 123D
identify distantly related structural homologs
To build a model from a sequence, based on homologies to known
Homology modeling Modeller, SWISS-MODEL
Model verification To check the fitness of a modeled structure for its protein sequence
Ab-initio structure
To predict a 3D structure from sequence in the absence of homology ROSETTA, RAMP

Chapter 11. Tools for Genomics and Proteomics
The methods we have discussed so far can be used to analyze a single sequence or structure and
compare multiple sequences of single-gene length. These methods can help you understand the
function of a particular gene or the mechanism of a particular protein. What you're more likely to be
interested in, though, is how gene functions manifest in the observable characteristics of an organism:
its phenotype. In this chapter we discuss some datatypes and tools that are beginning to be available for
studying the integrated function of all the genes in a genome.

What sets genomic science apart from the traditional experimental biological sciences is the emphasis
on automated data gathering and integration of large volumes of information. Experimental strategies
for examining one gene or one protein are gradually being replaced by parallel strategies in which
many genes are examined simultaneously. Bioinformatics is absolutely required to support these
parallel strategies and make the resulting data useful to the biology community at large. While
bioinformatics algorithms may be complicated, the ultimate goals of bioinformatics and genomics are
straightforward. Information from multiple sources is being integrated to form a complete picture of
genomic function and its expression as the pheotype of an organism, as well as to allow comparison
between the genomes of different organisms. Figure 11-1 shows the sort of flowchart you might create
when moving from genetic function to phenotypic expression.

Figure 11-1. A flowchart moving from genome to phenotype

From the molecular level to the cellular level and beyond, biologists have been collecting information
about pieces of this picture for decades. As in the story of the blind men and the elephant, focusing on
specific pieces of the biological picture has made it difficult to step back and see the functions of the
genome as a whole. The process of automating and scaling up biochemical experimentation, and
treating biochemical data as a public resource, is just beginning.

The Human Genome Project has not only made gigabytes of biological sequence information available;
it has begun to change the entire landscape of biological research by its example. Protein structure
determination has not yet been automated at the same level as sequence determination, but several pilot
projects in structural genomics are underway, with the goal of creating a high-speed structure
determination pipeline. The concept behind the DNA microarray experiment”thousands of

microscopic experiments arrayed on a chip and running in parallel”doesn't translate trivially to other
types of biochemical and molecular biology experiments. Nonetheless, the trend is toward efficiency,
miniaturization, and automation in all fields of biological experimentation.

A long string of genomic sequence data is inherently about as useful as a reference book with no
subject headings, page numbers, or index. One of the major tasks of bioinformatics is creating software
systems for information management that can effectively annotate each part of a genome sequence with
information about everything from its function, to the structure of its protein product (if it has one), to
the rate at which the gene is expressed at different life stages of an organism. Currently, the only way
to get the functional information that is needed to fully annotate and understand the workings of the
genome is traditional biochemical experimentation, one gene or protein at a time. The growing genome
databases are the motivation for further parallelization and automation of biological research.

Another task of genome information management systems is to allow users to make intuitive, visual
comparisons between large data sets. Many new data integration projects, from visual comparison of
multiple genomes to visual integration of expression data with genome map data, are under

Bioinformatics methods for genome-level analysis are obviously not as advanced in their development
as sequence and structure analysis methods. Sequence and structure data have been publicly available
since the 1970s; a significant number of whole genomes have become available only very recently. In
this chapter, we focus on some data presentation and search strategies the research community has
identified as particularly useful in genomic science.

11.1 From Sequencing Genes to Sequencing Genomes
In Chapter 7, we discussed the chemistry that allows us to sequence DNA by somehow producing a
ladder of sequence fragments, each differing in size by one base, which can then be separated by
electrophoresis. One of the first computational challenges in the process of sequencing a gene (or a
genome) is the interpretation of the pattern of fragments on a sequencing gel.

11.1.1 Analysis of Raw Sequence Data: Basecalling

The process of assigning a sequence to raw data from DNA sequencing is called basecalling. As an end
user of genome sequence data, you don't have access to the raw data directly from the sequencer; you
have to rely on a sequence that has been assigned to this data by some kind of processing software.
While it's not likely you will actually need basecalling software, it is helpful to remember what the
software does and that it can give rise to errors.

If this step doesn't produce a correct DNA sequence, any subsequent analysis of the sequence is
affected. All sequences deposited in public databases are affected by basecalling errors due to
ambiguities in sequencer output or to equipment malfunctions. EST and genome survey sequences have
the highest error rates (1/10 -1/100 errors per base), followed by finished sequences from small
laboratories (1/100 - 1/1,000 per base) and finished sequences from large genome sequencing centers
(1/10,000 -1/100,000 per base). Any sequence in GenBank is likely to have at least one error.

Improving sequencing technology, and especially the signal detection and processing involved in DNA
sequencing, is still the subject of active research.
These values were provided by Dr. Sean Eddy of Washington University.

There are two popular high-throughput protocols for DNA sequencing. As discussed earlier, DNA
sequencing as it is done today relies on the ability to create a ladder of fragments of DNA at single-
base resolution and separate the DNA fragments by gel electrophoresis. The popular Applied
Biosystems sequencers label the fragmented DNA with four different fluorescent labels, one for each
base-specific fragmentation, and run a mixture of the four samples in one gel lane. Another commonly
used automated sequencer, the Pharmacia ALF instrument, runs each sample in a separate, closely
spaced lane. In both cases, the gel is scanned with a laser, which excites each fluorescent band on the
gel in sequence. In the four-color protocol, the fluorescence signal is elicited by a laser perpendicular to
the gel, one lane at a time, and is then filtered using four colored filters to obtain differential signals
from each fluorescent label. In the single-color protocol, a laser parallel to the gel excites all four lanes
from a single sequencing experiment at once, and the fluorescent emissions are recorded by an array of
detectors. Each of these protocols has its advantages in different types of experiments, so both are in
common use. Obviously, the differences in hardware result in differences in the format of the data
collected, and the use of proprietary file formats for data storage doesn't resolve this problem.

There are a variety of commercial and noncommercial tools for automated basecalling. Some of them
are fully integrated with particular sequencing hardware and input datatypes. Most of them allow, and
in fact require, curation by an expert user as sequence is determined.

The raw result of sequencing is a record of fluorescence intensities at each position in a sequencing gel.
Figure 11-2 shows detector ouput from a modern sequencing experiment. The challenge for automated
basecalling software is to resolve the sequence of apparent fluorescence peaks into four-letter DNA
sequence code. While this seems straightforward, there are fairly hard limits on how much sequence
can be read in a single experiment. Because separation of bands on a sequencing gel isn't perfect, the
quality of the separation and the shape of the bands deteriorates over the length of the gel. Peaks
broaden and intermingle, and at some point in the sequencing run (usually 400 -500 bases), the peaks
become impossible to resolve. Various properties of DNA result in nonuniform reactions with the
sequencing gel, so that fragment mobility is slightly dependent on the identity of the last base in a
fragment; overall signal intensities can depend on local sequence and on the reagents used in the
experiment. Unreadable regions can occur when DNA fragments fold back upon themselves or when a
sequencing primer reacts with more than one site in a DNA sequence, leading to sample heterogeneity.
Because these are fairly well-understood, systematic errors, computer algorithms can be developed to
compensate for them. The ultimate goal of basecalling software development is to improve the
accuracy of each sequence read, as well as to extend the range of sequencing runs, by providing means
to deconvolute the more ambiguous fluorescence peaks at the end of the run.

Figure 11-2. Detector output from a modern sequencing experiment

Most sequencing projects deal with the inherent errors in the sequencing process by simply sequencing
each region of a genome multiple times and by sequencing both DNA strands (which results in high-
quality reads of both ends of a sequence). If you read that a genome has been sequenced with 4X
coverage or 10X coverage, that means that portion of the genome has been sequenced multiple times,
and the results merged to produce the final sequence.

Modern sequencing technologies replace gels with microscopic capillary systems, but the core concepts
of the process are the same as in gel-based sequencing: fragmentation of the DNA and separation of
individual fragments by electrophoresis.

At this point, the major genome databases don't provide raw sequence data to users, and for most
applications, access to raw sequence data isn't really necessary. However, it is likely that, with the
constant growth of computing power, this will change in the future, and that you may want to know
how to reanalyze the raw sequence data underlying the sequences available in the public databases.

One noncommercial software package for basecalling is Phred, which is available from the University
of Washington Genome Center. Phred runs on either Unix or Windows NT workstations. It uses
Fourier analysis to resolve fluorescence traces to predict an evenly spaced set of peak locations, then
uses dynamic programming to match the actual peak locations with the predicted results. It then
annotates output from basecalling with the probability that the call is an error. Phred scores represent
scaled negative log probability that a base call is in error; hence, the higher the Phred score, the lower
the probability that an error has been made. These values can be helpful in determining whether a
region of the genome may need to be resequenced. Phred can also feed data to a sequence-assembly
program called Phrap, which then uses both the sequence information and quality scores to aid in
sequence assembly.

11.1.2 Sequencing an Entire Genome

Genome sequencing isn't simply a scaled -up version of a gene-sequencing run. As noted earlier, the
sequence length limit of a sequencing run is something like 500 base pairs. And the length of a genome
can range from tens of thousands to billions of base pairs. So in order to sequence an entire genome,
the genome has to be broken into fragments, and then the sequenced fragments need to be reassembled
into a continuous sequence.

There are two popular strategies for sequencing genomes: the shotgun approach and the clone contig
approach. Combinations of these strategies are often used to sequence larger genomes. The shotgun approach

Shotgun DNA sequencing is the ultimate automated approach to DNA sequencing. In shotgun
sequencing, a length of DNA, either a whole genome or a defined subset of the genome, is broken into
random fragments. Fragments of manageable length (around 2,000 KB) are cloned into plasmids
(together, all the clones are called a clone library). Plasmids are simple biological vectors that can
incorporate any random piece of DNA and reproduce it quickly to provide sufficient material for

If a sufficiently large amount of genomic DNA is fragmented, the set of clones spans every base pair of
the genome many times. The end of each cloned DNA fragment is then sequenced, or in some cases,
both ends are sequenced, which puts extra constraints on the way the sequences can be assembled.
Although only 400 -500 bases at the end or ends of the fragment are sequenced, if enough clones are
randomly selected from the library and sequenced, the amount of sequenced DNA still spans every
base pair of the genome several times. In a shotgun sequencing experiment, enough DNA sequencing
to span the entire genome 6 -10 times is usually required.

The final step in shotgun sequencing is sequence assembly, which we discuss in more detail in the next
section. Assembly of all the short sequences from the shotgun sequencing experiment usually doesn't
result in one single complete sequence. Rather, it results in multiple contigs”unambiguously
assembled lengths of sequence that don't overlap each other. In the assembly process, contigs start and
end because a region of the genome is encountered from which there isn't enough information (i.e., not
enough clones representing that region) to continue assembling fragments. The final steps in
sequencing a complete genome by shotgun sequencing are either to find clones that can fill in these
missing regions, or, if there are no clones in the original library that can fill in the gaps, to use PCR or
other techniques to amplify DNA sequence that spans the gaps.

Recently, Celera Genomics has shown that shotgun DNA sequencing”sequencing without a map”
can work at the whole genome level even in organisms with very large genomes. The largely
completed Drosophila genome sequence is evidence of their success. The clone contig approach

The clone contig approach relies on shotgun sequencing as well, but on a smaller scale. Instead of
starting by breaking down the entire genome into random fragments, the clone contig approach starts
by breaking it down into restriction fragments, which can then be cloned into artificial chromosome
vectors and amplified. Restriction enzymes are enzymes that cut DNA. These enzymes are sequence-
specific; that is, they recognize only one specific sequence of DNA, anywhere from 6-10 base pairs in
length. By pure statistics, any base has a 1 in 4 chance of occurring randomly in a DNA sequence; an
N-residue sequence has a 1 in 4N chance of occurring. Enzymes that cut at a specific 6 -10 base pair
sequence of DNA end up cutting genomic DNA relatively rarely, but because DNA sequence isn't
random, the restriction enzyme cuts result in a specific pattern of fragment lengths that is characteristic
of a genome.

Each of the cloned restriction fragments can be sequenced and assembled by a standard shotgun
approach. But assembly of the restriction fragments into a whole genome is a different sort of problem.
When the genome is digested into restriction fragments, it is only partially digested. The amount of
restriction enzyme applied to the DNA sample is sufficient to cut at only approximately 50% of the
available restriction sites in the sample. This means that some fragments will span a particular
restriction site, while other fragments will be cut at that particular site and will span other restriction
sites. So the clone library that is made up of these restriction fragments will contain overlapping

Chromosome walking is the process of starting with a specific clone, then finding the next clone that
overlaps it, and then the next, etc. Methods such as probe hybridization or PCR are used to help
identify the restriction fragment that has been inserted into each clone, and there are a number of
experimental strategies that can make building up the genome map this way less time-consuming. A
genome map is a record of the location of known features in the genome, which makes it relatively
simple to associate particular clone sequences with a specific location in the genome by probe
hybridization or other methods.

Genomes can be mapped at various levels of detail. Geneticists are used to thinking in terms of genetic
linkage maps, which roughly assign the genes that give rise to particular traits to specific loci on the
chromosome. However, genetic linkage maps don't provide enough detail to support the assembly of a
full genome worth of sequence, nor do they point to the actual DNA sequence that corresponds to a
particular trait. What genetic linkage maps do provide, though, is a set of ordered markers, sometimes
very detailed depending on the organism, which can help researchers understand genome function (and
provide a framework for assembling a full genome map).

Physical maps can be created in several ways: by digesting the DNA with restriction enzymes that cut
at particular sites, by developing ordered clone libraries, and recently, by fluorescence microscopy of
single, restriction enzyme-cleaved DNA molecules fixed to a glass substrate. The key to each method is
that, using a combination of labeled probes and known genetic markers (in restriction mapping) or by
identifying overlapping regions (in library creation), the fragments of a genome can be ordered
correctly into a highly specific map (see Figure 11-2). LIMS: Tracking all those minisequences

In carrying out a sequencing project, tracking the millions of unique DNA samples that may be isolated
from the genome is one of the biggest information technology challenges. It's also probably one of the
least scientifically exciting, because it involves keeping track of where in the genome each sample
came from, which sample goes into each container, where each container goes in what may be a huge
sample storage system, and finally, which data came from which sample. The systems that manage
output from high-throughput sequencing are called Laboratory Information Management Systems
(LIMS), and while we will not discuss them in the context of this book, LIMS development and
maintenance makes up the lion's share of bioinformatics work in industrial settings. Other high-
throughput technologies, such as microarrays and cheminformatics, also require complicated LIMS

11.2 Sequence Assembly
Basecalling is only the first step in putting together a complete genome sequence. Once the short
fragments of sequence are obtained, they must be assembled into a complete sequence that may be
many thousands of times their length. The next step is sequence assembly.

Sequence assembly isn't something you're likely to be doing on your own on a large scale, unless you
happen to be working for a genome project. However, even small labs may need to solve sequence-
assembly problems that require some computer assistance.

DNA sequencing using a shotgun approach provides thousands or millions of minisequences, each 400
-500 fragments in length. The fragments are random and can partially or completely overlap each other.
Because of these overlaps, every fragment in the set can be identified by sequence identity as adjacent
to some number of other fragments. Each of those fragments overlaps yet another set of fragments, and
so on. It's standard procedure for the sequences of both ends of some fragments to be known, and the
sequences of only one end of other fragments to be known. Figure 11-3 illustrates the shotgun
sequencing approach.

Figure 11-3. The shotgun DNA sequencing approach

Ultimately, all the fragments need to be optimally tiled together into one continuous sequence.
Identifying sequence overlaps between fragments puts some constraints on how the sequences can be
assembled. For some fragments, the length of the sequence and the sequences of both its ends are
known, which puts even more constraints on how the sequences can be assembled. The assembly
algorithm attempts to satisfy all the constraints and produce an optimal ordering of all the fragments
that make up the genome.

Repetitive sequence features can complicate the assembly process. Some fragments will be
uncloneable, and the sequencing process will fail in other cases, leaving gaps in the DNA sequence that
must be resolved by resequencing. These gaps complicate automated assembly. If there isn't sufficient
information at some point in the sequence for assembly to continue, the sequence contig that is being
created comes to an end, and a new contig starts when there is sufficient sequence information for
assembly to resume.

The Phrap program, available from the University of Washington Genome Center, does an effective job
assembling sequence fragments, although a large amount of computer time is required. The
accompanying program Consed is a Unix-based editor for Phrap sequence assembly output. TIGR
Assembler is another genome assembly package that works well for small genomes, BACs, or ESTs.

11.3 Accessing Genome Informationon the Web
Partial or complete DNA sequences from hundreds of genomes are available in GenBank. Putting those
sequence records together into an intelligible representation of genome structure isn't so easy. There are
several efforts underway to integrate DNA sequence with higher-level maps of genomes in a user-
friendly format. So far, these efforts are focused on the human genome and genomes of important plant
and animal model systems. They aren't collected into one uniform resource at present, although NCBI
does aim to be comprehensive in its coverage of available genome data eventually.

Looking at genome data is analogous to looking at a map of the world. You may want to look at the
map from a distance, to see the overall layout of continents and countries, or you may want to find a
shortcut from point A to point B. Each choice requires a different sort of map. However, the maps need
to be linked, because you may want to know the general directions from San Diego to Blacksburg, VA,
but may also want to know exactly how to get to a specific building on the Virginia Tech campus when
you get there. The approach that web-based genome analysis tools are taking is similar to the approach
taken by online map databases such as MapQuest. Place names and zip codes are analogous to gene
names and GenBank IDs. You can search as specifically as you wish, or you can begin with a top view
of the genome and zoom in.

The genome map resources have the same limitations as online map resources, as well. You can search
MapQuest and see every street in Blacksburg, but ask MapQuest for a back-road shortcut between
Cochin and Mangalore on the southwest coast of India, and it can't help you. Currently, NCBI and
EMBL provide detailed maps and tools for looking at the human genome, but if your major interest is
the cat genome, you're (at least this year) out of luck.

Genome resources are also limited by the capabilities of bioinformatics analysis methods. The available
analysis tools at the genome sites are usually limited to sequence comparison tools and whatever
single-sequence feature-detection tools are available for that genome, along with any information about
the genome that can be seamlessly integrated from other databases. If you are hoping to do something
with tools at a genome site you can't do with existing sequence or structure analysis tools, you will still
be disappointed. What genome sites do provide is a highly automated user experience and expertly
curated links between related concepts and entities. This is a valuable contribution, but there are still
questions that can't be answered.

11.3.1 NCBI Genome Resources

NCBI offers access to a wide selection of web-based genome analysis tools from the Genomic Biology
section of its main web site. These tools are designed for the biologist seeking answers to specific
questions. Nothing beyond basic web skills and biological knowledge is required to apply these tools to
a question of interest. Their interfaces are entirely point-and-click, and NCBI supplies ample
documentation to help you learn how to use their tools and databases.

Here's a list of the available tools:

Genome Information

Genome project information is available from the Entrez Genomes page at NCBI. Database
listings are available for the full database or for related groups of organisms such as
microorganisms, archaea, bacteria, eukaryotes, and viruses. Each entry in the database is linked
to a taxonomy browser entry or a home page with further links to available information about
the organism. If a genome map of the organism is available, a "See the Genome" link shows up
on the organism's home page. From the home page, you can also download genome sequences
and references.

Map Viewer

If a genome map is available for the organism, you can click on parts of the cartoon map that is
first displayed and access several different viewing options. Depending on the genome, you can
access links to overview maps, maps showing known protein-coding regions, listings of coding
regions for protein and RNA, and other information. Information is generally downloadable in
text format. Map Viewer distinguishes between four levels of information: the organism's home
page, the graphical view of the genome, the detailed map for each chromosome (aligned to a
master map from which the user can select where to zoom in), and the sequence view, which
graphically displays annotations for regions of the genome sequence. Full Map Viewer
functionality is available only for human and drosophila genomes at the time of this writing;
however, for any complete genome, clickable genome maps and views of the annotated genome
at the sequence level are available.

ORF Finder

The Open Reading Frame (ORF) Finder is a tool for locating open reading frames in a DNA
sequence. ORF finders translate the sequence using standard or user-specified genetic code. In
noncoding DNA, stop codons are frequently found. Only long uninterrupted stretches without
stop codons are taken to be coding regions. Information from the ORF finder can provide clues
about the correct reading frame for a DNA sequence and about where coding regions start and
stop. For many genomes found in the Entrez Genomes database, ORF Finder is available as an
integrated tool from the map view of the genome.


LocusLink is a curated database of genetic loci in several eukaryotic organisms that give rise to
known phenotypes. LocusLink provides an alphabetical listing of traits as well as links to
HomoloGene and Map Viewer.


HomoloGene is a database of pairwise orthologs (homologous genes from different organisms
that have diverged by speciation, as opposed to paralogs that have diverged by gene
duplication) across four major eukaryotic genomes: human, mouse, rat, and zebrafish. The
ortholog pairs are identified either by curation of literature reports or calculation of similarity.
The HomoloGene database can be searched using gene symbols, gene names, GenBank
accession numbers, and other features.

Clusters of Orthologous Groups (COG)

COG is a database of orthologous protein groups. The database was developed by comparing
protein sequences across 21 complete genomes. The entries in COG represent genome functions
that are conserved throughout much of evolutionary history”functions that were developed
early and retained in all of the known complete genomes. The authors' assumption is that these
ancient conserved sequences comprise the minimal core of functionality that a modern species
(i.e., one that has survived into the era of genome sequencing) requires. The COG database can
be searched by functional category, phylogenetic pattern, and a number of other properties.

NCBI also provides detailed genome-specific resources for several important eukaryotic genomes,
including human, fruit fly, mouse, rat, and zebrafish.

11.3.2 TIGR Genome Resources

The Institute for Genome Research (TIGR, http://www.tigr.org) is one of the main producers of new
genome sequence data, along with the other major human genome sequencing centers and commercial
enterprises such as Celera. TIGR's main sequencing projects have been in microbial and crop genomes,
and human chromosome 16. TIGR recently announced the Comprehensive Microbial Resource, a
complete microbial genome resource for all of the genomes they have sequenced. At the present time,
each microbial genome has its own web page from which various views of the genome are available.
There are also tools within the resource that allow you to search the omniome, as TIGR designates the
combined genomic information in its database. The TIGR tools aren't as visual as the NCBI genome
analysis tools. Selection of regions to examine requires you to enter specific information into a form
rather than just pointing and clicking on a genome map. However, the TIGR resources are a useful
supplement to the NCBI tools, providing a different view on the same genetic information.

TIGR maintains many genome-specific databases focused on expressed sequence tags (ESTs) rather
than complete genomic data. ESTs are partial sequences from either end of a cDNA clone. Despite
their incompleteness, ESTs are useful for experimental molecular biolo gists. Since cDNA libraries are
prepared by producing the DNA complement to cellular mRNA (messenger RNA), a cDNA library
gives clues as to what genes are actually expressed in a particular cell or tissue. Therefore, a sequence
match with an EST can be an initial step in helping to identify the function of a new gene. TIGR's EST
databases can be searched by sequence, EST identifier, cDNA library name, tissue, or gene product
name, using a simple forms -based web interface.

11.3.3 EnsEMBL

EnsEMBL is a collaborative project of EMBL, EBI, and the Sanger Centre (http://www.sanger.ac.uk)
to automatically track sequenced fragments of the human genome and assemble them into longer
stretches. Automated analysis methods, such as genefinding and feature-finding tools and sequence-
comparison tools, are then applied to the assembled sequence and made available to users through a
web interface.

In June 2000, the Human Genome consortium announced the completion of the first map of the human
genome. It's important to stress that such maps, and indeed much of the genomic information now
available, are only drafts of a final picture that may take years to assemble. To remain useful, the draft
maps must be constantly updated to stay in sync with the constantly updated sequence databases. The
EnsEMBL project expects to apply its automated data analysis pipeline to many different genomes,
beginning with human and mouse.

There are three ways to search EnsEMBL: a BLAST search of a query sequence against the database; a
search using a known gene, transcript, or map marker ID; or a chromosome map browser that allows
you to pick a chromosome and zoom in to ever-more specific regions. All these tools are relatively self-
explanatory and available from the EnsEMBL web site. In order to use them, however, you should
know something of what you are looking for or on which chromosome to look.

11.3.4 Other Sequencing Centers

TIGR isn't the only genome center to provide software and online tools for analyzing genomic data.
Genome sequencing programs generally incorporate a bioinformatics component and attract
researchers with interests in developing bioinformatics methods; their web sites are good points of
entry to the bioinformatics world. The University of Washington Genome Center is known for the
development of sequence assembly tools ”its Phred and Phrap software are widely used. Other genome
centers include, but aren't limited to, the Sanger Centre, the DOE Joint Genome Institute, Washington
University in St. Louis, and many corporate centers.
A complete list of genome sequencing projects in progress and active genome centers can be found
online in the Genomes OnLine Database (GOLD), a public site maintained by Integrated Genomics,
Inc. (http://wit.integratedgenomics.com/GOLD/).

11.3.5 Organism-Specific Resources

The Arabidopsis Information Resource (TAIR) is an excellent example of an organism-specific
genome resource, this one focusing on the widely used plant model system Arabidopsis thaliana. In
addition to the standard features offered at EnsEMBL and NCBI, such as clickable and zoomable
chromosome maps and sequence analysis tools, TAIR offers a variety of expert-curated links to other
information for users of the resource. TAIR is limited in its scope to Arabidopsis, but it is a much
deeper resource than the general public databases. Similar resources are available for many organisms,
from maize to zebrafish. Listings of online genome resources can be located at several sites, such as

11.4 Annotating and Analyzing Whole Genome Sequences

Genome data presents completely new issues in data storage and analysis:

• Genome sequences are extremely large.
• Users need to access genome data both as a whole and as meaningful pieces.
• The majority of the sequence in a genome doesn't correspond to known functionality.

Annotation of the genome with functional information can be accomplished by several means:
comparison with existing information for the organism in the sequence databases, comparison with
published information in the primary literature, computational methods such as ORF detection and
genefinding, or propagation of information from one genome to another by evolutionary inference
based on sequence comparison. Due to the sheer amount of data available, automatic annotation is
desirable, but it must be automatic without propagating errors. The use of computational methods is
fallible; sequence similarity searches can result in hits that aren't biologically meaningful, and
genefinders often have difficulty detecting the exact start and end of a gene. Sometimes experimental
information is incorrect or is recorded incorrectly in the database. Using this info rmation to annotate
genomes leaves a residue of error in the database, which can then be propagated further by use of
comparative methods.

11.4.1 Genome Annotation

Genome annotation is a difficult business. This is in part because there are a huge number of different
pieces of information attached to every gene in a genome. Not every piece of information is interesting
to every user, and not every piece of this information can (or should) be crammed in a single file of
information about the gene. Genome annotation relies on relational databases to integrate genome
sequence information with other data. [2]

The term relational database should give you a clue that the function of the database is to maintain relationships or connections among entries. We discuss this in
more detail in Chapter 13.

The primary sources of information about what genes do are laboratory experiments. It may take many
experiments to figure out what a gene does. Ideally, all that diverse experimental data should somehow
be associated with the gene annotation. What this means in practice is hyperlinking of content between

multiple databases”sequence, structure, and functional genomics fully linked together in a queryable
system. This strategy is beginning to be implemented in most of the major public databases, although
the goal of "one world database" (in the user's perception) has not yet been reached and perhaps never
will. MAGPIE

MAGPIE is an environment for annotation of genomes based on sequence similarity. It can maintain
status information for a genome project and make information about the genome available on the Web,
as well as provide an interface for automated sequence similarity-based and manual annotation. Even if
you're not maintaining a genome database for public use, a look at the features of MAGPIE may help
clarify some of the information technology issues in genome annotation. The Sulfolobus solfataricus P2
Genome Project and many other smaller genome projects have implemented MAGPIE; the S.
solfataricus project provides information on its web site about the role MAGPIE plays in the genome
annotation process.

11.4.2 Genome Comparison

Pairwise or multiple comparison of genomes is an idea that will be useful for many studies, ranging
from basic questions of evolutionary biology to very specific clinical questions, such as the
identification of genetic polymorphisms, which give rise to disease states or significant variations in

Why compare whole genomes rather than just comparing genes one at a time? As the Human Genome
Project reaches completion, researchers are just beginning to explore in detail how genome structure
affects genome function. Is junk DNA really junk? Are there structural features in DNA that control
expression? Are there promoters and regulatory regions we haven't yet figured out? Genome
comparison can help answer such questions by pointing to regions of similarity within uncharacterized
or even supposedly redundant DNA. Genome comparison will also aid in genomic annotation.
Prototype genome comparisons have helped to justify the sequencing of additional genomes; the
comparison of mouse and human genomes is one such example. Genome comparison is useful both at
the map level and directly at the sequence level. PipMaker

PipMaker is a tool that compares two DNA sequences of up to 2 MB each (longer sequences will be
handled by the new Version 2.0, to be released soon) and produces a percent identity plot as output.
This is useful in identifying large-scale patterns of similarity in longer sequences, although obviously
not entire larger genomes. The process of using PipMaker is relatively simple. Starting with two
FASTA-format sequence files, you first generate a set of instructions for masking sequence repeats
(generated using the RepeatMasker server). This reduces the number of uninformative hits in the
sequence comparison. The resulting information, plus a simple file containing a numerical list of
known gene positions, is submitted to the PipMaker web server at Penn State University and the results
are emailed to you. A detailed PipMaker tutorial is available at the web site
(http://bio.cse.psu.edu/pipmaker/). PipMaker relies on BLASTZ to align sequences. BLASTZ is an
experimental version of BLAST designed for extremely long sequences and developed at NCBI. MUMmer

Another program for pairwise genome comparison is TIGR's MUMmer. MUMmer was designed to
meet the needs of the sequencing projects at TIGR and is optimized for comparing microbial genome
sequences that are assumed to be relatively similar. Its first application was a detailed comparison of
genomes of two strains of M. tuberculosis. MUMmer can compare sequences millions of base pairs in
length and produce colorful visualizations of regions of similarity. MUMmer is based on a computer
algorithm called a suffix tree, which essentially makes it easy for the system to rapidly handle a large
number of pairwise comparisons. The dynamic programming algorithm used in standard BLAST
comparison doesn't scale well with sequence length. For genome-length sequences, dynamic
programming methods are unacceptably slow. MUMmer is an example of a novel method developed to
get around the problems involved in using standard pairwise sequence comparison to compare full
genome sequences. MUMmer is designed for Unix systems and is freely available for use in nonprofit
institutions. A new public web interface to MUMmer has recently become available on the TIGR web

11.5 Functional Genomics: New Data Analysis Challenges
The advent of high-speed sequencing methods has changed the way we study the DNA sequences that
code for proteins. Once, finding these bits of DNA in the genome of an organism was the goal, without
much concern for the context. It is now becoming possible to view the whole DNA sequence of a
chromosome as a single entity and to examine how the parts of it work together to produce the
complexity of the organism as a whole.

The functions of the genome break down loosely into a few obvious categories: metabolism, regulation,
signaling, and construction. Metabolic pathways convert chemical energy derived from environmental
sources (i.e., food) into useful work in the cell. Regulatory pathways are biochemical mechanisms that
control what genomic DNA does: when it is expressed and when it isn't. Genomic regulation involves
not only expressed genes but structural and sequence signals in the DNA where regulatory proteins
may bind. Signaling pathways control, among other things, the movement of chemicals from one
compartment in a cell to another. Teasing out the complex networks of interactions that make up these
pathways is the work of biochemists and molecular biologists. Many regulatory systems for the control
of DNA transcription have been studied. Mapping these metabolic, regulatory, and signaling systems to
the genome sequence is the goal of the field of functional genomics.

11.5.1 Sequence-Based Approaches for Analyzing Gene Expression

In addition to genome sequence, GenBank contains many other kinds of DNA sequence. Expressed
sequence tag (EST) data for an organism can be an extremely useful starting point for discovery-
oriented exploration of gene expression. To understand why this is, you need to recall what ESTs
represent. ESTs are partial sequences of cDNA clones; cDNA clones are DNA strands built using
cellular mRNA as templates. In the cell, mRNA is RNA with a mission”to be converted into protein,

and soon. mRNA levels respond to changes in the cell or its environment; mRNA levels are tissue-
dependent, and they change during the life cycle of the organism as well. Quantitation of mRNA or
cDNA provides a pretty good measure of what a genome is doing under particular conditions.
The term transcriptome has been used to describe the collection of sequenced transcripts from a given organism.

The sequence of a cDNA molecule built off an mRNA template should be the same as the sequence of
the DNA that originally served as a template for building the mRNA. Sequencing a short stretch of

bases from a cDNA sequence provides enough information to localize the source of an mRNA in a
genome sequence.

NCBI offers a database called dbEST that provides access to several thousand libraries of ESTs. Quite
a large number of these are human EST libraries, but there are libraries from dozens of other organisms
as well. NCBI's UniGene database provides fully searchable access to specific EST libraries from
human, mouse, rat, and zebrafish. EST data within UniGene has been combined with sequences of
well-characterized genes and clustered, using an automated clustering procedure, to identify groups of
related sequences. The Library Browser can locate libraries of interest within UniGene.

Another NCBI resource for sequence-based expression analysis is SAGEmap. Serial Analysis of Gene
Expression (SAGE) is an experimental technique in which the transcription products of many genes are
rapidly quantitated by sequencing short "tags" of DNA at a specific position (usually a restriction site)
in the sequence. SAGEmap is a specialized online resource for the SAGE user community that
identifies potential SAGE tags in human DNA sequence and maps them to the genome.

11.5.2 DNA Microarrays: Emerging Technologiesin Functional Genomics

Recently, new technology has made it possible for researchers to rapidly explore expression patterns of
entire genomes worth of DNA. A microarray (or gene chip) is a small glass slide”like a microscope
slide”about a centimeter on a side. The surface of the slide is covered with 20,000 or more precisely
placed spots each containing a different DNA oligomer (short polynucleotide chain). cDNA can also be
affixed to the slide to function as probes. Other media, such as thin membranes, can be used in place of
slides. The key to the experiment is that each piece of DNA is immobilized”attached at one end to the
slide's surface. Any reaction that results in a change in microarray signal can be precisely assigned to a
specific DNA sequence.

Microarray experiments capitalize on an important property of DNA. One strand of DNA (or RNA) can
hybridize with a complementary strand of DNA. If the complementarity of the two strands is perfect,
the bond between the two strands is difficult to break. Each oligomer in a DNA microarray can serve as
a probe to detect a unique, complementary DNA or RNA molecule. These oligomers can be bound by
fluorescently labeled DNA, allowing the chip to be scanned using a confocal scanner or CCD camera.
The presence or absence of a complementary sequence in the DNA sample being run over the chip
determines whether the position on the array "lights up" or not. Thus, the presence or absence of an
average of 20,000 sequences can be experimentally demonstrated with one gene chip.

Microarrays are conceptually no different from traditional hybridization experiments such as Southern
Blots (probing DNA samples separated on a filter with labeled probe sequences) or Northern Blots
(probing RNA samples separated on a filter). In traditional blotting, the protein sample is immobilized;
in microarray experiments, the probe is immobilized, and the amount of information that can be
collected in one experiment is vastly larger. Figure 11-4 shows just a portion of a microarray scan from
Arabidopsis (Image courtesy of the Arabidopsis Functional Genomics Consortium (AFGC) and the
Stanford Microarray Database, http://genome-www.stanford.edu/microarray). Other advantages are
that microarray experiments rely on fluorescent probes rather than the radioactive probes used in
blotting techniques, and gene chips can be manufactured robotically rather than laboriously generated
by hand.

Figure 11-4. A microarray scan

Microarray technology is now routinely used for DNA sequencing experiments; for instance, in testing
for the presence of polymorphisms. Another recent development is the use of microarrays for gene
expression analysis. When a gene is expressed, an mRNA transcript is produced. If DNA oligomers
complementary to the genes of interest are placed on the microarray, mRNA or cDNA copies can be
hybridized to the chip, providing a rapid assay as to whether or not those genes are being expressed.
Experiments like these have been performed in yeast to test differences in whole-genome expression
patterns in response to changes in ambient sugar concentration. Microarray experiments can provide
information about the behavior of every one of an organism's genes in response to environmental

11.5.3 Bioinformatics Challenges in Microarray Design and Analysis

So why do microarrays merit a section in a book on bioinformatics? Bioinformatics plays multiple
roles in microarray experiments. In fact, it is difficult to conceive of microarrays as useful without the
involvement of computers and databases. From the design of chips for specific purposes, to the
quantitation of signals, to the extraction of groups of genes with linked expression profiles, microarray
analysis is a process that is difficult, if not impossible, to do by eye or with a pencil and a notebook.

The most popular laboratory equipment for microarray experiments, at the time of this writing, is the
Affymetrix machine; however, it's been followed closely by home-built configurations. If you're
working with a commercial arrayer, integrated software will probably make it relatively easy for you to
analyze data. However, home-built microarray setups put out data sets of varying sizes. Arrayers may
not spot quite as uniformly as commercial machines. Standardization is difficult. And running a home-
built setup means you have to find software that supports all the steps of the array experiment and has
the features you need for data analysis.

One of the main challenges in conducting microarray experiments with noncommercial equipment is
that there are a limited number of available tools for linking expression data with associated sequences
and annotations. Constructing such a database interface can be a real burden for a novice. Proprietary
software, based on proprietary chip formats, is often quite well supported by a database backend
specific to the chip, but it isn't always generalizable, or not easily so, to a variety of image formats and
data-set sizes. In the public domain, several projects are underway to improve this situation for
academic researchers, including NCGR's GeneX and EMBL-EBI's ArrayExpress. The National Human
Genome Research Institute (NHGRI) is currently offering a demonstration version of an array data

management system called ArrayDB (http://genome.nhgri.nih.gov/arraydb/) that includes both analysis
tools and relational database support. ArrayDB will also allow a community of users to add records to

a central database.
It's in alpha release at the time of this writing.

The Pat Brown group at Stanford has a comprehensive listing of microarray resources on their web site,
including instructions for building your own arrayer (for about 10% of the cost of a commercial setup)
and the ArrayMaker software that controls the printing of arrays. This site is an excellent resource for
microarray beginners. Planning array experiments

A key element in microarray experiments is chip design. This is the aspect that's often forgotten by
users of commercial devices and commercial chips, because one benefit of those systems is that chip
design has been done for you, by an expert, before you ever think about doing an experiment. Chip
design is a process that can take months.

Even the largest chip can't fit all the proteins in a eukaryotic genome; there may be hundreds of
thousands of different targets. The chip designer has to select subsets of the genome that are likely to
be informative when assayed together. EST data sets can be helpful in designing microarray primers;
while they are quantitatively uninformative, ESTs do indicate which subsets of genes are likely to be
active under particular conditions and hence are informative for a specific microarray experiment.

In order for microarray results to be clear and unambiguous, each DNA probe in the array must be
sufficiently unique that only one specific target gene can hybridize with it. Otherwise, the amount of
signal detected at each spot will be quantitatively incorrect.

What this means, in practice, is lots of sequence analysis: finding possible genes of interest, and
selecting and synthesizing appropriate probes. Once the probes are selected, their sequence, plus
available background information for each spot in the array, must be recorded in a database so that
information is accessible when results are analyzed. Finally, the database must be robust enough to take
into account changing annotations and information in the public sequence databases, so that incorrect
interpretations of results can be avoided.

Some resources for probe and primer design are available on the Web. A "good" oligonucleotide”one
that is useful as a probe or primer for microarrays, PCR, and other hybridization experiments shouldn't
hybridize with itself to form dimers or hairpins. It should hybridize uniquely with the target sequence
you are interested in. For PCR applications, primers must have an optimal melting temperature and
stability. An excellent web resource for designing PCR primers is the Primer3 web site at the
Whitehead Institute; CODEHOP is another primer/probe design application based on sequence motifs
found in the Blocks database. Analyzing scanned microarray images with CrazyQuant

Once the array experiment is complete, you'll find yourself in possession of a lot of very large TIFF
files containing scanned images of your arrays. If you're not using an integrated analysis package,
where do you go from there?

The standard for public-domain microarray analysis tools are the packages developed at Stanford. One
package, ScanAlyze, available for Windows, is the image analysis tool in this group. ScanAlyze is well
regarded and widely used, especially in academia and features semiautomatic grid definition and
multichannel pixel analysis. It supports TIFF files as well as the Stanford SCN format. It's by far the
most sophisticated of the image-analysis programs discussed here.

A relatively straightforward public-domain program for array-image analysis is CrazyQuant, a Java
application available from the Hood Laboratory at the University of Washington. CrazyQuant is menu-
driven and can load TIFF, JPG, or GIF format microarray images. CrazyQuant assumes a 2D matrix of
regularly spaced spots, and to begin the analysis process, you need to define a 2D grid that pinpoints
the spot locations. The program then computes relative intensities at each spot and stores them as
integer values. CrazyQuant can quantitate both one- and two-color (differential) array data. CrazyQuant
is extremely simple to install on a Linux workstation. Download the archive, move it to its own
directory, unzip it, and run the program by entering java CrazyQuant. A sample GIF image is included
in the archive so that you can see how it works.

TIGR also offers a Windows application for microarray image analysis called SpotFinder. SpotFinder
can process the TIFF-format files produced by most microarray scanners and produce output that is
compatible with both TIGR's ArrayViewer and other microarray analysis software. Visualizing high-dimensional data

Microarray results can be difficult to visualize. Array experiments generally have at least four
dimensions (x-location, y-location, fluorescence intensity, and time). Straightforward plotting of array
images isn't very informative. Tools that help extract features from higher-dimensional data sets and
display these features in a sensible image format are needed.

TIGR offers a Java application called ArrayViewer. Currently, ArrayViewer's functions are focused on
detecting differentially expressed genes and displaying differential expression results in a graphical
format. ArrayViewer's parameters can be adjusted to incorporate data from arrays of any size, and it
can be configured to allow straightforward access to underlying gene sequence data and annotation.
Several normalization options are offered. Features for analysis of time series data and other more
complicated data sets aren't yet implemented, but ArrayViewer does meet most basic array
visualization needs.

Some general visualization and data-mining packages such as Xgobi, which we discuss in Chapter 14,
can also be used to examine array data. Clustering expression profiles

At the time of this writing, the most popular strategy for analysis of microarray data is the clustering of
expression profiles. An expression profile can be visualized as a plot that describes the change in
expression at one spot on a microarray grid over the course of the experiment. The course of the
experiment changes with the context, anything from changes in the concentration of nutrients in the
medium in which cells are being grown prior to having their DNA hybridized to the array, to cell cycle

In this context, what is clustered is essentially the shape of the plot. Different clustering methods, such
as hierarchical clustering or SOMs (self-organizing maps) may work better in different situations, but

the general aim of each of these methods is the same. If two genes change expression levels in the

same way in response to a change in conditions, it can be assumed that those genes are related. They
may share something as simple as a promoter, or more likely, they are controlled by the same complex
regulatory pathway. Automated clustering of expression profiles looks for similar symptoms (similarly
shaped expression profiles) but doesn't necessarily point to causes for those changes. That's the job of
the scientist analyzing the results, at least for now.
We discuss clustering approaches in a little more detail in Chapter 14.

The programs Cluster and TreeView, also from Stanford, are Windows -platform tools for clustering
expression profiles. Various algorithms for clustering are implemented, including SOMs and
hierarchical clustering. XCluster, which implements most of the features of Cluster, is available for
Unix platforms. All these programs require a specific file format (detailed in the manual, which is
available online). A note on commercial software for expression analysis

Several commercial software packages, with tools for visualizing complex microarray data sets, are
available. Many of these are specific to particular hardware or array configurations. Others, such as
SpotFire and Silicon Genetics' GeneSpring, are more universal. These software packages are often
rather expensive to license; however, at this stage of the development of microarray technology, their
relative ease of use may make them worthwhile.

11.6 Proteomics
Proteomics refers to techniques that simultaneously study the entire protein complement of a cell.
While protein purification and separation methods are constantly improving, and the time-to-
completion of protein structures determined by NMR and x-ray crystallography is decreasing, there is
as yet no single way to rapidly crystallize the entire protein complement of an organism and determine
every structure. Techniques in biochemical characterization, on the other hand, are getting better and
faster. The technological advance in biochemistry that most requires informatics support is the
immobilized-gradient 2D-PAGE process and the subsequent characterization of separated protein
products by mass spectrometry. Microarraying robots have begun to be used to create protein arrays,
which can be used in protein interaction assays, drug discovery, and other applications. However,
protein microarrays are still far from a routine approach.

11.6.1 Experimental Approaches in Proteomics

Knowing when and at what levels genes are being expressed is only the first step in understanding how
the genome determines phenotype. While mRNA levels are correlated with protein concentration in the
cell, proteins are subject to post-translational modifications that can't be detected with a hybridization
experiment. Experimental tools for determining protein concentration and activity in the cell are the
crucial next step in the process.

Another high-throughput technology that is emerging as a tool in functional genomics is 2D gel
electrophoresis. Gels have long been used in molecular biology to separate mixtures of components.
Depending on the conditions of the experiment and the type of gel used, different components will
migrate through a gel matrix at different rates. (This same principle makes DNA sequencing possible).

Two-dimensional gel electrophoresis can be used to separate protein mixtures containing thousands of
components. The first dimension of the experiment is separation of the components of a solution along
a pH gradient (isoelectric focusing). The second dimension is separation of the components
orthogonally by molecular weight. Separation in these two dimensions can resolve even a complicated
mixture of components. Figure 11-5 shows a 2D-PAGE reference map from Arabidopsis thaliana. The
2D-PAGE experiment separates proteins from a mixed sample so that individual proteins can be
identified. Each spot on the map represents a different protein. (Image © Swiss Institute of
Bioinformatics, Geneva, Switzerland.)

Figure 11-5. A 2D-PAGE reference map from Arabidopsis thaliana

While 2D gel electrophoresis is a useful and interesting technology in itself, the technology did not
really come into its own until the development of standardized immobilized-gradient gels. These gels
allow very precise protein separations, resulting in standardized high density data arrays. They can
therefore be subjected to automated image analysis and quantitation and used for accurate comparative
studies. The other advance that has put 2D gel technology at the forefront of modern molecular biology
methods is the capacity to chemically analyze each spot on the gel using mass spectrometry. This
allows the measurable biochemical phenomenon”the amount of protein found in a particular spot on
the gel”to be directly connected to the sequence of the protein found at that spot.

11.6.2 Informatics Challenges in 2D-PAGE Analysis

The analysis pathway for 2D-PAGE gel images is essentially quite similar to that for microarrays. The
first step is an image analysis, in which the positions of spots on the gel are identified and the
boundaries between different spots are resolved. Molecular weight and isoelectric point (PI) for each
protein in the gel can be estimated according to position.

Next, the spots are identified, and sequence information is used to make the connection between a
particular spot and its gene sequence. In microarray experiments, this step is planned in advance, as the
primers or cDNA fragments are laid down in the original chip design. In proteome analysis, the
immobilized proteins can either be sequenced in situ or spots of protein can be physically removed
from the gel, eluted, and analyzed using mass spectrometry methods such as electrospray ionization-
mass spectrometry (ESI-MS) or matrix-assisted laser desorption ionization mass spectrometry

The essence of mass spectrometry methods is that they can determine the masses of components in a
mixture, starting from a very small sample. Proteins, fragmented by various chemically specific
digestion methods, have characteristic fingerprints (patterns of peptide masses) that can identify
specific proteins and match them with a gene sequence.

Peptide fingerprints are sufficient to identify proteins only in cases in which the sequence of a protein
is already known and can be found in a database. When full sequence information isn't available, a
second mass spectrometry step can obtain partial sequence information from each individual peptide
that makes up the peptide fingerprint. The initial peptide fingerprinting process separates the protein
into peptides and characterizes them by mass. Within the mass spectrometer, each peptide can then be
further broken down into ionized fragments. The goal of the peptide fragmentation step is to produce a
ladder of fragments each differing in length by one amino acid. Because each type of amino acid has a
different molecular weight, the sequence of each peptide can be deduced from the resulting mass

Finally, staining, radiolabeling, fluorescence, or other methods are used to quantitate each protein spot
on the gel. Both in the microarray experiment and the 2D-PAGE experiment, quantitation is a fairly
difficult step. In this step, computer algorithms can help analyze the amount of signal at each spot and
deconvolute complex patterns of spots.

11.6.3 Tools for Proteomics Analysis

Several public-domain programs for proteomics analysis are available on the Web. Most of these can
be accessed through the excellent proteomics resource at Expert Protein Analysis System (ExPASy,
http://www.expasy.ch/tools/), the excellent resource maintained by the Swiss Institute of
Bioinformatics. ExPASy is linked with SWISS-PROT, an expert-curated database of protein sequence
information, and TrEMBL, the computer-generated counterpart to SWISS-PROT. Most of its tools are
web-based and tied into these and other protein databases. The Swiss Institute of Bioinformatics also
maintains SWISS-2DPAGE, a database of reference gel maps that are fully searchable and integrated
with other protein information. SWISS-2DPAGE, like other biological databases, is growing rapidly;
however deposition of 2D-PAGE results into databases isn't, at this time, required for publication, so
the database isn't comprehensive.

The Melanie3 software package, a Windows-based package for 2D-PAGE image analysis, was
developed at ExPASy, although it has since been commercialized. A Melanie viewer, which allows

users who don't own Melanie3 to view Melanie3 data sets generated by colleagues, is freely distributed
by ExPASy.

Here are some other ExPASy proteomics tools:


Allows you to identify proteins by their amino acid composition


Compares a protein's amino acid composition with other proteins in SWISS-PROT


A multifunction tool that uses PI, molecular weight, mass fingerprints, and other data to help
identify proteins


Compares experimentally determined mass fingerprints with theoretically calculated mass
fingerprints for all proteins in SWISS-PROT


Predicts specific post-translational modifications to proteins based on mass differences between
experimental and computed fingerprints


Predicts oligosaccharide modifications from mass differences


Computes a theoretical mass fingerprint for a SWISS-PROT or TrEMBL entry, or for a user-
entered protein sequence

These tools are entirely forms-based and very approachable for the novice user. In addition, ExPASy
provides links to many externally developed tools and web servers. It is an excellent starting resource
for anyone interested in proteomics.

The PROWL database is a relatively new web resource for proteomics. PROWL tools can be used to
search a protein database with peptide fingerprint or partial sequence information. The PROWL group
also provides a suite of software for mass spectrometry data analysis.

11.6.4 Generalizing the Array Approach

Integration of microarray and 2D-PAGE methods”which provide information about gene transcription
and translation, respectively”with genome sequence data is the best way currently available to form a
picture of whole-genome function. However, these methods are still fairly new. Although the
technology is moving forward by leaps and bounds, their results aren't yet fully standardized, and
consensus software tools and analysis methods for these types of data are still emerging.

Array and 2D-PAGE experiments have elements in common, including the ability to separate and
quantitate components in a mixture and fix particular components consistently to positions in a grid,
and the ability to measure changes in signal at each position over time. Approaches for analyzing array-
formatted biochemical data are likely to be similar on some level, whether the experiments involve
DNA-DNA, DNA-mRNA, or even protein-protein interactions. Array strategies have recently been
used to conduct a genome-wide survey of protein-protein interactions in yeast, and other applications
of the strategy are, no doubt, in the works. Array methods and other parallel methods promise to
continue to revolutionize biology. However, the biology community is still in the process of developing
standards for reporting and archiving array data, and it is unlikely that a consensus will be reached
before this book goes to press.

11.7 Biochemical Pathway Databases
Gene and protein expression are only two steps in the translation of genetic code to phenotype. Once
genes are expressed and translated into proteins, their products participate in complicated biochemical
interactions called pathways, as shown in Figure 11-6 (the image in the figure is © Kyoto Encyclopedia
of Genes and Genomes). It is highly unlikely that one enzyme-catalyzed chemical reaction will produce
a needed product from a material readily available to the organism. Instead, a complicated series of
steps is usually required. Each pathway may supply chemical precursors to many other pathways,
meaning that each protein has relationships not only to the preceding and following biochemical steps
in a single pathway, but possibly to steps in several pathways. The complicated branchings of
metabolic pathways are far more difficult to represent and search than the linear sequences of genes
and genomes.

Figure 11-6. A complex metabolic pathway

11.7.1 Illustration of a Complex Metabolic Pathway

Several web-based services offer access to metabolic pathway information. These resources are
primarily databases of information linked by search tools; at the time of this writing metabolic
simulation tools, such as those we describe in the next section, have not been fully integrated with
databases of known metabolic pathway information into a central web-based resource.

11.7.2 EC Nomenclature

Enzymes (proteins that catalyze metabolic reactions) can be described using a standard code called the
EC code. The EC nomenclature is a hierarchical naming scheme that divides enzymes into several
major classes. The first class number refers to the chemistry of the enzyme: oxidoreductase, lyase,
hydrolase, transferase, isomerase, or ligase. The second class number indicates what class of substrate
the enzyme acts on. The third class number, which can be omitted, indicates other chemical participants
in the reaction. Finally, the fourth number narrows the search to the specific enzyme. Thus, EC number refers to alcohol dehydrogenase, which is a (1) oxidoreductase acting on the (1) CH-OH group
of donors with (1) NADH as acceptor. If you are interested in using most metabolic pathway resources,
it's helpful to become familiar with EC nomenclature. The EC code and hierarchy of functional
definitions can be found online at the IUBMB Biochemical Nomenclature Committee web site.

11.7.3 WIT and KEGG

The best known metabolic pathway resources on the Web are What is There (WIT,
http://wit.mcs.anl.gov/WIT2/) and the Kyoto Encyclopedia of Genes and Genomes (KEGG,
http://www.genome.ad.jp/kegg/). WIT is a metabolic pathway reconstruction resource; that is, the
curators of WIT are attempting to reconstruct complete metabolic pathway models for organisms
whose genomes have been completely sequenced. WIT currently contains metabolic models for 39
organisms. The WIT models include far more than just metabolism and bioenergetics; they range from
transcription and translation pathways to transmembrane transport to signal transduction.

WIT can be searched and queried in a number of ways. You can browse the database beginning at the
very top level, a functional overview of the WIT contents, which is found under the heading General
Overview on the main page. Each heading in the metabolic outline is a clickable link that takes you to
increasingly specific levels of detail about that subset of metabolism. The View Models menu takes
you directly to organism-specific metabolic models.

The General Search function allows you to search all of WIT, or subsets of organisms. This type of
search is based on keywords, using Unix-style regular expressions to find matches. There is also a
similarity search function that allows you to search all the open reading frames (ORFs) of a selected
organism for sequence pattern matches, using either BLAST or FASTA. Pathway queries require you
to specify the names of metabolites and/or specific EC enzyme codes. Enzyme queries allow you to
specify an enzyme name or EC code, along with location information such as tissue specificity, cellular
compartment specificity, or organelle specificity. In all except the regular-expression searches, the
keywords are drawn from standardized metabolic vocabularies. WIT searches require some prior
knowledge of these vocabularies when you submit the query. WIT was primarily designed as a tool to
aid its developers in producing metabolic reconstructions, and documentation of the vocabularies used
may not always be sufficient for the novic e user. WIT is relatively text-heavy, although at the highest
level of detail, metabolic pathway diagrams can be displayed.

Another web-based metabolic reconstruction resource is KEGG, which provides its metabolic
overviews as map illustrations, rather than text-only, and can be easier to use for the visually-oriented
user. KEGG also provides listings of EC numbers and their corresponding enzymes broken down by
level, and many helpful links to sites describing enzyme and ligand nomenclature in detail. The
LIGAND database, associated with KEGG, is a useful resource for identifying small molecules
involved in biochemical pathways. Like WIT, KEGG is searchable by sequence homology, keyword,
and chemical entity; you can also input the LIGAND ID codes of two small molecules and find all of
the possible metabolic pathways connecting them.

11.7.4 PathDB

PathDB is another type of metabolic pathway database. While it contains roughly the same information
as KEGG and WIT”identities of compounds and metabolic proteins, and information about the steps
that connect these entities ”it handles information in a far more flexible way than the other metabolic
databases. Instead of limiting searches to arbitrary metabolic pathways and describing pathways with
preconceived images, PathDB allows you to find any set of connected reactions that link point A to
point B, or compound A to compound B.

PathDB contains, in addition to the usual search tools, a pathway visualization interface that allows you
to inspect any selected pathway and display different representations of the pathway. The PathDB
developers plan to incorporate metabolic simulation into the user interface as well, although those
features aren't available at the time of this writing.

The PathDB browser is a platform-independent tool you can use on any machine with a Java Runtime
Environment ( JRE) Version 1.1.4 or later installed. Both Sun and IBM supply a JRE for Linux
systems. Once the JRE is installed, you can run the PathDB installer, making sure that the installer uses
the correct version of the JRE (for this to work, you may need to add the JRE binary directory to your
path). Let the installer create a link to PathDB in your home directory. To run the program, enter
PathDB. You may be prompted to specify the correct Java virtual machine or exit; use the same Java
pathway you did when you installed the software.

To sample how PathDB works, submit a simple query that will assure you get results the first time
(such as "All Catalysts with EC Number like," which brings up a list of alcohol
dehydrogenases). You can also follow the tutorials available from the PathDB web site.

11.8 Modeling Kinetics and Physiology
A new "omics" buzzword that has begun to crop up in the literature rather recently is "metabolomics."
Researchers have begun to recognize the need to exhaustively track the availability and concentration
of small molecules”everything from electrolytes to metabolic intermediates to enzyme cofactors”in
biological systems. Small molecules are extremely important in biochemistry, playing roles in
everything from signal transduction to bioenergetics to regulation. The populations of small molecules
that interact with proteins in the cell will continue to be a key topic of research as biologists attempt to
assemble the big picture of cellular function and physiology.

Mathematical modeling of biochemical kinetics and physiology is a complicated topic that is largely
beyond the scope of this book. Mathematical models are generally system-specific, and to develop
them requires a detailed understanding of a biological system and a facility with differential equations.
However, a few groups have begun to develop context-independent software for developing
biochemical and physiological models. Some of the best known of these are Gepasi, a system for
biochemical modeling; XPP, a more general package for dynamical simulation; and the Virtual Cell

The essential principle behind biochemical and physiological modeling is that changes in biochemical
systems can be modeled in terms of chemical concentrations and associated rate equations. Each "pool"


. 9
( 12)