. 6
( 12)


(http://www.ebi.ac.uk/translate/) at the European Bioinformatics Institute (EBI) will do it for you.

7.7 Pairwise Sequence Comparison
Comparison of protein and DNA sequences is one of the foundations of bioinformatics. Our ability to
perform rapid automated comparisons of sequences facilitates everything from assignment of function
to a new sequence, to prediction and construction of model protein structures, to design and analysis of
gene expression experiments. As biological sequence data has accumulated, it has become apparent

that nature is conservative. A new biochemistry isn't created for each new species, and new
functionality isn't created by the sudden appearance of whole new genes. Instead, incremental
modifications give rise to genetic diversity and novel function. With this premise in mind, detection of
similarity between sequences allows you to transfer information about one sequence to other similar
sequences with reasonable, though not always total, confidence.

Before you can make comparative statements about nucleic acid or protein sequences, a sequence
alignment is needed. The basic concept of selecting an optimal sequence alignment is simple. The two
sequences are matched up in an arbitrary way. The quality of the match is scored. Then one sequence is
moved with respect to the other and the match is scored again, until the best-scoring alignment is

What sounds simple in principle isn't at all simple in practice. Choosing a good alignment by eye is
possible, but life is too short to do it more than once or twice. An automated method for finding the
optimal alignment out of the thousands of alternatives is clearly the right approach, but in order for the
method to be consistent and biologically meaningful, several questions must be answered. How should
alignments be scored? A scoring scheme can be as simple as +1 for a match and -1 for a mismatch, but
what is the best scoring scheme for the data? Should gaps be allowed to open in the sequences to
facilitate better matches elsewhere? If gaps are allowed, how should they be scored? Given the correct
scoring parameters, what is the best algorithm for finding the optimal alignment of two sequences? And
when an alignment is produced, is it necessarily significant? Can an alignment of similar quality be
produced for two random sequences? Through the rest of this section, we consider each of these
questions in greater detail.

Figure 7-8 shows examples of three kinds of alignment. These are three pairwise sequence alignments
generated using a program called ALIGN. In each alignment, the sequences being compared are
displayed, one above the other, such that matching residues are aligned. Identical matches are indicated
with a colon (:) between the matching residues, while similarities are indicated with a single dot (.).
Information about the alignment is presented at the top, including percent identity (the number of
identical matches divided by the length of the alignment) and score. Finally, gaps in one sequence
relative to another are represented by dashes (-) for each position in that sequence occupied by a gap.

Figure 7-8. Three alignments: high scoring, low scoring but meaningful, and random

The first alignment is a high-scoring one: it shows a comparison of two closely related proteins (two
hemoglobin molecules, one from a sea lamprey and one from a hagfish). Compare that alignment with
the second, a comparison of two distantly related proteins (again, two hemoglobin molecules, in this
case taken from lamprey and rice). Cursory inspection shows fewer identical residues are shared by the
sequences in the low-scoring alignment than in the high-scoring one. Still, there are several similarities
or conservative changes”changes in which one amino acid has been replaced by another, chemically
similar residue. The third alignment is a random alignment, a comparison between two unrelated
sequences (the lamprey hemoglobin and a human retinol binding protein). Notice that, in addition to
the few identities and conservative mutations between the two, large gaps have been opened in both
sequences to achieve this alignment. Gene families aren't likely to evolve in this way, and given the
lack of similarity between the sequences, you can conclude that these proteins are unrelated.

In describing sequence comparisons, several different terms are commonly used. Sequence identity,
sequence similarity, and sequence homology are the most important of these terms. Each means
something slightly different, though they are often casually used interchangeably.

Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same
position in two aligned sequences. Sequence similarity is meaningful only when possible substitutions
are scored according to the probability with which they occur. In protein sequences, amino acids of
similar chemical properties are found to substitute for each other much more readily than dissimilar
amino acids. These propensities are represented in scoring matrices that score sequence alignments.
Two amino acids are considered similar if one can be substituted for another with a positive log odds
score from a scoring matrix (described in the next section).

Sequence homology is a more general term that indicates evolutionary relatedness among sequences. It
is common to speak of a percentage of sequence homology when comparing two sequences, although
that percentage may indicate a mixture of identical and similar sites. Finally, sequence homology refers
to the evolutionary relatedness between sequences. Two sequences are said to be homologous if they
are both derived from a common ancestral sequence. The terms similarity and homology are often used
interchangeably to describe sequences, but, strictly speaking, they mean different things. Similarity
refers to the presence of identical and similar sites in the two sequences, while homology reflects a
stronger claim that the two sequences share a common ancestor.

7.7.1 Scoring Matrices

What you really want to learn when evaluating a sequence alignment is whether a given alignment is
random, or meaningful. If the alignment is meaningful, you want to gauge just how meaningful it is.
You attempt to do this by constructing a scoring matrix.

A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair
occurring in an alignment. The values in a scoring matrix are logarithms of ratios of two probabilities.
One is the probability of random occurrence of an amino acid in a sequence alignment. This value is
simply the product of the independent frequencies of occurrence of each of the amino acids. The other
is the probability of meaningful occurrence of a pair of residues in a sequence alignment. These
probabilities are derived from samples of actual sequence alignments that are known to be valid.

In order to score an alignment, the alignment program needs to know if it is more likely that a given
amino acid pair has occurred randomly, or that it has occurred as a result of an evolutionary event. The
logarithm of the ratio of the probability of meaningful occurrence to the probability of random
occurrence is positive if the probability of meaningful occurrence is greater, and negative if the
probability of random occurrence is greater. Because the scores are logarithms of probability ratios,
they can be meaningfully added to give a score for the entire sequence. The more positive the score, the
more likely the alignment is to be significant.

Figure 7-9 shows an example of a BLOSUM45 matrix, a popular substitution matrix for amino acids.

Figure 7-9. The BLOSUM45 matrix, a popular substitution matrix for amino acids

Substitution matrices for amino acids are complicated because they reflect the chemical nature and
frequency of occurrence of the amino acids. For example, in the BLOSUM matrix, glutamic acid (E)
has a positive score for substitution with aspartic acid (D) and also with glutamine (Q). Both these
substitutions are chemically conservative. Aspartic acid has a sidechain that is chemically similar to
glutamic acid, though one methyl group shorter. On the other hand, glutamine is similar in size and
chemistry to glutamic acid, but it is neutral while glutamic acid is negatively charged. Substitution
scores for glutamic acid with residues such as isoleucine (I) and leucine (L) are negative. These
residues have neutral, nonpolar sidechains and are chemically different from glutamic acid. The scores
on the diagonal of the matrix reflect the frequency of occurrence of each amino acid. For example, with
a positive score of 15, it is extremely unlikely that the alignment of a rare tryptophan (W) with another
tryptophan is coincidence, while the more common serine (S) has a positive score of only 4 for a match
with another serine. It's important to remember that these scores are logarithms, which means that a
match of two serines is far from being mere coincidence.

Substitution matrices for bases in DNA or RNA sequence are very simple. By default, the sequence
alignment program BLAST uses the scheme of assigning a standard reward for a match and a standard
penalty for a mismatch, with no regard to overall frequencies of bases. In most cases, it is reasonable to
assume that A:T and G:C occur in roughly equal proportions.

Commonly used substitution matrices include the BLOSUM and PAM matrices. When using BLAST,
you need to select a scoring matrix. Most automated servers select a default matrix for you (usually
something like BLOSUM62), and if you're just doing a quick sequence search, it's fine to accept the

BLOSUM matrices are derived from the Blocks database, a set of ungapped alignments of sequence
regions from families of related proteins. A clustering approach sorts the sequences in each block into
closely related groups, and the frequencies of substitutions between these within a family derives the
probability of a meaningful substitution. The numerical value (e.g., 62) associated with a BLOSUM
matrix represents the cutoff value for the clustering step. A value of 62 indicates that sequences were
put into the same cluster if they were more than 62% identical. By allowing more diverse sequences to
be included in each cluster, lower cutoff values represent longer evolutionary time scales, so matrices
with low cutoff values are appropriate for seeking more distant relationships. BLOSUM62 is the

standard matrix for ungapped alignments, while BLOSUM50 is more commonly used when generating
alignments with gaps.

Point accepted mutation (PAM) matrices are scaled according to a model of evolutionary distance from
alignments of closely related sequences. One PAM "unit" is equivalent to an average change in 1% of
all amino acid positions. The most commonly used PAM matrix is PAM250. However, comparison of
results using PAM and BLOSUM matrices suggest that BLOSUM matrices are better at detecting
biologically significant similarities.

7.7.2 Gap Penalties

DNA sequences change not only by point mutation, but by insertion and deletion of residues as well.
Consequently, it is often necessary to introduce gaps into one or both of the sequences being aligned to
produce a meaningful alignment between them. Most algorithms use a gap penalty to represent the
validity of adding a gap in an alignment.

The addition of a gap has to be costly enough, in terms of the overall alignment score, that gaps will
open only where they are really needed and not all over the sequence. Most sequence alignment models
use affine gap penalties, in which the cost of opening a gap in a sequence is different from the cost of
extending a gap that has already been started. Of these two penalties”-the gap opening penalty and the
gap extension penalty”-the gap opening penalties tend to be much higher than the associated
extension penalty. This tendency reflects the tendency for insertions and deletions to occur over several
residues at a time.

Gap penalties are intimately tied to the scoring matrix that aligns the sequences: the best pair of gap
opening and extension penalties for one scoring matrix doesn't necessarily work with another. Scores of
-11 for gap opening and -1 for gap extension are commonly used in conjunction with the BLOSUM 62
matrix for gapped-BLAST, while BLOSUM50 uses a -12/-1 penalty.

7.7.3 Dynamic Programming

Dynamic programming methods are a general class of algorithms that are often seen both in sequence
alignment and other computational problems. They were first described in the 1950s by Richard
Bellman of Princeton University as a general optimization technique. Dynamic programming seems to
have been introduced to biological sequence comparison by Saul Needleman and Christian Wunsch,

who apparently were unaware of the similarity between their method and Bellman's.
Or, as mathematicians might say, "rediscovered." Because computational biology combines research from so many different areas, this independent discovery
happens often and is only noticed later.

As we mentioned, dynamic programming algorithms solve optimization problems, problems in which
there are a large number of possible solutions, but only one (or a small number of ) best solutions. A
dynamic programming algorithm finds the best solution by first breaking the original problem into
smaller subproblems and then solving. These pieces of the larger problem have a sequential
dependency; that is, the fourth piece can be solved only with the answer to the third piece, the third can
be solved only with the answer to the second, and so on. Dynamic programming works by first solving
all these subproblems, storing each intermediate solution in a table along with a score, and finally
choosing the sequence of solutions that yields the highest score. The goal of the dynamic programming
algorithm is to maximize the total score for the alignment. In order to do this, the number of high-

scoring residue pairs must be maximized and the number of gaps and low-scoring pairs must be

In sequence comparison, the overall problem is finding the best alignment between two sequences. This
problem is broken down into subproblems of aligning each residue from one sequence with each
residue from the other. The solution is a decision as to whether the residues should be aligned with
each other, a gap should be introduced in the first sequence, or a gap should be introduced in the
second sequence. Each high-scoring choice rules out the other two low-scoring possibilities, so that if
information about the accumulated scores is stored at each step, every possible alignment need not be

The algorithm uses an (m x n) matrix of scores (illustrated in Figure 7-10) in which m and n are the
lengths of the sequences being aligned. Starting with the alignment of a gap against itself (which is
given the initial score zero), the algorithm fills in the matrix one row at a time. At each position in the
matrix, the algorithm computes the scores that result for each of its three choices, selects the one that
yields the highest score, then stores a pointer at the current position to the preceding position that was
used to arrive at the high score. When every position in the matrix has been filled in, a traceback step is
performed, and the highest-scoring path along the pointers is followed back to the beginning of the

Figure 7-10. A matrix of scores comparing two sequences; continuous high-scoring matches are

7.7.4 Global Alignment

One alignment scenario you may encounter is the alignment of two sequences along their whole length.
The algorithm for alignment of whole sequences is called the Needleman-Wunsch algorithm. In this
scenario, an optimal alignment is built up from high-scoring alignments of subsequences, stepping
through the matrix from top left to bottom right. Only the best-scoring path can be traced through the
matrix, resulting in an optimal alignment. Using ALIGN to produce a global sequence alignment

Now that we have seen the theory behind the global alignment of two sequences, let's examine a
program that implements this algorithm. ALIGN is a simple utility for computing global alignments. It
is part of the FASTA software distribution, described later in this chapter. The programs in the FASTA
distribution are easily run from the Unix command line, although many of them have been incorporated
into the SDSC Biology Workbench web-based sequence analysis software, if you prefer to access them
through a point-and-click interface. The FASTA programs compile easily under Linux; however, once
they are compiled, you need to link them into your /usr/local/bin directory or some other sensible
location by hand.

To run ALIGN and any of the other FASTA programs, you need sequence data in FASTA format. This
is one of the most frequently used sequence formats and probably the simplest. To use ALIGN, each of
the sequences you are aligning should be in a separate file.

A sequence in FASTA format looks like this: [3]

Also known as Pearson format, after the author of the FASTA software, William Pearson.


The FASTA format is very flexible, and it is one of the most commonly used formats for sequence
analysis programs. A FASTA file contains one or more records in FASTA format, separated by empty
lines. Each record consists of a human-readable comment followed by a nucleotide or protein sequence.
The comment appears in the first line of the record; it must begin with a greater-than (>) symbol
followed by one or more identifiers for the sequence. The comment may contain information about the
molecule represented by the sequence, such as the protein or gene name and source organism. In the
previous example, the identifier is a PDB code (2HHB), followed by a description of the sequence (the
A chain of a deoxyhemoglobin protein). The remainder of the record contains the sequence itself,
divided into separate lines by line breaks. Lines are usually 60 characters long, but the format doesn't
specify a line length. Programs that take FASTA data as input (such as ALIGN) usually make
allowances for FASTA's free-form nature. Still, it's a good practice to check the program's
documentation to make sure that your data is appropriately formatted.

To use ALIGN, simply enter align at the command prompt. You are then prompted for sequence
filenames. Results are sent to standard output. The ASCII format for pairwise alignments that is
produced by FASTA is still commonly used, although there is a trend toward use of more easily parsed
alignment formats:

Output scoring matrix: BLOSUM50, gap penalties: -12/-2 43.2% identity;
Global alignment score: 374

10 20 30 40 50
: :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :.

10 20 30 40 50
60 70 80 90 100 110
.::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.

60 70 80 90 100 110
120 130 140
:::: :.:. .: .:.:...:. ::.
120 130 140

The FASTA distribution contains a sample HTML form and CGI script for use with the program
LALIGN, a pairwise local alignment program. The script can be modified to work with the ALIGN
program if a web-based interface is desired.

7.7.5 Local Alignment

The most commonly used sequence alignment tools rely on a strategy called local alignment. The
global alignment strategy discussed earlier assumes that the two sequences to be aligned are known and
are to be aligned over their full length. In the scenarios that are encountered most often with sequence
alignment, however, you are either searching with one sequence against a sequence database looking
for unknown sequences, or searching a very long DNA sequence, such as part of a genome, for partial
segments that match a query sequence. In protein or gene sequences that do have some evolutionary
relatedness, but which have diverged significantly from each other, short homologous segments may be
all the evidence of sequence homology that remains.

The version of the dynamic programming algorithm that performs local alignment of two sequences is
known as the Smith-Waterman algorithm. Named for its inventors, Dr. Temple Smith and Dr. Michael
Waterman, this algorithm is similar to the Needleman-Wunsch algorithm except that an additional
choice is allowed when tracing through the matrix. A local alignment isn't required to extend from
beginning to end of the two sequences being aligned. If the cumulative score up to some point in the
sequence is negative, the alignment can be abandoned and a new alignment started. The alignment can
also end anywhere in the matrix. Tools for local alignment

One of the most frequently reported implementations of the Smith-Waterman algorithm for database
searching is the program SSEARCH, which is part of the FASTA distribution described later.
LALIGN, also part of the FASTA package, is an implementation of the Smith-Waterman algorithm for
aligning two sequences.

7.8 Sequence Queries Against Biological Databases
A common application of sequence alignment is searching a database for sequences that are similar to a
query sequence. In these searches, an alignment of a sequence hundreds or thousands of residues long
is matched against a database of at least tens of thousands of comparably sized sequences. Using
dynamic programming-based methods, this isn't very practical unless special-purpose hardware is
available. Instead, for routine searches, special heuristic database-searching methods are used.
Heuristic methods exploit knowledge about sequences and alignment statistics to make these large-
scale searches efficient and practical. While they don't guarantee optimal alignments, they make
sensitive searches of large sequence databases possible. In this section, we describe BLAST and
FASTA, two commonly used database-searching programs.

7.8.1 Local Alignment-Based Searching Using BLAST

By far, the most popular tool for searching sequence databases is a program called BLAST (Basic
Local Alignment Search Tool). BLAST is the algorithm at the core of most online sequence search
servers. It performs pairwise comparisons of sequences, seeking regions of local similarity, rather than

optimal global alignments between whole sequences.
To give you perspective on how long the common tools of bioinformatics have been available, the original BLAST paper by Altschul et al. was published in the
Journal of Molecular Biology in October 1990.

BLAST can perform hundreds or even thousands of sequence comparisons in a matter of minutes. And
in less than a few hours, a query sequence can be compared to an entire database to find all similar
sequences. BLAST is so popular for this purpose that it's become a verb in the computational biology
community, as in "I BLASTed this sequence against GenBank and came up with three matches." The BLAST algorithm

Local sequence alignment searching using a standard Smith-Waterman algorithm is a fairly slow
process. The BLAST algorithm, which speeds up local sequence alignment, has three basic steps. First,
it creates a list of all short sequences (called words in BLAST terminology) that score above a
threshold value when aligned with the query sequence. Next, the sequence database is searched for
occurrences of these words. Because the word length is so short (3 residues for proteins, 11 residues for
nucleic acids), it's possible to search a precomputed table of all words and their positions in the
sequences for improved speed. These matching words are then extended into ungapped local
alignments between the query sequence and the sequence from the database. Extensions are continued
until the score of the alignment drops below a threshold. The top-scoring alignments in a sequence, or
maximal-scoring segment pairs (MSPs), are combined where possible into local alignments. Originally,
BLAST searched only for ungapped alignments. However, new additions to the BLAST software
package that do search for gapped alignments have since been introduced. NCBI BLAST and WU-BLAST

There are two implementations of the BLAST algorithm: NCBI BLAST and WU-BLAST. Both
implementations can be used as web services and as downloadable software packages. NCBI BLAST is
available from the National Center for Biotechnology Information (NCBI), while WU-BLAST is an
alternate version that grew out of NCBI BLAST 1.4 and is developed and maintained by Dr. Warren
Gish and coworkers at Washington University.

NCBI BLAST is the more commonly used of the two. The most recent versions of this program have
focused on the development of methods for comparing multiple-sequence profiles (see Chapter 8).
WU-BLAST, on the other hand, has developed a different system for handling gaps as well as a
number of features (such as filtering for repeats) that are useful for searching genome sequences.
Consequently, TIGR, Stanford's yeast genome server, Berkeley's Drosophila genome project, and
others use WU-BLAST 2.0 as the sequence-comparison tool for searching their genome sequence data
via the Web. As of this writing, WU-BLAST 2.0, the most recent version of the software, is
copyrighted. NCBI BLAST and WU-BLAST's previous version, 1.4, are both in the public domain and
freely available to all researchers. Because of its ubiquity we focus on NCBI BLAST in the following
section, but WU-BLAST is an alternative. For more information on these flavors of BLAST see the
NCBI web site at http://www.ncbi.nlm.nih.gov/BLAST, or the WU-BLAST web site at
161 What do the various BLAST programs do?

Frequent us ers of BLAST can also download and install BLAST binaries on their own machines.
BLAST installs easily on a Linux system. Simply create a new directory (e.g., /usr/local/blast), move
the archive into it, and extract. Here are the four main executable programs in the BLAST distribution:


Performs BLAST searches using one of five BLAST programs: blastp, blastn, blastx, tblastn, or


Performs searches in PSI-BLAST or PHI-BLAST mode


Performs a local alignment of two sequences


Converts a FASTA-format flat file sequence database into a BLAST database

blastall encompasses all the major options for ungapped and gapped BLAST searches. A full list of its
command-line arguments can be displayed with the command blastall - :


Program name. Its options include:


Protein sequence (PS) query versus PS database


Nucleic acid sequence (NS) query versus NS database


NS query translated in all six reading frames versus PS database


PS query versus NS database dynamically translated in all six reading frames


Translated NS query versus translated NS database”computationally intensive

Database name. Each indexed database consists of several files; the name is the common
portion of those filenames.


Query sequence filename. Defaults to standard input if not specified.


Expectation value cutoff. (See Section


Alignment view. Its options include:




Master-slave, show identities


Master-slave, no identities


Flat master-slave, show identities


Flat master-slave, no identities


Master-slave, no identities, blunt ends


Flat master-slave, no identities, blunt ends


Output file name. Defaults to standard output if not specified.


Gap opening penalty. Defaults to 11 (for BLOSUM63 matrix).


Gap extension penalty. Defaults to 1 (for BLOSUM63 matrix).


Nucleotide mismatch penalty. blastn only. Defaults to -3.


Reward for nucleotide match. blastn only. Defaults to 1.


Number of alignments to show.


Perform gapped alignment. T/F. Unavailable with tblastx.


Scoring matrix. Defaults to BLOSUM62.


Produce HTML output. T/F. Defaults to F.

These are the command-line options you are most likely to use, but there are a large number of other
options available. As you become familiar with BLAST, you may want to learn to use them.

blastpgp allows you to use two new BLAST modes: PHI-BLAST (Pattern Hit Initiated BLAST) and
PSI-BLAST (Position Specific Iterative BLAST). PHI-BLAST uses protein motifs, such as those found
in PROSITE and other motif databases, to increase the likelihood of finding biologically significant
matches. PSI-BLAST uses an iterative alignment procedure to develop position-specific scoring
matrices, which increases its capability to detect weak pattern matches. Both methods are discussed
further when we get to multiple sequence analysis in Chapter 8.

bl2seq allows the comparison of two known sequences using the blastp or blastn programs. Most of the
command-line options for bl2seq are similar to those for blastall. Building a local database with formatdb

Usage: formatdb -i araseed.nt -p F -o T

Although BLAST is available on many web servers, one of the benefits of installing and using BLAST
locally is the ability to create your own sequence databases. For instance, you may have a set of
sequences that aren't yet published or publicly distributed. They're not in the GenBank database, so if
you can't run BLAST on your own machine, how do you search them?

The program formatdb accepts an input sequence database either in FASTA format or in NCBI's
ASN.1 format (described in Chapter 12 ). On the program command line it is necessary to specify the
input filename, whether the input is a protein or nucleic acid sequence, and whether you want to create
indexes for the database or not. There are other command-line options available, which can be viewed
by trying to run formatdb with no command-line options specified.

The files created are:

araseed.nt.nsq Evaluating BLAST results

A BLAST search in a sequence database can produce dozens or hundreds of candidate alignments. Out
of these alignments, how can you tell which are really significantly homologous, and which are merely
the best matches between unrelated sequences? BLAST provides three related pieces of information
that allow you to interpret its results: raw scores, bit scores, and E-values.

The raw score for a local sequence alignment is the sum of the scores of the maximal-scoring segment
pairs (MSPs) that make up the alignment. Because of differences between scoring matrices, raw scores
aren't always directly comparable. Bit scores are raw scores that have been converted from the log base
of the scoring matrix that creates the alignment to log base 2. This rescaling allows bit scores to be
compared between alignments.

E-values provide information about the likelihood that a given sequence alignment is significant. An
alignment's E-value indicates the number of alignments one expects to find with a score greater than or
equal to the observed alignment's score in a search against a random database. Thus, a large E-value (5
or 10) indicates that the alignment probably has occurred by chance, and that the target sequence has
been aligned to an unrelated sequence in the database. E-values of 0.1 or 0.05 are typically used as
cutoffs in sequence database searches. Using a larger E-value cutoff in a database search allows more
distant matches to be found, but it also results in a higher rate of spurious alignments. Of the three, E-
values are the values most often reported in the literature.

There is a limit beyond which sequence similarity becomes uninformative about the relatedness of the
sequences being compared. This limit is encountered below approximately 25% sequence similarity for
protein sequences of normal length, although research continues to push at this boundary. In the case of
protein sequences with low sequence similarity that are still believed to be related, structural analysis
techniques may provide evidence for such a relationship. Where structure is unknown, sequences with
low similarity are categorized as unrelated, but that may mean only that the evolutionary distance
between sequences is so great that a relationship can't be detected.

7.8.2 Local Alignment Using FASTA
Another heuristic method for local sequence alignment is the FASTA algorithm. FASTA predates
BLAST, and it is still actively maintained by Dr. William Pearson at the University of Virginia. Like
BLAST, it is available both as a service over the Web and as a downloadable set of programs. In this
section, we describe the current version of the FASTA algorithm and some of the programs included in
the FASTA distribution. The FASTA algorithm

FASTA first searches for short sequences (called ktups) that occur in both the query sequence and the

sequence database. Then, using the BLOSUM50 matrix, the algorithm scores the 10 ungapped
alignments that contain the most identical ktups. These ungapped alignments are tested for their ability
to be merged into a gapped alignment without reducing the score below a threshold. For those merged
alignments that score over the threshold, an optimal local alignment of that region is then computed,
and the score for that alignment (called the optimized score) is reported.
An abbreviation for k tuples, o r ordered sequences of k residues.

FASTA ktups are shorter than BLAST words, typically 1 or 2 for proteins, and 4 or 6 for nucleic acids.
Lower ktup values result in slower but more sensitive searches, while higher ktup values yield faster
searches with fewer false positives. The FASTA programs

The FASTA distribution contains search programs that are analogous to the main BLAST modes, with
the exception of PHI-BLAST and PSI-BLAST, as well as programs for global and local pairwise
alignment and other useful functions. The FASTA programs listed here all compile easily on a Linux


Compares a protein sequence against a protein database (or a DNA sequence against a DNA
database) using the FASTA algorithm


Compares a protein sequence against a protein database (or DNA sequence against a DNA
database) using the Smith-Waterman algorithm

[fastx /fasty]

Compares a DNA sequence against a protein database, performing translations on the DNA

[tfastx /tfasty]

Compares a protein sequence against a DNA database, performing translations on the DNA
sequence database


Computes the global alignment between two DNA or protein sequences


Computes the local alignment between two DNA or protein sequences

As of this writing, all these programs except ALIGN and LALIGN are available in the FASTA 3.3
package; for now, ALIGN and LALIGN are available only in the FASTA 2 distribution.

7.9 Multifunctional Tools for Sequence Analysis
Several research groups and companies have assembled web-based interfaces to collections of
sequence tools. The best of these have fully integrated tools, public databases, and the ability to save a
record of user data and activities from one use to another. If you're searching for matches to just one or
a few sequences and you want to search the standard public databases, these portals can save you a lot
of time while providing most of the functionality and ease of use of a commercial sequence analysis
package. In some cases, you'll have to pay for a license or subscription to access the full functionality
of these sites; others are freely accessible.


The NCBI SEALS project aims to develop a Perl-based command-line environment for Systematic
Analysis of Lots of Sequences. SEALS is far from a fully automated genome analysis tool, and it isn't
intended to be. What SEALS does provide is an enhancement to the command-line environment on
Unix systems. It is composed of a large suite of scripts with a variety of useful functions: converting
file formats, manipulating BLAST results and FASTA files, database retrieval, piping files into
Netscape, and a host of other features that make your data easier to look at without requiring a
resource-sucking GUI. SEALS runs on Unix systems and is probably most useful for those who are
already Unix aficionados. Before you write a script to process a lot of sequences, check to see if the
process you want has been implemented in SEALS.

7.9.2 The Biology Workbench

The San Diego Supercomputing Center offers access to sequence-analysis tools through the Biology
Workbench. This resource has been freely available to academic users in one form or another since
1995. Users obtain a login and password at the SDSC site, and work sessions and data can be saved
securely on the server.

The Biology Workbench offers keyword and sequence-based searching of nearly 40 major sequence
databases and over 25 whole genomes. Both BLAST and FASTA are implemented as search and
alignment tools in the Workbench, along with several local and global alignment tools, tools for DNA
sequence translation, protein sequence feature analysis, multiple sequence alignment, and phylogenetic
tree drawing. The Workbench group has not yet implemented profile tools, such as MEME, HMMer, or
sequence logo tools, although PSI-BLAST is available for sequence searches.

Although its interface can be somewhat cumbersome, involving a lot of window scrolling and button
clicking, the Biology Workbench is still the most comprehensive, convenient, and accessible of the
web-based toolkits. One of its main benefits is that many sequence file formats are accepted and

translated by the software. Users of the Workbench need never worry about file type incompatibility
and can move seamlessly from keyword-based database search, to sequence-based search, to multiple
alignment, to phylogenetic analysis.

7.9.3 DoubleTwist

Another entry into the sequence analysis portal arena is DoubleTwist at http://doubletwist.com. This
site allows you to submit a sequence for comparison to multiple databases using BLAST. It also
provides "agents" that monitor databases for new additions that match a submitted sequence and
automatically notifies the user. These services, as well as access to the EcoCyc pathways database and
to an online biology research protocols database, are free with registration at the site at the time of this

Chapter 8. Multiple Sequence Alignments, Trees, and
In Chapter 7, we introduced the idea of using sequence alignment to find and compare pairs of related
sequences. Biologically interesting problems, however, often involve comparing more than two
sequences at onc e. For example, a BLAST or FASTA search can yield a large number of sequences
that match the query. How do you compare all these resulting sequences with each other? In other
words, how can you examine these sequences to understand how they are related to one another?

One approach is to perform pairwise alignments of all pairs of sequences, then study these pairwise
alignments individually. It's more efficient (and easier to comprehend), however, if you compare all the
sequences at once, then examine the resulting ensemble alignment. This process is known as multiple
sequence alignment. Multiple sequence alignments can be used to study groups of related genes or
proteins, to infer evolutionary relationships between genes, and to discover patterns that are shared
among groups of functionally or structurally related sequences. In this chapter, we introduce some tools
for creating and interpreting multiple sequence alignments and describe some of their applications,
including phylogenetic inference and motif discovery. Phylogenetic inference and motif discovery are
rooted in evolutionary theory, so before we dive into a discussion of that area of bioinformatics, let's
take a minute to review the history and theory of evolution.

8.1 The Morphological to the Molecular
In order to ground our discussion of the details of multiple sequence alignment, let's take another brief
look at evolution. One of the goals of biology has been the creation of a taxonomy for living things, a
method of organizing species in terms of their relationships to one another. Early biologists classified
species solely according to their morphology”the physical appearance of the organism”and later, as
dissection became a more common practice, their anatomy.

Naturalists also discovered fossils of creatures that didn't resemble anything alive at the time, but were
thought to have once been living things. This evidence introduced the possibility that life on Earth had
changed over time. It also suggested that the interrelationship between species isn't static, but rather is
the result of an evolutionary process. As understanding of the geophysical processes of the planet
improved, the amount of time required for such changes to occur became clearer. It is now widely
accepted by scientists that life on Earth is approximately 3.5 billion years old. Fossil records of single-
celled organisms resembling bacteria, with an estimated age of 3.5 billion years, have been found and

The evolutionary theory that was eventually accepted by most biologists was that of Charles Darwin.
Darwin proposed that every generation of living creatures has some variability. The individuals whose
variations predispose them to survive in their environment are the ones who reproduce most
successfully and pass on their traits in greater numbers. In light of this theory, it has been hypothesized
that the diversity of life forms on Earth is due to divergence, perhaps even from one common ancestral
unicellular organism, to fill various biological niches.

Molecular evolution extends the concept of evolution to the level of DNA and protein sequences.
Although the replication of DNA sequence is a very accurate process, small replication errors
accumulate over time, along with radiation damage and other mutations or alterations of the genomic

sequence. Instead of evolutionary pressure selecting organisms based on morphological traits, selection
occurs at the level of mutations. Consequently, the only mutations observed in genes from healthy
organisms are those that don't result in the organisms' death.

Because these changes between gene sequences are incremental, we can take homologous genes ”
genes with common evolutionary origin and related function”from a number of divergent organisms
and compare them by aligning identical or similar residues. This comparison of multiple sequences
shows which regions of a gene (or its derived protein) are sensitive to mutation and which are tolerant
of having one residue replaced by another. Thus, we can develop hypotheses about the molecular
events underlying the evolution of these sequences. Many bioinformatics methods, including pairwise
sequence comparison and sequence database searching, are based on this observation that homologous
genes have similar sequences.

In considering sequence similarity, there is one additional wrinkle to bear in mind: the difference
between orthologs and paralogs. The chemical processes of molecular evolution are responsible for
more than just giving rise to species differences. Evolutionary change can occur within the genome of a
single species as well. Orthologs are genes that are evolutionarily related, share a function, and have
diverged by speciation. Paralogs, on the other hand, have a common ancestor but have diverged by
gene duplication and no longer have a common functional role. In other words, orthologs have the
same function but occur in different species, while paralogs exist in the same genome but have
different functions. A sequence database search may return both orthologs and paralogs. Depending on
the objectives of your study, you probably will not want to treat them all as members of the same set.

8.2 Multiple Sequence Alignment
Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they
are a statement of both evolutionary and structural similarity among the proteins encoded by each
sequence in the alignment. We know that proteins with closely related functions are similar in both
sequence and structure from organism to organism, and that sequence tends to change more rapidly
than structure in the course of evolution. In multiple alignments generated from sequence data alone,
regions that are similar in sequence are usually found to be superimposable in structure as well.

With a detailed knowledge of the biochemistry of a protein, you can create a multiple alignment by
hand. This is a painstaking process, however. The challenge of automatic alignment is that it is hard to
define exactly what an optimal multiple alignment is, and impossible to set a standard for a single
correct multiple alignment. In theory, there is one underlying evolutionary process and one
evolutionarily correct alignment to be generated from any group of sequences. However, the
differences between sequences can be so great in parts of an alignment that there isn't an apparent,
unique solution to be found by an alignment algorithm. Those same divergent regions are often
structurally unalignable as well. Most of the insight that we derive from multiple alignments comes
from analyzing the regions of similarity, not from attempting to align the very diverged regions.

The dynamic programming algorithm used for pairwise sequence alignment can theoretically be
extended to any number of sequences. However, the time and memory requirements of this algorithm
increase exponentially with the number of sequences. Dynamic programming alignment of two
sequences takes seconds. Alignment of four relatively short sequences takes a few hours. Beyond that,
it becomes impractical to align sequences this way. The program MSA is an implementation of an
algorithm that reduces the complexity of the dynamic programming problem for multiple sequences to

some extent. It can align about seven relatively short (200 -300) protein sequences in a reasonable
amount of time. However, MSA is of little use when comparing large numbers of sequences.

8.2.1 Progressive Strategies for Multiple Alignment

A common approach to multiple sequence alignment is to progressively align pairs of sequences. The
general progressive strategy can be outlined as follows: a starting pair of sequences is selected and
aligned, then each subsequent sequence is aligned to the previous alignment. Like the Needleman-
Wunsch and Smith-Waterman algorithms for sequence alignment, progressive alignment is an instance
of a heuristic algorithm. Specifically, it is a greedy algorithm. Greedy algorithms decompose a problem
into pieces, then choose the best solution to each piece without paying attention to the problem as a
whole. In the case of progressive alignment, the overall problem (alignment of many sequences) is
decomposed into a series of pairwise alignment steps.

Because it is a heuristic algorithm, progressive alignment isn't guaranteed to find the best possible
alignment. In practice, however, it is efficient and produces biologically meaningful results.
Progressive alignment methods differ in several respects: how they choose the initial pair of sequences
to align, whether they align every subsequent sequence to a single cumulative alignment or create
subfamilies, and how they score individual alignments and alignments of individual sequences to
previous alignments.

8.2.2 Multiple Alignment with ClustalW

One commonly used program for progressive multiple sequence alignment is ClustalW. The heuristic
used in ClustalW is based on phylogenetic analysis. First, a pairwise distance matrix for all the
sequences to be aligned is generated, and a guide tree is created using the neighbor-joining algorithm.
Then, each of the most closely related pairs of sequences”the outermost branches of the tree”are
aligned to each other using dynamic programming. Next, each new alignment is analyzed to build a
sequence profile. Finally, alignment profiles are aligned to each other or to other sequences (depending
on the topology of the tree) until a full alignment is built.

This strategy produces reasonable alignments under a range of conditions. It's not foolproof; for
distantly related sequences, it can build on the inaccuracies of pairwise alignment and phylogenetic
analysis. But for sequence sets with some recognizably related pairs, it builds on the strengths of these
methods. Pairwise sequence alignment by dynamic programming is very accurate for closely related
sequences regardless of which scoring matrix or penalty values are used. Phylogenetic analysis is
relatively unambiguous for closely related sequences. Using multiple sequences to create profiles
increases the accuracy of pairwise alignment for more distantly related sequences.

There are many parameters involved in multiple sequence alignment. There are, of course, scoring
matrices and gap penalties associated with the pairwise alignment steps. In addition, there are
weighting parameters that alter the scoring matrix used in sequence-profile and profile-profile
alignments. In ClustalW, these are set from the Multiple Alignment submenu or the Profile Structure
Alignments submenu. In ClustalX, they are set from the Alignment pulldown menu.

The pairwise alignment parameters are familiar and have the same meaning in multiple alignment as
they do in pairwise alignment. The multiple alignment parameters include gap opening and gap
extension penalties for the multiple alignment process”to be used when fine-tuning alignments ”and a

maximum allowable delay, in terms of sequence length, for the start of divergent sequences at the
beginning of the alignment.

One of ClustalW's heuristics is that, in protein sequence alignment, different scoring matrices are used
for each alignment based on expected evolutionary distance. If two sequences are close neighbors in
the tree, a scoring matrix optimized for close relationships aligns them. Distant neighbors are aligned
using matrices optimized for distant relationships. Thus, when prompted to choose a series of matrices
in the Multiple Alignment Parameters menu, it means just that: use BLOSUM62 for close relationships
and BLOSUM45 for more distant relationships, rather than the same scoring matrix for all pairwise

Another heuristic that ClustalW uses is scalable gap penalties for protein profile alignments. A gap
opening next to a conserved hydrophobic residue can be penalized more heavily than a gap opening
next to a hydrophilic residue. A gap opening too close to another gap can be penalized more heavily
than an isolated gap. These parameters are set in the Protein Gap Parameters menu.

Although ClustalW is run from the Unix command line, it is menu-driven and doesn't rely on
command-line options. To start the program, you can simply type clustalw, and a menu of options is

******** CLUSTAL W (1.8) Multiple Sequence Alignments ********
1. Sequence Input From Disc
2. Multiple Alignments
3. Profile / Structure Alignments
4. Phylogenetic trees

S. Execute a system command
X. EXIT (leave program)

This menu, along with subsequent menus that appear after you input your sequences, guides you
through the use of ClustalW in a fairly straightforward fashion. Alignments are written in a plain-text

While ClustalW is simple to install and use on a Linux workstation, ClustalX, the X Windows-based
graphical user interface for ClustalW, isn't so easy to compile. However, ClustalX runs in its own X
window, has pulldown menus, and allows viewing and plotting of multiple sequence alignments in a
color-coded format. It also allows you to append sequences to an alignment one at a time, and to
produce color PostScript output of specified sequence ranges in an alignment from different files, if
desired, along with other convenient features. To install ClustalX on a Linux machine, you need:

• The ClustalX binaries
• The NCBI software toolkit source distribution
• The LessTif libraries

The first thing you need to do is install the LessTif libraries. This distribution is extremely easy to work
with. The LessTif libraries are available from http://www.lesstif.org and may also be available within
your Linux distribution. The NCBI Toolkit (available from http://www.ncbi.nlm.nih.gov) should

compile completely as long as your LessTif libraries are installed in /usr/X11R6/lib. If the NCBI
Toolkit installation produces the file libvibrant.a, the command clustalx will work.

8.2.3 Viewing and Editing Alignments with Jalview

Usage: Jalview alignmentfile FORMAT

Viewing alignments is useful, but it's often necessary to edit them as well. Alignment editing is one of
the few bioinformatics functions that's actually been done better for the Windows platform than for
Unix, but if you've installed Java support on your workstation you can use a program called Jalview,
written by Michele Clamp and available from the EBI at
http://www.ebi.ac.uk/˜michele/jalview/contents.html. [1]

Installing Java support involves installing a Java Development Kit and a Java Runtime Environment. IBM has ported JDK and JRE 1.1.8 to Linux. They are
available at the IBM site, http://www.ibm.com/java/jdk/index.html. You have to register to download the kits, but it's free. The kits are available as easy-to-install
RPMs. You won't encounter a lot of Java applications for bioinformatics, but when you do, it's nice to be able to run them.

To use Jalview on a Linux workstation, download the full Unix distribution of the version you want.
Unpack the distribution, then edit the file Jalview to reflect your local environment. Jalview is a script
that sets up the environment for Jalview and actually starts the program. Specifically, you want to set
the environment variable CLASSPATH to reflect the location of the class file in your JDK ( Java
Development Kit) and the location of the Jalview classes (your jalview.jar file). Set the environment
variable JAVA_EXE to point to your Java executable:

setenv CLASSPATH /usr/jdk118/lib/classes.zip:/usr/local/jalview/jalview.jar
setenv JAVA_EXE /usr/jdk118/bin/java

Jalview can read an alignment (.aln) file from Clustal, as well as several other alignment formats. We
focus on using Jalview as an alignment editor, but it does have other functions you can explore. It's also
capable of searching databases if you specify them as a command-line option.

To run Jalview, make a link to the Jalview script in your working directory.

The common alignment formats that Jalview recognizes are MSF, CLUSTAL, and FASTA. These need
to be indicated in all capital letters when the command is given.

The Jalview window is an active place: click with care. You can select individual sequences by
clicking on their names at the left of the window, and you can select ranges of sequence by clicking on
the numerical labels at the top of the sequence alignment. A red box appears to indicate the selected

As in any other menu interface, the File menu contains file import and export options. Sequence
alignments can be read from a file or even from a web URL. The Edit pulldown contains commands
that allow you to delete, copy, and move selected sequences or columns. You can also manipulate the
alignment by hand. Clicking on any letter in the alignment allows you to open a gap and move
everything to the left of that letter over by dragging in either direction with the mouse. The Colour
menu contains options for color-coding alignments, most of which are most informative for protein
sequence. The Calculate menu contains options for calculating consensus sequences and phylogenetic

8.2.4 Sequence Logos

Another way to view sequence alignments, and one which has become quite popular recently, is the
sequence logo format developed by Tom Schneider of the National Cancer Institute. This format is
especially good for shorter sequence regions, such as protein motifs. Consensus sequences represent
each position in an alignment with the residue that is most commonly found in that position. Other
information in the alignment, such as whether there are any other residues that occur at that site and
with what relative frequencies they occur, is lost in a consensus sequence.

Sequence logos, as illustrated in Figure 8-1, are a graphical way to represent relative frequencies,
information content, order of substitution preference, and other characteristics of each site in an

Figure 8-1. A sequence logo

In a sequence logo, the letters in the column at each sequence position represent the consensus
sequence in more detail than a standard single-letter consensus sequence would. The total height of a
column represents the amount of information contained in that sequence position. The sizes of the [2]

individual letters depict their relative frequency of occurrence.
For a thorough discussion of sequence logos and of information content in biological sequence data, you can download some very readable papers from Dr. Tom
Schneider's web site at the National Cancer Institute: http://www-lecb.ncifcrf.gov/˜toms/index.html.

The software for creating sequence logos is part of a larger group of programs called the DELILA
package, which was originally developed in the Pascal language. You actually need only two of the
many DELILA programs (alpro and makelogo) to create logos from aligned sequences. Pascal
compilers aren't among the compilers commonly found on Linux systems, but there is a standard GNU
Pascal compiler you can download if you're feeling adventurous. The other way to compile the

software is to use the C versions of the programs that are now available. Because these programs have
been automatically translated from Pascal, they require that the p2c (Pascal-to-C) libraries are installed
on your system.

An easier approach for the novice is to use the sequence logo web server at the University of
Cambridge, which (as of this writing) is actually recommended by the author of the DELILA programs
and hence, we assume, does exactly what it's supposed to do. Aligned sequences can be submitted to
this server in FASTA alignment format, which can be generated by ClustalX.

8.3 Phylogenetic Analysis
Having covered some of the basics of multiple sequence alignment, we now introduce one of its
applications: phylogenetic inference. Phylogenetic inference is the process of developing hypotheses
about the evolutionary relatedness of organisms based on their observable characteristics. Traditionally,
phylogenetic analyses have been based on the gross anatomy of species. When Linneaus developed the
system of classification into kingdoms, phyla, genera, and species, the early biologists sorted living
things into a symbolic Tree of Life, which we saw in Figure 1-3. This tree-based representation of the
relationships among species is a phylogenetic tree; it has since been adopted as a convenient schematic
for depicting evolutionary relatedness based on sequence similarity. The quantitative nature of
sequence relationships has allowed the development of more rigorous methods and rules for tree

While hand-drawn trees of life may branch fancifully according to what is essentially an artist's
conception of evolutionary relationships, modern phylogenetic trees are strictly binary; that is, at any
branch point, a parent branch splits into only two daughter branches. Binary trees can approximate any
other branching pattern, and the assumption that trees are binary greatly simplifies the tree-building

The length of branches in a quantitative phylogenetic tree can be determined in more than one way.
Evolutionary distance between pairs of sequences, relative to other sequences in an input data set, is
one way to assign branch length.

While a phylogeny of species generally has a root, assuming that all species have a specific common
ancestor, a phylogenetic tree derived from sequence data may be rooted or unrooted. It isn't too
difficult to calculate the similarity between any two sequences in a group and to determine where
branching points belong. It is much harder to pinpoint which sequence in such a tree is the common
ancestor, or which pair of sequences can be selected as the first daughters of a common ancestor. While
some phylogenetic inference programs do offer a hypothesis about the root of a tree, most simply
produce unrooted trees. Figure 8-2 and Figure 8-3 illustrate rooted and unrooted phylogenetic trees.

Figure 8-2. A rooted phylogenetic tree

Figure 8-3. An unrooted phylogenetic tree

A phylogeny inferred from a protein or nucleic acid sequence has only a passing resemblance to a
whole-organism tree of life (a true tree) that represents actual speciation events. A single phylogeny
may be a tree, and it may describe a biological entity, but it takes far more than a single evolutionary
analysis to draw conclusions about whole-organism phylogeny. Sequence-based phylogenies are
quantitative. When they are built based on sufficient amounts of data, they can provide valuable,
scientifically valid evidence to support theories of evolutionary history. However, a single sequence-
based phylogenetic analysis can only quantitatively describe the input data set. It isn't valid as a
quantitative tool beyond the bounds of that data set, and if you are using phylogenetic analysis tools to
develop evolutionary hypotheses, it is critical to remember this point.

It has been shown, by comparative analysis of phylogenies generated for different protein and gene
families, that one protein may evolve more quickly than another, and that a single protein may evolve
more quickly in some organisms than in others. Thus, the phylogenetic analysis of a sequence family is
most informative about the evolution of that particular gene. Only by analysis of much larger sets of
data can theories of whole-organism phylogeny be suggested.

8.3.1 Phylogenetic Trees Based on Pairwise Distances

One of the easiest to understand algorithms for tree drawing is the pairwise distance method. This
method produces a rooted tree. The algorithm is initialized by defining a matrix of distances between
each pair of sequences in the input set. Sequences are then clustered according to distance, in effect
building the tree from the branches down to the root.

Distances can be defined by more than one measure, but one of the more common and simple measures
of dissimilarity between DNA sequences is the Jukes-Cantor distance, which is logarithmically related
to the fraction of sites at which two sequences in an alignment differ. The fraction of matching
positions in an ungapped alignment between two unrelated DNA sequences approaches 25%.
Consequently, the Jukes-Cantor distance is scaled such that it approaches infinity as the fraction of
unmatched residue pairs approaches 75%.

The pairwise clustering procedure used for tree drawing (UPGMA, unweighted pair group method
using arithmetic averages) is intuitive. To begin with, each sequence is assigned to its own cluster, and
a branch (or leaf ) of the tree is started for that sequence at height zero in the tree. Then, the two
clusters that are closest together in terms of whatever distance measure has been chosen are merged
into a single cluster. A branch point (or node) is defined that connects the two branches. The node is
placed at a height in the tree that reflects the distance between the two leaves that have been joined.
This process is repeated iteratively, until there are only two clusters left. When they are joined, the root
of the tree is defined. The branch lengths in a tree constructed using this process theoretically reflect
evolutionary time.

8.3.2 Phylogenetic Trees Based on Neighbor Joining

Neighbor joining is another distance matrix method. It eliminates a possible error that can occur when
the UPGMA method is used. UPGMA produces trees in which the branches that are closest together by
absolute distance are placed as neighbors in the tree. This assumption places a restriction on the
topology of the tree that can lead to incorrect tree construction under some conditions.

In order to get around this problem, the neighbor-joining algorithm searches not just for minimum
pairwise distances according to the distance metric, but for sets of neighbors that minimize the total
length of the tree. Neighbor joining is the most widely used of the distance-based methods and can
produce reasonable trees, especially when evolutionary distances are short.

8.3.3 Phylogenetic Trees Based on Maximum Parsimony

A more widely used algorithm for tree drawing is called parsimony. Parsimony is related to Occam's
Razor, a principle formulated by the medieval philosopher William of Ockham that states the simplest
explanation is probably the correct one. Parsimony searches among the set of possible trees to find the

one requiring the least number of nucleic acid or amino acid substitutions to explain the observed
differences between sequences.
Or, in other words, "It is futile to do with more what can be done with fewer."

The only sites considered in a parsimony analysis of aligned sequences are those that provide
evolutionary information”that is, those sites that favor the choice of one tree topology over another. A
site is considered to be informative if there is more than one kind of residue at the site, and if each type
of residue is represented in more than one sequence in the alignment. Then, for each possible tree
topology, the number of inferred evolutionary changes at each site is calculated. The topology that is
maximally parsimonious is that for which the total number of inferred changes at all the informative
sites is minimized. In some cases there may be multiple tree topologies that are equally parsimonious.

As the number of sequences increases, so does the number of possible tree topologies. After a certain
point, it is impossible to exhaustively enumerate the scores of each topology. A shortcut algorithm that
finds the maximally parsimonious tree in such cases is the branch-and-bound algorithm. This algorithm
establishes an upper bound for the number of allowed evolutionary changes by computing a tree using
a fast or arbitrary method. As it evaluates other trees, it throws out any exceeding this upper bound
before the calculation is completed.

8.3.4 Phylogenetic Trees Based on Maximum Likelihood Estimation

Maximum likelihood methods also evaluate every possible tree topology given a starting set of
sequences. Maximum likelihood methods are probabilistic; that is , they search for the optimal choice
by assigning probabilities to every possible evolutionary change at informative sites, and by
maximizing the total probability of the tree. Maximum likelihood methods use information about
amino acid or nucleotide substitution rates, analogous to the substitution matrices that are used in
multiple sequence alignment.

8.3.5 Software for Phylogenetic Analysis

There is a variety of phylogenetic analysis software available for many operating systems. With such a
range of choices, which package do you use? One of the most extensive listings currently available is
maintained by Dr. Joe Felsenstein, the author of the PHYLIP package, and is accessible from the
PHYLIP web page (http://evolution.genetics.washington.edu/phylip.html). If you don't want to follow
our example and use PHYLIP, you can easily find information about other packages. PHYLIP

The most widely distributed phylogenetic analysis package is PHYLIP. It contains 30 programs that
implement different phylogenetic analysis algorithms. Each of the programs runs separately, from the
command line. By default, most of the programs look for an input file called infile and write an output
file called outfile. Rather than entering parameters via command-line flags, as with BLAST, the
programs have an interactive text interface that prompts you for information.

The following are the PHYLIP programs you are most likely to use when you're just getting started
analyzing protein and DNA sequence data:


Infers phylogenies from protein sequence input using the parsimony method


Computes an evolutionary distance matrix from protein sequence input, using maximum
likelihood estimation


Infers phylogenies from DNA sequence input using parsimony


Finds all maximally parsimonious phylogenies for a set of sequences using a branch-and-bound


Infers phylogenies from DNA sequence input using maximum likelihood estimation

Computes a distance matrix from DNA sequence input using the Jukes-Cantor distance or one
of three other distance criteria


Infers phylogenies from distance matrix data using either the pairwise clustering or the neighbor
joining algorithm


Draws a rooted tree based on output from one of the phylogeny inference programs


Draws an unrooted tree based on output from one of the phylogeny inference programs


Computes a consensus tree from a group of phylogenies


Allows interactive manipulation of a tree by the user”not based on data

PHYLIP is a flexible package, and the programs can be used together in many ways. To analyze a set
of protein sequences with PHYLIP, you can:

1. Read a multiple protein sequence alignment using PROTDIST and create a distance matrix.
2. Input the distance matrix to NEIGHBOR and generate a phylogeny based on neighbor joining.
3. Read the phylogeny into DRAWTREE and produce an unrooted phylogenetic tree.

Or, you can:

1. Read a multiple sequence alignment using PROTPARS and produce a phylogeny based on
2. Read the phylogeny using DRAWGRAM and produce a rooted tree.

Each of the PHYLIP programs is exhaustively documented in the *.doc files available with the
PHYLIP distribution. This documentation has been converted into HTML by several groups. Links to
web-based PHYLIP documentation are available from the PHYLIP home page.

Several organizations have made PHYLIP servers available on the Web; the version of PHYLIP in the
SDSC Biology Workbench produces downloadable PostScript output. The PHYLIP input format

PHYLIP's molecular sequence analysis programs can accept sequence data in an aligned (interleaved)



where the first 10 characters are that sequence's name followed by the sequence in aligned form.
Subsequent rows follow. In a sequential format, the complete first sequence is given, then the second
complete sequence, etc. However, in either case, the sequences must be prealigned by another program.
PHYLIP doesn't have a utility for computing multiple sequence alignments.

If you examine the phylogeny output from PHYLIP, you'll find it's represented by codes indicating
each of the sequences, arranged in nested parentheses. This is called Newick notation. The pattern of
the parentheses indicates the topology of the tree. The innermost parentheses surround the terminal
branches of the tree, e.g., (A,B), and each subsequent set of parentheses joins another pair of branches,
e.g., ((A,B),(C,D)). If the algorithm that generates the phylogeny produces branch lengths, these branch
lengths are associated explicitly with each branch within the Newick notation: e.g.,
((A:1.2,B:1.5):1.0,(C:2.5,D:0.5):1.2). Generating input for PHYLIP with ClustalX

The multiple sequence alignment program ClustalX, which we discussed earlier in this chapter, draws
phylogenetic trees with the neighbor joining method. Perhaps more importantly, it can read sequences
in various input formats and then write PHYLIP-format files from multiple sequence alignments, using
a simple Save As command from within the ClustalX interface.

8.4 Profiles and Motifs
In addition to studying relationships between sequences, one of the most successful applications of
multiple sequence alignments is in discovering novel, related sequences. This profile- or motif-based
analysis uses knowledge derived from multiple alignments to construct and search for sequence
patterns. In this section, we first introduce some of the concepts behind motifs, then describe tools that
use these principles to search sequence databases.

First, by way of a refresher, a multiple sequence alignment is an alignment of anywhere from three to
hundreds of sequences. Multiple sequence alignments can span the full sequence of the proteins
involved or a single region of similarity, depending on their purpose. Multiple sequence alignments,
such as the one shown in Figure 8-4, are generally built up by iterative pairwise comparison of
sequences and sequence groups, rather than by explicit multiple alignment.

Figure 8-4. A multiple sequence alignment, shown using ClustalX

A sequence motif is a locally conserved region of a sequence, or a short sequence pattern shared by a
set of sequences. The term "motif" most often refers to any sequence pattern that is predictive of a
molecule's function, a structural feature, or family membership. Motifs can be detected in protein,
DNA, and RNA sequences, but the most common use of motif-based analyses is the detection of
sequence motifs that correspond to structural or functional features in proteins. Motifs are generated
from multiple sequence alignments and can be displayed as patterns of amino acids (such as those in
the Prosite database) or as sequence logos. For computational purposes, they can be represented as
flexible patterns, position-specific scoring matrices, or profile hidden Markov models.

Motifs can be created for protein families, or sets of proteins whose members are evolutionarily related.
Protein families can consist of many proteins that range from very similar to quite diverse. While the
idea of a protein family is a fairly common concept, the method of selecting a protein family and
defining its limits depends on the researcher who defines it. As in pairwise sequence comparison, there
is a lower bound beyond which homology can't easily be detected. Motif-based methods can push this
lower bound by detecting particularly subtle sequence patterns and distant homologs.

A sequence profile is a quantitative or qualitative method of describing a motif. A profile can be
expressed in its most rudimentary form as a list of the amino acids occurring at each position in the
motif. Early profile methods used simple profiles of this sort; however, modern profile methods usually
weight amino acids according to their probability of being observed at each position.

Figure 8-5 shows a position-specific scoring matrix (PSSM), which is a matrix of scores representing a
motif. Unlike a standard scoring matrix, the first dimension of the matrix is the length of the motif; the
second dimension consists of the 20 amino acid possibilities. For each position in the matrix, there is a
probability score for the occurrence of each amino acid. Most methods for developing position-specific

scoring matrices normalize the raw probabilities with respect to a standard scoring matrix such as

Figure 8-5. PSSMs for sequence motifs common to zinc finger proteins

Finally, a profile hidden Markov model (HMM) is the rigorous probabilistic formulation of a sequence
profile. Profile HMMs contain the same probability information found in a PSSM; however, they can
also account for gaps in the alignment, which tends to improve their sensitivity. Because profile
analysis methods are still a subject of active research, there are many different programs and methods
for motif discovery and profile building. We will focus on two of the easiest motif discovery packages
to use, MEME and HMMer. We also describe the searchable databases of preconstructed protein
family motifs”some with associated PSSMs or profile HMMs”offered by several organizations.

8.4.1 Motif Databases

We have seen that profiles and other consensus representations of sequence families can be used to
search sequence databases. It shouldn't be too surprising, then, that there are motif databases that can be
searched us ing individual sequences. Motif databases contain representations of conserved sequences
shared by a sequence family. Today, their primary use is in annotation of unknown sequences: if you
get a new gene sequence hot off the sequencer, scanning it against a motif database is a good first
indicator of the function of the protein it encodes.

Motifs are generated by a variety of methods and with different objectives in mind. Some rely on
automated analysis, but there is often a large amount of hands-on labor invested in the database by an
expert curator. Because they store only those motifs that are present in reasonably large families, motif
databases are small relative to GenBank, and they don't reflect the breadth of the protein structure or
sequence databases. Be aware that an unsuccessful search against a motif database doesn't mean your
sequence contains no detectable pattern; it could be part of a family that has not yet been curated or that
doesn't meet the criteria of the particular pattern database you've searched. For proteins that do match
defined families, a search against the pattern databases can yield a lot of homology information very
quickly. Blocks

Blocks, a service of the Fred Hutchinson Cancer Research Center, is an automatically generated
database of ungapped multiple sequence alignments that correspond to the most conserved regions of
proteins. Blocks is created using a combination of motif-detection methods, beginning with a step that
exhaustively searches all spaced amino acid triplets in the sequence to discover a seed alignment,
followed by a step that extends the alignment to find an aligned region of maximum length. The Blocks
database itself contains more than 4,000 entries; it is extended to over 10,000 entries by inclusion of
blocks created from entries in several other protein family databases (see Section The Blocks
server also provides several useful search services, including IMPALA (which uses the BLAST
statistical model to compare a sequence against a library of profiles) and LAMA (Local Alignment of
Multiple Alignments; Shmuel Pietrokovski's program for comparing an alignment of your own
sequences against a database of Blocks). PROSITE

PROSITE is an expert-curated database of patterns hosted by the Swiss Institute of Bioinformatics. It
currently contains approximately 1,200 records, and is available for download as a structured flat file
from http://ftp.expasy.ch. PROSITE uses a single consensus pattern to characterize each family of
sequences. Patterns in PROSITE aren't developed based on automated analysis. Instead, they are
carefully selected based on data published in the primary literature or on reviews describing the
functionality of specific groups of proteins. A humorous cartoon on the PROSITE server indicates that
the optimal method for identifying patterns requires only a human, chalk, and a chalkboard. PROSITE
contains pattern information as well as position-specific scoring matrices that can detect new instances
of the pattern. Pfam

Pfam is a database of alignments of protein domain families. Pfam is made up of two databases: Pfam-


. 6
( 12)