. 7
( 12)


A and Pfam-B. Pfam-A is a curated database of over 2,700 gapped profiles, most of which cover whole
protein domains; Pfam-B entries are generated automatically by applying a clustering method to the
sequences left over from the creation of Pfam-A. Pfam-A entries begin with a seed alignment, a
multiple sequence alignment that the curators are confident is biologically meaningful and that may
involve some manual editing. From each seed alignment, a profile hidden Markov model is constructed
and used to search a nonredundant database of available protein sequences. A full alignment of the
family is produced from the seed alignments and any new matches. This process can be iterated to
produce more extensive families and detect remote matches. Pfam entries are annotated with
information extracted from the scientific literature, and incorporate structural data where available. As
a final note, Pfam is the database of profile HMMs used by the GeneWise genefinder to search for open
reading frames. PRINTS

PRINTS is a database of protein motifs similar to PROSITE, except that it uses "fingerprints"
composed of more than one pattern to characterize an entire protein sequence. Motifs are often short
relative to an entire protein sequence. In PRINTS, groups of motifs found in a sequence family can
define a signature for that family. COG

NCBI's Clusters of Orthologous Groups (COG) database is a different type of pattern database. COG is
constructed by comparing all the protein sequences encoded in 21 complete genomes. Each cluster
must consist of protein sequences from at least three separate genomes. The premise of COG is that
proteins that are conserved across these genomes from many diverse organisms represent ancient
functions that have been conserved throughout evolution. COG entries can be accessed by organism or
by functional category from the NCBI web site. COG currently contains more than 2,100 entries. Accessing multiple databases

So, which motif database should you use to analyze a new sequence? Because the comparisons are
performed quickly and efficiently, we recommend you use as many as possible, keeping track of the
best matches from each, their scores, and (if available) the significance of the hit. While Blocks uses
InterPro as one of the sources for its own patterns, as of June 2000 it contains only ungapped patterns,
omitting gapped profiles such as those contained in Pfam-A and PROSITE. Fortunately, all the motif
databases discussed here have search interfaces available on the Web, most of which accept input in
FASTA format or FASTA alignment format.

One service that allows integrated searching of many motif databases is the European Bioinformatics
Institute's Integrated Resource of Protein Domains and Functional Sites (InterPro to its friends).
InterP ro allows you to compare a sequence against all the motifs from Pfam, PRINTS, ProDom, and
PROSITE. InterPro motifs are annotated with the name of the source protein, examples of proteins in
which the motif occurs, references to the literature, and related motifs.

8.4.2 Constructing and Using Your Own Profiles

Motif databases are useful if you're looking for protein families that are already well documented.
However, if you think you've found a new motif you want to use to search GenBank, or you want to get
creative and look for patterns in unusual places, you need to build your own profiles. Several software
packages and servers are available for motif discovery, the process of finding and constructing your
own motifs from a set of sequences. The simplest way to construct a motif is to find a well-conserved
section out of a multiple sequence alignment. As usual, though, we encourage you to use automated
approaches instead of doing things by hand: automation makes your work faster, more reproducible,
and less error-prone. In addition to Block Maker, a number of other programs are commonly used to
search for and discover motifs. In this section, we discuss the use of the MEME and HMMer programs,
two packages commonly used for motif analysis.

Before we begin, though, here are two observations about motif discovery. First, as InterPro and
Blocks grow, it is becoming increasingly difficult to find completely novel sequence motifs
undocumented by one of their member databases. Be sure to check your motif against the set of known
motifs, either by searching your sequences against the databases or by using a motif-comparison tool,
such as the Blocks server's LAMA program. Second, in order to find patterns reliably and search with
them, you need a lot of sequences. We have used these programs in projects where very few (5-10)

sequences were available, but, as a rule of thumb, more than 20 sequences are needed for reasonable
motif predictions. The more sequences you have, the more reliable the resulting motifs will be. Finding new motifs with MEME

The MEME programs are a set of tools for motif analysis developed by Charles Elkan, Tim Bailey, and
William Grundy of the University of California, San Diego. MEME is short for Multiple EM for Motif
Elicitation (EM, in turn, is short for Expectation Maximization, a procedure from the world of statistics
for predicting the values of "missing," or unobserved, values). They can be used over the Web
(http://meme.sdsc.edu) or their C source code can be downloaded, compiled, and run on a local
computer; here, we look at the web version. There are three programs in the MEME suite:


Discovers shared motifs in a set of unaligned sequences


Takes a motif discovered by MEME and uses it to search a sequence database


Constructs a model from multiple MEME motifs and uses it to search a sequence database

When you submit a set of sequences to MEME, you are testing the hypothesis that, although though
you don't know the overall alignment of the sequences, they share short regions of similarity. You
begin using MEME by entering on a web form your email address and a set of sequences in which you
wish to search for a motif. Sequences can be in one of several formats, although FASTA is preferred.
At the bottom of the sumission page are some parameters you need to set regarding the number of
times per sequence you expect a motif to occur, the number of motifs you expect to find, and the
approximate width of each motif.

The results will be sent back to you in three emails. The first is just a confirmation message, letting you
know that the job is being processed. The second (with the subject line "MEME Job xxxxx results:",
where xxxxx is the job number assigned by the MEME server) contains MEME's prediction for the
motifs in both human- and machine-readable form. This message is the one you need to search the
database; be sure to save the contents of this message to a text file, so you can later submit it to MAST
or MetaMEME. The third message (with the subject line "MEME job... MAST analysis:") is an HTML
document (making it suitable for viewing in a web browser) that shows the location of each motif in the
sequences you submitted. Each message is well documented and contains detailed explanations of the
contents. Searching for motifs with MAST and MetaMEME

The next step of a motif analysis is to see whether there are new occurrences of your motif in other
sequences. The MEME server provides two distinct programs, MAST and MetaMEME, that allow you
to search a sequence database using your new MEME motifs. MAST simply searches for occurrences
of each motif and reports matching sequences, while MetaMEME combines multiple MEME motifs
into a hidden Markov model and uses that model to search the database. Both MAST and MetaMEME

take the MEME motif prediction from the second email as input; MetaMEME also uses the original

sequence file that generates the MEME motifs in creating its HMM. Both programs return results
showing the position of each match, its score, and its statistical significance.
You did save the second email to a text file as we suggested, didn't you? Motif discovery with other programs

As we mentioned previously, there are a number of programs that discover motifs in groups of
unaligned sequences. Besides the ones we mentioned, you may want to try these: the SAM HMM
programs developed by David Haussler and coworkers at University of California, Santa Cruz; the
Emotif and Ematrix servers in the Brutlag group at Stanford University; and the ASSET, gibbs, and
Probe tools available for download from NCBI. Again, a good thing to do early on is to use the LAMA
program to compare your motif against the motifs in the Blocks database. If it looks like you really do
have a novel motif, it can be useful to compare the results of one or more of these other motif discovery
tools. If all the programs predict the same motif from the same sequences, you can be more confident in
your results. HMMer

HMMer is a software package for building profile HMMs. HMMer's central functionality is located in
the hmmbuild program, which creates profile HMMs from sequence alignment, and the hmmcalibrate
program, which calibrates search statistics for the HMM. The HMMer package also contains tools for
generating new sequences probabilistically based on an HMM, searching sequence databases with a
profile as the query, and searching profile databases with a query sequence, as well as the handy utility
programs we list here:


Extracts a sequence from a large flat-file database by name. Handy to have around if you're
selecting specific records out of a database from the command line.


Reads both a sequence file and a profile HMM and creates a multiple sequence alignment.


Builds a profile HMM from a multiple sequence alignment. It can produce global results for the
entire alignment or results for multiple local alignments.


Reads an HMM and calibrates its search statistics.


Converts an HMM into other profile formats, notably GCG profile format.


Generates sequences probabilistically based on a profile HMM. It can also generate a consensus


Retrieves a profile HMM from a database if the name of the desired record is known.


Indexes a profile HMM database.


Searches a profile HMM database (e.g., Pfam) with a query sequence. Use this if you're trying
to annotate an unknown sequence.


Searches a sequence database with a profile HMM. Use this if you're looking for more instances
of a pattern in a sequence database.


Converts a sequence or alignment file from one format to another. Handy to have around.

HMMer reads multiple sequence alignment files from several different sequence alignment programs,
including ClustalW. The HMMer authors recommend ClustalW as a tool to generate multiple
alignments for input into hmmbuild.

HMMer is available for download from Dr. Sean Eddy at Washington University
(http://hmmer.wustl.edu). HMMer is a very well-behaved program, which installs without difficulty
from source on Linux systems: just follow the directions in the INSTALL file. It even installs its own
Unix manpages so you can access online help for each of the HMMer programs using the man
command. Specific information about each of the HMMer programs' command-line options can also be
viewed by running the program with the -h option.

8.4.3 Incorporating Motif Information into Pairwise Alignment

Multiple sequence information can optimize pairwise alignments. The BLAST package contains two
new modes that use multiple alignment information to improve the specificity of database searches.
These modes are accessed through the blastpgp program.

Position Specific Iterative BLAST (PSI-BLAST) is an enhancement of the original BLAST program
that implements profiles to increase the specificity of database searches. Starting with a single
sequence, PSI-BLAST searches a database for local alignments using gapped BLAST and builds a
multiple alignment and a profile the length of the original query sequence. The profile is then used to

search the protein database again, seeking local alignments. This procedure can be iterated any number
of times. One caveat of using PSI-BLAST is that you need to know where to stop. Errors in alignment
can be magnified by iteration, giving rise to false positives in the ultimate sequence search. PSI-
BLAST can be used as a standalone by running the blastpgp program. However, the NCBI PSI-BLAST
server is probably the optimal way to run a PSI-BLAST search. The server requires you to decide after
each iteration whether to continue to another iteration, and you can hand-pick the sequences that
contribute to the profile at each step.

Pattern Hit Initiated BLAST (PHI-BLAST) takes a sequence and a preselected pattern found in that
sequence as input to query a protein sequence database. The pattern must be expressed in PROSITE
syntax, which is described in detail on the PHI-BLAST server site. PHI-BLAST can also initiate a
series of PSI-BLAST iterations, and can be a standalone program or a (vastly more user-friendly) web

Chapter 9. Visualizing Protein Structures and Computing
Structural Properties
Analysis of protein 3D structures is a more mature field than biological sequence analysis. The Protein
Data Bank started distributing coordinates of macromolecular crystal structures in the early 1970s, and
since that time, many research groups and companies have developed software to visualize and
measure the properties of protein structures.

Visualization of structure and measurement of structural properties are important tools for molecular
and structural biologists. Being able to "see" the 3D structure of a protein and analyze its shape in
detail can suggest the location of catalytic sites and interaction sites, and can help identify targets for
the site-directed mutagenesis studies that are so often used to arrive at a detailed characterization of a
protein's functional chemistry.

Here are some recent applications of this type of approach in molecular biology:

• Molecular modeling of an allergy-causing protein from mountain cedar pollen and subsequent
identification of the region that causes allergic response
• Characterization of the mutagenic active site in DNA reverse transcriptase from the HIV virus;
this site is thought to be responsible for the ability of the HIV virus to mutate rapidly
• Modeling of a DNA binding protein involved in Bloom syndrome, and characterization of the
mutations that cause the disease

There are many specialized analysis programs in the protein structure literature, and we will not
attempt to catalogue all these methods. Instead, we present an introduction to standard operations for
analyzing and modeling protein structure, with examples of software for each purpose: visualization
and plotting; geometric and surface property analysis; classification; analysis of intramolecular
interactions and solvent interactions; and computation of some physicochemical properties.

For all-purpose molecular structure modeling, the easiest-to-use tools are still commercial packages
such as MSI's Quanta and Insight, Tripos' SYBYL, and others. However, licensing for these packages,
especially for multiple users, is quite expensive and they generally require specialized high-end
hardware (such as SGI and IBM Unix workstations) to run. In this chapter, we again focus on software
that can run on a standard desktop PC under Linux or within a web browser on any platform.

9.1 A Word About Protein Structure Data
Because protein structure analysis is a relatively old field, evolving earlier in the history of computers
than sequence analysis, it has inherited some inconveniences. While many programs use the standard
PDB format, others, especially molecular simulation software, expect input in slightly or significantly
different forms. And because protein structure analysis software is older, many programs are written in
the FORTRAN language and are very picky about data input formats. Data standardization at the PDB
is excellent, but standardization at the individual software package level isn't as good. If you're going to
be doing a lot of work with protein structure data it may be necessary to learn some programming to be
able to convert structure files to alternate formats when necessary. We show an example of a simple
structure file-format conversion in Chapter 12.

The Brookhaven PDB format is the protein structure data format that most structure-analysis programs
use. This format met the needs of the protein structure field in the 1970s, and was especially human-
readable, and compatible with FORTRAN programs, because of its use of rigidly structured 80-
character lines. This format consists of a header section that contains miscellaneous information about
the structure, including literature citations; resolution; crystallographic parameters; sequence, and
sometimes secondary structure information; and a section that contains atom records. Atoms labeled
ATOM are part of the protein chain, while atoms labeled HETATM (for heteroatom group) are part of
cofactor molecules, substrates, ions, or other groups that aren't a covalently bound part of the protein
chain. A detailed line-by-line description of the Brookhaven format is available from the RCSB PDB
web site.

Protein structure files also are available from the PDB in a new format called mmCIF (the
Macromolecular Crystallographic Information Format) and from NCBI in the ASN.1 file format. Both
of these formats are highly parseable by computers, and if you are writing computer programs to
analyze protein structures, they may be easier to use than the obsolete Brookhaven format. However,
you'll need to consider that the user community is still attached to the Brookhaven format.

9.2 The Chemistry of Proteins
To work with protein sequence and structure, you need a working knowledge of protein chemistry”the
kind of knowledge you'd probably have picked up in an undergraduate organic chemistry course. We'll
provide you with a little of that vocabulary here, and you can find out more from the references listed
in the Bibliography. If you already know what you need to know about protein chemistry, you can skip
ahead to Section 9.3.

The reason you should have a basic knowledge of organic chemistry when studying protein structures
is simple. Proteins often perform their functions using standard organic reaction mechanisms, mediated
by amino acids and small organic molecules (cofactors) that bind to the protein, or by metal ions. To
understand how the protein structure might catalyze a reaction, you need to understand enough about
organic reaction mechanisms to develop a hypothesis about how the reaction might work, given the
shape of the protein and the location of various amino acids.

Even in cases in which a catalytic mechanism isn't your main concern, chemistry comes into play.
Protein association is often mediated by the electrostatic properties of the protein structure; interacting
molecules can be drawn together over considerable distances by strong electrostatic potentials. Within
protein structures, hydrogen bonds and other interatomic interactions confer structural stability.
Interatomic interactions and molecular shapes are the basis of the specificity of intermolecular
interactions ”the interactions of proteins with other proteins or with small molecule substrates. You are
likely to be concerned about molecular specificity in practical applications of biochemistry”designing
small-molecule or peptide drugs, understanding the molecular basis of disease and immunity, or
delving into the specific molecules involved in sending molecular signals between cells and through the

The tools in this chapter enable you to look at a protein structure, see what its features are, locate
different types of amino acids and visualize specific subsets of the protein, measure distances and
surface areas, and compute spatially variable properties such as solvent accessibility and electrostatic
potentials. However, what you can do with those tools depends on your understanding of protein

9.2.1 From 1D to 3D

How does the chemistry of a protein relate to its 1D sequence? In Chapter 8, we discussed techniques
for detecting characteristic conserved patterns, called motifs, in families of protein sequences. We can
find these sequence patterns in 1D data because although the 3D structure of a protein is complex, it is
somehow determined by the invariant sequence of amino acids that makes up the protein. Motifs that
are conserved in sequence often are related to important structural or functional features of a protein
family, and those features often can be understood by their roles in the protein structure.

When amino acids come together in sequence to form a polymer, they do so by forming a peptide bond
between the basic amino group and the acidic carboxyl group of each amino acid (Figure 9-1). This
results in a long chain of amino acids that has a repeating backbone structure.

Figure 9-1. Peptide bond, peptide chain (chemical notation)

The variable group of each amino acid protrudes from the repeating backbone and is referred to in the
protein structure business as a sidechain (Figure 9-2). Each of the 20 amino acid sidechains is
chemically different from the others in some respect.

Figure 9-2. The amino acid sidechains (chemical notation)

The sidechains can be classified in many ways. Some are relatively large, while others are tiny or in
one case nonexistent. Some have a positive or negative charge. Some are oily, or hydrophobic (water-
fearing), meaning that it's energetically unfavorable for them to be solvated in water. Others are
hydrophilic (water-loving), and they solvate easily in water. Some have bulky ringlike structures, while
others are straight carbon chains. Some are acids, others are bases. Amino acids are conserved through
evolution at specific locations in a protein sequence because they are needed there, whether to stabilize
the protein structure, to form a specific binding site, or to catalyze a reaction. You can detect that
particular amino acids in a protein are conserved by looking at sequence data, but to develop a
hypothesis about why they are conserved, it's helpful to examine the 3D protein structure. Figure 9-3
shows the 20 amino acids classified into chemically similar groups. Note that many of the amino acids
fall into more than one category. An amino acid sidechain can be both "nonpolar" and "basic," for
instance, like lysine, which has a long aliphatic sidechain that terminates in an amino group. Because

the relationship between chemical characteristics and amino acids isn't one-to-one, but rather many-to-
many, it's not always simple to predict the effects of an amino acid substitution.

Figure 9-3. The amino acid sidechains (classification in a Venn diagram)

Interatomic forces aren't responsible only for specific interactions that form binding and interaction
sites; they also are responsible for the formation of certain standard patterns that are consistently
observed in protein structure. The amino acid backbone is sterically constrained”restricted from
moving in certain ways because atoms will bump into each other”to follow only certain pathways.
You may already be familiar with the alpha helix and beta sheet structures that commonly occur in
protein structures; the reason that alpha helices and beta sheets are common is the steric restrictions on
the protein backbone.

From the known structures of amino acids, Pauling and Corey first predicted the existence of alpha
helices and beta sheets as a component of protein structure. Ramachandran first described exactly what
range of conformations are available to amino acids in a peptide chain. Peptide chain conformation is
simply described by the values of the dihedral angles in the protein backbone (i.e., the angle described
by the four atoms surrounding the N-C bond and the angle described by the four atoms surrounding
the C -C bond). These angles are referred to as and , respectively. The chain isn't free to rotate
around the third kind of bond in the protein backbone, the peptide bond, because it is a partial double
bond and hence chemically constrained to be planar, so the values of and for each amino acid
provide a complete description of the protein backbone. A Ramachandran map is simply a plot of
versus for an entire protein structure. One means of evaluating a protein structure model is to
compare its individual Ramachandran map with the general Ramachandran map of allowed values of
and .

Figure 9-4 is a general Ramachandran map that shows the allowed combinations of and values for
amino acids in protein structures. The small shaded region in the lower left quadrant of the map is the
standard conformation of an amino acid in an alpha helix. The larger shaded region in the upper left
quadrant of the map is the standard conformation of an amino acid in a beta sheet, or extended

Figure 9-4. Ramachandran map of allowed conformation for protein backbones

It's apparent from the Ramachandran map that steric interactions are very important determinants of the
general features of protein structure. Steric interactions instantly eliminate a large fraction of possible
conformations for proteins and leave relatively few options for how a compact structure can form from
a linear chain of amino acids.

The sequence of a protein is called its primary structure; the most basic level of organization in a
protein is the sequence of amino acids. Alpha helix and beta sheet structures, shown in Figure 9-5, are
known collectively as secondary structures and are the next level of organization. Interactions between
multiple secondary structure elements give rise to supersecondary structure and tertiary structure”
helices and sheets contacting each other to form larger characteristic structures, which can be described
by their topology.

Figure 9-5. Alpha helix and beta strand structures

To create a functional protein, the sequence of amino acids in the protein chain must give rise to the
proper 3D fold for the protein, and it must also place individual amino acids at appropriate points on
that scaffold to carry out the protein's chemistry. Finding ways to extract those chemical instructions
from the sequences of known proteins, formulating them as rules, and using those rules to predict the
structure of other proteins is one of the biggest open research problems in bioinformatics.

9.2.2 Interatomic Forces and Protein Structure

Since the form that a protein structure can take and its chemical characteristics are governed by
interatomic interactions, it is important to have at least a basic understanding of the interatomic
interactions that play a role in protein structure. Interactions between atoms are physically complicated
and to describe them in detail would require a whole other book, which fortunately has already been
written by someone else: see the Bibliography. What we hope to give you is a rudimentary knowledge
of these forces, to help you understand why computer methods have been developed to measure and
calculate particular structural properties of proteins.

Understanding these forces gives us a basis for designing evaluative and predictive methods. Threading
methods rely on the ability to discriminate between an amino acid that is in a favorable chemical
environment and one that isn't. Homology modeling and structure optimization methods rely on rules
for spacing between atoms, bond lengths, bond angles, and other values. These rules can be derived
from chemical experiments on small molecules or from the distribution of observed values in known
protein structures. However these rules are constructed, though, they reflect energetically favorable
interactions between atoms. Covalent interactions

Covalent interactions are the very short range (approximately 1 to 1.5 angstroms); they are very strong
forces that bind atoms together into a molecule. In covalent bonding, the atoms involved actually share
electrons. Unlike other forces encountered in protein structures, covalent bonds actually change the
nature of the atoms involved to some extent. Atoms involved in covalent bonds are no longer discrete
entities; instead, they combine to form a new molecule.

The protein backbone, including the peptide bond that joins one amino acid to another, is held together
by covalent bonds. Amino acids retain some of their chemical individuality within the protein structure,
but formally they become part of a new molecule. Atoms within individual amino acid sidechains are
also covalently bonded to each other. These covalent bonds place strong constraints on the distance
between atoms in a protein structure.

Because covalent interactions are strongly constrained by physicochemical rules, an important part of
the verification process for structural quality is making sure that bond lengths, bond angles, and
dihedral angles don't vary dramatically from their allowed values. Covalent bond lengths are
determined by the size and type of the atoms involved and by the number of electrons shared between
atoms. The more electrons are shared, the shorter and stronger the bond. Bond angles are constrained
by the structure of atomic orbitals. Dihedral angles, the angles of rotation of two bonded pairs of atoms
with respect to each other around a central bond, are constrained primarily by steric hindrance. These
chemical constraints are also used in macromolecular simulation, where they are associated with
applied forces that keep the molecule in allowed conformations. Hydrogen bonds

Hydrogen bonds arise when two polar groups interact. The two polar groups must be of specific types.
One must be a proton donor, a chemical group in which a proton (hydrogen atom) is covalently bonded
to a strongly electronegative atom such as oxygen. The bond between the proton and the
electronegative atom is polarized, giving the proton a partial positive charge and the electronegative
atom a partial negative charge. The other group must be a proton acceptor, an electronegative atom
with a partial negative charge and no attached proton. The positively polarized proton in the first group
is attracted to the negatively polarized second group, and the two form a bond that isn't covalent, but is
nonetheless, much shorter and stronger than a normal nonbonded interaction. Hydrogen bonds are
unusual among nonbonded and electrostatic interactions because they are strongly directional; they
weaken if the angle described by the three atoms involved is too large or too small.

Hydrogen bond interactions are one of the most important stabilizing forces in protein structure. The
protein backbone contains a proton donor, in its N-H group, and a proton acceptor, in its carbonyl
oxygen, spaced at regular intervals along the chain (Figure 9-6). The interaction of these groups
stabilizes the two major types of secondary structure, the alpha helix and the beta sheet (Figure 9-7).
Therefore, some structure prediction methods attempt to use the presence of potential hydrogen bond
pairs to improve the accuracy of predictions.

Figure 9-6. Proton donor and acceptor in the protein backbone

Figure 9-7. Hydrogen bonding in alpha helices and beta sheets Hydrophobic and hydrophilic interactions

A much-discussed (and frequently wrongly used) concept in protein structure analysis is that of the
hydrophobic force. We've already mentioned in passing that amino acids can be classified as
hydrophobic or hydrophilic. What exactly does this mean?

Proteins, except for those bound within cell membranes, always exist in aqueous solution. They
constantly interact with water molecules. Water is a solution that has some interesting properties, and
these properties contribute to the stability of the compact globular structures that characterize cellular

Water is a polar molecule. Individual water molecules in liquid water can each form four hydrogen
bonds with neighboring water molecules. Liquid water is an essentially uninterrupted lattice of
hydrogen bonded molecules, as seen in Figure 9-8. This unusual property contributes to the high
melting and boiling points of water, as well as to such properties as low compressibility and high
surface tension. It also results in interesting interactions of water with soluble proteins.

Figure 9-8. Hydrogen bonding in water

A nonpolar molecule dissolved in water interrupts the regular hydrogen bond lattice of liquid water.
Individual water molecules can reorient around a small nonpolar molecule to preserve their network of
hydrogen bonds, but this reorientation has a cost in terms of free energy (which is how cost is measured
in chemistry). The presence of a nonpolar solute forces water molecules into a more ordered
conformation than they would ordinarily assume. Instead of being able to face any which way and
rotate freely, water molecules near the surface of a nonpolar solute have to work around it and form a
cage. This is entropically unfavorable.

The larger a nonpolar solute gets, the more water molecules need to reorient to accommodate it, and the
higher the energy cost of solvating the molecule becomes. Of course, if the nonpolar solute has some
polar groups on its surface, water molecules can use those groups as hydrogen bonding partners instead
of other water molecules, and the water lattice is less disturbed. Globular proteins, which exist in
aqueous solution even though they are composed substantially of nonpolar groups, must present a good
hydrogen-bonding surface to the world. Hydrophilic amino acids are those whose sidechains offer
hydrogen bonding partners to the surrounding medium, while hydrophobic amino acids' sidechains
don't. The surface of a globular protein is usually anywhere from 50%-75% polar atoms, and deviations
in this pattern can suggest binding or complexation sites.

Solvent accessibility and hydrophobicity play an important role in evaluating model structures.
Threading methods for protein fold recognition use amino acid environments in evaluating models.
When many hydrophobic amino acids are found in solvent-exposed structural environments or
hydrophilic amino acids buried in the protein interior, it is considered unlikely that the protein model is
folded correctly. Charge-charge, charge-dipole, and dipole-dipole interactions

Unlike covalent bonds, the other important interactions in protein structure are nonspecific. They don't
change the discrete nature of the interacting atoms. They involve no sharing of electrons. Covalently
bonded atoms are married; noncovalently bonded atoms are just shacking up.

Several kinds of important forces can aris e among polar and charged atoms. An ion is an atom that has
a net positive or negative charge due to either a surplus or a deficit of electrons. Atoms that carry a
positive ionic charge are attracted to atoms that carry a negative ionic charge, with a strength that
depends on the size of the charges and the inverse of the distance between the atoms. In proteins,
charge-charge interactions occur between the sidechains of acidic and basic amino acids that are
negatively charged or positively charged due to loss or gain of a labile proton under normal
physiological conditions. The charge-charge interactions between amino acids in a protein structure are
called salt bridges, and they can contribute a significant stabilizing force to a protein structure.

There are other, weaker interactions that occur between charges and groups that don't carry a positive
or negative ionic charge. Dipolar molecules are molecules like those involved in hydrogen bonds, in
which one end of the molecule has a partial positive charge and the other end has a partial negative
charge. The dipole of a molecule is essentially a vector that describes the magnitude of the polarization
along a bond. Dipolar molecules can be strongly attracted to other partial charges or to ionic charges.
Many amino acid sidechains, as well as the protein backbone, have a strongly dipolar character, so
charge-dipole and dipole-dipole interactions play a substantial role in stabilization of protein structure. Van der Waals forces

The van der Waals force is a nonspecific attractive force between molecules. This force is loosely
analogous to gravity, in that it exists between every pair of nonbonded atoms, and it's a fairly long-
range force. However, it doesn't arise simply from the mass of the atoms involved, but from the
transient attractive forces between the instantaneous dipole moments of each atom. The van der Waals
force is quite strong, and because van der Waals interactions are nonspecific and numerous they play a
significant role in protein folding and protein association. Repulsive forces

Repulsive forces, or steric interactions, are very short range forces that increase sharply as atomic
centers approach each other. The radius at which the repulsive force begins to increase sharply defines
a spherical boundary around each atom center inside which another atom's spherical boundary (called
the van der Waals radius) can't pass. If two nonbonded atoms in a structure get into each other's
personal space, the contact is energetically unfavorable. In real molecules, atoms stay out of each
other's way. However, in models of molecules, whether derived from NMR or x-ray data or built from
scratch, checking for van der Waals bumps between nonbonded atoms is an important part of the
structure-refinement process. Relative strength of interatomic forces

The interaction between atoms can be described by a pair potential, such as the Lennard-Jones
potential (Figure 9-9), which includes both an attractive and a repulsive term. The form of the potential
shows that atoms tend to repel each other at very short range (positive potential energy indicating an
unfavorable interaction) but to attract each other at slightly longer range. The strength of the attraction
decays with distance, depending on the forces modeled.

Figure 9-9. Plot of Lennard-Jones potential

When making inferences about structural stability or function based on intermolecular interactions, it is
important to understand the relative strengths of these interactions, and how they scale with distance
(Table 9-1).

Table 9-1. How Interatomic Forces Scale with Distance
Type of Bond Range of Interaction
Covalent Complicated short range
Roughly 1/r2
Hydrogen bond
Charge-charge Scales with 1/r
Scales with 1/r2
Charge-fixed dipole
Scales with 1/r4
Charge-rotating dipole
Scales with 1/r3
Fixed dipole-fixed dipole
Scales with 1/r6
Rotating dipole-rotating dipole
Scales with 1/r4
Scales with 1/r6
Scales with 1/r6

In Table 9-1, r represents the distance between two atoms in angstroms. Interactions that decrease in
strength with 1/r are effective at a much longer range than those that decrease in strength with higher
powers of r. Covalent interactions and hydrogen bonds are strong, and very energetically significant at
short distances. Charge-charge interactions have some of the longest-range effects; electrostatic effects
on protein activity have been experimentally shown at over 15-angstrom distance, a substantial range in
molecular terms. A concentration of charges on a protein surface can create a powerful electrostatic
steering effect that can attract ligand molecules or other proteins at even longer range. Hydrogen bonds
and charge-dipole interactions are also relatively strong. The effects of these interactions are modeled
by computing electrostatic potentials and using the computed potentials as the basis for calculating
other molecular properties such as binding constants (via Brownian dynamics) or pKa values.

On the other hand, interactions between noncharged and nonpolar atoms are very weak and effective
only at short range. However, the effects of these interactions can be cumulative, stabilizing structure
and making intermolecular associations more favorable. The effects of these interactions are addressed
when you compute the size of intermolecular contact surfaces or enumerate interactions between
neighboring interactions in a protein. In the remainder of this chapter, we discuss various methods for

measuring and evaluating atomic structures of proteins, all of which can be used together to add to your
understanding of protein chemistry.

9.3 Web-Based Protein Structure Tools
Now that we've reviewed the basics of protein chemistry, let's turn our attention to the tools. The most
important source of information about protein structure is the PDB. In addition to being an entry point
to the structural data itself, the PDB web site (http://www.rcsb.org/pdb) contains links to many tools
database you can apply to individual protein structures as you search the database. Information from
the database is made available through the Protein Structure Explorer interface. For each protein, you
can view the molecular structure using 3D display tools such as RasMol and the Java QuickPDB
viewer. PDB files and file headers can be viewed as HTML and downloaded in a variety of formats.
Links to the protein structure classification databases CATH, FSSP, and SCOP are provided, along
with the tools CE and VAST, which search for structures based on structural alignment. Average
geometric properties, including dihedral angles, bond angles, and bond lengths can be displayed in
tabular format with extremes and deviations noted. Sequences can be viewed and labeled according to
secondary structure, and sequence information downloaded in FASTA format.

You can go directly to the page for a particular protein of interest by entering that protein's four-letter
PDB code in the Explore box on the PDB's main page. The PDB can also be searched using two
different search tools, SearchLite and SearchFields. SearchLite is a simple search tool that allows you
to enter one or more search terms separated by boolean operators into a single search field.
SearchFields is a tool for advanced searches that provides a customizable search form that allows you
to use separate keywords to search each PDB header field. You can modify the form by selecting
checkboxes at the bottom of the form and regenerating the form. SearchFields supports options for
searching a dozen of the most important fields in the PDB header, as well as crystallographic
information. SearchFields also allows the database to be searched using FASTA for sequence
comparison, as well as secondary structure features or short sequence features.

From the individual protein page generated by the Structure Explorer, the PDB provides a menu of
links through which to connect to other tools. These features are still evolving rapidly. Table 9-2
provides a brief overview of the PDB protein page. We also encourage you to explore the PDB site
regularly if you are interested in tools for protein structure analysis.

Table 9-2. PDB Summary Information
Page Description
The Summary page shows important information from the PDB header, as well, the chain composition
Summary page
of the protein and chemical information about any ligands and cofactors.
The View Structure page provides links to everything from static images to interactive protein views
View Structure
using VRML, RasMol, and the PDB's Protein Explorer tool.
Download/Display The Download page offers several options for downloading individual protein structures and headers
File in both classic PDB format and the new mmCIF format.
Structural The Structural Neighbors page links to manually curated protein classification databases, such as
Neighbors SCOP and CATH, as well as the automated protein structure comparison tools CE and VAST.
The Geometry page provides tabular views of bond length, bond angle, and dihedral angle data for the
The Other Sources page is a rich catalog of links for each protein to everything from its SWISS-PROT
accession code to literature references describing the structure. From this page, you can generate
Other Sources
everything from domain analyses to structural quality reports to searches of genome catalogs and the
NCBI Taxonomy database.

The Sequence Details page shows the sequence of the protein and the location of its secondary
Sequence Details structure features, as extracted from the crystallographic data. The sequences of the individual protein
chains in a PDB entry are also available for download in FASTA format.

We'll discuss the specifics of some of the tools linked from the PDB web site in the upcoming sections.
Again, as with any web-based tool, it's a good idea to learn as much as you can about the underlying
algorithms before basing any conclusions on their results. Just because a method is endorsed by the
PDB, doesn't mean that it's 100% foolproof, or that you can interpret results without understanding the

9.4 Structure Visualization
One of the first tools developed for structure analysis and one of the first analyses you will probably
want to do is simply structure visualization. Protein structure data is stored as collections of x, y, z
coordinates, but proteins can't be visualized simply by plotting those points. The connectivity between
atoms in proteins has to be taken into account, and for the visualization to be effective, a virtual 3D
environment, which provides the illusion of depth, needs to be created. Fortunately, all this was worked
out in the 1970s and 1980s, and there are now a variety of free and commercial structure visualization
tools available for every operating system.

Even with virtual 3D representation, protein structures are so complex that they are difficult to interpret
visually. The human eye can interpret 3D solids, but has a difficult time with topologically complex 3D
data sets. There are a number of conventional simplified representations of protein structure that allow
you to see the overall topology of the protein without the confusion of atomic detail. In order to be
useful, a protein structure visualization program needs to, at minimum, be able to display user-selected
subsets of atoms with correct connectivity, draw standard cartoon representations of proteins such as
ribbons and cylinders, and recolor subsets of a molecule according to a specified parameter.

9.4.1 Molecular Structure Viewers for Your Web Browser

One type of molecular structure viewers are lightweight applications that can be set up to work with
your web browser. When properly configured, they will display molecular data as you access it on the
Web. RasMol and CnD3 are two of the most popular viewers. RasMol

One of the most popular molecular structure visualization program tools is RasMol. It is available for a
wide range of operating systems, and it reads molecular structure files in the standard PDB format.
RasMol 2.7.1, the most up-to-date version, can be downloaded from Bernstein and Sons
(http://www.bernstein-plus -sons.com). Either source code or precompiled binary distributions can be

RasMol comes in three display depths: 8-, 16-, and 32-bit. Eight-bit is the default, but if you have a
high-resolution monitor, you may have to experiment and find out which executable is right for your
system. You'll know you have a problem when you try to run RasMol and it complains that no
appropriate display has been detected. Start with the 8-bit version, and work your way up.

If you plan to compile RasMol yourself, you need to get into the src directory and edit the Makefile to
produce the appropriate version. To do this, open the Makefile with an ASCII text editor such as vim or
Emacs and search for the variable DEPTHDEF. You should find something like this:


In this example, DEPTHDEF has been defined as 16-bit.

The # character at the beginning of a line marks that line as a comment, which isn't read by the make
program when it scans the Makefile. Lines of code can be skipped over by being commented out; that
is, marked as a comment. Remove the # character in front of the depth definition you need to use, and
add it to comment out the others. Comment characters vary from programming language to
programming language, but the notion of a comment line is common to all standard languages.

You may also need to edit the rasmol.h file, according to the install instructions.

Once you have the proper RasMol executable, whether you download it or compile it yourself, you
need to copy it into /usr/local/bin and copy the file rasmol.hlp into the directory /usr/local/lib/rasmol.
Then, in your web browser's preferences, you need to add RasMol as an application. If you're using
Netscape, the default browser on most Linux systems, go to the Preferences?Navigator?Applications
menu, select New, and enter the following values into the dialog box:

Description: Brookhaven PDB
MIMEType: chemical/x-pdb
Suffixes: .pdb
Application: /usr/local/bin/rasmol

You may also want to create a second entry for the MIME type chemical/x-ras.

When run from the command line, RasMol opens a single graphics display window with a black
background. The molecule can be rotated in this window either directly with the mouse, or with the
sliders on the bottom and right side of the window. This window has five pulldown menus. The File
menu contains commands for opening molecular structure files. The Display menu contains commands
for changing the molecular display style to formats including ball and stick, cartoons, and spacefill.
These display commands execute quickly, so you can try each of them out to see the different standard
molecular display formats. The Colours menu allows you to change the color scheme of the entire
molecule, and the Options menu changes the display style, allowing you to display the molecule in
stereo, turn the display of heteroatom groups or labels on and off, etc. The Export menu allows you to
write the displayed image in common electronic image formats such as GIF, PostScript, and PPM,
which can be edited later using standard image manipulation programs that come with most Linux
distributions, such as GIMP.

When you import or save files in RasMol, you do it from the RasMol command line. In the shell
window from which you start RasMol, the command prompt changes to RasMol >. Enter help
commands at this command prompt to see the full range of RasMol commands, including commands
for selecting subsets of atoms. If RasMol complains that it can't find its help file, create a symbolic link
to /usr/local/lib/rasmol/rasmol.hlp in the directory in which you installed RasMol and/or the directory
in which you are running it. Help commands allow you to create your own combinations of colors and

structure display formats, including some not available from the menus; create interatomic distance
monitors; and display some intermolecular interactions, such as hydrogen bonds and disulfide bridges. Cn3D

Cn3D is an application from NCBI that can view protein structure files in NCBI ASN.1 format. If you
use the NCBI databases frequently, you will also want to install this tool and set it up to work as an
application in your browser.

To install Cn3D on a Linux workstation and set it up as a browser application, you simply need to
download the Cn3D archive from NCBI, make a Cn3D directory on your own machine, move the
archive into that directory, and extract it.

Then, in your web browser's application preferences, make the following new entry:

Description: NCBI ASN.1
MIMEType: chemical/ncbi-asn1-binary
Suffixes: .prt
Application: /usr/local/cn3d/Cn3D

Cn3D opens two windows: a color structure viewer, in which a molecule can be rotated, colored
according to different properties, and rendered in different display formats; a sequence viewer, which
allows you to view sequences and alignments corresponding to the displayed protein and to add
graphics to the sequence display to highlight the location of secondary structure features. SWISS-PDBViewer

The SWISS-PDBViewer is a relatively new 3D structure display and analysis tool that complements
the services offered by the Swiss Institute of Bioinformatics. It can be used to prepare input for
homology modeling using the SWISS-Model web server. However, it is also useful as a standalone
visualization tool. The viewer incorporates many useful functions, including superimposition of
structures, calculation of molecular surfaces and electrostatic potentials, high-quality rendering,
analysis of torsion angles, creation of mutations to the structure, and much more. At the time of this
writing, SWISS-PDBViewer is in a phase of rapid development; if interested, you should check the
Swiss Institute of Bioinformatics web site for the current version and online documentation.

9.4.2 Standalone Modeling Packages

Heavy-duty molecular structure viewers tend to have many more features than web applications such as
RasMol and Cn3D. The most popular examples are MolMol, MidasPlus, and VMD. These programs
run on your desktop machine, and to use them you need copies of the PDB files you're interested in
using already stored on your computer. MolMol

If you have Cn3D and RasMol linked to your web browser, you are well-equipped to view any
molecular structure on the fly. However, there are times when you need to do more extensive
manipulations of a molecular structure. MolMol is a full-featured molecular structure visualization
package that allows you to display molecules, edit structures, and compute molecular properties.

You run the MolMol program by issuing the command molmol from the command line. There are no
command-line options. The program opens with one large window with a white background, and a
separate smaller window, which contains sliders for x, y and z rotation and for changing depth and
position of the clipping plane. The clipping plane controls the simulated depth of the display window
and the point at which the display window intersects the molecular structure. Atom selection options
are controlled from the menu bar to the right of the main window.

Like RasMol, MolMol has pulldown menus, but all its options are available from the pulldown menus,
and there are substantially more of them. MolMol has a complete manual, which is distributed, along
with the software, in HTML, and several printable formats, so we will not discuss each command here
in detail. Some MolMol features you may find useful, in addition to the standard molecular display
functions, are the display of Ramachandran and contact maps, calculation and display of
macromolecular surfaces, and display of qualitatively accurate electrostatic potentials.

MolMol is available as a binary distribution from ETH Zurich and is simple to install on a Linux
workstation. Follow the directions provided, and you can't go wrong. While the MolMol interface isn't
quite as slick as that of a commercial product like MSI's Quanta, it is an amazing value for the price. A
couple of general tips: be sure to close dialog boxes and windows by clicking on their OK buttons or by
selecting Quit from the menus, rather than by clicking the Kill Window button at the top-right corner.
If the program seems to need to take its time to do something, don't click a lot of extra buttons or try to
force it to close down”just wait. This will keep the program from hanging up your machine. MidasPlus

MidasPlus is a near commercial-quality molecular modeling package available from the University of
California at San Francisco. It provides many standard molecular display functions, as well as tools for
measurement, limited modeling capabilities (for instance, the ability to substitute amino acids in the
structure), and computation of molecular surfaces and electrostatics. The MidasPlus source code and
executables for various platforms, including some Linux systems, are available from UCSF for a
licensing fee of $350 ”much less than comparable commercial software packages. Your Linux
workstation must be equipped with a good-quality 3D graphics card in order to support MidasPlus. VMD

Another excellent package for creating molecular graphics is VMD, the Visual Molecular Dynamics
program from the Theoretical Biophysics group at the University of Illinois. VMD was designed to
visualize and animate trajectories from molecular dynamics simulations, but it can also produce quite
nice visualizations of single molecules. VMD is available for Linux systems and has an easy-to-use,
menu-driven graphical user interface.

9.4.3 Creating High-Quality Graphics with MolScript

Usage: molscript -in infile -[options] -out outfile
Usage: molauto -[options] infile > outfile

MolScript has a completely different purpose from the other visualization packages we have discussed.
It is designed to produce high-quality graphics for print publication, as you can see in Figure 9-10. It
can be configured to run from the command line and to produce PostScript, Raster3D, and VRML

output only; it can also be configured to run interactively in its own window, using OpenGL, and to
produce output in many additional image file formats. [1]

The image in Figure 9-10 was contributed by Per J. Kraulis, from "MOLSCRIPT: A Program to Produce Both Detailed and Schematic Plots of Protein Structures,"
Journal of Applied Crystallography (1991), vol. 24, pp. 946-950.

Figure 9-10. A sample image generated by molscript

Setting up interactive MolScript with OpenGL on a Linux workstation isn't straightforwRasMolard; it
requires the installation of Mesa (open source OpenGL) libraries and customization of the Makefile that
comes with the distribution. However, the basic MolScript installation is quick and simple and can
produce visually appealing line drawings of molecular structure cartoons in color or black and white, in
a style that is uniquely elegant and appropriate for print media. To install the basic version of
MolScript, simply follow the directions in the install file. Copy the resulting executables (molscript and
molauto) to your /usr/local/bin directory or to another directory in your default path. Here's what
molscript and molauto do:


The main MolScript program; generates images


The MolScript setup program; automatically generates a rudimentary MolScript input file from
an input PDB file

MolScript takes two input files: a MolScript command file and a PDB coordinate file. Here's the
MolScript input file that produced the images in Figure 9-10:

! MolScript v2.1 input file
! generated by MolAuto v1.1.1
read mol "1MBN.pdb";
transform atom * by centre position atom *;
set segments 2;
set planecolour hsb 0.6667 1 1
coil from 1 to 3
set planecolour hsb 0.619 1 1
helix from 3 to 18
set planecolour hsb 0.5714 1 1
coil from 18 to 20
set planecolour hsb 0.5238 1 1
helix from 20 to 35
coil from 94 to 100
set planecolour hsb 0.1429 1 1
helix from 100 to 118
set planecolour hsb 0.09524 1 1
coil from 118 to 125
set planecolour hsb 0.04762 1 1
helix from 125 to 148
set planecolour hsb 0 1 1
coil from 148 to 153;

set colourparts on
bonds in require residue 1 and type HEM;


The MolScript scripting language is unique and not really based on any standard computer language.
The only way to learn it is to decide what you want to do, study the manual and examples, and learn the
language. The example just shown is a simple MolScript command file; it reads in a single molecule,
centers it on the molecule's center of mass, defines the locations of the various secondary structure
elements and shades them through the spectrum from red to blue. MolScript can produce much more
complex figures than this, however. MolScript plots can be scaled and multiple plots shown on a single
page. Subsets of atoms in the molecule can be turned on, displayed in different formats, and custom
colored. Labels can be added to figures.

Fortunately, the molauto program automatically produces simple input files for the molscript program,
which can help you get started using the MolScript command language. molauto does the most tedious
part of input file setup for you”assigning helix, sheet, or coil drawing styles, and colors, to each
segment of secondary structure. molauto has a variety of command-line options, which you can access
by entering molauto -h. molauto reads input in the standard PDB file format, and writes to standard
output unless a redirector is used.

The following are some of the most useful command line options for molauto:


Reads secondary structure assignments from the PDB file


Uses hydrogen bonding patterns to assign secondary structure


Uses cylinders to indicate alpha helices


Renders cofactor molecules using a ball-and-stick representation


Leaves out the coloring commands


Improves the quality of the rendering, using more colo rs and segments

The output of the molauto program is an input for the main molscript program. Command-line options
for molscript include:


Produces PostScript output


Produces VRML output

-size width height

Changes the size of the output image

The default input files produced by molauto can be hand -edited to produce various effects. One
important thing you might want to do (and can't do automatically unless you have installed the
MolScript package with OpenGL support) is to rotate the molecular structure until you achieve a good

To rotate the molecule view using the noninteractive version of molscript, add the following lines to
your molscript input file, replacing the line that currently reads:

transform atom * by centre position atom *;


transform atom * by centre position in amino-acids
by rotation x 0.0
by rotation y 0.0
by rotation z 0.0
; !Be sure to include this semicolon.

After you generate your first version of the image, open it in a fast PostScript viewer such as gv. To
change the view of the molecule, experiment with changing the values of x, y, and z rotation in your
input file. Since molscript takes only seconds to run on any protein input file, you can make changes to
the input file, save the file, and redisplay the new output several times until you like the view.

Once generated, the molscript image file can be viewed, converted to other file formats, and edited
using standard Unix image-manipulation tools. One program you can load when you install most major
Linux distributions is GIMP, the freeware package similar to Adobe Photoshop.

9.4.4 Active Site Visualization with LIGPLOT

Usage: ligplot protein.pdb resid resid chain

Another useful tool for producing graphics for publication is the program LIGPLOT
(http://www.biochem.ucl.ac.uk/bsm/ligplot/ligplot.html), which is available from the Structure and
Modelling group at University College London (UCL). Given a molecular structure and a specific
residue or heteroatom group within the structure as input, LIGPLOT automatically generates a 2D
schematic drawing showing hydrogen bonds, interatomic contacts, and solvent accessibility. A sample
of LIGPLOT is shown in Figure 9-11

Figure 9-11. A schematic diagram of ligands to the heme cofactor in cytochrome B5, generated with

To install LIGPLOT on a Linux workstation, simply follow the directions in the README file.

In order for LIGPLOT to find its parameter files and helper programs correctly, you need to add some
path information to your .cshrc file:

setenv ligdir /usr/local/ligplot
alias ligplot $ligdir'/ligplot.scr'
alias ligonly $ligdir'/ligonly.scr'
alias dimplot $ligdir'/dimplot.scr'
alias dimonly $ligdir'/dimonly.scr'
setenv hbdir /usr/local/hbplus
alias hbplus $hbdir'/hbplus'

The values on the command line specify a residue range in a particular protein chain. The program
doesn't have to display only interactions with ligands and prosthetic groups; it can also display the
network of close interactions with any residue in a protein. This works best when the residue range
selected is small.

9.4.5 dimplot

Usage: dimplot protein.pdbchain1chain2
Usage: dimplot protein.pdb -d domain1 domain2

The dimplot pro gram, a variant of LIGPLOT, displays interactions across an interface between two
protein chains or domains. The domain variant works only if your PDB file labels proteins at the
domain level of organization.

The painful part of installing the LIGPLOT, hbplus, and naccess programs on some Linux systems is,
ironically, not the installation itself, but having the capability to decrypt the encrypted archives you get
from UCL. The files are encrypted using the standard Unix crypt command. This sounds
straightforward enough, but many Linux vendors don't include crypt in their distributions. In order to
use crypt on your system, you may in fact need to reinstall the latest version of glibc-2.0. If you don't
want to deal with this, request a decrypted copy of the LIGPLOT tar archive from the authors when
you send in your license agreement.

9.5 Structure Classification
Protein structure classification is important because it gives you an entry point into the world of protein
structure that is independent of sequence similarity. Proteins are grouped not by functional families, but
according to what kind of secondary structure (alpha helix, beta sheet, or both) they have. Within those
larger classes, subclasses are defined based on how the secondary structures in the protein are arranged.

The focus in protein classification is on finding proteins that have similar chemical architectures; it
doesn't matter if their sequences are related. Over the years, we've learned from classification that there
are far fewer unique protein folds than there are protein sequence families. Protein chemists often are
interested in the information that can be extracted from broader structural classes of proteins, since
analyzing that information can help them better understand how proteins fold.

Classification of protein structures into families is a nontrivial task. Proteins have many levels of
structure: the primary structure, which is the 1D sequence; the secondary structure, which is composed
of the regular substructures that the protein polymer forms due to steric and hydrogen bond
interactions; the tertiary structure, which is the overall 3D structure of the protein; and the quaternary
structure, which is the most complex protein structure composed of multiple chains. The quaternary
structure is required to form a functional protein. Structure classification involves developing a
representation of how units of secondary structure come together to form domains, which are compact
regions of structure within the larger protein structure. Dividing proteins into domains is another aspect
of structure classification.

There isn't really a consensus as to how to classify protein structures quantitatively. Instead, structures
end up in qualitatively named classes such as "greek key," "helix bundle," and "alpha-beta barrel."
These fold classes are useful in that they draw attention to prominent structural features and create a
frame of reference for classifying structure. However, qualitative classifications don't lend themselves
to automated analysis, and such protein classification databases still require the involvement of expert

If you're simply concerned with finding the close structural relatives of a published protein structure,
there are a number of online classification databases in which existing structures have been annotated
by a combination of automated analysis and input from protein structure experts. There are also
automated tools for finding structural neighbors by structure alignment, though like any alignment
method, these tools require you to understand the significance of comparison scores when analyzing

If you're interested in doing your own analysis of a protein structure, there are several structure
classification processes and tools that might help.

9.5.1 Secondary Structure from Coordinates

Protein coordinate data sets don't automatically come labeled with alpha-helix and beta-sheet
classifiers. Secondary structure features in the protein can be distinguished with reasonable certainty by
their hydrogen bonding patterns and their backbone torsion angles.

The standard program for extracting secondary structure from sequence is the DSSP program. DSSP
analyzes the geometry and backbone hydrogen bonding partners of each residue in a known protein
structure, producing a tabular output that includes residue numbering, sequence, hydrogen bonding, and
geometry details. The DSSP database, and DSSP executables derived from the 1995 release of the
program, are available from the European Bioinformatics Institute (EBI); these executables may still
cause Y2K-related errors on some older Linux systems. Updated DSSP source code is available from
the Gerrit Vriend at the Center for Molecular and Biomolecular Informatics at the University of
Nijmegen, Netherlands. STRIDE

Usage: stride -[options] infile > outfile

An alternative to DSSP is the program STRIDE, offered in either web server or downloadable form at
the European Molecular Biology Laboratory (EMBL, http://embl-heidelberg.de/stride/stride.html/).
STRIDE compiles easily on a Linux machine. Create a directory for the program, move the tar archive
into the directory, and extract. Compile the program with make.

Command-line options for STRIDE include:

-M molscript file

Produces a simple MolScript input


Reports hydrogen bond information


Reports secondary structure assignments only

A complete list of commands can be viewed by running STRIDE with no command-line options.

The STRIDE output format is in structured 78-character lines. The following example illustrates the
hydrogen bond information output format:

ACC ALA - 143 142 -> TYR - 146 145 3.3 107.8 125.8 58.5 76.9 1MBN

ACC ALA - 143 142 -> LYS - 147 146 3.2 154.3 113.4 0.1 43.4 1MBN
DNR ALA - 144 143 -> LYS - 140 139 3.0 153.6 109.9 16.4 27.2 1MBN

ACC ALA - 144 143 -> GLU - 148 147 3.0 160.3 109.4 11.6 6.4 1MBN
DNR LYS - 145 144 -> ASP - 141 140 3.2 145.3 119.5 3.7 73.8 1MBN

ACC LYS - 145 144 -> LEU - 149 148 3.0 149.4 128.8 4.7 63.7 1MBN
DNR TYR - 146 145 -> ILE - 142 141 3.2 158.7 121.8 20.1 52.6 1MBN
DNR TYR - 146 145 -> ALA - 143 142 3.3 107.8 125.8 58.5 76.9 1MBN

ACC TYR - 146 145 -> GLY - 150 149 3.0 156.9 96.3 37.1 37.7 1MBN

ACC TYR - 146 145 -> TYR - 151 150 3.1 111.2 118.0 4.2 89.9 1MBN
DNR LYS - 147 146 -> ALA - 143 142 3.2 154.3 113.4 0.1 43.4 1MBN

The STRIDE source code is well constructed and documented. It's an excellent example of how
molecular geometry is analyzed. Each function, e.g., surface area calculation, torsion angle calculation,
etc., lives in its own separate program. If you want to understand many of the standard operations
involved in analyzing geometric properties of proteins, we highly recommend the STRIDE source

9.5.2 Topology Cartoons

Topology cartoons are a 2D notation for depicting the topological arrangement of secondary structural
elements in proteins. The cartoons can clarify the spatial relationships and connectivity between
secondary structure elements in a protein. These relationships may not be easily seen in a 3D structure,
even if only the structural backbone is displayed or a ribbon diagram is drawn. Software for generating
your own cartoons may be found on the Protein topology page, http://www.sander.embl-

Topology cartoons, as illustrated in Figure 9-12, represent each secondary structural unit as a shape.
Circles are helices, and triangles are beta strands. The beginning of the chain is marked with an N, the
end with a C. Each element has a directionality, which can be deduced from the way the connecting
segment is drawn. If the N-terminal connection is to the edge of the secondary structural element, that
element is directed out of the plane of the drawing; if the N-terminal connection is to the center of the
secondary structural element, it is directed back into the plane of the drawing.

Figure 9-12. A protein topology cartoon TOPS

Usage: tops pdbcode

The TOPS program expects a file in DSSP format, generated from your protein of interest, as its input.

In order to compile the TOPS code on your own machine, you need Java support. Linux ports of Java
are available from IBM and Blackdown at http://blackdown.org. The Blackdown version requires that
you update to glibc2.1.2, but the IBM version installs easily under Red Hat 6.1 using GnoRPM (if you
download RPMs, of course). Once the IBM JRE and JDK are installed, TOPS installs without any
difficulty. To run the EditTOPS executable, which allows you to actually view and plot topology files,
be sure that these environmental variables are set correctly:


Includes /usr/jdk118/bin (or wherever you installed Java)


Where you installed TOPS classes TOPS.jar


Where you installed TOPS

You can set these variables by writing a script called topssetup, which contains the following three
lines, and placing it in your home directory. Before you try to run TOPS or EditTOPS, use source
topssetup to set the environment variables correctly.

setenv PATH "/usr/sbin:/sbin:/usr/jdk118/bin:${PATH:."
setenv CLASSPATH "/usr/local/Tops/classes/TOPS.jar:${CLASSPATH"
setenv TOPS_HOME "/usr/local/Tops"

Topology patterns also have been implemented as data structures in web-based search tools that allow
you to compare topologies of two structures or to search a protein database for structures of similar
topology. These services are available from the EBI at http://www.ebi.ac.uk.

9.5.3 Classification Databases

Classification databases are taxonomies of protein structure, and they bear a strong resemblance to the
morphology-based taxonomies developed by early biologists. Proteins that "look" grossly the same, in
terms of shape and topology, are classified as more closely related than proteins that look substantially
different. Protein structure types have whimsical names (like Greek key beta barrel ) based on visual
observation and comparison with familiar objects. The classification databases can be envisioned as
trees with many branchings at each branch point”very similar to phylogenetic trees, in concept. SCOP

The Structural Classification of Proteins (SCOP, http://scop.mrc-lmb.cam.ac.uk/scop/) is a database
maintained by the MRC Laboratory of Molecular Biology at Cambridge, United Kingdom. SCOP is
extensively hand -curated, and tends to lag at least several months behind the PDB in terms of its
content. SCOP is a simple, relatively low-tech resource composed of a hierarchy of HTML pages with
links to still pictures of individual proteins and folds, as well as embedded links to structure files to be
opened with RasMol or Chime plugins and links back to the PDB to download structures.

At the top level of SCOP, known proteins are generally grouped by their secondary structure
characteristics into all-alpha, all-beta, coiled coil, small proteins with structural metal ions, and various
types of mixed alpha-beta structures. These major types are called Classes within SCOP. The next layer
of classification, the Fold level, is a mixture of topology and similarity to domains of known function:
one fold can be called "globin-like" and the next "four helical up and down bundle." Beyond the Fold
level, proteins are divided further into Superfamilies and Families. Superfamily and Family divisions
may be purely functional, or they may also involve some structural difference. CATH

CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new/) is similar to SCOP in concept, but it divides up
the PDB a little differently. In CATH, proteins are classified at the level of (C)lass, (A)rchitecture,
(T)opology, and (H)omologous superfamily. The CATH interface is easily navigated, and it is an
excellent resource for examining the variety of known protein structures. CATH can be searched by
PDB code, and proteins can be displayed within the browser page. The CATH maintainers provide an
excellent lexicon of protein structure description to give you a feel for the structural reality behind the
somewhat whimsical protein family names. At the time of this writing, the CATH web interface is
undergoing rapid revision and expansion of its capabilities, to include everything from structural
assignments of uncharacterized genes that may fit into CATH classes, to new levels of classification
hierarchy. Unique protein structure data sets

The PDB is full of duplication. It's been estimated that out of the approximately 13,000 structures in the
PDB at this time, only around 1,000 of them actually represent unique folds. This lack of uniqueness
can bias predictive and analytical methods based on extraction of structural patterns and features from
the protein database. Thus, there is a need to produce nonredundant subsets of the PDB and to select,

from among groups of similar proteins, the best representative of each class. This is essentially a subset
of the classification problem, and for a long time it was done based on manual examination and
annotation of PDB data. But as the PDB has grown, automated methods for generating nonredundant
data sets based on sequence comparison have emerged.

The process for generating such data sets is fairly standard, although the particular parameters differ.
First, the PDB is culled to remove extremely short protein chains, chains of very poor resolution, and
chains containing a large number of nonstandard residues. The PDB is then decomposed into individual
chains, and the chains are sorted by various quality criteria. An all-against-all sequence comparison is
done, and chains that don't differ sufficiently to meet a certain cutoff are removed, choosing the lowest-
quality chain in a pair to be removed, until all the chains in the list meet the uniqueness criteria in a
pairwise comparison. Finally, the removed chains are reintroduced and added back to the set if they
don't violate the uniqueness criteria with any other chain in the final set.

At this time, nonredundant data sets can be obtained from PDB Select, at EMBL, from NCBI, and from
Dr. Roland Dunbrack at the Fox Chase Cancer Center. There is no software we know of that allows
you to create a unique data set based on your own choice of parameters, although the groups mentioned
may be willing to generate data sets by special request. A Perl script for creation of nonredundant
databases from a sequence DB, called nrdb90.pl, is also available from EBI; however, it's hardcoded to
produce a nonredundant set at the 90% sequence identity level. If you're intrepid, you can modify this
script for your own purposes.

9.6 Structural Alignment
Recently, there have been many attempts to make protein-structure classification an automatic and
quantitative process, rather than an expert-curated process. Overlaying and comparing structures is a
3D problem that is much more resource-intensive than comparing 1D sequence data. The automated
structure comparison tools that exist, therefore, are available primarily as online tools for searching
precomputed databases of structure comparisons.

9.6.1 Comparing Two Protein Structures

The most common parameter that expresses the difference between two protein structures is RMSD, or
root mean squared deviation, in atomic positions between the two structures. RMSD can be computed
as a function of all the atoms in a protein or as a function of some subset of the atoms, such as the
protein backbone or the alpha-carbon positions only. Using a subset of the protein atoms is common,
because it is likely that, when two protein structures are compared, they will not be identical to each
other in sequence, and therefore the only atoms between which one-to-one comparisons in position can
be made will be the backbone atoms.

This is the first context we've discussed in which the orientation of a molecular structure becomes
important. Because protein structures are generally described in Cartesian coordinates, they essentially
exist within a virtual space, and they come with a built-in orientation with respect to that space. RMSD
is a function of the distance between atoms in one structure and the same atoms in another structure.
Thus, if one molecule starts out in a different position with respect to the reference coordinate system,
the other molecule”the RMSD between the two proteins”will be large whether they are similar or


. 7
( 12)