. 8
( 12)



In order to compute meaningful RMSDs, the two structures under consideration must first be
superimposed, insofar as that is possible. Superimposition of protein structures usually starts with a
sequence comparison. The sequence comparison establishes the one-to-one relationships between pairs
of atoms from which the RMSD is computed. Atom-to-atom relationships, for the purpose of structure
comparison, may actually occur between residues that aren't in the same relative position in the amino
acid sequence. Sequence insertions and deletions can push two sequences out of register with each
other, while the core architecture of the two structures remains similar.

Once atom-to-atom relationships between two structures are established, the task of a superposition
program is to achieve an optimal superposition between the two programs ”that is, the superposition
with the smallest possible RMSD. Because protein scaffolds, or cores, can be similar in topology
without being identical, it isn't usually possible to achieve perfect overlap in all pairs of atoms in two
structures that are being compared. Overlaying one pair of atoms perfectly may push another pair of
atoms further apart. Superposition algorithms optimize the orientation and spatial position of the two
molecules with respect to each other.

Figure 9-13 shows an optimal alignment between atomic structures of triosephosphate isomerase and
beta-mannase, shown in Compare3D. The two structures are similar enough to be classified as
structural neighbors, and their chain traces are relatively similar. However, their sequence identity is
only 8.5%.

Figure 9-13. An optimal superposition of myoglobin and the 4 chain of hemoglobin, which are structural

Once optimal superpositions of all pairs of structures have been made, the RMSD values that are
computed as a result can be compared with each other, because the structures have been moved to the
same frame of reference before making the RMSD calculations. ProFit

Usage: profit reference.pdb mobile.pdb

ProFit, developed by Andrew Martin at the University of Reading, United Kingdom, is an easy-to-use
program for superimposing two protein structures. One protein is assigned by the user to be the
reference structure, and the other protein is mobile with respect to the reference. ProFit outputs RMSD
and can also write out coordinates for the superimposed proteins. ProFit allows the option of
superimposing only selected regions of each protein so that domains can be examined independently.
ProFit compiles and runs on any Unix workstation. ProFit may be downloaded from Andrew's web site

9.6.2 DALI Domain Dictionary

The DALI Domain Dictionary (DDD) at the EBI is based on an automatic classification of protein
domains by sequence identity. Rather than using a human-designed classification scheme, DDD is
constructed by clustering protein neighbors within an abstract fold space. Instead of working with
whole proteins, DDD classifies structures based on compact, recurring structures (called domains) that
may repeat themselves within, and among, different protein structures. The content of DDD may also
be familiar to you as FSSP, the "Fold classification based on Structure-Structure alignment of Proteins"

DDD can be searched based on text keywords; it can also be viewed as a tree or a clickable graphical
representation of fold space. Views of sequence data for conserved domains are available through the
DDD interface, as well as connections to structural neighbors.

The superposition program (SUPPOS) that produces the structural alignments in DALI/FSSP is
available within the WHAT IF software package of protein structure analysis tools, which is discussed
in Section

9.6.3 CE and CL

The Combinatorial Extension of the Optimal Path (CE) is a sophisticated automatic structure alignment
algorithm that uses characteristics of local geometry to "seed" structural alignments and then joins
these regions of local similarity into an optimal path for the full alignment. Dynamic programming can
then optimize the alignment.

CE is available either as a web server or as source code from the San Diego Supercomputer Center.
The web server allows you to upload files for pairwise comparison to each other or to proteins in the
PDB, to compare a structure to all structures in the PDB, to compare a structure to a list of
representative chains, and review alignments for specific protein families. CE also is fully integrated
with the PDB's web site, and CE searches can be initiated directly from the web page generated for any
protein you identify in a sequence search. Along with the source code, you can download a current,
precomputed pairwise comparison database containing all structures in the PDB. If you're doing only a
few comparisons, however, you probably won't even want to do this.

When using the CE server to compute similarities, there are several parameters that you can set,
including cutoffs for percent sequence identity, percent of the alignment spanned by gaps, and percent
length difference between two chains. You can also set an RMSD cutoff and a Z-score cutoff. The Z-
score is a measure of the significance of an alignment relative to a random alignment, analogous to a

BLAST E-value. A Z-score of 3.5 or above from CE usually indicates that two proteins have a similar

Along with CE, the SDSC offers the Compound Likeness (CL) server, a suite of tools for probabilistic
comparison between protein structures. In CL, you select either an entire protein structure or a structure
fragment to use as a probe for searching the PDB. Search features include bond length and angle
parameters, surface polarity and accessibility, dihedral angles, secondary structure, shape, and
predicted alpha helix and beta sheet coefficients. CL allows you to ask the question "what else is
chemically similar to this protein (or fragment) that is of interest to me" and to define chemical
similarity very broadly. A full tutorial on CL is available at the CL web site (http://cl.sdsc.edu/cl1.html/

9.6.4 VAST

VAST is a pairwise structural alignment tool offered by NCBI. VAST reports slightly different
parameters about structural comparison than CE does, and the underlying algorithm differs in
significant respects. However, the results tend to be quite similar. VAST searches automatically allow
you to view your superimposed protein structures in the Cn3D browser plug-in, with aligned sequences
displayed in Cn3D as well. For practical purposes, either CE or VAST is sufficient to give you an idea
of how two structures match up; if you are concerned about the algorithmic differences, both groups
provide access to detailed explanations at their sites. Unlike CE, the VAST software doesn't appear to
be available to download, so if you want to perform a large number of comparisons on your own
server, CE may be preferable.

9.7 Structure Analysis
Geometric analysis of protein structures serves two main purposes. It is useful for verifying the
chemical correctness of a protein structure, both as a means of deciding whether the structural model is
ready to be submitted to the PDB and for analyzing existing structures. Geometric analysis also allows
you to examine the internal contacts within a protein structure. Since protein function often depends on
the interactions of amino acids that aren't adjacent in the protein sequence, contact analysis can provide
insight into complex, nonsequential structural patterns in proteins.

9.7.1 Analyzing Structure Quality

Geometric analysis can show where a model developed from x-ray crystallography data or NMR data
violates the laws of chemistry. As mentioned earlier, there are physical laws governing intermolecular
interactions: nonbonded atoms can get only so close to each other because as two atoms are forced
together beyond the boundary set by their van der Waals radius, the energetics of the contact become
very unfavorable. These interactions limit not only the contacts between pairs of atoms in different
parts of a protein chain, but also how freely atoms can rotate around the bonds that connect them. The
structure of atomic orbitals and the nature of bonds between atoms place natural limits on the position
of bonded atoms with respect to each other, so bond angles and dihedral angles are, in practice,
restricted to a limited set of values. Tools for geometric analysis generally have been developed by
crystallographers to show where their structural models violate these laws of nature; they also can be
used by homology modelers or ab-initio structure modelers to evaluate the quality of a structural

There are a variety of tools for analyzing structure quality. Some run as standalones; others are
incorporated into more comprehensive structure analysis and simulation packages. An exhaustive
listing of the best of these tools can be found on the PDB web site. PROCHECK

PROCHECK http://www.biochem.ucl.ac.uk/˜roman/procheck/procheck.html) is a popular software
package for checking protein quality. It produces easily interpreted color PostScript plots describing a
protein structure and can also compare two related protein structures. It runs on Unix systems and also
been ported to Windows.

Using PROCHECK requires that you set up several aliases in your .cshrc file. The aliases you need are:

setenv prodir /usr/local/procheck
alias procheck '$prodir /procheck.scr'
alias procheck_comp '$prodir /procheck_comp.scr'
alias procheck_nmr '$prodir /procheck_nmr.scr'
alias proplot '$prodir /proplot.scr'
alias proplot_comp '$prodir /proplot_comp.scr'
alias proplot_nmr '$prodir /proplot_nmr.scr'
alias aquapro '$prodir /aquapro.scr'
alias gfac2pdb '$prodir /gfac2pdb.scr'
alias viol2pdb '$prodir /viol2pdb.scr'

The aliases are required by the various PROCHECK command scripts, so you can't just run
PROCHECK by typing the full pathnames to each individual module. When you run PROCHECK or
PROCOMP, the program you actually run is a command script that calls several other programs and

PROCHECK can be set up to produce several different kinds of output, either in color or black and
white, by editing the procheck.prm file in the directory in which you are about to issue the procheck
command. The parameters are edited by changing Y to N or vice versa at points in the procheck.prm
file where those options are available. The file is self-documenting and easy to understand. The most
important part of the file, for reference, is probably the portion in which you turn on or off the various
types of plots that are available. The rest of the parameters in procheck.prm are mainly default color
values for different types of plots.

Colour all plots?
Y <- Produce all plots in colour (Y/N)?

Which plots to produce
Y <- 1. Ramachandran plot (Y/N)?
N <- 2. Gly & Pro Ramachandran plots (Y/N)?
N <- 3. Chi1-Chi2 plots (Y/N)?
N <- 4. Main-chain parameters (Y/N)?
N <- 5. Side-chain parameters (Y/N)?
N <- 6. Residue properties (Y/N)?
N <- 7. Main-chain bond length distributions (Y/N)?
N <- 8. Main-chain bond angle distributions (Y/N)?
N <- 9. RMS distances from planarity (Y/N)?
N <- 10. Distorted geometry plots (Y/N)?

Once you've edited the procheck.prm file to your satisfaction, run PROCHECK with the command
procheck filename.pdb [chain] resolution. The resolution parameter causes your protein to be
compared to a "reference protein of X angstrom resolution" in the PROCHECK output. This parameter
isn't optional. The command line for PROCOMP requires a second protein filename and chain ID
instead of the resolution parameter. WHAT IF/ WHAT CHECK

WHAT IF is a multifunctional menu-driven molecular modeling package developed by Gert Vriend
and now available through the University of Nijmegen. WHAT IF can calculate just about any property
of proteins we discuss in this chapter, from solvent accessible surface area to pKa values to contacts to
molecular dynamics using GROMOS. The full WHAT IF package is available to academic users for a
small fee, and it is known to compile and run well on Linux systems.

WHAT CHECK provides access to a subset of WHAT IF structural quality checks. WHAT CHECK
reports on stereochemistry, bond lengths, angles, and dihedrals, among other quantities. Complete
WHAT CHECK reports for any protein in the PDB can be downloaded from the PDBREPORTS
database at EMBL. WHAT CHECK also is available as part of the Biotech Validation Suite web server
at EMBL, for use on models and on structures not already deposited in the PDB.

9.7.2 Intramolecular Interactions

Geometric analysis can also be useful in understanding a protein's fold and function. In this case, the
geometry of interest isn't the chemical bonding interactions between atoms adjacent to each other in the
protein chain, but the nonbonded interactions between atoms that are widely separated in the protein
chain. The density of intramolecular contacts in the structural core of a domain may be quite different
from the density of contacts in a region between two structural domains. Measuring this density over
the whole protein may give clues as to the process by which a protein folds. The patterns of hydrogen
bonds that hold a protein together may serve as an identifying signature for a protein fold. And contacts
between certain chemically important residues in a protein may suggest hypotheses about the protein's
catalytic mechanism or function. Protein engineers may want to examine the intramolecular contacts in
a protein to determine where changes are least likely to disrupt the protein's structure. Computing contacts with HBPLUS

Listings of intramolecular nonbonded interactions and hydrogen bonds can be computed using the
standalone program HBPLUS, available from the Biomolecular Structure group at UCL. Obtaining the
HBPLUS program and running it are straightforward, but because the results are produced as a single
long text file, they require some scripted postprocessing to become useful. The LPC-CSU (Ligand-
Protein Contacts/Contacts of Structural Units) server at the Weizmann Institute (Rehovot, Israel,
http://bioinfo.weizmann.ac.il:8500/oca-bin/lpccsu/) can produce textual reports of important intra- and
intermolecular contacts in any protein. Protein structures can be uploaded to the server from the user's
machine or found on the server using their PDB ID codes.

Contact mapping and display functionality also can be found within the WHAT IF package. Two-
dimensional contact maps are a standard feature of most molecular modeling packages. A 2D contact
map is simply a plot of pairwise interactions between residues, where residue number within the
protein is plotted on each axis and a dot (perhaps color-coded to indicate the contact distance) is drawn
wherever residue X and residue Y come into close contact. Contact maps have distinct patterns that can

help identify a protein's fold, and some efforts have even been made to predict contact maps for
proteins of unknown structure based on their sequences and predicted secondary structure features.

9.8 Solvent Accessibility and Interactions
Solvent-accessible surface calculations help you figure out which chemical groups are on the surface of
a protein. Amino acids on the surface of a protein usually are the ones that determine how it interacts
with other molecules, such as chemical substrates, ligands, other proteins, and receptors If you know
what the chemical surface of the protein looks like, you can use that information to help determine why
one molecule binds to another, why an enzyme is specific for a particular substrate, or how the protein
influences its environment in other ways.

Analytical shape calculations also help you describe the geometry of the protein surface. A lot of
biochemistry textbooks describe intermolecular interactions in terms of locks and keys. Molecules fit
together in geometrically specific ways, so the shape of the lock (e.g., the enzyme) has to complement
to the shape of the key (the substrate). The shape of a receptor on the cell surface has to complement to
the ligand it's supposed to respond to, or the cellular response isn't triggered. The immune system is a
good example. In the immune response, the organism produces antibodies that attack antigens of a
particular shape. This is why you can vaccinate an animal against a disease by injecting a sample of
killed virus. The killed virus is shaped just like the live, deadly virus, but it can't harm the animal.
Nonetheless, the animal develops antibodies that recognize the shape of the killed virus. Then, when
the live virus comes along and tries to invade, the animal already has antibodies that are the right shape
to attack the live virus.

So, for instance, if you want to design a new vaccine or engineer a protein that will carry out a
particular reaction, or understand how two proteins in a metabolic pathway interact with each other, it's
important to be able to measure the shape of the molecule.

The standard method for computing solvent accessibility is quite simple. Each atom in the molecular
structure is represented by a sphere; there is a different sphere radius for each distinct atom type. The
spheres surround the known atomic centers and are modeled by a collection of several hundred discrete
points. To determine the solvent-accessible surface of the protein, solvent-accessibility calculators
simulate a spherical "probe" with a radius equivalent to that of water (1.4 angstroms) rolling over the
surface of the atomic spheres. The path of the center of the probe determines the solvent-accessible
surface of the molecule. Because the probe (and hence, water molecules) can't fit into sharp crevices in
the molecular surface, the computed solvent accessible surface is much smoother than the underlying
molecular surface (Figure 9-14).

Figure 9-14. Determination of solvent accessibility by probe-rolling

Because proteins are dynamic entities rather than the rigid bodies assumed by solvent-accessibility
calculations, it's likely that the interior of the molecule has more contact with solvent than can be
computed using a probe-rolling algorithm. However, solvent-accessibility calculations can help
develop an initial understanding of a protein molecule that will inform further experimentation.
Accessibility calculations are one way of getting at the complex physicochemical properties of a
protein; the nature of the protein surface affects its interaction with the surrounding media as well as
with other proteins or substrates.

9.8.1 Computing Solvent Accessibility with naccess

Usage: naccess pdb file [-p probe size] [-r vdw file] [-s stdfile] -[hwyc]

There are many programs for calculating solvent accessibility by probe-rolling. They are all
straightforward and easy to use, requiring a standard PDB file as input and usually giving output in the
form of a percentage of accessibility for each amino acid or atom in the protein. One popular program
is naccess, which is available from the Biomolecular Structure and Modelling group at UCL. naccess
can be used in combination with other programs developed by this group, such as HBPLUS (a program
for computing intermolecular interactions and hydrogen bonds) and LIGPLOT, which we covered
earlier. It also runs as a standalone. naccess is written in FORTRAN, so you'll need the g77 compiler
installed on your machine to compile it.

naccess outputs two files, an .asa file with accessible surface areas for each atom in the molecule and
an .rsa file with accessible surface areas and relative accessibilities for each amino acid. It handles both
protein and nucleic acid molecules and produces reports of accessibilities for individual molecular
chains as well as complete structures. The -h, -w, and -y flags cause the program to ignore hydrogen
atoms, water molecules, or heteroatoms, respectively. Run with the -c option, naccess produces
intermolecular contact areas rather than accessible areas.

SURFNET is a program developed by Roman Laskowski at UCL to manipulate solvent-accessibility
information and display useful representations of surface features, clefts, cavities, and binding sites.
SURFNET generates surface output in formats that can be displayed in molecular visualization
programs, including RasMol.

9.8.2 Solvent Accessibility with Alpha Shapes

The Alpha Shapes software is a mathematically exact alternative to the standard Connolly surface
method of computing solvent accessibility. Developed by the research group of mathematician Herbert
Edelsbrunner at NCSA (http://www.alphashapes.org/alpha/), the Alpha Shapes software is a general-
purpose program for modeling the surfaces of objects. A set of extensions to the Alpha Shapes
software, specifically for analyzing protein molecules, is also available.

The Alpha Shapes method constructs what is called a simplicial complex or alpha complex of a
structure, based on a rigorous geometrical decomposition of the space surrounding the collection of
points that describes the structure. Once the alpha complex is constructed for a protein structure,
algorithms for inclusion and exclusion can describe exactly the surface area or volume of the structure
as well as cavities, clefts, and regions of contact. The main benefit of using the Alpha Shapes algorithm
to compute protein shapes is that the software comes with a sophisticated visualization program called
alvis, which can display such geometrical features as the interior shape of an ion channel or the cavities
in the interior of a protein.

Several programs make up the Alpha Shapes distribution. These programs must be run in the proper
sequence to correctly analyze molecular data:


Translates a PDB file into an alpha datafile.


Computes the Delaunay complex of the molecule on the output from pdb2alf.


Computes the alpha complex from the Delaunay complex computed by delcx.


Computes protein properties, using the alpha complex computed by mkalf and information from
the original the PDB file. Depending on which command-line options are used, VOLBL can
compute cavities in the protein interior and space-filling models of the protein, as well as
volumes of molecules and cavities. Multiple VOLBL runs can produce complimentary data
sets, which can be added or subtracted to determine contact areas and other molecular

You can find usage details of each of these programs in a README file that accompanies the Alpha
Shapes distribution, or by attempting to run the program with no arguments on the command line.

Using alvis to visualize your Alpha Shapes data can be quite interesting. To do this, you need output
from delcx and mkalf, but not from VOLBL. To run alvis on a data set generated from molecule.PDB,
where output files molecule.dt and molecule.alf are also present, enter alvis molecule. The visualizer
opens with the convex hull of the molecule displayed. The standard atomic structure of the molecule
can't be seen from within the current version of alvis, but you can compare your alvis view with
another view of the same molecule (perhaps using RasMol or a similar molecular visualization
program) side by side.

In the bottom left of the alvis control panel, you'll see a box containing a graph with three colored
curves. This graph is called the alpha rank graph, and it can be used to select a desirable view of the
molecule. Positioning your cursor at peaks, valleys, or intersections on these graphs gives the most
interesting views of the molecular shape.

Using the Pocket Panel, available from the Gizmos pulldown in alvis, you can make selections that
shows voids, pockets, and difference areas in a protein. The online alvis tutorial at
http://www.alphashapes.org describes in full the settings needed to view these features.

Along with the main Alpha Shapes programs, a number of utility scripts are provided that can
postprocess VOLBL output to give specific information. These include:


Computes an itemized residue-wise contribution to area or volume from a VOLBL output file


Computes residue-wise differences in accessible area between two models


Outputs area or volume contributions from nonpolar atoms; aapolar does the same thing for
polar groups


Computes atom-wise differences in area between two files

Analogous scripts for computing volume differences are also included.

Analytical surface potentials based on Alpha Shapes can also be accessed with the CAST-P web server
at the University of Illinois at Chicago. At the time of this writing, not all protein structures in the PDB
are represented on CAST-P; the site is currently under development. However, it promises to be a
useful analytical tool in the future. CAST-P features an integrated Java-based visualization of cavities
in protein structures and the amino acids that are in contact with cavities.

9.9 Computing Physicochemical Properties
We've already discussed forces that control the interactions between individual atoms in a protein
molecule. However, to understand intermolecular interactions, it may be more interesting to learn how
all the atoms in a protein act together at a distance, to influence other proteins or ligands.

The electrostatic potential of an object is a measure of the force exerted by that object on other nearby
objects. The electrostatic potential of a protein molecule is a long-range force that can influence the
behavior of other molecules in the environment at a range of up to 15 angstroms.

Electrostatic interactions within the macromolecule can also be important. Nearby charged groups
within a protein may cause the pKa value (the pH at which an acidic or basic group loses or gains a
proton) of an amino acid to shift, creating the chemistry necessary for that molecule to perform its
chemical function.

9.9.1 Macromolecular Electrostatics

A protein molecule can be thought of as a collection of charges in a dielectric medium. In the model
that computes electrostatic potentials for protein molecules, each atom is represented as a point with a
partial atomic charge. The solvent accessible surface of the protein forms the boundary between the
interior medium of the protein and the exterior medium surrounding the protein.

Computing the electrostatic potential of a protein structure allows you to predict quantities such as
individual amino acid pKa values, solvation energies, and approximations to intermolecular binding
energies. If you are interested in protein modeling, macromolecular electrostatics is a topic that you
may wish to explore further. Our review of the subject in the March 2000 issue of Methods provides an
entry point to the molecular electrostatics literature.

The University of Houston Brownian Dynamics (UHBD) package is a state-of-the-art software package
for computing macromolecular electrostatics. UHBD computes electrostatic potentials and can also use
those potentials as parameters in subsequent Brownian Dynamics and Molecular Dynamics
simulations. The most recent release of UHBD can be compiled on Linux systems and includes several
control scripts that implement UHBD to calculate pKa values for individual titrating amino acids in the
protein, as well as theoretical titration curves for the protein as a whole. UHBD is accessed by a
scripting language; it requires a protein structure file and a command script as input. It also requires a
file containing atomic partial charges for any amino acids and other atoms in the input structure.
Detailed scripting examples are provided in the UHBD distribution, along with charge datafiles that
allow the program to assign correct partial atomic charges to all but unusual atom groups.

UHBD, and other similar programs such as DelPhi”which overlaps only the electrostatics
functionality of UHBD”use numerical approximations to solve the Poisson-Boltzmann equation for
the large number of interacting charges that make up a protein. In the finite-difference approach used
by UHBD, the irregularly spaced charges in a protein molecule are mapped onto a regular 3D grid, and
the Poisson-Boltzmann equation is solved iteratively for each point on the grid until the solution
converges to an electrostatic potential for each point.

9.9.2 Visualization of Molecular Surfaces with Mapped Properties

Other than alvis, which doesn't truly display a molecular surface but rather a mathematically derived
pseudosurface, there are several options for displaying the shapes of molecules. Most molecular
modeling packages incorporate a molecular surface display feature and allow the surface to be colored
according to chemical properties. However, the display schemes in programs not specifically designed
for that purpose are too unsophisticated to handle data from macromolecular electrostatics calculations
and other representations of physicochemical properties. An exception seems to be the SWISS-
PDBViewer (discussed earlier), which can interpret data from external electrostatics calculations and
analytical molecular surface calculations. GRASP/GRASS

GRASP is a high-quality molecular surface visualization program developed by Barry Honig's group at
Columbia University. GRASP can read electrostatic potential files and display them as features of a
molecular surface, and has many other display options for creating really beautiful visual
interpretations of electrostatic properties. Unfortunately, GRASP is available only for SGI IRIX
workstations and there are no plans to make it available for other operating systems at this time.

If you're using a Mac or PC, some of GRASP's functionality can be accessed through the GRASS web
interface at Columbia. However, this web interface relies heavily on an interface to either GRASP itself
(on SGI workstations), the Chime browser plug-in, or a VRML viewer, all of which are still
problematic or nonexistent if you're working on a Linux system. We have had some success using the
vrmlview viewer with Netscape to visualize VRML models from the GRASS server, although the
image quality is relatively low. To use vrmlview, download and install the program and then set your
Netscape preferences to use the vrmlview executable to handle files with MIME type model/vrml. The
"Handled By" entry in your Netscape applications list should read /usr/local/bin/vrmlview %s (or
wherever you installed vrmlview ).

The GRASS interface is straightforward and clickable. You can select from several molecular display
options: CPK surface, molecular surface, ribbons, or a stick model. Then, a property can be chosen to
be mapped onto the molecular graphics. Available properties include electrostatic potentials computed
using GRASP's built-in FDPB solver, surface curvature, hydrophobicity, and amino ac id variability
within the protein's sequence family. GRASS doesn't implement the full functionality of GRASP, but
many of the most useful features are available.

9.10 Structure Optimization
Protein structure optimization is the process of bringing a structure into agreement with some "ideal"
set of geometric parameters. As mentioned earlier when we discussed structure quality checking,
protein structural models sometimes violate the laws of chemistry. Placing atoms too close together
causes unfavorable intramolecular contacts, or van der Waals bumps. Bond lengths, bond angles, and
dihedral angles between atoms in the protein can also be "wrong"; that is, they can fall outside some
normal range of values expected for that type of bond or angle.

Structure optimization is an important issue not just to developers of theoretical models, but to
researchers who experimentally determine protein structures. All protein atomic coordinates are, in an
important sense, structural models. Structure optimization tools have long been part of the x-ray
crystallographer's toolkit. The process of optimization can be computationally intensive. Because all
atoms in a protein structure are connected by bonds with rigidly fixed lengths, moving an atom in one
part of the protein structure has wide-ranging effects on its neighbors. Often moving one part of the
protein into a better configuration means moving another part of the protein into an unfavorable
configuration. Optimization is, essentially, an iterative series of small changes designed to converge to
the best overall result. There are many methods of optimization, which is its own subdiscipline within
theoretical computer science.

You won't always need to know the particulars of optimization methods, but if you begin using
structure optimization and molecular simulation methods frequently, you should be aware that your
choice of optimization algorithms may be an issue. It's not always certain that optimization will provide
you with a better structural model; if the method is based on incorrect structural rules, or if the rules are
prioritized incorrectly, optimization can actually give you a worse model than you started with.

9.10.1 Informatics Plays a Role in Optimization

What are the "ideal" parameters or constraints used in optimization? In some cases, they are based
entirely on chemical principles: bond lengths and angles determined by steric restrictions and
nonbonded interactions described as Lennard-Jones potentials. In other cases, structural constraints are
based on information derived from the database of known protein structures. If a particular amino acid
in a particular sequence context always has the same conformation, a higher probability can be
assigned to it assuming that conformation again, rather than a different conformation. Secondary
structure prediction methods use an information-based approach to predicting likely conformations for
the protein backbone. Optimization methods use information to refine atomic structures at the level of
individual sidechain atoms once the backbone trace has been worked out.

9.10.2 Rotamer Libraries

Rotamer libraries are parameter sets specifically for the optimization of sidechain positions in
molecular model building. They are called rotamer libraries because they contain information about

allowed rotations of the remote amino acid sidechain atoms around the C -C bond, expressed as
the allowed values of sidechain dihedral angles.

Because of steric constraints on bond rotation, amino acid sidechains in proteins can assume only a few
conformations without unfavorable energetic consequences. Rotamer libraries can be derived using
chemical bond and angle constraints, but, they are more likely to be developed by analysis of the
conformations assumed by amino acid sidechains in known protein structures. Rotamer libraries can be
either backbone-dependent or -independent. Backbone-independent rotamer libraries classify all
instances of a particular amino acid as part of the same set, even if one occurrence is within a beta sheet
and the other is within an alpha helix. Backbone-dependent rotamer libraries, on the other hand, further
classify amino acids according to their occurrence in specific secondary structures. [2]

When rules fo r structure evaluation and optimization are derived from existing occurrences of patterns in a database, there is a trade-off between highly specific
classification of occurrences and the size of the data set for each type of occurrence. The more data in the data set, the better the value of the rule is likely to be;
however, the less specific the classification of occurrences, the less value the rule is likely to have for prediction. This is true not only of rotamer libraries but of
PDFs and any other database-derived rules.

SCRWL, available from the Fred Cohen research group at UCSF, is a program that allows you to
model sidechain conformations using a backbone-dependent rotamer library.

9.10.3 PDFs

The derivation of probability density functions (PDFs) is similar in concept to the development of
rotamer libraries, although more mathematically rigorous. The essence of a PDF is that a mathematical
function is developed to represent a distribution of discrete values. The discrete values that make up the
distribution are harvested from occurrences of a situation in a representative database of samples. That
mathematical function can then be used to evaluate and optimize (and in some cases even predict) the
properties of future occurrences of the same situation.

In protein modeling, PDFs have been used to describe intra- and inter-residue interatomic distances, as
well as bond angles, dihedral angles, and other more spatially extensive regions of protein structure.
Modeller, which is discussed in Chapter 10, uses a combination of bond angle and dihedral angle PDFs

to optimize the protein structure models it builds. Modeller's internal OPTIMIZE routine can be used
for PDF-based structure optimization.

The data from which PDFs are generated can be broken down into specific occurrences; for example,

all contacts between C in residue i and C in residue i+4 when both residues are leucine but again,
trade-offs between classification detail and class population occur. Distance PDFs for proteins have
been used by several groups to evaluate and optimize protein structures. Most such work is still in its
early stages, and software isn't yet available for public use.

Figure 9-15 shows a plot of a distance probability density function for tertiary interactions between
sulfur atoms in cysteine residues generated from known protein structures. The function's peak near 2
angstroms corresponds to the high propensity with which the sulfur atoms form disulfide bridges
between cysteine residues. These data are taken from the Biology Workbench at the San Diego
Supercomputer Center (http://workbench.sdsc.edu/) and plotted us ing xmgr. Note that the Workbench
PDFs make a distinction between cysteine residues participating in disulfide bridges (pictured here and
referred to as CSS residues at the Workbench site) and those cysteines that don't participate in disulfide
bonds (which the Workbench site calls CYS).

Figure 9-15. Interatomic distance probability density function

Structure evaluation based on PDFs is implemented in the Structure Tools section of the SDSC Biology
Workbench (Figure 9-16). You can upload a PDB structure or a theoretical model and score the
structure either on a residue-by-residue or an atom-by-atom basis. Scores can be displayed on a plot,
where the Y-axis represents the relative probability of the region of structure that's being evaluated.
This can be thought of in terms of the probability that a particular residue or atom is in the "correct"
position, given what is known about other occurrences of that residue or atom in similar sequence
environments. Regions with low probability are likely to be misfolded or poorly modeled. PDF
probability scores can also be written out in a special PDB file, in place of the temperature factor
values found in the original PDB file. These special PDB files can then be displayed using a
visualization program such as RasMol or Chime. Coloring the molecule by temperature factor maps the

PDF probability scores onto the molecular structure, highlighting regions of the structure that score

Figure 9-16. Comparing PDF scores for an obsolete PDB structure (1B5C) and (1CY0) that superseded it

9.11 Protein Resource Databases
Several new databases containing information about protein structure and function, and designed for
users of genome-level information, have recently emerged on the Web. Some of the most notable are
GeneCensus, PRESAGE, and BIND. These databases are still relatively lightly populated, and have not
yet taken the central role in biological research that PDB and NCBI. However, certainly these or
similar resources will soon become vital to molecular biology research.

9.11.1 GeneCensus

The GeneCensus project is a broad, sequence-based comparison of the protein content of several
genomes. At the time of this writing, GeneCensus contains information from 20 genomes. GeneCensus
currently can be searched with a PDB ID or an ORF ID to locate occurrences of specific protein folds
in the GeneCensus genomes.

9.11.2 PRESAGE

PRESAGE is a database of information about experimental progress with the structures of various
proteins. You can search PRESAGE with a TIGR ORF code, NCBI or SWISS-PROT accession
number, and a number of other codes to find out if someone is attempting to isolate, crystallize, and
solve the structure of that protein, and if so, how far along they are. PRESAGE is relatively new at
press time; it currently contains only about 6,000 records and isn't guaranteed to be comprehensive. It
can't be searched directly by BLAST or FASTA search with a sequence, so before checking PRESAGE
for instances of an unknown sequence, you have to search for matching accession codes. However, in

principle, the PRESAGE database promises to be useful, both for crystallographers and their
collaborators, and for curious citizens of the molecular biology community who are wondering if the
structure of their protein has been solved yet.

9.11.3 BIND

The Biomolecular Interaction Network Database (BIND) is another relatively new data repository
offered by the Samuel Lunenfeld Research Institute. BIND was designed to be a central deposition site
for known information about macromolecular interactions. BIND is a complex database, containing
information about interactions between objects in the database, molecular complex information, and
metabolic pathway information. The BIND format is designed to contain information about
experimental conditions that observe the interactions stored in the database, as well as information
about binding site location, biochemical activity, kinetics, and thermodynamics. BIND is still in its beta
testing phase, containing only a few hundred interactions. However, BIND has been funded to hire
indexers and it is expected to grow rapidly in the near future. One interesting aspect of the BIND data
entry process is that methods for automated reading of existing journal literature are being implemented
to extract known interactions from their inconvenient location in dusty library stacks and put them
more effectively in the public domain.

9.12 Putting It All Together
We can't give you a single recipe for using the techniques described in this chapter to characterize a
protein. There are too many variables from system to system, and too much diversity in what you as a
biologist might want to know. However, some common features of a structural modeling approach

• Gathering useful structural and biophysical information about the system under study.
Everything from site-directed mutagenesis to classic biochemical and biophysical studies may
be useful.
• Using multiple sequence analysis to analyze the protein as part of a related family. This may
give insight into the location of functionally important residues and active sites.
• Analyzing a crystal structure or theoretical model to identify the location of buried polar and
charged residues, unusual hydrogen bonds, networks of structured solvent molecules, and other
chemical features that may be involved in structural stability or function.
• Analyzing a family of related proteins by superimposing or comparing their structures to
identify common features.
• Mapping identified properties”sites known to affect function if mutated, sites conserved in
multiple sequences, etc.”onto the protein sequence and structure.
• Visualizing the structure and interpreting the location of potentially important amino acids and
• Computing the molecular surface and characterizing possible substrate binding sites or
molecular interaction regions by their shapes.
• Computing electrostatic potentials and modeling electrostatic properties such as individual
amino acid pKa values or molecular interaction energies. Unusually strong electrostatic
potentials or unusual pKa values may indicate regions of catalytic importance.

Obviously, this type of analysis requires a real understanding of protein chemistry. We've identified the
tools of structural biology for you, but you will decide how to put them to use. To help you toward that

end, Table 9-3 contains a quick reference of molecular structure tools and how they are commonly

Table 9-3. Structure Tools and Techniques
What you do Why you do it What you use to do it
Browser plugins: RasMol, Cn3D,
Computer graphics are the only way to "see" a protein
View molecular structure SWISS-PDBViewer; standalones:
structure in detail
MolMol, MidasPlus, VMD
Create high-quality PostScript
schematic diagrams and color For publication MolScript
graphics of proteins
Create schematic diagrams of To help identify the structural components of the
active sites functional site; for publication
Structure classification To identify relationships among proteins CATH, SCOP
To extract recognizable features at the SS level, which
Secondary structure analysis DSSP, STRIDE
aids in classification
To extract recognizable supersecondary motifs, which
Topology analysis TOPS
aids in classification
To extract recognizable domains, which aids in
Domain identification 3Dee
Unique structure database To eliminate bias in source data sets for knowledge-
PDBSelect, culled PDB databases
subsets based modeling
To identify relationships among distantly related
proteins that may have evolved beyond recognizable
Structure alignment CE, DALI, VAST
sequence similarity, while preserving structural
To identify strained conformations or incorrectly
Molecular geometry analysis PROCHECK, WHAT IF
represented regions in a structure model
To identify residue-residue interactions that may help
Intramolecular contact analysis identify active sites, structure-stabilizing features, etc. CSU, HBPLUS
Solvent accessibility
To identify amino acids that interact with a solvent naccess, Alpha Shapes
To place a chemically realistic solvent shell around the
Solvent modeling HBUILD
molecule in preparation for some types of simulations;
aids in understanding functional mechanism
Molecular surface To gain a visual understanding of molecular shape and GRASP, GRASS server, SWISS-
visualization chemical surface features PDBViewer
To visualize the chemically important surface features
Electrostatic potential of a protein, and as a preliminary step in pKa
UHBD, DelPhi
calculation calculations, binding energy calculations, and Brownian
dynamics simulations
To model pH-dependent behavior of proteins, identify
Protein pKa calculation UHBD, DelPhi
possible active sites, and and identify residues in
unusual chemical environments

Chapter 10. Predicting Protein Structure and Function
from Sequence
The amino acid sequence of a protein contains interesting information in and of itself. A protein
sequence can be compared and contrasted with the sequences of other proteins to establish its
relationship, if any, to known protein families, and to provide information about the evolution of
biochemical function. However, for the purpose of understanding protein function, the 3D structure of
a protein is far more useful than its sequence.

The key property of proteins that allows them to perform a variety of biochemical functions is the
sequence of amino acids in the protein chain, which somehow uniquely determines the 3D structure of
the protein. Given 20 amino acid possibilities, there are a vast number of ways they can be combined

to make up even a short protein sequence, which means that given time, organisms can evolve proteins
that carry out practically any imaginable purpose.
That "somehow," incidentally, represents several decades of work by hundreds of researchers, on a fundamental question that remains open to this day.

Each time a particular protein chain is synthesized in the cell, it folds together so that each of the
critical chemical groups for that particular protein's function is brought into a precise geometric
arrangement. The fold assumed by a protein sequence doesn't vary. Every occurrence of that particular
protein folds into exactly the same structure.

Despite this consistency on the part of proteins, no one has figured out how to accurately predict the 3D
structure that a protein will fold into based on its sequence alone. Patterns are clearly present in the
amino acid sequences of proteins, but those patterns are degenerate ; that is, more than one sequence
can specify a particular type of fold. While there are thousands upon thousands of ways amino acids
can combine to form a sequence of a particular length, the number of unique ways that a protein
structure can organize itself seems to be much smaller. Only a few hundred unique protein folds have
been observed in the Protein Data Bank. Proteins with almost completely nonhomologous sequences
nonetheless fold into structures that are similar. And so, prediction of structure from sequence is a
difficult problem.

10.1 Determining the Structures of Proteins
If we can experimentally determine the structures of proteins, and structure prediction is so hard, why
bother with predicting structure at all? The answer is that solving protein structures is difficult, and
there are many bottlenecks in the experimental process. Although the first protein structure was
determined decades before the first DNA sequence, the protein structure database has grown far more
slowly in the interim than the sequence database. There are now on the order of 10,000 protein
structures in the PDB, and on the order of 10 million gene sequences in GenBank. Only about 3,000
unique protein structures have been solved (excluding structures of proteins that are more than 95%
identical in sequence). Approximately 1,000 of these are from proteins that are substantially different
from each other (no more than 25% identical in sequence).

10.1.1 Solving Protein Structures by X-ray Crystallography

In the late 1930s, it was already known that proteins were made up of amino acids, although it had not
yet been proven that these components came together in a unique sequence. Linus Pauling and Robert
Corey began to use x-ray crystallography to study the atomic structures of amino acids and peptides.

Pure proteins had been crystallized by the time that Pauling and Corey began their experiments.
However, x-ray crystallography requires large, unflawed protein crystals, and the technology of protein
purification and crystallization had not advanced to the point of producing useful crystals. What
Pauling and Corey did discover in their studies of amino acids and peptides was that the peptide bond is
flat and rigid, and that the carboxylic acid oxygen is almost always on the opposite side of the peptide
bond as the amino hydrogen. (See the illustration of a peptide bond in Figure 9-1). Using this
information to constrain their models, along with the atomic bond lengths and angles that they observed
for amino acids, Pauling and Corey built structural models of polypeptide chains. As a result, they were
able to propose two types of repetitive structure that occur in proteins: the alpha helix and the beta
sheet, as shown previously in Chapter 9.

In experiments that began in the early 1950s, John Kendrew determined the structure of a protein called
myoglobin, and Max Perutz determined the structure of a similar protein called hemoglobin. Both
proteins are oxygen transporters, easily isolated in large quantities from blood and readily crystallized.
Obtaining x-ray data of reasonably high quality and analyzing it without the aid of modern computers
took several years. The structures of hemoglobin and myoglobin were found to be composed of high-
density rods of the dimensions expected for Pauling's proposed alpha helix. Two years later, a much
higher-quality crystallographic data set allowed the positions of 1200 of myoglobin's 1260 atoms to be
determined exactly. The experiments of Kendrew and Perutz paved the way for x-ray crystallographic
analysis of other proteins.

In x-ray crystallography, a crystal of a substance is placed in an x-ray beam. X-rays are reflected by the
electron clouds surrounding the atoms in the crystal. In a protein crystal, individual protein molecules
are arranged in a regular lattice, so x-rays are reflected by the crystal in regular patterns. The x-ray
reflections scattered from a protein crystal can be analyzed to produce an electron density map of the
protein (see Figure 10-1: images are courtesy of the Holden Group, University of Wisconsin, Madison,
and Bruker Analytical X Ray Systems). Protein atomic coordinates are produced by modeling the best
possible way for the atoms making up the known sequence of the protein to fit into this electron
density. The fitting process isn't unambiguous; there are many incrementally different ways in which an
atomic structure can be fit into an electron density map, and not all of them are chemically correct. A
protein structure isn't an exact representation of the positions of atoms in the crystal; it's simply the
model that best fits both the electron density map of the protein and the stereochemical constraints that
govern protein structures.

Figure 10-1. X-ray diffraction pattern and corresponding electron density map of the 3D structure of zinc-
containing phosphodiesterase

In order to determine a protein structure by x-ray crystallography, an extremely pure protein sample is
needed. The protein sample has to form crystals that are relatively large (about 0.5 mm) and singular,
without flaws. Producing large samples of pure protein has become easier with recombinant DNA
technology. However, crystallizing proteins is still somewhat of an art form. There is no generic recipe
for crystallization conditions (e.g., the salt content and pH of the protein solution) that cause proteins to
crystallize rapidly, and even when the right solution conditions are found, it may take months for
crystals of suitable size to form.

Many protein structures aren't amenable to crystallizatio n at all. For instance, proteins that do their
work in the cell membrane usually aren't soluble in water and tend to aggregate in solution, so it's
difficult to solve the structures of membrane proteins by x-ray crystallography. Integral membrane
proteins account for about 30% of the protein complement (proteome) of living things, and yet less
than a dozen proteins of this type have been crystallized in a pure enough form for their structures to be
solved at atomic resolution.

10.1.2 Solving Structures by NMR Spectroscopy

An increasing number of protein structures are being solved by nuclear magnetic resonance (NMR)
spectroscopy. Figure 10-2 shows raw data from NMR spectroscopy. NMR detects atomic nuclei with
nonzero spin; the signals produced by these nuclei are shifted in the magnetic field depending on their
electronic environment. By interpreting the chemical shifts observed in the NMR spectrum of a
molecule, distances between particular atoms in the molecule can be estimated. (The image in Figure
10-2 is courtesy of Clemens Anklin, Bruker Analytical X Ray Systems.)

Figure 10-2. NOESY (2D NMR) spectrum of lysozyme

To be studied using NMR, a protein needs to be small enough to rotate rapidly in solution (on the order
of 30 kilodaltons in molecular weight), soluble at high concentrations, and stable at room temperature
for time periods as long as several days.

Analysis of the chemical shift data from an NMR experiment produces a set of distance constraints
between labeled atoms in a protein. Solving an NMR structure means producing a model or set of
models that manage to satisfy all the known distance constraints determined by the NMR experiment,
as well as the general stereochemical constraints that govern protein structures. NMR models are often
released in groups of 20 -40 models, because the solution to a structure determined by NMR is more
ambiguous than the solution to a structure determined by crystallography. An NMR average structure is
created by averaging such a group of models (Figure 10-3). Depending on how this averaging is done,
stereochemical errors may be introduced into the resulting structure, so it's generally wise to check the
quality of average structures before using them in modeling.

Figure 10-3. Structural diversity in a set of NMR models

10.2 Predicting the Structures of Proteins
Ideally, what we'd like to do in analyzing proteins is take the sequence of a protein, which is cheap to
obtain experimentally, and predict the structure of the protein, which is expensive and sometimes
impossible to determine experimentally. It would also be interesting to be able to accurately predict
function from sequence, identify functional sites in uncharacterized 3D structures, and eventually, build
designed proteins”molecular machines that do whatever we need them to do. But without an
understanding of how sequence determines structure, these other goals can't reliably be achieved.

There are two approaches in computational modeling of protein structure. The first is knowledge-based
modeling. Knowledge-based methods employ parameters extracted from the database of existing
structures to evaluate and optimize structures or predict structure from sequence (the protein structure
prediction problem). The second approach is based on simulation of physical forces and molecular
dynamics. Physicochemical simulations are often used to attempt to model how a protein folds into its
compact, functional, native form from a nonfunctional, not-as-compact, denatured form (the protein
folding problem). In this chapter we focus on knowledge-based protein structure prediction and
analysis methods in which bioinformatics plays an important role.

Ab-initio protein structure prediction from protein sequence remains an unsolved problem in
computational biology. Although many researchers have worked to develop methods for structure
prediction, the only methods that produce a large number of successful 3D structure predictions are
those based on sequence homology. If you have a protein sequence and you want to know its structure,
you either need a homologous structure to use as a template for model-building, or you need to find a
crystallographer who will collaborate with you to solve the structure experimentally.

The protein-structure community is addressing the database gap between sequence and structure in a
couple of ways. Pilot projects in structural genomics”the effort to experimentally solve all, or some
large fraction of, the protein structures encoded by an entire genome”are underway at several
institutions. However, these projects stand little chance of catching up to the sheer volume of sequence
data that's being produced, at least in the near future.

10.2.1 CASP: The Search for the Holy Grail

In light of the database gap, computational structure prediction is a hot target. It's often been referred to
as the "holy grail" of computational biology; it's both an important goal and an elusive, perhaps
impossible, goal. However, it's possible to track progress in the area of protein structure prediction in
the literature and try out approaches that have shown partial success.

Every two years, structure prediction research groups compete in the Community Wide Experiment on
the Critical Assessment of Techniques for Protein Structure Prediction (CASP,
http://predictioncenter.llnl.gov/). The results of the CASP competition showcase the latest methods for
protein-structure prediction. CASP has three areas of competition: homology modeling, threading, and
ab initio prediction. In addition, CASP is a testing ground for new methods of evaluating the
correctness of structure predictions.

Homology modeling focuses on the use of a structural template derived from known structures to build
an all-atom model of a protein. The quality of CASP predictions in 1998 showed that structure
prediction based on homology is very successful in producing reasonable models in any case in which a
significantly homologous structure was available. The challenge for homology-based prediction, as for
sequence alignment, is detecting meaningful sequence homology in the Twilight Zone”below 25%
sequence homology.

Threading methods take the amino acid sequence of an uncharacterized protein structure, rapidly
compute models based on existing 3D structures, and then evaluate these models to determine how well
the unknown amino acid "fits" each template structure. All the threading methods reported in the most
recent CASP competition produce accurate structural models in less than half of the cases in which
they are applied. They are more successfully used to detect remote homologies that can't be detected by
standard sequence alignment; if parts of a sequence fit a fold well, an alignment can generally be
inferred, although there may not be enough information to build a complete model.

Ab-initio prediction methods focus on building structure with no prior information. While none of the
ab-initio methods evaluated in CASP 3 produce accurate models with any reliability, a variety of
methods are showing some promise in this area. One informatics-based strategy for structure prediction
has been to develop representative libraries of short structural segments out of which structures can be
built. Since structural "words" that aren't in the library of segments are out of bounds, the structural
space to be searched in model building is somewhat constrained. Another common ab-initio method is
to use a reduced representation of the protein structure to simulate folding. In these methods, proteins
can be represented as beads on a string. Each amino acid, or each fixed secondary structure unit in
some approaches, becomes a bead with assigned properties that attract and repel other types of beads,
and statistical mechanical simulation methods are used to search the conformational space available to
the simplified model. These methods can be successful in identifying protein folds, even when the
details of the structure can't be predicted.

10.3 From 3D to 1D
Proteins and DNA are, in reality, complicated chemical structures made up of thousands or even
millions of atoms. Simulating such complicated structures explicitly isn't possible even with modern
computers, so computational biologists use abstractions of proteins and DNA when developing
analytical methods. The most commonly used abstraction of biological macromolecules is the single-
letter sequence. However, reducing the information content of a complicated structure to a simple
sequence code leaves behind valuable information.

The sequence analysis methods discussed in Chapter 7 and Chapter 8, treat proteins as strings of
characters. For the purpose of sequence comparison, the character sequence of a protein is almost a
sufficient representation of a protein structure. However, even the need for substitution matrices in
scoring protein sequence alignments points to the more complicated chemical nature of proteins. Some
amino acids are chemically similar to each other and likely to substitute for each other. Some are
different. Some are large, and some are small. Some are polar; some are nonpolar. Substitution
matrices are a simple, quantitative way to map amino acid property information onto a linear sequence.
Asking no questions about the nature of the similarities, substitution matrices reflect the tendencies of
one amino acid to substitute for another, but that is all.

However, each amino acid residue in a protein structure (or each base in a DNA or RNA structure, as
we are beginning to learn) doesn't exist just within its sequence context. 1D information has not proven
sufficient to show unambiguously how protein structure and function are determined from sequence.
The 3D structural and chemical context of a residue contains many types of information.

Until quite recently, 3D information about protein structure was usually condensed into a more easily
analyzable form via 3D-to-1D mapping. A property extracted from a database can be mapped, usually
in the form of a numerical or alphabetic character score, to each residue in a sequence. Knowledge-
based methods for secondary structure prediction were one of the first uses of 3D-to-1D mapping. By
mapping secondary structure context onto sequence information as a property”attaching a code
representing "helix," "sheet," or "coil" to each amino acid”a set of secondary structure propensities
can be derived from the structure database and then used to predict the secondary structure content of
new sequences.

What important properties does each amino acid in a protein sequence have, besides occurrence in a
secondary structure? Some commonly used properties are solvent accessibility, acid/base properties,
polarizability, nearest sequence neighbors, and nearest spatial neighbors. All these properties have
something to do with intermolecular interactions, as discussed in Chapter 9.

10.4 Feature Detection in Protein Sequences
Protein sequence analysis is based to some extent on understanding the physicochemical properties of
the chemical components of the protein chain, and to some extent on knowing the frequency of
occurrence of particular amino acids in specific positions in protein structures and substructures.
Although protein sequence-analysis tools operate on 1D sequence data, they contain implicit
assumptions about how structural features map to sequence data. Before using these tools, it's best to
know some protein biochemistry.

Features in protein sequences represent the details of the protein's function. These usually include sites
of post-translational modifications and localization signals. Post-translational modifications are
chemical changes made to the protein after it's transcribed from a messenger RNA. They include
truncations of the protein (cleavages) and the addition of a chemical group to regulate the behavior of
the protein (phosphorylation, glycosylation, and acetylation are common examples). Localization or
targeting signals are used by the cell to ensure that proteins are in the right place at the right time. They
include nuclear localization signals, ER targeting peptides, and transmembrane helices (which we saw
in Chapter 9).

Protein sequence feature detection is often the first problem in computational biology tackled by
molecular biologists. Unfortunately, the software tools used for feature detection aren't centrally
located. They are scattered across the Web on personal or laboratory sites, and to find them you'll need
to do some digging using your favorite search engine.

One collection of web-based resources for protein sequence feature detection is maintained at the
Technical University of Denmark's Center for Biological Sequence Analysis Prediction Servers
(http://www.cbs.dtu.dk/services/). This site provides access to server-based programs for finding
(among other things) cleavage sites, glycosylation sites, and subcellular localization signals. These
tools all work similarly; they search for simple sequence patterns, or profiles, that are known to be
associated with various post-translational modifications. The programs have standardized interfaces
and are straightforward to use: each has a submission form into which you paste your sequence of
interest, and each returns an HTML page containing the results.

10.5 Secondary Structure Prediction
Secondary structure prediction is often regarded as a first step in predicting the structure of a protein.
As with any prediction method, a healthy amount of skepticism should be employed in interpreting the
output of these programs as actual predictions of secondary structure. By the same token, secondary
structure predictions can help you analyze and compare protein sequences. In this section, we briefly
survey secondary structure prediction methods, the ways in which they are measured, and some

Protein secondary structure prediction is the classification of amino acids in a protein sequence
according to their predicted local structure. Secondary structure is usually divided into one of three
types (alpha helix, beta sheet, or coil), although the number of states depends on the model being used.

Secondary structure prediction methods can be divided into alignment-based and single sequence-based
methods. Alignment-based secondary structure prediction is quite intuitive and related to the concept of
sequence profiles. In an alignment-based secondary structure prediction, the investigator finds a family
of sequences that are similar to the unknown. The homologous regions in the family of sequences are
then assumed to share the same secondary structure and the prediction is made not based on one
sequence but on the consensus of all the sequences in the set. Single sequence-based approaches, on the
other hand, predict local structure for only one unknown.

10.5.1 Alignment-Based and Hybrid Methods

Modern methods for secondary structure prediction exploit information from multiple sequence
alignments, or combinations of predictions from multiple methods, or both. These methods claim
accuracy in the 70 -77% range. Many of these programs are available for use over the Web. They take
a sequence (usually in FASTA format) as input, execute some procedure on them, then return a
prediction, usually by email, since the prediction algorithms are compute-intensive and tend to be run
in a queue. The following is a list of six of the most commonly used methods:


PHD combines results from a number of neural networks, each of which predicts the secondary
structure of a residue based on the local sequence context and on global sequence characteristics

(protein length, amino acid frequencies, etc.) The final prediction, then, is an arithmetic average
of the output of each of these neural networks. Such combination schemes are known as jury
decision or (more colloquially) winner-take-all methods. PHD is regarded as a standard for
secondary structure prediction.


PSIPRED combines neural network predictions with a multiple sequence alignment derived
from a PSI-BLAST database search. PSIPRED was one of the top performers on secondary
structure prediction at CASP 3.


JPred secondary structure predictions are taken from a consensus of several other
complementary prediction methods, supplemented by multiple sequence alignment information.
JPred is another one of the top-performing secondary structure predictors. The JPred server
returns output that can in turn be displayed, edited, and saved for use by other programs using
the Jalview alignment editor.


PREDATOR combines multiple sequence alignment information with the hydrogen bonding
characteristics of the amino acids to predict secondary structure.


PSA is another Markov model-based approach to secondary structure prediction. It's notable for
its detailed graphical output, which represents predicted probabilities of helix, sheet, and coil
states for each position in the protein sequence.

10.5.2 Single Sequence Prediction Methods

The first structure prediction methods in the 1970s were single sequence approaches. The Chou-
Fasman method uses rules derived from physicochemical data about amino acids to predict secondary
structure. The GOR algorithm (named for its authors, Garnier, Osguthorpe, and Robson) and its
successors use information about the frequency with which residues occur in helices, sheets, and coils
in proteins of known structure to predict structures. Both methods are still in use, often on multiple-tool
servers such as the SDSC Biology Workbench. Modern methods based on structural rules and
frequencies can achieve prediction accuracies in the 70 -75% range, especially when families of related
sequences are analyzed, rather than single sequences.

The surge in popularity of artificial intelligence methods in the 1980s gave rise to AI-based approaches
to secondary structure prediction, most notably the pattern-recognition approaches developed in the
laboratories of Fred Cohen (University of California, San Francisco) and Michael Sternberg (Imperial
Cancer Research Fund), and the neural network applications of Terrence Sejnowski and N. Qian (then
at Johns Hopkins). These methods exploited similar information as the earlier single-sequence methods
did, using the AI techniques to automate knowledge acquisition and application.

10.5.3 Measuring Prediction Accuracy
Authors who present papers on secondary structure prediction methods often use a measure of
prediction accuracy called Q 3. The Q 3 score is defined as:

Q = true_positives + true_negatives / total_residues

A second measure of prediction accuracy is the segment overlap score (Sov) proposed by Burkhard
Rost and coworkers. The Sov metric tends to be more stringent than Q3, since it gives high scores to
non-overlapping segments of a single kind of secondary structure, and penalizes sparse predictions
(Figure 10-4).

Figure 10-4. Good and bad (sparse) secondary structure predictions

When comparing methods, it pays to be conservative; look at both the average scores and their standard
deviations instead of the best reported score. As you can see, Q3 and Sov are fairly simple statistics.
Unlike E-values reported in sequence comparison, neither is a test statistic; both measure the percent of
residues predicted correctly instead of measuring the significance of prediction accuracy. And, as with
genefinders, make sure you know what kind of data is used to train the prediction method.

10.5.4 Putting Predictions to Use

Originally, the goal of predicting secondary structure was to come up with the most accurate prediction
possib le. Many researchers took this as their goal, resulting in many gnawed fingernails and pulled-out
hairs. As mentioned earlier, the hard -won lesson of secondary structure prediction is that it isn't very
accurate. However, secondary structure prediction methods have practical applications in
bioinformatics, particularly in detecting remote homologs. Drug companies compare secondary
structure predictions to locate potential remote homologs in searches for drug targets. Patterns of
predicted secondary structure can predict fold classes of proteins and select targets in structural

Secondary structure prediction tools such as PredictProtein and JPred combine the results of several
approaches, including secondary structure prediction and threading methods. Using secondary structure
predictions from several complementary methods (both single-sequence and homology-based
approaches) can result in a better answer than using just one method. If all the methods agree on the
predicted structure of a region, you can be more confident of the prediction than if it had been arrived
at by only one or two programs. This is known as a voting procedure in machine learning.

As with any other prediction, secondary structure predictions are most useful if some information about
the secondary structure is known. For example, if the structure of even a short segment of the protein
has been determined, that data can be used as a sanity check for the prediction.

10.5.5 Predicting Transmembrane Helices

Transmembrane helix prediction is related to secondary structure prediction. It involves the recognition
of regions in protein sequences that can be inserted into cell membranes. Methods for predicting
transmembrane helices in protein sequences identify regions in the sequence that can fold into a helix
and exist in the hydrophobic environment of the membrane. Transmembrane helix prediction grew out
of research into hydrophobicity in the early 1980s, pioneered by Russell Doolittle (University of
California, San Diego). Again, there are a number of transmembrane helix prediction servers available
over the Web. Programs that have emerged as standards for transmembrane helix prediction include
TMHMM (http://www.cbs.dtu.dk/services/TMHMM-1.0/), MEMSAT
(http://insulin.brunel.ac.uk/˜jones/memsat.html), and TopPred

Although structure determination for soluble proteins can be difficult, a far greater challenge is
structure determination for membrane-bound proteins. Some of the most interesting biological
processes involve membrane proteins”for example, photosynthesis, vision, neuron excitation,
respiration, immune response, and the passing of signals from one cell to another. Yet only a handful of
membrane proteins have been crystallized. Because these proteins don't exist entirely in aqueous
solution, their physicochemical properties are very different from those of soluble proteins and they
require unusual conditions to crystallize”if they are amenable to crystallization at all.

As a result, many computer programs exist that detect transmembrane segments in protein sequence.
These segments have distinct characteristics that make it possible to detect them with reasonable
certainty. In order to span a cell membrane, an alpha helix must be about 17-25 amino acids in length.
Because the interior of a cell membrane is composed of the long hydrocarbon chains of fatty acids, an
alpha helix embedded in the membrane must present a relatively nonpolar surface to the membrane in
order for its position to be energetically favorable.

Early transmembrane segment identification programs exploited these problems directly. By analyzing
every 17-25 residue windows of an amino acid sequence and assigning a hydrophobicity score to each
window, high-scoring segments can be predicted to be transmembrane helices. Recent improvements to
these early methods have boosted prediction of individual transmembrane segments to an accuracy
level of 90 -95%.

The topology of the protein in the membrane isn't as easy to predict. The orientation of the first helix in
the membrane determines the orientation of all the remaining helices. The connections of the helices
can be categorized as falling on one side or the other of the membrane, but determining which side is
which, physiologically, is more complicated.

10.5.6 Threading

The basic principle of structure analysis by threadingis that an unknown amino acid sequence is fitted
into (threaded through) a variety of existing 3D structures, and the fitness of the sequence to fold into
that structure is evaluated. All threading methods operate on this premise, but they differ in their

Threading methods don't build a refined all-atom model of the protein; rather, they rapidly substitute
amino acid positions in a known structure with the sidechains from the unknown sequence. Each
sidechain position in a folded protein can be described in terms of its environment: to what extent the
sidechain is exposed to solvent and, if it isn't exposed to solvent, what other amino acids it's in contact
with. A threaded model is scored highly if hydrophobic residues are found in solvent-inaccessible
environments and hydrophilic residues on the protein surface. But these high scores are possible only if
buried charged and polar residues are found to have appropriate countercharges or hydrogen bonding
partners, etc.

Threading is most profitably used for fold recognition, rather than for model building. For this purpose,
the UCLA-DOE Structure Prediction Server (http://www.doe-mbi.ucla.edu/people/frsur/frsur.html) is
by far the easiest tool to use. It allows you to submit a single sequence and try out multiple fold-
recognition and evaluation methods, including the DOE lab's own Gon prediction method as well as
EMBL's TOPITS method and NIH's 123D+ method. Other features, including BLAST and FASTA
searches, PROSITE searches, and Bowie profiles, which evaluate the fitness of a sequence for its
apparent structure, are also available.

Another threading server, the 3D-PSSM server offered by the Biomolecular Modelling Laboratory of
the Imperial Cancer Research Fund, provides a fold prediction based on a profile of the unknown
sequence family. 3D-PSSM incorporates multiple analysis steps into a simple interface. First the
unknown protein sequence is compared to a nonredundant protein sequence database using PSI-
BLAST, and a position-specific scoring matrix (PSSM) for the protein is generated. The query PSSM
is compared to all the sequences in the library database; the query sequence is also compared to 1D-
PSSMs (PSSMs based on standard multiple sequence alignments) and 3D-PSSMs (PSSMs based on
structural alignments) of all the protein families in the library database. Secondary structure predictions
based on these comparisons are shown, aligned to known secondary structures of possible structural
matches for the query protein. The results from the 3D-PSSM search are presented in an easy-to-
understand graphical format, but they can also be downloaded, along with carbon-alpha-only models of
the unknown sequence built using each possible structural match as a template.

Most threading methods are considered experimental, and new methods are always under development.
More than one method can be used to help identify any unknown sequence, and the results interpreted
as a consensus of multiple experts. The main thing to remember about any structural model you build
using a threading server is that it's likely to lack atomic detail, and it's also likely to be based on a
slightly or grossly incorrect alignment. The threading approach is designed to assess sequences as
likely candidates to fit into particular folds, not to build usable models. Putative structural alignments
generated using threading servers can serve as a basis for homology modeling, but they should be
carefully examined and edited prior to building an all-atom model.

10.6 Predicting 3D Structure
As was stated earlier in the chapter, predicting protein structure from sequence is a difficult task, and
there is no method yet that satisfies all parameters. There are, however, a number of tools that can
predict 3D structure. They fall into two subgroups: homology modeling and ab-initio prediction.

10.6.1 Homology Modeling

Let's say you align a protein sequence (a "target" sequence) against the sequence of another protein
with a known structure. If the target sequence has a high level of similarity to the sequence with known
structure (over the full length of the sequence), you can use that known structure as a template for the
target protein with a reasonable degree of confidence.

There is a standard process that is used in homology modeling. While the programs that carry out each
step may be different, the process is consistent:

1. Use the unknown sequence as a query to search for known protein structures.
2. Produce the best possible global alignment of the unknown sequence and the template
3. Build a model of the protein backbone, taking the backbone of the template structure as a
4. In regions in which there are gaps in either the target or the template, use a loop-modeling
procedure to substitute segments of appropriate length.
5. Add sidechains to the model backbone.
6. Optimize positions of sidechains.
7. Optimize structure with energy minimization or knowledge-based optimization.

The key to a successful homology-modeling project isn't usually the software or server used to produce
the 3D model. Your skill in designing a good alignment to a template structure is far more critical. A
combination of standard sequence alignment methods, profile methods, and structural alignment
techniques may be employed to produce this alignment, as we discuss in the example at the end of this
chapter. Once a good alignment exists, there are several programs that can use the information in that
alignment to produce a structural model. Modeller

Modeller (http://guitar.rockefeller.edu/modeller/modeller.html) is a program for homology modeling.
It's available free to academics as a standalone program or as part of MSI's Quanta package

Modeller has no graphical interface of its own, but once you are comfortable in the command-line
environment, it's not all that difficult to use. Modeller executables can be downloaded from the web
site at Rockefeller University, and installation is straightforward; follow the directions in the README
file available with the distribution. There are several different executables available for each operating
system; you should choose based on the size of the modeling problems you will use them for. The
README file contains information on the array size limits of the various executables. There are limits
on total number of atoms, total number of residues, and total number of sequences in the input

As input to Modeller, you'll need two input files, an alignment file, and a Modeller script. The
alignment file format is described in detail in the Modeller manpages; the Modeller script for a simple
alignment consists of just a few lines written in the TOP language (Modeller's internal language).
Modeller can calculate multiple models for any input. If the ENDING_ MODEL value in the example
script shown is set to a number other than one, more models are generated. Usually, it's preferable to
generate more than one model, evaluate each model independently, and choose the best result as your
final model.

The example provided in the Modeller documentation shows the setup for an alignment with a
pregenerated alignment file between one known protein and one unknown sequence:

INCLUDE # Include the predefined TOP routines
SET ALNFILE = 'alignment.ali' # alignment filename

SET KNOWNS = '5fd1' # codes of the templates
SET SEQUENCE = '1fdx' # code of the target
SET ATOMS_FILES_DIRECTORY = './:../atom_files'# directories for input atom files
SET STARTING_MODEL= 1 # index of the first model
SET ENDING_MODEL = 1 # index of the last model
# (determines how many models to calculate)
CALL ROUTINE = 'model' # do homology modeling

Modeller is run by giving the command mod scriptname. If you name your script fdx.top, the command
is mod fdx.

Modeller is multifunctional and has built-in commands that will help you prepare your input:


Searches for similar sequences in a database of fold class representative structures


Aligns two or more structures


Aligns two blocks of sequences


Evaluates an alignment to be used for modeling


Scores sequences in an alignment based on pairwise similarity


Superimposes one model on another or on the template structure


Generates a report of restraint violations in a modeled structure

Each command needs to be submitted to Modeller via a script that calls that command, as shown in the
previous sample script. Dozens of other Modeller commands and full details of writing scripts are
described in the Modeller manual.

One caveat in automated homology modeling is that sidechain atoms may not be correctly located in
the resulting model; automatic model building methods focus on building a reasonable model of the
structural backbone of the protein because homology provides that information with reasonable
confidence. Homology doesn't provide information about sidechain orientation, so the main task of the
automatic model builder is to avoid steric conflicts and improbable conformations rather than optimize
sidechain orientations. Incorrect sidechain positions may be misleading if the goal of the model
building is to explore functional mechanisms. How Modeller builds a model

Though Modeller incorporates tools for sequence alignment and even database searching, the starting
point for Modeller is a multiple sequence alignment between the target sequence and the template
protein sequence(s). Modeller uses the template structures to generate a set of spatial restraints, which
are applied to the target sequence. The restraints limit, for example, the distance between two residues


. 8
( 12)