. 5
( 12)


creating to exist simultaneously, or you wish to download a whole archive file and unpack it just to
retrieve a few files, you can transfer your archive over the network or even just to another partition
using a combination of ftp and tar commands. Sending an archive this way and then extracting it at the
destination can be less time-consuming than a cp -r if a large number of files are involved. The ftp
program recognizes a form in which a command replaces the input filenames. The command is
executed in a subshell on the local machine and operates on files on the local filesystem. The construct

ftp command "| command" filename

Inside the ftp program, here's how to send the output of the tar command, enclosed in quotes, into the
filename specified as the target on the remote machine:

put "|tar cvBf - *" filename

Here's how to direct the downloaded archive through the tar command, resulting in extraction of only
the files in the specified directory within the archive:

get filename.tar "|tar xvf - dirname"

Finally, here's how to list the contents of the remote archive:

get filename.tar "|tar t - *" compress

Usage: compress -[options] filenames

Ultimately, you don't want to be left with large”if more manageable”tar files cluttering up your
filesystem. In this situation, data-compression utilities are important, since they allow you to cheat and

reduce the amount of space that files take up on your hard disk. compress is the standard Unix file-
compression command. It's the opposite of uncompress, the command used in Chapter 3 to open
compressed papers and software. compress adds a .Z to the end of the filename.

Here are the most useful options for compress :


Forces compression; even if there is already a compressed version of the file, the main effect is
to not overwrite an existing compressed file


(verbose mode) Prints percentage compression achieved by the file


(recursive mode) If compress is applied to a directory that contains subdirectories, compresses
their contents as well as those of the original directory

If you have a text file named stoat.txt and the tar file of the otter/ directory from the last section, and
you want to compress both and look at the resulting compression ratio achieved, type:

% compress -v stoat.txt otter.tar

This command produces two files stoat.txt.Z and otter.tar.Z. The files can be uncompressed using the
uncompress command or gzip -d (described next). In case you were wondering, natural languages (the
kind humans use) end up with a compression ratio around 60%, and programming languages get
around 40%. Try compressing the sequences of some of your favorite proteins to see what sort of ratio
you get: the values can be wildly variable, depending on whether there are repeats in the sequence. gzip

Usage: gzip -[options] filenames

As usual, in addition to the standard Unix compress, there's a faster and more efficient GNU utility:
gzip. gzip behaves in much the same way as compress, except that it gets better compression on
average, since it uses a superior algorithm. gzip adds the suffix .gz to a file that it compresses. It
emulates the compress options described earlier and adds a few of its own:


(default setting) Preserves the original name and timestamp from the file being compressed


(quiet mode) Suppresses warnings when running


Returns a file that has been compressed by gzip to its uncompressed state; gzip can also
recognize and uncompress files produced by compress

Chapter 6. Biological Research on the Web
The Internet has completely changed the way scientists search for and exchange information. Data that
once had to be communicated on paper is now digitized and distributed from centralized databases.
Journals are now published online. And nearly every research group has a web page offering
everything from reprints to software downloads to data to automated data-processing services.

A simple web search for the word bioinformatics yields tens of thousands of results. The information
you want may be number 345 in the list or it may not be found at all. Where can you go to find only the
useful software and data, and scientific articles? You won't always get there by a simple web search.
How can you judge which information is useful? Publication on the Web gives information an
appearance of authority it may not merit. How can you judge if software will give the type of results
you need and perform its function correctly?

In this chapter we examine the art of finding information on the Web. We cover search engines and
searching, where to find scientific articles and software, and how to use the classic online information
sources such as PubMed. And once you've located your information, we help you figure out how to use
it. Among the largest sources of information for biologists are the public biological databases. We
discuss the history of the public databases, data annotation, the various forms the data can take, and
how to get data in and out. Finally, we give you some pointers on how to judge the quality of the
information you find out there.

The Internet is a tremendously useful information source for biological research. In addition to
allowing researchers to exchange software and data easily, it can be a source of the kind of practical
advice about computer software and hard ware, experimental methods and protocols, and laboratory
equipment that you once could get only by buying a beer for a seasoned lab worker or computer
hacker. Use the Internet, but use it wisely.

6.1 Using Search Engines
AltaVista, Lycos, Google, HotBot, Northern Light, Dogpile, and dozens of other search engines exist to
help you find your way around the billion or more pages that make up the Web. As a scientist,
however, you're not looking for common web commodities such as places to order books on the Web
or online news or porn sites. You're looking for perhaps a couple of needles in a large haystack.

Knowing how to structure a query to weed out the majority of the junk that will come up in a search is
very useful, both in web searching and in keyword-based database searching. Understanding how to
formulate boolean queries that limit your search space is a critical research skill.

6.1.1 Boolean Searching

Most web surfers approach searching haphazardly at best. Enter a few keywords into the little box, and
look at whatever results come up. But each search engine makes different default assumptions, so if
you enter protein structure into Excite's query field, you are asking for an entirely different search than
if you enter protein structure into Google's query field. In order to search effectively, you need to use
boolean logic, which is an extremely simple way of stating how a group of things should be divided or
combined into sets.

Search engines all use some form of boolean logic, as do the query forms for most of the public
biological databases. Boolean queries restrict the results that are returned from a database by joining a
series of search terms with the operators AND, OR, and NOT. The meaning of these operators is
straightforward: joining two keywords with AND finds documents that contain only keyword1 and
keyword2 ; using OR finds documents that contain either keyword1 or keyword2 (or both); and using
NOT finds documents that contain keyword1 but not keyword2.

However, search engines differ in how they interpret a space or an implied operator. Some search
engines consider a space an OR, so when you type protein structure, you're really asking for protein or
structure. If you search for protein structure on Excite, which defaults to OR, you come up with a lot of
advertisements for fad diets and protein supplements before you ever get to the scientific sites you're
interested in. On the other hand, Google defaults to AND, so you'll find only references that contain
protein and structure, which is probably what you intended to look for in the first place. Find out how
the search engine you're using works before you formulate your query.

Boolean queries are read from left to right, just like text. Parentheses can structure more complex
boolean queries. For instance, if you look for documents that contain keyword1 and one of either
keyword2 or keyword3, but not keyword4, your query would look like this: (keyword1 AND (keyword2
OR keyword3)) NOT keyword4.

Many search engines allow you to use quotation marks to specify a phrase. If you want to find only
documents in which the words protein structure appear together in sequence, searching for "protein
structure" is one way to narrow your results.

Let's say you want to search a literature database for references about computing electrostatic potentials
for protein molecules, and you only want to look for references by two authors, Barry Honig and
Andrew McCammon. You might structure a boolean query statement as follows:

((protein AND "electrostatic potential") AND (Honig OR McCammon))

This statement tells the search engine you want references that contain both the word protein and the
phrase electrostatic potential, and that you require either one or the other of the names Honig and

There are many excellent web tutorials available on boolean searching. Try a search with the phrase
boolean searching in Google, and see what comes up.

6.1.2 Search Engine Algorithms

While the purpose of this book isn't to describe exhaustively how search engines work, there are
significant differences in how search engines build their databases and rank sites. These differences
make some search engines far more useful than others for searching science and technology web sites.

Key features to look at in a web search engine's database building and indexing strategies are free URL
submission, full-text indexing, automated, comprehensive web crawling, a fast "refresh rate," and a
sensible ranking strategy for results.

Our current favorite search engine is Google. Google is extremely comprehensive, indexing over 1
billion URLs. Pages are ranked based on how many times they are linked from other pages. Links from
well-connected pages are considered more significant than links from isolated pages. The claim is that
a Google search will bring you to the most well-traveled pages that match your search topic, and we've
found that it works rather well. Google caches copies of web pages, so pages can be accessible even if
the server is offline. It returns only pages that contain all the relevant query terms. Google uses a
shorthand version of the standard boolean search formula, and it allows such specialized services as
locating all the pages that link back to a page of interest.

For the neophyte user, however, HotBot is probably the best search engine. HotBot is relatively
comprehensive and regularly updated, and it offers form-based query tools that eliminate the need for
you to formulate even simple query statements.

6.2 Finding Scientific Articles
Scientists have traditionally been able to trust the quality of papers in print journals because these
journals are refereed. An editor sends each paper to a group of experts who are qualified to judge the
quality of the research described. These reviewers comment on the manuscript, often requiring
additions, corrections, and even further experiments before the paper is accepted for publication. Print
journals in the sciences are, increasingly frequently, publishing their content in an electronic format in
addition to hardcopy. Almost every major journal has a web site, most of which are accessible only to
subscribers, although access to abstracts usually is free. Scientific articles in these web journals go
through the same process of review as their print counterparts.

Another trend is e-journals, which have no print counterpart. These journals are usually refereed, and it
shouldn't be too hard to find out by whom. For instance, the Journal of Molecular Modeling, an
electronic journal published by Springer-Verlag, has links to information about the journal's editorial
policy prominently displayed on its home page.

An excellent resource for searching the scientific literature in the biological sciences is the free server
sponsored by the National Center for Biotechnology Information (NCBI) at the National Library of
Medicine. This server makes it possible for anyone with a web browser to search the Medline database.
There are other literature databases of comparable quality available, but most of these are not free.
Your institution may offer access to such sources as Lexis-Nexis or Cambridge Scientific Abstracts.

Outside of refereed resources, however, anyone can publish information on the Web. Often research
groups make papers available as technical reports on their web sites. These technical reports may never
be peer reviewed or published outside the research group's home organization, and your only clue to
their quality is the reputation and expertise of the authors. This isn't to say that you shouldn't trust or
seek out these sources. Many government organizations and academic research groups have reference
material of near-textbook quality on their web sites. For example, the University of Washington
Genome Center has an excellent tutorial on genome sequencing, and NCBI has a good practical tutorial
on use of the BLAST sequence alignment program and its variants.

6.2.1 Using PubMed Effectively

PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi ) is one of the most valuable web resources
available to biologists. Over 4,000 journals are indexed in PubMed, including most of the well-
regarded journals in cell and molecular biology, biochemistry, genetics, and related fields, as well as
many clinical publications of interest to medical professionals.

PubMed uses a keyword-based search strategy and allows the boolean operators AND, OR, and NOT
in query statements. Users can specify which database fields to check for each search term by
following the search term with a field name enclosed in square brackets.

Additionally, users can search PubMed using Medical Subject Heading (MeSH) terms. MeSH is a
library of standardized terms that may help locate manuscripts that use alternate terms to refer to the
same concept. The MeSH browser (http://www.nlm.nih.gov/mesh/meshhome.html) allows users to
enter a word or word fragment and find related keywords in the MeSH library. PubMed automatically
finds MeSH terms related to query terms and uses them to enhance queries.

For example, we searched for "protein electrostatics" in PubMed. The terms protein and electrostatics
are automatically joined with an AND unless otherwise specified. The resulting boolean query
statement submitted to PubMed is actually:

((("proteins"[MeSH Terms] OR protein[Text Word]) AND ("electrostatics"[MeSH Terms]
OR electrostatics[Text Word])) AND notpubref[sb])

The results of the search are shown in Figure 6-1.

Figure 6-1. Results from a PubMed search

As you can see in Figure 6-2, PubMed also allows you to use a web interface to narrow your search.
The Limits link immediately below the query box on the main PubMed page takes you to this web

Figure 6-2. Narrowing a search strategy using the Limits menu in PubMed

The Limits form allows you to add specificity to your query. You can limit your search to particular
fields in the PubMed database record, such as the Author Name or Substance Name field. Searches can
also be limited by language, content (e.g., searching for review articles or clinical trials only), and date.
For clinical research publications, the search can be limited based on the species, age, and gender of the
research subjects.

The Preview/Index menu allows you to build a detailed query interactively. You can select a specific
data field (for instance, the Author Name field) and then enter a term you want to search for within the
specified field only. Clicking the AND, OR, or NOT buttons joins the new term to your previous query
terms using the specified boolean operator.

For instance, you might start with a general search for "protein AND electrostatics," then go to the
Preview/Index page (Figure 6-3) and specify that you want to search for "Gilson OR McCammon" in
the Author Name field only.

Figure 6-3. Building a PubMed query using the Preview/Index form

You can also use the options in the History form to access results from earlier searches, and to narrow a
search by adding new terms to the query.

If you want to collect results from multiple queries and save them into one big file, the Clipboard will
allow you to do that. To save individual results to the Clipboard, simply click the checkbox next to the
result you want to save, then click the Add to Clipboard button in the menu at the top of your results
page. [1]

You'll notice that all the checkbox-clicking to select and save individual results can get time-consuming if you're working with a lot of pages of results. It would be
easier if you could come up with a search strategy that was absolutely certain to bring up only the results you want. There's no solution for this within the NCBI tools,
and writing your own scripts to process batches of results may not help you either. The limitation is in the ability of computer programs to parse human language.

If you find a search strategy that works for you in PubMed, you can save that strategy in the form of a
URL, and repeat the same search at any time in the future by visiting that URL. To save a PubMed
URL, click the Details link on your results page, then click the URL link on the Details page. The URL
of your search will appear in the Location field at the top of the web browser, so that you can
bookmark it.

The "bookmarkable" URL for a PubMed search should look something like this:


Spending a few hours developing some detailed PubMed search strategies that work for you, and
saving them, can save you a lot of work in the future.

6.3 The Public Biological Databases
The nomenclature problem in biology at the molecular level is immense. Genes are commonly known
by unsystematic names. These may come from developmental biology studies in model systems, so that
some genes have names like flightless, shaker, and antennapedia due to the developmental effects they
cause in a particular animal. Other names are chosen by cellular biologists and represent the function of
genes at a cellular level, like homeobox. Still other names are chosen by biochemists and structural
biologists and refer to a protein that was probably isolated and studied before the gene was ever found.
Though proteins are direct products of genes, they are not always referred to by the same names or
codes as the genes that encode them. This kind of confusing nomenclature generally means that only a
scientist who works with a particular gene, gene product, or the biochemical process that it's a part of
can immediately recognize what the common name of the gene refers to.

The biochemistry of a single organism is a more complex set of information than the taxonomy of
living species was at the time of Linnaeus, so it isn't to be expected that a clear and comprehensive
system of nomenclature will be arrived at easily. There are many things to be known about a given
gene: its source organism, its chromosomal location, and the location of the activator sequences and
identities of the regulatory proteins that turn it on and off. Genes also can be categorized by when
during the organism's development they are turned on, and in which tissues expression occurs. They
can be categorized by the function of their product, whether it's a structural protein, an enzyme, or a
functional RNA. They can be categorized by the identity of the metabolic pathway that their product is
part of, and by the substrate it modifies or the product it produces. They can be categorized by the
structural architecture of their protein products. Clearly this is a wealth of information to be condensed
into a reasonable nomenclature. Figure 6-4 shows a portion of the information that may be associated
with a single gene.

Figure 6-4. Some of the information associated with a single gene

The problem for maintainers of biological databases becomes mainly one of annotation; that is, putting
sufficient information into the database that there is no question of what the gene is, even if it does
have a cryptic common name, and creating the proper links between that information and the gene
sequence and serial number. Correct annotation of genomic data is an active research area in itself, as
researchers attempt to find ways to transfer information across genomes without propagating error.

Storage of macromolecular data in electronic databases has given rise to a way of working around the
problem of nomenclature. The solution has been to give each new entry into the database a serial
number and then to store it in a relational database that knows the proper linkages between that serial
number, any number of names for the gene or gene product it represents, and all manner of other
information about the gene. This strategy is the one currently in use in the major biological databases.
The questions databases resolve are essentially the same questions that arise in developing a
nomenclature. However, by using relational databases and complex querying strategies, they (perhaps
somewhat unfortunately) avoid the issue of finding a concise way for scientists to communicate the
identities of genes on a nondigital level.

6.3.1 Data Annotation and Data Formats

The representation and distribution of biological data is still an open problem in bioinformatics. The
nucleotide sequences of DNA and RNA and the amino acid sequences of proteins reduce neatly to
character strings in which a single letter represents a single nucleotide or amino acid. The remaining
challenges in representing sequence data are verification of the correctness of the data, thorough
annotation of data, and handling of data that comes in ever-larger chunks, such as the sequences of
chromosomes and whole genomes.

The standard reduced representation of the 3D structure of biomolecule consists of the Cartesian
coordinates of the atoms in the molecule. This aspect of representing the molecule is straightforward.
On the other hand, there are a host of complex issues for structure databases that are not completely
resolved. Annotation is still an issue for structural data, although the biology community has attempted
to form a consensus as to what annotation of a structure is currently required.

In the last 15 years, different researchers have developed their own styles and formats for reporting
biological data. Biological sequence and structure databases have developed in parallel in the United
States and in Europe. The use of proprietary software for data analysis has contributed a number of
proprietary data formats to the mix. While there are many specialized databases, we focus here on the
fields in which an effort is being made to maintain a comprehensive database of an entire class of data.

6.3.2 3D Molecular Structure Data

Though DNA sequence, protein sequence, and protein structure are in some sense just different ways of
representing the same gene product, these datatypes currently are maintained as separate database
projects and in unconnected data formats. This is mainly because sequence and structure determination
methods have separate histories of development.

The first public molecular biology database, established nearly 10 years before the public DNA
sequence databases, was the Protein Data Bank (PDB), the central repository for x-ray crystal
structures of protein molecules.

While the first complete protein structure was published in the 1950s, there were not a significant
number of protein structures available until the late 1970s. Computers had not developed to the point
where graphical representation of protein structure coordinate data was possible, at least at useful
speeds. However, in 1971, the PDB was established at the Brookhaven National Laboratory, to store
protein structure data in a computer-based archive. A data format developed, which owed much of its
style to the requirements of early computer technology. Throughout the 1970s and 1980s, the PDB
grew. From 15 sets of coordinates in 1973, it grew to 69 entries in 1976. The number of coordinate sets
deposited each year remained under 100 until 1988, at which time there were still fewer than 400 PDB

Between 1988 and 1992, the PDB hit the turning point in its exponential growth curve. By January
1994, there were 2,143 entries in the PDB; at the time of this writing, the PDB has nearly reached the
14,000-entry mark. Management of the PDB has been transferred to a consortium of university and
public -agency researchers, called the Research Collaboratory for Structural Bioinformatics, and a new
format for recording of crystallographic data, the Macromolecular Crystallographic Information File
(mmCIF), is being phased in to replace the antiquated PDB format. Journals that publish
crystallographic results now require submission to the PDB as a condition of publication, which means
that nearly all protein structure data obtained by academic researchers becomes available in the PDB in
a fairly timely fashion.

A common issue for data-driven studies of protein structure is the redundancy and lack of
comprehensiveness of the PDB. There are many proteins for which numerous crystal structures have
been submitted to the database. Selecting subsets of the PDB data with which to work is therefore an
important step in any statistical study of protein structure. As of December 1998, only about 2,800 of
the protein chains in the PDB were sufficiently different from each other (having less than 95% of their
sequence in common) to be considered unique. Many statistical studies of protein structure are based
on sets of protein chains that have no more than 25% of their sequence in common; if this criterion is
used, there are still only around 1,000 unique protein folds represented in the PDB. As the amount of
biological sequence data available has grown, the PDB now lags far behind the gene-sequence

6.3.3 DNA, RNA, and Protein Sequence Data

Sequence databases generally specialize in one type of sequence data: DNA, RNA, or protein. There
are major sequence data collections and deposition sites in Europe, Japan, and the United States, and
there are independent groups that mirror all the data collected in the major public databases, often
offering some software that adds value to the data.

In 1970, Ray Wu sequenced the first segment of DNA; twelve bases that occurred as a single strand at
the end of a circular DNA that was opened using an enzyme. However, DNA sequencing proved much
more difficult than protein sequencing, because there is no chemical process that selectively cleaves the
first nucleotide from a nucleic acid chain. When Robert Holley reported the sequencing of a 76-
nucleotide RNA molecule from yeas t, it was after seven years of labor. After Holley's sequence was
published, other groups refined the protocols for sequencing, even successfully sequencing an 3,200-
base bacteriophage genome. Real progress with DNA sequencing came after 1975, with the chemical
cleavage method designed by Allan Maxam and Walter Gilbert, and with Frederick Sanger's chain-
terminator procedure.

The first DNA sequence database, established in 1979, was the Gene Sequence Database (GSDB) at
Los Alamos National Lab. While GSDB has since been supplanted by the worldwide collaboration that
is the modern GenBank, up-to-date gene sequence information is still available from GSDB through
the National Center for Genome Resources.

The European Molecular Biology Laboratory, the DNA Database of Japan, and the National Institutes
of Health cooperate to make all publicly available sequence data available through GenBank. NCBI has
developed a standard relational database format for sequence data, known as the ASN.1 format. While
this format promises to make locating the right sequences of the right kind in GenBank easier, there are
still a number of services providing access to nonredundant versions of the database.

The DNA sequence database grew slowly through its first decade. In 1992, GenBank contained only
78,000 DNA sequences”a little over 100 million base pairs of DNA. In 1995, the Human Genome
Project, and advances in sequencing technology, kicked GenBank's growth into high gear. GenBank
currently doubles in size every 6 to 8 months, and its rate of increase is constantly growing.

6.3.4 Genomic Data

In addition to the Human Genome Project, there are now separate genome project databases for a large
number of model organisms. The sequence content of the genome project databases is represented in
GenBank, but the genome project sites also provide everything from genome maps to supplementary
resources for researchers working on that organism. As of October 2000, NCBI's Entrez Genome
database contained the partial or complete genomes of over 900 species. Many of these are viruses. The
remainder include bacteria; archaea; yeast; commonly studied plant model systems such as A. thaliana,
rice, and maize; animal model systems such as C. elegans, fruit flies, mice, rats, and puffer fish; as well
as organelle genomes. NCBI's web-based software tools for accessing these databases are constantly
evolving and becoming more sophisticated.

6.3.5 Biochemical Pathway Data

The most important biological activities don't happen by the action of single molecules, but as the
orchestrated activities of multiple molecules. Since the early 20th century, biochemists have studied
these functional ensembles of enzymes and their substrates. A few research groups have begun work on
intelligently organizing and storing these pathways in databases. Two examples of pathway databases
are WIT and KEGG. WIT, short for "What Is There?", was developed at Argonne National Labs. It's a
database containing reconstructed metabolic pathways for organisms whose genomes have been
entirely sequenced. The Kyoto Encyclopedia of Genes and Genomes (KEGG) stores similar data but
links in information from sequence, structure, and genetic linkage databases. Both databases are
queryable through web interfaces and are curated by a combination of automation and human expertise.

In addition to these whole genome "parts catalogs," other, more specialized databases that focus on
specific pathways (such as intercellular signaling or degradation of chemical compounds by microbes)
have been developed.

6.3.6 Gene Expression Data

DNA microarrays (or gene chips) are miniaturized laboratories for the study of gene expression. Each
chip contains a deliberately designed array of probe molecules that can bind specific pieces of DNA or
mRNA. Labeling the DNA or RNA with fluorescent molecules allows the level of expression of any
gene in a cellular preparation to be measured quantitatively. Microarrays also have other applications in
molecular biology, but their use in studying gene expression has opened up a new way of measuring
genome functions.

Since the development of DNA microarray technology in the late 1990s, it has become apparent that
the increase in available gene expression data will eventually parallel the growth of the sequence and
structure databases, and that this is another datatype for which public access to raw data will be
desirable. Raw microarray data has just begun to be made available to the public in selective databases,
and talk of establishing a central data repository for such data is underway. However, formats for
delivering this kind of data are still not standardized; often, it's made available in large spreadsheets or
tab-delimited text. Two of the most comprehensive resources for microarray data are the National
Human Genome Research Initiative's Microarray Project site and the Stanford Genome Resources site.
Since many of the early microarray expression experiments were performed at Stanford, their genome
resources site has links to both raw data and, in some cases, databases that can be queried using gene
names or functional descriptions. Recently, the European Bioinformatics Institute has been
instrumental in developing a set of standards for deposition of microarray data in databases. Several
databases also exist for the deposition of 2D gel electrophoresis results, including SWISS-2DPAGE
and HSC-2DPAGE. 2D-PAGE is a technology that allows quantitative study of protein concentrations
in the cell, for many proteins simultaneously. The combination of these two techniques is a powerful
tool for understanding how genomes work.

Table 6-1 summarizes sources on the Web for some of the most important databases we've discussed in
this section.

Table 6-1. Major Biological Data and Information Sources
Subject Source Link
Biomedical literature PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
Nucleic acid sequence GenBank http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide
SRS at EMBL /EBI http://srs.ebi.ac.uk
Genome sequence Entrez Genome http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Genome
TIGR databases http://www.tigr.org/tdb/
Protein sequence GenBank http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
PIR http://www-nbrf.georgetown.edu
Protein structure Protein Data Bank http://www.rcsb.org/pdb/
Entrez Structure DB
Protein and peptide mass
PROWL http://prowl.rockefeller.edu
Post-translational modifications RESID http://www-nbrf.georgetown.edu/pirwww/search/textresid.html
Biochemical and biophysical
ENZYME http://www.expasy.ch/enzyme/
BIND http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Structure
Biochemical pathways PathDB http://www.ncgr.org/software/pathdb/
KEGG http://www.genome.ad.jp/kegg/
WIT http://wit.mcs.anl.gov/WIT2/
Gene Expression
Microarray http://industry.ebi.ac.uk/˜alan/MicroArray/
2D-PAGE SWISS-2DPAGE http://www.expasy.ch/ch2d/ch2d-top.html
Web resources The EBI Biocatalog http://www.ebi.ac.uk/biocat/
IUBio Archive http://iubio.bio.indiana.edu

6.4 Searching Biological Databases
There are dozens of biological databases on the Web, and many alternate web interfaces that provide
access to the same sets of data. Which ones you use depends on your needs, but it's necessary for you
to be aware of what the central data repositories are for various datatypes, and how often the more
peripheral databases you might be using synchronize themselves with these central data sources.

Although data repositories for new types of biological data are multiplying, we focus here on two
established databases: NCBI's GenBank, for DNA sequence data; and the Protein Data Bank, for
molecular structure data. Every database has its own deposition procedures, and for the newer
datatypes these are not yet well established or are still changing rapidly. However, both NCBI and
RCSB have mature, automated, web -based deposition systems that are not likely to change drastically
in the near future.

6.4.1 GenBank

NCBI, in cooperation with EMBL and other international organizations, provides the most complete
collection of DNA sequence data in the world, as well as PubMed, a taxonomy database, and an
alternate access point for protein sequence and structure data. This database, known as GenBank, may
be accessed at http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein.

NCBI maintains sequence data from every organism, every source, every type of DNA”from mRNA
to cDNA clones to expressed sequence tags (ESTs) to high-throughput genome sequencing data and
information about sequence polymorphisms. Users of the NCBI database need to be aware of the
differences between these datatypes so that they can search the data set that's most appropriate for the
work they're doing. The main sequence types that you'll encounter in a full GenBank search include:


Messenger RNA, the product of transcription of genomic DNA. mRNA may be edited by the
cell to remove introns (in eukaryotes) or in other ways that result in differences from the
transcribed genomic DNA. May be "partial" or "complete"; an mRNA may not cover the
complete coding sequence of a gene.


A DNA sequence artificially generated by reverse transcription of mRNA. cDNA roughly
represents the coding components of the genomic DNA region that produced the mRNA. May
also be "partial" or "complete."

Genomic DNA

A DNA sequence from genome sequencing that contains both coding and noncoding DNA
sequences. May contain introns, repeat regions, and other features. Genomic DNA (as opposed
to genome survey sequence) is generally "complete"; it's a result of multiple sequencing passes
over a single stretch of a genome, and can generally be relied upon as a fairly good
representation of the real DNA sequence of that region.


Short cDNA sequences prepared from mRNA extracted from a cell under particular conditions
or in specific developmental phases (e.g., arabidopsis thaliana 2-week old shoots or valencia
orange seeds). ESTs are used for quick identification of genes and don't cover the entire coding
sequence of a gene.

Genome survey sequence. Single-pass sequence direct from the genome projects. Covers each
region of sequence only once and is likely to contain a relatively large proportion of sequencing
errors. You'd include genome survey sequence in a search only if you were looking for very
new hypothetical gene annotations in a genome project that's still in progress.

There are two ways to search GenBank. The first is to use a text-based query to search the annotations
associated with each DNA sequence entry in the database. The second, which we'll discuss in Chapter
7, is to use a method called BLAST to compare a query DNA (or protein) sequence to a sequence

Here's a sample GenBank record. Each GenBank entry contains annotation”information about the
gene's identity, the conditions under which it was characterized, etc.”in addition to sequence.

LOCUS AB009351 1412 bp mRNA PLN 22-JUN-1999
DEFINITION Citrus sinensis mRNA for chalcone synthase, complete cds, clone
ACCESSION AB009351 VERSION AB009351.1 GI:5106368
KEYWORDS chalcone synthase.
SOURCE Citrus sinensis young seed cDNA to mRNA, clone:CitCHS2.
ORGANISM Citrus sinensis
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; core
eudicots; Rosidae; eurosids II; Sapindales; Rutaceae; Citrus.
REFERENCE 1 (sites)
AUTHORS Moriguchi,T., Kita,M., Tomono,Y., EndoInagaki,T. and Omura,M.
TITLE One type of chalcone synthase gene expressed during embryogenesis
regulates the flavonoid accumulation in citrus cell cultures
JOURNAL Plant Cell Physiol. 40 (6), 651-655 (1999)
MEDLINE 99412624
FEATURES Location/Qualifiers
Source 1..1412
/organism="Citrus sinensis"
/dev_stage="young seed"
/note="Valencia orange"
CDS 30..1205
/product="chalcone synthase"
polyA_site 1412
/note="18 a nucleotides"
BASE COUNT 331 a 358 c 372 g 351 t
1 aaacatattc attaagggtt caacttgaaa tggcaaccgt tcaagagatc agaaacgctc
61 agcgtgccga cggcccggcc accgtcctcg ccatcggtac ggccacgcct gcccacagtg
121 tcaaccaggc tgattatccc gactattact tcaggatcac aaagagcgag catatgacgg

1261 cacagttgag ttattggttg atcgtgtgaa ggtttagttt tgtcaattga gtttaaggca
1321 tcgtgccttt tctcttatga cgtcaccaaa cctgggcaac gctttgtgtt tatgcataaa
1381 ttcttgggaa tttgagaaag tagtaaattt gt

This sample GenBank record shows the types of fields that can be found in a record from the GenBank
Nucleotide database. Everything from the identity of the protein product (in this example, chalcone
synthase), the sequence of the protein product, and its starting and ending point within the gene, to the
authors who submitted the record and the journal references in which the experiment was described,
can be found in the record, and therefore can be used to search the database.

The GenBank search interface is nearly identical to the PubMed search interface. The Limits,
Preview/Index, History, and Clipboard features for searching work the same way in the Protein,
Nucleic Acid, and Genome databases as they do for PubMed, although the specific fields that can be
searched and limits that can be set are somewhat different. Saving search results

Sequences can be downloaded from NCBI in any of three file formats: the simple FASTA format,
which is readable by many sequence analysis programs but contains little information other than
sequence; the GenBank flat file format, which is a legacy flat file format that was used at GenBank
earlier in its history; and the modern ASN.1 (Abstract Syntax Notation One) format. ASN.1 is a generic
data specification, designed to promote database interoperability, that is now used for storage and
retrieval of all datatypes”sequences, genomes, structure, and literature”at NCBI. The NCBI Toolkit,
a code library for developing molecular biology software, relies on the ASN.1 specification. NCBI, and
increasingly, other organizations, rely on the NCBI Toolkit for software development. Learning to use
the NCBI Toolkit is a programming challenge well beyond the scope of this book, but there is an
excellent tutorial on the Web, developed by Christopher Hogue and his research group at the Samuel
Lunenfeld Research Institute.

The casual database user or depositor doesn't have to think too much about file formats, except if
database files are to be exported and read by another piece of software. NCBI's forms-based interfaces
convert user-entered data into the appropriate format for deposition, and the availability of GenBank
files in FASTA format means that most sequence analysis software can handle sequence files you
download from NCBI without complicated conversions.

When you save results of a GenBank search, you can choose the format in which to save them. Earlier,
you saw what the GenBank sequence record looks like. Many of the computer programs we discuss in
the following chapters can read GenBank format sequence files, but some can't. A particularly
foolproof format in which to save your sequence files if you're going to process them with other
software is the FASTA format. FASTA files have a simple format, a single comment line that begins
with a > character, followed by single-character DNA sequence on as many lines as needed to hold the
sequence, with no breaks. Of course, some information associated with the gene is lost when you save
the data in FASTA format, but if the program you want to use can't read that extra data, it won't be
useful to you anyway.

Here's a sample of data in FASTA format:

> gene identifier and comments here


To save your files in FASTA format, simply use the pulldown menu at the top of the results page.
When you first see it, it will say "Summary," but you can change it to FASTA, ASN.1, and other
formats. Once you've chosen your format, you can click the Save button to save all your sequences into
one big FASTA-format file. Figure 6-5 shows you how to change the file formats when doing a
GenBank search.

Figure 6-5. Changing the file format to write out your GenBank search results Saving large result sets

So far, our discussion of information retrieval from databases has assumed that you need access to only
a few sequences at a time. However, modern bioinformatics studies increasingly deal with large
amounts of sequence data. For example, genefinding programs (covered in Chapter 7) are trained and
tested on hundreds or thousands of DNA sequences; comprehensive studies of protein families can
involve analysis of up to thousands of protein sequences as well. While it's possible to select thousands
of checkboxes on a web page by hand, it would be better to use an automated tool that can return a
large number of sequences based on criteria you specify.

NCBI provides just such a tool in the form of Batch Entrez
(http://www.ncbi.nlm.nih.gov/Entrez/batch.html ). Batch Entrez is one of the tools accessible from the
Entrez web site. It's accessed using a web form that allows the user to select sequences by source
organism, by an Entrez query (using the query structure described in the section on PubMed), or by a
list of accession numbers (provided by the user in the form of a text file). The results of a Batch Entrez
search are then packaged in a file that is downloaded to the user's computer, where the complete result
set can be edited manually or (even better) using a script.

At this time, not all the biological databases are so kind about providing such services, but all the
public databases have FTP sites that allow you to download the entire database in one form or another.
That can take up a lot of space on your hard disk, but disk space is cheaper these days than the time it
would take you to handle a large set of results on an interactive web site. If you've got a local copy of
the big databases that interest you, you can write (or perhaps even download) a script that processes the
database, looking for your keyword of choice, and writes out the information you want to a file.

6.4.2 PDB

Unlike NCBI, the Protein Data Bank (http://www.rcsb.org/pdb/) is responsible for only one type of
molecular data: molecular structures of molecules and, to a growing extent, the underlying raw data
sets from which the molecular structures were modeled.

The PDB web site offers three options for searching the database. You can enter a four-letter PDB
identifier directly, or search using the SearchLite or SearchFields interfaces. The SearchLite interface is
similar to the other query tools we've discussed. You can enter a term or terms into the query box,
joined by the operators AND, OR, and BUTNOT.

The SearchFields interface is an innovative design-it-yourself web form system. As you see in Figure
6-6, when you first go to SearchFields, you can scroll down to the bottom of the web form and select
which parts of the form you need. If you're only going to be doing a FASTA search to find similar
sequences, you don't need a search form that prompts you for keywords to use in searching the Citation
Author field. You might want to add a field that lets you search for proteins with a particular ligand or
prosthetic group. With the SearchFields interface, you select the form elements you want for your
custom PDB search, and click the "New Form" button to generate the new query form.

Figure 6-6. Customizing the PDB's SearchFields form

Whether you use SearchLite or SearchFields, you'll come to the Query Result browser (Figure 6-7),
where you can select options for refining your query, downloading your results as structure or sequence
files, and even preparing a tabular report of your search results. These options are straightforward to
use and well documented on the PDB web site.

Figure 6-7. Options for using query results at the PDB

The Protein Data Bank makes data available in two formats: the legacy PDB flat- file format, and the
newer mmCIF data format. We'll discuss the differences between these two file formats in more detail
in Chapter 12. At this point, little of the available structure-analysis and protein-modeling software
handles the mmCIF format, so you are not likely to need to download protein structure data in mmCIF
format unless you are developing new software. You can choose to download the complete set of

results from your search as a tar archive or a zipped file in either PDB or mmCIF format, as well as in
sequence-only FASTA format.
The PDB offers a suite of mmCIF and PDB format conversion tools, as well as code libraries for working with mmCIF files.

Another convenient way to view protein structure data from the PDB web site is to install a browser
plug-in such as RasMol or Chime on your computer. We discuss how to do this in Chapter 9. Once the
plug-in is installed and properly configured, you can simply click on a link on the protein's View
Structure page and the protein structure is automatically displayed using the plug-in, as shown in
Figure 6-8.

Figure 6-8. Viewing a PDB file using a browser plug-in

6.5 Depositing Data into the Public Databases
In addition to downloading information from the public databases, you may also submit your own

6.5.1 GenBank Deposition

Deposition of sequences to GenBank has been made extremely simple by NCBI. Users depositing only
a few sequences can use the web-based BankIt tool, which is a self-explanatory form-based interface
accessible from the GenBank main page at NCBI. Users submitting multiple sequences or other
complicated submissions can use NCBI's Sequin software, which is available for all major operating
systems. Sequin is well documented on the NCBI site. NCBI has recently established two special
submission paths: EST sequences should be submitted through dbEST, rather than to GenBank, and
genome survey sequences through dbGSS.

6.5.2 PDB Deposition

Deposition of structures to the PDB are done using the AutoDep input tool (ADIT). AutoDep is a tool
that integrates data validation software with the deposition process so that the user can receive
feedback on data quality during the deposition process. AutoDep is tied in with the curation tools the
PDB uses to prepare structure data for inclusion in the data bank.

6.6 Finding Software
Bioinformatics is a diffuse field, attracting researchers from many disciplines, and articles about new
research developments in bioinformatics are widely distributed in the literature. If you're looking for
cutting-edge developments, journals such as Bioinformatics, Nucleic Acids Research, Journal of
Molecular Biology, and Protein Science often publish papers describing innovations in computational
biology methods.

If you're looking for proven software for a particular application, there are a number of reliable web
resource lists that link to computational biology software sites. Most of the major biological databases
have software resource listings and the necessary motivation to keep their listings up-to-date. The PDB
links to the best free software packages for macromolecular structure refinement, visualization, and
dynamics. TIGR and NCBI provide links to many tools for protein and DNA sequence analysis.

Many organizations and groups provide web implementations of their software. These can be a great
time-saver, especially if you are new to the use of noncommercial software packages in research. Many
of the bioinformatics programs that we describe in this book are also available as web servers. You can
use the web -server versions to get you started and understand the inputs, outputs, and options for the
program. However, web servers have their drawbacks. They typically implement only the most popular
options in any software package: it's difficult to design a web form that allows you to select every
option in a complicated program. They often allow you to run only one calculation at a time. This is
fine if you're only interested in analyzing a few sequences or structures, and not so fine if you suddenly
find yourself with 500 sequences to analyze.

With a little clever programming, you can develop scripts that allow you to hit a web server with
multiple requests without entering them manually into a form, but if you're capable of doing that,
you're probably able to download a local copy of the software and run it on your own machine. Using
your own processor in such cases avoids slow data transfer to and from remote sites and is also
considered more polite than running huge jobs on someone else's web server.

In the next four chapters, we'll discuss the software packages you are most likely to want to use. We'll
show you how to set them up on your own computer and use them independent of web interfaces.

We can't cover every available software package and web server in this book; there are just too many.
You will eventually want to go out on your own and find new tools to use. Keep a few things in mind
when searching for software, and you'll soon be able to judge for yourself if a new computer program is
something you want to use.

6.7 Judging the Quality of Information
Your ability to judge the quality of information and software you find on the Web will improve as you
continue to learn the field. At a more obvious level, however, some simple guidelines can help you
screen the information you find. Approach software, information, and services offered on the Web with
a healthy skepticism, and you're not likely to be led astray.

6.7.1 Authority

One of the first things to consider when evaluating software, data, or information found on the Internet
is the source. Who are the authors? If you don't know the authors presenting the information by
reputation, is information about their affiliation and credentials available on the web site? Is their
expertise related to the topic or purpose of the web site? Do they make it possible for you to contact
them and ask questions?

What is the purpose of the organization sponsoring the information? Is it an academic organization? A
government agency? A company? For-profit corporations often have different motivations for offering
access to their software and data than nonprofits and academic research groups; usually they are

offering a stripped-down version of their software or services to get you to buy a more complete
package. An individual academic researcher's site doesn't always have the same need to be all-inclusive
as a publicly funded database does. There is nothing inherently wrong with these offerings, but you
should be aware of whether or not they are comprehensive, whether all their features are available to
the casual user, and why.

Even data and software from national or international public sites are not necessarily entirely correct. It
has been estimated that any given sequence in GenBank is likely to contain at least one error. While
these errors generally don't render the data meaningless, it's always best to be aware of such issues even
when using top-of-the-line public resources. Like any other software you find on the Web, software
offered by public agencies such as NCBI and the PDB may still be under development. You can use
this software, and much of it is of good quality. If you're basing your research on a beta version (a
version still under development) of a software package, just read the documentation carefully so that
you know what problems still remain to be worked out.

6.7.2 Transparency

When you send data off to a web server for processing, do you ever wonder exactly what happens to it?
You should. It's OK to use your word processor as a black box, but if you're publishing scientific
conclusions based on output that you get from a web server or software package, you should definitely
know at least the basics of what's under the hood. Anyone can create a web server, based on any
software, whether it's good or just goofy. Creating a web server creates an illusion of authority; after
all, the authors know how to build a web server that works, so their other software must work too. But
that appearance of authority isn't always well founded.

Ideally, you have access to the source code (the human-readable version of a computer program) for
whatever the web server is doing, and you can read the source code and know it's doing what you
expect. But you might not know how to read source code, and even if you do, you might not be able to
get hold of it. Unfortunately, some bioinformatics software authors don't make their source code
publicly available, preferring to set up web servers that are easier to use and maintain. This can
incidentally have the effect of hiding the underlying method from close scrutiny by users.

If you can't read the source code, what can you read? Most software or web servers made available by
academic researchers or government institutions have online help pages and other documentation,
including bibliographic information for publications in refereed journals that describe the methods
encoded in the software. Read this documentation and understand the method and its results before you
use it, just as you would for an experimental method that is new to you.

If the program or server you want to use has no documentation and doesn't allow you to check the
source code, you should seriously consider not using that program, unless you have some way to verify
its output (for instance, by comparison with the output of a well-documented program). After all, you're
drawing conclusions based on your results; do you want to stake your scientific credibility on an
unknown quantity?

6.7.3 Timeliness

One of the most frequently linked biology resource sites on the Web is Pedro's Biomolecular Research
Tools (http://www.public.iastate.edu/˜pedro/research_tools.html). Sites all over the world still have
pointers to this collection of links. And yet, if you click to Pedro's site, you'll find that the collection
was last updated in 1996. A funny thing about the Web is that out-of-date sites don't just go away. They
remain on the server, looking authoritative. Check web sites for dates. If there's no sign of activity in or
reference to the current year, be skeptical.

Timeliness isn't always an issue with software. Software written in 1980 can be as useful and functional
now as it was then. What you may encounter are problems compiling software that incorporates
proprietary technologies that are no longer supported, or code libraries that have since ceased to be

Chapter 7. Sequence Analysis, Pairwise Alignment, and
Database Searching
We now begin our tour of bioinformatics tools in earnest. In the next five chapters, we describe some
of the software tools and applications you can expect to see in current research in computational
biology. From gene seq uences to the proteins they encode to the complicated biological networks they
are involved in, computational methods are available to help you analyze data and formulate
hypotheses. We have focused on commonly used software packages and packages we have used; to
attempt to encompass every detail of every program out there, however, we'd need to turn every
chapter in this book into a book of its own.

The first tools we describe are those that analyze protein and DNA sequence data. Sequence data is the
most abundant type of biological data available electronically. While other databases may eventually
rival them in size, the importance of sequence databases to biology remains central. Pairwise sequence
comparison, which we discuss in this chapter, is the most essential technique in computational biology.
It allows you to do everything from sequence-based database searching, to building evolutionary trees
and identifying characteristic features of protein families, to creating homology models. But it's also
the key to larger projects, limited only by your imagination”comparing genomes, exploring the
sequence determinants of protein structure, connecting expression data to genomic information, and
much more.

The types of analysis that you can do with sequence data are:

• Knowledge-based single sequence analysis for sequence characteristics
• Pairwise sequence comparison and sequence-based searching
• Multiple sequence alignment
• Sequence motif discovery in multiple alignments
• Phylogenetic inference

We divide our coverage of sequence analysis tools into two chapters. This chapter focuses on programs
that operate on single sequences, or compare gene or protein sequences against each other. Chapter 8 is
devoted to multiple sequence alignment methods.

Pairwise sequence comparison is the primary means of linking biological function to the genome and
of propagating known information from one genome to another. In this chapter, we discuss the
techniques of biological sequence analysis and, most importantly, how to assess the significance of
results from sequence comparison. There are also a number of software tools available for doing
pairwise sequence comparison. Table 7-1 provides a summary.

Table 7-1. Sequence Analysis Tools and Techniques
What you do Why you do it What you use to do it
GENSCAN, GeneWise,
Gene finding Identify possible coding regions in genomic DNA sequences PROCRUSTES, GRAIL
Locate splice sites, promoters, and sequences involved in
DNA feature detection CBS Prediction Server
regulation of gene expression
DNA translation and "Protein machine" server at
Convert a DNA sequence into protein sequence or vice versa
reverse translation EBI
Pairwise sequence Locate short regions of homology in a pair of longer sequences BLAST, FASTA

alignment (local)
Pairwise sequence
Find the best full-length alignment between two sequences ALIGN
alignment (global)
Find sequence matches that aren't recognized by a keyword
Sequence database search
search; find only matches that actually have some sequence BLAST, FASTA, SSEARCH
by pairwise comparison

7.1 Chemical Composition of Biomolecules
Sequence analysis techniques can be applied to DNA and RNA (nucleotide) sequences or to protein
(amino-acid) sequences. To understand why DNA and protein sequences are informative, you need to
know a bit about the chemistry of DNA and proteins. In the context of the sequence analysis
applications we discuss in this chapter, it's perfectly fine to think of a DNA sequence as pure
information. If you really want to, you can skip over the chemical structures and think of DNA as a
string of letters. But keep this fact in mind: the single-letter sequence code that describes DNA and is a
simplified representation of a 3D chemical entity, and in some cases the 3D structure of the DNA is
really significant.

Proteins, at least at first glance, are more chemically complicated than DNA, and it's impossible to
separate the information content of their sequences from the chemical properties of the amino acids
they're built from. You can't safely forget about the chemistry of proteins when you're analyzing their
sequences, so we'll discuss protein chemistry thoroughly at the beginning of Chapter 9, before we
introduce techniques for protein structure analysis. As discussed in Chapter 2, DNA is the medium for
storing information in cells, and it stores and transmits that information through the sequence of
nucleotides that make up the DNA chain. DNA occurs as a "double helix"”two long sequences of
nucleotides that are chemical mirror images of each other. This double-helical structure and the
chemistry that forces a specific pattern of pairing between nucleotides in the two halves of the helix is
what gives DNA the ability to replicate itself and faithfully pass its information from cell to cell and
generation to generation. The chemistry of pairing between nucleotide chains also allows the DNA
sequence to be transcribed into RNA and translated into proteins.

7.2 Composition of DNA and RNA
DNA and RNA are polymer chains composed of a small alphabet of chemically similar compounds.
The individual units are called nucleotides. As you can see in Figure 7-1, nucleotides are made up of
three distinct parts: a cyclic base, a cyclic sugar (deoxyribose or ribose, respectively), and a phosphate
group. Base utilization is different in DNA than in RNA. The DNA code consists of patterns built up
from the A (adenine), T (thymine), G (guanine), and C (cytosine) nucleotides, while the RNA code
substitutes U (uracil) for T.

Figure 7-1. The "backbone" bits of DNA and RNA”ribose and deoxyribose phosphates

Figure 7-2 shows the five nucleotides, which are also referred to as bases. In hydrolyzed double-
stranded DNA, there are always equal amounts of A and T nucleotides (A = T). The amounts of G and
C in the solution are also always equal (G = C). This is called Chargaff's rule after the researcher who
discovered the relationships between A and T, G and C. (Note that there can different amounts of A
and T, G, and C; the ratio of A-T to G-C base pairs can vary widely from species to species.)

Figure 7-2. The five "bases" that commonly appear in DNA or RNA

7.3 Watson and Crick Solve the Structure of DNA
The quantitative relationships between adenine and thymine, and cytosine and guanine led Watson and
Crick to propose a structural model for DNA in 1953, and later Crick's central dogma of biology.
Watson and Crick's model of DNA was based on several observations:

• The x-ray crystallography experiments of their colleague Rosalind Franklin who observed a
diffraction pattern from DNA that suggested a helical molecule with a regular repeating
structure at a spacing of 3.4 angstroms
• Chargaff's rules
• Experimental evidence that the bases were connected by hydrogen bonds in the DNA molecule
• The knowledge of the correct structural conformations of the bases from x-ray crystallography

What Watson and Crick did was to combine this disparate information to propose the double helix. The
double helix of DNA, which has now been determined in atomic detail using x-ray crystallography, is a
structure in which adenine pairs with thymine, and guanine with cytosine by hydrogen bonding (Figure
7-3). The hydrogen bonded base pairs form the core of the molecule.

Figure 7-3. Two common base pairs, A-T and G-C

As shown in Figure 7-4, the base pairs stack on top of and parallel to each other with a spacing of 3.4
angstroms. They are held together in sequence by covalent chemical bonds between the sugar group of
one nucleotide and the phosphate group of the next. This chain has a directionality: the end left with an
exposed phosphate group is called the 5' end, while the end with the exposed ribose group is the 3' end.

Figure 7-4. Schematic of the DNA chain

The specific chemical pairing of nucleotides in DNA and RNA sequences suggests a mechanism by
which each strand of DNA can serve as the template for the synthesis of a complementary strand. The
use of a similar nucleotide code in RNA suggests that DNA can also be used as a template for synthesis
of RNA. From these two pieces of evidence, Crick proposed his central dogma: that DNA directs its
own replication and its transcription into RNA and that RNA is translated into protein.

7.4 Development of DNA Sequencing Methods
If you just digest DNA into its four component bases and measure the quantity of each, it tells you
nothing about the DNA sequence. Modern methods for DNA sequencing rely on controlled
biochemical reactions that allow the base content at each position in the DNA sequence to be
quantitated independently. The chemical cleavage method for sequencing DNA relies on the specificity
of chemical reagents (reactive substances) to break DNA chains at four specific types of sites. There
are reagents that break or cleave the chain specifically after G nucleotides and reagents that cleave
specifically after C nucleotides. There are also reagents that cleave less specifically: one to cleave after
A and G nucleotides and one to cleave after C and T nucleotides. The method Maxam and Gilbert
designed was conceptually simple. Four samples of DNA are required for this method. One type of
reagent is mixed with each sample in a quantity that causes each DNA chain in the sample to be broken
only one time, on average, at a random location. One end of the DNA is radioactively labeled, and the
other is not, so only one piece of each broken chain is radioactive after the chain is cleaved. DNA
fragments of different sizes can be separated using an electric current to drive them through a viscous
medium called a gel. The larger the fragment, the more it's slowed by the gel, so at the end of some
period of time, different-sized radioactive pieces of DNA are spread out at regular intervals down the
gel. Figure 7-5 shows a partial autoradiogram of a DNA sequencing gel. Each set of four closely
spaced lanes represents an individual sequencing experiment. The gel is read from bottom to top. Each
band on the gel identifies the nucleotide present at the position in the sequence, depending on which of
the four lanes it appears in. (Image courtesy of Dr. Dennis Dean, Virginia Tech.) If each DNA chain is
broken once after a random A, C, G, or T, a uniform distribution of fragments that map the entire
sequence of the DNA is created. Depending on which sample the radioactive piece is from, the last
base in its sequence is known, and the sequence can be read off the gel from end to end.

Figure 7-5. DNA sequencing gel

Sanger's chain-terminator procedure is the most commonly used sequencing chemistry in modern
laboratories. This procedure takes advantage of an enzyme called DNA polymerase, which builds a
complementary strand of DNA for an existing single strand. In Sanger's method, the DNA polymerase
reaction is carried out in the presence of specific analogues of nucleotides that, when they are
incorporated, cause the synthesis of the complementary strand to stop. Four samples are prepared, each
containing a small amount of one type of chain terminator. Analogously to the Maxam and Gilbert
method, a uniform distribution of DNA fragments is generated, each with a known end residue. The

fragments are analyzed based on the strength of this fluorescence signal, giving the sequence of the
complementary strand to the original DNA.

The chain termination method is easily automated, and computer-compatible sequencing systems that
use this method are readily available. Most genome sequence data is currently generated using this
method, though new sequencing methods that don't involve chain cleavage or chain termination are in
development. We discuss the process of sequencing data analysis and genome assembly further in
Chapter 11.

7.4.1 The Chemical Composition of Proteins

Unlike DNA, protein polymers consist of a common set of building blocks called amino acids. There
are 20 amino acids that make up the standard chemical alphabet used to build proteins. Amino acids are
small molecules that share a common motif, of three substituent chemical groups arranged around a
central carbon atom. One of the substituent groups is always an amino group; another is always
carboxylic acid group. To form the protein polymer, the amino and carboxyl groups react with each
other and form a bond called the peptide bond. The third substituent on the central carbon of an amino
acid is variable, and it's this property that makes the amino acids into a code for storing information.
The sequence of amino acids in a protein is referred to as the protein's primary structure. Protein
sequence can be subjected to the same analyses (described later) for DNA sequence. As we describe
sequence analysis methods, we will point out ways in which these methods differ for proteins and

7.4.2 Mechanisms of Molecular Evolution

The discovery of DNA as the molecular basis of heredity and evolution made it possib le to understand
the process of evolution in a whole new way. Darwin's theory of evolution by natural selection
describes the observable process of evolution and speciation. However, it doesn't explain how
information is passed from generation to generation, nor does it explain the mechanisms that give rise
to, or that limit, variation within each generation.

The two halves of the double-helical DNA molecule serve as a template for replication of the DNA
molecule. Even though the molecular rules governing replication of DNA are specific, replication
doesn't always occur with perfect fidelity. When a piece of DNA is replicated incorrectly and the error
is not corrected by the cell's repair machinery, it's called a mutation.

Mutations can occur in any part of an organism's DNA: in the middle of genes that code for proteins or
functional RNA molecules, in the middle of regulatory sequences that govern when a gene is turned on,
or out in the "middle of nowhere", in the regions between gene sequences. Mutations can have dramatic
effects on the organism's phenotype (its visible or measurable characteristics) or they can have no
apparent effect. Over time”thousands or millions of years”mutations that are beneficial or at least
not harmful to a species can become fixed in the population, meaning that the mutated form of the gene
occurs with a certain frequency among all individuals of a particular species. Over longer time scales,
enough mutations may accumulate that new species develop.

There are two classes of mutations: point mutations, in which a change affects a single nucleotide in the
DNA sequence; and segmental mutations, which can affect anywhere from a few to many hundreds of
adjacent nucleotides.

Point mutations usually result from a single mismatch, in which one nucleotide is mispaired with the
template DNA as a new complementary DNA strand is being built. Point mutations become significant
only if they occur in the middle of a coding region or signal sequence, and then only if they cause a
change in functionality. In coding regions, point mutations can either be synonymous, meaning that the
mutated strand codes for the same amino acid as it did before the mutation occurred, or
nonsynonymous. The genetic code (which was shown back in Figure 2-3) is degenerate; that is, several
different three-letter combinations code for each amino acid. The groups of codons which code for each
amino acid are by no means random; instead, nature has arranged a fail-safe mechanism in which
several codons that differ by only one nucleotide represent a single amino acid, thereby allowing a little
room for synonymous replication errors in DNA.

Segmental mutations, which can result in insertion or deletion of long stretches of DNA, can occur by
many different mechanisms, all of which involve mismatching of a strand of DNA either with the
wrong partner or with a part of itself. Segmental mutations can result in duplications of whole genes or
even large regions of chromosomes; some genetic events can even result in the duplication of entire
genomes. Generated by gene and chromosome duplication, redundant copies of genes can be
repurposed (through a slow process of mutational trial and error) to perform new functions in the cell.
A detailed discussion of these mechanisms is given in the excellent book Fundamentals of Molecular
Evolution; see the Bibliography.

Both types of mutation leave traces in the evolutionary record, that is, in the DNA sequences of living
things. Since mutations tend to be preserved only if they are functionally useful (or at least, not
harmful), there is a tendency for functionally important parts of sequences to be conserved (to remain
constant throughout the evolutionary process) while noncoding or nonfunctional sequences diverge
wildly. This tendency to conserve functionally important sequences is the basis for the whole field of
sequence analysis; it lets us draw evolutionary connections between genes that are related in sequence.

By comparative study of DNA sequences, and on a larger scale, of whole genomes, it's possible to
develop quantitative methods for understanding when and how mutational events occurred, as well as
how and why they were preserved to survive in existing species and populations. Genomics and
bioinformatics”the production of genome data and the development of tools for analyzing it”have
made it possible to examine the evolutionary record and make increasingly quantitative statements
about the evolutionary relationship of one species to another. Taxonomies can begin to be based not
merely on anatomy but on quantitative measurements of differences in the genetic code. Both point
mutations and segmental mutations are explicitly modeled in the scoring schemes for comparison of
protein and DNA sequences discussed later in this chapter. Changes in the identity of the residue
(nucleotide or amino acid) at a given position in the sequence are scored using standard substitution
scores (for example, a positive score for a match and a negative score for a mismatch) or substitution
matrices. Insertions and deletions are scored with penalties for gap opening and gap extension.

7.5 Genefinders and Feature Detection in DNA
Once a large chunk of DNA has been mapped and sequenced, the task of understanding its function
begins. In this section, we describe some programs that search the sequence for genes and other
biologically important features. A feature is a sequence pattern with some functional significance, such
as start and stop codons, splice sites (in the case of eukaryotes), and sequences that are bound by
proteins in order to regulate gene expression. Some features can be found by searching for a specific
sequence, such as the restriction site cleaved by a given restriction enzyme. Others, such as promoters

and genes, aren't so easy to pick out. Analysis of single DNA sequences in search of sequence features
is a rapidly growing research area in bioinformatics.

There are two reasons that genefinding and feature detection are such notoriously difficult problems.
First, there are a huge number of protein-DNA interactions, many of which have not yet been
experimentally characterized, and some of which differ from organism to organism. More importantly,
we don't always know what constitutes a binding sequence. Current promoter detection algorithms
yield about 20-40 false positives for each real promoter identified. Some proteins bind to specific
sequences; others are more flexible in their preference for attachment sites. To complicate matters
further, a protein can bind in one part of a chromosome but affect a completely different region
hundreds or thousands of base pairs away.

7.5.1 Predicting Gene Locations

Genefinders are programs that identify (or try to, anyway) all the open reading frames in unannotated
DNA. They use a variety of approaches to locate genes, but the most successful combine content-based
and pattern-recognition approaches. Content-based methods for gene prediction take advantage of the
fact that the distribution of nucleotides in genes is different than in non-genes. The GRAIL family of
programs developed at Oak Ridge National Laboratories uses a neural network to combine evidence
from seven different statistical measures of DNA content (frame bias, periodicities, fractal dimension,
coding 6-tuples, in-frame 6-tuples, k-tuple commonality, and repetitive 6-tuple words); subsequent
versions measure additional features to better exploit these different types of data. At each position in
the DNA sequence, the program weighs each type of information, integrates them, and comes up with a
score that represents the likelihood that the region in question is in an ORF or an intergenic region.
Pattern-recognition methods look for characteristic sequences associated with genes (start and stop
codons, promoters, splice sites) to infer the presence and structure of a gene.

In isolation, each method goes only so far. You have a similar rate of success if you try to identify
human faces by looking for either a characteristic skin texture (content) or the presence of a moustache
(pattern), but not both. Not surprisingly, the current generation of genefinders combine both methods
with additional knowledge, such as gene structure or sequences of other, known genes.

Some genefinders are accessible only though web interfaces, making the interaction very
straightforward: the sequence that needs to be examined for genes is submitted to the program, it is
processed, and the output is returned. On one hand, this eliminates the need for installation and
maintenance of the genefinder on your system, and it provides a relatively uniform interface for the
different programs. On the other, if you plan to rely on the results of a genefinder, you should take the
time to understand underlying algorithm, find out if the model is specific for a given species or family,
and, in the case of content-based models, know which sequences they are. The accuracy of a genefinder
can be misleadingly high if it is trained on the same sequence with which you test it.

Some commonly used programs in gene finding include Oak Ridge National Labs' GRAIL,
GENSCAN (developed by Chris Burge, now at MIT, and Samuel Karlin at Stanford), PROCRUSTES
(developed by Pavel Pevzner and coworkers), and GeneWise (developed by Ewan Birney and Richard
Durbin). GRAIL combines evidence from a variety of signal and content information using a neural
network. GENSCAN combines information about content statistics with a probabilistic model of gene
structure. PROCRUSTES and GeneWise find open reading frames by translating the DNA sequence
and comparing the resulting protein sequence with known protein sequences. PROCRUSTES compares

potential ORFs with close homologs, while GeneWise compares the gene against a single sequence or a
model of an entire protein family.

7.5.2 Feature Detection

In addition to their role in genefinder systems, feature-detection algorithms can be used on their own to
find patterns in DNA sequences. Frequently, these tools help interpret freshly sequenced DNA or
choose targets for designing PCR primers or microarray oligomers. Some starting places for tools like
these include the Center for Biological Sequence Analysis at the Technical University of Denmark
(which has several web-based applications for finding intron-exon splice sites and transcription start
sites in eukaryotic DNA), the CodeHop server at the Fred Hutchinson Cancer Research Center (which
predicts PCR primers based on conserved protein sequences), and the Tools collection at the European
Bioinformatics Institute.

In addition to these special-purpose tools, another popular approach is to use motif discovery programs
that automatically find common patterns in sequences. We will examine these programs in greater
detail when we look at multiple sequence analysis methods.

7.6 DNA Translation
Before a protein can be synthesized, its sequence must be translated from the DNA. Translation of
DNA sequence into protein sequenc e isn't conceptually or computationally difficult. All that is required
is the DNA sequence, a genetic code, and a program that reads in one type of sequence and outputs the

Any DNA sequence can be translated in six possible ways. The sequence can be translated backward
and forward. Because each amino acid in a protein is specified by three bases in the DNA sequence,
there are three possible translations of any DNA sequence in each direction: one beginning with the
very first character in the sequence, one beginning with the second character, and one beginning with
the third character.

Figure 7-6 shows "back-translation" of a protein sequence (shown on the top line) into DNA, using the
bacterial and plant plastid genetic code. As you can see, back-translation of a protein sequence into
DNA isn't unique. Each amino acid in the short sequence shown can be represented by as many as six
codons, and the possible codons can be combined in many ways to produce not one, but hundreds of
possible coding sequences, even for a short peptide. However, note that nature has grouped the codons
"sensibly": alanine (A) is always specified by a "G-C-X" codon, arginine (R) is specified either by a
"C-G-X" codon or an "A-G-pyrimidine" codon, etc. This reduces the number of potential sequences
that have to be checked if you (for example) try to write a program to compare a protein sequence to a
DNA sequence database. [1]

The more computationally efficient solution to this problem is simply to translate the DNA sequence database in all six reading frames.

Figure 7-6. Back-translation from a protein sequence

There are no markers in the DNA sequence to indicate where one codon ends and the next one begins.
Consequently, unless the location of the start codon is known ahead of time, a double-stranded DNA
sequence can be interpreted in any of six ways: an open reading frame can start at nucleotide i, at i+1,
or at i+2 on either the observed or complementary strand. To account for this unc ertainty, when a
protein is compared with a set of DNA sequences, the DNA sequences are translated into all six
possible amino acid sequences, and the protein query sequence is compared with these resulting
conceptual translations. This exhaustive translation is called a "six-frame translation" and is illustrated
in Figure 7-7.

Figure 7-7. A DNA sequence and its translation in three of six possible reading frames

Because of the large number of codon possibilities for some amino acids, back-translation of a protein
into DNA sequence can result in an extremely large number of possible sequences. However, codon
usage statistics for different species are available and can be used to suggest the most likely back-
translation out of the range of possibilities.

BLAST and FASTA dynamically translate query and database sequences so you don't need to worry
about translating a database before you do a sequence comparison. However, in the event that you need
to produce a six-frame translation of a single DNA sequence or translate a protein back into a set of
possible DNA sequences, and you don't want to script it yourself, the Protein Machine server


. 5
( 12)