<<

. 10
( 12)



>>

of biochemical reagent in a system has an associated rate of formation and rate of breakdown, and the
model is capable of predicting how the system will behave over time under various starting conditions.
A model of metabolism may consist of dozens of reagents, each being formed and consumed by
multiple reactions. Models that accurately simulate the behavior of a complex biochemical pathway
aren't trivially developed, but once created, they can predict the effect of perturbations to the system
and help researchers develop new hypotheses.

If you're coming from a biological sciences background, you are probably familiar with the Michaelis-
Menten model for describing enzyme kinetics. Biochemical modeling extends this relatively simple
mathematical model of a single enzyme-catalyzed reaction to encompass multiple reactions that may
feed back upon each other in complex ways. Physiological models also involve multiple compartments
with barriers through which only some components can diffuse or be transported. However, the
underlying principles are similar, no matter how complex the model.

277
11.8.1 Modeling Kinetics with Gepasi

Gepasi (http://www.gepasi.org/) is a user-friendly biochemical kinetics simulator for Windows/NT that
can model systems of up to 45 metabolites and 45 rate equations. The Gepasi interface includes
interactive tools for creating a new metabolic model: entering chemical reactions, adding metabolites
that may be effectors or inhibitors of the reactions, defining reaction kinetics, setting metabolite
concentrations, and other key steps in model development. You can apply Gepasi's predefined reaction
types to your model or define your own reaction types. Gepasi automatically checks on mass
conservation relations that need to be accounted for in the simulation. Gepasi has numerous options for
running simulations over various time courses and testing the results of changing variable values over a
user-defined range. Gepasi can also optimize metabolic models used in metabolic engineering and fit
experimental data to metabolic models.

At the time of this writing, versions of Gepasi for platforms other than Windows/NT are in
development.

11.8.2 XPP

XPP (http://www.math.pitt.edu/˜bard/xpp/xpp.html) is a dynamical systems simulation tool that is
available for both Windows/NT and Linux. While it lacks some of the user-friendly features of Gepasi,
it has been used effectively to model biochemical processes ranging from biochemical reactions to cell
cycles and circadian rhythms. XPP compiles easily on any Linux system with a simple make command;
just download the archive, move it into a directory of its own, extract it, then compile the program.
Documentation, as well as example files for various simulations, are included with the XPP
distribution.

11.8.3 Using the Virtual Cell Portal

The Virtual Cell portal at the National Resource for Cell Analysis and Modeling (NRCAM,
http://www.nrcam.uchc.edu) is the first web-based resource for modeling of cellular processes. It
allows you to model cells with an arbitrary number of compartments and complex physiology. A
tutorial available at the Virtual Cell site walks the first-time user through the process of developing a
physiology model for a cell, choosing a cell geometry, and setting up and running a simulation. The
cell physiology model includes not only a compartmentation scheme for the cell, which can be created
using simple drawing tools, but the addition of specific types of ionic species and membrane
transporters to the cell model.

The Virtual Cell is a Java applet, which is fully supported for Macintosh and Windows users. In order
to use the Virtual Cell portal on a Linux workstation, you need to download the Java plug-in for
Netscape (available from http://www.blackdown.org) and install it in your ˜/.netscape directory. Once
the plug-in is installed, you can follow the "MacIntosh Users Run the Virtual Cell" link on the main
page, even if you're running the VCell Applet on a Linux workstation, and you can try out most
features of the portal. At the time of this writing, Unix users aren't explicitly supported at the Virtual
Cell portal, and while correct functionality seems to be available when the Blackdown Java applet is
used, it might be wise for serious users of the VCell tools to compare results for some test cases on a
Linux workstation and another machine.

11.9 Summary

278
We've compiled a quick-reference table of genomics and proteomics tools and techniques (Table 11-1).

Table 11-1. Genomics and Proteomics Tools and Techniques
What you do Why you do it What you use to do it
Online genome To find information about the location and function of NCBI tools, TIGR tools, EnsEMBL, and
resources particular genes in a genome genome-specific databases
To convert fluorescence intensities from the sequencing
Basecalling Phred
experiment into four-letter sequence code
Genome mapping and To organize the sequences of short fragments of raw
Phrap, Staden package
assembly DNA sequence data into a coherent whole
To connect functional information about the genome to
Genome annotation MAGPIE
specific sequence locations
To identify components of genome structure that
Genome comparison PipMaker, MUMmer
differentiate one organism from another
Microarray image
To identify and quantitate spots in raw microarray data CrazyQuant, SpotFinder, ArrayViewer
analysis
Clustering analysis of To identify genes that appear to be expressed as linked
Cluster, TreeView
microarray data groups
2D-PAGE analysis To analyze, visualize, and quantitate 2D-PAGE images Melanie3, Melanie Viewer
To analyze mass spectrometry results and identify ExPASy tools, ProteinProspector,
Proteomics analysis
proteins PROWL
Metabolic pathway To search metabolic pathways and discover functional
PATH-DB, WIT, KEGG
tools relationships; to reconstruct metabolic pathways
Metabolic andcellular To model metabolic and cellular processes based on
Gepasi, XPP, Virtual Cell
simulation known properties and inference




279
Chapter 12. Automating Data Analysis with Perl
As we've seen in previous chapters, a vast assortment of software tools exists for bioinformatics. Even
though it's likely that someone has already written what you need, you will still encounter many
situations in which the best solution is to do it yourself. In bioinformatics, that often means writing
programs that sift through mountains of data to extract just the information you require. Perl, the
Practical Extraction and Reporting Language, is ideally suited to this task.

12.1 Why Perl?
There are a lot of programming languages out there. In our survey of bioinformatics software, we have
already seen programs written in Java, C, and FORTRAN. So, why use Perl? The answer is
efficiency. Biological data is stored in enormous databases and text files. Sorting through and
[1]


analyzing this data by hand (and it can be done) would take far too long, so the smart scientist writes
computer tools to automate the process. Perl, with its highly developed capacity to detect patterns in
data, and especially strings of text, is the most obvious choice. The next obvious choice would
probably be Python. Python, the less well known of the two, is a fully object-oriented scripting
language introduced by Guido van Rossum in 1988. Python has some outstanding contributed code,
including a mature library for numerical methods, tools for building graphical user interfaces quickly
and easily, and even a library of functions for structural biology. At the end of the day, however, it's the
wealth of existing Perl code for bioinformatics, the smooth integration of that code onto Unix-based
systems, cross-platform portability, and an incredibly enthusiastic user community that makes Perl our
favorite scripting language for bioinformatics applications.
[1]
Efficiency from the programmer's point of view, that is. It takes far less programming time to extract data with Perl than with C or with Java.


Perl has a flexible syntax, or grammar, so if you are familiar with programming in other languages such
as C or BASIC, it is easy to write Perl code in a C-like or BASIC-like dialect. Perl also takes care of
much of the dirty work involved in programming, such as memory allocation, so you can concentrate
on solving the problem at hand. It's often the case that programming problems requiring many lines of
code in C or Java may be solved in just a few lines of Perl.

Many excellent books have been written about learning and using Perl, so this single chapter obviously
can't cover everything you will ever need to know about the language. Perl has a mountain of features,
and it's unrealistic to assume you can master it without a serious commitment to learning the art of
computer programming. Our aim in this chapter isn't to teach you how to program in Perl, but rather to
show you how Perl can help you solve certain problems in bioinformatics. We will take you through
some examples that are most immediately useful in real bioinformatics research, such as reading
datafiles, searching for character strings, performing back-of-the-envelope calculations, and reporting
findings to the user. And we'll explain how our sample programs work. The rest is up to you. The
ability to program in any language”but especially in Perl, Python, or Java”is an important skill for
any bioinformatician to have. We strongly suggest you take a programming class or obtain one of the
books on our list of recommended reading and start from the beginning. You won't regret it.

12.1.1 Where Do I Get Perl?



280
Perl is available for a variety of operating systems, including Windows and Mac OS, as well as Linux
and other flavors of Unix. It's distributed under an open source license, which means that it's essentially
free. To obtain Perl from the Web, go to http://www.perl.com/pub/language/info/software.html and
follow the instructions for downloading and installing it on your system.

12.2 Perl Basics
Once you've installed Perl, or confirmed with your system administrator that it's already installed on
your system, you're ready to begin writing your first program. Writing and executing a Perl program
can be broken into several steps: writing the program (or script) and saving it in a file, running the
program, and reading the output.

12.2.1 Hello World

A Perl program is a text file that contains instructions written in the Perl language. The classic first
program in Perl (and many other languages) is called "Hello, World!" It's written like this:

#!/usr/bin/perl -w
# Say hello
print "Hello, World!\n";

"Hello, World!" is a short program, but it's still complete. The first line is called the shebang line and
tells the computer that this is a Perl program. All Perl programs running on Unix begin with this line. [2]


It's a special kind of comment to the Unix shell that tells it where to find Perl, and also instructs it to
look for optional arguments. In our version of "Hello World!" we've included the optional argument -w
at the end of the line. This argument tells Perl to give extra warning messages if you do something
potentially dangerous in your program. It's a good idea to always develop your programs under -w.
[2]
Strictly speaking, the shebang line isn't necessary on Windows or Macintosh; neither of those systems has a usr/bin/perl. It's good programming form to always
include the line, however, since it's the best place to indicate your optional arguments in Perl. On other platforms, you can run the program by invoking the Perl
interpreter explicitly, as in perl hello.pl.


The second line starts with a # sign. The # tells Perl that the line of text that follows is a comment, not
part of the executable code. Comments are how humans tell each other what each part of the program is
intended to do. Make a habit of including comments in your code. That way you and other people can
add to your code, debug it successfully, and even more importantly, remember what it was supposed to
do in the first place.

The third line calls the print function with a single argument that consists of a text string. At the end of
the text string, there is a \n, which tells the computer to move to a new line after executing the print
statement. The print statement ends with a semicolon, as do most statements in Perl.

To try this little program yourself, you can open a text editor such as vi, Emacs, or pico, and type the
lines in. When you've finished entering the program, name the file hello.pl and save it in your directory
of choice. While you're learning, you might consider creating a new directory (using the mkdir
command, which we covered in Chapter 4) called Perl in your home directory. That way you'll always
know where to look for your Perl programs.

Now make the file executable using the command:


281
% chmod +x hello.pl

(If you need a refresher on chmod, this would be a good time to review the section on changing file
permissions in Chapter 4.) To run the program, type:

% hello.pl

Because of the shebang line in our program, this command invokes the Perl interpreter, which reads the
rest of the file and then translates your Perl source code into machine code the computer can execute.
In this case you'll notice that Hello, World! appears on your computer screen, and then the cursor
advances to a new line. You've now written and run your first Perl program!

12.2.2 A Bioinformatics Example

One of the strengths of Perl”and the reason that bioinformaticians love it”is that with a few lines of
code, you can automate a tedious task such as searching for a nucleotide string contained in a block of
sequence data. To do this, you need a slightly more complex Perl program that might look like this:

#!/usr/bin/perl -w
# Look for nucleotide string in sequence data

my $target = "ACCCTG";
my $search_string =
'CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA'.
'CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT'.
'ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC';

my @matches;

# Try to find a match in letters 1-6 of $search_string, then look at letters 2-7,
# and so on. Record the starting offset of each match.

foreach my $i (0..length $search_string){
if( $target eq substr( $search_string, $i, length $target)){
push @matches, $i;
}
}

# Make @matches into a comma-separated list for printing
print "My matches occurred at the following offsets: @matches.\n";

print "done\n";

This program is also short and simple, but it's still quite powerful. It searches for the target string
"ACCCTG" in a sequence of data and keeps track of the starting location of each match. The program
demonstrates variables and loops, which are two basic programming constructs you need to understand
to make sense of what is going on.

12.2.3 Variables

A variable is a name that is associated with a data value, such as a string or a number. It is common to
say that a variable stores or contains a value. Variables allow you to store and manipulate data in your


282
programs; they are called variables because the values they represent can change throughout the life of
a program.

Our sequence matching program declares four variables: $target , $search_string, @matches, and $i.
The $ and @ characters indicate the kind of variable each one is. Perl has three kinds of variables built
into the language: scalars, arrays, and hashes.

Unlike other programming languages, Perl doesn't require formal declaration of variables; they simply
exist upon their first use whether you explicitly declare them or not. You may declare your variables, if
you'd like, by using either my or our in front of the variable name. When you declare a variable, you
give it a name. A variable name must follow two main rules: it must start with a letter or an underscore
(the $ and @ characters aren't considered part of the name), and it must consist of letters, digits, and
underscores. The best names are ones that clearly, concisely, and accurately describe the variable's role
in the program. For example, it is easier to guess the role of a variable if it is named $target or
$sequence, than if it were called $icxl.

12.2.3.1 Scalars

A scalar variable contains a single piece of data that is either a number or a string. The $ character
indicates that a variable is scalar. The first two variables declared in our program are scalar variables:

my $target = "ACCCTG";
my $search_string =
"CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA".
"CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT".
"ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC";

In this case, "ACCCTG" is the target string we are seeking, and "CCACACCACACCCACAC..." is the
sequence data in which we're hoping to find it.

In a scalar variable, a number can be either an integer (0, 1, 2, 3, etc.) or a real number (a number that
contains a fractional portion, such as 5.6). A string is a chunk of text that's surrounded by quotes. For
example:

"I am a string."
'I, too, am a string'

One of Perl's special features is that it has a number of built-in facilities for manipulating strings, which
comes in handy when working with the flat text files common to bioinformatics. We cover flat text
files and their more structured relatives, relational databases, in detail in Chapter 13.

12.2.3.2 Arrays

An array is an ordered list of data. In our sequence matching program, @matches is an array variable
used to store the starting locations of all the matches. Each element stored in an array can be accessed
by its position in the list, which is represented as a number. In Perl, array variables are given an @
prefix. For example, the following statement declares an array of numbers:

@a = ( 1, "4", 9 );



283
This statement declares an array of strings:

@names = ("T. Herman", "N. Aeschylus", "H. Ulysses", "Standish");

And this statement declares an array with both:

@mix = ("Caesar Augustus", "Tiberius", 18, "Caligula", "Claudius");

Note the syntax in the declarations: each element in the array is separated from its neighbors by a
comma, each of the strings is quoted, and (unlike American English) the comma appears outside of the
quotes.

Because an array is an ordered set of information, you can retrieve each element in an array according
to its number. The individual elements in an array are written as $this_array[i], where i is the index of
the array element being addressed. Note that i can be either a bare number (such as 21), or a numeric
scalar variable (such as $n) that contains a bare number. Here is a Perl statement that uses the print
operator to display the second number in @a and the third name in @names on the screen:

print "second number: $a[1]\n third name: $names[2]\n";

You may be wondering why the element numbers here are one less than what you might think they
should be. The reason is that positions in Perl arrays are numbered starting from zero. That is, the first
element in an array is numbered 0, the second element is numbered 1, and so on. That's why, in the
previous example, the second element in @a is addressed as $a[1]. This is an important detail to
remember; mistakes in addressing arrays due to missing that crucial zero element are easy to make.

12.2.3.3 Hashes

A hash is also known as an associative array because it associates a name (or key, as it's called in Perl)
with each piece of data (or value) stored in it. A real-world example of a hash is a telephone book, in
which you look up a person's name in order to find her telephone number. Our sequence matching
program doesn't use any hashes, but they can be quite handy in bioinformatics programs, as you'll see
in a later example. Perl uses the % prefix to indicate hash variables (e.g., %sequences). There are a
number of ways to declare a hash and its contents as a list of key/value pairs. Here is the syntax for one
declaration style:

%hash = (
key1 => "value1",
key2 => "value2", ...
last_key => "last_value" );

A value can then be retrieved from this hash using the corresponding key, as follows:

$value = $hash{"key2"};

For example, you can declare a hash that relates each three-letter amino acid code to its one-letter
symbol:

my %three_to_one = (
ALA => A, CYS => C, ASP => D, GLU => E,
PHE => F, GLY => G, HIS => H, ILE => I,

284
LYS => K, LEU => L, MET => M, ASN => N,
PRO => P, GLN => Q, ARG => R, SER => S,
THR => T, VAL => V, TRP => W, TYR => Y
);

The hash entry with the one-letter code for arginine can then be displayed using the following
statement:

print "The one-letter code for ARG is $three_to_one{ARG}\n";

Because there are many popular sequence databases, another place where hashes can be immediately
useful is in keeping track of which sequence ID in one database corresponds to a sequence ID in the
next. In the following example, we define a hash in which each of the keys is a GenBank identifier (GI)
number of a particular enzyme, and each value is the corresponding SWISS-PROT identifier of the
same enzyme. Using this hash, a program can take the more cryptic GI number and automatically find
the associated SWISS-PROT ID:

#!/usr/bin/perl -w
# define the hash relating GI numbers to SWISSPROT IDs
%sods = (
g134606 => "SODC_DROME",
g134611 => "SODC_HUMAN",
g464769 => "SODC_CAEEL",
g1711426 => "SODC_ECOLI" );

# retrieve a value from %sods
$genbank_id = "g134611";
$swissprot_id = $sods{$genbank_id};
print "$genbank_id is the same as $swissprot_id\n";

If you save the previous script to a file, make the file executable, and run it, you should see:

g134611 is the same as SODC_HUMAN

In the first part of this script, you are declaring the hash relating GenBank IDs to SWISS-PROT IDs. In
the second part, you access the information stored in that hash. The first step is to assign one of the
GenBank IDs to the variable $genbank_id. Then you can retrieve the SWISS-PROT ID that %sods has
associated with the string in $genbank_id, and store the SWISS-PROT ID in the variable
$swissprot_id. Finally, print the values of the two scalar variables. This example is obviously rather
contrived, but it should give you an idea of how useful hashes can be in bioinformatics programs.

12.2.4 Loops

Now that we've talked about scalar, array, and hash variables in Perl, let's return to our sequence
matching program and talk about the other main programming construct it employs. A loop is a
programming device that repeatedly executes a specific set of commands until a particular condition is
reached. Our program uses a foreach loop to iterate through the search string:

foreach my $i (0..length $search_string){
if( $target eq substr( $search_string, $i, length $target)){
push @matches, $i;
}
}

285
The first time through this loop, Perl starts at 0 and looks at the first six-letter combination in the search
string, compares it to the target string, and, if there is a match, records it in @matches. The second
cycle of the loop looks at letters 2-7, the third looks at letters 3 -8, and so on. Perl stops executing this
sequence when it reaches the end of the search string. At this point, the loop is done, and the program
moves on to the next section, where it prints the results. Don't worry if you don't understand all the
code in the loop; all that's important right now is that you have a general understanding of what the
code is doing.

12.2.5 Subroutines

Although we don't use them in any of our example programs, the use of subroutines in programs is
worth mentioning. All modern programming languages provide a way to bundle up a set of statements
into a subroutine so that they can be invoked concisely and repeatedly. In Perl, you can create a
subroutine with the sub declaration:
sub greet {
my ($name) = shift;
print "Hello, $name!\n";
}

Once this greet subroutine has been declared, you can invoke it as follows:

greet("world"); # Prints "Hello, world!"
greet("Per"); # Prints "Hello, Per!"

Here, "world" and "Per" are arguments”values passed into the subroutine, where they are then stored
in $name. Our greet subroutine just prints a single line and then returns. Usually, subroutines do
something a bit more complicated, possibly returning a value:

$length = calculate_length($sequence);

This sets $length to whatever the calculate_length( ) subroutine returns when provided with the single
argument $sequence. When a subroutine is used for its return value, it's often called a function.

12.3 Pattern Matching and Regular Expressions
A major feature of Perl is its pattern matching, and particularly its use of regular expressions. A regular
expression (or regex in the Perl vernacular) is a pattern that can be matched against a string of data. We
first encountered regular expressions in our discussion of the Unix command grep, back in Chapter 5.
grep, as you may recall, searches for occurrences of patterns in files. When you tell grep to search for a
pattern, you describe what you're looking for in terms of a regular expression. As you know, much of
bioinformatics is about searching for patterns in data.

Let's look at a Perl example. Say you have a string, such as a DNA sequence, and you want to make
sure that there are no illegal characters in it. You can use a regular expression to test for illegal
characters as follows:

#!/usr/bin/perl
# check for non-DNA characters
my $string = "CCACACCACACCCACACaCCCaCaCATCACACCACACACCACACTACACCCA*CACACACA";

286
if( $string =˜ m/([^ATCG])/i) {
print "Warning! Found: $1 in the string";
}

This program contains the regular expression [^ATCG]. Translated into English, the regular expression
says "look for characters in $string that don't match A, T, C, or G." (The i at the end of the statement
tells Perl to match case insensitively; that is, to pay no attention to case. Perl's default is to treat A
differently from a.) If Perl encounters something other than the declared pattern, the program prints out
the offending character. The output of this program is:

Warning! Found * in the string

If instead you want to search for a particular combination of letters, like "CAT", you can change the
regular expression to read CAT:

#!/usr/bin/perl
# check for CATs
my $string =
"CCACACCACACCCACACaCCCaCaCATCACACCACACACCACACTACACCCA*CACACACA";
if( $string =˜ m/CAT/i ){
print "Meow.";
}

The output of this modified program is:

Meow.


12.4 Parsing BLAST Output Using Perl
Now that you know enough about how Perl is written to understand these simple programs, let's apply
it to one of the most common problems in bioinformatics: parsing BLAST output. As you already
know, the result of a BLAST search is often a multimegabyte file full of raw data. The results of
several searches can quickly become overwhelming. But by writing a fairly simple program in Perl,
you can automate the process of looking for a single string or multiple strings in your data.

Consider the following block of data:

...
gb|AC005288.1|AC005288 Homo sapiens chromosome 17, clone hC... 268 2e-68
gb|AC008812.7|AC008812 Homo sapiens chromosome 19 clone CTD... 264 3e-67
gb|AC009123.6|AC009123 Homo sapiens chromosome 16 clone RP1... 262 1e-66
emb|AL137073.13|AL137073 Human DNA sequence from clone RP11... 260 5e-66
gb|AC020904.6|AC020904 Homo sapiens chromosome 19 clone CTB... 248 2e-62
>gb|AC007421.12|AC007421 Homo sapiens chromosome 17, clone hRPC.1030_O_14,
complete sequence
Query: 3407 accgtcataaagtcaaacaattgtaacttgaaccatcttttaactcaggtactgtgtata 3466
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1366 accgtcataaagtcaaacaattgtaacttgaaccatcttttaactcaggtactgtgtata 1425
Query: 3467 tacttacttctccccctcctctgttgctgcagatccgtgggcgtgagcgcttcgagatgt 3526
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1426 tacttacttctccccctcctctgttgctgcagatccgtgggcgtgagcgcttcgagatgt 1485
Query: 3527 tccgagagctgaatgaggccttggaactcaaggatgcccaggctgggaaggagccagggg 3586
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1486 tccgagagctgaatgaggccttggaactcaaggatgcccaggctgggaaggagccagggg 1545

287
Query: 3587 ggagcagggctcactccaggtgagtgacctcagccccttcctggccctactcccctgcct 3646
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1546 ggagcagggctcactccaggtgagtgacctcagccccttcctggccctactcccctgcct 1605
Query: 3647 tcctaggttggaaagccataggattccattctcatcctgccttcatggtcaaaggcagct 3706
...

This is only a small portion of what you might find in a report from a BLAST search. (This is actual
data from a BLAST report. The entire file, blast.dat, is too large to reproduce here.) The first six lines
of this sample contain information about the BLAST search, as well as other "noise" that's of no
importance to the search. The next 13 lines, and the ones that follow it in the actual report, contain the
data to analyze. You want the Perl program to look at both the "Query" and "Sbjct" lines in this
BLAST report and find the number of occurrences of the following substrings:

• 'gtccca'
• 'gcaatg'
• 'cagct'
• 'tcggga'
• Missing data (represented by dashes in the nucleotide sequence strings)

At the same time, you want the program to ignore irrelevant material such as information about the
search and other noise. The program should then generate a report file called report.txt that describes
the findings for these strings.

In this program you need to create two very long scalar variables to represent each sequence for
searching. Let's call them $query_src, and $sbjct_src. In any BLAST output, you'll notice that
sometimes the "Sbjct" and "Query" lines aren't contiguous; that is, there are gaps in the data. From a
programming perspective, the fact that the gaps exist isn't important; you simply want to read the
nucleotides into your scalars consecutively. Here is a sample portion of BLAST data:

Query: 1165 gagcccaggagttcaagaccagcctgggtaacatgatgaaacctcgtctctac 1217
|||| |||||||| ||||||||||||| |||| | ||||||| ||||||||
Sbjct: 11895 gagctcaggagtttgagaccagcctggggaacacggtgaaaccctgtctctac 11843
Query: 1170 caggagttcaagaccagcctg 1190
|||||||||||||||||||||
Sbjct: 69962 caggagttcaagaccagcctg 69942
Query: 1106 tggtggctcacacctgcaatcccagcact 1134
||||||||||| |||| ||||||||||||
Sbjct: 77363 tggtggctcacgcctgtaatcccagcact 77335

In spite of the fact that the line numbers aren't contiguous, the sequence for "Query" starts with
'gagccca' and still ends with 'agcact', and will be 103 (53 + 21 + 29) characters long. As you'll see
shortly, the program is designed to ignore the gaps (and the line numbers) and input the data properly.
Frequent BLAST users may also notice that in a full BLAST report, each sequence is grouped by E-
values. We are ignoring that (usually) important fact in the program.

The Perl program used to search for the five substrings can be broken down into three parts:

• Inputting the data and preparing it for analysis
• Searching the data and looking for the patterns
• Compiling the results and storing them in report.txt


288
Let's go through the program step by step. Here are the first few lines:

#!/usr/bin/perl
# Search through a large datafile, looking for particular sequences

use strict;

my $REPORT_FILE = "report.txt";
my $blast_file = $ARGV[0] || 'blast.dat';

unless ( -e $blast_file ) {
die "$0: ERROR: missing file: $blast_file";
}

This code makes sure that the data is in good order. Since you'll be reading large amounts of data into
the variables, tell Perl to tighten up its rules with the line use strict;. This forces you to be more explicit
about how you want Perl to do things. use strict is particularly useful when developing large programs
or programs you want to modify and reuse. Go on to declare some variables, and in the last few lines,
tell Perl to make sure that data actually exists in the input file blast.dat.

In the next block of code, the program reads the sequences into variables:

# First, slurp all the Query sequences into one scalar. Same for the
# Sbjct sequences.
my ($query_src, $sbjct_src);

# Open the blast datafile and end program (die) if we can't find it
open (IN, $blast_file) or die "$0: ERROR: $blast_file: $!";

# Go through the blast file line by line, concatenating all the Query and
# Sbjct sequences.
while (my $line = <IN>) {
chomp $line;
print "Processing line $.\n";

In this section you read all the "Query" sequences into one scalar variable, and the "Sbjct" sequences
into another. The program then opens the file for reading with:

open (IN, $blast_file) or die "$0: ERROR: $blast_file: $!";

or prints an error message if for some reason it can't find the file. Next, the program goes through the
file line by line, removing the line-break characters with chomp $line;. And finally, with a print
function, you ask the program to display the current row as it reads in the data.

Now that you have the data in memory, you need to sort through it and extract the material you want.
Remember that the datafile included a lot of superfluous material you want to ignore. To do that,
instruct Perl to consider only those lines that begin with Query or Sbjct. Now Query and Sbjct, in
addition to having the desired sequence data, also have line numbers of varying length you don't want.
In order to read the sequence data correctly, you must design the program in such a way that you skip
over line numbers no matter how many characters they have, and always land on the first nucleotide.
You'll notic e in this line of data:

Query: 1165 gagcccaggagttcaagaccagcctgggtaacatgatgaaacctcgtctctac 1217

289
that there is a space between Query, the beginning line number, the sequence data, and the ending line
number. Since this happens to be true for all the query and subject lines, it becomes the key to how to
read the data correctly. To be sure you get only what you want, split all the query and subject data into
a four column array called @words. That task is accomplished by the following lines of code:

my @words = split /\s+/, $line;
if ($line =˜ /^Query/) {
$query_src .= $words[2];
} elsif ($line =˜ /^Sbjct/) {
$sbjct_src .= $words[2];
}
}

# We've now read the blast file, so we can close it.
close IN;

Once you've read the data into @words, you then instruct the program to take only the data from
column two of @words (which is filled only with nucleotide sequence data) and store it in the variables
$query_src and $sbjct_src. The program then closes the file and moves to a new line. You now have
just the data you want, stored in a form you can use.

The next part of the program performs the analysis:

# Now, look for these given sequences...
my @patterns = ('gtccca', 'gcaatg', 'cagct', 'tcggga', '-');

# ...and when we find them, store them in these hashes
my (%query_counts, %sbjct_counts);

# Search and store the sequences
foreach my $pattern (@patterns) {
while ( $query_src =˜ /$pattern/g ) {
$query_counts{ $pattern }++;
}
while ( $sbjct_src =˜ /$pattern/g ) {
$sbjct_counts{ $pattern }++;
}
}

The program sets up a loop that runs five times; once for each search string or pattern. Within each
iteration of the outer foreach loop, the program runs inner while loops that advance counters each time
they find a pattern match. The results are stored in separate hashes called %query_counts and
%sbjct_counts.

Here is the last section of the program, which produces the output:

# Create an empty report file
open (OUT, ">$REPORT_FILE") or die "$0: ERROR: Can't write $REPORT_FILE";

# Print the header of the report file, including
# the current date and time
print OUT "Sequence Report\n",
"Run by O'Reilly on ", scalar localtime, "\n",
"\nNOTE: In the following reports, a dash (-) represents\n",
" missing data in the chromosomal sequence\n\n",

290
"Total length of 'Query' sequence: ",
length $query_src, " characters\n", "Results for 'Query':\n";

# Print the Query matches
foreach my $key ( sort @patterns ) {
print OUT "\t'$key' seen $query_counts{$key}\n";
}

print OUT "\nTotal length of 'Sbjct' sequence: ",
length $sbjct_src, " characters\n", "Results for 'Sbjct':\n";

# Print the Sbjct matches
foreach my $key ( sort @patterns ) {
print OUT "\t'$key' seen $sbjct_counts{$key}\n";
}

close OUT;

__END__

This code compiles and formats the results and dumps them into a file called report.txt. If you open
report.text you see:

Sequence Report
Run by O'Reilly on Tue Jan 9 15:52:48 2001

NOTE: In the following reports, a dash (-) represents
missing data in the chromosomal sequence

Total length of 'Query' sequence: 1115 characters
Results for 'Query':
'-' seen 7
'cagct' seen 11
'gcaatg' seen 1
'gtccca' seen 6
'tcggga' seen 1

Total length of 'Sbjct' sequence: 5845 characters
Results for 'Sbjct':
'-' seen 12
'cagct' seen 2
'gcaatg' seen 6
'gtccca' seen 1
'tcggga' seen 6

In this example the results were sent to a file. You can just as easily ask Perl to generate an HTML-
coded file you can view with your web browser. Or you can make the process interactive and use Perl
to create a CGI script that generates a web form to analyze the data and give you back your results.

We've only scratched the surface in terms of what this sort of program can do. You can easily modify it
to look for more general patterns in the data or more specific ones. For example, you might search for
`tcggga' and `gcaatg', but only count them if they are connected by `cagct'. You also might search only
for breaks in the data. And after all the searches are complete, you might use Perl to automatically store
all the results in a database.



291
If you're feeling a little confused by all this, don't panic. We aren't expecting you to understand all the
code we've shown you. As we said at the beginning of the chapter, the purpose of the code isn't to teach
you to program in Perl, but to show you how Perl works, and also to show you that programming isn't
really all that difficult. If you have what it takes to design an experiment, then you have what it takes to
program in Perl or any other language.

12.5 Applying Perl to Bioinformatics
The good news is the more you practice programming, the more you learn. And the more you learn, the
more you can do. Programming in Perl is all about analyzing data and building tools. As we've said
before, biological data is proliferating at an astounding rate. The only chance biologists have of
keeping up with the job of analyzing it is by developing libraries of reusable software tools. In Perl,
there are a huge number of reusable functions available for use in your programs. Rather than being
wrapped in a complete program, a group of related functions are packaged as a module. In your
programs, you can use various modules to access the functions they support. There are Perl modules for
other scientific disciplines, as well as games, graphics programming, video, artificial intelligence,
statistics, and music. And they're all free.

To distinguish them from other Perl files, modules have a .pm suffix. To use functions from a module
(say, CGI.pm) in your Perl programs, include the following line after the shebang line:

use CGI;

The primary source of all modules is the Comprehensive Perl Archive Network, or CPAN. CPAN
(http://www.cpan.org) is a collection of sites located around the world, each of which mirrors the
contents of the main CPAN site in Finland. To find the CPAN site nearest you, check the Perl web site
(http://www.perl.com).

Because there are so many modules available, before you sit down to write a new function, it is worth
your time to check the CPAN archive to see if anyone has already written it for you. In this section, we
briefly describe some Perl modules that are particularly useful for solving common problems in
bioinformatics. This list is by no means comprehensive; you should keep an eye on CPAN for current
developments.

12.5.1 Bioperl

The Bioperl Project (along with its siblings, Biopython, BioJava, and Bioxml) is dedicated to the
creation of an open source library of modules for bioinformatics research. The general idea is that
common items in bioinformatics (such as sequences and sequence alignments) are represented as
objects in Bioperl. Thus, if you use Bioperl, instead of having to constantly rewrite programs that read
and write sequence files, you simply call the appropriate functions from Bio::SeqIO, and dump the
resulting sequence data into a sequence object.

Bioperl isn't limited to storing sequences: it currently contains modules for generating and storing
sequence alignments, managing annotation data, parsing output from the sequence-database search
programs BLAST and HMMer, and has other modules on the way. In addition to the core Bioperl
distribution, the ScriptCentral script repository at the Bioperl web site (http://www.bioperl.org) hosts a
collection of biology-related scripts. To learn more about downloading, installing, and using Bioperl,
see http://www.bioperl.org.
292
12.5.2 CGI.pm

CGI.pm is a module for programming interactive web pages. The functions it provides are geared
toward formatting web pages and creating and processing forms in which users enter information. If
you have used the Web, you almost certainly have used web pages written using CGI.pm. For example,
let's create a page that asks the user what his favorite color is using an HTML form. When the user
enters the data, the script stores it in a field named `color'. When the user hits "Submit," the same page
is loaded, only this time, $query->param(`color') contains the name of a color, so the print statement
after the "else" is executed. The CGI script looks like this:

#!/usr/bin/perl

use CGI; # Load Perl's CGI module

my $query = new CGI; # Create a CGI object named $query

# Send the HTML header and <HTML> tag
print $query->header, $query->start_html;

# If the user is visiting the site for the first time, ask him
# what his favorite color is

unless ($query->param('color')) { # Page 1: Asking the user
print $query->start_form, "What is your favorite color? ",
$query->popup_menu( -name => "color",
-values => ["red", "green", "blue"] ),
$query->submit,
$query->end_form;
} else { # Page 2: Telling the user
print "Your favorite color is ", $query->param('color');
}

print $query->end_html; # Send the </HTML> tag

The results of this script are shown in Figure 12-1.

Figure 12-1. Our CGI script generates an interactive web page




12.5.3 LWP

If CGI.pm automates the Web from the server's perspective, the Library for Web Programming (LWP)
automates web interaction from the perspective of the client. Using LWP, Perl programs can submit
data to forms, retrieve web pages, and eliminate much of the tedium of manually interacting with web
services through a browser. For example, let's say you want to retrieve and print out the HTML source
for the main page of http://www.oreilly.com. You can use the LWP::Simple module as follows:
293
#!/usr/bin/perl
use LWP::Simple;
print get("http://www.oreilly.com");

This retrieves the HTML source code for http://www.oreilly.com and displays it on your screen.

12.5.4 PDL

The Perl Data Language (which is abbreviated PDL and pronounced "piddle") is a module for doing
math with matrices. It is frequently used for scientific applications and image processing in conjunction
with the GIMP (since computer representations of images are just matrices). In computational biology,
PDL is invaluable for working with microarray expression data and scoring matrices, as well as data
that begins as images. For example, 2D gels that measure protein-protein interaction are usually stored
as images, and image processing tricks can locate and compare gel features.

Why do you need a whole library to do linear algebra with Perl? PDL allows you to work with matrices
of arbitrary dimensionality as if they were scalar variables. For example, a 2D matrix constructed using
standard Perl arrays looks like $a[$i][$j]. If you wanted to add two array-based matrices (let's call them
@a and @b) and store the result to another matrix, @c, you have to write code that looks like this:

for( $i=0; $i<$row_max; $i++ ) {
for( $j=0; $j<$col_max; $j++ ) {
$c[$i][$j] = $a[$i][$j] + $b[$i][$j];
}
}

so that you end up writing two loops, the outer one to iterate over each of the rows, and the inner to
iterate over each column. With PDL, you simply write:

$c = $a + $b;

In other words, when you define your multidimensional arrays as piddles (PDL's name for its matrix
data object) instead of Perl arrays, PDL makes it look like you are working with simple scalar objects,
even if you are working with several-megabyte matrices. In addition, PDL comes with an interactive
mode called perldl that is useful for trying out calculations with PDL, similar to the interactive modes
provided by the numerical data analysis packages R and Octave (which we will meet in Chapter 14).

12.5.5 DBI

DBI (short for database interface) is a module for writing programs that interact with relational
databases. It allows you to write programs that put data into databases, query databases, and extract
records from databases, without ever having to pay attention to the specific database you are using. For
example, a script written with DBI can be used with a MySQL database or an Oracle database with
only minor changes.

12.5.6 GD

The GD.pm module allows you to generate graphics using Perl programs. GD is often used to create
simple, customized plots on web pages, such as web server usage statistics. PaintBlast.pm, a module


294
that generates graphical representations of sequence alignments from BLAST output, is an example of
a GD application. It `s available from Bioperl's ScriptCentral.




295
Chapter 13. Building Biological Databases
Since the advent of the World Wide Web, biological databases have become a vital part of the
biological literature. Knowing how to find information in and download information from the central
biological data repositories is as important a skill for researchers now as traditional literature searching.
Major online data resources, such as the Protein Data Bank and GenBank, are expertly designed to
provide information to users who have no understanding of how the underlying databases function, and
to allow the deposition of data to a central repository by people who wouldn't know how to, or want to,
build their own private databases.

However, as web databases become more integral to sharing information within the scientific
community, it is likely that more people will want to develop their own databases and allow their
colleagues to access their data directly. Even something as simple as a web site for a research group
can be improved greatly and made easier to maintain with databases. In this chapter, we introduce some
elementary database terminology and give an example of how to set up a database for a simple data set.

If you're relatively new to the world of computers and software, you're not going to be able to read this
chapter and proceed directly to setting up your own database. What we hope to give you is an idea of
the steps involved in developing a database: designing a data model, choosing a database management
system (DBMS), implementing your data model, and developing a user-friendly frontend to your
database. What this chapter will give you is a general understanding of the issues in database
development. That understanding will help you to move forward, whether on your own or with the help
of a database expert.

You don't need to understand what makes a database tick in order to use it. However, providing access
via the Web to data you generate is becoming more and more important in the biology community, and
to do that you have to have at least a rudimentary knowledge of how databases work. Even if you've
got enough money lying around the lab to spring for your own Oracle administrator, you still need to
speak the language.

13.1 Types of Databases
There are two types of database management systems: flat file indexing systems and relational DBMSs.
A third type, the object-oriented DBMS, is beginning to increase in popularity. Choosing to use a flat
file indexing system or a relational database system is an important decision that will have long-range
implications for the capacity and usefulness of your database.

13.1.1 Flat File Databases

Flat file databases are the easiest type of database for nonexperts to understand. A flat file database isn't
truly a database, it's simply an ordered collection of similar files, usually (but not always) conforming
to a standard format for their content. The emphasis in formatting data for a flat file database is at the
character level; that is, at the level of how the data would appear if it were printed on a page.

A collection of flat files is analogous to having a large filing cabinet full of pieces of paper. Flat file
databases are made useful by ordering and indexing. A collection of flat files on a computer filesystem
can be ordered and stored in labeled folders exactly the same way as a collection of printed papers are
ordered in a file cabinet drawer (Figure 13-1). When we suggested, in an earlier chapter, using the

296
hierarchical nature of your filesystem and a sensible file-naming scheme to keep track of your files,
what we were essentially encouraging you to do is to develop a rudimentary flat file database of your
work. Creating a database means you can remember the rules of the database rather than the locations
of individual files and so find your way around more easily.

Figure 13-1. The relationship of a flat file to a flat file database




Flat file databases are often made searchable by indexing. An index pulls out a particular attribute from
a file and pairs the attribute value in the index with a filename and location. It's analogous to a book
index, which for example tells you where in a book you will find the word "genome." Like book
indexes, database indexes need to be carefully designed so that they point to a word only when it
occurs in an informative context. Database indexes take note of context by separately indexing
different fields within the file. The word "cytochrome" occurring in the Molecule Name field in a
protein structure file is likely to be far more significant to the user than the same word occurring only
in the file remarks. In the first context, finding the word "cytochrome" guarantees the file contains
information for some kind of cytochrome molecule. In the second context, the word can appear as part
of an article title or a comment about intermolecular interactions, even though the structure in the file
actually belongs to a different molecule. If multiple indexes for a file are created, you can then search a
particular index file based on keywords, which is less cumbersome than searching all the actual files in
the database file by file.

There's nothing inherently bad about flat file databases. They do organize data in a sensible way, and
with the proper indexing they can be made extensively searchable. However, as flat file collections
grow larger and larger, working with them becomes inefficient. An index is one-dimensional, so it is
difficult (though not impossible) to make connections between attributes within an indexed flat file
database.

13.1.1.1 Flat file databases in biology

Many of the popular biological databases began as flat file databases, and it's because of their legacy
that many of the programs and software packages we discussed in previous chapters have strict internal
requirements for the line format of input data.

For example, the PDB began by using flat files in the well-known PDB format. The format of these flat
files was designed to be read easily by FORTRAN programs, and in fact has its roots in the time when
computer input data was encoded on punch cards. When there were just a few protein structure files,
maintaining this database and accessing it was no problem. The PDB did not grow beyond a few
hundred files until 1990, nearly 20 years after its inception.



297
As PDB growth increased in the 1990s, new solutions for storing data needed to be found. In practical
terms, the full listing of the database was starting to be so long that, if a user entered a directory
containing all the available PDB files and tried to list filenames, it could take several seconds to even
produce a file list. Reading the contents of large directories slows down even simple Unix tools such as
ls, and it is even more of a problem for computer programs that might repeatedly read a directory. At
first, the PDB was split into subdirectories based on the letters of the PDB code. But as the database
approached 8,000 entries, even that began to prove too cumbersome.

The PDB now uses an object-oriented database backend (the part of the operation that resides on the
PDB servers and that users don't see) to support database queries and file access. However, files are
still made available in the legacy PDB format, so that users can continue to work with software that
was developed long before the PDB was modernized.

Beyond the PDB, flat file datab ases are still widely used by biologists. Many users of biological
sequence data store and access sequences locally using the S equence Retrieval System (SRS), a flat
file indexing system designed with biological data in mind.

13.1.2 Relational Databases

Like flat file databases, relational databases are just a way of collecting all the information about
something and storing it in a computer. In a flat file database, all the information about the thing is
stored in one big structured text file. In a relational database, the information is stored in a collection of
tables.

The flat file that describes a protein structure is like a bound book. There are chapters about the origin
of the sample, how the data was collected, the sequence, the secondary structure, and the positions of
the atoms.

In a relational database, the information in each chapter is put into separate tables, and instead of
having its own book, each protein has its own set of tables. So, there are tables of experimental
conditions, secondary structure elements, atomic positions, etc. All these tables are labeled with the
identity of the protein they describe, so that connections can be made between them, but they aren't
bound together like a book. The form of the tables follows rules that are uniform across the database, so
you can access all the tables about atomic positions or all the chapters about experimental conditions at
once, just as easily as you can access all the tables about a particular protein.

If you're interested in only one particular protein, it's not at all inconvenient to go to the library (the
PDB), look the book up in the catalog, and read it straight through. The librarian can pick a few items
of information out of the book (such as the name of the protein, the author who deposited it, etc.) and
put them in an index (like a card catalog) that will help you find where the book is on the shelf.

But what if you're interested in getting the secondary structure chapter out of every book in the protein
library? You have to go to the library, take down every book from the shelf, photocopy the secondary
structure chapter, and then convert that information into a form that you can easily analyze.

A relational database management system (RDBMS) allows you to view all of the protein structure
data in the database as a whole. You can "look" at the database from many different "angles," and
extract only the information you need, without actually photocopying a particular chapter out of each
book. Since each separate item of information about the protein is stored in its own separate table in the
298
database, the RDBMS can assemble any kind of book about proteins you want, on the fly. If you want a
book about hemoglobin, no problem. Even better, it is just as easy for the RDBMS to make you a book
about the secondary structures of all proteins in the database.

All you need to do is figure out how to structure the right query to get back what you want from the
database. If you want a book about hemoglobin, you can tell the RDBMS "if protein name equals
hemoglobin then give me all information about this protein." If you want a book that describes only the
secondary structure of each hemoglobin entry in the database, you can tell the RDBMS "if protein
name equals hemoglobin then give me the secondary structure table about this protein."

13.1.2.1 How tables are organized

Data in a relational database table is organized in rows, with each row representing one record in the
database. A row may contain several separate pieces of information (fields). Each field in the database
must contain one distinct piece of information. It can't consist of a set or list that can be further broken
into parts.

The tables in a relational database aren't just glorified flat files, though they may look that way if you
print them out. Rows are synonymous with records, not with 80 characters on a line. Fields in each row
aren't limited by a number of characters; they end where the value in the field ends. The job of the
RDBMS is to make connections between related tables by rapidly finding the common elements that
establish those relationships.

You can get an idea of the difference between data organized into tables and character-formatted flat
file data by comparing the two types of protein structure datafiles available from the PDB. The
standard PDB file is ordered into a series of 80 character lines. Each line is labeled, but especially in
the header, the information associated with a label is quite heterogeneous. For example:

REMARK 1 4HHB 14
REMARK 1 REFERENCE 1 4HHB 15
REMARK 1 AUTH M.F.PERUTZ,S.S.HASNAIN,P.J.DUKE,J.L.SESSLER, 4HHB 16
REMARK 1 AUTH 2 J.E.HAHN 4HHB 17
REMARK 1 TITL STEREOCHEMISTRY OF IRON IN DEOXYHAEMOGLOBIN 4HHB 18
REMARK 1 REF NATURE V. 295 535 1982 4HHB 19
REMARK 1 REFN ASTM NATUAS UK ISSN 0028-0836 006 4HHB 20
REMARK 1 REFERENCE 2 4HHB 21
REMARK 1 AUTH G.FERMI,M.F.PERUTZ 4HHB 22
REMARK 1 REF HAEMOGLOBIN AND MYOGLOBIN. V. 2 1981 4HHB 23
REMARK 1 REF 2 ATLAS OF MOLECULAR 4HHB 24
REMARK 1 REF 3 STRUCTURES IN BIOLOGY 4HHB 25
REMARK 1 PUBL OXFORD UNIVERSITY PRESS 4HHB 26
REMARK 1 REFN ISBN 0-19-854706-4 986 4HHB 27

In the PDB reference records shown here, you can see that entries in each row aren't distinct pieces of
information, nor are the rows uniform. Sometimes there are four author names on one line; sometimes
there are two. Sometimes there are three title lines; sometimes there is only one. This can cause
difficulties in parsing, or reading the header with a computer program.

Compare this to an mmCIF file. mmCIF is a new data standard for results of X-ray crystallography
experiments. Protein structures have been available from the PDB in mmCIF format since the
management of the PDB changed in 1999.

299
Before you see any data in the mmCIF file, you see what looks almost like a series of commands in a
computer program, lines that describe how the data in the file is to be read. Then you'll see tables of
data. Here's an example:

loop_
_citation.id
_citation.coordinate_linkage
_citation.title
_citation.country
_citation.journal_abbrev
_citation.journal_volume
_citation.journal_issue
_citation.page_first
_citation.year
_citation.journal_id_ASTM
_citation.journal_id_ISSN
_citation.journal_id_CSD
_citation.book_title
_citation.book_publisher
_citation.book_id_ISBN
_citation.details

primary yes
; THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT
1.74 ANGSTROMS RESOLUTION
;
UK 'J.MOL.BIOL. ' 175 ? 159 1984
'JMOBAK ' '0022-2836 ' 070 ? ? ? ?

1 no
; STEREOCHEMISTRY OF IRON IN DEOXYHAEMOGLOBIN
;
UK 'NATURE ' 295 ? 535 1982
'NATUAS ' '0028-0836 ' 006 ? ? ? ?

2 no
? ?? 2? ? 1981 ? ? 986
; HAEMOGLOBIN AND MYOGLOBIN.
ATLAS OF MOLECULAR
STRUCTURES IN BIOLOGY
;
; OXFORD UNIVERSITY PRESS
;
'0-19-854706-4 '?

An mmCIF file contains dozens of tables that are all "about" the same protein.

The opening lines of the reference section in the mmCIF file (which is just a flat representation of the
collection of tables that completely describes a protein structure) describe what the fields in each
upcoming row in the table will mean. Rows don't begin arbitrarily at character 1 and end at character
80; they may stretch through several "lines" in the printout or onscreen view of the data. Rows don't
end until all their fields are filled; when information is missing (as in the previous example), the fields
have to be filled with null characters, such as a question mark or a space.

In the protein database, the table of literature references that describes a particular structure is
associated with a particular PDB ID. However, there are other tables associated with that PDB ID as

300
well, and they have totally different kinds of rows from the reference table. The atomic positions that
describe a protein structure are contained in a separate table with a completely different format:

loop_
_atom_site.label_seq_id
_atom_site.group_PDB
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.auth_seq_id
_atom_site.label_alt_id
_atom_site.cartn_x
_atom_site.cartn_y
_atom_site.cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.footnote_id
_atom_site.label_entity_id
_atom_site.id
1
ATOM N N VAL A 1 . 6.204 16.869 4.854 7.00 49.05 . 1 1 1
ATOM C CA VAL A 1 . 6.913 17.759 4.607 6.00 43.14 . 1 2 1
ATOM C C VAL A 1 . 8.504 17.378 4.797 6.00 24.80 . 1 3 1
ATOM O O VAL A 1 . 8.805 17.011 5.943 8.00 37.68 . 1 4 1
ATOM C CB VAL A 1 . 6.369 19.044 5.810 6.00 72.12 . 1 5 1
ATOM C CG1 VAL A 1 . 7.009 20.127 5.418 6.00 61.79 . 1 6 1
ATOM C CG2 VAL A 1 . 5.246 18.533 5.681 6.00 80.12 . 1 7 2
ATOM N N LEU A 2 . 9.096 18.040 3.857 7.00 26.44 . 1 8 2
ATOM C CA LEU A 2 . 10.600 17.889 4.283 6.00 26.32 . 1 9 2
ATOM C C LEU A 2 . 11.265 19.184 5.297 6.00 32.96 . 1 10 2
ATOM O O LEU A 2 . 10.813 20.177 4.647 8.00 31.90 . 1 11 2
ATOM C CB LEU A 2 . 11.099 18.007 2.815 6.00 29.23 . 1 12 2
ATOM C CG LEU A 2 . 11.322 16.956 1.934 6.00 37.71 . 1 13 2
ATOM C CD1 LEU A 2 . 11.468 15.596 2.337 6.00 39.10 . 1 14 2
ATOM C CD2 LEU A 2 . 11.423 17.268 .300 6.00 37.47 . 1 15

The values in the atom table are clearly related to the values in the reference table; they both contain
information about the same PDB structure. However, the two types of data can't just be put together
into one big table. It doesn't make sense to put the reference information into the same scheme of rows
and columns the atom information goes into, either by tacking it on at the "bottom" of the table or by
adding extra columns (although in flat files we are forced to do exactly that!). The two datatypes are
related, but orthogonal to each other.

Anywhere in a set of information where it becomes impossible to sensibly tack rows or columns onto a
table, a new table needs to be created. Tables within a database may have interconnections only at the
[1]


topmost level, such as the atom and reference information related to the same PDB file, or they may be
more closely linked.
[1]
The technical term for the process of separating a complex data set into a collection of mutually orthogonal, related tables is normalization. For a rigorous
discussion of relational database theory, see the pertinent references in the Bibliography.


You may notice in the reference records two pages back that authors' names aren't listed. How can that
be? Well, the answer is that they're in a separate table. Because each reference can have an arbitrary


301
number of separate authors, that information can't just be tacked onto the reference table by adding a
fixed number of extra rows or columns. So there's a separate table for authors' names:

loop_
_citation_author.citation_id
_citation_author.name
primary 'Fermi, G.'
primary 'Perutz, M.F.'
primary 'Shaanan, B.'
primary 'Fourme, R.'
1 'Perutz, M.F.'
1 'Hasnain, S.S.'
1 'Duke, P.J.'
1 'Sessler, J.L.'
1 'Hahn, J.E.'
2 'Fermi, G.'
2 'Perutz, M.F.'
3 'Perutz, M.F.'
4 'TenEyck, L.F.'
4 'Arnone, A.'
5 'Fermi, G.'
6 'Muirhead, H.'
6 'Greer, J.'

This table is related to the previous reference table through the values in column 1, which match up
with the citation IDs in the other reference table. To get from "Fermi, G." to "THE CRYSTAL
STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT 1.74 ANGSTROMS RESOLUTION" in
this database, you connect through the citation ID, which specifies the relationship between the two
entities.

Using an RDBMS may at first seem like an overthinking of what could be a pretty simple set of data to
store. If you ever write programs that operate on the antiquated flat-file PDB format, though, you'll
realize how useful it might be to unambiguously assign your data to tables in a relational database.
Among other things, databases eliminate the need for complicated line-format statements and parsing
operations that are required when using 80 character-formatted files.

13.1.2.2 The database schema

The network of tables and relationships between them that makes up a database is called the database
schema. For a database to keep its utility over time, it's best to carefully develop the schema before you
even think about beginning to populate the database. In the example later in this chapter, we develop a
schema for a simple database.

Getting your brain around database schemas and tables can be a challenge without even coming up
with your own schema. However, relational databases are the standard for large database operations,
and understanding RDB concepts is necessary for anyone who wants to build her own. Before
designing your own database, you should definitely consult a reference that covers relational databases
rigorously.

13.1.3 Object-Oriented Databases



302
You'll hear the phrase object oriented in connection with both programming languages and databases.
An object-oriented database system is a DBMS that is consistent with object-oriented programming
principles. Some important characteristics of object-oriented databases are: they are designed to handle
concurrent interactions by multiple clients; they can handle complex objects (beyond tables of
character data); and they are persistent”that is, they survive the execution of a process. In practice,
because of the popularity of object-oriented programming strategies, most of the major relational
DBMSs are compatible with an object-oriented approach to some extent.

The practical upshot of the object-oriented approach in the database world is the emergence of DBMSs
that are flexible enough to store more than just tables and to handle functions beyond those in a rigidly
defined query-language vocabulary. Since object-oriented databases handle data as objects rather than
as tables, an object-oriented database can provide access to everything from simple text-format data to
images and video files within the same database. Object-oriented databases don't force the use of the
SQL query language, but rather provide flexible bindings to programming languages. Many DBMSs
are beginning to have both object and relational characteristics, but the giants of the DBMS world are
still primarily relational DBMSs.

13.2 Database Software
Databases don't just happen: they're maintained by DBMSs. There are several DBMSs, some open
source and some commercial. There are flat file indexing systems, RDBMSs, object DBMSs
(ODBMSs), and object-relational hybrids. Which DBMS you use depends on what you can afford, how
comfortable you are with software, and what you want to do.

13.2.1 Sequence Retrieval System

Even if you've decided to work with a flat file indexing and retrieval system, you don't need to reinvent
the wheel. The Sequence Retrieval System (SRS) is a popular system for flat file management that has
been extensively used in the biology community, both in corporate and academic settings. SRS was
developed at EMBL specifically for use in molecular biology database applications, and is now
available as a commercial product from Lion Bioscience, http://www.lionbioscience.com. It is still
offered for free to researchers at academic institutions, along with extensive documentation (but no
tech support). A common application of the SRS database is to maintain a local mirror of the major
biological sequence databases. The current release is SRS 6.

SRS can be installed on SGI, Sun, Compaq, or Intel Linux systems. To maintain your own SRS
database and mirror the major biological databases requires tens of gigabytes of disk space, so it's not
something to be taken on lightly. SRS has built-in parsers that know how to read EMBL nucleotide
database files, SWISS-PROT files, and TrEMBL files. It's also possible to integrate other databases
into SRS by using SRS's Icarus language to develop additional databank modules. For an example of
the variety of databases that can be integrated under an SRS flat file management system, you only
have to look at the SDSC Biology Workbench. Until its most recent release, SRS was the DBMS used
within the Biology Workbench, and supported nearly the full range of databases now integrated into the
Workbench.

13.2.2 Oracle



303
Oracle is the 18-wheeler of the RDBMS world. It's an industry-standard, commercial product with
extremely large capacity. It's also rapidly becoming a standard for federally funded research projects.
Oracle has some object capacities as well as extensive relational capacities. Potential Oracle customers
can now obtain a license to try Oracle for free from http://www.oracle.com. If you want to provide a
large-scale data resource to the biology community, you may need an Oracle developer (or a bunch of
them) to help you implement it.

13.2.3 PostgreSQL

PostgreSQL is a full-featured object-relational DBMS that supports user-defined datatypes and
functions in addition to a broad set of SQL functions and types. PostgreSQL is an open source project,
and the source code can be downloaded for free from http://www.postgresql.org, which also provides
extensive online documentation for the DBMS. PostgreSQL can also be found in most standard Linux
distributions. If you plan to create a database that contains data of unusual types and you need a great
degree of flexibility to design future extensions to your database, PostgreSQL may meet your needs
better than MySQL. PostgreSQL is somewhat limited in its capacity to handle large numbers of
operations, relative to Oracle and other commercial DBMSs, but for midrange databases it's an
excellent product.

13.2.4 Open Source Object DBMS

Several efforts to develop open source ODBMSs are underway as of this writing. One of the most high
profile of these is the Ozone project (http://www.ozone-db.org). Ozone is completely implemented in
Java and designed for Java developers; queries are implemented in the underlying language rather than
in SQL. One emphasis in Ozone development is object persistence, the ability of the DBMS to
straightforwardly save the states of a data object as it is affected by transactions with the database user.
Like many ODBMSs, Ozone is in a relatively early stage of development and may not be particularly
easy for a new user to understand. Unless you have a compelling reason to use object-oriented
principles in developing your database, it's probably wise to stick with relational database models until
object technology matures.

13.2.5 MySQL

MySQL is an open-source relational DBMS. It's relatively easy to set up and use, and it's available for
both Unix and Windows operating systems. MySQL has a rich and complex set of features, and it's
somewhat different from both PostgreSQL and Oracle, two other popular RDBMSs. Each system
recognizes a different subset of SQL datatypes and functions, and none of them recognizes 100% of the
possible types. MySQL sets lower limits on the number of operations allowed than PostgreSQL and
Oracle do, in some cases, so it's considered suitable for small and medium-sized database applications,
rather than for heavy-duty database projects. However, this isn't a hard and fast rule: it depends on what
you plan to do with the data in your database. MySQL is strictly a relational DBMS, so if you plan to
store unusual datatypes, it may not be the right DBMS for you. For most standard database
applications, however, MySQL is an excellent starting point.

MySQL's developers claim that it can manage large databases faster than other RDBMSs. While their
benchmarks seem to bear out this claim, we haven't independently evaluated it. What we can say is that
it's possible to learn to use MySQL and have a rudimentary database up and running within a few hours
to a few days, depending on the user's level of experience with Unix and SQL.

304
13.3 Introduction to SQL
As a practical matter, you are most likely to work either with specialized flat file database systems for
biological data, like SRS, or with some kind of RDBMS. In order to work with an RDBMS, you need
to learn something about SQL.

SQL, or Structured Query Language (usually pronounced "see-kwl" by those in the know, ess-que-ell
by literalists, and "squeal" by jokers) is the language RDBMSs speak. SQL commands are issued
within the context of a DBMS interface; you don't give a SQL command at the Unix command line.
Instead, you give the command to the DBMS program, and the program interprets the command.

SQL commands can be passed to the DBMS by another program (for instance, a script that collects
data from a web form) or hand-entered. Obviously, the first option is the ideal, especially for entering
large numbers of records into a database; you don't want to do that by hand. We can't teach you all of
the ins and outs of programming with SQL, however; in this section we'll just focus on the basic SQL
commands and what they do. Later on, we'll show an example of a web-based program that can interact
with a SQL database.

SQL commands read like stilted English with a very restricted vocabulary. If you remember
diagramming sentences in high-school English class, figuring out subject-verb-object relationships and
conditional clauses, SQL should seem fairly intuitive. The challenge is remembering the restrictions of
vocabulary and syntax, and constructing queries so that your DBMS can understand them. A SQL

<<

. 10
( 12)



>>