<<

. 2
( 12)



>>

find links to most of the scientific publications you need online. There are central databases that
collect reference information so you can search dozens of journals at once. You can even set up
"agents" that notify you when new articles are published in an area of interest. Searching the
public molecular-biology databases requires the same skills as searching for literature
references: you need to know how to construct a query statement that will pluck the particular
needle you're looking for out of the database haystack. Tools for searching biochemical
literature and sequence databases are introduced in Chapter 6.

Sequence alignment and sequence searching

As mentioned in Chapter 1, being able to compare pairs of DNA or protein sequences and
extract partial matches has made it possible to use a biological sequence as a database query.
Sequence-based searching is another key skill for biologists; a little exploration of the
biological databases at the beginning of a project often saves a lot of valuable time in the lab.
Identifying homologous sequences provides a basis for phylogenetic analysis and sequence-
pattern recognition. Seq uence-based searching can be done online through web forms, so it
requires no special computing skills, but to judge the quality of your search results you need to
understand how the underlying sequence-alignment method works and go beyond simple
sequence alignment to other types of analysis. Tools for pairwise sequence alignment and
sequence-based database searching are introduced in Chapter 7.

Gene prediction

Gene prediction is only one of a cluster of methods for attempting to detect meaningful signals
in uncharacterized DNA sequences. Until recently, most sequences deposited in GenBank were
already characterized at the time of deposition. That is, someone had already gone in and, using
molecular biology, genetic, or biochemical methods, figured out what the gene did. However,

34
now that the genome projects are in full swing, there's a lot of DNA sequence out there that isn't
characterized.

Software for prediction of open reading frames, genes, exon splice sites, promoter binding sites,
repeat sequences, and tRNA genes helps molecular biologists make sense out of this unmapped
DNA. Tools for gene prediction are introduced in Chapter 7.

Multiple sequence alignment

Multiple sequence-alignment methods assemble pairwise sequence alignments for many related
sequences into a picture of sequence homology among all members of a gene family. Multiple
sequence alignments aid in visual identification of sites in a DNA or protein sequence that may
be functionally important. Such sites are usually conserved; that is, the same amino acid is
present at that site in each one of a group of related sequences. Multiple sequence alignments
can also be quantitatively analyzed to extract information about a gene family. Multiple
sequence alignments are an integral step in phylogenetic analysis of a family of related
sequences, and they also provide the basis for identifying sequence patterns that characterize
particular protein families. Tools for creating and editing multiple sequence alignments are
introduced in Chapter 8.

Phylogenetic analysis

Phylogenetic analysis attempts to describe the evolutionary relatedness of a group of sequences.
A traditional phylogenetic tree or cladogram groups species into a diagram that represents their
relative evolutionary divergence. Branchings of the tree that occur furthest from the root
separate individual species; branchings that occur close to the root group species into kingdoms,
phyla, classes, families, genera, and so on.

The information in a molecular sequence alignment can be used to compute a phylogenetic tree
for a particular family of gene sequences. The branchings in phylogenetic trees represent
evolutionary distance based on sequence similarity scores or on information-theoretic modeling
of the number of mutational steps required to change one sequence into the other. Phylogenetic
analyses of protein sequence families talks not about the evolution of the entire organism but
about evolutionary change in specific coding regions, although our ability to create broader
evolutionary models based on molecular information will expand as the genome projects
provide more data to work with. Tools for phylogenetic analysis are introduced in Chapter 8.

Extraction of patterns and profiles from sequence data

A motif is a sequence of amino acids that defines a substructure in a protein that can be
connected to function or to structural stability. In a group of evolutionarily related gene
sequences, motifs appear as conserved sites. Sites in a gene sequence tend to be conserved”to
remain the same in all or most representatives of a sequence family”when there is selection
pressure against copies of the gene that have mutations at that site. Nonessential parts of the
gene sequence will diverge from each other in the course of evolution, so the conserved motif
regions show up as a signal in a sea of mutational noise. Sequence profiles are statistical
descriptions of these motif signals; profiles can help identify distantly related proteins by
picking out a motif signal even in a sequence that has diverged radically from other members of
the same family. Tools for profile analysis and motif discovery are introduced in Chapter 8.
35
Protein sequence analysis

The amino-acid content of a protein sequence can be used as the basis for many analyses, from
computing the isoelectric point and molecular weight of the protein and the characteristic
peptide mass fingerprints that will form when it's digested with a particular protease, to
predicting secondary structure features and post-translational modification sites. Tools for
feature prediction are introduced in Chapter 9, and tools for proteomics analysis are introduced
in Chapter 11.

Protein structure prediction

It's a lot harder to determine the structure of a protein experimentally than it is to obtain DNA
sequence data. One very active area of bioinformatics and computational biology research is the
development of methods for predicting protein structure from protein sequence. Methods such
as secondary structure prediction and threading can help determine how a protein might fold,
classifying it with other proteins that have similar topology, but they don't provide a detailed
structural model. The most effective and practical method for protein structure prediction is
homology modeling”using a known structure as a template to model a structure with a similar
sequence. In the absence of homology, there is no way to predict a complete 3D structure for a
protein. Tools for protein structure prediction are introduced in Chapter 9.

Protein structure property analysis

Protein structures have many measurable properties that are of interest to crystallographers and
structural biologists. Protein structure validation tools are used by crystallographers to measure
how well a structure model conforms to structural rules extracted from existing structures or
chemical model compounds. These tools may also analyze the "fitness" of every amino acid in a
structure model for its environment, flagging such oddities as buried charges with no
countercharge or large patches of hydrophobic amino acids found on a protein surface. These
tools are useful for evaluating both experimental and theoretical structure models.

Another class of tools can calculate internal geometry and physicochemical properties of
proteins. These tools usually are applied to help develop models of the protein's catalytic
mechanism or other chemical features. Some of the most interesting properties of protein
structures are the locations of deeply concave surface clefts and internal cavities, both of which
may point to the location of a cofactor binding site or active site. Other tools compute
hydrogen-bonding patterns or analyze intramolecular contacts. A particularly interesting set of
properties are the electrostatic potential field surrounding the protein and other electrostatically
controlled parameters such as individual amino acid pKas, protein solvation energies, and
binding constants. Methods for protein property analysis are discussed in Chapter 10.

Protein structure alignment and comparison

Even when two gene sequences aren't apparently homologous, the structures of the proteins
they encode can be similar. New tools for computing structural similarity are making it possible
to detect distant homologies by comparing structures, even in the absence of much sequence
similarity. These tools also are useful for comparing constructed homology models to the
known protein structures they are based on. protein structure alignment tools are introduced in
Chapter 10.
36
Biochemical simulation

Biochemical simulation uses the tools of dynamical systems modeling to simulate the chemical
reactions involved in metabolism. Simulations can extend from individual metabolic pathways
to transmembrane transport processes and even properties of whole cells or tissues.
Biochemical and cellular simulations traditionally have relied on the ability of the scientist to
describe a system mathematically, developing a system of differential equations that represent
the different reactions and fluxes occurring in the system. However, new software tools can
build the mathematical framework of a simulation automatically from a description provided
interactively by the user, making mathematical modeling accessible to any biologist who knows
enough about a system to describe it according to the conventions of dynamical systems
modeling. Dynamical systems modeling tools are discussed in Chapter 11.

Whole genome analysis

As more and more genomes are sequenced completely, the analysis of raw genome data has
become a more important task. There are a number of perspectives from which one can look at
genome data: for example, it can be treated as a long linear sequence, but it's often more useful
to integrate DNA sequence information with existing genetic and physical map data. This
allows you to navigate a very large genome and find what you want. The National Center for
Biotechnology Information (NCBI) and other organizations are making a concerted effort to
provide useful web interfaces to genome data, so that users can start from a high-level map and
navigate to the location of a specific gene sequence.

Genome navigation is far from the only issue in genomic sequence analysis, however.
Annotation frameworks, which integrate genome sequence with results of gene finding analysis
and sequence homology information, are becoming more common, and the challenge of making
and analyzing complete pairwise comparisons between genomes is beginning to be addressed.
Genome analysis tools are discussed in Chapter 11.

Primer design

Many molecular biology protocols require the design of oligonucleotide primers. Proper primer
design is critical for the success of polymerase chain reaction (PCR), oligo hybridization, DNA
sequencing, and microarray experiments. Primers must hybridize with the target DNA to
provide a clear answer to the question being asked, but, they must also have appropriate
physicochemical properties; they must not self-hybridize or dimerize; and they should not have
multiple targets within the sequence under investigation. There are several web -based services
that allow users to submit a DNA sequence and automatically detect appropriate primers, or to
compute the properties of a desired primer DNA sequence. Primer design tools are discussed in
Chapter 11.

DNA microarray analysis

DNA microarray analysis is a relatively new molecular biology method that expands on classic
probe hybridization methods to provide access to thousands of genes at once. Microarray
experiments are amenable to computational analysis because of the uniform, standardized
nature of their results”a grid of equally sized spots, each identifiable with a particular DNA

37
sequence. Computational tools are required to analyze larger microarrays because the resulting
images are so visually complex that comparison by hand is no longer feasible.

The main tasks in microarray analysis as it's currently done are an image analysis step, in which
individual spots on the array image are identified and signal intensity is quantitated, and a
clustering step, in which spots with similar signal intensities are identified. Computational
support is also required for the chip -design phase of a microarray experiment to identify
appropriate oligonucleotide probe sequences for a particular set of genes and to maintain a
record of the identity of each spot in a grid that may contain thousands of individual
experiments. Array analysis tools are discussed in Chapter 11.

Proteomics analysis

Before they're ever crystallized and biochemically characterized, proteins are often studied
using a combination of gel electrophoresis, partial sequencing, and mass spectroscopy. 2D gel
electrophoresis can separate a mixture of thousands of proteins into distinct components; the
individual spots of material can be blotted or even cut from the gel and analyzed. Simple
computational tools can provide some information to aid in the process of analyzing protein
mixtures. It's trivial to compute molecular weight and pI from a protein sequence; by using
these values in combination, sets of candidate identities can be found for each spot on a gel. It's
also possible to compute, from a protein sequence, the peptide fingerprint that is created when
that protein is broken down into fragments by enzymes with specific protein cleavage sites.
Mass spec analyses of protein fragments can be compared to computed peptide fingerprints to
further limit the search. Proteomics tools are covered in Chapter 11.

2.5 A Computational Biology Experiment
Computer-based research projects and computational analysis of experimental data must follow the
same principles other scientific study do. Your results must clearly answer the question you set out to
test, and they must be reproducible by someone else using the same input data and following the same
process.

If you're already doing research in experimental biology, you probably have a pretty good
understanding of the scientific method. Although your data, your method, and your results are all
encoded in computer files rather than sitting on your laboratory bench, the process of designing a
computational "experiment" is the same as you are used to.

Although it's easy in these days of automation to simply submit a query to a search engine and use the
results without thinking too much about it, you need to understand your method and analyze your
results thoroughly in the same way you would when applying a laboratory protocol. Sometimes that's
easier said than done. So let's take a walk through the steps involved in defining an experiment in
computational biology.

2.5.1 Identifying the Problem

A scientific experiment always begins with a question. A question can be as broad as "what is the
catalytic mechanism of protein X?" It's not always possible to answer a complex question about how


38
something works with one experiment. The question needs to be broken down into parts, each of which
can be formulated as a hypothesis.

A hypothesis is a statement that is testable by experiment. In the course of solving a problem, you will
probably formulate a number of testable statements, some of them trivial and some more complex. For
instance, as a first approach to answering the question, "What is the catalytic mechanism of protein
X?", you might come up with a preliminary hypothesis such as: "There are amino acids in protein X
that are conserved in other proteins that do the same thing as protein X." You can test this hypothesis
by using a computer program to align the sequences of as many protein X-type proteins as you can
find, and look for amino acids that are identical among all or most of the sequences. Subsequently
you'd move to another hypothesis such as: "Some of these conserved amino acids in the protein X
family have something to do with the catalytic mechanism." This more complex hypothesis can then be
broken down into a number of smaller ones, each of them testable (perhaps by a laboratory experiment,
or perhaps by another computational procedure).

A research project can easily become interminable if the goals are ill-defined or the question can't
feasibly be answered. On the other hand, if you aren't careful, it's easy to keep adding questions to a
project on the basis of new information, allowing the goal to keep creeping out of reach every time its
completion is close. It's easy to do this with computational projects, because the cost of materials and
resources is low once the initial expense of buying computers and software is covered. It seems no big
loss to just keep playing around on the computer.

We have found that this pathological condition can be avoided if, before embarking on a computational
project, some time is spent on sketching out a specification of the project's goals and timeline. If you
plan to write a project spec, it's easier to start from written answers to questions such as the following:

• What is the question this project is trying to answer?
• What is the final form you expect the results to take? Is the goal to produce a computer
program, a data set that will be used in an ongoing project, a journal publication, etc.? What are
the requirements for success or completion of the project?
• What is the approximate timeline of the project?
• What is the general outline of the project? Here, it would be appropriate to break the project
down into constituent parts and describe what you think needs to be done to finish each part.
• How does your project fit in with the work of others? If you're a lone wolf, you don't have to
worry about this, but research scientists tend to run in packs. It's good to have a clear
understanding of where your work is dependent on others. If you are writing a project spec for a
group of people to work on, indicate who is responsible for each part of the work.
• At what point will it be unprofitable to continue?

Thinking through questions like these not only gives you a clearer idea of what your projects are trying
to achieve, but also gives you an outline by which you can organize your research.

2.5.2 Separating the Problem into Simpler Components

In Chapter 7 through Chapter 14, we cover many of the common protocols for using bioinformatics
tools and databases in your research. Coming up with the series of steps in those protocols wasn't
rocket science. The key to developing your own bioinformatics computer skills is this: know what tools
are available and know how to use them. Then you can take a modular approach to the problems you
want to solve, breaking them down into distinct modules such as sequence searching, sequence profile
39
detection, homology modeling, model evaluation, etc., for each of which there are established
computational methods.

2.5.3 Evaluating Your Needs

As you break down a problem into modular components, you should be evaluating what you have, in
terms of available data and starting points for modeling, and what you need. Getting from point A to
point B, and from point C to point D, won't help you if there's absolutely no way to get from point B to
point C. For instance, if you can't find any homologous sequences for an unknown DNA sequence, it's
unlikely you'll get beyond that point to do any further modeling. And even if you do find a group of
sequences with a distinctive profile, you shouldn't base your research plans on developing a structural
model if there are no homologous structures in the Protein Data Bank (PDB). It's just common sense,
but be sure that there's a likely way to get to the result you want before putting time and effort into a
project.

2.5.4 Selecting the Appropriate Data Set

In a laboratory setting, materials are the physical objects or substances you use to perform an
experiment. It's necessary for you to record certain data about your materials: when they were made,
who prepared them, possibly how they were prepared, etc.

The same sort of documentation is necessary in computational biology, but the difference is that you
will be experimenting on data, not on a tangible object or substance. The source data you work with
should be distinguished from the derived data that constitutes the results of your experiment. You will
probably get your source data from one of the many biomolecular databases. In Chapter 13, you will
learn more about how information is stored in databases and how to extract it. You need to record
where your source data came from and what criteria or method you use to extract your source data set
from the source database.

For example, if you are building a homology model of a protein, you need to account for how you
selected the template structures on which you based your model. Did you find them using the unknown
sequence to search the PDB? Did that approach provide sufficient template structures, or did you,
perhaps, use sequence profile-based methods to help identify other structures that are more dis tantly
related to your unknown? Each step you take should be documented.

Criteria for selecting a source data set in computational biology can be quite complex and nontrivial.
For instance, statistical studies of sequence information or of structural data from proteins are often
based on a nonredundant subset of available protein data. This means that data for individual proteins
is excluded from the set if the proteins are too similar in sequence to other proteins that are being
included. Inclusion of two structure datafiles that describe the same protein crystallized under slightly
different conditions, for example, can bias the results of a computational study. Each step of such a
selection process needs to be documented, either within your own records, or by reference to a
published source.

It's important to remember that all digital sequence and structure data is derived data. By the time it
reaches you, it has been through at least one or two processing steps, each of which can introduce
errors. DNA sequences have been processed by basecalling software and assembled into maps,
analyzed for errors, and possibly annotated according to structure and function, all by tools developed
by other scientists as human and error-prone as yourself. Protein structure coordinates are really just a
40
very good guess at where atoms fit into observed electron density data, and electron density maps in
turn have been extrapolated from patterns of x-ray reflections. This isn't to say that you should not use
or trust biologic al data, but you should remember that there is some amount of uncertainty associated
with each unambiguous -looking character in a sequence or atomic coordinate in a structure.
Crystallographers provide parameters, such as R-factors and b-values, which quantify the uncertainty
of coordinates in macromolecular structures to some extent, but in the case of sequences, no such
estimates are routinely provided within the datafile.

2.5.5 Identifying the Criteria for Success

Critical evaluation of results is key to establishing the usefulness of computer modeling in biology. In
the context of presenting various tools you can use, we've discussed methods for evaluating your
results, from using BLAST E-values to pick the significant matches out of a long list of results to
evaluating the geometry of a protein structural model. Before you start computing molecular properties
or developing a computational model, take inventory of what you know, and look for further
information. Then try to see the ways in which that information can validate your results. This is part of
breaking down your problem into steps.

Computational methods almost always produce results. It's not like spectroscopy, where if there's
nothing significant in the cuvette, you don't get a signal. If you do a BLAST search, you almost always
get some hits back. You need to know how to distinguish meaningful results from garbage so you don't
end up comparing apples to oranges (or superoxide dismutases to alcohol dehydrogenases). If you
calculate physicochemical properties of a protein molecule or develop a biochemical pathway
simulation, you get a file full of numbers. The best possible way to evaluate your results is to have
some experimental results to compare them to.

Before you apply a computational method, decide how to evaluate your results and what criteria they
need to meet for you to consider the approach successful.

2.5.6 Performing and Documenting a Computational Experiment

When managing results for a computational project, you should make a distinction between primary
results and results of subsequent analyses. You should include a separate section in your results
directory for any analysis steps you may perform on the data (for instance, the results of statistical tests
or curve fitting). This section should include any observations you may have made about the data or the
data collection. Keep separate the results, which are the data you get from executing the experiment,
and the analysis, which is the insight you bring to the data you have collected.

One tendency that is common to users of computational biology software is to keep data and notes
about positive results while neglecting to document negative results. Even if you've done a web-based
BLAST search against a database and found nothing, that is information. And if you've written or
downloaded a program that is supposed to do something, but it doesn't work, that information is
valuable too”to the next guy who comes in to continue your project and wastes time trying to figure
out what works and what doesn't.

2.5.6.1 Documentation issues in computational biology

Many researchers, even those who do all their work on the computer, maintain paper laboratory
notebooks, which are still the standard recording device for scientific research. Laboratory notebooks
41
provide a tangible physical record of research activities, and maintenance of lab records in this format
is still a condition of many research grants.

However, laboratory notebooks can be an inconvenient source of information for you or for others who
are trying to duplicate or reconstruct your work. Lab notebooks are organized linearly, with entries
sorted only by date. They aren't indexed (unless you have a lot more free time than most researchers
do). They can't be searched for information about particular subjects, except by the unsatisfactory
strategy of sitting down and skimming the whole book beginning to end.

Computer filesystems provide an intuitive basis for clear and precise organization of research records.
Information about each part of a project can be stored logically, within the file hierarchy, instead of
sequentially. Instead of (or in addition to) a paper notebook on your bookshelf, you will have electronic
record embedded within your data. If your files are named systematically and simple conventions are
used in structuring your electronic record, Unix tools such as the grep command will allow you to
search your documentation for occurrences of a particular word or date and to find the needed
information much more quickly than you would reading through a paper notebook.

2.5.6.2 Electronic notebooks

While you can get by with homegrown strategies for building an electronic record of your work, you
may want to try one of the commercial products that are available. Or, if you're looking for a freeware
implementation of the electronic notebook concept, you can obtain a copy of the software being
developed by the DOE2000 Electronic Notebook project. The eNote package lets you input text,
upload images and datafiles, and create sketches and annotations. It's a Perl CGI program and will run
on any platform with a web server and a Perl interpreter installed. When installed, it's accessible from a
web URL on your machine, and you can update your notebook through a web form. The DOE project
is designed to fulfill federal agency requirements for laboratory notebooks, as scientific research
continues to move into the computer age.

The eNote package is extremely simple to install. It requires that you have a working web server
installed on your machine. If you do, you can download the eNote archive and unpack it in its own
directory, for example /usr/local/enote. The three files enote.pl, enotelib.pl, and sketchpad.pl are the
eNote programs. You need to move enote.pl to the /home/httpd/cgi-bin directory (or wherever your
executable CGI directory is; this one is the default on Red Hat Linux systems) and rename it enote.cgi.
If you want to restrict access to the notebook, create a special subdirectory just for the eNote programs,
and remember that the directory will show up in the URL path to access the CGI script. The
sketchpad.pl file should also be moved to this directory, but it doesn't have to be renamed. Move the
directories gifs and new-gifs to a web-accessible location. You can create a directory such as
/home/httpd/enote for this purpose. Leave the file enotelib.pl and the directory sketchpad where you
unpacked them.

Finally, you need to edit the first line in both enote.cgi and sketchpad.pl to point to the location of the
Perl executable on your machine. Edit the enote.cgi script to reflect the paths where you installed the
eNote script and its support files. You also need to choose a directory in which you want eNote to write
entries. For instance, you may want to create a /home/enote/notebook directory and store eNote write
files there. If so, be sure that directory is readable and writable by other users so the web server (which
is usually identified as user nobody) can write there.



42
The eNote script also contains parameters that specify whether users of the notebook system can add,
delete, and modify entries. If you plan to use eNote seriously, these are important parameters to
consider. Would you allow users to tear unwanted pages out of a laboratory notebook or write over
them so the original entry was unreadable? eNote allows you to maintain control over what users can
do with their data.

The eNote interface is a straightforward web form, which also links to a Java sketchpad applet. If you
want only specific users with logins on your machine to be able to access the eNote CGI script, you can
set up a .htaccess file in the eNote subdirectory of your CGI directory. A .htaccess file is a file readable
by your web server that contains commands to restrict access to a particular directory and/or where it
can be accessed from. For more information on creating a .htaccess file, consult the documentation for
the web server you are using”most likely Apache on most Linux systems.

If you do begin to use an electronic notebook for storing your laboratory notes, remember that you
must save backups of your notebook frequently in case of system failures.




43
Chapter 3. Setting Up Your Workstation
In this chapter, we discuss how to set up a workstation running the Linux operating system. Linux is a
free, open source version of Unix that makes it possible to turn an ordinary PC into a powerful
workstation. By configuring your system with Linux and other open source software, you can have
access to a lot of powerful computational biology and bioinformatics tools at a low cost.

In writing this chapter, we encountered a bit of a paradox”in order to get around in Unix you need to
have your computer set up, but in order to set up your computer you need to know a few things about
Unix. If you don't have much experience with Unix, we strongly suggest that you look through Chapter
4 and Chapter 5 before you set up a Linux workstation of your own. If you're already familiar with the
ins and outs of Unix, feel free to skip ahead to Chapter 6.

3.1 Working on a Unix System
You are probably accustomed to working with personal computers; you may be familiar with windows
interfaces, word processors, and even some data-analysis packages. But if you want to use computers
as a serious component in your research, you need to work on computer systems that run under Unix or
related multiuser operating systems.

3.1.1 What Does an Operating System Do?

Computer hardware without an operating system is like a dead animal. It isn't going to react, it isn't
going to function; it's just going to sit there and look at you with glassy eyes until it rots (or rusts). The
operating system breathes life into the inert body of your computer. It handles the low level processes
that make hardware work together and provides an environment in which you can run and develop
programs. The most important function of the operating system is that it allows you convenient access
to your files and programs.

3.1.2 Why Use Unix?

So if the operating system is something you're not supposed to notice, why worry about which one
you're using? Why use Unix?

Unix is a powerful operating system for multiuser computer systems. It has been in existence for over
25 years, and during that time has been used primarily in industry and academia, where networked
systems and multiuser high-performance computer systems are required. Unix is optimized for tasks
that are only fairly recent additions to personal-computer operating systems, or which are still not even
available in some PC operating systems: networking with other computers, initiating multiple
asynchronous tasks, retaining unique information about the work environments of multiple users, and
protecting the information stored by individual users from other users of the system. Unix is the
operating system of the World Wide Web; the software that powers the Web was invented in Unix, and
many if not most web servers run on Unix servers.

Because Unix has been used extensively in universities, where much software for scientific data
analysis is developed, you will find a lot of good-quality, interesting scientific software written for
Unix systems. Computational biology and bioinformatics researchers are especially likely to have


44
developed software for Unix, since until the mid-1990s, the only workstations able to visualize protein
structure data in realtime were Silicon Graphics and Sun Unix workstations.

Unix is rich in commands and possibilities. Every distribution of Unix comes with a powerful set of
built-in programs. Everything from networking software to word-processing software to electronic mail
and news readers is already a part of Unix. Many other programs can be downloaded and installed on
Unix systems for free.

It might seem that there's far too much to learn to make working on a Unix system practical. It's
possible, however, to learn a subset of Unix and to become a productive Unix user without knowing or
using every program and feature.

3.1.3 Different Flavors of Unix

Unix isn't a monolithic entity. Many different Unix operating systems are out there, some proprietary
and some freely distributed. Most of the commands we present in this book work in the same way on
any system you are likely to encounter.

3.1.3.1 Linux

Linux (LIH-nucks) is an open source version of Unix, named for its original developer, Linus Torvalds
of the University of Helsinki in Finland. Originally undertaken as a one-man project to create a free
Unix for personal computers, Linux has grown from a hobbyist project into a product that, for the first
time, gives the average personal-computer user access to a Unix system.

In this book, we focus on Linux for three reasons. First, with the availability of Linux, Unix is cheap
(or free, if you have the patience to download and install it). Second, under Linux, inexpensive PCs
regarded as "obsolete" by Windows users become startlingly flexible and useful workstations. The
Linux operating system can be configured to use a much smaller amount of system resources than the
personal computer operating systems, which means computers that have been outgrown by the ever-
expanding system requirements of PC programs and operating systems can be given a new lease on life
by being reconfigured to run Linux. Third, Linux is an excellent platform for developing software, so
there's a rich library of tools available for computational biology and for research in general.

You may think that if you install Linux on your computer, you'll be pretty much on your own. It's a
freeware operating system, after all. Won't you have to understand just about everything about Linux to
get it configured correctly on your system? While this might have been true a few years ago, it's not
any more. Hardware companies are starting to ship personal computers with Linux preinstalled as an
alternative to the Microsoft operating systems. There are a number of companies that sell distributions
of Linux at reasonable prices. Probably the best known of these is the Red Hat distribution. We should
mention that we (the authors) run Red Hat Linux. Most of our experience”and the examples in this
book”are based on that distribution. If you purchase Linux from one of these companies, you get CDs
that contain not only Linux but many other compatible free software tools. You'll also have access to
technical support for your installation.

3.1.3.1.1 Will Linux run on your computer?

Linux started out as a Unix-like operating system for PCs, but various Linux development projects now
support nearly every available system architecture, including PCs of all types, Macintosh computers

45
old and new, Silicon Graphics, Sun, Hewlett-Packard, and other high-end workstations and high-
performance multiprocessor machines. So even if you're starting with a motley mix of old and new
hardware, you can use Linux to create a multiworkstation network of compatible computers. See
Section 3.2 for more information on installing and running Linux.

3.1.3.2 Other common flavors

There are many varieties (or "flavors") of Unix out there. The other common free implementation is the
Berkeley Software Distribution (BSD) originally developed at the University of California-Berkeley.
For the PC, there are a handful of commercial Unix implementations, such as The Santa Cruz
Operation (SCO) Unix. Several workstation makers sell their own platform-specific Unix
implementations with their computers, often with their own peculiarities and quirks. Most common
among these are Solaris (Sun Microsystems), IRIX (Silicon Graphics), Digital Unix (Compaq
Corporation), HP-UX (Hewlett Packard), and AIX (IBM). This list isn't exhaustive, but it's probably
representative of what you will find in most laboratories and computing centers.

3.1.4 Graphical Interfaces for Unix

Although Unix is a text-based operating system, you no longer have to experience it as a black screen
full of glowing green or amber letters. Most Unix systems use a variant of the X Window System. The
X Window System formats the screen environment and allows you to have multiple windows and
applications open simultaneously. X windows are customizable so that you can use menu bars and
other widgets much like PC operating systems. Individual Unix shells on the host machine as well as
on networked machines are opened as windows, allowing you to exploit Unix's multitasking
capabilities and to have many shells active simultaneously. In addition to Unix shells and tools, there
are many applications that take advantage of the X system and use X windows as part of their graphical
user interfaces, allowing these applications to be run while still giving access to the Unix command
line.

The GNOME and KDE desktop environments, which are included in most major Linux distributions,
make your Linux system look even more like a personal computer. Toolbars, visual file managers, and
a configurable desktop replicate the feeling of a Windows or Mac work environment, except that you
can also open a shell window and run Unix programs.

3.2 Setting Up a Linux Workstation
If you are already using an existing Unix/Linux system, feel free to skip this section and go directly to
the next.

If you are used to working with Macintosh or PC operating systems, the simplest way to set up a Linux
workstation or server is to go out and buy a PC that comes with Linux preinstalled. VA Linux, for
example, offers a variety of Intel Pentium-based workstations and servers preconfigured with your
choice of several of the most popular Linux distributions.

If you're looking for a complete, self-contained bioinformatics system, Iobion Systems
(http://www.iobion.com) is developing Iobion, a ground-breaking bioinformatics network server
appliance developed using open source technologies. Iobion is an Intel-based hardware system that
comes preinstalled with Linux, Apache web server, a PostgreSQL relational database, the R statistical
language, and a comprehensive suite of bioinformatics tools and databases. The system serves these
46
scientific applications to web clients on a local intranet or over the Internet. The applications include
tools for microarray data analysis complete with a microarray database, sequence analysis and
annotation tools, local copies of the public sequence databases, a peer-to-peer networking tool for
sharing biological data, and advanced biological lab tools. Iobion promotes and adheres to open
standards in bioinformatics.

If you already have a PC, your next choice is to buy a prepackaged version of Linux, such as those
offered by Red Hat, Debian, or SuSE. These prepackaged distributions have several advantages: they
have an easy-to-use graphical interface for installing Linux, all the software they include is packed into
package manager (for Red Hat, it's the Red Hat Package Manager or RPM) archives or similar easily
extracted formats, and they often contain a large number of "extras" that are easier to install from the
distribution disk using a package manager than they are if you install them by hand.

That said, let's assume you've gone out and bought something like the current version of Red Hat.
You'll be asked if you want to do a workstation installation, a server installation, or a custom
installation. What do these choices mean?

Your Linux machine can easily be set up to do some things you may not be used to doing with a PC or
Macintosh. You can set up a web server on your machine, and if you dig a little deeper into the
manuals, you can find out how to give each user of your machine a place to create his own web page.
You can set up an anonymous FTP server so that guests can FTP in to pick up copies of files you
choose to make public. You can set up an NFS server to allow directories you choose to be mounted on
other machines. These are just some options that set a server apart from a workstation.

If you are inexperienced in Unix administratio n, you probably want to set up your first Linux machine
as a workstation. With a workstation setup, you can access the Internet, but your machine can't provide
any services to outside users (and you aren't responsible for maintaining these services). If you're
feeling more adventurous, you can do a custom installation. This allows you to pick and choose the
system components you want, rather than taking everything the installer thinks you may want.

3.2.1 Installing Linux

We can't possibly tell you everything you need to know to install and run Linux. That's beyond the
scope of this book. There are many excellent books on the market that cover all possible angles of
installing and running Linux, and you can find a good selection in this book's Bibliography. In this
section, we simply offer some advice on the more important aspects of installation.

3.2.1.1 System requirements

Linux runs on a range of PC hardware combinations, but not all possible combinations. There are
certain minimum requirements. For optimum performance, your PC should have an 80486 processor or
better. Most Linux users have systems that use Intel chips. If your system doesn't, you should be aware
that while Linux does support a few non-Intel processors, there is less documentation to help you
resolve potential problems on those systems.

For optimum performance your system should have at least 16 MB of RAM. If you're planning to run
X, you should seriously consider installing more memory”perhaps 64 MB. X runs well on 16 MB, but
it runs more quickly and allows you to open more windows if additional memory is available.


47
If you plan to use your Linux system as a workstation, you should have at least 600 MB of free disk
space. If you want to use it as a server, you should allow 1.6 GB of free space. You can never have too
much disk space, so if you are setting up a new system, we recommend buying the largest hard drive
possible. You'll never regret it.

In most cases the installation utility that comes with your distribution can determine your system
configuration automatically, but if it fails to do so, you must be prepared to supply the needed
information. Table 3-1 lists the configuration information you need to start your installation.

Table 3-1. Configuration Information Needed to Install Linux
Device Information Needed
• The number, size, and type of each hard drive

• Which hard drive is first, second, and so on
Hard drive(s)
• Which adapter type (IDE or SCSI) is used by each drive

• For each IDE drive, if the BIOS is set in LBS mode

• The amount of installed RAM
RAM

• Which adapter type (IDE, SCSI, other) is used by each drive
CD-ROM drive(s)
• For each drive using a non-IDE, non-SCSI adapter, the make and model of the drive

• The make and model of the card
SCSI adapter (if any)

• The make and model of the card
Network adapter (if any)

• The type (serial, PS/2, or bus)

• The protocol (Microsoft, Logitech, MouseMan, etc.)
Mouse
• The number of buttons

• For a serial mouse, the serial port to which it's connected

• The make and model of the card
Video adapter
• The amount of video RAM



To obtain information, you may need to examine your system's BIOS settings or open the case and
look at the installed hardware. Consult your system documentation or your system administrator to
learn how to do so.

Here are three of the more popular Linux distributions:

• Red Hat (http://www.redhat.com/support/hardware/)
48
• boot: Debian (http://www.debian.org/doc/FAQ/ch-compat.html)
• SuSE (http://www.suse.com)

All have well-organized web sites with information about the hardware their distributions support.
Once you've collected the information in Table 3-1, take a few minutes to check the appropriate web
site to see if your particular PC hardware configuration is supported.

3.2.1.2 Partitioning your disk

Linux runs most efficiently on a partitioned hard drive. Partitioning is the process of dividing your disk
up into several independent sections. Each partition on a hard drive is a separate filesystem. Files in
one filesystem are to some extent protected from what goes on in other filesystems. If you download a
bunch of huge image files, you can fill up only the partition in which your home directories live; you
can't make the machine unusable by filling up all the available space for essential system functions.
And if one partition gets corrupted, you can sometimes fix the problem without reformatting the entire
drive and losing data stored in the other partitions.

When you start a Red Hat Linux installation, you need the Linux boot disk in your floppy drive and the
Linux CD-ROM in your CD drive. When you turn the computer on, you almost immediately encounter
an installation screen that offers several installation mode options. At the bottom of the screen, there is
a boot: prompt. Generally, you should just hit the Enter key; however, if you're using a new model of
computer, especially a laptop, you may want to enter text, then press the Enter key for a text-mode
installation, in case your video card isn't supported by the current Linux release.

Click through the next few screens, selecting the appropriate language and keyboard. You'll come to a
point at which you're offered the option of selecting a GNOME workstation, a KDE workstation, a
server, or a custom installation. At this point, you can just choose one of the single user workstation
options, and you're essentially done. However, we suggest doing a custom installation to allow you
greater control over what is installed on your computer and where it's installed.

If you have a single machine that's not going to be interacting with other machines on the network, you
can probably get away with putting the entire Linux installation into one big filesystem, if that's what
you want. But if you're setting up a machine that will, for instance, share software in its /usr/local
directory with all the other machines in your lab, you'll want to do some creative partitioning.

On any given hard disk, you can have four partitions. Partitions can be of two types: primary and
extended. Within an extended partition, you can have as many subpartitions as you like. Red Hat and
other commercial Linux distributions have simple graphical interfaces that allow you to format your
hard disk. More advanced users can use the fdisk program to achieve precise partitioning. Refer to one
of the "Learning Linux" books we recommend in the Bibliography for an in-depth discussion of
partitioning and how to use the fdisk program.

3.2.1.3 Selecting major package groupings

After you've set up partitions on your disk, chosen mount points for your partitions, and completed a
few other configuration steps, you need to pick the packages to install.

First, go through the Package Group Selection list. You'll definitely need printer support; the X
Window System; either the GNOME or KDE desktop (we like KDE); mail, web, and news tools,

49
graphics manipulation tools; multimedia support; utilities; and networked workstation support. If you'll
be installing software (and you will), you need a number of items in the development package group
(C, FORTRAN, and other compilers come in handy, as do some development libraries). You may also
want to install the Emacs text editor and the authoring/publishing tools. Depending on where you use
your system from, you may need dial-up workstation support.

The rest of the package groups add server functionality to your machine. If you want your machine to
function as a web server, add the web server package group. If you want to make some of the
directories on your machine available for NFS mounting, choose the NFS server group. If you plan to
create your own databases, you may want to set up your machine as a PostgreSQL server. Generally, if
you have no idea what it is or how you'd use it, you probably don't need to install it at this point.

If you're concerned about running out of space on your machine, you can now sift through the contents
of each package grouping and get rid of software you won't be using. For example, the "Mail, Web and
News" package grouping contains many different types of software for reading email and newsgroups.
Don't install it all, just pick your favorite package, and get rid of the rest. (In case you're wondering
what to choose, here's a hint: it's very easy to configure the Netscape browser to do all the mail and
news reading you'll need.) If you're installing a Red Hat system, check under "Applications/Editors"
and make sure you have the vim editor selected; in "Applications/Engineering," select gnuplot; and in
"Applications/Publishing," select enscript. Don't worry if you don't install something at the beginning
and find you need to install it later, it's pretty easy to do.

3.2.1.4 Other useful packages to add

Once you've done a basic Linux installation on your machine, you can add new packages easily using
the kpackage command (if you're using the KDE desktop environment) or gnorpm (if you are using
GNOME).

In order to compile some of the software we'll be discussing in the next few chapters, and to expand the
functionality of your Linux workstation, you may want to install some of the following tools. The first
set of tools are from the Red Hat Linux Power Tools CD:

R

A powerful system for statistical computation and graphics. It's based on S and consists of a
high-level language and a runtime environment.

OpenGL /Mesa

A development kit for creating graphical user interfaces that enhances performance of some
molecular visualization software.

LessTif

A widget set for application development. You might not use it directly, but it's used when you
compile some of the software discussed later in this book. Install at least the main package and
the client package.

Xbase
50
Another widget set.

MySQL

A database server for smaller data sets. It's useful if you're just starting to build your own
databases.

octave

A MatLab-like high-level language for numerical computations.

xv

A multipurpose image-editing and conversion tool.

xemacs

A powerful X Windows-based editor with special extensions for editing source code.

plugger

A generic Netscape plug-in that supports many formats.

You can download from the Web and install the following tools:

JDK /JRE (http://java.sun.com)

A Java Development Kit and Java Runtime Environment are needed if you want to use Java-
based tools such as the Jalview sequence editor we discuss in Chapter 4. They are freely
available for Linux from IBM, Sun, and Blackdown (http://blackdown.org). Blackdown also
offers a Java plug-in for Netscape, which is required to run some of the applications we discuss.

NCBI Toolkit ( ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools/README.htm)

A software library for developers of biology applications. It's required in order to compile some
software originating at NCBI.

StarOffice (http://www.staroffice.com)

A comprehensive office productivity packa ge freely available from Sun Microsystems. It
replaces most or all functionality of Microsoft Office and other familiar office-productivity
packages.

3.3 How to Get Software Working
You've gone out and done the research and found a bioinformatics software package you want to install
on your own computer. Now what do you do?


51
When you look for Unix software on the Web, you will find that it's distributed in a number of different
formats. Each type of software distribution requires a different type of handling. Some are very simple
to install, almost like installing software on a Mac or PC. On the other hand, some software is
distributed in a rudimentary form that requires your active intervention to get it running. In order to get
this software working, you may have to compile it by hand or even modify the directions that are sent
to the compiler so that the program will work on your system. Compiling is the process of converting
software from its human-readable form, source code, to a machine-readable executable form. A
compiler is the program that performs this conversion.

Software that's difficult to install isn't necessarily bad software. It may be high-quality software from a
research group that doesn't have the resources to produce an easy-to-use installation kit. While this is
becoming less common, it's still common enough that you will need to know some things about
compiling software.

3.3.1 Unix tar Archives

Software is often distributed as a tar archive, which is short for "tape archive." We discuss tar and
other file-compression options in more detail in Chapter 5. Not coincidentally, these archives are one of
the most common ways to distribute Unix software on the Internet. tar allows you to download one file
that contains the complete image of the developer's working software installation and unpack it right
back into the correct subdirectories. If tar is used with the p option, file permissions can even be
preserved. This ensures that, if the developer has done a competent job of packing all the required files
in the tar archive, you can compile the software relatively easily.

tar archives are often compressed further using either the Unix compress command (indicated by a
.tar.Z extension) or with gzip (indicated by a .tar.gz or .tgz extension).

3.3.2 Binary Distributions

Software can be distributed either as uncompiled source code or binaries. If you have a choice, and if
you don't know any reason to do otherwise, choose the binary distribution. It will probably save you a
lot of headaches.

Binary software distributions are precompiled and (at least in theory) ready to run on your machine.
When you download software that is distributed in binary form, you will have a number of options to
choose from. For example, the following listing is the contents of the public FTP site for the BLAST
sequence alignment software. There are several archives available, each for a different operating
system; if you're going to run the software on a Linux workstation, download the file blast.linux.tar.Z.

README.bls 52 Kb Wed Jan 26 18:45:00 2000
blast.alphaOSF1.tar.Z 12756 Kb Wed Jan 26 18:40:00 2000 Unix Tape Archive
blast.hpux11.tar.Z 11964 Kb Wed Jan 26 18:43:00 2000 Unix Tape Archive
blast.linux.tar.Z 9334 Kb Wed Jan 26 18:41:00 2000 Unix Tape Archive
blast.sgi.tar.Z 14746 Kb Wed Jan 26 18:44:00 2000 Unix Tape Archive
blast.solaris.tar.Z 12724 Kb Wed Jan 26 18:37:00 2000 Unix Tape Archive
blast.solarisintel.tar.Z 10679 Kb Wed Jan 26 18:43:00 2000 Unix Tape Archive
blastz.exe 3399 Kb Wed Jan 26 18:44:00 2000 Binary Executable

Here are the basic binary installation steps:


52
1. Download the correct binaries. Be sure to use binary mode when you download. Download and
read the instructions (usually a README or INSTALL file).
2. Follow the instructions.
3. Make a new directory and move the archive into it, if necessary.
4. uncompress (*.Z ) or gunzip (*.gz) to uncompress the file.
5. Use tar tf to examine the contents of the archive and tar xvf to extract it.
6. Run configuration and installation scripts, if present.
7. Link binary into a directory in your default path using ln -s, if necessary.

3.3.3 RPM Archives

RPM archives are a new kind of Unix software distribution that has recently become popular. These
archives can be unpacked using the command rpm. The Red Hat Package Manager program is included
in Red Hat Linux distributions and is automatically installed on your machine when you install Linux.
It can also be downloaded freely from http://www.rpm.org and used on any Linux or other Unix
system. rpm creates a software database on your machine and simplifies installations and updates, and
even allows you to create RPM archives. RPM archives come in either source or binary form, but aside
from the question of selecting the right binary, the installation is equally simple either way.

(As we introduce commands, we'll show you the format of the command line for each command”for
example, "Usage: man name" -- and describe the effects of some options we find most useful.)

Usage: rpm --[options] *.rpm

Here are the important rpm options:

rebuild

Builds a package from a source RPM

install

Installs a new package from a binary RPM

upgrade

Upgrades existing software

uninstall (or erase)

Removes an installed package

query

Checks to see if a package is installed

verify


53
Checks information about installed files in a pac kage

3.3.3.1 GnoRPM

Recent versions of Linux that include the GNOME user interface also include an interactive installation
tool called GnoRPM. It can be accessed from the System folder in the main GNOME menu. To install
software from a CD-ROM with GnoRPM, simply insert and mount the CD-ROM, click the Install
button in GnoRPM, and GnoRPM provides a selectable list of every package on the CD-ROM you
haven't already installed. You can also uninstall and update packages with GnoRPM, ensuring that the
entire package is cleanly removed from your system. GnoRPM informs you if there are package
dependencies that require you to download code libraries or other software before completing the
installation.

3.3.4 Source Distributions

Sometimes the correct binary isn't available for your system, there's no RPM archive, and you have no
choice but to install from source code.

Source distributions can be easy or hard to install. The easy ones come with a configuration script, an
install script, and a Makefile for your operating system that holds the instructions to the compiler.

An example of an easy-to-install package is the LessTif source code distribution. LessTif is an open
source version of the OSF/Motif window manager software. Motif was developed for high-end
workstations and costs a few thousand dollars a year to license; LessTif supports many Motif
applications (such as the multiple sequence alignment package ClustalX and the useful 2D plotting
package Grace, for example) for free. When the LessTif distribution is unpacked, it looks like:

AUTHORS KNOWN_BUGS acconfig.h configure ltmain.sh
BUG-REPORTING Makefile acinclude.m4 configure.in make.out
COPYING Makefile.am aclocal.m4 doc missing
COPYING.LIB Makefile.in clients etc mkinstalldirs
CREDITS NEWS config.cache include scripts
CURRENT_NOTES NOTES config.guess install-sh test
CVSMake README config.log lib test_build
ChangeLog RELEASE-POLICY config.status libtool
INSTALL TODO config.sub ltconfig

Configuration and installation of LessTif on a Linux workstation is a practically foolproof process. As
the superuser, move the source tar archive to the /usr/local/src directory. Uncompress and extract the
archive. Inside the directory that is created (lesstif or lesstif.0-89, for example), enter ./configure. The
configuration script will take a while to run; when it's done, enter make. Compilation will take several
minutes; at the end, edit the file /etc/ld.so.conf. Add the line /usr/lesstif/lib, save the file, and then run
ldconfig -v to make the shared LessTif libraries available on your machine.

Complex software such as LessTif is assembled from many different source code modules. The
Makefile tells the compiler how to put them together into one large executable. Other programs are
simple: they have only one source code file and no Makefile, and they are compiled with a one-line
directive to the compiler. You should be able to tell which compiler to use by the extension on the
program filename. C programs are often labeled *.c, FORTRAN programs *.f, etc. To compile a C
program, enter gcc program.c -o program; for a FORTRAN program, the command is g77 program.f -

54
o program. The manpages for the compilers, or the program's documentation (if there is any) should
give you the form and possible arguments of the compiler command.

Compilers convert human-readable source code into machine-readable binaries. Each programming
language has its own compilers and compiler instructions. Some compilers are free, others are
commercial. The compilers you will encounter on Linux systems are gcc, the GNU Project C and C++
compiler, and g77, the GNU Project FORTRAN compiler. In computational biology and [1]


bioinformatics, you are likely to encounter programs written in C, C++, FORTRAN, Perl, and Java.
Use of other languages is relatively rare. Compilers or interpreters for all these languages are available
in open source distributions.
[1]
The GNU project is a collaborative project of the Free Software Foundation to develop a completely open source Unix-like operating system. Linux systems are,
formally, GNU/Linux systems as they can be distributed under the terms of the GNU Public License (GPL), the license developed by the GNU project.


Difficult-to-install programs come in many forms. One of the main problems you may encounter will
be source code with dependencies on code libraries that aren't already installed on your machine. Be
sure to check the documentation or the README file that comes with the software to determine
whether additional code or libraries are required for the program to run properly.

An example of an undeniably useful program that is somewhat difficult to install is ClustalX, the X
windows interface to the multiple sequence alignment program ClustalW. In order to install ClustalX
successfully on a Linux workstation, you first need to install the NCBI Toolkit and its included Vibrant
libraries. In order to create the Vibrant libraries, you need to install the LessTif libraries and to have
XFree86 development libraries installed on your computer.

Here are the bas ic steps for installing any package from source code:

1. Download the source code distribution. Use binary mode; compressed text files are encoded.
2. Download and read the instructions (usually a README or INSTALL file; sometimes you have
to find it after you extract the archive).
3. Make a new directory and move the archive into it, if necessary.
4. uncompress (*.Z ) or gunzip (*.gz) the file.
5. Extract the archive using tar xvf or as instructed.
6. Follow the instructions (did we say that already?).
7. Run the configuration script, if present.
8. Run make if a Makefile is present.
9. If a Makefile isn't present and all you see are *.f or *.c files, use gcc or g77 to compile them, as
discussed earlier.
10. Run the installation script, if present.
11. Link the newly created binary executable into one of the binary-containing directories in your
path using ln -s (this is usually part of the previous step, but if there is no installation script, you
may need to create the link by hand).

3.3.5 Perl Scripts

The Perl language is used to develop web applications and is frequently used by computational
biologists. Perl programs (called scripts) have the extension *.pl (or *.cgi if they are web applications).
Perl is an interpreted language; in other words, Perl programs don't have to be compiled in order to run.


55
Instead, each command in a Perl script is sent to a program called the Perl interpreter, which executes
the commands. [2]




[2]
There is now a Perl compiler, which can optionally be used to create binary executables from Perl scripts. This can speed up execution.


To run Perl programs, you need to have the Perl interpreter installed on your machine. Most Linux
distributions contain and automatically install Perl. The most recent version of Perl can always be
obtained from http://www.perl.com, along with plenty of helpful information about how to use Perl in
your own work. We discuss some of the basic elements of Perl in Chapter 12.

3.3.6 Putting It in Your Path

When you give a command, the default path or lookup path is where the system expects to find the
program (which is also known as the executable). To make life easier, you can link the binary
executable created when you compile a program to a directory like /usr/local/bin, rather than typing the
full pathname to the program every time you run it. If you're linking across filesystems, use the
command ln -s (which we cover in Chapter 4) to link the command to a directory of executable files.
Sometimes this results in the error "too many levels of symbolic links" when you try to run the
program. In that case, you have to access the executable directly or use mv or cp to move the actual
executable file into the default path. If you do this, be sure to also move any support files the program
needs, or create a link to them in the directory in which the program is being run.

Some software distributions automatically install their executables in an appropriate location. The
command that usually does this is make install. Be sure to run this command after the program is
compiled. For more information on symbolic linking, refer to one of the Unix references listed in the
Bibliography, or consult your system administrator.

3.3.7 Sharing Software Among Multiple Users

Before you start installing software on a Unix system, one of the first things to do is to find out where
shared software and data are stored on your machines. It's customary to install local applications in
/usr/local, with executable files in /usr/local/bin. If /usr/local is set up as a separate partition on the
system, it then becomes possible to upgrade the operating system without overwriting local software
installations.

Maintaining a set of shared software is a good idea for any research group. Installation of a single
standard version of a program or software package by the system administrator ensures that every
group member will be using software that works in exactly the same way. This makes troubleshooting
much easier and keeps results consistent. If one user has a problem running a version of a program that
is used by everyone in the group, the troubleshooting focus can fall entirely on the user's input, without
muddying the issue by trying to figure out whether a local version of the program was compiled
correctly.

For the most part, it's unnecessary for each user of a program to have her own copy of that program
residing in a personal directory. The main exception to this is if a user is actually modifying a program
for her own use. Such modifications should not be applied to the public, standard version of the
program until they have been thoroughly tested, and therefore the user who is modifying the program
needs her own version of the program source and executable.


56
3.4 What Software Is Needed?
New computational biology software is always popping up, but through a couple of decades of
collective experience, a consensus set of tools and methods has emerged. Many scientists are familiar
with standard commercial packages for sequence analysis, such as GCG, and for protein structure
analysis, such as Quanta or Insight. For beginners, these packages provide an integrated interface to a
variety of tools.

Commercial software packages for sequence analysis integrate a number of functions, including
mapping and fragment assembly, database searching, gene discovery, pairwise and multiple sequence
analysis, motif identification, and evolutionary analysis. One caveat is that these software packages can
be prohibitively expensive. It can be difficult, especially for educational institutions and research
groups on a limited budget, to purchase commercial software and pay the annual costs for license
maintenance (which can be in the many thousands of dollars).

A related cost issue is that many commercial software packages, especially those for macromolecular
structure analysis, don't yet run on consumer PCs. These packages were originally developed for high-
end workstations when these workstations were the only computers with sufficient graphics capability
to display protein structures. Although these days most home computers have high-powered graphics
cards, the makers of commercial molecular modeling software have been slow to keep up.

While commercial computational biology software packages can be excellent and easy to use, they
often seem to lag at least a couple of years behind cutting-edge method development. The company
that produces a commercial software package usually commits to only one method for each type of
tool, buys it at a particular phase in its development cycle, focuses on turning it into a commercially
viable product, and may not incorporate developments in the method into their package in a timely
fashion, or at all.

On the other hand, while academic software is usually on the cutting edge, it can be poorly written and
hard to install. Documentation (beyond the published paper that describes the software) may be
nonexistent. Graphical user interfaces in academic software packages are often rudimentary, which can
be aggravating for the beginning user.

With this book, we've taken the "science on a shoestring" approach. In Chapter 6, Chapter 7, Chapter 9,
Chapter 10, and Chapter 11 we've compiled quick-reference tables of fundamental techniques and free
software applications you can use to analyze your data. Hopefully, these will help you to know what
you need to do, how to seek out the tools that do it, and how to put them both together in the way that
best suits your needs. This approach keeps you independent of the vagaries of the software industry and
in touch with the most current methods.




57
Chapter 4. Files and Directories in Unix
Now that you've set up your workstation, let's spend some time talking about how to get around in a
Unix system. In this chapter, we introduce basic Unix concepts, including the structure of the
filesystem, file ownership, and commands for moving around the filesystem and creating files. [1]


Another important focus of this chapter, however, is the approach you should take to organizing your
research data so that it can be accessed efficiently by you and by others.
[1]
Throughout this chapter and Chapter 5, we introduce many Unix commands. Our quick and dirty approach to outlining the functions of these commands and their
options should help you get started working fast, but it's by no means exhaustive. The Bibliography provides several excellent Unix books that will help you fill in
the details.


4.1 Filesystem Basics
All computer filesystems, whether on Unix systems or desktop PCs, are basically the same. Files are
named locations on the computer's storage device. Each filename is a pointer to a discrete object with a
beginning and end, whether it's a program that can be executed or simply a set of data that can be read
by a program. Directories or folders are containers in which files can be grouped. Computer
filesystems are organized hierarchically, with a root directory that branches into subdirectories and
subdirectories of subdirectories.

This hierarchical system can help organize and share information, if used properly. Like the taxonomy
of species developed by the early biologists, your file hierarchy should organize information from the
general level to the specific. Each time the filesystem splits into subdirectories, it should be because
there are meaningful divisions to be created within a larger class of files.

Why should you organize your computer files in a systematic, orderly way? It seems like an obvious
question with an obvious answer. And yet, a common problem faced by researchers and research
groups is failure to share information effectively. Problems with information management often
become apparent when a research group member leaves, and others are required to take over his
project.

Imagine you work with a colleague who keeps all his books and papers piled in random stacks all over
his offic e. Now imagine that your colleague gets a new job and needs to depart in a hurry”leaving
behind just about everything in his office. Your boss tells you that you can't throw away any of your
colleague's papers without looking at them, because there might be something valuable in there. Your
colleague has not organized or categorized any of his papers, so you have to pick up every item, look at
it, determine if it's useful, and then decide where you want to file it. This might be a week's work, if
you're lucky, and it's guaranteed to be a tough job.

This kind of problem is magnified when computer files are involved. First of all, many highly useful
files, especially binaries of programs, aren't readable as text files by users. Therefore, it's difficult to
determine what these files do if they're not documented. Other kinds of files, such as files of numerical
data, may not contain useful header information. Even though they can be read as text, it may be next
to impossible to figure out their purpose.

Second, space constraints on computer system usage are much more nebulous than the walls of an
office. As disk space has become cheaper, it's become easier for users of a shared system simply never


58
to clean up after themselves. Many programs produce multiple output files and, if there's no space
constraint that forces you to clean up while running them, can produce a huge mess in a short time.

How can you avoid becoming this kind of problem for your colleagues? Awareness of the potential
problems you can cause is the first step. You need to know what kinds of programs and files you
should share with others and which you should keep in your own directories. You should establish
conventions for naming datafiles and programs and stick to these conventions as you work. You should
structure your filesystem in a sensible hierarchy. You should keep track of how much space you are
using on your computer system and create usable archives of your data when you no longer need to
access it frequently. You should create informative documentation for your work within the filesystem
and within programs and datafiles.

The nature of the filesystem hierarchy means that you already have a powerful indexing system for
your work at your fingertips. It's possible to do computer-based research and be just as disorganized as
that coworker who piles all his books and papers in random stacks all over his office. But why would
you want to do that? Without much more effort, you can use your computer's filesystem to keep your
work organized.

4.1.1 Moving Around the Directory Hierarchy

Like all modern operating systems, the file hierarchy on a Unix system is structured as a tree. You may
be used to this from PC operating systems. Open one folder, and there can be files and more folders
inside it, layered as deep as you want to go. There is a root directory, designated as /. The root directory
branches into a finite number of files and subdirectories. On a well-organized system, each of these
subdirectories contains files and other subdirectories pertaining to a particular topic or system function.

Of course, there's nothing inside your computer that really looks like a tree. Files are stored on various
media”most commonly the hard disk, which is a recordable device that lives in your computer. As its
name implies, the hard disk is really a disk. And the tree structure that you perceive in Unix is simply a
way of indexing what is on that disk or on other devices such as CDs, floppy disks, and Zip disks, or
even on the disks of every machine in a group of networked computers. Unix has extensive networking
capabilities that allow devices on networked computers to be mounted on other computers over the
network. Using these capabilities, the filesystems of several networked computers can be indexed as if
they were one larger, seamless filesystem.

4.1.2 Paths to Files and Directories

Each file on the filesystem can be uniquely identified by a combination of a filename and a path. You
can reference any file on the system by giving its full name, which begins with a / indicating the root
directory, continues through a list of subdirectories (the components of the path) and ends with the
filename. The full name, or absolute path, of a file in someone's home directory might look like this:

/home/jambeck/mustelidae/weasels.txt

The absolute path describes the relationship of the file to the root directory, /. Each name in the path
represents a subdirectory of the prior directory, and / characters separate the directory names.

Every file or directory on the system can be named by its absolute path, but it can also be named by a
relative path that describes its relationship to the current working directory. Files in the directory you
59
are in can be uniquely identified just by giving the filename they have in the current working directory.
Files in subdirectories of your current directory can be named in relation to the subdirectory they are
part of. From jambeck 's home directory, he can uniquely identify the file weasels.txt as
mustelidae/weasels.txt. The absence of a preceding / means that the path is defined relative to the
current directory rather than relative to the root directory.

If you want to name a directory that is on the same level or above the current working directory, there
is a shorthand for doing so. Each directory on the system contains two links, ./ and ../, which refer to
the current directory and its parent directory (the directory it's a subdirectory of ), respectively. If user
jambeck is working in the directory /home/jambeck /mustelidae/weasels, he can refer to the directory
/home/jambeck /mustelidae/otters as ../otters. A subdirectory of a directory on the same level of the
hierarchy as /home/jambeck /mustelidae would be referred to as ../../didelphiidae/opossums.

Another shorthand naming convention, which is implemented in the popular csh and tcsh shell
environments, is that the path of the home directory can be abbreviated as ˜. The directory
home/jambeck /mustelidae can then be referred to as ˜/mustelidae.

4.1.3 Using a Process-Based File Hierarchy

Filesystems can be deep and narrow or broad and shallow. It's best to follow an intuitive scheme for
organizing your files. Each level of hierarchy should be related to a step in the process you've used to
carry out the project. A filesystem is probably too shallow if the output from numerous processing
steps in one large project is all shoved together in one directory. However, a project directory that
involves several analyses of just one data object might not need to be broken down into subdirectories.
The filesystem is too deep if versions of output of a process are nested beneath each other or if analyses
that require the same level of processing are nested in subdirectories. It's much easier to for you to
remember and for others to understand the paths to your data if they clearly symbolize steps in the
process you used to do the work.

As you'll see in the upcoming example, your home directory will probably contain a number of
directories, each containing data and documentation for a particular project. Each of these project
directories should be organized in a way that reflects the outline of the project. Each directory should
contain documentation that relates to the data within it.

4.1.4 Establishing File -Naming Conventions for Your Work

Unix allows an almost unlimited variability in file naming. Filenames can contain any character other
than the / or the null character (the character whose binary representation is all zeros). However, it's
important to remember that some characters, such as a space, a backslash, or an ampersand, have
special meaning on the command line and may cause problems when naming files. Filenames can be
up to 255 characters in length on most systems. However, it's wise to aim for uniformity rather than
uniqueness in file naming. Most humans are much better at remembering frequently used patterns than
they are at remembering unique 255-character strings, after all.

A common convention in file naming is to name the file with a unique name followed by a dot (.) and
then an extension that uniquely indicates the file type.

As you begin working with computers in your research and structuring your data environment, you
need to develop your own file-naming conventions, or preferably, find out what naming conventions
60
already exist and use them consistently throughout your project. There's nothing so frustrating as
looking through old data sets and finding that the same type of file has been named in several different
ways. Have you found all the data or results that belong together? Can the file you are looking for be
named something else entirely? In the absence of conventions, there's no way to know this except to
open every unidentifiable file and check its format by eye. The next section provides a detailed
example of how to set up a filesystem that won't have you tearing out your hair looking for a file you
know you put there.

Here are some good rules of thumb to follow for file-naming conventions:

• Files of the same type should have the same extension.
• Files derived from the same source data should have a common element in their unique names.
• The unique name should contain as much information as possible about the experiment.
• Filenames should be as short as is possible without compromising uniqueness.

You'll probably encounter preestablished conventions for file naming in your work. For instance, if you
begin working with protein sequence and structure datafiles, you will find that families of files with the
same format have common extensions. You may find that others in your group have established local
conventions for certain kinds of datafiles and results. You should attempt to follow any known
conventions.

4.1.5 Structuring a Project: An Example

Let's take a look at an example of setting up a filesystem. These are real directory layouts we have used
in our work; only the names have been changed to protect the innocent. In this case, we are using a
single directory to hold the whole project.

It's useful to think of the filesystem as a family tree, clustering related aspects of a project into
branches. The top level of your project directory should contain two text files that explain the contents
of the directories and subdirectories. The first file should contain an outline of the project, with the
date, the names of the people involved, the question being investigated, and references to publications
related to this project. Tradition suggests that such informational files should be given a name along the
lines of README or 00README. For example, in the shards project, a minimal README file might
contain the following:

98-05-22
Project: Shards
Personnel: Per Jambeck, Cynthia Gibas
Question: Are there recurrent structural words in the three-dimensional structure
of proteins?
Outline: Automatic construction of a dictionary of elements of local structure in
proteins using entropy maximization-based learning.

The second file should be an index file (named something readily recognizable like INDEX ) that
explains the overall layout of the subdirectories. If you haven't really collected much data yet, a simple
sketch of the directories with explanations should do. For example, the following file hierarchy:

98-03-22 PJ
Layout of the Shards directory
(see README in subdirectories for further details)
/shards

61
/shards/data
/shards/data/sequences
/shards/data/structures
/shards/data/results
/shards/data/results/enolases
/shards/data/results/globins
/shards/data/test_cases
/shards/graphics
/shards/text
/shards/text/notebook
/shards/text/reports
/shards/programs
/shards/programs/source
/shards/programs/scripts
/shards/programs/bin

may also be represented in graphical form, as shown in Figure 4-1.

Figure 4-1. Tree diagram of a hierarchy




In this directory, we've made the first distinction between programs and data (programs contains the
software we write, and data contains the information we get from databases, or files the programs
generate). Within each subdirectory, we further distinguish between types of data (in this case, protein
structures and protein sequences), and results (run on two sets of proteins, the enolase family and the
globin superfamily) gleaned from running our programs on the data, and some test cases. Programs are
also subdivided according to types, namely whether they are the human-readable program listings
(source code), scripts that aid in running the programs, or the binaries of the programs.

As we mentioned earlier, when you store data in files, you should try to use a terse and consistent
system for naming files. Excessively long filenames that describe the exact contents of a file but
change for different file types (like all-GPCR-loops-in-SWISSPROT-on-99-7-14.text) will cause
problems once you start using the facilities Unix provides for automatically searching for and updating
files. In the shards project, we began with protein structures taken from the Protein Data Bank (PDB).

<<

. 2
( 12)



>>