<<

. 12
( 12)



> a <- 2 + 2
>a
[1] 4


333
Just as Perl has basic datatypes that are useful for working with text, R has datatypes that are useful for
doing statistics. We have already seen the vector; R also has matrices, arrays (which are a
multid imensional generalization of matrices), and factors (lists of strings that label vectors of the same
length).

14.5.4.2 Online resources for R

The place to go for more information on R is the Comprehensive R Archive Network (CRAN). You
can find the CRAN site nearest to you either by using your favorite search engine or off a link from the
R Project home page (http://www.R-project.org). CRAN has a number of packages for specific
statistical applications implemented in R and available as RPM files (for information on installing
RPMs, see Chapter 3). If your project requires sampling, clustering, regression, or factor analysis, R
can be a lifesaver. R can even be made to directly access XGobi as an output system, so that the results
of your computations can be plotted in two or more dimensions.

You can try R without having to install it, thanks to Rweb (http://www.math.montana.edu/Rweb/), a
service provided by the Montana State University Mathematics Department. Rweb accepts your R
code, runs it, and returns a page with the results of the calculation. If you want to use R for anything
beyond simple demonstrations, though, it's faster to download the RPM files and run R on a local
computer.

If you find that R is useful in your work, we vigorously recommend you supplement the R tutorial, An
Introduction to R, and the R FAQ (http://cran.r-project.org/) with the third edition of Modern Applied
Statistics with S-Plus (see the Bibliography). Both Venables and Ripley are now part of the R
development core team, and although their text is primarily an S-plus book, supplements are available
from Ripley's web site (http://www.stats.ox.ac.uk/˜ripley/)that make the examples in book more easily
used under R.

14.5.4.3 Matlab and Octave

GNU Octave (http://www.gnu.org/software/octave/octave.html) is a freely available programming
language whose syntax and functions are similar to Matlab, a commercial programming environment
from The MathWorks, Inc. (http://www.mathworks.com/products/matlab/). Matlab is popular among
engineers for quickly writing programs that perform large numbers of numerical computations. Octave
(or Matlab) is worth a look if you want to write quick prototypes of number-crunching programs,
particularly for data analysis or simulation. Both Octave and Matlab are available for Unix and
Windows systems. Octave source code, binaries, and documentation are all available online; they are
also distributed as part of an increasing number of Linux distributions.

Octave produces graphical output using the gnuplot package mentioned previously. While this
arrangement works well enough, it is rather spartan compared to the plotting capabilities of Matlab. In
fact, if you are a student, we will take off our open source hats and strongly encourage you to take
advantage of the academic pricing on Matlab; it will add years to your life. As a further incentive, a
number of the data mining and machine learning tools discussed in the next section are available as
Matlab packages.

14.6 Visualization: Summary


334
This section has described solutions to data presentation problems that arise frequently in
bioinformatics. For some of the most current work on visualization in bioinformatics, see the European
Bioinformatics Institute's visualization information off the Projects link on their industrial relations
page (http://industry.ebi.ac.uk). Links to more online visualization and data mining resources are
available off the web page for this book. Table 14-1 shows the tools and techniques that are used for
data visualization.

Table 14-1. Data Visualization Tools and Techniques
What you do Why you do it What you use to do it
View graphics files To view results and check figures xzgv
gv, Adobe Acrobat
View PDF or PostScript files To read articles in electronic form
Reader
Manipulate graphics files For preparation of figures The GIMP
Plot data in two or three dimensions To summarize data for presentations Grace, gnuplot
Multidimensionalvisualization To explore data with more than three variables XGobi
Multidimensional scaling To view high-dimensional data in2D or 3D XGvis
Plot graphical structures To draw networks and pathways GraphViz programs
Print sequence alignment clearly To format sequence alignment for publication TEXshade
For rapid prototyping of statistical data-analysis
Statistics-heavy programming for data analysis R (or S-plus)
tools
Numerically intensive programming for data For rapid prototyping of tools that make heavy use GNU Octave (or
analysis of matrices Matlab)


14.7 Data Mining and Biological Information
One of the most exciting areas of modern biology is the application of data mining methods to
biological databases. Many of these methods can equally well fall into the category of machine
learning, the name used in the artificial intelligence community for the larger family of programs that
adapt their behavior with experience. We present here a summary of some techniques that have
appeared in recent work in bioinformatics. The list isn't comprehensive but will hopefully provide a
starting point for learning about this growing area.

A word of caution: anthropomorphisms have a tendency to creep into discussions of data mining and
machine learning, but there is nothing magical about them. Programs are said to "learn" or be "trained,"
but they are always just following well-defined sets of instructions. As with any of the tools we've
described in this book, data mining tools are supplements, rather than substitutes, for human knowledge
and intuition. No program is smart enough to take a pile of raw data and generate interesting results,
much less a publication-quality article ready for submission to the journal of your choice. As we've
stressed before, the creation of a meaningful question, the experimental design, and the meaningful
interpretation of results are your responsibility and yours alone.

14.7.1 Problems in Data Mining and Machine Learning

The topics addressed by data mining are ones that statisticians and applied mathematicians have
worked on for decades. Consequently, the division between statistics and data mining is blurry at best.
If you do work with data mining or machine learning techniques, you will want to have more than a
passing familiarity with traditional statistical techniques. If your problem can be solved by the latest
data-mining algorithm or a straightforward statistical calculation, you would do well to choose the
simple calculation. By the same token, please avoid the temptation to devise your own scoring method
335
without first consulting a statistics book to see if an appropriate measure already exists. In both cases, it
will be easier to debug and easier to explain your choice of a standard method over a nonstandard one
to your colleagues.

14.7.1.1 Supervised and unsupervised learning

Machine learning methods can be broadly divided into supervised and unsupervised learning. Learning
is said to be supervised when a learning algorithm is given a set of labeled examples from which to
learn (the training set) and is then tested on a set of unlabeled examples (the test set). Unsupervised
learning is performed when data is available, but the correct labels for each example aren't known. The
objective of running the learning algorithm on the data is to find some patterns or trends that will aid in
understanding the data. For example, the MEME program introduced in Chapter 8, performs
unsupervised learning in order to find sequence motifs in a set of unaligned sequences. It isn't known
ahead of time whether each sequence contains the pattern, where the pattern is, or what the pattern
looks like.

Cluster analysis is another kind of unsupervised learning that has received some attention in the
analysis of microarray data. Clustering, as shown in Figure 14-5, is the procedure of classifying data
such that similar items end up in the same class while dissimilar items don't, when the actual classes
aren't known ahead of time. It is a standard technique for working with multidimensional data. Figure
14-5 shows two panels with unadorned dots on the left and dots surrounded by cluster boundaries on
the right.

Figure 14-5. Clustering




14.7.2 A Collection of Data Mining Techniques

In this section, we describe some data mining methods commonly reported in the bioinformatics
literature. The purpose of this section is to provide an executive summary of the complex tric ks for data
analysis. You aren't expected to be able to implement these algorithms in your programming language
of choice. However, if you see any of these methods used to analyze data in a paper, you should be able
to recognize the method and, if necessary, evaluate the way in which it was applied. Like any technique
in experimental biology, it is important to have an understanding of the machine learning methods used
in computational biology to know whether or not they have been used appropriately and correctly.

14.7.2.1 Decision trees




336
In its simplest form, a decision tree is a list of questions with yes or no answers, hierarchically
arranged, that lead to a decision. For instance, to determine whether a stretch of DNA is a gene, we
might have a tree like the one shown in Figure 14-6.

Figure 14-6. Simple gene decision tree




A tree like this one is easy to work through, since it has a finite number of possibilities at each branch,
and any path through the tree leads to a decision. The structure of the tree and the rules at each of the
branches are determined from the data by a learning algorithm. Techniques for learning decision trees
were described by Leo Breiman and coworkers in the early 1980s, and were later popularized in the
machine learning community by J. R. Quinlan, whose freely available C4.5 decision tree software and
its commercial successor, C5, are standards in the field.

One major advantage of decision trees over other machine learning techniques is that they produce
models that can be interpreted by humans. This is an important feature, because a human expert can
look at a set of rules learned by a decision tree and determine whether the learned model is plausible
given real-world constraints. In biology, tree classifiers tend to be used in pattern recognition
[5]


problems, such as finding gene splice sites or identifying new occurrences of a protein family member.
The MORGAN genefinder developed by Steven Salzberg and coworkers is an example of a decision
tree approach to genefinding.
[5]
The canonical decision -tree urban legend comes from an application of trees by a long -distance telephone company that wanted to learn about churn, the process
of losing customers to other long -distance companies. They discovered that an abnormally large number of their customers over the age of 70 were subject to churn.
A human recognized something the program did not: humans can die of old age. So, being able to interpret your results can be useful.


14.7.2.2 Neural networks

Neural networks are statistical models used in pattern recognition and classification. Originally
developed in the 1940s as a mathematical model of memory, neural networks are sometimes also called
connectionist models because of their representation as nodes (which are usually variables) connected
by weighted functions. Figure 14-7 shows the process by which a neural network is constructed. Please
note, though, that there is nothing particularly "neural" about these models, nor are there actually
physical nodes and connections involved. The idea behind neural networks is that, by working in
concert, these simple processing elements can perform more complex computations.


337
Figure 14-7. Neural network diagram




A neural network is composed of a set of nodes that are connected in a defined topology, where each
node has input and output connections to other nodes. In general, a neural network will receive an input
pattern (for example, an amino acid sequence whose secondary structure is to be predicted), which sets
the values of the nodes on the first layer (the input layer). These values are propagated according to
transfer functions (the connections) to the next layer of nodes, which propagate their values to the next
layer, until the output layer is reached. The pattern of activation of the output layer is the output of the
network. Neural networks are used extensively in bioinformatics problems; examples include the PHD
(http://www.embl-heidelberg.de/predictprotein/predictprotein.html) and PSIPRED
(http://insulin.brunel.ac.uk/psipred/) secondary structure predictors described in Chapter 9, and the
GRAIL genefinder (http://compbio.ornl.gov/grailexp/) mentioned in Chapter 7.

14.7.2.3 Genetic algorithms

Genetic algorithms are optimization algorithms. They search a large number of possible solutions for
the best one, where "best" is determined by a cost function or fitness function. Like neural networks,
these models were inspired by biological ideas (in this case, population genetics), but there is nothing
inherently biological about them. In a genetic algorithm, a number of candidate solutions are generated
at random. These candidate solutions are encoded as chromosomes. Parts of each chromosome are then
exchanged à la homologous recombination between real chromosomes. The resulting recombined
strategies are then evaluated according to the fitness function, and the highest scoring chromosomes are
propagated to the next generation. This recombination and propagation loop continues until a suitably
good solution is found. Genetic algorithms are frequently used in molecular simulations, such as
docking and folding of proteins.

14.7.2.4 Support vector machines

In late 1998, a machine learning tool called the support vector machine (SVM) began to attract a great
deal of attention in bioinformatics. Support vector machines were developed by Vladimir Vapnik of
Bell Laboratories, an early pioneer of machine learning methods. They have been applied to a wide
range of problems, from optical character recognition to financial time series analysis and recognizing
spam (the email variety, not the lunch meat). SVMs were first applied to biological problems by
Tommi Jaakola (now at MIT), David Haussler, and coworkers at UC Santa Cruz, who used them for
protein sequence classification. They have since been applied to many of the standard computational
biology challenge problems (function prediction, structure prediction, and genefinding) but have gained
the most attention for their use in microarray analysis.

Support vector machines are supervised classifiers that try to find a linear separation between different
classes of points in a high-dimensional space. In a 2D space, this separator is a line; in 3D, it's a plane.

338
In general, this separating surface is called a hyperplane. Support vector machines have two special
features. First, instead of just finding any separating hyperplane, they are guaranteed to find the optimal
one, or the one whose placement yields the largest separation between the two classes. The data points
nearest the frontier between the two classes are called the support vectors. Second, although SVMs [6]


are linear classifiers, they can classify nonlinearly separable sets of points by transforming the original
data points into a higher dimensional space in which they can be separated by a linear surface.
[6]
Vectors, in this case, refer to the coordinates of t he data points. For example, on a 2D map, you might have pairs of (x,y) coordinates representing the location of
the data points. These ordered pairs are the vectors.


Table 14-2 shows some of the most popular data-mining tools and techniques.

Table 14-2. Data Mining Tools and Techniques
What you do Why you do it What you use to do it
To find similar items when a classification scheme isn't
Clustering Clustering algorithms, self-organizing maps
known ahead of time
To label each piece of data according to a classification
Classification Decision trees, neural networks, SVMs
scheme
Regression algorithms, neural networks,
Regression To extrapolate a trend from a few examples
SVMs, decision trees
Combining
To improve reliability of prediction Voting methods, mixture methods
estimators




339
Biblio.1 Unix
Learning Red Hat Linux and Learning Debian GNU Linux. B. McCarty. O'Reilly & Associates. Good
introductory guides to setting up systems with these releases of Linux.

Learning the Unix Operating System. J. Peek, G. Todino, and J. Strang. O'Reilly & Associates. A
concise introduction to Unix for the beginner.

The Linux Desk Reference. S. Hawkins and J. Brockmeier. Prentice Hall.

Linux in a Nutshell. Siever, et al. O'Reilly & Associates. A no -nonsense quick-reference guide to Linux
commands.

Running Linux. M. Welsh and L. Kaufman. O'Reilly & Associates. A relatively comprehensive how-to
guide for setting up a Linux system.

Unix for the Impatient. P. Abrahams and B. Larson. Addison Wesley. A detailed yet user-friendly
presentation of everything a Unix user needs to know. (My first and still favorite Unix guide. CJG)

Unix in a Nutshell. A. Robbins. O'Reilly & Associates. A no-nonsense quick-reference guide to Unix
commands.

Biblio.2 SysAdmin
Essential System Administration. A. Frisch. O'Reilly & Associates. A detailed guide to administration
of Unix systems.

Using csh & tcsh. P. DuBois. O'Reilly & Associates. A detailed guide to using two of the most
common shell environments.

Biblio.3 Perl
Elements of Programming in Perl. A. L. Johnson. Manning Publications. Good introduction to Perl as a
first programming language

Learning Perl. R. Schwartz and T. Christiansen. O'Reilly & Associates. Introduction to Perl but
assumes prior experience with another programming language.

For a more detailed, biology-oriented Perl tutorial, we recommend the one available online at Lincoln
Stein's laboratory page at Cold Spring Harbor Labs, http://stein.cshl.org.

Mastering Algorithms in Perl. J. Orwant, J. Hietaniemi, and J. Macdonald. O'Reilly & Associates. Both
this book and the next cover interesting things that can be done with Perl.

Perl Cookbook. T. Christiansen and N. Torkington. O'Reilly & Associates.

Programming Perl. L. Wall, T. Christiansen, and J. Orwant. O'Reilly & Associates. The bible of Perl.

340
Biblio.4 General Reference
Finding Out About: Search Engine Technology from a Cognitive Perspective. R. Belew. Cambridge
University Press. A fascinating discussion of information retrieval and the process of web-based
research from a cognitive science perspective. Both practical and philosophical aspects are covered.

All three of the following books cover general programming techniques:

Code Complete. S. McConnell. Microsoft Press.

The Practice of Programming. B. W. Kernighan and R. Pike. Addison Wesley

Programming Pearls. J. Bentley. Addison Wesley

Biblio.5 Bioinformatics Reference
Bioinformatics: A Machine Learning Approach. P. Baldi and S. Brunak. MIT Press. The authors have
firsthand experience with applying neural networks and hidden Markov models to sequence analysis,
including genefinding, DNA feature detection, and protein family modeling.

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. A. D. Baxevanis and B. F. F.
Ouellette. John Wiley & Sons. A gentle introduction to biological information and bioinformatics tools
on the Web, focused on NCBI tools.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. R. Durbin, S. Eddy,
A. Krogh, and G. Mitchison. Cambridge University Press. A rigorous presentation of the statistical and
algorithmic basis of sequence analysis methods, including pairwise and multiple sequence analysis,
motif discovery, and phylogenetic analysis.

Molecular Systematics. D. M. Hillis, C. Moritz, and B. K. Mable, Eds. Sinauer and Associates.
Although the first two -thirds of the book are devoted to experimental methods, the chapters on the
methods for inferring and applying phylogenies provide a rigorous and comprehensive follow-up to the
Graur and Li book.

Biblio.6 Molecular Biology/Biology Reference
Fundamentals of Molecular Evolution. D. Graur, W-H. Li. Sinauer and Associates. A readable
explanation of the mechanisms by which genomes change over time, with a discussion of phylogenetic
inference based on molecular data.

Molecular Systematics. D. M. Hillis, C. Moritz, and B. K. Mable, eds. Sinauer and Associates.
Although the first two -thirds of the book are devoted to experimental methods, the chapters on the
methods for inferring and applying phylogenies provide a rigorous and comprehensive follow-up to the
Graur and Li book.

Biblio.7 Protein Structure and Biophysics


341
Intermolecular and Surface Forces. J. Israelachvili. Academic Press. A must-have book for any serious
student of macromolecular structure and molecular biophysics. This book details the physical
chemistry of interactions among molecules and between molecules and surfaces.

Introduction to Protein Structure. C-I. Branden and J. Tooze. Garland Publishing. An illustrated guide
to the basic principles of protein structure and modeling.

Biblio.8 Genomics
Genomes. T. A. Brown. Wiley-Liss. A thorough presentation of molecular genetics from the genomics
perspective.

Genomics: The Science and Technology Behind the Human Genome Project. C. R. Cantor and C. L.
Smith. John Wiley & Sons. If you want to understand, in detail, how genomic sequence data is
obtained, this is the book to have. It exhaustively details experimental protocols for sequencing and
mapping and explores the future of sequencing technology.

Biblio.9 Biotechnology
DNA Microarrays: A Practical Approach. M. Schena, ed. Oxford University Press. An introduction to
the basics of DNA microarray technology and its applications.

Proteome Research: New Frontiers in Functional Genomics. M. R. Wilkins, K. L. Williams, R. D.
Appel, and D. F. Hochstrasser, eds. Springer. An introduction to new techniques for protein
identification and analysis, from 2D-PAGE to MALDI-TOF and beyond.

Biblio.10 Databases
CGI Programming with Perl. S. Guelich, S. Gundavaram, and G. Birznieks. O'Reilly & Associates. An
introduction to the CGI protocol for generating active-content web pages. If you are interested in web
software development, this book is an essential starting point.

Joe Celko's Data and Databases: Concepts in Practice. J. Celko. Morgan Kaufman. A good
introduction to relational database concepts and the use of SQL.

MySQL. P. DuBois. New Riders. A detailed guide to using MySQL. Detailed coverage of
administration and security issues.

MySQL & mSQL. R. J. Yarger, G. Reese, and T. King. O'Reilly & Associates. An introduction to using
MySQL and mSQL; also contains an introduction to RDB concepts and database normalization.
O'Reilly also publishes a collection of reference books about Oracle, if you prefer to start using Oracle
from the beginning.

Biblio.11 Visualization
Understanding Robust and Exploratory Data Analysis. D. C. Hoaglin, et al. eds. John Wiley & Sons. A
classic book on visualization techniques. Don't be put off by the fact that the focus of the book is on
techniques for doing analysis by hand rather than the latest computational tricks: the methods described

342
are implemented in many visualization packages and are easily applicable to the latest bioinformatics
problems.

The Visual Display of Quantitative Information, Envisioning Information, and Visual Explanations. E
Tufte. Graphics Press. In each book, Tufte illustrates good and bad practices in visual data analysis
using examples from newspapers, advertising campaigns, and train schedules (to name a few).

The Visualization Toolkit: An Object-Oriented Approach to 3-D Graphics. W. Schroeder, K. Martin,
and B. Lorensen. Prentice Hall Computer Books. For those readers who want a more active role in
designing visualization tools, this book combines introductions to computer graphics and visualization
practices with a description of a working implementation of a complete visualization system, the
Visualization Toolkit (VTK). VTK is an object-oriented, scriptable framework for building
visualization tools. It is available from http://www.kitware.com.

Biblio.12 Data Mining
Data Mining: Practical Machine Learning. I. Witten and E. Frank. Morgan Kaufman. A clearly written
introduction to data mining methods. It comes with documentation for the authors' WEKA program
suite, a set of data mining tools written in Java that can be freely downloaded from their web site.

Data Preparation for Data Mining. D. Pyle. Morgan Kaufman. For readers looking for more insight
into the data-preparation process.

Machine Learning. T. Mitchell. McGraw-Hill. Provides a complementary treatment of the same
methods as the previous book and is more formal but no less practical.

Modern Applied Statistics with S-Plus. Brian D. Ripley and William N. Venables. Springer Verlag.

Numerical Recipes in C. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.
Cambridge University Press. A comprehensive introduction to the techniques that underlie all
nontrivial methods for data analysis. Combines mathematical explanations with efficient C
implementations. In addition to the hardcopy form, the entire book and all its source code are available
online at no charge from http://www.nr.com.




343
Colophon
Our look is the result of reader comments, our own experimentation, and feedback from distribution
channels. Distinctive covers complement our distinctive approach to technical topics, breathing
personality and life into potentially dry subjects.

The animal on the cover of Developing Bioinformatics Computer Skills is Caenorhabditis elegans, a
small nematode worm. Unlike many of its nastier parasitic cousins, C. elegans lives in the soil where it
feeds on microbes and bacteria. It grows to about 1 mm in length.

In spite of its status as a "primitive" organism, C. elegans shares with H. sapiens many essential
biological characteristics. C. elegans begins life as a single cell that divides and grows to form a
multicellular adult. It has a nervous system and a brain (more properly known as the circumpharyngeal
ring) and a muscular system that supports locomotion. It exhibits behavior and is capable of
rudimentary learning. Like humans, it comes in two sexes, but in C. elegans those sexes consist of a
male and a self-fertilizing hermaphrodite. C. elegans is easily grown in large numbers in the laboratory,
has a short (2-3 week) lifespan, and can be manipulated in sophisticated experiments. These
characteristics make it an ideal organism for scientific research.

The C. elegans hermaphrodite has 959 cells, 300 of which are neurons, and 81 of which are muscle
cells. The entire cell lineage has been traced through development. The adult has a number of sensory
organs in the head region which respond to taste, smell, touch, and temperature. Although it has no
eyes, it does react slightly to light. C. elegans has approximately 17,800 distinct genes, and its genome
has been completely sequenced. Along with the fruit fly, the mouse, and the weed Arabidopsis, C.
elegans has become one of the most studied model organisms in biology since Sydney Brenner first
focused his attention on it decades ago.

Mary Anne Weeks Mayo was the production editor and copyeditor for Developing Bioinformatics
Computer Skills. Rachel Wheeler proofread the book. Linley Dolby and Sheryl Avruch provided
quality control. Gabe Weiss, Edie Shapiro, Matt Hutchinson, and Sada Preisch provided production
assistance. Joe Wizda wrote the index.

Ellie Volckhausen designed the cover of this book, based on a series design by Edie Freedman. The
cover image is an original illustration created by Lorrie LeJeune, based on a photograph supplied by
Leon Avery at the University of Texas Southwestern Medical Center. Emma Colby produced the cover
layout with QuarkXPress 4.1 using Adobe's ITC Garamond font.

Melanie Wang designed the interior layout based on a series design by Nancy Priest. Cliff Dyer
converted the files from MSWord to FrameMaker 5.5 using tools created by Mike Sierra. The text and
heading fonts are ITC Garamond Light and Garamond Book; the code font is Constant Willison. The
illustrations for this book were created by Robert Romano and Lucy Muellner using Macromedia
Freehand 9 and Adobe Photoshop 6. This colophon was written by Lorrie LeJeune.




344

<<

. 12
( 12)