<<

. 11
( 12)



>>

statement might read something like this: [2]




[2]
SQL commands don't have to appear in all capital letters; they're case-insensitive. But we'll write them in all capital letters in our examples, so that you can
distinguish them easily from the names of files, variables, and databases. File, variable, and database names are case-sensitive in SQL, so if you name a database
PeOPlE, you'll have to live with that.

SELECT program FROM software WHERE program LIKE 'blast'

This says "select the names of programs from the list of software where the name of the program is like
blast." This is something you might want to do if you use a searchable database of bioinformatics
software.

As mentioned above, all DBMSs aren't created equal. There is a SQL standard vocabulary, called SQL
92; however, most systems implement only parts of this standard. You need to study the documentation
for the particular DBMS you're using so you don't confuse it by giving unrecognized commands.

13.3.1 SQL Datatypes

The notion of a datatype is simple to understand. A datatype is an adjective that describes the data
stored in a particular column of a table. In general, data stored in a table can consist of either numeric
values or character strings. SQL, however, defines a multitude of subtypes within these common
datatypes, mostly variants that set different upper limits on the size of a text field or numerical field,
but also special numeric types such as DATE or MONEY.

When you create tables in a database, you need to define the type of each column. This means you need
to know from the beginning, as you are setting up your data model, what type of data will be contained
in each column. You should also have a rough idea of the likely upper and lower limits for your data,
so that you can select the smallest possible type to contain them. For instance, if you know that the
305
integer values in a column in your table will never be greater than 255, you should use the smallest
possible integer type, TINYINT, for that column, rather than making space for much larger values in
your database when you won't actually need that space. On the other hand, if that value will eventually
grow beyond 255, then you should choose a type that allows a broader range of values for that column.
Setting up a relational database requires quite a bit of intelligent forethought.

Here are some of the most popular SQL types, all of which are supported in most major RDBMS
programs:

INT

An integer number. Variations include TINYINT, SMALLINT, MEDIUMINT, and BIGINT.
Each of these allows a different range of numerical values.

FLOAT

A floating-point number. Maximum value on the order of 3 E 38; minimum value on the order
of 1.7 E-38.

REAL

A longer floating-point number. Maximum value on the order of 2 E 308; minimum value on
the order of 2 E-308.

CHAR

A fixed-length text string. Values shorter than the fixed length are padded with spaces.

TEXT

A variable-length text string with a maximum value. Variations include TINYTEXT,
MEDIUMTEXT, and LONGTEXT.

BLOB

A variable-length binary field with a maximum value. Variations include TINYBLOB,
MEDIUMBLOB, and LONGBLOB. Just about anything can go in a binary field. The
maximum size of a LONGBLOB is 4 GB. All sorts of interesting things, such as image data,
can go into a binary field.

DECIMAL

A real number that is stored as a character string rather than as a numerical field.

DATE

A date value that stores the year, month, and day.


306
TIMESTAMP

A time value that updates every time the record is modified.

ENUM

A value that is one of a limited set of options and can be selected using either the option name
or a numeric value that represents the name.

SET

A value that is one of a limited set of options.

13.3.2 SQL Commands

SQL has many commands, but it's probably most important for you to know how to create new tables,
add data to them, and then search for data in your database. We'll introduce you briefly to the SQL
CREATE, ALTER, INSERT, UPDATE, and SELECT commands, as they are implemented in MySQL.
The references mentioned in the Bibliography contain full descriptions of the SQL commands available
through MySQL.

13.3.2.1 Adding a new table to a database

New tables are created with the SQL CREATE statement. The syntax of the CREATE statement is
simply:

CREATE TABLE tablename (columnname type [modifiers] columnname type
[modifiers])

If you want to create a table of information about software packages, for the example database we
discuss in this chapter, you can do as follows:

CREATE TABLE software_package
(packid INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
packname VARCHAR(100)
packurl VARCHAR(255)
function TEXT
keyword ENUM
os SET
format SET
archfile VARCHAR(255)
)

This command tells MySQL to set up a table in which the first column is an automatically
incrementing integer; that is, the DBMS automatically assigns a unique value to each entry you make.
The second and third columns are variable-length character strings with preset maximum lengths, in
which the name and URL of the software package will be entered. The fourth column is a text field that
can contain up to 64 KB of text describing the software package. The fifth column allows you to
choose one of 64 preset keywords to describe your software package; the sixth and seventh columns let
you choose any number of values from a set of preset values to describe the operating systems the
software will run under (e.g., mac, windows, linux) and the type of archive file available (e.g., binary,

307
rpm, source, tar ). The final field is another variable character string that will contain the URL of the
archive file.

13.3.2.2 Changing an existing table

If you create a table and you decide that it should look different than you originally planned, you can
use the ALTER command to change it. To add another column to a table, the syntax is:

ALTER TABLE tablename ADD [COLUMN] (columnname type [modifiers])

13.3.2.3 Adding data to an existing table

The INSERT command adds a new row of data to a table. The syntax of the INSERT command is:

INSERT INTO table ( colname1, colname2, colname3 ) VALUES (
'value1','value2','value3')

13.3.2.4 Altering existing data in a table

The UPDATE and REPLACE commands can modify an existing row in a table. Your user privileges
must allow you to use UPDATE and REPLACE. These commands can take a WHERE clause, with
syntax analogous to that of the SELECT command, so that you can specify under what conditions a
record is updated.

13.3.3 Accessing Your Database with the SQLSELECT Command

The SQL SELECT command finds data in a table for you. In other words, SELECT is the command
that makes the database useful once you have created it and populated it with data. It can be modified
by a conditional clause that lets you determine under what conditions a record is selected from the
table.

13.3.3.1 Choosing fields to select

The general syntax of the SELECT command is:

SELECT [fields] FROM [table] WHERE [clause]

To select all the fields in a particular table, the asterisk character can be used:

SELECT * FROM [table] WHERE [clause]

In this chapter's database example, if you want to select the software package name and software
package URL from the software table, the SELECT command is:

SELECT packname, packurl FROM software

13.3.3.2 Using a WHERE clause to specify selection conditions




308
The WHERE clause allows you to specify conditions under which records are selected from a table.
You can use standard operators, such as =, >=, etc., to set the conditions for your WHERE clause.
MySQL also allows you to use the LIKE and NOT LIKE operators for pattern matching.

If you want to set up your SELECT statement to find only software for sequence alignment, it should
look like this:

SELECT packname, packurl FROM software WHERE keyword = "sequence alignment";

If you want to find only software packages with names starting with the letter B, the SELECT
statement looks like this:

SELECT packname, packurl FROM software WHERE packname LIKE "B%";

The % character is a wildcard character that represents any number of characters, so the software
packages you select using this statement can have names of any length as long as the name starts with
B.

13.3.3.3 Joining output from multiple tables

SELECT can also join two related tables. When we talk later about developing databases, you'll find
that relationships between tables are created by replicating information called a primary key from one
table as a foreign key in another table. If the foreign key in one table matches the primary key in
another, the data in the two tables refers to the same record and can be joined to produce one set of
output from SELECT. A MySQL SELECT statement for joining two tables might look like this:

SELECT FROM table1, table2 WHERE primarykey1=foreignkey2

For instance, we've already discussed creating one table that lists the names, URLs, and other details
about the software packages listed in the database. In order to build the database properly, you have to
have another table that lists information about the literature references that describe the functions of the
software packages in the database.

What if you want to select only the names and URLs of software packages that were first described in
the literature in 1998 or later? The names and URLs are found in the software table; the dates are found
in the reference table. Here's the SQL:

SELECT packname, packurl, reference_date FROM software, reference
WHERE software.package_id = reference.package_id
AND reference_date >= "1998";

The variable package_id is the primary key from the software table, and it is replicated in the reference
table to maintain a relationship between the details of software packages and the references that
describe them. If the value of package_id is the same in both tables, the two rows being accessed are
part of the same record in the database. Therefore, the first part of the WHERE clause is what joins the
two tables. The second part of the WHERE clause (AND reference_date >= "1998") specifies the
actual selection condition.

Different database-management systems implement different levels of join functionality, so you will
have to check the specific documentation for your DBMS to see how joins work.
309
13.4 Installing the MySQL DBMS
To set up and maintain your own database, you need to have a database server installed on the machine
on which the data will be stored.

MySQL is a lightweight relational DBMS that is fairly easy to install and run. We're going to use
MySQL to set up the example database, so if you're interested in trying it out, be sure the MySQL
server is installed on your machine. If you're using a Red Hat Linux distribution, this is ridiculously
easy. If you didn't install MySQL when you set up your machine, simply use kpackage or gnorpm to
select the MySQL components you want to install”the server, clients, and development tools. This
will probably give you an older version of MySQL; to get the current version and install it easily, use
the binary RPMs from the latest stable version at http://www.mysql.com. You'll also want to make sure
the Apache web server and PHP support, available from http://www.apache.org, are installed. The next
time you restart your machine after the install, the MySQL server daemon, mysqld, is started, MySQL [3]


privilege databases are initialized, and the PHP module is made available to your Apache server.
[3]
System processes such as servers that run in the background on Unix systems are known as daemons.


13.4.1 Setting Up MySQL

When you look at RDBMS software, you usually find you have a choice of setting up a client or a
server. The server part of the program runs on the machine on which the data is actually stored. It runs
as a daemon on Unix machines; that is, as a system process that is always on, listening for and
responding to requests from clients. The MySQL server program is called mysqld. Figure 13-2 shows
an example of a client/server architecture.

Figure 13-2. Client/server architecture




Clients are programs that connect to the server and request data. Clients can be located on the database
server itself, but they also can be located on other machines on which mysqld isn't running and connect
over the Internet to the main database.

The MySQL client programs include mysql, the main client that lets you do everything from designing
databases to inserting records; mysqladmin, which performs selected administrative functions such as
creating new databases and checking the status of the server; and client5. client5 is similar to mysql in
that it allows interactive query processing, but for security purposes, it doesn't allow you to add and
modify database records.


310
When we talk about the MySQL DBMS as a whole, we refer to it as MySQL. When we talk about a
client program that's part of MySQL, we refer to it by its specific client name.

13.4.1.1 Using the mysql client program

The mysql program has only a few commands of its own; the commands that are primarily directed to
the mysql program or the client5 program are SQL statements. When you are inside the mysql program,
the program interprets any SQL statement you give to it as one continuous statement, until the
terminating character ";" is read. Here are the mysql commands:

use

Takes a database name as its argument; allows you to change which database is in active use

status

Returns the status of the server

connect

Reconnects with the server

go

Sends a command to the MySQL server; also can be indicated by terminating a SQL statement
with \g or ;

help

Prints a complete list of mysql commands

13.4.1.2 Using the mysqladmin client program to set up MySQL

You can get a comprehensive listing of mysqladmin commands with the command:

mysqladmin --help

Here are the commands you are likely to use frequently:

create

Takes a database name as its argument; creates a new database

drop

Takes a database name as its argument; deletes an entire database

reload


311
Reloads the grant tables

variables

Prints available variables that describe the MySQL installation

ping

Checks to see if the MySQL server is alive

shutdown

Shuts down the MySQL server on the local machine

13.4.1.3 Restarting the MySQL server

mysqladmin has an option for shutting down the server. But what about starting it up again? To start
your MySQL server, use the Unix su command to become the MySQL administrator, whether that's
user mysql or some other user ID. Then, start the MySQL server with the command:

safe_mysqld &

13.4.2 Securing Your MySQL Server

Your MySQL server isn't secure when you first install it from RPMs, although the databases are
initialized. To secure your server, you should immediately set a root password for the MySQL
installation. This can (and should) be different from your system root password. MySQL usernames
and system usernames aren't connected, although server processes do need to run under a user ID that
exists on your server. You need to use the mysql program directly to update the user grant table, the
main table of permissions for MySQL users. To invoke the mysql program, give the command:

mysql -u root mysql

Your command prompt will change to mysql>, which indicates you are inside the mysql program until
you quit using the quit command.

To update the grant tables, type:

UPDATE user SET Password=PASSWORD("your_password") WHERE User="root";

When you issue this command through the mysql program, you're giving a SQL command to update the
table user in the database mysql. After you reset the root password, exit mysql and tell MySQL to
reread the grant tables with the command:

mysqladmin -u root reload

Now you can reaccess the mysql program and other client programs only if you use the proper root
password. To restart the mysql program on the mysql database, give the command:


312
mysql --user=root --password mysql

You'll be prompted for your password. If you enter the password on the command line, instead of
allowing mysql to prompt you for the password, the password can become visible to other users (or
hackers) of your system.

If you install MySQL from RPMs on a Linux system, during the installation the mysql user ID is added
to your system. This user should own the MySQL data directory and its subdirectories. The MySQL
daemon runs as a process started by system user mysql, and access to the database is controlled by that
user. You can set the system password for user mysql using the Unix passwd command as root. To set
the MySQL password for this user, you may need to use SQL commands to insert the user mysql into
the grant tables. The SQL statement that creates the mysql user and grants it global access permissions
for all of your databases is:

INSERT INTO user VALUES("localhost","mysql",PASSWORD("your_password"),
"Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y");}

For more on administration and security of MySQL databases, we suggest you consult the pertinent
books listed in the Bibliography.

13.4.3 Setting Up the Data Directory

If you install MySQL from RPMs, your data directory is automatically located in /var/lib/mysql. When
you set up your workstation, you may not have left much space on the /var partition. If you're going to
be doing a lot with databases, you probably want to give the MySQL data directory some room to
grow.

An easy way to do this is to relocate the data directory to a different partition and create a symbolic link
from that directory to /var/lib/mysql. If you relocate the data directory this way, you don't have to
change any MySQL configuration information.

First, choose a location for your data directory. You can, for example, create a directory
/home/mysql/data. Then, shut down your MySQL daemon using: [4]




[4]
You also need to include -- user=mysql -- password on the mysqladmin command line, but from now on, we're going to assume you know that.

mysqladmin shutdown

Using the Unix mv command, move all the files in /var/lib/mysql to /home/mysql/data. Once the
/var/lib/mysql directory is empty, use rmdir to remove it. cd to the /home/mysql directory and type:

chown -Rf mysql:mysql data

This sets the proper file ownership for all the files in that directory. Finally, use ln -s to create a
symbolic link between the /home/mysql/data directory and /var/lib/mysql. Then restart your MySQL
server by typing:

safe_mysqld &

You'll probably need to be logged in as the superuser to do this.

313
13.4.4 Creating a New Database

Once your MySQL server is installed and running, you need to create a new database and grant
yourself the correct permissions to read and write to that database. You can do this as MySQL user
mysql, unless you want to create a separate identity for yourself right now. We're going to make a
database of bioinformatics resources on the Web, so you need to create a database called resourcedb.
To do this, simply type:

mysqladmin --user=mysql --password create resourcedb

Then run mysql on the resourcedb database with the command:

mysql --user=mysql --password resourcedb


13.5 Database Design
The example we'll walk you through is a simple example of how to use MySQL to create a searchable
database of bioinformatics software web sites. We'll also talk a little bit about a scripting language
[5]


called PHP, which allows you to embed commands that let others access your database directly into an
HTML file, and about other ways to access your database from the Web.
[5]
Don't run out and implement this on your machine just because we talked about how to do it. The Web is teeming with out-of-date collections of bioinformatics
links (and other kinds of links), and unless you intend to be a responsible curator, no one really needs you to add to them.


If you're looking for bioinformatics or computational biology software on the Web, there are several
things you'll probably want to know about each item you find and several ways you'll want to query the
database. You'll want to know the name of each item and have access to a description of what it does
and the URL from which you can download it. You'll probably want to know the author of the item and
what papers have been published about it. You may even want to have immediate access to a Medline
link for each reference. You'll want to know what operating systems each item works under, and what
format it's available in; you may even want a direct link to the archive file. You may also want to be
able to search the database by keywords such as "sequence alignment" or "electrostatic potential."

That sounds pretty simple, right? You may be thinking that all that information would go nicely into
one table, and a complicated RDBMS isn't needed to implement this kind of database. Figure 13-3
shows what that big table looks like.

Figure 13-3. The bioinformatics software DB as one big table




However, if you look more closely, you'll see it's not really possible for even a simple bioinformatics
software database to fit in one table. Remember, data in tables must be atomic ; that is, each cell must
contain only one distinct item, not a list or a set.
314
If you think through the possibilities, you'll realize that there are several places where lists or sets might
occur in a bioinformatics software database record: there might be multiple authors, and/or multiple
publications describing the software; the software might be available for many different operating
systems; and there might be more than one keyword used to describe each item.

13.5.1 On Entities and Attributes

Databases can contain two kinds of information: information that indicates an entity or thing that might
have relationships with other things; and information that is purely descriptive of a single entity”
attributes of that entity.

In our database example, the one thing we are sure of is that a software package is an entity. Let's begin
designing the tables in this database by listing all the information associated with each software
package:

Software package name
Software URL
Textual description of function
Descriptive keyword
Operating system
Software format
Archive filename
Reference
Author
Medline link

We may be able to think of more information about each software package in the database, but for the
purposes of this example, we'll leave it at that.

Entities can be described by both attributes and relationships to other entities. If an entry in a database
has no attributes and no relationships, it shouldn't be considered an entity. One item in our list of facts
about each software package definitely has attributes; each reference has an author or authors, and each
reference has a Medline link. Therefore, references should be a separate entity in the database. So we'll
need at least two tables:

SoftwarePackage
----------------
Software package ID
Software package name
Software URL
Textual description of function
Descriptive keyword
Operating system
Software format
Archive filename

Reference
---------
Reference ID
Reference name
Reference year
Author
Medline link


315
We've included an "identifier" attribute in each table. Why? Imagine that there are two software
packages named BLAST. Both do very different things. They need to be distinguished from each other
in our database, but not by creating another table of "things named BLAST." The unique ID allows us
to store information about two software packages named BLAST in the same table and distinguish
them from each other.

Ideally, we want entities to have either one-to-one relationships or one-to-many relationships with each
other. The relationship of references to software packages is a one-to-many relationship: each software
package can be described by many references, but each reference describes only one software package
(see Figure 13-4). Many-to-many relationships can't be sorted out by the RDBMS software, so they
need to be eliminated from the data model before creating a database.

Figure 13-4. Relationship of package to reference




If you're observant, you'll notice that within the Reference table, there is a many-to-many relationship
just waiting to happen. Each author can produce many references, and each reference can have many
authors. The presence of that many-to-many relationship indicates that Author should probably be a
separate entity, even though we haven't chosen to store any attributes about authors in our current data
model. So we actually need a third table:

Reference
---------
Reference ID
Reference name
Medline link

Author
---------
Author ID
Author Name

Even after we create a new table for the Author entity, though, the relationship between authors and
references is still many-to-many. This can be resolved by creating a junction entity that has no purpose
other than to resolve the many-to-many relationship between the two. The junction entity could be
called AuthorRef, or any other arbitrary name. Its only attributes will be its unique identifier (primary
key) and the foreign keys that establish its relationship with the other tables.

13.5.2 Creating a Database from Your Data Model

When you actually create your database, entities become tables. Every attribute becomes a column in
the table, and the ID becomes the primarykey for that table. Relationships to information in other tables
become foreign keys.

316
Before relationships are established, the four tables in our database contain the following information:

SoftwarePackage
----------------
Software package ID
Software package name
Software URL
Textual description of function
Descriptive keyword
Operating system
Software format
Archive filename

Reference
----------------
Reference ID
Reference name
Reference date
Medline link

AuthorRef
---------
AuthorRef ID

Author
---------
Author ID
Author Name

Each attribute is a column in the table, and each column must have a datatype. The primary keys can be
integer values, but they can't be NULL or empty. The appropriate datatype for the primary key
identifiers is thus INT_NOT_NULL; the rest of the fields can be defined as TEXT fields of one size or
another.

13.5.3 Creating Relationships Between Tables

To store the relationships between tables in your database, you place the primary key from one table in
a column in the other table; that column is called a foreignkey. In our example, the primary key of the
SoftwarePackage table is also entered in a column in the Reference table, because there is one software
package to many references. The primary key from the Reference table and the primary key from the
Author table become foreign keys in the AuthorRef table; there are many AuthorRefs for each author,
and many AuthorRefs for each reference.

Once you've worked out what information your tables will contain, and what format each column will
be in, you have what is called a physical model of your database and you are ready to create it using
SQL CREATE statements, as we demonstrated earlier.

13.6 Developing Web-Based Software That Interacts with Databases
The purpose of public biological databases is to allow the biology community to share data in a
straightforward manner. Nothing is more straightforward than the Web. Therefore, it's almost a given
in developing a database (especially with federal funding involved) that you will eventually think about
how to make data available on the Web. There are several technologies that allow communication

317
between web pages and databases. The oldest of these is called Common Gateway Interface (CGI)
programming, but CGI is now being augmented by other technologies such as XML and PHP.

The world of web-based software development is a rapidly changing one, and it's not our job to detail
all the available technologies in this book. However, you should be aware of what these technologies
are and roughly how they work, because every time you make a request that directs a web server to
process some information, you are using one of them.

If you want to set up your own web server and offer data-analysis services to other users, you need to
use CGI scripts or web pages that incorporate XML or PHP code. After we give brief explanations of
CGI and XML, we'll show you a couple of examples of how to use PHP commands in your web pages
to access the example database we've just created.

13.6.1 CGI

A CGI program, or script, is a software application that resides on a web server. When the CGI
program is called by a remote user of the web server, the application executes on the server and then
passes information back to the remote user in the form of a web page, as shown in Figure 13-5. CGI
programs are accessed using the Hypertext Transport Protocol (HTTP) just like normal HTML web
pages. Unlike normal web pages, however, CGI scripts either live in a special directory (such as /cgi or
/cgi-bin) within the server's web documents directory, or they have a special file extension such as .cgi.
When the server receives an HTTP request, instead of just serving the CGI code to your browser as it
does for a normal web page, the server executes the CGI program. CGI is a relatively mature
technology and is supported by all the major web servers.

CGI programs usually consist of multiple sections (see Figure 13-5). First, there may be a section of the
program that collects user input from a web form. This is followed by the section of the program that
takes the user input and does something with it. The CGI program may contain the complete code to do
the input processing, but it is more likely that the program formats the input appropriately and passes it
to a separate program that exists on the server, then collects the output from that program when the run
is completed. The final function of the CGI program is to return the output from the process that ran on
the server to the user in the form of a web page, which may contain either textual output or links to
downloadable results files, or both.

Figure 13-5. How a CGI program is executed




An example of a CGI program you might use is the BLAST server at NCBI. When you select "Basic
BLAST search" from the NCBI BLAST home page, you'll notice that the URL of the new page
actually points to a CGI script:

http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=0



318
The first part of the URL, up to the question mark, gives the directions to the CGI program. The second
part of the URL is state information, which tells the CGI program what part of its functionality is
needed. The state information in this particular URL is telling the BLAST CGI program to bring up an
empty search form in which you can enter your sequence.

Once you click the "Submit" button, a new page appears. The new page lists your request ID and
approximately how long the request will take to process. Behind the scenes, the CGI program has
passed your request to the actual BLAST program, which runs on NCBI's server. When BLAST
finishes running your request, the results are written out and labeled with the request ID the server
assigned to you. The CGI program then looks for your results under that request ID.

After the search is run, you have the option of displaying your data. The URL still points to the BLAST
CGI program, but the state information changes. The URL now looks like this:

http://www.ncbi.nlm.nih.gov/blast/blast.cgi?RID=965246273-2980-7926
&DESCRIPTIONS=100&ALIGNMENTS=50&ALIGNMENT_VIEW=0&&HTML=on&OVERVIEW=on
&REFRESH_DELAY=22

The state information that is being passed along in this URL tells the program which NCBI request ID
(RID) to search for on the server and how the results should be displayed, information that you had the
option of entering through the menus on the previous form. The new page that is displayed with this
URL contains a listing of your BLAST results as well as links to other information at NCBI. The
BLAST results and links were generated behind the scenes on the NCBI server and written to what
appears to you as a normal web page (see Figure 13-6).

Figure 13-6. Processing a BLAST request at NCBI




CGI programs produce a lot of the dynamic content currently available on the Web, although other
methods for producing dynamic content are becoming popular.

13.6.2 XML

The eXtensible Markup Language, better known as XML, is a data-representation scheme that has
attracted a great deal of attention in the last few years. Like the HTML language that describes web
pages, XML is derived from the Standard Generalized Markup Language (SGML). HTML and XML
define tags that are used to annotate a document. Tags are surrounded by angle brackets and use the
syntax <tag>text</tag>. HTML tags specify a web page's structure and appearance. For example, the
text <B>this is bold</B> is rendered in boldface.


319
XML tags, on the other hand, define a document's content. For example, in the text:

homologs of the sequence <gi>g7290345</gi>

the GenBank ID g7290345 is unambiguously identified because it is bracketed by <gi> tags. If you
write a program that searches a document for GenBank IDs, it's easier to find them if they're explicitly
labeled than if you specify a GenBank ID regular expression. Thus, XML lends structure to flat file
data such that it can be read and written in a standard way.

The tags used in a given XML document are defined in a document type definition, or DTD. The DTD
acts as a dictionary for the data format, specifying the elements that are present in a document and the
values each attribute may have. The DTD can exist in its own file, or it can be part of the XML datafile
itself. Because XML allows users to define their own tags in a DTD, it provides a rich and detailed
description of information that would potentially end up in a glob of free text (for example, the
REMARK fields in a PDB file). The downside of this descriptiveness is that records can rapidly become
bloated with details, especially when representing complex data such as the results of microarray
experiments.

The fact that XML can mark up data in flat files in a standard and uniform way is significant for
working with biological data, which is often stored in flat files. For example, if you want to use data in
the ASN.1 format described earlier in this book, you need an ASN.1 parser, which reads only ASN.1
data. By the same token, if you need to read in files in PDB format, you need a different parser for PDB
files. However, if your data is marked up in XML, any XML parser can read the data into your
program. Here is an example of an XML representation of PDB author data:

<!-- Simple PDB citation DTD -->
<!ELEMENT citation (author)+>
<!ELEMENT author (first-name, last-name)>
<!ELEMENT first-name (#PCDATA)>
<!ELEMENT last-name (#PCDATA)>

<!DOCTYPE pdbcite SYSTEM "pdbcite.dtd">
<citation>
<author>
<name>Fermi, G.</name>
<citation_id>primary</citation_id>
</author>
<author>
<name></name>
<citation_id></citation_id>
</author>
...
</citation>

A number of XML parsers are available for the programming languages commonly used in
bioinformatics research, including Perl, Java, Python, and C++. There are two basic types of XML
parser: nonvalidating and validating. Nonvalidating parsers read the XML file and ensure its overall
syntactic correctness. Validating parsers, on the other hand, can guard against missing or incorrect
values. By comparing the XML document against its DTD, a validating parser ensures that the markup
of the data isn't only syntactically correct but that each tag or attribute is associated with appropriate
values.


320
13.6.2.1 XML applications

Thanks to its flexibility and success in other domains, XML has been adopted as a data description
language for some bioinformatics projects. XML has caught on particularly well in genome annotation:
the Genome Annotation Markup Element (GAME-XML) DTD was developed by Suzanne Lewis and
coworkers at the Berkeley Drosophila Genome Project to represent sequence features in genomes.
XML is also the basis for the markup scheme proposed by Lincoln Stein, Sean Eddy, and Robin
Dowell for the distributed annotation system, DAS. Some other example applications of XML include
the Biopolymer Markup Language (BioML) sequence description format developed at Proteometrics,
the Taxonomic Markup Language developed by Ronald Gilmour of the University at Albany for
representing the topology of taxonomic trees, and the Chemical Markup Language (CML) for
representing small molecule structures.

Information about these and other applications of XML in bioinformatics are available at the web page
of the Bioxml group, the XML-specific arm of the Bioperl Project (http://www.bioxml.org). Additional
information about XML and its associated technologies are available from the WWW Consortium
(http://www.w3c.org).

13.6.3 PHP

PHP is a hypertext preprocessor module for your web server that allows it to read and interpret PHP
code embedded in web pages. PHP code resembles, but isn't identical to, familiar programming
languages such as Perl and C.

PHP runs on most web servers; see http://www.php.net for more information. Unlike some other
dynamic content technologies out there (for instance, Microsoft's ASP), PHP is an open source project
that is supported on multiple operating systems. PHP also has built-in support for interacting with
MySQL, PostgreSQL, and Oracle databases.

When a web page that incorporates PHP code is requested from a web server, the server processes the
PHP instructions in the page before passing it to the client. The page source appears to the client as
standard HTML; the PHP code remains invisible to machines beyond the web server.

PHP commands can be interspersed with HTML code in just about any order that seems useful to the
page designer, and the resulting HTML will appear at that point after the PHP code is processed. PHP
also has the capacity to include external files, so you can write the code that controls the appearance of
the web page in the main PHP file and place subroutines in separate files to be included. PHP pages are
distinguished from standard HTML files by a .php extension.

13.6.3.1 Accessing MySQL databases with PHP

Accessing a MySQL database with PHP doesn't take much work. You need one line of code to connect
to the database, another line to select which database to use, a query statement, and a statement that
sends the data as HTML to the client's web browser. A simple example might look like this:

<?php
$link =@mysql_pconnect ("myhost.biotech.vt.edu","cgibas","password") or
exit ( );
mysql_select_db ("resourcedb") or exit ( );
$result = mysql_query ("SELECT program, url, institution FROM software

321
WHERE program = "BLAST") or exit ( );
while ($row = mysql_fetch_row ($result))
{
print("<br>\n");
for ($i = 0; $i < mysql_num_fields ($result); $i++)
{
print ($row[$i]);
}
}
mysql_free_result ($result);
?>

The first line of code (< ? php) signals the start of a chunk of PHP code. The next step is to connect to
the MySQL server with the specified name and password, or terminate the script if the connection is
unsuccessful:

$link =@mysql_pconnect ("myhost.biotech.vt.edu","cgibas","password") or
exit ( );

Now you request the database called resourcedb:

mysql_select_db ("resourcedb") or exit ( );

Next, issue a MySQL query that selects values from the program, URL, and institution fields in the
software table when the program name is equal to "BLAST":

$result = mysql_query ("SELECT program, url, institution FROM software
WHERE program = "BLAST") or exit ( );

Every time a row is read from the database, you break that row down into fields and assign them to the
$row variable, then step through each of the fields in the row and print out the value of that field: [6]




[6]
The way we've done this, it will be a rather ugly plain-text printout. Adding HTML table tags at various points in the command sequence results in much prettier
output.

while ($row = mysql_fetch_row ($result))
{
print("<br>\n");
for ($i = 0; $i < mysql_num_fields ($result); $i++)
{
print ($row[$i]);
}
}

Finally, release the results from memory when the query is completed:

mysql_free_result ($result);

The last line of code (? >) terminates the PHP portion of the web page.

13.6.3.2 Collecting information from a form with PHP

Obviously, this code would be more useful if you substituted a variable name for the word "BLAST,"
and created a little form that would let the user input a word to be assigned to that variable name. All of

322
a sudden, instead of a little bit of PHP code that searches the database for BLAST servers, you have a
crude search engine to find a user-specified program in the resourcedb database.

Forms are created in PHP using PHP print statements to produce the HTML code for the form. For
example, to produce a pair of radio buttons, the PHP code looks like this:

print("<INPUT TYPE=\"radio\" NAME=\"type\" VALUE=\"Yes\" CHECKED>Yes\n");
print("<INPUT TYPE=\"radio\" NAME=\"type\" VALUE=\"No\">No\n");

Other form features are implemented analogously. For more information about forms, collecting data
using forms, and detailed examples of how to produce a PHP form, see the MySQL references in the
Bibliography.

Web database programming isn't something you can learn in a few pages, but we hope we've convinced
you that creating a MySQL database is something that you can do if needed, and that writing the PHP
code to access it won't be that much harder than working with HTML and Perl. Rather than showing
the full PHP code for the MySQL database example, we'll walk you through the important things the
PHP code will need to do.

To interact with our example database, you want a PHP script that does several major tasks:

1. Present a welcome page to the user. The page should allow the user the option of searching the
database or adding a new entry to the database. Behind the scenes, that selection needs to be
processed by the PHP script so that it subsequently presents the correct page.
2. Present a query submission form to the user. The PHP code needs to build a useful form, then
grab the data the user enters in the form and use it to build SQL SELECT statements.
3. Present query results to the user. As matching records are found in the database, the program
will have to format each one into a reasonably nice-looking piece of HTML code so that it
displays in the user's web browser in a readable format.
4. Present a form for adding a new entry. This assumes you have granted permissions for adding
entries to the database to outside users and will require you to collect username and password
information.
5. Add the new entry to the database. This routine needs to take the information from the add form
and actually use a SQL INSERT command to add it to the database.




323
Chapter 14. Visualization and Data Mining
Any result in bioinformatics, whether it is a sequence alignment, a structure prediction, or an analysis
of gene expression patterns, should answer a biological question. For this reason, it is up to the
investigators to interpret their results in the context of a clear question, and to make those results
accessible to their colleagues. This interpretation step is the most important part of the scientific
process. For your results to be useful, they must be interpretable. We'll say it again: if your results can't
be interpreted, they won't help anybody, not even you.

In this chapter, we present computational tools that help you to make sense of your results. To this end,
the chapter is organized so that it roughly parallels the data-analysis process. In the first part of this
chapter, we introduce a number of programs that are used to visualize the sort of data arising from
bioinformatics research. These programs range from general-purpose plotting and statistical packages
for numerical data to programs dedicated to presenting sequence and structural information in an
interpretable form. The second part of this chapter covers some tools for data mining”the process of
finding, interpreting, and evaluating patterns in large sets of data”in the context of some
bioinformatics applications.

The topics covered in this chapter are basically subdisciplines of the larger area of computational
statistics. As you have seen in previous chapters, statistical methods are important because they provide
a check on the significance of the researcher's discoveries. The human nervous system is very good at
finding patterns; a little too good, in fact. If you scrutinize a protein sequence for long enough, you will
begin to see patterns, whether they're biologically significant (like part of a family signature sequence,
such as P.YTVF in chloramphenicol acetyltransferase) or not (words or names, such as PER ) amidst
the amino acids. Thus, we use visualization to exploit the abilities of the eye and brain to find patterns
[1]


that may be interesting. We use statistics and data mining to keep our intuition in check and to restrict
our searches to those patterns that can be quantitatively and repeatedly shown to be significant.
[1]
When you start to see sequence motifs in words or people's names, it's time to take a break.


14.1 Preparing Your Data
Preparing your data (also known as preprocessing or data cleansing) is the most important part of data
mining. It's also the least glamorous and one of the least-discussed parts. Preprocessing can be as
simple as making sure your data is in the right format for the program that reads it, or it can involve
extensive calculations.

As a bioinformatics researcher, you must be especially careful of your data. Your results and reputation
are based on data that have been provided by other researchers. Consequently, you must be scrupulous
in collecting and using that data. The following is a list of some general questions about data integrity
to answer when you work on a project (this list isn't exhaustive; you will probably come up with other
things to check that are specific to the project at hand):

• Is your data what you expect it to be? For example, DNA sequences should only contain As, Ts,
Cs, and Gs (unless your program understands the additional IUPAC symbols). Protein
sequences should contain only the 20 amino acids. You can use grep to quickly check if your
file contains lines with bad characters.


324
• Are your datafiles correctly formatted for the programs you plan to use? Be wary of more
flexible formats. For example, some programs apply a length limit to the comment line in
FASTA files, while other programs don't.
• Be aware of sequence variants. Splice variants, mutations, deletions, sequencing errors, and
inadvertent truncations of the sequence file all can result in a different sequence than you'd
expect. It is up to you to track which differences in sequences or structures are biologically
relevant and which are artifacts of the experimental process.
• Unless the sequences you are working with have been given to you by a collaborator who has
not yet deposited them in a sequence database, make sure that you can find each of your
sequences in GenBank or another database.
• When working with large tabular data, make sure that every field in the table has an appropriate
value. Using a program such as XGobi is a good way to check this, since it complains if not
every field has a value. A visualization tool such as XGobi is also useful if the values are out of
register, since the resulting points will be outliers.
• Does the program produce meaningful results on test data? When you use a new program, you
should have some data for which you know the results, so you can test the program and make
sure it gives the right answer and behaves in a reasonable fashion (these tests are sometimes
called sanity checks). For example, if a program compares two sequences or structures, does it
give the same result regardless of which order you pass the data to it?
• Check for side effects produced by the programs you use. Does a program change any of its
input? Changes can be as subtle as adding spaces between every 10 residues in a sequence file,
normalizing numerical values, or changing the coordinate values of structural data.
• For microarray data, have the values been normalized? If the programs you are using perform
any kind of normalization, it is important that you determine how the normalization was
achieved.
• For protein structure data, are all the atom numbers and residue numbers sequential? Is the
structure intact, or does it contain chain breaks or other anomalies? Are all residues labeled with
the standard amino acid three-letter codes, and are all atoms labeled with the appropriate PDB
atom codes?

Finally, make sure you understand all the programs being used, at least as far as knowing the format
and meaning of their input and output.

14.2 Viewing Graphics
If you are going to be working with images, you need some basic tools for viewing graphics files under
Linux and Unix. There are many graphics programs available; the three that we describe next are
commonly available, easy to use, and free.

14.2.1 xzgv

xzgv is a program for viewing graphics files under the X Window System. It can display the more
popular graphics formats (GIF, PNG, JPEG), as well as a variety of others. For a simple graphics
viewer, it has some handy features. For example, it creates thumbnails (small versions of a picture that
preview the file) and can step through images one at a time, as a slideshow.

14.2.2 Ghostview and gv


325
Ghostview and gv are viewers for PostScript and Portable Document Format (PDF) files. PostScript
and PDF are page-description languages developed at Adobe Systems, Inc. Both programs allow you [2]


to page through a document, jump to specific pages, print whole documents or selected pages, and
perform other simple document-navigation tasks. More and more journals are distributing their articles
electronically as PDF files, so a document viewer such as gv is very useful for keeping up with
literature.
[2]
Adobe makes a PDF reader as well, named Acrobat Reader, which is available at no cost for Windows, Mac OS, Linux, and a handful of Unix systems.


Because it produces more compact files and cooperates with Windows applications, PDF seems to be
overtaking PostScript as the more common format. Many documents are still distributed over the Net in
PostScript, including preprints (particularly those written by people who use the LaT EX document
formatting language) and the output of some web-based services (such as the BMERC secondary
structure prediction server).

14.2.3 The GIMP

The GIMP (short for "GNU Image Manipulation Program") is an image processing package with
similar functionality and features to Adobe Photoshop. While the GIMP can open and view graphics
files, it is probably overkill to do so. However, when it comes to reformatting or modifying graphics to
use in a presentation or paper, having image-manipulation software on hand is invaluable.

14.3 Sequence Data Visualization
Tools for viewing sequence data, particularly multiple sequence alignments, were discussed in Chapter
8. As we mentioned in that chapter, one of the best ways to rapidly summarize information from a
sequence alignment is to use a sequence logo. In this section, we discuss a geometric approach to
visualizing sequence relationships and introduce TEXshade, a program for creating publication-quality
sequence alignment figures.

14.3.1 Making Publication-Quality Alignmentswith T EXshade

TEXshade (http://homepages.uni-tuebingen.de/beitz/tse.html) is a package for marking up sequence
alignments written using LaT EX, a document markup language invented for mathematicians and
computer scientists. This package is remarkably flexible, allowing you to color and label aligned
residues according to conservation and chemical characteristics. In addition, T EXshade can incorporate
secondary structure and accessibility information output from the DSSP program (described in Chapter
6), as well as predictions of secondary structure generated by the PHD prediction server. Finally,
TEXshade can automatically create "fingerprints" that provide a bird's-eye view of an alignment, in
which columns of conserved residues are represented by vertical lines. Like sequence logos,
fingerprints can rapidly summarize alignment data and find clusters of neighboring conserved residues.

TEXshade is called from within a LaT EX document. If you have a sequence alignment stored in MSF
format (the GCG multiple sequence alignment format) in the file alignment.msf, the following LaT EX
document produces an alignment formatted according to T EXshade's defaults:

\documentclass{report}
\usepackage{texshade}
\begin{document}


326
\begin{texshade}{alignment.msf}
\end{texshade}

\end{document}

LaT EX is a document markup language similar to HTML. In the preceding code example, the output of
which is shown in Figure 14-1, you are telling the T EX interpreter that this document has a beginning
and an end, and that it contains only a TEXshade alignment of the sequences in alignment.msf. You
need to mark up the resulting alignment by hand. If this sounds painful, the SDSC Biology Workbench
provides a graphical interface to T EXshade and can automatically render TEXshade-formatted sequence
alignments as GIF or PostScript images.

Figure 14-1. A TEXshade alignment and its corresponding fingerprint




14.3.2 Viewing Sequence Distances Geometrically

Multiple sequence alignments and sequence logos represent similarities at every position of a group of
aligned sequences. However, even with coloring of conserved residues, it isn't always easy to tell how
the sequences are related. Sometimes, it's useful to look at an even higher level of abstraction to see
how the sequences cluster. Phylogenetic trees represent one way to visualize relatedness.

DGEOM, a set of programs by Mark J. Forster and coworkers, takes a set of aligned sequences (either
a multiple sequence alignment in GCG's MSF format, or a set of pairwise alignments) and represents

327
them as points in a 3D space, where the distances between the points represent the evolutionary
distances between the sequences. The points are written to a PDB file and can be viewed with your
choice of protein structure viewers. Some may flinch at the idea of storing a representation of a
sequence alignment in a file format intended to store structural data, but the program works well, and
since high-quality structure visualization packages are easy to find, this approach avoids the need for a
standalone graphics system. The programs are written in Perl and C, making them fairly easy to
modify.

Another implementation of the geometric approach to viewing sequence relationships is the SeqSpace
package developed by Chris Dodge and Chris Sander at the EBI. This package includes C++ programs
for computing the sequence distances, and it uses Java viewers to render the points in 3D.

14.4 Networks and Pathway Visualization
As of this writing, there is no standard package for visualizing interactions between molecules in a
pathway or network. The most common way to represent molecular interactions schematically is in the
form of a graph. Graphs can also be useful for illustrating other data that represents interactions,
[3]


including the output of cluster analyses and interacting residues in protein structures. Biological
networks, such as metabolisms and signaling pathways, tend to be very densely connected, and creating
readable, highly connected graphs isn't an easy task. Fortunately, AT&T Research distributes GraphViz
(http://www.research.att.com/sw/tools/graphviz/), a freely available suite of programs that allow you to
draw and edit graphs. This package has three features that make it particularly attractive: it runs
quickly, it has excellent documentation, and it uses a flexible and intuitive syntax to describe graphs.
[3]
Here we are talking about the graph theory kind of graph, in which a set of dots (nodes) are connected by a set of lines (edges). Directed graphs are those in which
the edge connecting two nodes has a direction. Edges in directed graphs are usually represented using arrows pointing in the direction of a connection. In an
undirected graph, the edges have no direction.


The GraphViz package consists of five programs:

dot

Draws directed graphs

neato

Draws undirected graphs

dotty

A graphical interface for editing graphs

lefty

A language for editing graphs and other diagrams; used to write dotty

tcldot

A graphical interface to dot written in the Tcl language

328
To draw a directed graph of a small set of interactions, you can type the following code into a text
editor and save it to a file named morphopath.dot:

digraph G {
size="3, 3";
SHH -> "Early gene expression";
"FGF-4" -> "Early gene expression";
SHH -> "BMP-2";
"BMP-2" -> "FGF-4";
"FGF-4" -> SHH;
"BMP-2" -> "Late gene expression";
"FGF-4" -> "Late gene expression"; }

The dot program is invoked using the following command at the shell prompt:

% dot -Tps morphopath.dot -o morphopath.ps \end

This command tells the dot program in the GraphViz suite to produce a PostScript image of the graph
described by morphopath.dot and to put the image in the file morphopath.ps (see Figure 14-2).

Figure 14-2. morphopath pathway output




If you have some experience with the C programming language, you might recognize the similarity of
the graph description format to a struck in C. This structured approach to describing the graph's layout
makes it possible to define graphs within graphs, and also makes it straightforward to generate
GraphViz files from Perl scripts. In addition to specifying connectivity, graphs produced by the
GraphViz programs can be annotated with labels, different arrow types, and highlighted boxes.

14.5 Working with Numerical Data
Numerical data can always be fed into a spreadsheet program such as Microsoft Excel or StarOffice
Calc and plotted using the program's built-in graphing functions. Often, this is the best way to make
plots quickly. However, you will encounter situations in which you need more control over the plotting
process than the spreadsheet provides. Two common examples of this kind of situation are in
formatting your plots for publication and in dealing with high-dimensional data sets. If you do have to
create figures from your data, we encourage you to take a look at Edward Tufte's books on visual
explanations (see the Bibliography). They are full of examples and tips on making clean figures that
clearly say what you mean.



329
In this section, we describe some programs that can create plots. In addition, we introduce two special-
purpose programming languages that include good facilities for visualization as well as data analysis:
the statistics language R (and its commercial cousin, S-plus) and the numerical programming language
Matlab (and its free counterpart, Octave).

14.5.1 gnuplot and xgfe

gnuplot (http://www.gnuplot.org) is one of the more widely used programs for producing plots of
scientific data. Because of its flexibility, longevity, and open source nature, gnuplot is loaded with
features, including scripting and facilities for including plots in documents. The dark side of this
flexibility and longevity, though, is a fairly intimidating command syntax. Fortunately, a graphical
interface to gnuplot called xg fe exists. xg fe is good for quickly plotting either data or a function, as
shown in Figure 14-3. You can find out more about xg fe at
http://home.flash.net/˜dmishee/xgfe/xgfe.html.

Figure 14-3. Output from xg fe/gnuplot




If you need to exert more control over the format of the output, though, it behooves you to read through
the gnuplot documentation and see what it can do. Additionally, if you need to aumotically generate
many plots from data, you may want to figure out how to control gnuplot 's behavior using Perl or
another scripting language.

14.5.2 Grace: The Pocketknife of Data Visualization

Grace (http://plasma-gate.weizmann.ac.il/Grace/)and its predecessor, xmgr, are alternatives to gnuplot
as a fairly powerful tool for plotting 2D data. Grace uses a simple graphical interface under the X
Window System, which allows a fair amount of menu-driven customization of plots. Like xg fe, Grace
provides the fairly simple 20% functionality you need 80% of the time. In addition to its current main

330
distribution site at the Weizmann Institute of Science in Israel (which always has the latest version),
there are a number of mirror sites from which Grace can be acquired. The home site also has a useful
FAQ and tutorial introduction to working with Grace.

14.5.3 Multidimensional Analysis: XGobi and XGvis

Plotting programs such as Grace and gnuplot work well if your data has two or three variables that can
be assigned to the plot axes. Unfortunately, most interesting data in biology has a much higher
dimensionality. The science of investigating high-dimensional data is known as multivariate or
multidimensional analysis. One significant problem in dealing with multidimensional data is
visualization. For those who can't envision an 18-dimensional space, there is XGobi
(http://www.research.att.com/areas/stat/xgobi/). XGobi and XGvis are a set of programs freely
available from AT&T Labs. XGobi allows you to view data with many variables three dimensions at a
time as a constellation of points you can rotate using a mouse. XGvis performs multidimensional
scaling, the intelligent squashing of high-dimensional data into a space you can visualize (usually a 2D
plot or a rotatable 3D plot). Figure 14-4 shows output from XGobi.

Figure 14-4. Screenshot from XGobi




XGobi has a huge number of features; here is a brief explanation to get you started. XGobi takes as
input a text file containing columns of data. If you have a datafile named xgdemo.dat, it can be viewed
in XGobi by typing the following command at the shell prompt:

% xgobi xgdemo.dat &

XGobi initially presents the points in a 2D scatterplot. Selecting Rotation from the View menu at the
top of the window shows a moving 3D plot of the points that you can control with the mouse by
clicking within the data points and moving the mouse. Selecting Grand Tour or Correlation Tour from
the View menu rotates the points in an automated tour of the data space.


331
The variable widgets (the circles along the right side of the XGobi interface) represent each of the
variables in the data. The line in each widget represents the orientation of that variable's axis in the
plot. If the data contains more than three variables, you can select the variables to be represented by
clicking first within the widget of the variable you want to dismiss, and then within the widget of the
variable to be displayed. Finally, clicking on the name of the corresponding variable displays a menu of
transformations for that axis (e.g., natural logarithms, common logs, squares, and square roots). Like
the GraphViz graph drawing programs, XGobi and XGvis are superbly documented and easy to install
on Linux systems if you follow the instructions on the XGobi home page. Some Linux distributions
(such as SuSE) even include XGobi.

14.5.4 Programming for Data Analysis

In this section, we introduce two new programming languages that are well adapted for data analysis.
The proposition of learning more languages after just learning Perl may seem a little perverse. Who
would want to learn a whole language just to do data analysis? If your statistics requirements can be
satisfied with a spreadsheet and calculator, these packages may not be for you. Also, as we saw in the
last chapter, there are facilities for creating numerically sophisticated applications using Perl,
particularly the PDL.

However, many problems in bioinformatics require the use of involved numerical or statistical
calculations. The time required to develop and debug such software is considerable, and it may not be
worth your time to work on such code if it's used only once or twice. Fortunately, in the same way that
Perl makes developing data-handling programs easy, data analysis languages (for lack of a better term)
ease the prototyping and rapid development of data analysis programs. In the next sections, we
introduce R (and its commercial cousin, S-plus), a language for doing statistics; and Matlab (and its
free cousin, Octave), a language for doing numerical mathematics.

14.5.4.1 R and S-plus

R is a free implementation of the S statistics programming language developed at AT&T Bell
Laboratories. R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland.
Both R and its commercial cousins (S-plus 3.x, 4.x, and 2000) are available for the Unix and Win32
platforms, and both have a syntax that has been described as "not dissimilar," so we use R to refer to
both languages.

R is usually run within an interpreted environment. Instead of writing whole programs that are
executed from the command line, R provides its own interactive environment in which statements can
be run one at a time or as whole programs. To start the R environment, type in R at the command
prompt and hit the Enter key. You should see something like this:

R : Copyright 2000, The R Development Core Team
Version 1.1.1 (August 15, 2000)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type "?license" or "?licence" for distribution details.

R is a collaborative project with many contributors.
Type "?contributors" for a list.

Type "demo( )" for some demos, "help( )" for on-line help, or

332
"help.start( )" for a HTML browser interface to help.
Type "q( )" to quit R.

The angle bracket (>) at the bottom of this screen is the R prompt, similar to the shell prompt in Unix.
In the following examples, we write the R prompt before the things that the user (that's you) is
supposed to type.

Before anything else, you should know two commands. Arguably, the most important command in any
interactive environment is the one that lets you exit back out to the operating system. In R, the
command to quit is:

> q( )

The second most useful command is the one that provides access to R's online help system, help( ). The
default help( ) command, with no arguments, returns an explanation of how to use help( ). If you want
help on a specific R function, put the name of the function in the parentheses following help. So, for
example, if you want to learn how the source( )function works, you can type:

> help(source)

You can also use ? as shorthand for help( ). So, instead of typing help (source) in the example, you can
just enter ?source.

As mentioned earlier, the R environment is interactive. If you type the following:

>2+2

R tells you that two plus two does, in fact, equal four:

>2+2
[1] 4

The R response (4 ) is preceded by a bracketed number ([1] ), which indicates the position of the
answer in the output vector. Unlike Perl, R has no scalar variables. Instead, single numbers like the
previous answer are stored in a vector of length one. Also, note that the first element in the vector is
numbered 1 instead of 0. [4]




[4]
Actually, R vectors do have a zero element, but it doesn't accept values assigned to it, and it returns numeric(0), which is an empty vector.


The <- operator is used as the assignment operator. Its function is similar to the = operator in Perl.
Typing:

> a <- 2 + 2

produces no output, since the result of the calculation is stored in the variable on the left side of the
assignment operator. In order to see what value a variable contains, enter its name at the R prompt:

<<

. 11
( 12)



>>