<<

. 4
( 12)



>>


-d

Produces a dump of unsigned decimal numbers

Unless you're a serious programmer, you're not likely to have to read binaries. However, on the off
chance that you do, we hope these standard tools will help you start to get your questions answered.

5.4 Transformations and Filters
Filters are programs that take input data and transform it to produce output. They can accomplish
tasksk ”such as extracting parts of files”that word processing and spreadsheet applications can't. A
transformation involves a simple manipulation of the data format, or selection of specified lines or
fields from the data. In this section, we discuss some of the more commonly used filters that are part of
Unix. These filters can read from standard input and writing to standard output, allowing you to
combine them and produce fairly complex transformations. [5]




[5]
If you need to transform data in a way that isn't allowed by the standard Unix filters, see Chapter 12, in which we discuss the Perl scripting language. Perl is a very
complete and sophisticated language that allows you to produce an infinite variety of specialized filters.


5.4.1 Extracting the Beginning of a File with head

Usage: head - number files

Say you have a program that spits out a lengthy datafile that has several different tables of information
concatenated together. Leaving aside the question of why anyone would write a program that creates
such difficult output, there are commands that allow you to work with such data, and you need to know
them. head is one such command.

By default, head sends the top 10 lines of the specified file or files to standard output. Checking the
head of a file this way is an easy way to see if there's something in the file without opening it using an
editor or doing a full cat of the file.


90
With the -number flag, head becomes a tool for selecting a specified number of records from the top of
a file. Combinations of head and tail commands can extract any set of lines from a file provided that
you know their location in the file.

5.4.2 Extracting the End of a File with tail

Usage: tail [-f] -number files

The tail command outputs the last 10 lines of a file by default, or the last num lines of the file if
specified. With the -f option, tail gives constantly updated output of the last few lines in the file. This
provides a way to monitor what is being added to a text output file as it's written by another program.

5.4.3 Splitting Files with split and csplit

Usage: split -[options] filename
Usage: csplit -[options] file criteria

The split command allows you to break up an existing file into smaller files of a specified size. Each
file produced is uniquely named with a suffix (aa, ab...az, ba, etc.). The options to split are:

-l lines

Splits the file into subfiles of length lines

-a length

Uses length letters to form suffixes

If you have a file called big-meercat.txt and you want to split it into subfiles of length 100 lines using
single-letter suffixes and writing the files out to subfiles named meercat.*, the command form of the
split command is:

% split -l 100 -a 1 big-meercat.txt meercat.

csplit also splits files into subfiles, but is somewhat more flexible than split, because it allows the use of
criteria other than number of lines or bytes for splitting. Here are csplit 's options:

-f prefix

Uses the specified file prefix to form subfile names

-n length

Uses suffixes of a specified length to form subfile names; subfile suffixes are made up of
numbers rather than letters

Split criteria are formed in two ways: either a regular expression is supplied as the criterion, possibly
modified by an offset, or a number of lines can be specified.

91
A biological sequence database in FASTA format may contain many records of the form:

>identifying header information
PROTEINORNUCLEICACIDSEQUENCEDATA

The csplit command can split such a database into individual sequence files using the command:

% csplit -f dbrecord. -n 6 fastadbfile /^>/

The file is split into numbered subfiles, each containing a single sequence.

5.4.4 Separating File Components with cut

Usage: cut -c list filenames
or cut -f list -d delim -s file

The cut command outputs selected parts of each line of an input file. A line in a file is simply any
stretch of characters that ends with a specific delimiter; a delimiter is a special nontext character an
operating system or program recognizes. Lines in files are terminated with an EOL (end-of-line)
character; files themselves are terminated with an EOF (end-of-file) character. These characters are
usually invisible to you when you're working with the file, but they are important in how a file is
treated by programs that read it.

For example, say you have a file called sequence_data that contains the following:

ATC TAC
ATG CCC
GAT TCC

Here's how to use cut to output the first character of each line in the file:

% cut -c 1 sequence_data
A
A
G

And here's how to output the first line of fields 1 and 2:

% cut -f 1-2 sequence_data
AAT TAC

Portions of each defined line can be selected by character number in the line with the -c option, or by
field with the -f option. Fields are stretches of characters within a line that are defined by delimiters.
The most obvious delimiter for use within the text of a file is simply the space character, but other
characters can be used as well. Fields are different from columns, which are strictly defined by
numbering each character in the input line.

The list argument specifies the range of each line, whether in characters or in fields, to be selected.

The list is in the form of single numbers or of two numbers separated by a - character. Multiple single
columns or ranges can be selected by separating them with commas. Either the first or the last number
92
can be omitted, indicating that the cut starts at the beginning of the line or that it ends at the end of the
line. Characters and fields in each line are numbered starting at 1.

When the -f option is used, indicating that cut is to count fields rather than characters, a delimiter other
than the default tab character can be specified with the -d option. The -s option causes cut to ignore
lines that don't contain the specified delimiter. This option can be useful, for example, for ignoring
header lines in a table.

5.4.5 Combining Files with paste

Usage: paste -[options] files

The paste command allows you to combine fields from several files into one larger file. Unlike the join
command, which does a database-style merging of two files, paste is a purely mechanical combination
of files. Lines are combined based solely on their line number in each file: i.e., the first line of file1 is
pasted next to the first line of file2, regardless of the content of the lines. Pasted data is separated by a
tab character unless another delimiter is specified with the -d option. With the -s option and only one
input filename, paste joins all the lines in the input file into one long line.

paste can prepare datafiles to be read by data-analysis applications. If you have a group of files in the
same format and you have used other filter commands to remove corresponding information from each
of them, you can prepare one input file that allows you to plot the corresponding information from each
of the files without reading them independently. In a previous example, we used piped commands to
extract a column from a table in a complicated output file:

% head -167 protein1.pka | tail -100 | cut -c30-39 > protein1.pka.data

If you have eight similar output files for proteins 1-8, you can process them all in the same way and
then paste the results that you're interested in comparing into one big datafile:

% paste protein*.pka.data > allproteins.pka.data

Each individual file in this example might look something like this:

3.8
12.0
10.8
4.4
4.0
6.3
7.9

Each number represents the computed pKa value of one amino acid in a protein. If you have several
sets of results that can be meaningfully combined into a table, paste creates a simple tab -delimited table
that looks like this:

3.8 3.2 3.6
12.0 12.9 12.5
10.8 10.9 11.0
4.4 4.2 4.5
4.0 3.9 4.2

93
6.3 6.5 6.2
7.9 7.5 8.0

It's up to you, however, to understand how your data can be meaningfully combined into a table and to
use the paste command correctly to get the result you want.

5.4.6 Merging Datafiles with join

Usage: join -[options] file1, file2

join merges two files based on the contents of a specified join field, where lines from the two files
having the same value in the join field are assumed to correspond. Files are assumed to have a tabular
format consisting of fields and a consistent field separator, and are assumed to be sorted in increasing
order of their join fields.

Command-line options for join include:

-1 fieldnum

Uses the specified field number as the join field in file 1

-2 fieldnum

Uses the specified field as the join field in file 2

-t character

Uses the specified character as the delimiter throughout the join operation

-e string

Replaces empty output fields with the specified string

-a filenum

Produces output for each unpairable line in the specified file; can be specified for both input
files; fields belonging to the other output file are empty

-v filenum

Produces output only for unpairable lines in the specified file

-o list

Constructs the output lines from the list of specified fields, where the format of the field list is
filenum.fieldnum; multiple items in the list can be separated by commas or whitespace

join is quite useful for constructing data tables from multiple files, and a sequence of join operations
can construct a complicated file. In a simple example, there are three files:
94
mustelidae.color:
badger black
ermine white
long-tailed tan
otter brown
stoat tan

mustelidae.prey:
ermine mouse
badger mole
stoat vole
otter fish
long-tailed mouse

mustelidae.habitat:
river otter
snowfield ermine
prairie long-tailed
forest badger
plains stoat

First, combine mustelidae.color and mustelidae.prey. The field both have in common is the name of the
animal, which is the first field in each file. mustelidae.prey isn't yet sorted. The form of the join
command needed is:

% sort mustelidae.prey | join mustelidae.color - > outfile

which produces the following output:

badger black mole
ermine white mouse
long-tailed tan mouse
otter brown fish
stoat tan vole

Now combine the resulting file with mustelidae.habitat. If you want the resulting output to be in the
form habitat animal prey color, use the command construct:

% sort -k2 mustelidae.habitat | join -1 2 -2 1 -o 1.1,2.1,2.3,2.2 - outfile

This operates on the standard input and the output file from the previous step to produce the output:

forest badger mole black
snowfield ermine mouse white
prairie long-tailed mouse tan
river otter fish brown
plains stoat vole tan

5.4.7 Sorting Files with sort

Usage: sort -[general options] -o[outfile] -[key interpretation options] -t[char] -
k[keydef]...[filenames]




95
The sort command can sort a single file, sort a group of files and simultaneously merge them into a
single file, or check a file for sortedness. This function has many applications in data processing. Eac h
line in the file is treated as a single field by default, but keys can also be defined by the user on the
command line.

The main options for sort are:

-c

Tests a file for sortedness based on the user-selected options

-m

Merges several input files

-u

Displays only one instance of lines that compare as equal

-o outfile

Sends the output to a file instead of sending it to standard output

-t char

Uses the specified character to delimit fields

Options that determine how keys are interpreted can be used as global options, but they can also be
used as flags on a particular key. The key interpretation options for sort are:

-b

Ignores leading or trailing whitespace in a sort key.

-r

Reverses the sort order for a particular key.

-d

Uses "dictionary order" in making comparisons; i.e., characters other than letters, digits, and
whitespace are ignored.

-f

Reclassifies lowercase letters as uppercase for the purpose of making comparisons. Normally, L
and l would be separated from each other due to being in uppercase and lowercase character
sets; with the -f flag, all L's end up together, whether capitalized or not.

96
5.4.7.1 Specifying sort keys

Key definitions are arguments of the -k option. The form of a key definition is position1,position2.
Each is a numerical value that specifies where within the line the key starts and ends. Positions can
have the form field.character, where field specifies the field position in the input line, and character
specifies the position of the starting character of the key within its individual field. If the key is flagged
with one of the key interpretation options, the form of the key is field.character[flags]. If the key
interpretation option isn't applied to the whole sort, but merely to one key, then it's appended to the key
definition without a preceding hyphen.

5.5 File Statistics and Comparisons
It's frequently useful to find out if two separate files are the same and, if not, where they have
differences. For instance, if you have compiled a program on your local machine, and test cases are
provided, you should run your copy of the program on the test cases and compare the output to the
canonical output provided by the makers of the program. If you want to check that the backup copy of a
file and the current version of the file are the same, file-comparison tools are very useful. Unix
provides tools that allow you to do this without laboriously searching through the files by hand.

5.5.1 Comparing Files with cmp and diff

Usage: cmp -[options] file1 file2
Usage: diff -[options] file1 file2

Let's say you have two lists and, while they look similar, you can't tell by eye if they are exactly the
same list. This can happen if you get a list of gene names back from database searches performed using
two subtly different queries and want to know if they are equivalent. In order to compare them
rigorously (and save your eyes in the process), you can try the semicomplementary commands cmp and
diff. In short, cmp tells you whether two files are identical, and diff prints any lines that are different.

cmp is fairly simple-minded. Typing:

% cmp enolase1.list enolase2.list

produces no output if the two files are identical. Otherwise, cmp returns a message that the files differ
and includes the character and line at which the first difference occurs.

diff is most useful for comparing different versions of a file to find exactly where the files differ.
Before looking at diff ' s rather obtuse output, it's worth a moment to see how to decrypt it. Without
options, diff responds with a list of differences in the form of the changes required to make file2 from
file1:

x,y d i

Lines x through y in file1 are missing in file2 after line i (i.e., they've been deleted from file2).

i a x,y

Lines x through y in file2 are missing in file1 after line i (i.e., they've been added to file2).
97
i,j c x,y

Lines i through j in file1 have been changed to lines x through y in file2.

In practice, the output looks like this (where enolase1.txt and enolase2.txt are lists of names of putative
enolases produc ed by two database searches performed at different times):

% diff enolase1.list enolase2.list
1a2
> ENO_MESCR
5a7
> ENOA_MOUSE

Here are two of the more immediately useful options diff uses:

-b

Ignores differences in whitespace between lines

-B

Ignores inserted or deleted blank lines between files

The info pages on diff and its variants are especially helpful. If you use this utility extensively, we
strongly recommend you give them a look.

5.5.2 Counting Words with wc

Usage: wc -[options] filename (s)

wc is a simple and useful utility for counting things in text files. Given a text file, wc counts the number
of lines, words, and bytes (characters) that it contains. The default setting for wc is to count all three
entities, so that typing it at the command prompt returns a line that looks like:

% wc meercat1.txt
27 98 559 meercat1.txt

This output tells you that there are 27 lines, 98 words, and 559 bytes in meercat1.txt. If you pass
multiple files to wc, it returns counts both for individual files and for all of them combined. For
example, if you run wc on the three meercat files:

% wc meercat1.txt meercat2.txt meercat3.txt

(or, to save time, wc meercat*.txt, being appropriately careful using the wildcard), the output looks
like:

41 130 905 meercat1.txt
50 124 869 meercat2.txt
10 19 156 meercat3.txt
101 273 1930 total


98
These are the options for wc :

-c

Counts only bytes (characters)

-w

Counts only words

-l

Counts only lines

- -help

Prints a usage message

- -version

Prints the version of wc being used

Unix tools can often be used in combination to collect information you need. For instance, say you
have a list of 1,000 files that need to be processed, and the output files are all saved together in the
same directory. Instead of trying to list the contents of that directory using ls, you can use ls -1 dirname
| wc to find how many output files have been created so far.

5.6 The Language of Regular Expressions
The pattern-matching language known as regular expressions allows you to search for and extract
matches and to replace patterns of characters in files (given the right program). Regular expressions are
used in the vi and Emacs text-editing programs. Since much of the data that biologists work with
contains patterns, one of the first skills you need to learn is how to match patterns and extract them
from files.

Regular expressions also are understood by the Perl language interpreter. Knowing how to use regular
expressions along with the basic commands of Perl gives you a powerful set of data-processing tools.
We'll cover the basics of regular expressions here, and return to them again in Chapter 12.

If you've ever used a wildcard character in a search, you've used a regular expression. Regular
expressions are patterns of text to be matched. There are also special characters that can be used in
regular expressions to stand for variable patterns, which means you can search for partial or inexact
matches. Regular expressions can consist of any combination of explicit text and special characters.

The special characters recognized in basic regular expressions are:

\


99
The backslash acts as an escape character for a special character that follows it. If part of the
pattern you are searching for is a dot, you give the regular expression chars\.txt to find the
pattern chars.txt.

.

The dot matches any single character.

*

The behavior of the asterisk in regular expressions is different from its behavior as a shell
wildcard. If preceded by a character, it matches zero or more occurrences of that character. If
preceded by a character class description, it matches zero or more characters from that set. If
preceded by a dot, it matches zero or more arbitrary characters, which is equivalent to its
behavior in the shell.

^

The caret at the beginning of a regular expression matches the beginning of a line. Otherwise, it
matches itself.

$

The dollar sign at the end of a regular expression matches the end of a line. Otherwise, it
matches itself.

[charset]

A group of characters enclosed in square brackets matches any single character within the
brackets. [badger] matches any of (a, b, d, e, g, r). Within the set, only -, caret, ], and [ are
special. All other characters, including the general special characters, match themselves. A
range of characters in the form [c1-c2 ] can also be given; e.g., [0 -9] or [A-Z].

5.6.1 Searching for Patterns with grep

Usage: grep -[options] 'pattern' filenames

grep allows you to search for patterns (in the form of regular expressions) in a file or a group of files.
GNU grep (the standard on Linux) searches for one of three kinds of patterns, depending on which of
the following functions is selected:

-G

Standard grep : searches for a regular expression (this is the default)

-E

Extended grep : searches for an extended regular expression

100
-F

Fast grep : rapidly searches for a fixed string (a pattern made of normal characters, as opposed
to regular expressions)

Note that the -E and -F options can be explicitly selected by calling egrep or fgrep on some systems. If
no files are specified to be searched, grep searches the standard input for the pattern, allowing the
output of another program to be redirected to grep if you are looking for a pattern in the output.

As a simple example, consider the following commands:

% grep -c '>' SP-caspases-A.fasta SP-caspases-B.fasta
% grep '>' SP-caspases-A.fasta SP-caspases-B.fasta

These both search through a file of FASTA-formatted sequences (whose header lines, you will
remember, begin with the > symbol). The first command returns the number of sequences in each file,
while the second returns a list of the sequence headers. Be sure to enclose the > in quotes, though.
Otherwise, as one of us once found out the hard way, the command is interpreted as a request for grep
to search the standard input for no pattern and then redirect the resulting empty string to the files listed,
overwriting whatever was already there.

grep takes dozens of options. Here are some of the more useful ones:

-c

Prints only a count of matching lines, rather than printing the matching lines themselves

-i

Ignores uppercase/lowercase distinctions in both file and pattern

-n

Prints lines and line numbers for each occurrence of a pattern match

-l

Prints filenames containing matches to pattern, but not matching lines

-h

Prints matching lines but not filenames (the opposite of -l )

-v

Prints only those lines that don't contain a match with pattern

-q


101
(quiet mode) Stops listing matches after the first occurrence

In protein structure files, protein sequence information is stored as a sequence of three-letter codes,
rather than in the more compact single-letter code format. It's sometimes necessary to extract sequence
information from protein structure files. In real life, you can do this with a simple Perl program and
then go on to translate the sequence into single-letter code. But you can also extract the sequence with
two simple Unix filter commands.

The first step is to find the SEQRES records in the PDB file. This is done using the grep command:

% grep SEQRES pdbfile > seqres

This gives you a file called seqres containing records that look like this:

SEQRES 1 357 GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN 2MNR 106
SEQRES 2 357 VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR 2MNR 107
SEQRES 3 357 VAL GLY THR ALA PRO LEU VAL LEU ILE ASP LEU ALA THR 2MNR 108

Not all the characters in each record belong to the amino-acid sequence. Next, you need to extract the
sequences from the records. This can be done using the cut command:

% cut -c20-70 seqres > seqs

The output of this command, in the file seqs, looks like this:

GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN
VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR
VAL GLY THR ALA PRO LEU VAL LEU ILE ASP LEU ALA THR

If you don't want to create the intermediate file, you can pipe the commands together into one
command line:

% grep SEQRES pdbfile | cut -c20-70 | paste -s > seqs.

Addition of the paste -s command joins the individual lines in the file into one long line.

5.7 Unix Shell Scripts
The various Unix shells also provide a mechanism for writing multistep scripts that let you automate
your work. Scripts are labeled as such because they contain, verbatim, the sequence of commands you
want to "say" to the shell, just as the script for a play contains the sequence of lines the author wants
the actors to say.

Shell scripts”even the simplest ones”are still applications, and they behave accordingly. Let's say
you want to start a series of calculations that will take a while, and then go home to eat dinner. By
default, the shell will wait until one command is finished to execute the next command, so if the second
command acts upon the output of the first, it won't start prematurely. The important thing is that you
don't have to be there to type the second command.



102
Here's a relatively simple example. Assume you have just downloaded the entire set of GenBank DNA
sequence files. You want the information in the files, but you need it to be in a different format so that
a program you've downloaded can process it. You're going to use the program gb2fasta to convert the
files from GenBank to FASTA format. (This script assumes you've downloaded the GenBank files to
your current working directory.) Then you want to process each file using the BLAST formatdb
program. To make the script more flexible, you can write it so that it takes an optional file list on the
command line to specify which files to process. The script might look like this:

#!/usr/bin/csh
foreach file ($*)
echo $file
gb2fasta $file > $file.na
formatdb -t "$file" -i $file.na -p F
end

After creating the file, you need to make it executable using the chmod command. For instance, if the
filename of the script is blastprep, give the command:

% chmod a+x blastprep

The first line of the script tells the operating system which shell program to use, the shell is invoked,
and the job is run. You can invoke your command immediately in the following way:

./blastprep gbest*.seq

In order to run the new script without giving its full path, you need to run the rehash command before
typing this command. rehash is a C-shell command that updates the list of all executable files in your
path.

In the previous example, all the GenBank EST files are automatically parsed and prepared for use with
BLAST. The programs gb2fasta and formatdb run just as they do on the command line, but you don't
have to wait for each command to complete. The script takes your command-line argument”in this
case gbest*.seq, which is a list of filenames”and sequentially fills the variable $file with each value. It
then loops through the lines between the "foreach" and "end" lines. The echo command simply s ends
the value of $file to standard output, so you can see in your terminal window how the job is
progressing. The gb2fasta program normally prints to standard output, so you need to redirect the
output to a specific filename. On the other hand, formatdb processes the input files and generates new
files using an internal naming convention, so no output file is needed in the script.

5.8 Communicating with Other Computers
As we'll see in Chapter 6, the ability to plug into other computers and networks across the world allows
you to read and download an amazing amount of information, as well as share data with your
colleagues. In fact, your work as a bioinformatician depends on having access to public databases and
other repositories of biological data. In this section, we look at how your computer communicates with
other machines and the tools it uses to do so.

5.8.1 The Web



103
The easiest way to communicate with other computers is via the Web. Most distributions of Linux
include web browser software”usually Netscape”which, if you select it from the list of installation
options, is automatically installed for you. Setting up a web browser on a Linux system is the same as
setting up a browser on other computers; you need to set the browser's preferences and tell it where the
correct utilities are located to open different kinds of file attachments.

You may want to maintain a web page on your machine, and in order to do that, you need to install web
server software. Again, most Linux distributions allow you to install the Apache web server software as
one of your installation options. If you choose to install the Apache web server, you can publish a
simple web site by placing the appropriate HTML files in the /home/httpd/html directory.

5.8.2 IP Addresses and Hostnames

In the world of the Internet, computers recognize each other by their Internet Protocol (IP) addresses.
Computers that are constantly connected to the Internet have permanently allocated IP addresses and
hostnames, while computers that only connect to the Internet occasionally may have dynamically
allocated IP addresses, or no IP address at all, depending on the protocol they use to connect.

IP addresses consist of four numbers separated by dots (e.g., 128.174.55.33). These are interpreted as
directions to the host (a computer that communicates with other computers) by network software.
Computers also have hostnames, such as gibas.biotech.vt.edu. Name servers are dedicated machines
that maintain information about the relationships among IP addresses and hostnames.

5.8.3 telnet

Usage: telnet full.hostname

The telnet command opens a shell on a remote Unix machine; the workstation on which the command
is issued becomes a terminal for that machine. To telnet to another Unix machine, you must have a
login on that machine. Once you're logged in to the remote host, the shell works just as if you were
working directly on the remote machine. [6]




[6]
If you are logged in as root, there are certain tasks you can't do from a remote terminal.


A "login:" prompt should appear, followed by a "password:" prompt after your ID is entered.

5.8.4 ftp

Usage: ftp full.host.name.edu

The File Transfer Protocol (ftp) is a method for transferring files from one computer to another. You
may be familiar with Fetch, Interarchy, or other PC-based FTP clients; Unix ftp is conceptually similar
to these programs (and many of them have analogs that run under Linux, if you like their graphical user
interfaces). When you use ftp to connect to another host, you will find yourself in an operating
environment that is unique to ftp. Unix commands don't always wo rk in the ftp environment, although
the commands ls and cd have similar functions.

Again, a "login:" prompt appears, followed by a "password:" prompt. If you are accessing an
anonymous FTP server (a common way to distribute software), the standard username is anonymous,

104
and your email address is the password. Once in the FTP environment, the most important commands
to know are:

help

Prints out the list of ftp commands. help command prints out information on a specific
command.

ls

Lists the contents of the directory on the remote host.

cd

Changes the working directory on the remote host.

lcd

Changes the working directory on the local host.

get, mget

get copies a single file from the remote host to the local host. mget copies multiple files.

put, mput

put copies a single file from the local host to the remote host. mput copies multiple files.

binary, ascii

Changes the file-transfer mode to binary or ASCII. You should choose binary when you are
downloading binary executables, images, and other encoded file formats.

prompt

Toggles the interactive mode that asks you to confirm every transfer when you transfer multiple
files.

5.8.5 Displaying from a Remote Terminal

Sometimes you need to run an X program on another computer and have it display on your terminal.
This is relatively simple to do. First, you need to set your own terminal to allow remote displays from
other hosts. This is done using the xhost command:

% xhost +

A confirmation that access is allowed from other hosts is then printed to standard output.



105
Next, you need to change the display environment on the remote machine. This is done with the setenv
command:

% setenv DISPLAY yourmachine.yoursubnet.wherever.edu:0

Not all X applications running on a remote server can use your terminal for display, generally because
the remote machine and your machine don't have the same graphics capabilities. For instance,
programs running on a remote Silicon Graphics machine can't display on your local Linux workstation,
because Silicon Graphics uses proprietary graphics libraries that aren't currently available to Linux
users. However, even if both machines are compatible, bandwidth limitations can make running large
X programs over the network extremely slow.

5.8.6 Communication and File Sharing

One of the biggest inconveniences for Linux users in a primarily Mac/PC environment is the sharing of
files generated by PC productivity software with other users. While it's not our purpose to teach you to
use these packages here, we can mention a few options that will help you handle communication with
non-Unix users.

Fortunately, there are relatively low-cost software products available for Linux that make it possible to
work with common file types, such as Microsoft Word and rich-text format (RTF) documents,
PowerPoint presentations, and Excel spreadsheets. Sun's StarOffice (http://www.staroffice.com) and
Applix's Applixware (http://www.vistasource.com) are two possibilities; at the time of this writing,
StarOffice seemed to do the cleanest job of converting files generated by Microsoft Word and other
commonly used programs. Adding one of these packages to your Linux system will add most of the
basic PC functions (word processing, electronic presentations, etc.) that may be vital to your work.

Most kinds of graphics files are easily handled and converted on Linux systems. One powerful tool for
manipulating graphics files is called the GIMP (Gnu Image Manipulation Program,
http://www.gimp.org). The GIMP is commonly included in Linux distributions, so be sure to select it
as part of your installation if you will be doing anything with graphics files. The GIMP is analogous to
Adobe Photoshop program and shares most of the same functionality.

5.8.7 Media Compatibility

Linux users can read and write files on Microsoft-formatted floppy disks and Zip disks. A floppy or Zip
disk is treated as an additional filesystem on your computer. The most basic way to access this
filesystem is to mount it using the mount command. To do this, you need to know the device ID of the
disk you are trying to mount and establish a mount point for the new filesystem.

Determining the device IDs of the various drives is usually straightforward. One way is to open the file
/var/log/dmesg. This file contains the system information that is printed to standard output when the
machine is booted. Scan through the file and find the drive information, which should look like this:

hdc: SAMSUNG SC-140B, ATAPI CDROM drive
hdd: IOMEGA ZIP 250 ATAPI, ATAPI FLOPPY drive
hdc: ATAPI 40X CD-ROM drive, 128KB Cache
Floppy drive(s): fd0 is 1.44M



106
This section of the file contains information about IDE devices. On this particular machine, the IDE
devices include a CD-ROM drive, a Zip drive, and a floppy drive. The three-letter codes hdc, hdd, and
fd0 are the device IDs.

The next section of the file contains information about SCSI devices. On this particular machine, the
main hard disk is a SCSI drive, and its ID is sda. sda1, sda2, etc., are the individual IDs of the
partitions on the hard drive:

Detected scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device
sda: hdwr sector= 512 bytes. Sectors= 35566499 [17366 MB] [17.4 GB]
sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 >

5.8.8 Accessing Devices as Unix Filesystems

Once you know the device IDs, mounting these new filesystems is simple. If you're the root user of
your own machine, the command is:

mount -t [filesystem type] devicefile mount point

For example, to mount a PC-formatted floppy disk at /mnt/floppy, the command is:

% mount -t msdos /dev/fd0 /mnt/floppy

You can find a listing of allowed file types in the manpages for mount.

As a shortcut, you can modify your /etc/fstab file to contain the following lines:

/dev/fd0 /mnt/floppy vfat noauto,owner 00
/dev/hdd4 /mnt/zip vfat noauto,owner 00

On this system, the Zip drive is located at /dev/hdd. All PC-formatted Zip disks use partition number 4,
and the device file for that partition is /dev/hdd4. The noauto flag means that these disks aren't mounted
automatically at boot time. Once these lines are added to /etc/fstab, the devices can be mounted with
the shortened command mount devicefilename.

Once the Zip or floppy is mounted as a partition, the files on that disk can be treated like any other file
on the system.

Getting some of these devices working isn't as straightforward as we'd like it to be. For further help,
you can search the Web for the Linux how-to pages for the particular device you're using.

5.8.9 Accessing Devices as DOS Disks

If you install the utility package mtools and its graphical frontend mfm, you can run mfm and move files
to Zip or floppy disks, using a graphical interface similar to that on a PC. However, if you use this
method to access devices, you can't run Unix commands on the files stored on your media until you
move them onto the local hard disk.

By default, processes to access media may be run only by the root user. It's possible to configure your
system so that other users can write to floppy and Zip drives. However, this creates a security hole in
107
your system. You have to decide for yourself whether the benefits of easy disk access outweigh any
potential risks.

5.9 Playing Nicely with Others in a Shared Environment
Unix environments traditionally have been multiuser environments. While the availability of new
flavors of Unix for personal computers might change this on your computer at home, at work you will
probably use a shared, networked Unix system at least some of the time. And even on a personal Unix
system, you need to be aware of problems that can arise when you create an excessive load on your
system, and of how background processes can interfere with your ability to run interactive processes.

Because the Unix operating system can interact with more than one user at a time, from terminals
attached directly to the system or over a network, there can be many processes executing on your
system. Some processes will be yours, and others will belong to users who may be working across the
room from you or hundreds of miles away. To be a good citizen in a Unix environment, you need to
share the system's resources. While administrators of large public systems make it nearly impossible
for you to be a bad citizen by implementing quotas for space usage and queueing systems for process
management, it isn't likely that all systems you use will be so tightly managed. On shared systems in
which good faith is all that's keeping users from stepping on each other's toes, it's wise to manage your
own processes responsibly. Otherwise someone's going to come gunning for you, and it won't be pretty.

5.9.1 Processes and Process Management

A Unix system carries out many different operations at the same time. Each of these operations, or
processes, has a unique process ID that allows the user or the administrator to interact with that process
directly.

There are a minimum number of processes that run on a system regardless of whether you actively
initiate them. Each shell program, whether idle or active, has a process ID attached to it. Several system
(or root) processes, sometimes known as daemons, are constantly active on the system. These processes
often lie in wait for you to initiate some activity they handle: for instance, printing files, sending email,
or initiating a telnet session.

Above and beyond this minimal system activity level are any processes you initiate by typing a
command or running a program. The Unix kernel manages these processes, allocating whatever
resources are available to the processes according to their needs.

Each process uses a percentage of the processing capacity of the system's CPU or CPUs. It also uses a
percentage of the system's memory. When the processes running on a machine require more than 100%
of the CPU's capacity to execute, each individual process will execute more slowly. While Unix does
an extremely good job of juggling hundreds of processes that run at the same time without having the
machine roll over and die, eventually you will see a situation where the load on the machine increases
to the point that the machine becomes useless. The operating system uses many techniques to prevent
this, such as limiting the absolute number of processes that can be started and swapping idle jobs out of
memory. Even on a single processor system, it's possible to have multiple processes running
concurrently as long as there is enough space for both jobs to remain in memory. At the point at which
the CPU has to constantly wait for data to get loaded from the swap space on the hard drive, you will
see a great drop in efficiency. This can be monitored using the top command, which is described in

108
Section 5.9.1.3. Many machines are more limited by lack of memory than they are by a slow CPU, and
it's often now more cost-effective to put money into additional RAM than to buy the latest, greatest,
and fastest CPU.

5.9.1.1 Checking the load average

Usage: w

The w command is available on most Unix systems. This command can show you which other users are
logged into the system and what they are doing. It also shows the current load average on the system.

The standard output of the w command looks like this:

2:55pm up 37 days, 4:50, 4 users, load average: 1.00, 1.02, 2.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
jambeck tty1 22Jan99 37days 3:55m 0.06s startx
jambeck ttyp0 :0.0 Wed 5pm 1:34m 0.22s 0.22s -csh
jambeck ttyp3 :0.0 21Feb99 3:47 9.05s 8.51s telnet weasel
god ttyp2 around 2:52pm 0.00s 0.55s 0.09s create world

The first line of the output is the header. It shows the time of day, how long the machine has been up,
how many users are logged in, and what the load average on the system has been for the last 1 minute,
5 minutes, and 15 minutes. The load average represents the fractional processor use on the system. If
you have a single processor system and a load average of 1, the system is being used at optimal
capacity. A four-processor system with a load average of 2 is being used at only half of its capacity. If
you log in to a system and it's already being used at or beyond its capacity, it's not polite to add other
processes that will start running right away. The batch or at commands can set up a process to start
when resources become available.

The information displayed for each user is the username, the tty name, the remote host from which the
user is logged in, the login time, the idle time, the JCPU and PCPU times, and what the user is doing.

5.9.1.2 Listing processes with ps

Usage: ps [options]

ps produces a snapshot of what the processor is doing at the moment you issue the command.
Depending on what your computer is doing at the time, typing ps at the prompt should give output
along the lines of:

PID TTY TIME CMD
36758 ttyq10 0:02 tcsh
43472 ttyq10 0:00 ps
42948 ttyq10 4:24 xemacs-20
42967 ttyq10 1:21 fermats-last-theorem-solver

Most of ps 's options modify the types of processes on which ps reports and the way in which it reports
them. Here are some of the more useful options:

a


109
Lists every command running on the computer, including those of other users

l

Produces a long listing of processes (process memory size, user ID, etc.)

f

Lists processes in a "tree" form, showing related processes

Notice that you don't need to preceed the option with a dash. There are actually a couple of dozen
options for ps ; check info ps to see which options are supported by your local installation.

5.9.1.3 top

Usage: top -[options]

The top command provides real-time monitoring of processor activity. It lists processes on the system,
sorted by CPU usage, memory usage, or runtime. The top screen looks like this:

4:34pm up 37 days, 6:29, 4 users, load average: 0.25, 0.07, 0.02
42 processes: 39 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 42.9% user, 6.4% system, 0.0% nice, 51.0% idle
Mem: 39092K av, 38332K used, 760K free, 13568K shrd, 212K buff
Swap: 33228K av, 20236K used, 12992K free 8008K cached

PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
516 jambeck 15 0 4820 3884 1544 R 0 30.4 9.9 4:23 emacs-fgyell
415 root 9 0 10256 9340 888 R 0 15.5 23.8 161:41 /usr/X11R6/b
10756 cgibas 50 716 716 556 R 0 2.3 1.8 0:01 top-ci

The header is similar to the output of w but more detailed. It gives a breakdown of CPU and memory
usage in addition to uptime and load averages. The display can be changed to show a variety of fields.
The default configuration of top is set in the user's .toprc file or in a systemwide /etc/toprc file.

Here are the top options:

-d

Updates with a frequency of delay

-q

Refreshes without any delay, running at the highest possible priority

-s

Runs in secure mode, with its most potentially dangerous commands disabled

-c

110
Prints the full command line instead of just the command you're running

-i

Ignores all processes except those currently running

While top is running, certain interactive commands can be entered, unless they are disabled from the
command line. The command i toggles the display between showing all processes and showing just the
processes currently running. k kills a process. It prompts you for the process ID of the process to kill
and the signal to send to it. Signal 15 is a normal kill; signal 9 is a swift and deadly kill that can't be
ignored by the process. r changes the running priority of a process, implementing the renice command
discussed in Section 5.9.1.5. It prompts you for the process ID and the new priority value for the job.

5.9.1.4 Signaling processes with kill

Usage: kill [-s signal | -p ] [ -a ] PID

The kill command lets you terminate a process abnormally from the command line. While kill can
actually send various types of signals to a process, in practice it's most often used in the form kill PID
or, if that fails to kill the process, kill -9 PID.

On most systems, kill -l lists the available types of signals that can be sent to a process. It's sometimes
useful to know that jobs can be stopped and restarted with kill -s STOP and kill -s CONT. [7]




[7]
Discussion of the other signals can be found in any of the comprehensive Unix references listed in the Bibliography.


A PID is usually just the numerical process ID, which you can find with the ps or top commands. It can
also be a process name, in which case a group of similarly named processes can be addressed. Another
useful form of PID is -n process group ID, which allows the kill command to address all the processes
in a group simultaneously.

5.9.1.5 Setting process priorities with nice and renice

Usage: nice -n [val command arg]
Usage: renice -n [incr] [-g|-p|-u] id

Processes initiated on a Unix system run at the maximum allowed priority unless you tell them to do
otherwise. The nice and renice commands allow the owner of a process, or the superuser, to lower the
priority of a job.

If limited computing resources are shared among many users and computers are used simultaneously
for computation and interactive work, it's polite to run background jobs (jobs that run on the machine
without any interactive interface) with a low priority. Otherwise, interactive jobs such as text editing or
graphical-display programs run extremely slowly while background jobs hog the available resources.
Jobs running at a low priority are slowed only if higher-priority processes are running. When the load
on the system is low, background jobs with low priority expand to use all the available resources.




111
You can initiate a command at a low priority using nice. n is the priority value. On most systems, this is
set to 10 by default and can range from 1-19, or 0-20. The larger the number, the lower the priority of
the job, of course.

The renice command allows you to reset the priority of a process that's already running. incr is a value
to be added to the current priority. Thus, if you have a background process running at normal priority
(priority 1) and you want to lower its priority (by increasing the priority number), you can enter renice
-n 18 to increase the priority value to 19. You can also input a negative number to put the job at high
priority, but unless you are root, you are limited to raising its priority to 1. The renice options, -p, -g,
and -u, cause renice to interpret id as a process ID, a process group ID, or a user number, respectively.

5.9.2 Scheduling Recurring Activities with cron

The cron daemon, crond, is a standard Unix process that performs recurring jobs for the system and
individual users. System activities such as cleanup of the /tmp directory and system backups are
typically functions controlled by the cron daemon. Normal users can also submit their own jobs to
cron, assuming they have permission to run cron jobs. Details about cron permissions are found in the
crontab manpage. Since the at and batch commands, which are discussed later, are also controlled by
cron, most systems are configured to allow users to use cron by default.

5.9.2.1 Submitting jobs to cron using crontab

Usage: crontab -[options] file

Submission of jobs to cron is done using the crontab command. crontab -l > file places the current
contents of your crontab into a file so you can edit the list. crontab file sends the newly edited file back
and initializes it for use by cron. crontab -r deletes your current crontab.

cron processes the contents of all crontab 's and then initiates jobs as scheduled. A crontab entry as
produced by crontab -l looks like:

# Format of lines:
#min hour daymo month daywk cmd
50 2 * * * /home/jambeck/runme

This entry runs the program runme at 2:50 A.M. every day. An asterisk in any field means "perform
this function every time." In this entry, all output to either STDOUT or STDERR is mailed to user
jambeck 's email account on the machine where the cron job ran.

5.9.2.2 Using cron to schedule a recurrent database search

What if your group performs DNA sequencing on a daily basis, and you want to use the sequence-
alignment program BLAST to compare your sequences automatically against a nonredundant protein
database? Consider this crontab entry:

01 4 * * * find /data/seq/ -name "*.seq" -type f -mtime -1 -exec
/usr/bin/csh /usr/local/bin/blastall -p blastx -d nr -i '{' ";'

This automatically runs at 4:01 A.M. and checks for all sequences that have been modified or added to
the database in the last 24 hours. It then runs the BLASTX program to search your copy of the
112
nonredundant protein sequence database for matches to your new sequences and mails you the results.
This example assumes you have all the necessary environment variables set up correctly so that
BLAST can find the necessary scoring matrixes and databases. It also uses a default parameter set,
which may need to be modified to get useful results. Once you get it configured correctly, all you have
to do is browse through your email while you drink your morning coffee.

5.9.2.3 Scheduling processes with batch and at

Usage: at -[options] time
Usage: batch -[options]

The batch and at commands are standard Unix functions and are commonly available on most systems.
Jobs are submitted to queues, and the queues are processed by the cron daemon; jobs are governed by
the same restrictions as crontab submissions. The batch command assigns priorities to jobs running on
the system. Using batch allows a system administrator to sort jobs by priority”high to low”thereby
allowing more important jobs to run first. Unless the system has a mechanism to kill interactive jobs
that exceed a specified time limit, this use of the batch queue relies on users to work in a cooperative
manner. On larger systems the function of batch is usually replaced by more complicated queuing
systems. You need to get information from your system administrator about which batch and at queues
are available.

at allows you to submit a job to run at some specified time. batch sequentially runs jobs whenever the
machine load drops below a specified level and the number of concurrent batch jobs has not been
exceeded. Once you initiate at or batch, all command-line entries are considered part of the job until
you terminate the submission with a Ctrl-D keystroke. Like cron, any STDOUT and STDERR
generated by the job are mailed to you, so you at least get notified of error conditions. Here are the
common options:

-q queuename

Specifies the queue. By default, at uses the "a" queue; batch uses the "b" queue.

-l

Causes at to list the jobs current in the specified queue.

-d jobid

Tells at to delete a specified job.

-f filename

Instructs at to run the job from a file rather than standard input.

-m

Instructs batch and at to send mail upon completion, even when no output is generated.

time
113
Time can be now, teatime, 7:00 P.M., 7:00 P.M. tomorrow, etc. Check the manpage for more
details.

As an example, let's say that you want your boss to think you were slaving away at 3:00 A.M. Simply
send her mail at 3:07 A.M. Even if you don't plan on being awake, it's no problem. At the at command
prompt, just type:

> at 3:07am
Mail -s "big breakthrough" boss@wherever < /home/jambeck/news
<Ctrl-d>

5.9.3 Monitoring Space Usage and File Sizes

As fast as available disk space on a system expands, users seem to be able to expand their files to fill it.
Software takes up more space; output files become larger and more complex; more layers of analysis
can be created. Since the infinitely large data-storage medium has yet to be invented, you can still run
up against disk-space limitations. So, you need to be able to monitor how much space you are using
and, as we'll discuss in Section 5.9.4, how to make data archives and store them on appropriate media.

5.9.3.1 Checking disk usage with du

Usage: du -[options] filenames

du reports the number of disk blocks used by the specified file or files. Without a filename, it reports
disk usage for all files in the current working directory. The -s flag causes du to report values only for
the named file, rather than for the file and its subdirectories.

5.9.3.2 Checking for free disk space with df

Usage: df

df reports free diskspace for local and networked filesystems on your computer. df is a useful way to
find out which filesystems are mounted on your computer. If a connection to a filesystem you would
expect to find is down, that filesystem doesn't appear in the df output. The df output looks like this:

Filesystem Type blocks use avail %use Mounted on
/dev/root xfs 17506496 14113592 3392904 81 /
/dev/xlv/xlv_raid_home xfs 62783488 39506328 23277160 63 /scratch-res1
/dev/dsk/dks0d5s0 xfs 17259688 15000528 2259160 87 /mnt/root-6.4
/dev/dsk/dks12d1s7 xfs 17711872 11773568 5938304 67 /ip410
/mnt/local/jmd/balder: NFS server balder not responding
zeus:/hamr nfs 2205816 703280 1502536 32 /nfs/zeus/hamr
zeus:/hamrscr nfs 4058200 2153480 1904720 54 /nfs/zeus/hamrscr
zeus:/lcascr1 nfs 142241472 103956480 38284992 74 /nfs/zeus/lcascr1

The first column is the actual location of the filesystem. In this case, locations preceded with / are local,
and those preceded with a name (e.g., /zeus:...) are physically part of another machine. The second
column shows which protocol can mount the remote filesystem”that is, connect it to your computer.
The next three columns show how many blocks are available on the filesystem, how many of those are
in use, and how many are available, followed by the percent use of each device. The final column
shows the local path to the filesystem.

114
It's useful to know these things if you are working on a system that is made up of multiple networked
machines. From time to time connections are lost, like that to balder in the previous example. You may
log in to a machine that can't find your home directory because an NFS connection is down. At these
times, it's useful to be able to figure out what the problem is so you can send a concise and helpful
email to the system administrator rather than just saying "help! My home directory is missing."

5.9.3.3 Checking your compliance with system quotas with quota

On some Unix systems, especially those that provide services to many users, system ad ministrators
implement disk space quotas for each user. The consequences of exceeding a disk space quota may be
obvious. You might find that you're unable to write files or that you are automatically prompted to
delete files each time you log in. Or, the consequences may be silent, but very annoying. For instance,
if you exceed a quota, you may be able to run a text editor, only to find that it has overwritten the file
you were editing with a file of length zero. Or your older files may simply start to be deleted as space is
needed by other users.

If you're paying for computer time on a shared system, it's in your interest to find out what the user
quota for the system is, for how long you can exceed it, what will happen if you exceed it, and where
and how you can archive your files.

The quota command gives basic information about space usage and quota limits on systems with
quotas. On most Unix systems, issuing the command quota -v gives space use information even when
user disk quotas haven't been exceeded.

5.9.4 Creating Archives of Your Data

So, after months of your time, hundreds of megabytes of files, and several layers of subdirectories, the
otter project is finally complete. Time to move on to the next project with a clean slate. But as
refreshing as it may sound, you can't just type:

% rm -rf otter/

Other people may need to look back at your findings or use them as a starting point for their own
research. At the other extreme, you can't leave your files lying around or laboriously copy them a few
at a time to another location. Not every file needs to be accessible at all times; some files are replaced,
while others are more conveniently stored elsewhere. This section covers the tools provided by Unix
for archiving your data so you don't have to worry about it on a day-to-day basis but can find things
later when you need them.

5.9.4.1 tar: Hold the feathers

Usage: tar functions [options] [arguments] filenames

After going through all the effort of setting up your filesystem rationally, it seems like a waste to lose
that structure in the process of storing it away, like hastily packed dishes in an unexpected cross-
country move. Fortunately, there is a Unix command that lets you work with whole directories of files
while retaining the directory structure. tar compacts a directory and all its component files and (if you
ask for it) subdirectories into a single file with the name of the compacted directory and a .tar
extension. The options for tar break down into two types: functions (of which you must choose one)

115
and options. tar is short for "tape archive," since the utility was originally designed to read and write
archives stored on magnetic tape. Another common use of tar is to package software in a form that can
be easily transferred over the Internet.

To run tar, you must choose one of the following functions:

c

Creates a new tape archive

r

Appends the files to an existing archive

u

Adds files to the archive if they aren't present or are modified

x

Extracts files from an existing archive

t

Prints a table of contents of the archive

The options for tar are as follows:

f archive

Performs the specified operation on archive, which can either be a device (such as a tape drive
or a removable disk) or a tar file

v

(verbose mode) Prints the name of each file archived or extracted with a character to indicate
the function (a for archived; x for extracted)

w

(whiny mode) Asks for confirmation at every step

Note that neither functions nor options require the hyphen that usually precedes Unix command
options.

If you type:

% tar cvf otter/


116
the otter/ directory and all its subdirectories are rolled into a single file called otter.tar. It's good
practice to use the v option, so you can see if something is going horribly wrong while the archive is
being processed.

If, on the other hand, you want to make an archive of the otter/ directory on the tape drive nftape, you
can type:

% tar cvf /dev/nftape otter/

A couple of warnings about tar are in order. First, before you use tar on your system, you should use
which to find out whether the GNU or the standard version is installed. Several of the options mean
different things to each version; the ones listed earlier are the same in each version.

Second, the tar file you create will be as large as all the contents of the directory and subdirectories
beneath it. This condition has dire implications if your archived directory is large and you have limited
disk space, or you need to transfer large amounts of tar 'd data. In these cases, you should break down
the directory into subdirectories of a more manageable size, and tar those instead.

If you don't have the space on your current filesystem or partition for your files and the archive you are

<<

. 4
( 12)



>>