<<

. 3
( 12)



>>

We then used a homegrown Perl program called unique.pl to generate a nonredundant database, in
which no protein's sequence had greater than 25% similarity to any other protein in the set. Thus, we
can represent this information econo mically using the filename PDB-unique-25 for files related to this
data set. For example, the list of the names of proteins in the set, and the file containing the proteins'


62
sequences in FASTA format (a common text-file format for storing macromolecular sequence data),
are stored, respectively, in:

PDB-unique-25.list
PDB-unique-25.fasta

Files containing derived data can be named consistently as well. For example, the file containing all
seven-residue pieces of protein structure derived from the nonredundant set is called PDB-unique-25-
7.shard. This way, if you need to do something with all files pertaining to this nonredundant database,
you can use the wildcard PDB-unique-25*, ignoring databases generated by different programs or those
generated with unique.pl at different similarity thresholds.

File naming conventions can take you only so far in organizing a project; the simple naming schemes
we've laid out here will become more and more confusing as a project grows. For larger projects, you
should consider using a database management system (DBMS) to manage your data. We introduce
database concepts in Chapter 13.

4.2 Commands for Working with Directories and Files
Now that you have the basics of filesystems, let's dig into the specifics of working with files and
directories in Unix. In the following sections, we cover the Unix commands for moving around the
filesystem, finding files and directories, and manipulating files and directories.

As we introduce commands, we'll show you the format of the command line for each command (for
example, "Usage: man name"), and describe the effects of some options we find most useful.

4.2.1 Moving Around the Filesystem

When you open a window on a Linux system, you see a command prompt:

$

Command prompts can look different depending on the configuration of your system and your shell.
For example, the following user is using the tcsh shell environment and has configured the command
prompt to show the username and current working directory:

[cgibas@gibas ˜]$

Whatever the style of the command prompt, it means that your computer is waiting for you to tell it to
do something. If you type an instruction at the prompt and press the Enter key, yo u have given your
computer a command. Unix provides a set of simple navigation commands and commands for
searching your filesystem for particular files and programs. We'll discuss the format of commands
more thoroughly in Chapter 5. In this chapter, we'll introduce you to basic commands for getting
around in Unix.

4.2.1.1 You are here: pwd

pwd stands for "print working directory," and that's exactly what it does. pwd sends the full pathname
of the directory you are currently in, the current working directory, to standard output”it prints to the

63
screen. You can think of being "in" a directory in this way: if the directory tree is a map of the
filesystem, the current working directory is the "you are here" pointer on the map.

When you log in to the system, your "you are here" pointer is automatically placed in your home
directory. Your home directory is a unique place. It contains the files you use almost every time you
log into your system, as well as the directories that you create to store other files. What if you want to
find out where your home directory is in relation to the rest of the system? Typing pwd at the command
prompt in your home directory should give output something like:

/home/jambeck

This means that jambeck 's home directory is a subdirectory of the home directory, which in turn is a
subdirectory of the root ( / ) directory.

4.2.1.2 Changing directories with cd

Usage: cd pathname

The cd command changes the current working directory. The only argument commonly used with this
[2]


command is the pathname of a directory. If cd is used without an argument, it changes the current
working directory to the user's home directory.
[2]
As you'll see when we cover the Unix shell and the command line in Chapter 5, Unix commands can be issued with or without arguments on the command line.
The first word in a line is always a command. Subsequent words are arguments and can include options, which modify the command's behavior, and operands, which
specify pathnames. Words in the command line are items separated by whitespace (spaces or tabs).


In order for these "you are here" tools to be helpful, you need to have organized your filesystem in a
sensible way in the first place, so that the name and location of the directory that you're in gives you
information about what kind of material can be found there. Most of the filesystem of your machine
will have been set up by default when you installed Linux, but the organization of your own directories,
where you store programs and data that you use, is your responsibility.

4.2.2 Finding Files and Directories

Unix provides many ways to find files, from simply listing out the contents of a directory to search
programs that look for specified filenames and the locations of executable programs.

4.2.2.1 Listing files with ls

Usage: ls [-options] pathname

Now that you know where you are, how do you find out what's around you? Simply typing the Unix
list command, ls, at the prompt gives you a listing of all the files and subdirectories in the current
working directory. You can also give a directory name as an argument to ls. It then prints the names of
all files in the named directory.

If you have a directory that contains a lot of files, you can use ls combined with the wildcard character
* (asterisk) to produce a partial listing of files. There are several ways to use the *. If you have files in a
series (such as ch1 to ch14 ), or files with common characters (like those ending in .txt), you can use *
to specify all of them at once. When given as the argument in a command, * takes the place of any

64
number of characters in a filename. For example, let's say you're looking for files called seq11, seq25,
and seq34 in a directory of 400 files. Instead of scrolling through the list of files by eye, you could find
them by typing:

% ls seq*

What if in that same directory you wanted to find all the text files? You know that text files usually end
with .txt, so you can search for them by typing:

% ls *.txt

There are also a variety of command-line options to use with ls. The most useful of these are:

-a

Lists all the files in a directory, even those preceded by a dot. Filenames beginning with a dot
(.) aren't listed by ls by default and consequently are referred to as hidden files. Hidden files
often contain configuration instructions for programs, and it's sometimes necessary to examine
or modify them.

-R

Lists subdirectories recursively. The content of the current directory is listed, and whenever a
subdirectory is reached, its contents are also explicitly included in the listing. This command
can create a catalog of files in your filesystem.

-1

Lists exactly one filename per line, a useful option. A single-column listing of all your source
datafiles can quickly be turned into a shell script that executes an identical operation on each
file, using just a few regular-expression tricks.

-F

Includes a code indicating the file type. A / following the filename indicates that the file is a
directory, * indicates that the file is executable, and @ following the filename indicates that the
file is a symbolic link.

-s

Lists the size of the file in blocks along with the filename.

-t

Lists files in chronological order of when they were last modified.

-l

Lists files in the long format.
65
- - color

Uses color to distinguish different file types.

4.2.2.2 Interpreting ls output

ls gives its output in two formats, the short and the long format. The short format is the default. It
includes only the name of each file along with information requested using the -F or -s options:

#corr.pl# commands.txt hi.c psimg.c
#eva.pl# corr.pl nsmail res.sty
#pitch.txt# corr.pl˜ paircount.pl res.sty˜
#wish-list.txt# correlation.pl paircount.pl˜ resume.tex
Xrootenv.0 correlation.pl˜ pj-resume.dvi seq-scratch.txt
a.out detailed-prac.txt pj-resume.log sources.txt

The long format of the ls command output contains a variety of useful information about file ownership
and permissions, file sizes, and the dates and times that files were last modified:

drwxrwxr-x 4 jambeck weasel 2048 Mar5 18:23 ./
drwxr-xr-x 5 root root 1024 Jan 20 12:13 ../
-rw-r--r-- 1 jambeck weasel 293 Jan 28 17:39 commands.txt
-rw-r--r-- 1 jambeck weasel 1749 Feb 21 12:43 corr.pl
-rw-r--r-- 1 jambeck weasel 559 Feb 23 14:52 correlation.pl
-rwxr-xr-x 1 jambeck weasel 3042 Jan 21 17:05 eva.pl*
drwx------ 2 jambeck weasel 1024 Feb 16 14:44 nsmail/

This listing was generated with the command ls -alF. The first 10 characters in the line give
information about file permissions. The first character describes the file type. You will commonly
encounter three types of files: the ordinary file (represented by -), the directory (d ), and the symbolic
link (l ).

The next nine characters are actually three sets of three bits containing file permission information. The
first three characters following the file type are the file permissions for the user. The next set are for the
user's group, and the final set are for users outside the group. The character string rwxrwxrwx indicates
a file is readable (r ), writable (w), and executable (x ) by any user. We talk about how to change file
permissions and file ownership in Section 4.3.3.2.

The next column in the long format file listing tells you how many links a file has; that is, how many
directory listings for that file exist on the filesystem. The same file can be named in multiple
directories. In the section Section 4.2.3, we talk about how to create links (directory listings) for new
and existing files.

The next two columns show the ownership of the file. The owner of the files in the preceding example
is jambeck , a member of the group weasel.

The next three columns show the size of the file in characters, and the date and time that the file was
last modified. The final column shows the name of the file.

4.2.2.3 Finding files with find



66
Usage: find pathname list -[test] criterion

The find command is one of the most powerful, flexible, and complicated commands in the standard set
of Unix programs. find searches a path or paths for files based on various tests. There are over 20
different tests that can be used with find; here are a few of the most useful:

-print

This test is always true and sends the pathname of the current file to standard output. -print
should be the last command specified in a line, because, as it's always true, it causes every file
in the pathname being searched to be sent to the list if it comes before other tests in a sequence.

-name

This is the test most commonly applied with find and the one that is the most immediately
useful. find -name weasel.txt -print lists to standard output the full pathnames of all files on the
filesystem named weasel.txt. The wildcard operator * can be used within the filename criterion
to find files that match a given substring. find -name weas* -print finds not only weasel.txt, but
weasel.c and weasel.

-user uname

This test finds all files owned by the specified user.

-group gname

This test finds all files owned by the specified group.

-ctime n

This test is true if the current file has been changed n days ago. Changing a file refers to any
change, including a change in permissions, whereas modification refers only to changes to the
internal text of the file. -atime and -mtime tests, which check the access and modification times
of the files, are also available.

Performing two find tests one after another amounts to applying a logical "and" between the tests. A -o
between tests indicates a logical "or." A slash ( / ) negates a command, which means it finds only those
files that fail the test.

find can be combined with other commands to selectively archive or remove particular files from a
filesystem. Let's say you want a list of every file you have modified in your home directory and all
subdirectories in the last week:

% find ˜ -type f -mtime -7 -print

Changing the type to d shows only new directories; changing the -7 to +7 shows all files modified more
than a week ago. Now let's go back to the original problem and find executable files. One way to do
this with find is to use the following command:

67
% find / -name progname -type f -exec ls -alF '{' ';'

This example finds every match for progname and executes ls -alF FullPathName for every match.
Any Unix command can be used as the object of -exec. Cleanup of the /tmp directory, which is usually
done automatically by the operating system, can be done with this command:

find /tmp -type f -mtime +1 -exec rm -rf '{' ';'

This deletes everything that hasn't been modified within the last day. As always, you need to refer to
your manual pages, or manpages, for more details (for more on manpages, see Chapter 5).

4.2.2.4 Finding an executable file with which

Usage: which progname

The which command searches your current path and reports the full path of the program that executes if
you enter progname at the command prompt. This is useful if you want to know where a program is
located, if, for instance, you want to be sure you're using the right version of the program. which can't
find a program in a directory that isn't in your path.

4.2.2.5 Finding an executable file with whereis

Usage: whereis -[options] progname

The whereis command searches a standard set of directories for executables, manpages, and source
files. Unlike which, whereis isn't dependent on your path, but it looks for programs only in a limited set
of directories, so it doesn't give a definitive answer about the existence of a program.

4.2.3 Manipulating Files and Directories

Of course, just as with the stacks of papers on your desk, you periodically need to do some
housekeeping on your files and directories to keep everything neat and tidy. Unix provides commands
for moving, copying, and deleting files, as well as creating and removing directories.

4.2.3.1 Copying files and directories with cp

Usage: cp -[options] source destination

The cp command makes a copy of a source file at a destination. If the destination is a directory, the
source can be multiple files, copies of which are placed in the destination directory. Frequently used
options are -R and -r. Both copy recursively; that is, they copy the source directory and all its
subdirectories to the destination. The -R option prevents cp from following symbolic links; only the
link itself is copied. The -r option allows cp to follow symbolic links and copy all files it finds. This
can cause problems if the symbolic links happen to form a circular path through the filesystem.

Normally, new files created by cp get their file ownership and permissions from your shell settings.
However, the POSIX version of cp provides an -a option that attempts to maintain the original file
attributes.


68
4.2.3.2 Moving and renaming files and directories with mv

Usage: mv source destination

The mv command simply moves or renames source to destination. Files and directories can both be
either source or destination. If both source and destination are files or both are directories, the result of
mv is essentially that the file or directory is renamed. If the destination is a directory, and the intention
is to move already existing files or directories under that directory in the hierarchy, the directory must
exist before the mv command is given. Otherwise the destination is created as a regular file, or the
operation is treated as a renaming of a directory. One problem that can occur if mv isn't used carefully
is when source represents a file list, and destination is a preexisting single file. When this happens,
each member of source is renamed to destination and then promptly overwritten, leaving only the last
file of the list intact. At this point, it's time to look for your system administrator and hope there is a
recent backup.

4.2.3.3 Creating new links to files and directories with ln

Usage: ln -[options] source destination

The ln command establishes a link between files or directories at different locations in the directory
tree. While creating a link creates the appearance of a new file in the destination location, no data is
actually copied. Instead, what's created is a new pointer in the filesystem index that allows the source
file to be found at more than one location "on the map."

The most commonly used option, -s, creates a symbolic link (or symlink) to a file or directory, as in the
following example:

% ln -s perl5.005_03 perl

This allows you to type in just the word perl rather than remembering the entire version nomenclature
for the current version of Perl.

Another commo n use of the ln command is to create a link to a newly compiled binary executable file
in a directory in the system path, e.g., /usr/local/bin. Doing this allows you to run the program without
addressing it by its full pathname.

4.2.3.4 Creating and remov ing directories with mkdir and rmdir

Usage: mkdir -[options] dirname
Usage: rmdir -[options] dirname

New directories can be created with the mkdir command, which has only two command-line options.

mkdir -p creates a directory and any intermediate components of the path that are missing. For instance,
if user jambeck decides to create a directory mustelidae/weasels in his home directory, but the
intermediate directory mustelidae doesn't exist, mkdir -p creates the intermediate directory and its
subdirectory weasels.

mkdir -m mode creates a directory with the specified file-permission mode.

69
rmdir removes a directory if it's empty. With the -p option, rmdir removes all the empty directories in a
given path. If user jambeck decides to remove the directory mustelidae/weasels, and directory
mustelidae is empty except for directory weasels, rmdir -p ˜/mustelidae/weasels removes both weasels
and its parent directory mustelidae.

4.2.3.5 Removing files with rm

Usage: rm -[options] files

The rm command removes files and directories. Here are its common options:

-f

Forces the removal of files without prompting. You still can't remove files you don't own, but
the write permissions on files you do own are ignored. For example, rm -f a* deletes all files
starting with the letter a, but doesn't delete any subdirectories.

-i

Prompts you with rm: remove filename? Files are removed only if you begin your answer with
a y or Y.

-r

(recursive option) Removes all directories and subdirectories in the list of files. Symbolic links
aren't traversed; only the symlink itself is removed.

-v

(verbose option) Echoes the names of all files/directories that are removed.

While rm is a fairly simple command, there are a few instances in which it can cause serious problems
for the careless user.

The command rm * removes all files in a directory. Unless you have the files set as read-only or have
the interactive flag set, you will delete everything in the directory. Of course this isn't as bad as using
the command rm -r * or rm -rf *, the last of which overrides any read-only file modes, traverses down
through your directories and deletes everything in your current directory or below.

Occasionally you will find that you create odd files in your directories. For instance, you might have a
file named -myfile where the - is part of the filename. Try deleting it, and you will get an error message
concerning the fact that rm doesn't have a -m option. Your shell program interprets the -m as a
command flag, not part of the filename. The solution to this problem is trivial but not always instantly
apparent: simply provide a more complete path to the file, such as rm ./-myfile or rm /home/jambeck/-
myfile. Similar solutions are needed if you accidently create a file with a space in the name.

4.3 Working in a Multiuser Environment


70
Unix systems are designed to allow multiple users to share system resources and software, yet at the
same time to allow users to selectively protect their work from each other. To work with others in a
multiuser environment, there are a number of general Unix concepts you need to understand.

4.3.1 Users and Groups

If you use a Unix system, you must be registered. You are identified by a login name and can log in
only by entering the password uniquely associated with your login name. You have control over an
area of the filesystem, which may be as large or small as system resources allow. You belong to one or
more groups and can share files with other members of a group without needing to make the files
accessible to other users. At any given time, only one of a your groups is active, and new files you
create are automatically associated with the active, or primary, group. If you use group permissions to
share files with other users, and you need to change to a particular group ID, the command newgrp
allows you to change your primary group ID. The id command tells you what your user and primary
group IDs are.

Information about your account is stored the /etc/passwd file, a file that provides the system with
information needed when you log in. Your username and user ID mapping are found here, along with
your default groups, full name, home directory and default shell program. The shell program is
described in Chapter 5. The encrypted version of your password used to be stored here, but on most
systems, for security reasons, the actual password has been removed from the passwd file. Additional
group information is found in the /etc/group file. You can view the contents of these files with an
editor, even though they are system files you normally can't overwrite.

4.3.2 User Directories

When your system administrator creates a new user account, the process includes creating an entry in
the /etc/passwd file, possibly adding you to a number of groups in /etc/group, creating a home directory
for you somewhere on the system, and then changing the ownership of that directory so that you own it
and any files that are put into it at the time of creation. Your entry in /etc/passwd needs to match the
path to your home directory, and the user and group that own your home directory. There should also
be a set of files in your home directory that set up your work environment when you log in and are
specific to the Unix shell listed in your passwd entry. These files are discussed in more detail in
Chapter 5.

4.3.3 File Permissions and Statistics

As we discussed in the section on the ls command, each file and directory has an owner and a group
with which it's associated. Each file is created with permissions that allow or prevent you access to the
file dependent on your user ID and group. In this section we discuss how to view and change file
permissions and ownership.

4.3.3.1 Viewing file attributes with stat

Usage: stat -[options] filename

stat lets you view the complete set of attributes of a file or directory, including permissions,
modification times, and ownership. It may be more information than you need, but it's there if you want
it. For example, the command stat image1.rgb returns:
71
image1.rgb:
inode 11750927; dev 77; links 1; size 922112
regular; mode is rw-------; uid 12430 (jambeck); gid 280 (weasel)
projid 0 st_fstype: xfs
change time - Sun Mar 14 14:21:50 1999 <921442910>
access time - Sat Mar 13 18:11:21 1999 <921370281>
modify time - Sat Mar 13 10:28:39 1999 <921342519>

4.3.3.2 Changing file ownership and permissions with chmod

On most Unix systems, you wouldn't want every file to be readable, writable, and executable by every
user. The chmod command allows you to set the file permissions, or mode, on a list of files and
directories. The recursive option, -R, causes chmod to descend recursively through a directory tree and
change the mode of the files and directories.

For example, a long directory listing for a directory, a symlink, and a file looks like this:

drwxr-xr-x 7 jambeck weasel 2048 Feb 10 19:08 image/
lrwxr-xr-x 1 jambeck weasel 10 Mar 14 13:12 image.rgb-> image1.rgb
-rw-r--r-- 1 jambeck weasel 922112 Mar 13 10:28 image1.rgb

The first character in each line indicates whether the entry is a file, directory, symlink, or one of a
number of other special file types found on Unix systems. The three listed here are by far the most
common. The remaining nine characters describe the mode of the file. The mode is divided into three
sets of three characters. The sets correspond”in the following order”to the user, the group, and other.
The user is the account that owns the directory entry, the group can be any group on the system, and
other is any user that doesn't belong to the set that includes the user and the group. Within each set, the
characters correspond to read (r ), write (w), and execute (x) permissions for that person or group.

In the previous example, to change the mode of the file image1.rgb so that it's readable only by the user
and modified (writable) by no one, you can issue one of the following commands:

chmod u-w,g-r,o-r image1.rgb
chmod u=r,g=-,o=- image1.rgb
chmod u=r,go=- image1.rgb

Any one of these commands results in image1.rgb 's permissions looking like:

-r-------- 1 jambeck weasel 922112 Mar 13 10:28 image1.rgb

The first two commands should be fairly obvious. You can add or subtract user's, group's or other's
read, write or execute permissions by this mechanism. The mode parameters are:

[u,g,o]

User, group, other

[+,-,=]

Add, subtract, set


72
[r,w,x]

Read, write, execute

u, g, and o can be grouped or used singly. The same is true for r, w, and x. The operators +, -, and =
describe the action that is to be performed.

4.3.3.3 Changing file and directory ownership with chown and chgrp

Usage: chown -[options] filenames item
Usage: chgrp -[options] filenames

The chown command lets you change the owner (or, in file-permission parlance, the user) of a file or
directory. The operation of the chown command is dependent on the version of Unix you are running.
For example, IRIX allows you to "give" the ownership to someone else, while this is impossible to do
in Linux. We will cite only examples of the chgrp command, since in Linux, you can be a member of
two groups and get this command to work for you.

chgrp lets you change the group of a file or directory. You must be a member of the group the file is
being changed to, so you have to be a member of more than one group and understand how to use the
newgrp command (which is described later in this chapter). Assume for a moment that you created
image/, a directory containing files, while you were in your default group. Later, you realize that you
want to share these files with members of another group on the system. So, at first, the permissions
look like this:

drwxr-xr-x 7 jambeck weasel 2048 Feb 10 19:08 image/

Change to the other group using the command newgrp wombat, then type:

chgrp -R wombat image

to make all files in the directory accessible to the wombat group. Finally, you should change the
permissions to make the files writable by the wombat group as well. This is done with the command:

chmod -R g+w image

Your entry should now appear as follows:

drwxrwxr-x 7 jambeck wombat 2048 Feb 10 19:08 image/

4.3.4 System Administration

Most files that control the configuration of the Unix system on your computer are writable only by the
system administrator. Adding and deleting users, backing up and restoring files, installing new software
in shared directories, configuring the Unix kernel, and controlling access to various parts of the
filesystem are tasks normally handled by one specially designated user, with the username root. When
you're doing day-to-day tasks, you shouldn't be logged in as root, because root has privileges ordinary
users don't, and you can inadvertently mess up your computer system if you have those privileges. Use
the su command from your command line to assume system-administration privileges temporarily, do

73
only those tasks that need to be done by the system administrator, and then exit back to your normal
user status.

If you set up a Unix system for yourself, you need to become the system administrator or superuser and
learn to do the various system-administration tasks necessary to maintain your computer in a secure
and useful condition. Fortunately, there are several informative reference books on Unix system
administration available (several by O'Reilly), and an increasing number of easy-to-use graphical
system-administration tools are included in every Linux distribution.

4.3.5 Conventions for Organizing Files

Unix uses a simple set of designatio ns for the various types of files found on the system. Normally you
can find what you need with info, find, or which, but sometimes it's necessary to search manually, and
you don't want to look in /bin for a library. These designations are used at the operating-system level,
but they are also often used in project subdirectories or software distributions to separate files:

bin

Executable files, or binaries

lib

Libraries, both runtime or shared, and those needed when compiling

spool

Directories used by the system when communicating with external devices and machines

tmp

Temporary storage

src

Source code for programs

etc

Configuration information

man

Manual pages, documentation

doc

Documentation


74
X

X or X11R6 refers to X programs, libraries, src, etc.; directories typically have a fairly
complete set of subdirectories

Once you have a basic understanding of how to organize and manage your files and directories, you're
well on your way to understanding how to work in a Unix environment. In Chapter 5 we complete our
lightning Unix tutorial with a discussion of many of the most commonly used Unix commands. In
order to really master the art of Unix, we strongly recommend consulting one or more of the books in
the Bibliography.

4.3.6 Locating Files in System Directories

While all your own files should be created in your home directory or in other areas specifically
designated for users to share, you need to be aware of the locations of files in other parts of the system.
One benefit of a system environment designed for multiple users is that many users can share common
resources while controlling access to their own files.

To say there is a standard Unix filesystem is somewhat of an overstatement, but, like Plato's vision of
the perfect chair, we will attempt to imagine one out in the ether. Since Linux is being developed by
thousands of programmers on different continents and has the benefit of the development of both
Berkeley and AT&T's SysV Unix, along with the POSIX standards, we will use the Linux filesystem as
a template and point out major discrepancies when necessary. The current standard for the Linux
filesystem is described at http://www.pathname.com/fhs/. Here, we present a brief skeleton of the
complete filesystem and point out a few salient features. Most directories described in this section are
configurable only by the system administrator; however, as a user, you may sometimes need to know
where system files and programs can be found. Figure 4-2 illustrates the major subdirectories, which
are further described in the following list.

Figure 4-2. Unix subdirectories




75
/dev

Contains all the device drivers needed to connect peripherals to the system. Drivers for SCSI,
audio, IDE drives, PPP, mice, and most other devices are found here. In general there are no
user-configurable options here.

/etc

Houses all the configuration files local to your machine. This includes items such as the system
name, Internet address, password file (unless your machine is part of some larger cluster),
filesystem information, and Unix initialization information.

/home

A common, but not standard, part of Unix. /home is usually a fairly large, separate partition that
houses all user home directories. Having /home on a separate partition has the advantage of
allowing it to be shared in a cluster environment, and it also makes it difficult for users to
completely fill an important system partition and cause it to lock up.

/lost+found

A system directory that is a repository for files and directories that have somehow been
misplaced by the system. Typically, users can't cd into this directory. Files usually end up in the
lost+found because of a system crash or a disk problem. At times it's possible that your system
administrator can recover files that appear to be lost simply by moving them from lost+found
and renaming them. There's a separate lost+found for each partition on the system.

/mnt

While not found on all systems, this is the typical place to mount any partitions not described by
the standard Unix filesystem description. Under Linux, this is where you will find a mounted
CD-ROM or floppy drive.

/nfs

Often used as the top-level directory for any mount points for partitions that are mounted from
remote machines.

/opt

A relatively new addition to the Unix filesystem. This is where optional, usually commercial,
packages are installed. On many systems you will find higher-end, optimizing compilers
installed here.

/root

The home directory for root, i.e., for the system administrator when she is logged in as root.


76
/sbin, /bin, and /lib

Since the machine may need to start the boot process without the /usr partition present, any
programs that are using it prior to mounting the /usr partition must reside on the main or root
partition. The contents of the /sbin directory, for instance, are a subset of the /usr/sbin directory.
Labeling directories sbin indicates that only system-level commands are present and that normal
users probably won't need them, and therefore don't need to include these directories in their
path. The /lib directory is a small subset of system libraries that are needed by programs in /bin
and /sbin. Current Unix programs use shared libraries, which means that many programs can
use functions from the same library, and so the library needs to be loaded into memory only
once. What this means for practical purposes is that programs don't take as much memory as
they would if each program included all the library routines, and the programs don't actually run
if the correct library has been deleted or hasn't been mounted yet.

/tmp and /var/tmp

Typically configured to be readable/writable/executable by all users. Many standard programs,
such as vi, write temporary files to one of these directories while they are running. Normally the
system cleans out these directories automatically on a regular basis or when the machine is
rebooted. This is a good place to write temporary files, but you can't assume that the system will
wait for you to erase them.

/usr

The repository for the majority of programs, compilers, libraries, and documentation for the
Unix filesystem. The current recommendation for most Unix systems is that the system should
be able to mount /usr as a separate, read-only partition. In a workstation-cluster environment,
this means that a server can export a /usr partition, and all the workstations in that cluster will
share the programs. This makes the system administrator's job easier and provides users with a
uniform set of machines.

/usr/local

The typical directory in which to install programs and documentation so that they aren't
overwritten by the operating system. You will often find programs such as Perl and various
others that have been downloaded from the Internet installed in this location.

/var

The directory used by all system programs that write output to the disk. All system logs, spools,
and temporary data are written here. This includes logging information such as that written
during the boot process, by the mailer, by the login program, and by all other system processes.
Incoming and outgoing mail is stored in the /var/spool directory, as are files being sent to
printers. Information needed for cron, batch, and at jobs is also found here.




77
Chapter 5. Working on a Unix System
Unix has a wealth of functions, and you'll want to be aware of a particular subset of them before you
start running programs and collecting data. In Chapter 4, we talked about how to organize and manage
your files in Unix, as well as how to move around the filesystem. In this chapter we take you on a
whirlwind tour through the common Unix commands you'll need to know to work efficiently. We
discuss the Unix shell itself, issuing commands in Unix, viewing, editing, and extracting information
from your files, shell scripts, and working in a multiuser environment.

Once you've learned to use some of these Unix commands, you'll find that they are astonishingly
powerful and flexible, allowing you to modify files in ways that are impossible, or at least not easy,
with a conventional word-processing program. For example, with a single command you can find all
the instances of a pattern in every file under your home directory. A few simple tricks can create a
script that will process every file in your source data directory identically. Another simple script can
update a customized local copy of a database every night while you're sleeping.

5.1 The Unix Shell
When you log into a Unix system or open a new window in your system's window manager interface,
the system automatically starts a program called a shell for you. The shell program interprets the
commands you enter and provides you with a working environment and an interface to the operating
system. It's possible to work in Unix without the shell using graphical file manager tools, but you'll find
that many shell commands are useful for data processing and analysis. Entire books devoted to the
various shells are available, and the manpages for some of the common shells exceed 100 pages when
printed. We provide you only with a brief introduction to the commonly used shells, to get you started
with as few hurdles as possible.

5.1.1 What Flavors of Shell Are There?

The shell program you use affects the feel of your command-line interface. Some of the features that
can be built into the shell program include a simple arithmetic interpreter that lets you use the
command line as a calculator; command aliasing, which lets you refer to standard Unix commands with
other more convenient words; filename completion, which lets you type only the number of characters
necessary to distinguish a file from other files in the directory, rather than typing the full filename;
command editing and command history, which let you scroll back through the commands you've
recently issued and edit them on the command line; spelling correction; and help functions for the shell
program.

There are a number of common shell programs on Unix systems. You are automatically assigned a
shell when your system administrator sets up your account. On Linux systems, the default shell
program is the bash (Bourne Again) shell. However, you may prefer to use a shell other than bash. The
two main classes of shell programs are shells derived from the Bourne shell, sh, and shells derived
from the C shell csh. Bourne-type shells include sh, bash, ksh (the Korn shell), and zsh (the Z shell). C-
type shells include csh and tcsh.

We tend to prefer C shells, for historical reasons. When we started working in Unix, the C shell was the
best thing going, and the tcsh program has expanded the original csh into a powerful shell. tcsh
implements most of the desirable shell features, including history, command aliasing, filename

78
completion, command-line editing, arithmetic and functions, job control, and spelling correction. tcsh
is also one of the most user-configurable shells. Therefore, we'll discuss the behavior of Unix
commands from a C-shell perspective, as if you were using the tcsh program, which we use on our
machines.

Your default shell will be listed as the last item in your entry in the /etc/passwd file. If you aren't
certain which shell you are currently using, you can find out by typing:

% finger your-user-name

For user jambeck, this command shows the following information:

Login name: jambeck In real life: Per Jambeck
Directory: /home/jambeck Shell: /bin/tcsh

This tells us that he is using tcsh as his default shell. For practical reasons, we will limit our discussions
and most references to csh and tcsh. It must also be noted that many system processes (e.g., batch, at,
and cron) use the Bourne shell by default, which makes it necessary to learn at least a minimal subset
of its command language. On most systems there are commands to change your default shell as set in
the passwd file. The chsh (change shell) command allows you to change your default login shell, if
you're working on a Linux system.

5.2 Issuing Commands on a Unix System
There 's a standard format for sending an instruction to Unix. In this book, we'll refer to commands and
to the command line. Each of Unix's many native commands has a tangible existence as an executable
program, and to issue the command is to tell Unix to execute that program. In this section and those
that follow, we move fairly quickly through concepts and commands. While we can give you a brief
overview of the Unix features we find most useful, this book isn't designed to replace a comprehensive
Unix reference book. If you're new to Unix, we strongly recommend that you review the basics of Unix
with the help of books such as Learning the Unix Operating System, Running Linux, or Unix for the
Impatient. We've provided a list of recommended reading in the Bibliography.

5.2.1 The Command-Line Format

The command line consists of the command itself, optional arguments that modify how the command
works, and operands such as files upon which the command operates. For example, the chsh (change
shell) command, which we just discussed briefly, has several possible options. The first is the -s option,
which must be followed by the name of a shell program as its argument. The second is the -l option,
which needs no argument, and which lists the shells that are available on your system. The operand for
the chsh command is the username of the user whose shell is being changed. So, to change your default
shell program, you might first type:

% chsh -l

which gives you a list of the shell programs available on the system:

/bin/bash
/bin/sh
/bin/ash

79
/bin/bsh
/bin/bash2
/bin/tcsh
/bin/csh
/bin/ksh
/bin/zsh

Then, to actually change your shell to tcsh, you can type:

% chsh -s /bin/tcsh yourusername

Options can simply be single-letter codes, or they can have their own arguments. Options that take no
arguments can be given as a group, while each option that takes an argument must be specified
separately. Each option group and separate option must be preceded by a hyphen (-). The last option in
a group, or separate options, can be followed by the option argument. The operands follow the final
option in the list.

Many Unix commands have options that, frankly, you'll never use. And we're not going to talk about
them. But there are ways of finding out more.

5.2.2 Unix Information Commands

Unix has its own built-in reference manual, which is quite comprehensive and informative, and which
will give you the correct information about the commands and options available on the particular
system you're using.

The man command is one of the most useful Unix commands; it allows you to view Unix manual
pages. While some Unix systems have implemented a web browser-like interface to the Unix
manpages, you can't always count on this option being available. The man command is available on all
types of Unix systems.

Usage: man name

where name can be a Unix command, such as grep, or a system file, such as the password file
/etc/passwd.

If you're not sure of the command you're looking for, you can sometimes find the right information
using man's slightly smarter cousin, apropos. The apropos command locates commands by keyword
lookup.

Usage: apropos name

For instance, if you're concerned about disk usage on your system, you can enter apropos usage. The
output of this command on our PC running Red Hat Linux is:

du (1) - summarize disk usage
getrlimit, getrusage, setrlimit (2) - get/set resource limits and usage
quota (1) - display disk usage and limits
quotacheck (8) - scan a file system for disk usages



80
apropos doesn't always produce such brief and informative output. Entering a smart combination of
keywords is (as always with such searches) the key to getting the output you want. If you want a
predictable listing of Unix commands, it's probably best to pick up a comprehensive Unix book.

What should you do if you find the following text in a manpage?

This documentation is no longer being maintained and may be inaccurate or
incomplete. The Texinfo documentation is now the authoritative source.

The GNU set of Unix tools are adopting a documentation system, called texinfo, that is different from
[1]


the traditional man system. If you come across this message, you should be able to read the up-to-date
documentation on the program by typing in the command info progname. For instance, info info gives
you a complete set of documentation on the use of info and even provides instructions for creating your
own info documentation when you start writing your own programs.
[1]
GNU tools are distributed and maintained by the GNU Project at the Free Software Foundation. GNU stands for "GNU's Not Unix" and refers to a complete, Unix-
like operating system that's built and maintained by the GNU Project (http://www.Gnu.org ).


5.2.3 Standard Input and Output

By default, many Unix commands read from standard input and send their output to standard output.
Standard input and output are file descriptors associated with your terminal. A program reading from
standard input will simply hang out and wait for you to type something on your keyboard and press the
Enter key. A program writing to standard output spews its output to your terminal, sometimes far faster
than you can read it.

Some Unix commands read a hyphen (-) surrounded by whitespace on either side as "data from
standard input." This construct can then be used in place of a filename in the command line. Absence
of an output filename is sufficient to cause the program to write to standard output.

5.2.4 Redirection of Command Input and Output

The standard input and output descriptors are useful because you can redirect both standard input and
output, associating them with filenames, with no effects on the functioning of the program. Here are the
most common redirection constructs used by the C shell:

<

This redirector preceding a filename associates that filename with standard input, i.e., the
contents of the file are presented to the program as if they are standard input.

>

This redirector associates a filename with standard output, so that the filename is created on
execution of the command, or whatever is in an existing file of that name is overwritten by the
output of the command.

>>



81
This redirector associates a filename with standard output. It differs from > in that the output of
the command is appended to the end of the existing file.

The cat command reads the contents of a file and writes them to standard output. If you want to use the
cat command to combine the contents of three files into one new file, you can use a redirector like this:

% cat file1 file2 file3 > file4

This construct with cat would be useful if, for example, you'd just downloaded a bunch of individual
sequence files from the NCBI web site and want to collect them into one large file that can be read by
another program. (This is an example of something that seems like it should be simple, but is actually
time-consuming and annoying to do with a standard PC word -processing program. Unix provides a
neat solution that doesn't even require you to open any files).

You can also use redirectors to direct the contents of a file into a program at run-time, as standard input
(useful if you are running a program that prompts you for input from the keyboard) or to capture output
from a program that is normally written to standard output:

program < inputfile
program > outputfile

For example, let's say you've just finished an extensive BLAST search, and you want to send the results
to your colleague. You can use the redirector < ("less than"), to scoop the file huge_blast_report out of
your directory and mail it directly to your colleague:

% mail colleague@university.edu < huge_blast_report

If you want to increase the chances of your colleague opening the message, you can add a subject
header to the mail message using the mail option -s. The command reads:

% mail -s "surprise!" colleague@university.edu < huge_blast_report

The reverse operation, sending the results of standard output (or text that's displayed on your screen) to
a file, can be accomplished using > ("greater than"). Perhaps your colleague wants to write a quick
reminder to herself to reply to your mail. She could do it using the cat command to take input from the
keyboard and redirect it to a file, like this:

% cat > reminder_to_self
Ha! Send fifteen BLAST reports to colleague on Monday.
^D
%

Ctrl-D (^D) signals that you have finished entering text. Your colleague now has a file called
reminder_to_self in her current working directory.

5.2.5 Operators

Operators are similar to redirectors in that they are ways of directing standard input and output.
However, they direct input and output to and from other commands rather than to filenames.


82
The most commonly used operator is the pipe (|). The pipe directs standard output of one command
into standard input for the next command. This allows you to chain together several different filtering
commands or programs without creating input or output files each time.

You can use the cat command to direct the contents of a file into a program that reads information from
standard input:

% cat inputfile | program

This command construct does the same thing as the example we showed earlier (program < inputfile).
Both cause the output of the cat command to act as input for program. If you want to do a lot of runs of
the same program using slightly different input, you can create multiple input files and then write a
script that cat s each of those input files in turn and pipes their contents to program.

Pipes can carry out a complete set of file-processing options without writing to disk. For instance,
imagine that you have a datafile consisting of multiple tables concatenated together. The first table in
the file takes up the first 67 lines, the second table takes up the next 100 lines, and the rest of the file is
taken up by a third table. You want the information that's contained in the second column of the
[2]


middle table, which stretches from characters 30 -39 in the row. Using filters and pipes, you can
construct the following command to crop out the data you need:
[2]
This isn't an imaginary format at all. It's pretty close to the format of the output file from a calculation that we do frequently: computing the pKa values of
individual amino acids in a protein.

% head -167 protein1.pka | tail -100 | cut -c30-39 > protein1.pka.data

In this example, head sends the top 167 lines of a specified file or files (in this case protein1.pka) to
standard output; tail takes the last 100 lines of the output of head; and cut takes the correct column of
characters out of the results of head and tail and then stores it in protein1.pka.data.

5.2.6 Wildcard Characters

A useful construct Unix shells recognize is the presence of wildcard characters in filenames. The shell
locates matches for any wildcards before passing filenames on to the program. The two most
commonly used wildcards are the asterisk (*) and the question mark (?). * means "any sequence of zero
or more characters, except for the / character." ? means "any single character." Thus, "every file in this
directory" can be denoted by a lone *, which is a useful shortcut.

The shell recognizes other wildcards as well. The construct [cset ] refers to any characters in the
specified set. If you want to move all files beginning with letters a through m to a new directory, you
can structure the command as mv [a -m]* ../newdir. If you want to move all files beginning with a
number to a new directory, enter mv [0 -9]* ../newdir.

5.2.7 Running X Commands

On Unix systems running the X Window System, there are many commands available that initiate
programs with functions that aren't command line-based. Once these programs, which can include
anything from graphics viewers to complicated scientific applications, are called from the command
line, they use the X Window System to open their own windows, which generally contain a complete,
independent graphical user interface.
83
5.3 Viewing and Editing Files
You're probably accustomed to the idea of using a program to open a file. If your first introduction to
computers has been sometime in the last 15 years, you're probably used to simply clicking on a file
icon, which is automatically recognized by the right piece of software, which opens the file.

In Unix, commands are designed to operate on files that are sensibly readable and printable as text
whenever possible. Thus text files can be opened by a wide variety of commands that allow a great deal
of flexibility in file manipulation. The file reading and processing commands have such functions as
sorting data based on the value of a particular substring in each line of the file, cutting a particular
column out of a file, pasting columns of data together side by side, checking to see what the differences
between two files are, and searching for instances of a pattern in a file or group of files. Often, these
simple commands are all you need to extract a desired subset of the data in a file and prepare it for
analysis.

Unix has many ways to view and edit the contents of files. There are viewers for text and programs that
allow you to examine the contents of binary files, as well as full-featured editors for modifying plain-
text files.

5.3.1 Viewing and Combining Files with cat

Usage: cat -[options] files

cat dumps the contents of a file onto the screen. If your file is short, or if you've successfully completed
a speed-reading course, this utility works well. If you need to see what's on each page of a file, though,
cat is less useful, since the contents of the file scroll by without pausing.

Instead of viewing text, cat is most useful for combining (or concat enating) files. For instance, if you
have a series of files of program output named meercat1.txt, meercat2.txt, and meercat3.txt, and you
want to combine them into a single file, you can type:

% cat meercat1.txt meercat2.txt meercat3.txt > big-meercat.txt

This command appends the contents of meercat3.txt to the end of meercat2.txt, the contents of
meercat2.txt to the end of meercat1.txt, and so on, combining them into one big file named big-
meercat.txt. If you've thought to number the outputs sequentially (as we have with the meercats), and
want them in that order in the file, you can just type:

% cat meercat*.txt > big-meercat.txt

and it will have the same effect. Wildcard characters such as * use a strict alphabetical order: if they
exist, files meercat10.txt and meercat11.txt come before meercat2.txt.

cat can also append files to the end of an existing file. For example, if your program generates another
output file you need to attach to the end of the collection, the command:

% cat meercat10.txt >> big-meercat.txt




84
does just that. If you use > instead of >> in this situation, instead of being added at the end of the file,
the new file meercat.txt overwrites the entire contents of big-meercat.txt.

Incidentally, if you want a command that's the reverse of cat to print the lines of a file in backward
order, you're in luck: the command is called tac. Sadly, the command acta, for printing a file inside out,
hasn't yet been implemented.

5.3.2 more: A Step in the Right Direction

Usage: more -[options] [+linenumber] [+/pattern] filename

more is a pager, which in Unix means a program that lets you view a file one page at a time. Suppose
you have a file containing BLAST output named blast-first.txt. Typing:

% more blast-first.txt

shows you the first page of the file blast-first.txt, and steps forward one page every time you press the
space bar. To leave more, hit the q key; to view other more commands while within more, enter h.

more is smart about moving around files. If you know where you want to go in the file, you can specify
the line number (using the +linenumber option). If, on the other hand, you want to start at the first
occurrence of a certain word or pattern, use the +/pattern option. When viewing a file in more, if you
press the / key and then type a pattern to search for, more jumps to the next occurrence of that pattern
in the file and repeats searches for each subsequent occurrence of that pattern every time you press /
followed by the Enter key.

Here are some other useful options for more :

-r

Shows normally unprintable control characters as well as normal text

-s

Squeezes multiple empty lines into a single one

You can redirect the output of a program that generates more than a screen's worth of text to more,
allowing you to page through the output one screen at a time. Let's say you want to know who is logged
into your Unix system. If enough users are logged in, the output scrolls off the screen. By piping who to
mor e :

% who | more

you can scroll through the output line-by-line using the Return key or screen-by-screen using the space
bar.

more 's most significant shortcoming is that some versions can't move backward through a file. less is a
utility that remedies this simple problem.


85
5.3.3 less: The Gold Standard

There is a superior pager command, less. Most importantly, less rectifies more 's biggest flaw: it lets
you page backward as well as forward in a file. less also doesn't load a file into memory all at once,
which makes it less likely that your computer will grind to a halt if you view a huge file with it. Finally,
it also handles binary files more gracefully, displaying readable text as characters and representing
unreadable control characters in the form ^X. less uses the same options as more, but it also takes
additional options. Be sure to check info less to see which ones your local version takes. And finally,
while it hardly bears mentioning, why is it called less ? Because less is more. Sigh.

5.3.4 Editing Files with vi and vim

Usage: vim filename

Because it's a text-based operating system that has historically been used for software development and
computation, Unix did not traditionally provide the kind of full-featured, "what you see is what you
get" text editing that exists on personal computers, although now such editors are available. In fact,
WYSIWYG text editors are of limited utility for programmers because they often introduce invisible
markup characters into documents.

It's worth learning to use the plain-text editors that are provided for Unix. They have a fairly steep
learning curve, but they are the right tools for the job if you're writing programs or looking at plain-text
data. If you download sequence data from a web server and open and work with it in a plain-text editor,
the file you write out should be readable by a sequence-analysis program. If you opened the same file
and worked with it in a WYSIWYG editor, then wrote it out in the file format used by that editor, it
would be unreadable by other programs.

The vi editor is a standard feature of most Unix systems. It's a full-screen editor; it allows you to see as
many lines of the file that you are editing as will fit into the terminal screen or window in which you
run it. The cursor can be moved through the file using keyed instructions, but it can't be moved with the
mouse. The bottom line on the screen is called the status line. Error messages from vi appear in the
status line.

In Section 5.6, we discuss the use of regular expressions for searching and replacement as a feature of
the plain-text editor vi. The ability to use vi with the regular-expression language makes vi a powerful
tool for file manipulation.

A few nice features have been added to vi in vim (vi improved). It's worth asking your system
administrator to install vim if it's not already on your system, if only for the multiple undo feature that it
introduces. We can't cover all the features of vim here, but we will present a few commands that will
get you up and running. [3]




[3]
See the Bibliography for pointers to complete references on vi.


vim has three modes ; in each, input from the keyboard is interpreted differently:

Command



86
This is the main mode; you are automatically in command mode when you start working.
Keystrokes are interpreted as vim's short commands, most of which consist of one or two letters.
You can always return to the command mode by hitting the Escape key once (or sometimes
twice).

Input

This mode is reached by issuing any command that requires input.

Status line

This mode is for issuing longer, more complex commands. To reach status line mode, simply
type a semicolon (;) in command mode. A semicolon appears at the left side of the status line,
and anything you type appears in the status line. When you finish typing your command and hit
the Enter key, the command is executed, and you return to command mode.

Here are some of the most useful vim command-mode commands:

h, j, k, l

Moves the cursor around in your file character-by-character or line-by-line. It's sort of like a
pre-joystick video game: "h" moves you to the left, "l" to the right, "j" moves you down a line,
and "k" up a line. On most systems, the arrow keys on your keyboard will also work to move
you around within vim.

w, b

Moves the cursor forward ("w") or back ("b") by one word in the text. Words are delimited by
whitespace.

), (

Moves the cursor forward ")" or back "(" by one sentence in the text. Sentences are recognized
as sequences of words terminated by an end -of-sentence character (. ? !).

a, A, i, I, o, O

Initiates the insertion of text. "a" and "A" insert text after the cursor and at the end of the current
line, respectively. "i" and "I" insert text before the cursor and at the beginning of the current
line. "o" and "O" open a blank line below or above the current line, respectively, and begin
inserting text on the new line.

x, X

Deletes the text under the cursor or before the cursor, respectively. Preceded by an integer
number, they delete that number of characters after or preceding the cursor.

s, S

87
Substitutes for the character under the cursor or for the current line, respectively, by deleting the
character either under the cursor or the line and initiating insertion of text in place of the deleted
character. Preceded by an integer number, "s" replaces that number of characters with the new
text, and "S" replaces the specified number of lines.

Here are some of the most useful vim status line mode commands:

:wq

Saves changes to the file and quits the editing session. ":w" can be used by itself or with the
name of the file to write to. ":q!" exits the session without saving changes.

r]

Followed by a filename, inserts the entire text of the named file.

:g/pattern/s//replacement/g

Searches for and replaces pattern with replacement throughout the buffer. If the trailing "g" is
left off, only the first occurrence of the pattern in any line is replaced.

:number

Moves the cursor to the specified line number.

5.3.5 The GNU Emacs Editor

vim is a fairly flexible editor, and you can certainly learn to make it do any text-editing task that you
need to do. However, there are other options for text editing on Unix systems. The best of these is
probably the Emacs editor. Emacs is an editing program made available by the Free Software
Foundation. It contains not only a text-editing facility with special modes for T EX and LaT EX
documents, programs in various programming languages, and outlines, but also a file manager, mail
and news readers, and access to the online documentation browser info. Whole books have been written
on Emacs (see the Bibliography) so we won't go into it here except to recommend that, if you're
working on a Unix system, learning to use Emacs is one of the better uses of your learning-curve time.

5.3.6 Viewing Binary Files with strings

Usage: strings -[options] filenames

In addition to the text files we've discussed up to now, there are also binary files that can't be read as
text. They are almost always the output of a program or the executable form of a program itself (as
opposed to the source code). Binary files and program executables aren't human-readable because they
are in machine language. Because of this language gap, we'll unflinchingly make the prediction that, 9
times out of 10, it isn't worth the effort needed to read binaries. You'll have more luck taking another
route, like talking to the person whose program created the file in the first place. Unfortunately, many
programs today, such as commercial hidden Markov model software or data mining programs that
directly write their internal representation of data structures to disk, use binary files to store proprietary

88
data structures. For that tenth time, then, we present some tips on how to extract information from
binaries without going crazy.

Your first step should be to use either the less command described earlier or the strings command. If
any portions of the file are in plain text, they will be readable in less. The strings command cuts out any
readable text characters in the file and prints them to the screen. For example, if you have an
undocumented binary file named badger and want to see if it contains any clues as to what it does, try
typing:

% strings -n 3 badger | less

(The -n option tells strings how many readable text characters in a row constitute a string. The default
setting for -n is four). Piping the output to less will let you page through it if it's longer than one screen.
If the output looks like:

ATCGTACTGATCGTCGATCGTCGATCATGCA
CGTAGCAGTCGATCATCATCGTACTAGCTAG
ATGCCTGAGCTATACACACTAGTCACGATGC

you might guess that badger contains some kind of binary encoding of data including a nucleotide
sequence or a (not good) multiple sequence alignment.

5.3.7 od and Binary Data

Usage: od -[options] filenames

Sometimes, it may be necessary to do more than just identify a binary file. In these cases, the od
program may provide a first step in understanding the file's contents. Before looking at od itself, let's
take a quick detour through the ways in which binary information is represented in a moderately more
human-friendly form.

Rather than using conventional decimal (base-10) notation, binary data is usually represented using a
base that is a power of two: either octal (base-8) or hexidecimal (base-16) digits. Octal numbers are
usually preceded by a 0. For example, the decimal number 25 corresponds to octal 031. Hexidecimal [4]


digits, on the other hand, are usually preceded by a 0x and use the letters A through F to represent the
decimal numbers 10 through 15. The decimal number 25 is 0x19 in hexidecimal.
[4]
Giving rise to the old joke, "Why do p rogrammers confuse Christmas and Halloween? Because OCT 31 is DEC 25."


If you want to delve into the heart of the binary file and see what's going on, you can use the od
command to perform an octal dump (or hex dump) and see if your binary file is readily interpretable.
Typing:

% od -c badger | less

creates an octal dump of badger you can step through a page at a time. It should look something like
this:

0000000 \0 \0 \0 001 \0 \0 \0 006 R T D C Y G \0
0000020 \0 \0 006 R T D C Y G \0 \0 \0 \a \0 \0 \0

89
0000040 001 1 \0 \0 \0 003 A R G \0 \0 \0 001 \0 \0 \0
0000060 002 C A \0 \0 \0 002 B Z 270 R ? 200 \0 \0 @

o d 's primary options are:

-c

Prints out text characters corresponding to bytes

-x

Produces a hex dump of the file

-o

Produces an octal dump (the default setting)

<<

. 3
( 12)



>>