. 9
( 9)

for the interested reader.
Section 7.2 is an informal introduction to what are called splicing opera-
tions using examples that may appear quite arbitrary at ¬rst reading. Those
who read the optional Section 7.3 will ¬nd that the examples of Section 7.2 are
abstractions of the “cut and paste” actions of commercially available enzymes
operating on DNA molecules. Section 7.4 provides the de¬nitions and con-
structions required for a formal theory of splicing. In the remaining sections the
deepest results relating the theory of splicing systems and the class of regular
languages are treated.
Although all of the motivating biomolecular examples given here involve
double stranded DNA, splicing theory is potentially applicable to polypeptides,
RNA, DNA (whether single stranded or double stranded) and any other poly-
mers that may be viewed as strings of related units linked in a uniform manner.
(The cytoskeletal ¬laments in eukaryotic cells provide several such examples.)
Moreover, dsDNA frequently occurs, both in vivo and in vitro, in circular form.
Linear and circular dsDNA molecules interact (inter-splice) in nature as illus-
trated for ciliate genomes in [27] and [6]. Interactions between linear and cir-
cular DNA have been discussed in an abstract splicing context in [18] and [33].
A review of in vitro solutions of standard combinatorial computations using the
cut and paste operations discussed here appears in [20]. The intention of this
chapter is to stimulate the creation of additional connections between formal
language theory and the biomolecular sciences.

7.2 Constructing new words by splicing together pairs of
existing words
Given an ordered pair of words over the alphabet {a, c, g, t}, for exam-
ple u = ttttggaaccttt and v = tttggaacctttt, one can consider allowing the
7.3 The motivation from molecular biology 233

construction of a new word from these two by “cutting” each, say between the
two occurrences of the symbol a in the subsequence s = ggaacc that occurs in
each of u and v, and then building a new string by “pasting” together (concate-
nating) the left portion of the ¬rst string and the right portion of the second string.
The cutting process applied to the ordered pair u and v gives the fragments:
ttttgga, accttt and tttgga, acctttt. The pasting of the indicated fragments in
the indicated order then gives the word x = ttttggaacctttt. In this way the
ordered pair of words u, v has provided x. Note that the ordered pair v, u fol-
lowing the same cut and paste operation produces y = tttggaaccttt. We say
that we have spliced u, v producing x and we have spliced v, u producing y.
An extensive literature has developed in which the generative power of such
splicing operations on words has been investigated. Many carefully considered
control structures have been studied that guide the splicing process. Numerous
researchers have been able to demonstrate that, by applying various such control
structures, they can provide universal (Turing equivalent) computational power
based on splicing operations. We do not pursue this goal here; we stay in the
realm of regular languages. The original motivation for the introduction of
the splicing concept was the modeling of the cut and paste actions provided by
sets of restriction enzymes acting on double stranded DNA (dsDNA) molecules.
These enzymatic actions are fundamental tools of genetic engineering. Our goal
is to show that the theory of regular languages provides a formalism through
which the potential generative power of sets of restriction enzymes acting on
dsDNA can be represented. Readers who have interdisciplinary inclinations can
continue with studies of [22] and models of computation based on biochemistry
[32]. In Section 7.3 a minimal discussion of DNA splicing is given to indicate
the contact point between splicing as understood in formal language theory and
in molecular biology. A reader who does not wish for additional motivation
from the biomolecular sciences may skip Section 7.3 which is not required for
an understanding of the discussions in later sections.

7.3 The motivation from molecular biology
A single stranded DNA (ssDNA) molecule can be viewed as a linear sequence of
the four covalently bonded deoxyribonucleotides {A = adenine, C = cytosine,
G = guanine, T = thymine}. For example:


A dsDNA molecule can be viewed as a linear sequence of hydrogen bonded
From biopolymers to formal language theory

pairs where the hydrogen bonds are between the vertically displayed pairs:
It is adequate here to assume that A and T pair only with each other and C and
G pair only with each other. Due to this so-called Watson“Crick pairing rule,
when one strand of a dsDNA molecule is determined the other is also known.
If (as above) one row is


we know its companion row is


Consequently, we need to give only one of the two strands. For ef¬ciency
and convenience we will list only one row of each dsDNA molecule. To be
certain not to confuse dsDNA and ssDNA, we will use lowercase a, c, g, t to
denote the paired deoxyribonucleotides:
Thus TTTTGGAACCTTT is an ssDNA, but ttttggaaccttt is a dsDNA having
as one of its strands the ssDNA, TTTTGGAACCTTT.
There are over 200 commercially available restriction enzymes that cut
dsDNA molecules at speci¬c subsequences (sites). The example given in Sec-
tion 7.2 is, in fact, a representation of an actual enzymatic process. At an
occurrence of the site ggaacc in a dsDNA molecule the enzyme Nla IV cuts
the covalent bonds in each of the single strands that hold the middle a“a of
the site together. When this cut is made in aqueous solution the left and right
halves separate due to Brownian motion. The resulting freshly cut ends of the
fragments can be again connected with restored covalent bonds if an enzyme
called a ligase is present. Suppose now that we have a test tube which contains,
dissolved in water (or more precisely, in an appropriate buffer solution), the
dsDNA molecules u = ttttggaaccttt and v = tttggaacctttt and also Nla IV and
a ligase. Then Nla IV will cut the two molecules u and v producing the four
fragments ttttgga , accttt and tttgga , acctttt, where the symbols have been
added to denote the freshly cut ends (technically, the phosphates attached at the
5 -ends remain after the cutting and are required for future pasting). The ligase
can now paste together the fragments ttttgga and acctttt to yield the dsDNA
molecule x = ttttggaacctttt. The ligase can also paste together the fragments
7.3 The motivation from molecular biology 235

tttgga and accttt to yield the dsDNA molecule y = tttggaaccttt. The molecules
x and y are said to be recombinants of u and v. For completeness we mention
that the ligase also has the potential for reconstructing the original molecules u
and v from the four fragments. If we ignore any remaining fragments that have
freshly cut ends, then we may say that the “language” of all possible molecules
that can arise in our test tube consists of the molecular varieties u, v, x, and
y. Some signi¬cant details concerning DNA molecules have been suppressed
above. The interested reader can ¬nd these details treated in the following exer-
cise and the references.

The two ends of an ssDNA molecule exhibit distinct structures. At one end a
methyl group protrudes which may have an attached phosphate group. This end
is referred to as the 5 end since the carbon atom of the methyl group is counted
as the 5 carbon of the sugar substructure to which it belongs. At the other end
a hydroxyl is attached at the 3 carbon of the sugar substructure to which it
belongs. This end is referred to as the 3 end of the molecule. In modeling one
must either label the ends or adopt a convention that allows the labels to be
known otherwise. The ssDNA molecules 5 -ACTTGC-3 and 3 -ACTTGC-5
are not representations of the same molecule. For dsDNA one must understand
that the two strands of the molecule always have opposite 5 ’ 3 orientation.
For convenience and concision we use the convention illustrated here. When, for
example, acttgc is used to represent a dsDNA molecule it must be understood
that this molecule has as one strand 5 -ACTTGC-3 and consequently that the
dsDNA molecule when fully spelled out is:
(a) Write 3 -ACTTGC-5 with the 5 end on the left (and the 3 on the right).
(b) Write a lowercase representation for the dsDNA molecule that has 3 -
ACTTGC-5 as one of its strands. Is there a second lowercase represen-
tation? Is there a third?
(c) Which pairs of words, when regarded as models of dsDNA molecules,
denote the same molecules: acttgc, cgttca, gcaagt, tgaacg, aaattt, tttaaa.
(d) Verify that each dsDNA molecule, when denoted using the alphabet {a,c,g,t}
and the conventions established here, has either exactly two distinct repre-
sentations or only one representation. Give examples of each type. Those
having only one are said to possess dyadic symmetry. (That dsDNA
molecules may possess two distinct word representations creates only a
From biopolymers to formal language theory

slight nuisance when constructing splicing models as explained in [17] [21]
and [22].)

7.4 Splicing rules, schemes, systems, and languages
The previous sections have been written in an informal manner, possibly allow-
ing ambiguity between molecules and the words used to represent them. The
remainder of this chapter deals speci¬cally with words in a free monoid. (How-
ever, all results in the chapter have meaningful interpretations for enzymes
acting on dsDNA.)
Let be a ¬nite set to be used as an alphabet. Let — be the set of all strings
over . By a language we mean a subset of — . A splicing rule is an element
r = (u, u , v , v) of the product set
—4 — — — —
]= — — — .
The action of the rule r on a language L de¬nes the language r (L) = {xuvy
in — : L contains strings xuu q and pv vy for some x, q, p, and y in — }.
For each set, R, of splicing rules we extend the de¬nition of r (L) by de¬ning
R(L) = ∪{r (L) : r ∈ R}. A rule r respects the language L if r (L) is contained
in L and a set R of rules respects L if R(L) is contained in L. By the radius of
a splicing rule (u, u , v , v) we mean the maximum of the lengths of the strings
u, u , v , v.
De¬nition 7.1 A splicing scheme is a pair σ = ( , R), where is a ¬nite
alphabet and R is a ¬nite set of splicing rules. For each language L and
each nonnegative integer n, we de¬ne σ n (L) inductively: σ 0 (L) = L and,
for each nonnegative integer k, σ k+1 (L) = σ k (L) ∪ R(σ k (L)). We then de¬ne
σ — (L) = ∪{σ n (L) : n ≥ 0}. A splicing system is a pair (σ, I ), where σ is a splic-
ing scheme and I is a ¬nite initial language contained in — . The language
generated by (σ, I ) is L(σ, I ) = σ — (I ). A language L is a splicing language if
L = L(σ, I ) for some splicing system (σ, I ).
Example 7.1 Let = {a, c, g, t}. Let r = (u, u , v , v) where the four words
u, u , v , v in — appearing in the rule r are u = v = gga and u = v = acc.
Let R = {r }. This gives the splicing scheme
σ = ( , R) = ({a, c, g, t}, {(gga, acc, gga, acc)}).
I = {ttttggaaccttt, tttggaacctttt}.
7.4 Splicing rules, schemes, systems, and languages 237

Observe that r applied to the ordered pair

(ttttggaaccttt, tttggaacctttt)

of words in I gives the word ttttggaacctttt, and r applied to the ordered pair

(tttggaacctttt, ttttggaaccttt)

of words in I gives tttggaaccttt. The less interesting actions of r on I must
be recognized: When r acts on ordered pairs in the “diagonal” of I — I , for
example on

(ttttggaaccttt, ttttggaaccttt)

the result is merely ttttggaaccttt which appeared as each coordinate of the
pair. Here we have

σ 0 (I ) = I = {ttttggaaccttt, tttggaacctttt}


σ 1 (I ) = σ 0 (I ) ∪ R(σ 0 (I ))
= I ∪ {ttttggaacctttt, tttggaaccttt, ttttggaaccttt, tttggaacctttt}


{ttttggaacctttt, tttggaaccttt, ttttggaaccttt, tttggaacctttt}.

Notice that R respects σ 1 (I ) and consequently σ 2 (I ) = σ 1 (I ). Then also
σ 3 (I )) = σ 2 (I ) = σ 1 (I ) and in fact σ — (I ) = σ 1 (I ). Thus L(σ, I ) is the ¬nite

σ — (I ) = {ttttggaacctttt, tttggaaccttt, ttttggaaccttt, tttggaacctttt}.

This example connects the formal de¬nitions of splicing systems and languages
with the less formal introductory remarks of Sections 7.1 and 7.2.
= {a, c, g, t}. Let r = (c, cccgg, c, cccgg), R = {r },
Example 7.2 Let
and let I contain only one word of length 30,

I = {aaaaaaccccggaaaaaaccccggaaaaaa}.

The rule can be applied to the ordered pair

(a 6 ccccgga 6 ccccgga 6 , a 6 ccccgga 6 ccccgga 6 )

with cuts made using the right occurrence of ccccgg in the ¬rst coordinate and
the left occurrence of ccccgg in the second coordinate. This gives the word of
From biopolymers to formal language theory

length 42:

a 6 ccccgga 6 ccccgga 6 ccccgga 6 .

The rule can be also applied to the ordered pair using the left occurrence of
ccccgg in the ¬rst coordinate and the right occurrence of ccccgg in the second
coordinate. This gives the word of length 18, a 6 ccccgga 6 . Thus

σ 1 (I ) = {a 6 ccccgga 6 , a 6 ccccgga 6 ccccgga 6, a 6 ccccgga 6 ccccgga 6 ccccgga 6 }.

Continuing with similar considerations one ¬nds that L(σ, I ) = σ — (I ) is the
in¬nite regular language

a 6 ccccgga 6 (ccccgga 6 )— .

Example 7.3 We may interpret the 30 symbol word given in Example 7.2
as a model of a dsDNA molecule as indicated in Section 7.3. The rule r of
Example 7.2 represents the cut and paste activity of the restriction enzyme
BsaJ I accompanied by a ligase. With these understandings the language

L(σ, I ) = a 6 ccccgga 6 (ccccgga 6 )—

obtained in Example 7.2 is a model of the set of all dsDNA molecules (having no
freshly cut ends) that can potentially arise in a test tube containing BsaJ I, a lig-
ase, and (suf¬ciently many) dsDNA molecules of model a 6 ccccgga 6 ccccgga 6 .
The ability to make assertions as in the preceding sentence motivated the intro-
duction of the splicing concept into formal language theory.
Example 7.4 Let = {a, b, c}. Let L be the regular language caba — b. Can
we ¬nd a splicing system that generates L? Yes, this can be done very eas-
ily by taking advantage of the fact that the symbol c occurs as the leftmost
symbol of every word in L and occurs nowhere else in any word of L. Let
r = (caba, a, cab, a) and let I = {cabb, cabab, cabaab}. Note that r allows
the generation of cabaaab as follows: From the ordered pair (cabaab, cabaab),
and the two distinct conceptual analyses caba/ab and cab/aab, the rule r gives
the new word cabaaab. Then from the ordered pair (cabaab, cabaaab), and
the analyses caba/ab and cab/aaab, the rule r gives the word cabaaaab.
(Note that r provides a form of pumping.) Continuing in this way all words
caba n b with n ≥ 3 can be obtained. Since caba n b with 0 ¤ n ¤ 2 were given
in I , we have, for R = {r } and σ = ( , R), L(σ, I ) = caba — b as desired. In
fact it has been shown [12] that for any regular language L over any alphabet
, by choosing a symbol not in , say c, the language

L = cL = {cw : w ∈ L }
7.5 Every splicing language is a regular language 239

is generated by a splicing system that can be speci¬ed very much as we have
done for the language caba — b in this example. Thus informally speaking, each
regular language is almost a splicing language.
Example 7.5 Let = {a, b}. The regular language L = (aa)— cannot be gen-
erated by a splicing system. As the reader may verify, any ¬nite set of rules that
allows every word in L to be generated will also generate strings of odd length
as well as the strings of even length.
Example 7.6 The regular language L = a — ba — ba — cannot be generated by a
¬nite set of rules either: For any nonnegative integer n,
Rn = {(», ba n b, », aba n b), (ba n ba, », ba n b, »)}
In = {aba n b, ba n b, ba n ba}
generate a — ba n ba — .
Consequently, for any ¬nite subset F of nonnegative integers, R =
∪{Rn : n ∈ F} and I = ∪{In : n ∈ F} generate the language L =
∪{a — ba n ba — : n ∈ F}. However, as the reader may verify, any ¬nite set
of rules and ¬nite initial language that generate all words in a — ba — ba — will
also generate words in which the symbol b occurs more than twice. Thus there
are regular languages that are not splicing languages.

(1) Let L be any ¬nite language over any alphabet . Specify a splicing system
that generates L. (Hint: The set R of rules can be empty.)
= {a, b}. Find three splicing systems that generate, respectively,
(2) Let
(i) L = b(aa)— ; (ii) L = — ; and (iii) L = ba — ba — .
(3) Let = {a, b, c}. Find three splicing systems that generate, respectively,
(i) L = ab— abc; (ii) L = ab— cab; (iii) a — ba — ca — ba — .

7.5 Every splicing language is a regular language
Splicing languages were introduced in published form for the ¬rst time in 1987
[17]. Fortunately K. Culik II and T. Harju quickly announced in 1989 [4], [5] that
all splicing languages are regular. A second exposition of the regularity result
was given by D. Pixton in 1996. This exposition provided, for each splicing
system, an explicit construction of a ¬nite automaton that was concisely proved,
From biopolymers to formal language theory

using an insightfully constructed inductive set, to recognize the language gen-
erated by the splicing system. In 1989, R. Gatterdam observed [11] that not
all regular languages are splicing languages. So, which regular languages are
splicing languages? We would like to have a theorem that characterizes the class
of splicing languages in terms of previously known classes of languages. As
yet we have no such characterization. In [17] it was observed that a language L
is a splicing language if there is a positive integer n such that uxq is always in
L whenever x has length n and both uxv and pxq are in L. These languages,
which were analyzed rather thoroughly in [19], constitute a highly restricted
subclass of the splicing languages, enlargements of which have been studied
extensively in [12]. The interested reader is urged to study very carefully Pix-
ton™s proof of regularity which is given in [33], [21], [32], [22] and broadly
generalized in [34].
With no crisp characterization of the class of splicing languages found,
concern turned to the search for an algorithm for deciding whether a given
regular language can be generated by a splicing system. There is, of course, an
easily described procedure that is guaranteed to discover that a regular language
L ⊆ — is a splicing language if L is a splicing language: For each positive
integer n, for each set R of rules of radius ¤ n, and for each subset I of L
consisting of strings of length ¤ n, decide whether L(σ, I ) = L, where σ =
( , R). Since both L and each such L(σ, I ) are regular, all these steps can be
carried out. The procedure terminates when a system L(σ, I ) is found, but fails
to terminate when L is not a splicing language. From this triviality, however, it
follows that an algorithm will become available immediately if, for each regular
language L, a bound, N (L), can be calculated for which it can be asserted that
L cannot be a splicing language unless there is a splicing system having rules
of radius ¤ N (L) and initial strings of length ¤ N (L). (Recall from Section 7.2
that the radius of a rule r = (u, u , v , v) is the length of the longest of the four
words u, u , v , and v.) The determination of the bound N (L) is a conceptual
victory for the concept of the syntactic monoid of a language because it allows
the concise statement of an adequate bound N (L), given in Section 7.7. It also
provides a valuable tool stated in the heading of Section 7.6.

(1) Let = {a, b}. Find a regular expression that represents L(σ, I ) where
σ = ( , R), R = {r }, r = (b, b, a, b), and I = {abba}.
(2) Let = {a, b, c}. Find a regular expression that represents L(σ, I ) where
σ = ( , R), R = {r1 , r2 }, r1 = (ab, c, cb, a), r2 = (cb, a, ab, c), and I =
{abc, cba}.
7.6 Syntactic monoids and rules respecting L 241

(3) Same as Exercise 2 with one new rule added: r3 = (cbc, », cb, c) so that
R = {r1 , r2 , r3 }.

7.6 The syntactic monoid of a regular language L allows
an effective determination of the set of all splicing
rules that respect L
First we show how to decide whether a given splicing rule r respects a given
regular language L ⊆ — : Let M be the minimal automaton recognizing L. Let
S be the set of all states of M and let F be the set of ¬nal states. For each state s
in S and each word w in — we denote by sw the state of M arrived at after w is
read from state s. Note that the rule r = (u, u , v , v) respects L if and only if,
for each ordered pair of states p, q of M, for which {x ∈ — : puu x ∈ F} and
{y ∈ — : qv vy ∈ F} are not empty, {z ∈ — : qv vz ∈ F} ⊆ {z : puvz ∈ F}.
The emptiness conditions and the inclusion are decidable since each of the four
sets is regular.
Next we show how to specify all of the rules that respect the regular lan-
guage L in a manner that requires that the procedure above be used on only
a ¬nite number of rules. Recall that the syntactic congruence relation, C, in

is de¬ned by setting u Cu if and only if, for every pair of strings x and
y ∈ — , either both xu y and xuy are in L or neither is in L. Since L is
regular, the number of C’congruence classes is a positive integer which we
denote n(L). Then there are precisely [n(L)]4 ordered quadruples of congru-
ence classes. Let (W, X, Y, Z ) be an ordered quadruple of congruence classes.
Let r = (w, x, y, z) and r = (w , x , y , z ) be two rules in W — X — Y — Z .
We verify that r respects L if and only if r respects L. By the symmetry of
the roles of r and r in the hypothesis, we need only assume that r respects
L and verify that then r must respect L. Suppose that r respects L and that
the pair uw x v, sy z t is in L. We need only show that uw z t is in L: From
w Cw we have uwx v is in L and from x C x we then have uwxv in L. From
y C y and z C z it follows that syzt is in L. Since r respects L and the pair
uwxv, syzt is in L, we have uwzt in L. From w Cw and z C z it follows that
uw z t is in L, as required. Thus, to specify all the rules that respect L, we
construct the [n(L)]4 quadruples of syntactic classes determined in — by L
and, from each such quadruple (W, X, Y, Z ), we choose one word from each
class to obtain one rule (w, x, y, z) and then decide whether it respects L. If
it does then every rule in W — X — Y — Z respects L. If it does not respect L
then no rule in W — X — Y — Z respects L. This discussion has justi¬ed the
From biopolymers to formal language theory

Let L be a regular language. The set of rules that respect L
Proposition 7.1
has the form

∪{Wi — X i — Yi — Z i : 1 ¤ i ¤ m}

where m is a nonnegative integer and each of the sets

Wi ,X i ,Yi ,Z i (1 ¤ i ¤ m)

is an element of the syntactic monoid of L.

Since each syntactic class of a regular language L is itself a regular language,
one can list all the strings of length at most k in the class. Consequently when
the representation in the proposition has been constructed, the set of all rules
of radius at most k that preserve L can be listed with no additional testing: For
each of the sets

Wi — X i — Yi — Z i (1 ¤ i ¤ m)

in the representation, list all of the rules (w, x, y, z) in

Wi — X i — Yi — Z i

of radius at most k. In order to create such a list without using the syntactic
monoid it would be necessary to list every rule of radius at most k in all of
[ — ]4 and test every such rule individually to decide if it preserves L.

(1) Let = {a} and L = (aa)— . Construct the syntactic monoid of L.
(2) Let = {a, b} and L = a — ba — ba — . Construct the syntactic monoid of L.
(3) Construct the syntactic monoid of the language in Exercise 3 of Section 7.5.

7.7 It is algorithmically decidable whether a given
regular language is a re¬‚exive splicing language
A rule set R is re¬‚exive if, for each rule (u, u , v , v) in R, the rules (u, u , u, u )
and (v , v, v , v) are also in R. When R is re¬‚exive we say the same of any
scheme or system having R as its rule set. In fact, splicing systems that model
the cut and paste action of restriction enzymes and a ligase are necessarily
re¬‚exive. Consequently, from a modeling perspective, it is the re¬‚exive splicing
systems that are of prime interest.
7.7 Deciding which languages are splicing languages 243

Section 7.6 provides the tools to construct, for each regular language L and
each positive integer k, the following ¬nite re¬‚exive set Tk of splicing rules:
Tk = {(u, u , v , v) : the radius of (u, u , v , v) ¤ k and each of the three rules
(u, u , v , v), (u, u , u, u ), and (v , v, v , v) respects L}.
Recall that Tk (L) = ∪{r (L) : r ∈ Tk }, which is regular since Tk is ¬nite and,
since L is regular, each r (L) is regular (as con¬rmed in Exercise 5 of this
section). Consequently L \ Tk (L) is also regular.
Theorem 7.1 (Pixton and Goode) A regular language L is a re¬‚exive splicing
language if and only if L\Tk (L) is ¬nite where k = 2(n(L)2 + 1) and n(L) is
the cardinal number of the syntactic monoid of L.
Let L be a regular language. In Chapter 3 the syntactic monoid of L was
de¬ned in a way that allows n(L), and therefore also k, to be computed. Sec-
tion 7.6 provides a procedure for computing the ¬nite set Tk , from which the
regular set Tk (L) can be computed and it can be decided whether L\Tk (L) is
¬nite. Thus the theorem of Pixton and Goode provides an algorithm that allows
one to decide whether any given regular language is a re¬‚exive splicing lan-
guage. It is tempting to suppose that when L\Tk (L) is ¬nite it can serve as the
set of initial words of a splicing system that generates L . Unfortunately this is
not the case as shown in Exercise 3 of this section. Although the proof of the
theorem is beyond the scope of this book, the decision procedure that it provides
can, in principle, be carried out using the machinery this book has provided.

(1) Let = {a} and L = (aa)— . Compute n(L) and k for this language.
(2) Let = {a, b} and L = a — ba — ba — . Compute n(L) and k for this language.
(3) Let = {a, b} and L = — . Note that uCv for every u, v in L and conse-
quently the syntactic monoid of L is a singleton.
(a) Compute n(L) and k.
(b) Describe Tk in words. How many elements does Tk contain?
(c) Compute Tk (L) and L\Tk (L).
(d) Conclude that, when the set L\Tk (L) is ¬nite, it does not follow that
L\Tk (L) is adequate to serve as the set I of initial words of any splicing
system (σ, I ) for which L = L(σ, I ). (This is what makes the proof of
Theorem 7.1 challenging.)
(e) Without using Theorem 7.1, specify a set R of splicing rules and a
set I of initial strings for which L = L(σ, I ) for σ = ( , R). Hint:
(», », », ») is a splicing rule.
From biopolymers to formal language theory

(4) Show that the following de¬nition of a re¬‚exive splicing language is equiv-
alent to the de¬nition given: A language L is a re¬‚exive splicing language
if L = L(σ, I ) for some splicing scheme σ = ( , R) and, for each rule
r = (u, u , v , v) in R, the rules (u, u , u, u ) and (v , v, v , v) respect L.
(This alternative de¬nition allows one to list fewer rules in the rule set
specifying a re¬‚exive splicing system.)
(5) Let L ⊆ — be a regular language. Let r = (u, u , v , v) be a splicing rule
with u, u , v , v in — . Show that r (L) is regular. Hint: Use two copies, M
and M , of the minimal automaton M that recognizes L. Let the sets of
states of these two automata be S and S . Combine M and M into a single
automaton M , having state set S ∪ S , by adding carefully chosen new
edges that allow transitions from states in S to states in S having v as label.
Choose the initial state i of M as the initial state of M and choose the set
F of ¬nal states of M as the set of ¬nal states of M .
Appendix A

Theorem A.1 If |S| ¤ |T | and |T | ¤ |S|, then there is a one-to-one corre-
spondence between S and T , i.e. |S| = |T |
Proof Assume f : S ’ T and g : T ’ S are injective functions. For each
s ∈ S, we ¬nd g ’1 (s) if it exists. We then ¬nd f ’1 g ’1 (s) if it exists. Then ¬nd
g ’1 f ’1 g ’1 (s) if it exists. We continue this process. There are three possible
results: (1) The process continues inde¬nitely. (2) The process ends because
for some si in the process, there is no g ’1 (si ). (3) The process ends because
for some ti in the process, there is no f ’1 (ti ). Let S1 be the elements of S for
which the ¬rst result occurs. Let S2 be the elements of S for which the second
result occurs. Let S3 be the elements of S for which the third result occurs.
Obviously these sets are disjoint. Similarly form T1 , T2 , and T3 as subsets of
T . f is a one-to-one correspondence from S1 to T1 . f is also a one-to-one
correspondence from S2 to T2 . g ’1 is a one-to-one correspondence from S3 to
T3 . Let θ : S ’ T be de¬ned by
θ(s) = f (s) if s ∈ S1
= f (s) if s ∈ S2
= g ’1 (s) if s ∈ S3
θ is a one to one correspondence from S to T .
Theorem A.2 For any set A, |A| < |P(A)|.
Proof Certainly |A| ¤ |P(A)| since for each element in a in A, {a} is in P(A).
Assume |A| = |P(A)|. Then there is a one-to-one correspondence between
|A| and |P(A)|. For a ∈ A let φ(a) be the element in P(A) paired with a.
Some elements in A belong to the element in P(A) with which they are
paired. For example, if a ∈ A and φ(a) = A in P(A), then a ∈ φ(a). How-
ever, if a ∈ A and φ(a) = …, the empty set in P(A), certainly a ∈ φ(a). Let


W = {a : a ∈ A a ∈ φ(a)}. W ∈ P(A), but no element in A can correspond
to W , for if φ(a) = W and a ∈ W , then by de¬nition of W , a ∈ φ(a) and
a ∈ W . However, if φ(a) = W and a ∈ W , then a ∈ φ(a) and by de¬nition of
/ / /
W , a ∈ W . Hence we have a contradiction if any element of A corresponds to
W and there is no one-to-one correspondence between |A| and |P(A)|.
This theorem shows us that, for any in¬nite set, there is another in¬nite set
with greater cardinality. We shall not prove it here but the cardinality of the real
numbers is equal to the cardinality of the power set of the set of integers.
Appendix B
Co-compactness Lemma

Lemma B.1 (Co-compactness Lemma) Let A be a ¬nite set and let
{Ri : i ∈ I } be a family of retracts in A— . There is a ¬nite subset F of I for
Ri =
which Ri .
i∈F i∈I

Proof We consider only the case for which there is a single key set K for
which, for each i ∈ I , K is a set of keys for the key code that generates Ri .
The general result then follows from the fact that there are only a ¬nite number
of subsets, hence of possible key sets of A. First we partition K into disjoint
subsets K and K . Let K = {a ∈ K : for every ¬nite subset J of I , a occurs
Ri . Let K = K ’ K .
in at least one word in
From the de¬nition of K , it follows that, for each a ∈ K , there is a ¬nite
subset F(a) of I for which Ri contains no word in which a occurs. Let
F= F(a) a in K . The symbols in K that occur in words in Ri are
a∈K i∈F
precisely the symbols in K .
De¬ne an equivalence relation ∼ in the set Ri : by Ri ∼ R j if for each
a ∈ K , the generator of Ri in which a occurs is identical with the generator of
R j in which a occurs. In the next three paragraphs we show that there are only
¬nitely many ∼ equivalence classes.
Choose an arbitrary index m ∈ I . Let C be the key code that generates Rm .
Let C be the subset of C consisting of those words with keys in K . Let L
be the length of the longest word in C . Note that, for any word w ∈ C — : (1)
the number of symbols to the left of the ¬rst occurrence of a key symbol in w
is less than or equal to L ’ 1; (2) the number of symbols occurring between
two successive occurrences of keys is less than or equal to 2L ’ 2; and (3) the
number of symbols to the right of the last occurrence of a key symbol in w is
less than or equal to L ’ 1.

Co-compactness Lemma

Next we establish that, for every j ∈ I and every k ∈ K , the generator
of R j in which k occurs has length at most 4L ’ 2. For such j and k we
have: since G = F ∪ {m, j} is ¬nite and k ∈ K , there is a word w ∈ Ri
in which k occurs. Let w = x0 a0 x1 a1 . . . xi’1 ai xi ai+1 xi+1 . . . xn’1 an xn where
the ai , 1 ¤ i ¤ n, are all the key occurrences in w. Hence, no key occurs in
any of the xi , 1 ¤ i ¤ n. Note that all of the keys occurring in w must lie in
K . The word w can be segmented into code words belonging to Rm and it can
also be segmented into code words belonging to R j . We have k = ai for some
i, 1 ¤ i ¤ n. Note that, if the length of the code word belonging to R j in which
k occurs were greater than or equal to 4L ’ 2, this would contradict one of
(1),(2), or (3) of the ¬nal sentence of the previous paragraph.
We have shown that there is a bound B(= 4L ’ 2) such that, for every j ∈ I
and k ∈ K , no code word of R j in which k occurs can have length greater than
or equal to B. From the de¬nition of the equivalence relation ∼ we see that
there are only ¬nitely many ∼ equivalence classes.
Let F be a subset of I for which, for each i ∈ I , there is a unique j ∈ F
for which Ri ∼ R j .
The statement of the lemma is true for F = F ∪ F .

[1] J. A. Anderson, Semiretracts of a free monoid, Theor. Comput. Sci. 134(1994),
[2] J. A. Anderson, “Semiretracts of a free monoid” (extended abstract) Proceedings
of the Colloquium of Words, Languages and Combinatorics II, New York, World
Scienti¬c (1994), 1“5.
[3] J. A. Anderson and T. Head, The lattice of semiretracts of a free monoid, Intern. J.
Computer Math. 43(1992), 127“31.
[4] K. Culik II and T. Harju, The regularity of splicing systems and DNA, in Proceed-
ings ICALP ™89, Lecture Notes in Comput. Sci. 372(1989), 222“33.
[5] K. Culik II and T. Harju, Splicing semigroups of dominoes and DNA, Discrete
Appl. Math. 31(1991), 261“77.
[6] A. Ehrenfeucht, T. Harju, I. Petre, D. M. Prescott and G. Rozenberg, Computation
in Living Cells “ Gene Assembly in Ciliates, Berlin, Springer-Verlag (2004).
[7] J. Engelfriet and G. Rozenberg, Equality languages and ¬xed point languages,
Information and Control 43(1979), 20“49; 31(1991), 261“77.
[8] L. Fuchs, In¬nite Abelian Groups, New York, Academic Press, I (1970), III (1973).
[9] W. Forys and T. Head, Retracts of free monoids are nowhere dense with respect to
¬nite group topologies and p-adic topologies, Semigroup Forum, 42(1991), 117“19.
[10] W. Forys and T. Head, The poset of retracts of a free monoid, Intern. J. Computer
Math. 37(1990), 45“8.
[11] R. W. Gatterdam, Splicing systems and regularity, Intern. J. Computer Math.
31(1989), 63“7.
[12] E. Goode (Laun), Constants and Splicing Systems, Dissertation Binghamton
University, Binghamton, New York (1999).
[13] T. Harju and D. Nowotka, The equation xi = yj zk in a free semigroup, Semigroup
Forum 68(2004), 488“90.
[14] J. Harrison, On almost cylindrical languages and the decidability of the D0L and
PWD0L primitivity problems, Theor. Comput. Sci. 164(1996), 29“40.
[15] T. Head, Cylindrical and eventual languages, in Mathematical Linguistics and
Related Topics (Gh. Paun, ed.), Bucharest, Romania, Editura Academiei Romane
(1995), pages 179“83.
[16] T. Head, Expanded subalphabets in the theories of languages and semigroups,
Intern. J. Computer Math. 12(1982), 113“23.


[17] T. Head, Formal language theory and DNA: an analysis of the generative capacity
of speci¬c recombinant behaviors, Bull. Math. Biology 49(1987), 737“59.
[18] T. Head, Splicing schemes and DNA, in Lindenmayer Systems “ Impacts on The-
oretical Computer Science, Computer Graphics, and Developmental Biology (G.
Rozenberg and A. Salomaa, eds.), Berlin, Springer-Verlag (1992), pages 371“83.
[19] T. Head, Splicing representations of strictly locally testable languages, Discrete
Appl. Math. 87(1998), 139“47.
[20] T. Head and S. Gal, Aqueous computing: writing on molecules dissolved in
water, in Nanotechnology: Science and Computation (J. Chen, N. Jonoska and G.
Rozenberg, eds.), Berlin, Springer-Verlag (2006), pages 321“34.
[21] T. Head, Gh. Paun and D. Pixton, Language theory and molecular genetics:
generative mechanisms suggested by DNA recombination, in Handbook of Formal
Languages (G. Rozenberg and Gh. Paun, eds.), Berlin, Springer-Verlag (1997),
Vol. II, Chapter 7.
[22] T. Head and D. Pixton, Splicing and regularity, in Formal Languages and Appli-
cations II (Z. Esik, C. Martin-Vide and V. Mitrana, eds.), Berlin, Springer-Verlag
[23] T. Head, Visualizing languages using primitive powers, in Words, Semigroups,
& Transducers (M. Ito, Gh. Paun, and S. Yu, eds.), Singapore, World Scienti¬c
(2001), pages 169“80.
[24] T. Head and B. Lando, Periodic D0L languages, Theor. Comput. Sci. 46(1986),
[25] M. Ito, M. Katsura, H. J. Shyr and S. S. Yu, Automata accepting primitive words,
Semigroup Forum 37(1988), 45“52.
[26] B. Lando, Periodicity and ultimate periodicity of D0L systems, Theor. Comput.
Sci. 82(1991), 19“33.
[27] L. F. Landweber and L. Kari, Universal molecular computation in ciliates,
in Evolution as Computation (L. F. Landweber and E. Winfree, eds.), Berlin,
Springer-Verlag (2002), pages 3“15.
[28] M. Lothaire, Combinatorics on Words, Massachusetts, Addison-Wesley (1983).
Reprinted by Cambridge University Press (1984).
[29] G. H. Mealy, A Method for Synthesizing Sequential Circuits, Bell System
Technical Journal 34(1955), 1045“70.
[30] E. F. Moore, Gedanken-Experiments on Sequential Machines, Automata Studies,
Annals of Mathematical Studies 32(1956), 129“53.
[31] R. McNaughton and S. Papert, Counter-Free Automata, Cambridge, MA, MIT
Press (1971).
[32] Gh. Paun, G. Rozenberg and A. Salomaa, DNA Computing “ New Computing
Paradigms, Berlin, Springer-Verlag (1998).
[33] D. Pixton, Regularity of splicing languages, Discrete Appl. Math. 69(1996),
[34] D. Pixton, Splicing in abstract families of languages, Theor. Comput. Sci.
234(2000), 135“66.
[35] A. Salomaa, Jewels of Formal Language Theory, Rockville, MD, Computer
Science Press (1986).
[36] H. J. Shyr, Free Monoids and Languages, Taiwan, Hon Min l991.
[37] B. Tilson, The intersection of free submonoids of a free monoid is free, Semigroup
Forum 4(1972), 345“50.
Further reading

[1] J. A. Anderson, Code properties of minimal generating sets of retracts and semire-
tracts, SEA Bull. Math. 18(1994), 7“16.
[2] J. A. Anderson, “Semiretracts and their syntactic monoids,” Semigroups: Interna-
tional Conference on Semigroups and its Related Topics, Singapore, Springer-Verlag
[3] J. Berstel and D. Perrin, Theory of Codes, New York, Academic Press (1985).
[4] E. Goode and D. Pixton, Recognizing splicing languages: syntactic monoids and
simultaneous pumping, Discrete Appl. Math. (to appear).
[5] J. Howie, Automata and Languages, New York, Oxford University Press (1991).
[6] J. E. Pin, Varieties of Formal Languages, New York, Plenum (1986).
[7] G. Thierrin, A mode of decomposition of regular languages, Semigroup Forum
10(1975), 32“8.


accepted codomain, 12
PDA, 90 combinatorics on words, 210
alphabet, 23 concatenation, 23
automata congruence, 19
pushdown, 90 conjugate words, 224
automaton, 37 context-free gramar, 127
acceptance states, 37 context-free language, 127, 149
accepted by, 38 context-free languages
accessible, 73 decidable, 208
deterministic, 41 deterministic, 211
intrinsic, 74 pumping lemma, 163
language accepted by, 37 context-sensitive grammar, 127
minimal, 73 cylindrical language, 218
pushdown, 89
sink state, 39 decidability
state diagram, 38 context-free languages, 164
syntactic monoid, 74, 75 regular languages, 86
transformation monoid, 78 deterministic automata, 41
transition function, 37, 41 deterministic PDA, 91
axiom system deterministic Turing machine, 170
G¨ del“Hilbert“Bernays, 1
o disjoint, 9
Russell“Whitehead, 1 domain, 12
Zermelo“Fraenkel“von Neumann, 1
Burali-Forti, 1 relation, 9

Cantor, Georg, 1 ¬‚ag, 223
Chomsky normal form, 138, language, 223
144 word, 223
code, 26 function, 12
bipre¬x, 26 bijection, 13
block, 26 codomain, 12
in¬x, 26 domain, 12
pre¬x code, 26 image, 12
suf¬x code, 26 injection, 13
uniquely decipherable, 26 preimage, 12


function (cont.) mapping, 12
range, 12 Mealy machine, 102
surjection, 13 minimal automaton, 73, 74
modi¬ed correspondence system, 200
grammar modi¬ed Post™s Correspondence Problem,
Chomsky normal form, 138 200
context-free, 127 monoid, 16
Greibach normal form, 138 Moore machine, 99
leftmost derivation, 140 mutually exclusive sets, 9
grammar, 114
context-sensitive, 127 nondeterministic automata, 41
corresponding tree, 123 nondeterministic Turing machine, 190
formal, 115
language generated by, 115 partial ordering
nonterminal symbols, 115 relation, 10
phrase structure, 115 partially ordered set
productions, 115 greatest lower bound, 11
regular, 128, 129 least element, 11
start symbol, 115 least upper bound, 11
terminal symbols, 115 partition, 10
Greibach normal form, 138, 148 Post™s Correspondence Problem, 200
match, 200
halting problem, 195 modi¬ed, 200
homomorphism preimage, function, 12
mortal, 30 primitive word, 210
vital, 30 pumping lemma
context-free languages, 163
ideal, 19 regular languages, 84
idempotent, 18 pushdown automaton, 89, 90
function, 12 range, function, 12
intrinsic automaton, 74 recombinants of molecules, 235
regular expression, 24
Kleene star, 23 regular grammar, 128, 129
Kleene™s Theorem, 52, 68 regular language, 25
pumping lemma, 84
language, 24 relation
almost bounded above, 220 equivalence, 9, 10
almost cylindrical, 220 retract, 29
almost eventual, 220 key, 31
almost uniformaly bounded above, key code, 31
220 key word, 31
almost uniformly eventual, 220 retraction, 29
bounded above, 219 right ideal, 19
context-free, 127, 149 Russell, Bertrand, 1
cylindrical, 218
eventual, 219 semigroup, 16
sketch equivalent, 222 semilattice, 18
sketch parameter, 221 lower semilattice, 11
uniformly eventual, 219 upper semilattice, 11
left ideal, 19 semiretract, 35
Index 255

set, 1 splicing scheme, 236
cardinality, 4 splicing system, 236
Cartesian product, 5 stack
element, 1 LIFO, 89
power set, 5 pop, 89
relation, 6 push, 89
set difference, 3 pushdown, 89
subset, 2 string, 23
symmetric difference, 3 concatenation, 23
set properties support of a language, 217
associative properties, 4, 6 syntactic monoid, 74, 75, 242
commutative properties, 4, 6
complement properties, 4, 6 transformation monoid, 78
de Morgan™s laws, 4, 5 Turing acceptable, 195
distributive laws, 3 Turing decidable, 195
distributive properties, 6 Turing machine, 170
double complement, 4, 5 con¬guration, 172
idempotent properties, 4, 5 crash, 173
identity properties, 4, 6 crashes, 171
set relation deterministic, 170
disjoint, 9 halting problem, 195
function, 12 hang, 173
mutually exclusive, 9 language accepted by, 180
sketch function, 218 nondeterministic, 190
spectral partition, 216 Turing acceptable, 195
spectrum, 214 Turing decidable, 195
splicing, 232 word accepted by, 180
spliced, 233
splicing language, 236, 239 undecidable
re¬‚exive, 242, 244 context-free languages, 208
splicing rule, 236 uniquely decipherable code,
action, 236 26
radius, 236
respects language, 236 word, 23


. 9
( 9)