ńņš. 9 |

Section 7.2 is an informal introduction to what are called splicing opera-

tions using examples that may appear quite arbitrary at ļ¬rst reading. Those

who read the optional Section 7.3 will ļ¬nd that the examples of Section 7.2 are

abstractions of the ācut and pasteā actions of commercially available enzymes

operating on DNA molecules. Section 7.4 provides the deļ¬nitions and con-

structions required for a formal theory of splicing. In the remaining sections the

deepest results relating the theory of splicing systems and the class of regular

languages are treated.

Although all of the motivating biomolecular examples given here involve

double stranded DNA, splicing theory is potentially applicable to polypeptides,

RNA, DNA (whether single stranded or double stranded) and any other poly-

mers that may be viewed as strings of related units linked in a uniform manner.

(The cytoskeletal ļ¬laments in eukaryotic cells provide several such examples.)

Moreover, dsDNA frequently occurs, both in vivo and in vitro, in circular form.

Linear and circular dsDNA molecules interact (inter-splice) in nature as illus-

trated for ciliate genomes in [27] and [6]. Interactions between linear and cir-

cular DNA have been discussed in an abstract splicing context in [18] and [33].

A review of in vitro solutions of standard combinatorial computations using the

cut and paste operations discussed here appears in [20]. The intention of this

chapter is to stimulate the creation of additional connections between formal

language theory and the biomolecular sciences.

7.2 Constructing new words by splicing together pairs of

existing words

Given an ordered pair of words over the alphabet {a, c, g, t}, for exam-

ple u = ttttggaaccttt and v = tttggaacctttt, one can consider allowing the

7.3 The motivation from molecular biology 233

construction of a new word from these two by ācuttingā each, say between the

two occurrences of the symbol a in the subsequence s = ggaacc that occurs in

each of u and v, and then building a new string by āpastingā together (concate-

nating) the left portion of the ļ¬rst string and the right portion of the second string.

The cutting process applied to the ordered pair u and v gives the fragments:

ttttgga, accttt and tttgga, acctttt. The pasting of the indicated fragments in

the indicated order then gives the word x = ttttggaacctttt. In this way the

ordered pair of words u, v has provided x. Note that the ordered pair v, u fol-

lowing the same cut and paste operation produces y = tttggaaccttt. We say

that we have spliced u, v producing x and we have spliced v, u producing y.

An extensive literature has developed in which the generative power of such

splicing operations on words has been investigated. Many carefully considered

control structures have been studied that guide the splicing process. Numerous

researchers have been able to demonstrate that, by applying various such control

structures, they can provide universal (Turing equivalent) computational power

based on splicing operations. We do not pursue this goal here; we stay in the

realm of regular languages. The original motivation for the introduction of

the splicing concept was the modeling of the cut and paste actions provided by

sets of restriction enzymes acting on double stranded DNA (dsDNA) molecules.

These enzymatic actions are fundamental tools of genetic engineering. Our goal

is to show that the theory of regular languages provides a formalism through

which the potential generative power of sets of restriction enzymes acting on

dsDNA can be represented. Readers who have interdisciplinary inclinations can

continue with studies of [22] and models of computation based on biochemistry

[32]. In Section 7.3 a minimal discussion of DNA splicing is given to indicate

the contact point between splicing as understood in formal language theory and

in molecular biology. A reader who does not wish for additional motivation

from the biomolecular sciences may skip Section 7.3 which is not required for

an understanding of the discussions in later sections.

7.3 The motivation from molecular biology

A single stranded DNA (ssDNA) molecule can be viewed as a linear sequence of

the four covalently bonded deoxyribonucleotides {A = adenine, C = cytosine,

G = guanine, T = thymine}. For example:

TTTTGGAACCTTT.

A dsDNA molecule can be viewed as a linear sequence of hydrogen bonded

From biopolymers to formal language theory

234

pairs where the hydrogen bonds are between the vertically displayed pairs:

TTTTGGAACCTTT

AAAACCTTGGAAA.

It is adequate here to assume that A and T pair only with each other and C and

G pair only with each other. Due to this so-called Watsonā“Crick pairing rule,

when one strand of a dsDNA molecule is determined the other is also known.

If (as above) one row is

TTTTGGAACCTTT

we know its companion row is

AAAACCTTGGAAA.

Consequently, we need to give only one of the two strands. For efļ¬ciency

and convenience we will list only one row of each dsDNA molecule. To be

certain not to confuse dsDNA and ssDNA, we will use lowercase a, c, g, t to

denote the paired deoxyribonucleotides:

A C G T

T G C A

respectively.

Thus TTTTGGAACCTTT is an ssDNA, but ttttggaaccttt is a dsDNA having

as one of its strands the ssDNA, TTTTGGAACCTTT.

There are over 200 commercially available restriction enzymes that cut

dsDNA molecules at speciļ¬c subsequences (sites). The example given in Sec-

tion 7.2 is, in fact, a representation of an actual enzymatic process. At an

occurrence of the site ggaacc in a dsDNA molecule the enzyme Nla IV cuts

the covalent bonds in each of the single strands that hold the middle aā“a of

the site together. When this cut is made in aqueous solution the left and right

halves separate due to Brownian motion. The resulting freshly cut ends of the

fragments can be again connected with restored covalent bonds if an enzyme

called a ligase is present. Suppose now that we have a test tube which contains,

dissolved in water (or more precisely, in an appropriate buffer solution), the

dsDNA molecules u = ttttggaaccttt and v = tttggaacctttt and also Nla IV and

a ligase. Then Nla IV will cut the two molecules u and v producing the four

fragments ttttgga , accttt and tttgga , acctttt, where the symbols have been

added to denote the freshly cut ends (technically, the phosphates attached at the

5 -ends remain after the cutting and are required for future pasting). The ligase

can now paste together the fragments ttttgga and acctttt to yield the dsDNA

molecule x = ttttggaacctttt. The ligase can also paste together the fragments

7.3 The motivation from molecular biology 235

tttgga and accttt to yield the dsDNA molecule y = tttggaaccttt. The molecules

x and y are said to be recombinants of u and v. For completeness we mention

that the ligase also has the potential for reconstructing the original molecules u

and v from the four fragments. If we ignore any remaining fragments that have

freshly cut ends, then we may say that the ālanguageā of all possible molecules

that can arise in our test tube consists of the molecular varieties u, v, x, and

y. Some signiļ¬cant details concerning DNA molecules have been suppressed

above. The interested reader can ļ¬nd these details treated in the following exer-

cise and the references.

Exercise

The two ends of an ssDNA molecule exhibit distinct structures. At one end a

methyl group protrudes which may have an attached phosphate group. This end

is referred to as the 5 end since the carbon atom of the methyl group is counted

as the 5 carbon of the sugar substructure to which it belongs. At the other end

a hydroxyl is attached at the 3 carbon of the sugar substructure to which it

belongs. This end is referred to as the 3 end of the molecule. In modeling one

must either label the ends or adopt a convention that allows the labels to be

known otherwise. The ssDNA molecules 5 -ACTTGC-3 and 3 -ACTTGC-5

are not representations of the same molecule. For dsDNA one must understand

that the two strands of the molecule always have opposite 5 ā’ 3 orientation.

For convenience and concision we use the convention illustrated here. When, for

example, acttgc is used to represent a dsDNA molecule it must be understood

that this molecule has as one strand 5 -ACTTGC-3 and consequently that the

dsDNA molecule when fully spelled out is:

5 -ACTTGC-3

3 -TGAACG-5

(a) Write 3 -ACTTGC-5 with the 5 end on the left (and the 3 on the right).

(b) Write a lowercase representation for the dsDNA molecule that has 3 -

ACTTGC-5 as one of its strands. Is there a second lowercase represen-

tation? Is there a third?

(c) Which pairs of words, when regarded as models of dsDNA molecules,

denote the same molecules: acttgc, cgttca, gcaagt, tgaacg, aaattt, tttaaa.

(d) Verify that each dsDNA molecule, when denoted using the alphabet {a,c,g,t}

and the conventions established here, has either exactly two distinct repre-

sentations or only one representation. Give examples of each type. Those

having only one are said to possess dyadic symmetry. (That dsDNA

molecules may possess two distinct word representations creates only a

From biopolymers to formal language theory

236

slight nuisance when constructing splicing models as explained in [17] [21]

and [22].)

7.4 Splicing rules, schemes, systems, and languages

The previous sections have been written in an informal manner, possibly allow-

ing ambiguity between molecules and the words used to represent them. The

remainder of this chapter deals speciļ¬cally with words in a free monoid. (How-

ever, all results in the chapter have meaningful interpretations for enzymes

acting on dsDNA.)

Let be a ļ¬nite set to be used as an alphabet. Let ā— be the set of all strings

over . By a language we mean a subset of ā— . A splicing rule is an element

r = (u, u , v , v) of the product set

ā—4 ā— ā— ā— ā—

]= Ć— Ć— Ć— .

[

The action of the rule r on a language L deļ¬nes the language r (L) = {xuvy

in ā— : L contains strings xuu q and pv vy for some x, q, p, and y in ā— }.

For each set, R, of splicing rules we extend the deļ¬nition of r (L) by deļ¬ning

R(L) = āŖ{r (L) : r ā R}. A rule r respects the language L if r (L) is contained

in L and a set R of rules respects L if R(L) is contained in L. By the radius of

a splicing rule (u, u , v , v) we mean the maximum of the lengths of the strings

u, u , v , v.

Deļ¬nition 7.1 A splicing scheme is a pair Ļ = ( , R), where is a ļ¬nite

alphabet and R is a ļ¬nite set of splicing rules. For each language L and

each nonnegative integer n, we deļ¬ne Ļ n (L) inductively: Ļ 0 (L) = L and,

for each nonnegative integer k, Ļ k+1 (L) = Ļ k (L) āŖ R(Ļ k (L)). We then deļ¬ne

Ļ ā— (L) = āŖ{Ļ n (L) : n ā„ 0}. A splicing system is a pair (Ļ, I ), where Ļ is a splic-

ing scheme and I is a ļ¬nite initial language contained in ā— . The language

generated by (Ļ, I ) is L(Ļ, I ) = Ļ ā— (I ). A language L is a splicing language if

L = L(Ļ, I ) for some splicing system (Ļ, I ).

Example 7.1 Let = {a, c, g, t}. Let r = (u, u , v , v) where the four words

u, u , v , v in ā— appearing in the rule r are u = v = gga and u = v = acc.

Let R = {r }. This gives the splicing scheme

Ļ = ( , R) = ({a, c, g, t}, {(gga, acc, gga, acc)}).

Let

I = {ttttggaaccttt, tttggaacctttt}.

7.4 Splicing rules, schemes, systems, and languages 237

Observe that r applied to the ordered pair

(ttttggaaccttt, tttggaacctttt)

of words in I gives the word ttttggaacctttt, and r applied to the ordered pair

(tttggaacctttt, ttttggaaccttt)

of words in I gives tttggaaccttt. The less interesting actions of r on I must

be recognized: When r acts on ordered pairs in the ādiagonalā of I Ć— I , for

example on

(ttttggaaccttt, ttttggaaccttt)

the result is merely ttttggaaccttt which appeared as each coordinate of the

pair. Here we have

Ļ 0 (I ) = I = {ttttggaaccttt, tttggaacctttt}

and

Ļ 1 (I ) = Ļ 0 (I ) āŖ R(Ļ 0 (I ))

= I āŖ {ttttggaacctttt, tttggaaccttt, ttttggaaccttt, tttggaacctttt}

equals

{ttttggaacctttt, tttggaaccttt, ttttggaaccttt, tttggaacctttt}.

Notice that R respects Ļ 1 (I ) and consequently Ļ 2 (I ) = Ļ 1 (I ). Then also

Ļ 3 (I )) = Ļ 2 (I ) = Ļ 1 (I ) and in fact Ļ ā— (I ) = Ļ 1 (I ). Thus L(Ļ, I ) is the ļ¬nite

language

Ļ ā— (I ) = {ttttggaacctttt, tttggaaccttt, ttttggaaccttt, tttggaacctttt}.

This example connects the formal deļ¬nitions of splicing systems and languages

with the less formal introductory remarks of Sections 7.1 and 7.2.

= {a, c, g, t}. Let r = (c, cccgg, c, cccgg), R = {r },

Example 7.2 Let

and let I contain only one word of length 30,

I = {aaaaaaccccggaaaaaaccccggaaaaaa}.

The rule can be applied to the ordered pair

(a 6 ccccgga 6 ccccgga 6 , a 6 ccccgga 6 ccccgga 6 )

with cuts made using the right occurrence of ccccgg in the ļ¬rst coordinate and

the left occurrence of ccccgg in the second coordinate. This gives the word of

From biopolymers to formal language theory

238

length 42:

a 6 ccccgga 6 ccccgga 6 ccccgga 6 .

The rule can be also applied to the ordered pair using the left occurrence of

ccccgg in the ļ¬rst coordinate and the right occurrence of ccccgg in the second

coordinate. This gives the word of length 18, a 6 ccccgga 6 . Thus

Ļ 1 (I ) = {a 6 ccccgga 6 , a 6 ccccgga 6 ccccgga 6, a 6 ccccgga 6 ccccgga 6 ccccgga 6 }.

Continuing with similar considerations one ļ¬nds that L(Ļ, I ) = Ļ ā— (I ) is the

inļ¬nite regular language

a 6 ccccgga 6 (ccccgga 6 )ā— .

Example 7.3 We may interpret the 30 symbol word given in Example 7.2

as a model of a dsDNA molecule as indicated in Section 7.3. The rule r of

Example 7.2 represents the cut and paste activity of the restriction enzyme

BsaJ I accompanied by a ligase. With these understandings the language

L(Ļ, I ) = a 6 ccccgga 6 (ccccgga 6 )ā—

obtained in Example 7.2 is a model of the set of all dsDNA molecules (having no

freshly cut ends) that can potentially arise in a test tube containing BsaJ I, a lig-

ase, and (sufļ¬ciently many) dsDNA molecules of model a 6 ccccgga 6 ccccgga 6 .

The ability to make assertions as in the preceding sentence motivated the intro-

duction of the splicing concept into formal language theory.

Example 7.4 Let = {a, b, c}. Let L be the regular language caba ā— b. Can

we ļ¬nd a splicing system that generates L? Yes, this can be done very eas-

ily by taking advantage of the fact that the symbol c occurs as the leftmost

symbol of every word in L and occurs nowhere else in any word of L. Let

r = (caba, a, cab, a) and let I = {cabb, cabab, cabaab}. Note that r allows

the generation of cabaaab as follows: From the ordered pair (cabaab, cabaab),

and the two distinct conceptual analyses caba/ab and cab/aab, the rule r gives

the new word cabaaab. Then from the ordered pair (cabaab, cabaaab), and

the analyses caba/ab and cab/aaab, the rule r gives the word cabaaaab.

(Note that r provides a form of pumping.) Continuing in this way all words

caba n b with n ā„ 3 can be obtained. Since caba n b with 0 ā¤ n ā¤ 2 were given

in I , we have, for R = {r } and Ļ = ( , R), L(Ļ, I ) = caba ā— b as desired. In

fact it has been shown [12] that for any regular language L over any alphabet

, by choosing a symbol not in , say c, the language

L = cL = {cw : w ā L }

7.5 Every splicing language is a regular language 239

is generated by a splicing system that can be speciļ¬ed very much as we have

done for the language caba ā— b in this example. Thus informally speaking, each

regular language is almost a splicing language.

Example 7.5 Let = {a, b}. The regular language L = (aa)ā— cannot be gen-

erated by a splicing system. As the reader may verify, any ļ¬nite set of rules that

allows every word in L to be generated will also generate strings of odd length

as well as the strings of even length.

Example 7.6 The regular language L = a ā— ba ā— ba ā— cannot be generated by a

ļ¬nite set of rules either: For any nonnegative integer n,

Rn = {(Ī», ba n b, Ī», aba n b), (ba n ba, Ī», ba n b, Ī»)}

and

In = {aba n b, ba n b, ba n ba}

generate a ā— ba n ba ā— .

Consequently, for any ļ¬nite subset F of nonnegative integers, R =

āŖ{Rn : n ā F} and I = āŖ{In : n ā F} generate the language L =

āŖ{a ā— ba n ba ā— : n ā F}. However, as the reader may verify, any ļ¬nite set

of rules and ļ¬nite initial language that generate all words in a ā— ba ā— ba ā— will

also generate words in which the symbol b occurs more than twice. Thus there

are regular languages that are not splicing languages.

Exercises

(1) Let L be any ļ¬nite language over any alphabet . Specify a splicing system

that generates L. (Hint: The set R of rules can be empty.)

= {a, b}. Find three splicing systems that generate, respectively,

(2) Let

(i) L = b(aa)ā— ; (ii) L = ā— ; and (iii) L = ba ā— ba ā— .

(3) Let = {a, b, c}. Find three splicing systems that generate, respectively,

(i) L = abā— abc; (ii) L = abā— cab; (iii) a ā— ba ā— ca ā— ba ā— .

7.5 Every splicing language is a regular language

Splicing languages were introduced in published form for the ļ¬rst time in 1987

[17]. Fortunately K. Culik II and T. Harju quickly announced in 1989 [4], [5] that

all splicing languages are regular. A second exposition of the regularity result

was given by D. Pixton in 1996. This exposition provided, for each splicing

system, an explicit construction of a ļ¬nite automaton that was concisely proved,

From biopolymers to formal language theory

240

using an insightfully constructed inductive set, to recognize the language gen-

erated by the splicing system. In 1989, R. Gatterdam observed [11] that not

all regular languages are splicing languages. So, which regular languages are

splicing languages? We would like to have a theorem that characterizes the class

of splicing languages in terms of previously known classes of languages. As

yet we have no such characterization. In [17] it was observed that a language L

is a splicing language if there is a positive integer n such that uxq is always in

L whenever x has length n and both uxv and pxq are in L. These languages,

which were analyzed rather thoroughly in [19], constitute a highly restricted

subclass of the splicing languages, enlargements of which have been studied

extensively in [12]. The interested reader is urged to study very carefully Pix-

tonā™s proof of regularity which is given in [33], [21], [32], [22] and broadly

generalized in [34].

With no crisp characterization of the class of splicing languages found,

concern turned to the search for an algorithm for deciding whether a given

regular language can be generated by a splicing system. There is, of course, an

easily described procedure that is guaranteed to discover that a regular language

L ā ā— is a splicing language if L is a splicing language: For each positive

integer n, for each set R of rules of radius ā¤ n, and for each subset I of L

consisting of strings of length ā¤ n, decide whether L(Ļ, I ) = L, where Ļ =

( , R). Since both L and each such L(Ļ, I ) are regular, all these steps can be

carried out. The procedure terminates when a system L(Ļ, I ) is found, but fails

to terminate when L is not a splicing language. From this triviality, however, it

follows that an algorithm will become available immediately if, for each regular

language L, a bound, N (L), can be calculated for which it can be asserted that

L cannot be a splicing language unless there is a splicing system having rules

of radius ā¤ N (L) and initial strings of length ā¤ N (L). (Recall from Section 7.2

that the radius of a rule r = (u, u , v , v) is the length of the longest of the four

words u, u , v , and v.) The determination of the bound N (L) is a conceptual

victory for the concept of the syntactic monoid of a language because it allows

the concise statement of an adequate bound N (L), given in Section 7.7. It also

provides a valuable tool stated in the heading of Section 7.6.

Exercises

(1) Let = {a, b}. Find a regular expression that represents L(Ļ, I ) where

Ļ = ( , R), R = {r }, r = (b, b, a, b), and I = {abba}.

(2) Let = {a, b, c}. Find a regular expression that represents L(Ļ, I ) where

Ļ = ( , R), R = {r1 , r2 }, r1 = (ab, c, cb, a), r2 = (cb, a, ab, c), and I =

{abc, cba}.

7.6 Syntactic monoids and rules respecting L 241

(3) Same as Exercise 2 with one new rule added: r3 = (cbc, Ī», cb, c) so that

R = {r1 , r2 , r3 }.

7.6 The syntactic monoid of a regular language L allows

an effective determination of the set of all splicing

rules that respect L

First we show how to decide whether a given splicing rule r respects a given

regular language L ā ā— : Let M be the minimal automaton recognizing L. Let

S be the set of all states of M and let F be the set of ļ¬nal states. For each state s

in S and each word w in ā— we denote by sw the state of M arrived at after w is

read from state s. Note that the rule r = (u, u , v , v) respects L if and only if,

for each ordered pair of states p, q of M, for which {x ā ā— : puu x ā F} and

{y ā ā— : qv vy ā F} are not empty, {z ā ā— : qv vz ā F} ā {z : puvz ā F}.

The emptiness conditions and the inclusion are decidable since each of the four

sets is regular.

Next we show how to specify all of the rules that respect the regular lan-

guage L in a manner that requires that the procedure above be used on only

a ļ¬nite number of rules. Recall that the syntactic congruence relation, C, in

ā—

is deļ¬ned by setting u Cu if and only if, for every pair of strings x and

y ā ā— , either both xu y and xuy are in L or neither is in L. Since L is

regular, the number of Cā’congruence classes is a positive integer which we

denote n(L). Then there are precisely [n(L)]4 ordered quadruples of congru-

ence classes. Let (W, X, Y, Z ) be an ordered quadruple of congruence classes.

Let r = (w, x, y, z) and r = (w , x , y , z ) be two rules in W Ć— X Ć— Y Ć— Z .

We verify that r respects L if and only if r respects L. By the symmetry of

the roles of r and r in the hypothesis, we need only assume that r respects

L and verify that then r must respect L. Suppose that r respects L and that

the pair uw x v, sy z t is in L. We need only show that uw z t is in L: From

w Cw we have uwx v is in L and from x C x we then have uwxv in L. From

y C y and z C z it follows that syzt is in L. Since r respects L and the pair

uwxv, syzt is in L, we have uwzt in L. From w Cw and z C z it follows that

uw z t is in L, as required. Thus, to specify all the rules that respect L, we

construct the [n(L)]4 quadruples of syntactic classes determined in ā— by L

and, from each such quadruple (W, X, Y, Z ), we choose one word from each

class to obtain one rule (w, x, y, z) and then decide whether it respects L. If

it does then every rule in W Ć— X Ć— Y Ć— Z respects L. If it does not respect L

then no rule in W Ć— X Ć— Y Ć— Z respects L. This discussion has justiļ¬ed the

following:

From biopolymers to formal language theory

242

Let L be a regular language. The set of rules that respect L

Proposition 7.1

has the form

āŖ{Wi Ć— X i Ć— Yi Ć— Z i : 1 ā¤ i ā¤ m}

where m is a nonnegative integer and each of the sets

Wi ,X i ,Yi ,Z i (1 ā¤ i ā¤ m)

is an element of the syntactic monoid of L.

Since each syntactic class of a regular language L is itself a regular language,

one can list all the strings of length at most k in the class. Consequently when

the representation in the proposition has been constructed, the set of all rules

of radius at most k that preserve L can be listed with no additional testing: For

each of the sets

Wi Ć— X i Ć— Yi Ć— Z i (1 ā¤ i ā¤ m)

in the representation, list all of the rules (w, x, y, z) in

Wi Ć— X i Ć— Yi Ć— Z i

of radius at most k. In order to create such a list without using the syntactic

monoid it would be necessary to list every rule of radius at most k in all of

[ ā— ]4 and test every such rule individually to decide if it preserves L.

Exercises

(1) Let = {a} and L = (aa)ā— . Construct the syntactic monoid of L.

(2) Let = {a, b} and L = a ā— ba ā— ba ā— . Construct the syntactic monoid of L.

(3) Construct the syntactic monoid of the language in Exercise 3 of Section 7.5.

7.7 It is algorithmically decidable whether a given

regular language is a reļ¬‚exive splicing language

A rule set R is reļ¬‚exive if, for each rule (u, u , v , v) in R, the rules (u, u , u, u )

and (v , v, v , v) are also in R. When R is reļ¬‚exive we say the same of any

scheme or system having R as its rule set. In fact, splicing systems that model

the cut and paste action of restriction enzymes and a ligase are necessarily

reļ¬‚exive. Consequently, from a modeling perspective, it is the reļ¬‚exive splicing

systems that are of prime interest.

7.7 Deciding which languages are splicing languages 243

Section 7.6 provides the tools to construct, for each regular language L and

each positive integer k, the following ļ¬nite reļ¬‚exive set Tk of splicing rules:

Tk = {(u, u , v , v) : the radius of (u, u , v , v) ā¤ k and each of the three rules

(u, u , v , v), (u, u , u, u ), and (v , v, v , v) respects L}.

Recall that Tk (L) = āŖ{r (L) : r ā Tk }, which is regular since Tk is ļ¬nite and,

since L is regular, each r (L) is regular (as conļ¬rmed in Exercise 5 of this

section). Consequently L \ Tk (L) is also regular.

Theorem 7.1 (Pixton and Goode) A regular language L is a reļ¬‚exive splicing

language if and only if L\Tk (L) is ļ¬nite where k = 2(n(L)2 + 1) and n(L) is

the cardinal number of the syntactic monoid of L.

Let L be a regular language. In Chapter 3 the syntactic monoid of L was

deļ¬ned in a way that allows n(L), and therefore also k, to be computed. Sec-

tion 7.6 provides a procedure for computing the ļ¬nite set Tk , from which the

regular set Tk (L) can be computed and it can be decided whether L\Tk (L) is

ļ¬nite. Thus the theorem of Pixton and Goode provides an algorithm that allows

one to decide whether any given regular language is a reļ¬‚exive splicing lan-

guage. It is tempting to suppose that when L\Tk (L) is ļ¬nite it can serve as the

set of initial words of a splicing system that generates L . Unfortunately this is

not the case as shown in Exercise 3 of this section. Although the proof of the

theorem is beyond the scope of this book, the decision procedure that it provides

can, in principle, be carried out using the machinery this book has provided.

Exercises

(1) Let = {a} and L = (aa)ā— . Compute n(L) and k for this language.

(2) Let = {a, b} and L = a ā— ba ā— ba ā— . Compute n(L) and k for this language.

(3) Let = {a, b} and L = ā— . Note that uCv for every u, v in L and conse-

quently the syntactic monoid of L is a singleton.

(a) Compute n(L) and k.

(b) Describe Tk in words. How many elements does Tk contain?

(c) Compute Tk (L) and L\Tk (L).

(d) Conclude that, when the set L\Tk (L) is ļ¬nite, it does not follow that

L\Tk (L) is adequate to serve as the set I of initial words of any splicing

system (Ļ, I ) for which L = L(Ļ, I ). (This is what makes the proof of

Theorem 7.1 challenging.)

(e) Without using Theorem 7.1, specify a set R of splicing rules and a

set I of initial strings for which L = L(Ļ, I ) for Ļ = ( , R). Hint:

(Ī», Ī», Ī», Ī») is a splicing rule.

From biopolymers to formal language theory

244

(4) Show that the following deļ¬nition of a reļ¬‚exive splicing language is equiv-

alent to the deļ¬nition given: A language L is a reļ¬‚exive splicing language

if L = L(Ļ, I ) for some splicing scheme Ļ = ( , R) and, for each rule

r = (u, u , v , v) in R, the rules (u, u , u, u ) and (v , v, v , v) respect L.

(This alternative deļ¬nition allows one to list fewer rules in the rule set

specifying a reļ¬‚exive splicing system.)

(5) Let L ā ā— be a regular language. Let r = (u, u , v , v) be a splicing rule

with u, u , v , v in ā— . Show that r (L) is regular. Hint: Use two copies, M

and M , of the minimal automaton M that recognizes L. Let the sets of

states of these two automata be S and S . Combine M and M into a single

automaton M , having state set S āŖ S , by adding carefully chosen new

edges that allow transitions from states in S to states in S having v as label.

Choose the initial state i of M as the initial state of M and choose the set

F of ļ¬nal states of M as the set of ļ¬nal states of M .

Appendix A

Cardinality

Theorem A.1 If |S| ā¤ |T | and |T | ā¤ |S|, then there is a one-to-one corre-

spondence between S and T , i.e. |S| = |T |

Proof Assume f : S ā’ T and g : T ā’ S are injective functions. For each

s ā S, we ļ¬nd g ā’1 (s) if it exists. We then ļ¬nd f ā’1 g ā’1 (s) if it exists. Then ļ¬nd

g ā’1 f ā’1 g ā’1 (s) if it exists. We continue this process. There are three possible

results: (1) The process continues indeļ¬nitely. (2) The process ends because

for some si in the process, there is no g ā’1 (si ). (3) The process ends because

for some ti in the process, there is no f ā’1 (ti ). Let S1 be the elements of S for

which the ļ¬rst result occurs. Let S2 be the elements of S for which the second

result occurs. Let S3 be the elements of S for which the third result occurs.

Obviously these sets are disjoint. Similarly form T1 , T2 , and T3 as subsets of

T . f is a one-to-one correspondence from S1 to T1 . f is also a one-to-one

correspondence from S2 to T2 . g ā’1 is a one-to-one correspondence from S3 to

T3 . Let Īø : S ā’ T be deļ¬ned by

Īø(s) = f (s) if s ā S1

= f (s) if s ā S2

= g ā’1 (s) if s ā S3

Īø is a one to one correspondence from S to T .

Theorem A.2 For any set A, |A| < |P(A)|.

Proof Certainly |A| ā¤ |P(A)| since for each element in a in A, {a} is in P(A).

Assume |A| = |P(A)|. Then there is a one-to-one correspondence between

|A| and |P(A)|. For a ā A let Ļ(a) be the element in P(A) paired with a.

Some elements in A belong to the element in P(A) with which they are

paired. For example, if a ā A and Ļ(a) = A in P(A), then a ā Ļ(a). How-

ever, if a ā A and Ļ(a) = ā…, the empty set in P(A), certainly a ā Ļ(a). Let

/

245

Cardinality

246

W = {a : a ā A a ā Ļ(a)}. W ā P(A), but no element in A can correspond

/

to W , for if Ļ(a) = W and a ā W , then by deļ¬nition of W , a ā Ļ(a) and

/

a ā W . However, if Ļ(a) = W and a ā W , then a ā Ļ(a) and by deļ¬nition of

/ / /

W , a ā W . Hence we have a contradiction if any element of A corresponds to

W and there is no one-to-one correspondence between |A| and |P(A)|.

This theorem shows us that, for any inļ¬nite set, there is another inļ¬nite set

with greater cardinality. We shall not prove it here but the cardinality of the real

numbers is equal to the cardinality of the power set of the set of integers.

Appendix B

Co-compactness Lemma

Lemma B.1 (Co-compactness Lemma) Let A be a ļ¬nite set and let

{Ri : i ā I } be a family of retracts in Aā— . There is a ļ¬nite subset F of I for

Ri =

which Ri .

iāF iāI

Proof We consider only the case for which there is a single key set K for

which, for each i ā I , K is a set of keys for the key code that generates Ri .

The general result then follows from the fact that there are only a ļ¬nite number

of subsets, hence of possible key sets of A. First we partition K into disjoint

subsets K and K . Let K = {a ā K : for every ļ¬nite subset J of I , a occurs

Ri . Let K = K ā’ K .

in at least one word in

iāJ

From the deļ¬nition of K , it follows that, for each a ā K , there is a ļ¬nite

subset F(a) of I for which Ri contains no word in which a occurs. Let

iāF(a)

F= F(a) a in K . The symbols in K that occur in words in Ri are

aāK iāF

precisely the symbols in K .

Deļ¬ne an equivalence relation ā¼ in the set Ri : by Ri ā¼ R j if for each

iāI

a ā K , the generator of Ri in which a occurs is identical with the generator of

R j in which a occurs. In the next three paragraphs we show that there are only

ļ¬nitely many ā¼ equivalence classes.

Choose an arbitrary index m ā I . Let C be the key code that generates Rm .

Let C be the subset of C consisting of those words with keys in K . Let L

be the length of the longest word in C . Note that, for any word w ā C ā— : (1)

the number of symbols to the left of the ļ¬rst occurrence of a key symbol in w

is less than or equal to L ā’ 1; (2) the number of symbols occurring between

two successive occurrences of keys is less than or equal to 2L ā’ 2; and (3) the

number of symbols to the right of the last occurrence of a key symbol in w is

less than or equal to L ā’ 1.

247

Co-compactness Lemma

248

Next we establish that, for every j ā I and every k ā K , the generator

of R j in which k occurs has length at most 4L ā’ 2. For such j and k we

have: since G = F āŖ {m, j} is ļ¬nite and k ā K , there is a word w ā Ri

iāG

in which k occurs. Let w = x0 a0 x1 a1 . . . xiā’1 ai xi ai+1 xi+1 . . . xnā’1 an xn where

the ai , 1 ā¤ i ā¤ n, are all the key occurrences in w. Hence, no key occurs in

any of the xi , 1 ā¤ i ā¤ n. Note that all of the keys occurring in w must lie in

K . The word w can be segmented into code words belonging to Rm and it can

also be segmented into code words belonging to R j . We have k = ai for some

i, 1 ā¤ i ā¤ n. Note that, if the length of the code word belonging to R j in which

k occurs were greater than or equal to 4L ā’ 2, this would contradict one of

(1),(2), or (3) of the ļ¬nal sentence of the previous paragraph.

We have shown that there is a bound B(= 4L ā’ 2) such that, for every j ā I

and k ā K , no code word of R j in which k occurs can have length greater than

or equal to B. From the deļ¬nition of the equivalence relation ā¼ we see that

there are only ļ¬nitely many ā¼ equivalence classes.

Let F be a subset of I for which, for each i ā I , there is a unique j ā F

for which Ri ā¼ R j .

The statement of the lemma is true for F = F āŖ F .

References

[1] J. A. Anderson, Semiretracts of a free monoid, Theor. Comput. Sci. 134(1994),

3ā“11.

[2] J. A. Anderson, āSemiretracts of a free monoidā (extended abstract) Proceedings

of the Colloquium of Words, Languages and Combinatorics II, New York, World

Scientiļ¬c (1994), 1ā“5.

[3] J. A. Anderson and T. Head, The lattice of semiretracts of a free monoid, Intern. J.

Computer Math. 43(1992), 127ā“31.

[4] K. Culik II and T. Harju, The regularity of splicing systems and DNA, in Proceed-

ings ICALP ā™89, Lecture Notes in Comput. Sci. 372(1989), 222ā“33.

[5] K. Culik II and T. Harju, Splicing semigroups of dominoes and DNA, Discrete

Appl. Math. 31(1991), 261ā“77.

[6] A. Ehrenfeucht, T. Harju, I. Petre, D. M. Prescott and G. Rozenberg, Computation

in Living Cells ā“ Gene Assembly in Ciliates, Berlin, Springer-Verlag (2004).

[7] J. Engelfriet and G. Rozenberg, Equality languages and ļ¬xed point languages,

Information and Control 43(1979), 20ā“49; 31(1991), 261ā“77.

[8] L. Fuchs, Inļ¬nite Abelian Groups, New York, Academic Press, I (1970), III (1973).

[9] W. Forys and T. Head, Retracts of free monoids are nowhere dense with respect to

ļ¬nite group topologies and p-adic topologies, Semigroup Forum, 42(1991), 117ā“19.

[10] W. Forys and T. Head, The poset of retracts of a free monoid, Intern. J. Computer

Math. 37(1990), 45ā“8.

[11] R. W. Gatterdam, Splicing systems and regularity, Intern. J. Computer Math.

31(1989), 63ā“7.

[12] E. Goode (Laun), Constants and Splicing Systems, Dissertation Binghamton

University, Binghamton, New York (1999).

[13] T. Harju and D. Nowotka, The equation xi = yj zk in a free semigroup, Semigroup

Forum 68(2004), 488ā“90.

[14] J. Harrison, On almost cylindrical languages and the decidability of the D0L and

PWD0L primitivity problems, Theor. Comput. Sci. 164(1996), 29ā“40.

[15] T. Head, Cylindrical and eventual languages, in Mathematical Linguistics and

Related Topics (Gh. Paun, ed.), Bucharest, Romania, Editura Academiei Romane

(1995), pages 179ā“83.

[16] T. Head, Expanded subalphabets in the theories of languages and semigroups,

Intern. J. Computer Math. 12(1982), 113ā“23.

249

References

250

[17] T. Head, Formal language theory and DNA: an analysis of the generative capacity

of speciļ¬c recombinant behaviors, Bull. Math. Biology 49(1987), 737ā“59.

[18] T. Head, Splicing schemes and DNA, in Lindenmayer Systems ā“ Impacts on The-

oretical Computer Science, Computer Graphics, and Developmental Biology (G.

Rozenberg and A. Salomaa, eds.), Berlin, Springer-Verlag (1992), pages 371ā“83.

[19] T. Head, Splicing representations of strictly locally testable languages, Discrete

Appl. Math. 87(1998), 139ā“47.

[20] T. Head and S. Gal, Aqueous computing: writing on molecules dissolved in

water, in Nanotechnology: Science and Computation (J. Chen, N. Jonoska and G.

Rozenberg, eds.), Berlin, Springer-Verlag (2006), pages 321ā“34.

[21] T. Head, Gh. Paun and D. Pixton, Language theory and molecular genetics:

generative mechanisms suggested by DNA recombination, in Handbook of Formal

Languages (G. Rozenberg and Gh. Paun, eds.), Berlin, Springer-Verlag (1997),

Vol. II, Chapter 7.

[22] T. Head and D. Pixton, Splicing and regularity, in Formal Languages and Appli-

cations II (Z. Esik, C. Martin-Vide and V. Mitrana, eds.), Berlin, Springer-Verlag

(2006).

[23] T. Head, Visualizing languages using primitive powers, in Words, Semigroups,

& Transducers (M. Ito, Gh. Paun, and S. Yu, eds.), Singapore, World Scientiļ¬c

(2001), pages 169ā“80.

[24] T. Head and B. Lando, Periodic D0L languages, Theor. Comput. Sci. 46(1986),

83ā“9.

[25] M. Ito, M. Katsura, H. J. Shyr and S. S. Yu, Automata accepting primitive words,

Semigroup Forum 37(1988), 45ā“52.

[26] B. Lando, Periodicity and ultimate periodicity of D0L systems, Theor. Comput.

Sci. 82(1991), 19ā“33.

[27] L. F. Landweber and L. Kari, Universal molecular computation in ciliates,

in Evolution as Computation (L. F. Landweber and E. Winfree, eds.), Berlin,

Springer-Verlag (2002), pages 3ā“15.

[28] M. Lothaire, Combinatorics on Words, Massachusetts, Addison-Wesley (1983).

Reprinted by Cambridge University Press (1984).

[29] G. H. Mealy, A Method for Synthesizing Sequential Circuits, Bell System

Technical Journal 34(1955), 1045ā“70.

[30] E. F. Moore, Gedanken-Experiments on Sequential Machines, Automata Studies,

Annals of Mathematical Studies 32(1956), 129ā“53.

[31] R. McNaughton and S. Papert, Counter-Free Automata, Cambridge, MA, MIT

Press (1971).

[32] Gh. Paun, G. Rozenberg and A. Salomaa, DNA Computing ā“ New Computing

Paradigms, Berlin, Springer-Verlag (1998).

[33] D. Pixton, Regularity of splicing languages, Discrete Appl. Math. 69(1996),

101ā“24.

[34] D. Pixton, Splicing in abstract families of languages, Theor. Comput. Sci.

234(2000), 135ā“66.

[35] A. Salomaa, Jewels of Formal Language Theory, Rockville, MD, Computer

Science Press (1986).

[36] H. J. Shyr, Free Monoids and Languages, Taiwan, Hon Min l991.

[37] B. Tilson, The intersection of free submonoids of a free monoid is free, Semigroup

Forum 4(1972), 345ā“50.

Further reading

[1] J. A. Anderson, Code properties of minimal generating sets of retracts and semire-

tracts, SEA Bull. Math. 18(1994), 7ā“16.

[2] J. A. Anderson, āSemiretracts and their syntactic monoids,ā Semigroups: Interna-

tional Conference on Semigroups and its Related Topics, Singapore, Springer-Verlag

(1998).

[3] J. Berstel and D. Perrin, Theory of Codes, New York, Academic Press (1985).

[4] E. Goode and D. Pixton, Recognizing splicing languages: syntactic monoids and

simultaneous pumping, Discrete Appl. Math. (to appear).

[5] J. Howie, Automata and Languages, New York, Oxford University Press (1991).

[6] J. E. Pin, Varieties of Formal Languages, New York, Plenum (1986).

[7] G. Thierrin, A mode of decomposition of regular languages, Semigroup Forum

10(1975), 32ā“8.

251

Index

accepted codomain, 12

PDA, 90 combinatorics on words, 210

alphabet, 23 concatenation, 23

automata congruence, 19

pushdown, 90 conjugate words, 224

automaton, 37 context-free gramar, 127

acceptance states, 37 context-free language, 127, 149

accepted by, 38 context-free languages

accessible, 73 decidable, 208

deterministic, 41 deterministic, 211

intrinsic, 74 pumping lemma, 163

language accepted by, 37 context-sensitive grammar, 127

minimal, 73 cylindrical language, 218

pushdown, 89

sink state, 39 decidability

state diagram, 38 context-free languages, 164

syntactic monoid, 74, 75 regular languages, 86

transformation monoid, 78 deterministic automata, 41

transition function, 37, 41 deterministic PDA, 91

axiom system deterministic Turing machine, 170

GĀØ delā“Hilbertā“Bernays, 1

o disjoint, 9

Russellā“Whitehead, 1 domain, 12

Zermeloā“Fraenkelā“von Neumann, 1

equivalence

Burali-Forti, 1 relation, 9

Cantor, Georg, 1 ļ¬‚ag, 223

Chomsky normal form, 138, language, 223

144 word, 223

code, 26 function, 12

bipreļ¬x, 26 bijection, 13

block, 26 codomain, 12

inļ¬x, 26 domain, 12

preļ¬x code, 26 image, 12

sufļ¬x code, 26 injection, 13

uniquely decipherable, 26 preimage, 12

253

Index

254

function (cont.) mapping, 12

range, 12 Mealy machine, 102

surjection, 13 minimal automaton, 73, 74

modiļ¬ed correspondence system, 200

grammar modiļ¬ed Postā™s Correspondence Problem,

Chomsky normal form, 138 200

context-free, 127 monoid, 16

Greibach normal form, 138 Moore machine, 99

leftmost derivation, 140 mutually exclusive sets, 9

grammar, 114

context-sensitive, 127 nondeterministic automata, 41

corresponding tree, 123 nondeterministic Turing machine, 190

formal, 115

language generated by, 115 partial ordering

nonterminal symbols, 115 relation, 10

phrase structure, 115 partially ordered set

productions, 115 greatest lower bound, 11

regular, 128, 129 least element, 11

start symbol, 115 least upper bound, 11

terminal symbols, 115 partition, 10

Greibach normal form, 138, 148 Postā™s Correspondence Problem, 200

match, 200

halting problem, 195 modiļ¬ed, 200

homomorphism preimage, function, 12

mortal, 30 primitive word, 210

vital, 30 pumping lemma

context-free languages, 163

ideal, 19 regular languages, 84

idempotent, 18 pushdown automaton, 89, 90

image

function, 12 range, function, 12

intrinsic automaton, 74 recombinants of molecules, 235

regular expression, 24

Kleene star, 23 regular grammar, 128, 129

Kleeneā™s Theorem, 52, 68 regular language, 25

pumping lemma, 84

language, 24 relation

almost bounded above, 220 equivalence, 9, 10

almost cylindrical, 220 retract, 29

almost eventual, 220 key, 31

almost uniformaly bounded above, key code, 31

220 key word, 31

almost uniformly eventual, 220 retraction, 29

bounded above, 219 right ideal, 19

context-free, 127, 149 Russell, Bertrand, 1

cylindrical, 218

eventual, 219 semigroup, 16

sketch equivalent, 222 semilattice, 18

sketch parameter, 221 lower semilattice, 11

uniformly eventual, 219 upper semilattice, 11

left ideal, 19 semiretract, 35

Index 255

set, 1 splicing scheme, 236

cardinality, 4 splicing system, 236

Cartesian product, 5 stack

element, 1 LIFO, 89

power set, 5 pop, 89

relation, 6 push, 89

set difference, 3 pushdown, 89

subset, 2 string, 23

symmetric difference, 3 concatenation, 23

set properties support of a language, 217

associative properties, 4, 6 syntactic monoid, 74, 75, 242

commutative properties, 4, 6

complement properties, 4, 6 transformation monoid, 78

de Morganā™s laws, 4, 5 Turing acceptable, 195

distributive laws, 3 Turing decidable, 195

distributive properties, 6 Turing machine, 170

double complement, 4, 5 conļ¬guration, 172

idempotent properties, 4, 5 crash, 173

identity properties, 4, 6 crashes, 171

set relation deterministic, 170

disjoint, 9 halting problem, 195

function, 12 hang, 173

mutually exclusive, 9 language accepted by, 180

sketch function, 218 nondeterministic, 190

spectral partition, 216 Turing acceptable, 195

spectrum, 214 Turing decidable, 195

splicing, 232 word accepted by, 180

spliced, 233

splicing language, 236, 239 undecidable

reļ¬‚exive, 242, 244 context-free languages, 208

splicing rule, 236 uniquely decipherable code,

action, 236 26

radius, 236

respects language, 236 word, 23

ńņš. 9 |