Reading Notes in Formal Languages

Intersection of a CFL with the Prefixes of a CFL

2008/07/16

Just a quick note on a decision problem: a classical result from the theory of context-free languages is the undecidability of the emptiness of the intersection of two CFLs. Put more concretely, given L₁ and L₂ two CFLs, we cannot decide whether L₁∩L₂=∅.

In the context of PEGs (Ford, 2004), the disjointness of two expressions can be ensured by the emptiness of the intersection of the set of prefixes of the context-free approximation of the first expression with the context-free approximation of the second: given e₁ and e₂ two parsing expressions, we consider two context-free approximations L₁ and L₂ with ℒ(e₁)⊆L₁ and ℒ(e₂)⊆L₂, and we want to check that Pre(L₁)∩L₂=∅, where Pre(L)={x|∃y, xy∈L}.

Goofing Around

At first sight, there is little indication whether the problem is decidable or not. The set of prefixes of a context-free language is in general context-free (for instance Pre({aⁿbⁿ|n≥0})={aⁿb^m| 0≤m≤n}), and its intersection with a context-free language is in general context-sensitive (for instance Pre({aⁿbⁿc^m| m, n≥0})∩{a^mbⁿcⁿ|m, n≥0}=a^*∪{aⁿbⁿcⁿ| n≥0}).

On the other hand, this is hardly conclusive, and the hypothesis that one of the languages is closed under the Pre operation might be just enough to make the problem decidable. For instance, couldn't we encode this intersection in a "higher-order" grammatical formalism with decidable emptiness, like indexed grammars (Aho, 1968)?

The Hard Way

It turns out that this slight variation of the emptiness of intersection problem is not decidable either. Considering the reduction of the Post correspondence problem given by Harrison (1978, pages 254–258), an instance of PCP

x=(x₁, x₂, ..., x_n) y=(y₁, y₂, ..., y_n) x_i, y_i∈{a,b}⁺

is associated with the languages

L(x,y) = {ba^i_k···ba^i₁cx_i₁···x_{i_k}cy^R_{j_l}···y^R_j₁ca^j₁b···a^j_lbc | k,l≥1, 1≤i_p,j_q≤n, 1≤p≤k, 1≤q≤l}

L_s = {w₁cw₂cw^R₂cw^R₁c | w₁, w₂ ∈ {a,b}⁺}

with intersection

L(x,y) ∩ L_s = {t₁ct₂ct^R₂t^R₁c | t₁=ba^i_k···ba^i₁, t₂=x_i₁···x_{i_k}=y_i₁···y_{i_k}}

which is nonempty if and only if there is a solution to the PCP instance (x,y).

Considering now the intersection Pre(L(x,y))∩L_s, observe that the four c symbols of L_s reject any proper prefix of L(x,y). Thus the emptiness of the intersection of the set of prefixes of a CFL with a CFL is not decidable.

The Easy Way

Sometimes it takes me a little while to figure something very simple. Consider two context-free languages L₁ and L₂ over an alphabet Σ, and consider the intersection Pre(L₁·{a})∩(L₂·{a}) with a not in Σ. This intersection is empty if and only if L₁∩L₂ is.

More on Parsing Expression Languages

2008/03/13

Back to PELs (Ford, 2004) with a small comparison with Church Rosser languages (McNaughton et al., 1988) this time. Both enjoy a linear time membership problem, include some context-sensitive languages and the full class of DCFLs, and are closed under complement.

I had wondered for some time whether all PELs were Church Rosser, but I just found out that the following Gladkij language was not growing context-sensitive and thus not Church Rosser (Buntrock and Otto, 1998):

ℒ_Gladkij = { wcw^Rcw | w ∈ {a,b}^* }.

We can however recognize ℒ_Gladkij with a PEG since it is the intersection of two deterministic context-free languages

ℒ₁ = { wcw^Rcw' | w,w' ∈ {a,b}^* }
ℒ₂ = { w'cwcw^R | w,w' ∈ {a,b}^* }.

Since PELs are closed under intersection and include all the DCFLs, it follows that ℒ_Gladkij is a parsing expression language. The following PEG recognizes ℒ_Gladkij (I'm not too fond of predicates, so I prefer to stick with the ordered choice operator):

S ← ((W' c W / ε) Z / ε) W c W'
W' ← a W' / b W' / ε
W ← a W a / b W b / c
T ← a / b / c
Z ← T Z / T

The expression W' c W recognizes ℒ₂, and the expression W c W' recognizes ℒ₁. If an input string is in ℒ₂, then the expression (W' c W / ε) Z fails, thus the next alternative ε is used and the input string also has to be in ℒ₁ in order to be accepted by the grammar. If an input string is not in ℒ₂, then (W' c W / ε) Z succeeds and consumes the whole input through Z, therefore the grammar cannot accept the input.

I have currently no clue whether an inclusion of the Church Rosser languages in the class of parsing expression languages would hold.

A Note on Parsing Expression Languages

2008/03/11

Thanks to a recent excursion into the realm of RCGs, I have found the answer to another comparison with the boolean closure of some context-free languages: the comparison with PEL (Ford, 2004).

Parsing expression languages are closed under union, intersection, and complement, and properly include DCFLs (Birman and Ullman, 1973). Thus a comparison with the boolean closure of DCFLs is of some interest (on the other hand, it is widely conjectured that they do not capture all the CFLs). Using again the results of Kutrib et al. (2004, Theorem 17), we know that the language

ℒ_exp = { a^2ⁿ | n ≥ 1 },

which is not semilinear, is not in Bool(CFL) and thus not in Bool(DCFL).

But we can generate this language with the PEG

S ← a S a / a a

(see Exercise 6.1.1 by Aho and Ullman, 1972). Thus the inclusion of Bool(DCFL) in the set of parsing expression languages is proper.

Notes on 1-Range Concatenation Grammars

2008/03/07

Some quick notes on 1-RCGs after a first skim through Boullier (2000)'s paper.

Boolean Closure of Context-Free Languages

A striking aspect of 1-RCGs is how close to defining exactly the boolean closure of CFLs (Bool(CFL)) they seem to be. Thanks to their closure by intersection and complement and their inclusion of CFLs, 1-RCLs (actually proper 1-RCLs) clearly include Bool(CFL).

The inclusion is however proper. Using the results of Kutrib et al. (2004, Theorem 17), we can exploit the semilinearity of some languages in Bool(CFL) to show that the language

ℒ_exp = { aⁿb^2ⁿ | 0 ≤ n }.

is not in Bool(CFL).

Nevertheless, I think we can generate this language with the 1-RCG

S(aXYZ) → S(XY) S(XZ) A(X) B(YZ)
S(b) → ε
A(aX) → A(X)
A(ε) → ε
B(bXb) → B(X)
B(bb) → ε

This grammar is not a proper 1-RCG: the calls to S(XY) and S(XZ) use two variables in a right-hand side predicate, without any clear way to obtain the same effect with only one variable. The class of languages generated by proper 1-RCGs would probably deserve to be further compared to Bool(CFL).

Tree Adjoining Languages

Boullier (2000) indicates that it is unknown, albeit very unlikely, whether any tree adjoining language could be expressed using a proper 1-RCG. I am really not sure about this, but I wondered whether the language

ℒ_palcopy = { ww^Rww^R | w ∈ {a,b}^* }

could not be a good candidate for a counter-example.

Dominance Languages are not Semilinear

2008/02/19

A point mentioned by Bruno Guillaume during a conversation about Interaction Grammars (Perrier, 2004) some time ago. Looking at Hopcroft and Pansiot (1979)'s VASS, it turns out that Interaction Grammars is not the only dominance formalism that derives non semilinear sets.

Non Semilinearity

An interesting consequence of the equivalence of Rambow (1994)'s {}-LIGs with de Groote et al. (2004)'s VATAs is that the former generate some non semilinear sets. Indeed, VATAs generalize Hopcroft and Pansiot (1979)'s VASS, and we can very easily translate their example of a VASS with a non semilinear reachability set into a dominance grammar that generates a non semilinear language.

Hopcroft and Pansiot's three-dimensional VASS example (Lemma 2.8) follows, with two states A and B and (A,(0,0,1)) as initial configuration:

A → (A,(0,1,-1))
A → (B,(0,0,0))
B → (B,(0,-1,2))
B → (A,(1,0,0))

They prove that a configuration (A,(x₁,x₂,x₃)) is reachable if and only if 0 < x₂ + x₃ ≤ 2^x₁ holds.

The translation as a {}-LIG is rather immediate; we merely add the rules of T and U to consume the remaining indices and produce a and b symbols.

S → A {x₃}
A {x₃} → A {x₂}
A → B
B {x₂} → B {x₃,x₃}
B → A {x₁}
A → T
T {x₁} → a T
T → U
U {x₂} → b U | b
U {x₃} → b U | b

By Hopcroft and Pansiot's proof, the language of this grammar is

ℒ = { aⁿb^m | 0 ≤ n, 0 < m ≤ 2ⁿ }.

Rambow further provides the translation from {}-LIGs to weakly equivalent UVG-dls. The translation exploits a normal form for {}-LIGs; the translated grammar follows.

   S    S     A    A    A    A      T'   T      U'   U'
   |    |     |    |    |    |     / \   |     / \   |
   A    A     A'   A'   B    T    a   T  U    b   U  b
   :    :     :    :
   U    A'    U    B
   |    |     |    |
   U'   A     U'   B'


   B'   B'   B"   B"   B
   |    |    |    |    |
   B"   B"   B    B    A
   :    :    :    :    :
   U    A'   U    A'   T
   |    |    |    |    |
   U'   A    U'   A    T'

Consequences

UVG-dls are the simplest representatives of a number of dominance grammatical formalisms: GAG (Candito and Kahane, 1998), DSG (Rambow et al., 2001), (non-local) TDG (Kallmeyer, 1999), and Interaction Grammars (Perrier, 2004); their non semilinearity implies the non semilinearity of all those grammatical frameworks. This further justifies the substitutable notion in Rambow et al. (1995)'s original DTG definition, and the local restriction of TDG advocated by Kallmeyer to enforce semilinearity.

Semilinearity, or more exactly the (implied) constant growth property, is one of the characteristics often invoked for natural language syntax, and one of the three requirements for mild context sensitivity. Non semilinearity indicates that dominance grammars are probably more expressive than needed (or at least on this particular point). Nevertheless, all this should be taken with a grain of salt: ℒ still verifies the constant growth property, and mild context sensitivity is more of an empirical assessment than a real formal definition one should strictly adhere to.

Note that the semilinearity requisite has been challenged by an analysis of Old Georgian by Michaelis and Kracht (1997), but this analysis was in turn challenged by Bhatt and Joshi (2004).

Post-Order Typing XML Streams

2008/01/23

Some notes on a characterization of post-order typing described by Kumar et al. (2007).

Typing XML Streams

XML streams are a relatively recent subject of study (see e.g. Segoufin and Vianu, 2002; Martens et al., 2005). But it is raising some interest for applications that have to process large amounts of XML data—so large that the entire XML tree does not hold in physical memory of the processing device.

As noted by Martens et al., XML streams are easy to process efficiently if one knows the type of a node as soon as it is met, aka pre-order typing. They also briefly mention the mirror condition, namely post-order typing:

A tag is said to be post-order typed if its unique type is determined at the close-tag.
Kumar et al. (2007)

Note that pre-order typing implies post-order typing, and that a XML scheme (taken as a regular unranked tree grammar) can be neither pre-order typable nor post-order typable; for instance, it suffices that it is ambiguous.

Balanced Context-Free Grammars

Regular unranked tree grammars are isomorphic with balanced context-free grammars, where tree grammar rules A → tag ( regex ) correspond to extended context-free rules A → <tag> regex </tag> (Berstel and Boasson, 2002; Brüggemann-Klein and Wood, 2004). This representation has the advantage that the XML streams are exactly the languages generated by the corresponding balanced context-free grammar.

It should be clear that XML schemes allowing pre-order typing are defined by LL(1) balanced context-free grammars, with characterizations by Brüggemann-Klein and Wood (1998) and Martens et al. (2005).

On the post-order typing side, it also stands to sense that it corresponds to LR(0) balanced context-free grammars. LR parsing for extended context-free grammars is complex in general, because the parser might have issues finding how much of its stack content it should reduce—the topic is notoriously difficult and has been the subject of a rather large number of papers originating with Madsen and Christensen (1976)'s and Lalonde (1977)'s. Fortunately, we do not have this issue with balanced grammars, since the opening tag is clearly identifiable in the stack. Moreover, shift/reduce conflicts cannot occur.

Disjoint Calls

Let's turn now to Kumar et al. (2007)'s characterization of post-order typing:

We say that an XVPA A has disjoint calls if for all states q in A and open tags c, if q →^c_A e_m and q →^c_A e_m' for distinct m, m' ∈ µ^-1(c), then L_d(m)∩L_d(m')=∅.
An SDTD (d,m₀) is post-order typed if and only if there is an XVPA A with disjoint calls over the same modules such that L(A)=L_d(m₀).

I leave it to the reader to fetch and read Kumar et al.'s article in order to understand this statement.

The insight we gain with the LR(0) characterization provides us with a constructive version of the statement: we can map a XVPA with disjoint calls to a deterministic LR(0) parser constructed from the provided SDTD.

Copy Language and Dominance Formalisms

2008/01/21

Got some new insight today when I looked at Shieber (1985) again: Swiss German is not weakly context-free, but the rationale is not that the copy language

ℒ_copy = { ww^R | w in {a,b}^* }

is not context-free like I wrongly remembered. The rationale is that the crossings language

ℒ_cross = { aⁿb^mcdⁿe^m | n, m ≥ 0 }

is not context-free.

Thus the argument that UVG-dl (Rambow, 1994) and DSG (Rambow et al., 2001) might not generate Swiss German does not hold for weak generation: the set of trees

   S             A               B
   |            / \            / | \
   A           a   A          b  B  e
   :               :
   A               C
   |              / \
   B             C   d
   :
   B
   |
   C
   :
   C
   |
   c

generates ℒ_cross. Of course, it is not linguistically adequate—it does not strongly generate the correct dependencies.