Post-Order Typing XML Streams

2008/01/23

Some notes on a characterization of post-order typing described by Kumar et al. (2007).

Typing XML Streams

XML streams are a relatively recent subject of study (see e.g. Segoufin and Vianu, 2002; Martens et al., 2005). But it is raising some interest for applications that have to process large amounts of XML data—so large that the entire XML tree does not hold in physical memory of the processing device.

As noted by Martens et al., XML streams are easy to process efficiently if one knows the type of a node as soon as it is met, aka pre-order typing. They also briefly mention the mirror condition, namely post-order typing:

A tag is said to be post-order typed if its unique type is determined at the close-tag.
Kumar et al. (2007)

Note that pre-order typing implies post-order typing, and that a XML scheme (taken as a regular unranked tree grammar) can be neither pre-order typable nor post-order typable; for instance, it suffices that it is ambiguous.

Balanced Context-Free Grammars

Regular unranked tree grammars are isomorphic with balanced context-free grammars, where tree grammar rules A → tag ( regex ) correspond to extended context-free rules A → <tagregex </tag> (Berstel and Boasson, 2002; Brüggemann-Klein and Wood, 2004). This representation has the advantage that the XML streams are exactly the languages generated by the corresponding balanced context-free grammar.

It should be clear that XML schemes allowing pre-order typing are defined by LL(1) balanced context-free grammars, with characterizations by Brüggemann-Klein and Wood (1998) and Martens et al. (2005).

On the post-order typing side, it also stands to sense that it corresponds to LR(0) balanced context-free grammars. LR parsing for extended context-free grammars is complex in general, because the parser might have issues finding how much of its stack content it should reduce—the topic is notoriously difficult and has been the subject of a rather large number of papers originating with Madsen and Christensen (1976)'s and Lalonde (1977)'s. Fortunately, we do not have this issue with balanced grammars, since the opening tag is clearly identifiable in the stack. Moreover, shift/reduce conflicts cannot occur.

Disjoint Calls

Let's turn now to Kumar et al. (2007)'s characterization of post-order typing:

We say that an XVPA A has disjoint calls if for all states q in A and open tags c, if q →cA em and q →cA em' for distinct m, m' ∈ µ-1(c), then Ld(m)∩Ld(m')=∅.
An SDTD (d,m0) is post-order typed if and only if there is an XVPA A with disjoint calls over the same modules such that L(A)=Ld(m0).

I leave it to the reader to fetch and read Kumar et al.'s article in order to understand this statement.

The insight we gain with the LR(0) characterization provides us with a constructive version of the statement: we can map a XVPA with disjoint calls to a deterministic LR(0) parser constructed from the provided SDTD.

Copy Language and Dominance Formalisms

2008/01/21

Got some new insight today when I looked at Shieber (1985) again: Swiss German is not weakly context-free, but the rationale is not that the copy language

copy = { wwR | w in {a,b}* }

is not context-free like I wrongly remembered. The rationale is that the crossings language

cross = { anbmcdnem | n, m ≥ 0 }

is not context-free.

Thus the argument that UVG-dl (Rambow, 1994) and DSG (Rambow et al., 2001) might not generate Swiss German does not hold for weak generation: the set of trees

   S             A               B
   |            / \            / | \
   A           a   A          b  B  e
   :               :
   A               C
   |              / \
   B             C   d
   :
   B
   |
   C
   :
   C
   |
   c

generates cross. Of course, it is not linguistically adequate—it does not strongly generate the correct dependencies.