Parsing of nillable elements is a subtle area in DFDL.

This article deals with the case when nilKind="literalValue"

dfdl:lengthKind='delimited' (or 'endOfData')

First, this is all about the case of lengthKind="delimited" (or lengthKind="endOfdata" though that's NYI.)

First, keep in mind that the check for a literal nil value occurs entirely separately and before the check for an ordinary value. That is, the grammar says (nilLit | emptyDefaulted | parsedNil | parsedValue)

Let's punt consideration of emptyDefaulted for now, and parsedNil for now.

So we have (nilLit | parsedValue)

This grammar production implies that the decision about whether something is a nilLit must be self-contained and made for sure down inside nilLit. We can't have a decision pending things we run into things after this has returned success, and then decide, "oh I guess it wasn't a nil literal", because at that point we wouldn't backtrack into parsedValue.

Now, as of this writing, in GrammarMixins.scala we have is incorrect:

nilLit = nilInitiator ~ LiteralNilValue(this) ~ nilTerminator

Let's assume we normalize the nilValue property so that

  • if %ES; is there inside some other token it's an SDE
  • if %WSP*; appears alone it is turned into two list entries: %ES; %WSP+
  • Any resulting duplicates are removed.

for lengthKind delimited or endOfdata, we need to know when to stop parsing for a nil value.

So the grammar should be a special case for lengthKind delimited:

nilLit = nilInitiator ~ LiteralNilValueDelimited(this) ~ nilTerminator

for other length kinds see below.

Now let's look at LiteralNilValueDelimited.

Since the nilValue property can contain a bunch of possibilities, and some of them are of potentially unbounded length like WSP+, and some are zero length (%ES), we have to know "when to stop" when parsing for a nil value. A nil value has to be followed by whatever ends the element, otherwise traps like "nilAndOtherStuff," could be recognized as "nil". We have to verify that the field in fact ends after the nilValue was found.

We will need a new lazy val nilTerminatingMarkup . This is similar to regular terminatingMarkup in StringDelimitedEndOfData, except takes nilValueDelimterPolicy into account before adding the terminator of the element to the list.

Now, in the case of LiteralNilValueDelimited, lengthKind='delimited', then nilTerminatingMarkup cannot be empty list. If it is empty list, then it's an SDE. "Nothing specified to terminate nillable element X. Note: property dfdl:nilValueDelimterPolicy='...' " or something like that.

(Note: NYI yet, but If LiteralNilValueDelimited, and lengthKind='endOfData' then we can SDE. But in general this would be ok and there might not be any nilTerminatingMarkup. It's a corner case.)

Given nilTerminatingMarkup, we can then create regular expressions which match each of the nilValues, and follow them by regular expressions for each of the nilTerminatingMarkup.

That is, look at the regular expressions created by StringDelimtedEndOfData for the terminatingMarkup. Do something similar and create regular expressions for each of the nilTerminatingMarkup.

Now, for each of the nilValue regex's, we combine it with each of the "following possible terminating markup regexs".

By analogy, in StringDelimitedEndOfData, each of the terminatingMarkup regex's gets prefixed by ".*?" to form the final regex that will be used to match. For LiteralNilValueDelimited, each nilTerminatingMarkup regex gets prefixed by EACH nilValue regex.

Examples:

The regex for %ES; is nothing at all. "".

So to be clear, if we are searching for %ES; followed by a comma: RegEx = “” + “,” = “,”.

If we are searching for “nil” followed by a comma: RegEx = “nil” + “,” = “nil,”

To be clear, if there are 4 possibilities in nilValues, and 3 values in nilTerminatingMarkup, then there will be 12 regular expressions.

Longest match (of the 12), wins.

If anything matches, then LiteralNilValueDelimited succeeds and a nil value is created in the infoset.

The position is advanced to after the "value part" of the match (analogous to StringDelimtedEndofData, which advanced to after what the ".*?" matched).

Example:

nilValue = “nil” AND delim = “,”

RegEx = “nil,”  and the resultant datastream still contains the delimiter followed by the rest of the data.

If no match, then we fail, and backtracking will try parsedValue.

The above is all good and efficient stuff, and doesn't read data or do work that is not necessary to decide if the value is nil or not.

dfdl:lengthKind not 'delimited'

Now, if lengthKind is not delimited, Then things are somewhat different.

We need a grammar rule like this:

litNilNotDelimited = nilInitiator ~ stringValue ~ ToLiteralNilValue ~ nilTerminator

What I don't like here, is in order to decide if something is Nil or not, we are going to grab the string contents, regardless of size.

For now I think this is ok.

We have the same issue with numbers. If for example, someone has

 <choice>
<element name="theNum" type="xs:int" ...>
<element name="theBlob" ...>
</choice>

Then, when it first tries to parse theNum, it's going to grab the entire piece of data, which might be 4Gbytes of text. Then it's going to feed it to some number parse routine which is going to fail with NumberFormatException. Nothing is going to realize an xs:int can only be so big, and cut off the possibility before reading in the whole string of data.

That optimization, which is based on having some estimated loose upper bound on maximum length. Is separate and to be discussed elsewhere.

  • No labels

1 Comment

  1. In your choice at the very end, I know I would put a dfdl:assert with testKind='pattern', and reject anything that didn't look like an int right there. So there is a way to explicitly peek at the front of the data stream and rule-out an alternative efficiently, without having to wait and see if it fully parses.

    Of course this requires me to reverse engineer what a legal thing can start with as the regex for the assertion, but in some cases that's very easy.