There are times when parsing data formats when it is nessasary to consider data that occurs at a future point in the bitstream. For instance, consider a simple fixed-length tagged union, where the tag occurs after the union. Conceptually, such a format may be described by:

<xs:choice dfdl:choiceDispatchKey="{ tag }">
  <xs:element name="a" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="1"/>
  <xs:element name="b" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="2"/>
</xs:choice>
<xs:element name="tag" type="xs:int" dfdl:length="8" />

An existing proposal (Proposal: DFDL base plus offset feature - Enables describing TIFF) would allow for this, by making it possible to put the <tag> element first in the infoset, despite it occuring later in the bitstream. However, such a proposal imposes unnessasary complexity for such a usecase. In particular, the schema must specify explicitly to jump forward and backward in the bitstream. Further, the full generallity of the schema involves considering additional issues surrounding unparsing (such as overlapping data).

For basic usecases such as the above, it is possible to instead add support in a much simpler manner, by providing lookahead capabilities directly in DPath:

<xs:choice dfdl:choiceDispatchKey="{ dfdlx:lookAhead(16,8) }">
  <xs:element name="a" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="1"/>
  <xs:element name="b" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="2"/>
</xs:choice>
<xs:element name="tag" type="xs:int" dfdl:length="8" />

Another potential solution would be to allow forward references in DPath expressions during parsing, if the compiler can prove that such a forward reference is resolvable (eg. the portion of content being skipped over is of constant length). However doing so would add significant complexity to both Daffodil and DFDL.

This proposal is to add the dfdlx:lookAhead function to DPath.

dfdlx:lookAhead

  • dfdlx:lookAhead(distance, bitSize) 
    • read bitSize bits, where the first bit is located at an offset of distance from the current location
  • Restrictions
    • distance >=0
    • bitSize >= 0
    • distance + bitSize <= Implementation defined limit no less than 512 bits
    • Cannot be called during unparse
    • ParseError if looks past EOF
    • Undefined behavior if looks past document boundery when in streaming mode.
    • bitOrder and byteOrder are determined by the current location. Changes between the current location and the location containing the data being read will not be respected.

Examples

The following two elements are equivalent:

  • <xs:element name="a" type="xs:unsignedInt" dfdl:length="3" dfdl:lengthUnits="bits" />
  • <xs:element name="a" type="xs:unsignedInt" dfdl:length="3" dfdl:lengthUnits="bits" dfdl:inputValueCalc="{ dfdlx:lookAhead(0,3) }" />

The following example demonstrates using lookAhead to branch based on a field in the future:

<xs:choice dfdl:choiceDispatchKey="{ dfdlx:lookAhead(16,8) }">
  <xs:element name="a" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="1"/>
  <xs:element name="b" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="2"/>
</xs:choice>
<xs:element name="tag" type="xs:int" dfdl:length="8" />
  • No labels

3 Comments

  1. I would like this proposal a lot better if there was a way to compute these lengths using some expressions referring to schema concepts. E.g., something like

        dfdlx:lookAhead(dfdlx:constantSizeOf(../a), dfdlx:constantSizeOf(../tag))

    The notion here is that constantSizeOf(path) takes a path to an element, and returns a static size, and SDE if not a constant. 

    Alternatively some functions allowing one to ask for start positions of things relative to a base location would be helpful.

    I have the same problem with dfdl:choiceLengthKind="explicit" and dfdl:choiceLength which must be some integer. We have no way for users to figure out these magic lengths in any robust manner that is maintainable. In that particular case I really think DFDL should be figuring out the length. I think there is a separate proposal for more dfdl:choiceLengthKinds to include things like 'longest' which requires the lengths to be constant, and pads all of them to the length of the longest.

  2. Something to keep in mind is that:

     dfdlx:lookAhead(dfdlx:constantSizeOf(../a), dfdlx:constantSizeOf(../tag))

    Actually uses 2 different constantSizeOf functions. dfdlx:constantSizeOf(../a) includes all padding, while dfdlx:constantSizeOf(../tag) includes no padding, and assumes that tag has no left padding.

    Essentially, my concern is that actually getting these expressions write would be a very subtle and hard to debug problem. It would probably be easier and less error prone to just compute it by hand (especially considering that when you are doing this, you likely have a spec that tells you what bit position all the elements are at)


    What would be really nice is if:

    <xs:choice dfdl:choiceDispatchKey="{ tag }">
      <xs:element name="a" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="1"/>
      <xs:element name="b" type="xs:int" dfdl:length="16" dfdl:choiceBranchKey="2"/>
    </xs:choice>
    <xs:element name="tag" type="xs:int" dfdl:length="8" />

    just works. In any case where such an expression is actually constant, it should be possible for Daffodil to figure this type of thing out at compile time. I suspect that it wouldn't even be too difficult to implement this. The problem I see is defining exactly what cases can be automatically calculated. WIthout a precise definition, this will become a death trap for compatibility. 

    As a matter of sanity, we probably shouldn't allow the above even if we could, but we could allow something like:

    dfdlx:lookAhead(dfdlx:constantDistanceTo(../tag), dfdlx:constantSizeOf(../tag))

    This is not substantively different, but makes it obvious to the reader that this works because the compiler can prove that the values are constant.

    This avoid the user needing to arbitrarily reference <a/> and risk not noticing that <b/> is a different size. It also makes dealing with intervening sequences easier, as the user would otherwise need to explicitly refer to ever intervening element.

    We could possibly simplify it further to something like:

    dfdlx:magicReadAhead(../tag)

    which should make it clear that it is working by some compile time magic. This also gives us the use-cases where tag is not an unsigned binary integer.

    Again, my only concern here is that for compatability reasons we would probably need to define what algorithm we are using in the spec; otherwise different implementations would likely come to different conclusions about what cases can work for this.

  3. It was recently pointed out to me that DFDL has a mechanism for doing this look-ahead, which is dfdl:assert/dfdl:discriminator with testKind 'pattern'.

    A regex can be used to look ahead in the data stream, and if the regex finds a match, the assert succeeds. Otherwise it fails and backtracking will occur. 

    By use of the dfdl:encoding 'iso-8859-1' along with specifying bytes via the regex \xHH convention to match a character specified as a hex byte, or by use of the daffodil specific encodings that are for BITS, BASE4, OCTAL, HEX (1 bit, 2 bits, 3 bits, 4 bits respectively) one can construct a regex that skips some amount of data and then matches against expected data pattern/value to guide a choice branch selection. For example, in charset encoding iso-8859-1, this regex skips 2 bytes then matches if the value is either 0x03 or 0x05 in the next byte. 

    "..[\x03\x05]"

    This is not as efficient as a choice using dfdl:choiceDispatchKey, which this lookAhead function enables, but it is standard portable DFDL.