Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction: Why Unparsing is Harder than Parsing

It is initially surprising to some software engineers that DFDL unparsing is more complicated than DFDL parsing. After all, parsing involves lookahead and backtracking. Unparsing seems like just serialization.

However,...There are a number of algorithms that taken together implement DFDL's unparsing. The subject of this page is the I/O (output really) layer's buffering system.

DFDL has a feature known by the dfdl:outputValueCalc (OVC for short) property. This property holds an expression in DFDL's expression language, and this expression typically refers to elements that are later in the infoset than the element whose declaration carries the OVC.

...

  • The length of the OVC element may not be known; hence, the starting bit position of this added buffering DataOutputStream may not be known until unparsing of the OVC element's suspension has completed.
  • Alignment: Because elements and model groups (terms generally) can have alignment, and text anywhere can have mandatory text alignment, then in the case where we do not know the starting bit position, we are not able to compute the size of the alignment fill region needed.
    • This implies that non-zero alignment requires a split of the data output stream of its own - in the case where the starting bit position is not known.
  • Bit order: Elements can have dfdl:bitOrder, and model groups can have text (e.g., dfdl:initiator), and text implies a bit order as charset encodings each have a specified bit order. It is not meaningful for the bit order to change except on a byte boundary (8 bit boundary). So, if the starting bit position of a buffering data output stream is not known, then the unparser cannot determine whether a bit order change is legal or not until that starting bit position has been determined.
    • This implies that bit order changes require a split of the data output stream of their own - in the case where the starting bit position is not known.
  • Interior Alignment affecting Length: The length of an element of complex type may depend on its starting bit position in data output stream. The element's initial alignment is not part of its length, but this dependency happens because terms (elements or model groups) may have alignment (or mandatory text alignment) on terms they contain (aka "interior" terms). These alignment regions may be of varying size depending on where the term starts in the data output stream; hence, the length of a complex type may not be able to be computed until its starting position is known, and recursively the starting positions of any interior terms inside it are known.
    • This implies that expressions that compute the dfdl:contentLength or dfdl:valueLength of an element must potentially suspend until the starting bit positions become known so that the length of the alignment regions can be computed.
      • Hence, expressions can block, not only on values of infoset elements, but the ending bit position of the representation of infoset elements.
    • Circular deadlocks can occur if an OVC element needs the length of a later element, but the length of the later element depends (by way of this interior alignment issue), on the length of the OVC element.
      • Note: it is expected that formats are rare (but possible) where an OVC element itself is a variable-length element. Most commonly OVC elements have fixed lengths (in our experience), as they are most common in binary data formats where the length fields are also fixed-length binary integers. Formats have been described; however, where a length is expressed in a textual integer, which varies in size depending on the magnitude of the value, followed by a terminating delimiter. So variable-length OVC elements are possible. Just uncommon.
  • Target length: Some elements have an explicit length which can be fixed, or given by a dfdl:length expression. When unparsing, this dfdl:length expression is evaluated to give a value known as the target length. This can differ from the value's implicit length in that the value may need to be padded to achieve the target length, or for xs:string only, the value may need to be truncated to fit within the target length.
    • TBD: For elements with explicit length, there is an element unused region at the end which may need to be filled (with dfdl:fillByte). For simple elements this would also be a difference between value and content length. For complex types. .......
    • There is commonly a circular dependency between an OVC element storing a length, and the element whose length it stores. Deadlock is avoided when unparsing because the value of the OVC element must depend only on the dfdl:valueLength (which excludes padding/filling), and so can be computed without reference to the target length of the element. The target length expression is then able to depend on the value of the OVC element and the circularity is avoided.

As one can see from the above, there are a number of algorithms that taken together implement DFDL's unparsing. Each of the above topics is deserving of a design note of its own. The subject of this page is the I/O (output really) layer's buffering system.

Direct and Buffered Data Output Streams

This note describes the DataOutputStream buffering, that is, the mechanisms implemented by the DirectOrBufferedDataOutputStream class.

...

This discussion did not cover several other important aspects of the unparser algorithmdirect and buffered data output streams and related algorithms:

  • suspended unparsers - for alignment fill, unused regions, and for expressions and dfdl:outputValueCalc.
    • TBD: expressions involving variables yet to be set. 
  • queueing of suspensions on the infoset, and infoset event detection including open/final infoset nodes.
  • capture and propagation of start/end of content and value
    • propagation of start positions, lengths of data output streams, splitting off of buffered data output streams with unknown start positions.

Those will be covered in other pages.

...