Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

DFDL has a feature known by the dfdl:outputValueCalc (OVC for short) property. This property holds an expression in DFDL's expression language, and this expression typically refers to elements that are later in the infoset than the element whose declaration carries the OVC.OVC is a tremendously powerful feature in DFDL which enables a DFDL schema to truly capture inter-dependencies between elements, typically when one element stores the length of another element, but where these two elements are not adjacent in the representation. For example, a length field may appear in a header part of the data, and the part of the data whose length is given in that header, is represented much later in the data stream.

The Daffodil unparser is designed to support streaming behavior - the infoset arrives as a series of infoset events generated by a Daffodil InfosetInputter. Daffodil attempts to stream output data to the output stream without waiting for the entire infoset to arrive. Ideally, once an infoset element's start and end events have arrived the element's representation could be written to the output stream. When an infoset element has been unparsed, then in principle it can be pruned from the infoset and its memory recovered. This enables a very large infoset to be unparsed using only a finite memory footprint that is much smaller than it would be if it held the entire infoset.

TBD: A figure (or several) right here would be good illustrating incoming infoset events, incremental construction of the infoset tree, with simultaneous unparsing of these infoset elements to the output, and pruning of the unparsed elements.

However, the OVC feature complicates thisstreaming behavior. OVC is a tremendously powerful feature in DFDL which enables a DFDL schema to truly capture inter-dependencies between elements, typically when one element stores the length of another element, but where these two elements are not adjacent in the infoset. For example, a length field may appear in a header part of the data, and the part of the data whose length is given in that header, is represented much later in the data stream.

OVC elements typically don't appear in the stream of infoset events. In order to support data parsing and unparsing, they are tolerated, but ignored, if they appear in the infoset events, and the values are recomputed. For purposes of this discussion we'll assume that no events appear in the infoset events OVC elements typically don't appear in the stream of infoset events. In order to support data parsing and unparsing, they are tolerated, but ignored, if they appear in the infoset events, and the values are recomputed. For purposes of this discussion we'll assume that no events appear in the infoset events corresponding to OVC elements. 

...

  • The length of the OVC element may not be known; hence, the starting bit position of this added buffering DataOutputStream may not be known until unparsing of the OVC element's suspension has completed.
  • Alignment: Because elements and model groups (terms generally) can have alignment, and text anywhere can have mandatory text alignment, then in the case where we do not know the starting bit position, we are not able to compute the size of the alignment fill region needed.
    • This implies that non-zero alignment requires a split of the data output stream of its own - in the case where the starting bit position is not known.
  • Bit order: Elements can have dfdl:bitOrder, and model groups can have text (e.g., dfdl:initiator), and text implies a bit order as charset encodings each have a specified bit order. It is not meaningful for the bit order to change except on a byte boundary (8 bit boundary). So, if the starting bit position of a buffering data output stream is not known, then the unparser cannot determine whether a bit order change is legal or not until that starting bit position has been determined.
    • This implies that bit order changes require a split of the data output stream of their own - in the case where the starting bit position is not known.
  • Interior Alignment affecting Length: The length of an element of complex type may depend on its starting bit position in data output stream. The element's initial alignment is not part of its length, but this dependency happens because terms (elements or model groups) may have alignment (or mandatory text alignment) on terms they contain (aka "interior" terms). These alignment regions may be of varying size depending on where the term starts in the data output stream; hence, the length of a complex type may not be able to be computed until its starting position is known, and recursively the starting positions of any interior terms inside it are known.
    • This implies that expressions that compute the dfdl:contentLength or dfdl:valueLength of an element must potentially suspend until the starting bit positions become known so that the length of the alignment regions can be computed.
      • Hence, expressions can block, not only on values of infoset elements, but the ending bit position of the representation of infoset elements.
    • Circular deadlocks can occur if an OVC element needs the length of a later element, but the length of the later element depends (by way of this interior alignment issue), on the length of the OVC element.
      • Note: it is expected that formats are rare (but possible) where an OVC element itself is a variable-length element. Most commonly OVC elements have fixed lengths (in our experience), as they are most common in binary data formats where the length fields are also fixed-length binary integers. Formats have been described; however, where a length is expressed in a textual integer, which varies in size depending on the magnitude of the value, followed by a terminating delimiter. So variable-length OVC elements are possible. Just uncommon.
  • Target length: Some elements have an explicit length which can be fixed, or given by a dfdl:length expression. When unparsing, this dfdl:length expression is evaluated to give a value known as the target length. This can differ from the value's implicit length in that the value may need to be padded to achieve the target length, or for xs:string only, the value may need to be truncated to fit within the target length.
    • TBD: For elements with explicit length, there is an element unused region at the end which may need to be filled (with dfdl:fillByte). For simple elements this would also be a difference between value and content length. For complex types. .......
    • There is commonly a circular dependency between an OVC element storing a length, and the element whose length it stores. Deadlock is avoided when unparsing because the value of the OVC element must depend only on the dfdl:valueLength (which excludes padding/filling), and so can be computed without reference to the target length of the element. The target length expression is then able to depend on the value of the OVC element and the circularity is avoidedvalue of the OVC element and the circularity is avoided.
  • Expression Evaluation Modes: When unparsing, expressions can be evaluated in backward-only mode (just like parsing), or in forward-referencing mode where they can block waiting for updates to the infoset. (Adding children, closing/finishing the infoset element - indicating no more children to be added, setting a value, setting nilled, determining length, etc.)
    • Expressions can reference variables, whose values are assigned by way of dfdl:setVariable or dfdl:newVariableInstance expressions. These also can (TBD: must?) be evaluated in forward-referencing mode.
      • (TBD: Must? ... because we don't know if they'll be referenced from backward-only expressions or forward-referencing expressions of an OVC element, or recursively another variable value expression where the variable was referenced from an OVC element expression.)
  • Queuing Suspensions
    • Note: a quick and dirty implementation which actually defeats streaming behavior, is to just queue all suspensions centrally until the primary unparser pass is over. Then just loop through the suspensions retrying them until they all succeed.
    • Suspensions should be stored on the infoset elements they are blocked on. Infoset modifications (as values are added, or lengths become known, or children elements are added) should generate events, and those events should trigger retries of the suspensions.
  • Pruning the Infoset: True streaming behavior requires that the parts of the infoset that are no longer needed by expressions, and that have already been unparsed, are dropped so that their memory can be recovered.
    • Some formats by their nature defeat streaming. For example, a format which has a header which contains the length of the entire rest of the data, such header cannot be unparsed and emitted to the output stream until the length of the entire infoset can be computed; hence, at minimum a buffer containing the entire unparsed representation has to exist temporarily to enable computing this length.
    • Other formats are stream-capable easily - formats that use delimited length kind only, for example,
    • Formats with OVC elements are stream-capable within limits. Streaming is blocked for the span of the infoset and its representation, going from an OVC element to the infoset elements it forward references (and their representations). This much data must be buffered, but once those forward references can be resolved, the streaming can resume.

As one can see from the above, there are a number of algorithms that taken together implement DFDL's unparsingthe Daffodil unparser runtime. Note that the above are all about the runtime mechanism. This doesn't really discuss the schema compilation algorithms needed to support these runtime behaviors.

Furthermore, all of the above mechanisms can be composed recursively - that is, an element whose length is needed for an OVC, that element may be of complex type, and within that complex type may be other OVC elements, referring to yet later elements. All these must compose properly so that a DFDL format that contains an OVC can be combined together into a larger format without constraint on how that larger format works.

Each of the above topics is deserving of a design note of its own. While DFDL parsing is also complex, the above ought to convince you that DFDL unparsing is very much more complex due to the interaction of streaming behavior with OVC calculations.

However, the subject of this page is the I/O (output really) layer's buffering system.

...