Introduction: Why Unparsing is Harder than Parsing

It is initially surprising to some software engineers that DFDL unparsing is more complicated than DFDL parsing. After all, parsing involves lookahead and backtracking. Unparsing seems like just serialization.

However,....

DFDL has a feature known by the dfdl:outputValueCalc (OVC for short) property. OVC is a tremendously powerful feature in DFDL which enables a DFDL schema to truly capture inter-dependencies between elements, typically when one element stores the length of another element, but where these two elements are not adjacent in the representation. For example, a length field may appear in a header part of the data, and the part of the data whose length is given in that header, is represented much later in the data stream.

The Daffodil unparser is designed to support streaming behavior - the infoset arrives as a series of infoset events generated by a Daffodil InfosetInputter. Daffodil attempts to stream output data to the output stream without waiting for the entire infoset to arrive. Ideally, once an infoset element's start and end events have arrived the element's representation could be written to the output stream. When an infoset element has been unparsed, then in principle it can be pruned from the infoset and its memory recovered. This enables a very large infoset to be unparsed using only a finite memory footprint that is much smaller than it would be if it held the entire infoset.

TBD: A figure (or several) right here would be good illustrating incoming infoset events, incremental construction of the infoset tree, with simultaneous unparsing of these infoset elements to the output, and pruning of the unparsed elements.

However, the OVC feature complicates streaming behavior.

OVC elements typically don't appear in the stream of infoset events. In order to support data parsing and unparsing, they are tolerated, but ignored, if they appear in the infoset events, and the values are recomputed. For purposes of this discussion we'll assume that no events appear in the infoset events corresponding to OVC elements. 

The unparser determines that an OVC element must be computed and added to the infoset, typically when an infoset event is encountered that is known to occur after the OVC element. When this is detected, an infoset element is added to the infoset corresponding to the OVC element, but this element has no value, and so it cannot be unparsed.

The OVC element's value is computed from a DFDL expression. This expression is evaluated in a special unparser-specific mode where forward-reference into the infoset is expected. If the expression can be evaluated and that evaluation returns a value, then the value is placed in the infoset element for the OVC, and unparsing then proceeds as if the element had arrived as infoset events with a value.

It is when the OVC element's expression cannot be evaluated that things get interesting. Typically an OVC element has an expression that refers to something later than it in the infoset; hence, this first attempt to evaluate the OVC element expression will fail - the evaluation mode throws specific exception types indicating why the expression could not be evaluated. These exceptions are caught, informing the unparser that the expression could not be successfully evaluated.The unparser then creates a Suspension object. A suspension contains a copy of the Unparser's state, and the expression. The suspension is queued for later evaluation when the unparser determines it should be retried. This could be a late as when the entire infoset has been created (by then every suspension must be able to be evaluated), or the suspension could be queued on the first infoset element where the expression was unable to be evaluated. In this case, once that infoset element is updated, the suspension could be immediately retried.

For purposes of this discussion we'll not worry about exactly when the suspensions are re-evaluated.

A key concept in this suspending of expressions for OVC elements, is that unparsing must continue. It is not sufficient to just consume incoming infoset events until the infoset contains all the elements needed by the expression, because the expression can actually contain things like this:

dfdl:outputValueCalc="{ dfdl:valueLength(../later/elem, 'bits') }"

This function, dfdl:valueLength, measures the length of the representation of the '../later/elem' element. Hence, that element of the infoset not only has to exist, but we must unparse it into a buffer, measure the length of it's value's representation.

Ultimately, this means after an OVC element is suspended, we must continue unparsing into buffers, and keep track, on each infoset element, the start and end positions of the content and value regions of the representation. (The content length is greater than or equal to the value length. In the representation, there are 2 start positions and 2 end positions for each element: content start, value start, value end, and content end. Content length is greater than value length when padding is inserted.)

One simple optimization Daffodil uses, is to only keep track of the content and value start/end positions for elements that actually appear in expressions.

(Idea for future: for debugging, it may be useful to compute these for every element, so as to be able to show a user exactly where the representation of every element is, and this applies to both parsing and unparsing.)

To enable the unparser to continue unparsing after an OVC element is suspended, the DataOutputStream implementation supports what we call 'splitting' the stream into an original part, and an added buffering data output stream, into which the unparsing can continue. Once the OVC element value is computed, then the unparsing of the OVC element can proceed, using the original DataOutputStream. Once that unparsing of the OVC element is complete, the original DataOutputStream is finished, and everything, including its ending bit position, is known, so that it can be recombined with the buffering data output stream that was split off of it.

A few complicating factors

  • The length of the OVC element may not be known; hence, the starting bit position of this added buffering DataOutputStream may not be known until unparsing of the OVC element's suspension has completed.
  • Alignment: Because elements and model groups (terms generally) can have alignment, and text anywhere can have mandatory text alignment, then in the case where we do not know the starting bit position, we are not able to compute the size of the alignment fill region needed.
    • This implies that non-zero alignment requires a split of the data output stream of its own - in the case where the starting bit position is not known.
  • Bit order: Elements can have dfdl:bitOrder, and model groups can have text (e.g., dfdl:initiator), and text implies a bit order as charset encodings each have a specified bit order. It is not meaningful for the bit order to change except on a byte boundary (8 bit boundary). So, if the starting bit position of a buffering data output stream is not known, then the unparser cannot determine whether a bit order change is legal or not until that starting bit position has been determined.
    • This implies that bit order changes require a split of the data output stream of their own - in the case where the starting bit position is not known.
  • Interior Alignment affecting Length: The length of an element of complex type may depend on its starting bit position in data output stream. The element's initial alignment is not part of its length, but this dependency happens because terms (elements or model groups) may have alignment (or mandatory text alignment) on terms they contain (aka "interior" terms). These alignment regions may be of varying size depending on where the term starts in the data output stream; hence, the length of a complex type may not be able to be computed until its starting position is known, and recursively the starting positions of any interior terms inside it are known.
    • This implies that expressions that compute the dfdl:contentLength or dfdl:valueLength of an element must potentially suspend until the starting bit positions become known so that the length of the alignment regions can be computed.
      • Hence, expressions can block, not only on values of infoset elements, but the ending bit position of the representation of infoset elements.
    • Circular deadlocks can occur if an OVC element needs the length of a later element, but the length of the later element depends (by way of this interior alignment issue), on the length of the OVC element.
      • Note: it is expected that formats are rare (but possible) where an OVC element itself is a variable-length element. Most commonly OVC elements have fixed lengths (in our experience), as they are most common in binary data formats where the length fields are also fixed-length binary integers. Formats have been described; however, where a length is expressed in a textual integer, which varies in size depending on the magnitude of the value, followed by a terminating delimiter. So variable-length OVC elements are possible. Just uncommon.
  • Target length: Some elements have an explicit length which can be fixed, or given by a dfdl:length expression. When unparsing, this dfdl:length expression is evaluated to give a value known as the target length. This can differ from the value's implicit length in that the value may need to be padded to achieve the target length, or for xs:string only, the value may need to be truncated to fit within the target length.
    • TBD: For elements with explicit length, there is an element unused region at the end which may need to be filled (with dfdl:fillByte). For simple elements this would also be a difference between value and content length. For complex types. .......
    • There is commonly a circular dependency between an OVC element storing a length, and the element whose length it stores. Deadlock is avoided when unparsing because the value of the OVC element must depend only on the dfdl:valueLength (which excludes padding/filling), and so can be computed without reference to the target length of the element. The target length expression is then able to depend on the value of the OVC element and the circularity is avoided.
  • Expression Evaluation Modes: When unparsing, expressions can be evaluated in backward-only mode (just like parsing), or in forward-referencing mode where they can block waiting for updates to the infoset. (Adding children, closing/finishing the infoset element - indicating no more children to be added, setting a value, setting nilled, determining length, etc.)
    • Expressions can reference variables, whose values are assigned by way of dfdl:setVariable or dfdl:newVariableInstance expressions. These also can (TBD: must?) be evaluated in forward-referencing mode.
      • (TBD: Must? ... because we don't know if they'll be referenced from backward-only expressions or forward-referencing expressions of an OVC element, or recursively another variable value expression where the variable was referenced from an OVC element expression.)
  • Queuing Suspensions
    • Note: a quick and dirty implementation which actually defeats streaming behavior, is to just queue all suspensions centrally until the primary unparser pass is over. Then just loop through the suspensions retrying them until they all succeed.
    • Suspensions should be stored on the infoset elements they are blocked on. Infoset modifications (as values are added, or lengths become known, or children elements are added) should generate events, and those events should trigger retries of the suspensions.
  • Pruning the Infoset: True streaming behavior requires that the parts of the infoset that are no longer needed by expressions, and that have already been unparsed, are dropped so that their memory can be recovered.
    • Some formats by their nature defeat streaming. For example, a format which has a header which contains the length of the entire rest of the data, such header cannot be unparsed and emitted to the output stream until the length of the entire infoset can be computed; hence, at minimum a buffer containing the entire unparsed representation has to exist temporarily to enable computing this length.
    • Other formats are stream-capable easily - formats that use delimited length kind only, for example,
    • Formats with OVC elements are stream-capable within limits. Streaming is blocked for the span of the infoset and its representation, going from an OVC element to the infoset elements it forward references (and their representations). This much data must be buffered, but once those forward references can be resolved, the streaming can resume.

As one can see from the above, there are a number of algorithms that taken together implement the Daffodil unparser runtime. Note that the above are all about the runtime mechanism. This doesn't really discuss the schema compilation algorithms needed to support these runtime behaviors.

Furthermore, all of the above mechanisms can be composed recursively - that is, an element whose length is needed for an OVC, that element may be of complex type, and within that complex type may be other OVC elements, referring to yet later elements. All these must compose properly so that a DFDL format that contains an OVC can be combined together into a larger format without constraint on how that larger format works.

Each of the above topics is deserving of a design note of its own. While DFDL parsing is also complex, the above ought to convince you that DFDL unparsing is very much more complex due to the interaction of streaming behavior with OVC calculations.

However, the subject of this page is the I/O (output really) layer's buffering system.

Direct and Buffered Data Output Streams

This note describes the DataOutputStream buffering, that is, the mechanisms implemented by the DirectOrBufferedDataOutputStream class.

It is perhaps best to show how this works by way of an example.

Consider unparsing of data described by this schema:

<dfdl:format lengthKind="explicit"
   dfdl:byteOrder="littleEndian" 
   dfdl:bitOrder="mostSignificantBitFirst" 
   dfdl:lengthUnits="bits"
   dfdl:alignment="1" dfdl:alignmentUnits="bits"/>

<element name="r" dfdl:lengthKind="implicit">
  <complexType>
    <element name="x" type="xs:int" dfdl:length="8"
    <element name="a" type="xs:int" dfdl:outputValueCalc="{ dfdl:valueLength(../d) }"
                      dfdl:length="5"/>
    <element name="b" type="xs:int" dfdl:length="3"/>
    <element name="c" type="xs:int" dfdl:length="4" dfdl:bitOrder="leastSignificantBitFirst"/>
    <element name="d" type="xs:string" dfdl:lengthKind="pattern" 
        ... string in a LSBF encoding like the 7-bit ascii packed one ..../>
   ...
 </complexType>
</element>

The important feature above is that at element 'c', the bit order changes from MSBF to LSBF. And this happens immediately after element 'a' which has dfdl:outputValueCalc, that must suspend until we have the infoset element 'd' unparsed so that we can get is value length.

This should be fine, because 'x' is 8 bits wide,  'a' is 5 bits wide, and 'b' is 3 bits wide, so we're on a byte boundary logically when this bit order change occurs.

Now let's consider this infoset expressed as XML:

<r>
  <x>255</x> <!-- hex FF binary 1111 1111-->
  <a>22</a> <!-- binary 10110 -->
  <b>7</b> <!-- binary 111 -->
  <c>6</c> <!-- binary 0101 -->
  <d>abc</d>
  ...
</r>

Now we are going to try to animate what happens as we unparse this based on the arrival of the infoset events to the unparser.

Our start state has a DataOutputStream which is 'direct' meaning attached to an actual Java JVM output stream. Because DFDL and Daffodil are very bit-oriented we implement bit-level I/O on top by accompanying this output-stream with a "frag byte", a single byte which contains from 0 to 7 bits which are the fragment of a byte that is not yet complete and ready to be written to the actual output stream. The frag byte never contains all 8 bits (except transiently) because when it has all 8 bits we write it to the byte stream and clear it.

We'll depict a data output stream like so. This shows a data output stream where 4 whole bytes + 2 additional bits have been written.

On the left we have the JVM stream (or a buffer) holding whole bytes which we'll write in hex. On the right we have the frag byte which will eventually, once filled up, flow into the whole bytes part, at which point the frag byte will be reset. At the bottom of the frag byte we have the current bit order (shown as MSBF), and the number of bits in the fragment (shown as 2). The data in the frag byte illustrates the bits that are significant, with X for the bits as yet unoccupied by unparsed data.

So let's look at unparsing the infoset we see in the XML above.

We start from an empty data output stream. We will get a Start-Element 'r' event. No I/O occurs as this root element has nothing about it in the representation. So per below, the data output stream remains empty as here. Note that the bit position (1 based) is 1, but nothing has been output at that position. That is the position where output will begin.

We then get a Start-Element 'x' event with the value 255.

This is a whole byte, and the stream is currently byte aligned (because it is empty), so this data is output to the whole-bytes part of the data output stream:

This leaves us at bit position 8.

Next to unparse is element 'a'. However, this has dfdl:outputValueCalc, which depends on the infoset event for element 'd', which we don't have yet. So now the unparser works its magic, creates a suspension which it queues for later evaluation, and it splits off a buffered data input stream. So below, the stream on the left is the direct stream, the buffered stream is on the right.

Because we know the length (in bits) of element 'a' to be 5 bits, we know the starting position of the buffered stream. Logically, the first bit in the buffered stream is at bit 14.

Now we suspend the computation of element 'a' for later, but proceed to unparse element 'b', into the buffered stream. This results in:

We see that the 3 bits representing the value 7, 111, are in the frag byte of the buffered stream. The buffered stream now contains 3 bits. That means it ends at bit 16, with position at bit 17.

Now we get the start event for 'c'. However, element 'c' uses the LSBF bit order.

This should be fine, because the logical position in the bit stream is 17, which is 1 mod 8, i.e., byte aligned.

However, there is a subtlety here. It's not at a byte boundary in the buffered stream because that didn't start on a byte boundary. Rather, every byte in the buffered stream is off by 3 bits. We're at bit 4 in MSBF order in the frag byte.

Let's look at the data in bits, using A for the bits not yet populated from element 'a':

1111 1111 | AAAA A111

That's the whole byte for element 'x', followed by the 5 bits for 'a' (unknown as yet), and the 3 bits for 'b'. Let's add to this sequence, the next byte, and using letter 'C' show which bits should be occupied by the value of element 'c'

1111 1111 | AAAA A111 | XXXX CCCC

Because element 'c' is LSBF, and we are starting a new byte, the bits for 'c' go in the least-significant bits of the third byte. The value is 6, which is 0101. So the bits should be:

1111 1111 | AAAA A111 | XXXX 0101

So, now considering our buffered data output stream above, we want to write out the bits of element 'c', but we can't yet flush the frag byte because we don't have all its bits yet. The byte boundary in the data is after the three 111 bits in the frag byte. So we have 5 bits of the next byte available to write in the frag byte, but we only have the data as yet to populate 1 of them.

The solution to this problem is to split the data output stream again, on every bit order change.

  • Any time bit-order changes, split off yet another buffered data output stream.
  • Record the bit order of every buffered output stream.
  • When eventually collapsing the data output streams together, check that we end up changing bit orders on proper byte boundaries or issue a runtime SDE.
  • A detail is to save information on each buffered data output stream for diagnostic purposes in issuing this error. E.g., the Term's Runtime Data object.

To illustrate this,  let's go back to where we have just unparsed element 'b' and we're about to unparse element 'c'.

If we recognize that element 'c' has a different bit order, and we split off another buffered stream we get:

Now, the middle stream is marked FINAL, because we are splitting due to bit order change, not due to some additional suspended material to be added on later. The right-most stream contains the representation of 'c', confident that this stream begins on a byte boundary.

Now we unparse element 'd'. This is a string, and without getting into the details, we're going to say that it contains 28 bits containing the characters 'abc' then a NUL. These characters are 7-bits wide each, and they are stored LSBF. These characters in hex are 'a' is 0x61, 'b' is 0x62, 'c' is 0x63, and NUL is 0x00. First we unparse the 'a' getting 7 bits of 1100001. The frag byte has room for 4 characters, and LSBF those are the lease significant 4 bits of this character, so 0001. So the data output stream gets 0001 added to its frag byte for a full byte of 00001110 or hex 0x0E. This would be moved to the whole bytes, and the remaining 2 bits of the character 11 placed in the frag byte. So our streams now look like this:

By the same technique we output the next 3 characters, 'b', 'c', and the NUL (7 bits each), and we get:

At this point we have unparsed element 'd' (without showing exactly how), but basically, when we receive the event for element 'd', we construct an infoset element node. Since element 'd' is used in expressions, we know we must capture four things: content start position, value start position, value end position, and content end position, and as we unparse the value of 'd' into the output stream (third one, right-most above), we record these positions.

The expression for computing element 'a' value depends on knowing the value-length of element 'd', so once we have the value end position, we can subtract the value start position and we know the value length. This allows the suspended computation of the value for element 'a', and the unparsing thereof, to be resumed.

So, that expression is evaluated, creates the value 28, and that value is stored in the infoset element for the instance of 'a'. Then the unparsing of that infoset element is done, and stored with the suspended computation/unparser is the data output stream to use, which is the left-most one above. Now 28 expressed as binary is 11100 (5 bits), so we write these bits to the data output stream (on the left below).

The last thing the suspended computation and unparser for 'a' does, is "setFinal" on the data output stream. This tells the data output stream that no further data will be appended. What follows its contents is then logically the next data output stream (middle one above). The "setFinal" operation initiates a process called "collapsing" the data output stream buffers.

The data output streams are connected into a double-linked chain:

We've also emphasized that this first data output stream on the left is  special. it is the "direct" output stream in that it is actually connected to a java.io.OutputStream, not just to a buffer. Bytes written to the "whole bytes" of the direct output stream actually are being written out by Daffodil. So far, only a single byte "FF" has been output to the java output.

So when a buffered data output stream is setFinal, it is simply marked as such, and this will be checked on later. But when the direct data output stream is marked final, then it is combined with the next data output stream. If that stream is final, then it will be combined with the next after that, and so on until either all the streams have been collapsed and all data written out, or a data stream that is not final is encountered.

So looking at the collapsing of the direct and first buffered data output stream, first the data of the buffered stream is written to the direct data output stream. Since the direct data output stream has a frag byte, that byte must be completed first. In this example, the buffered data output stream (middle above) doesn't have any whole bytes, but it does have a frag byte, and so bits from this frag byte are moved into the direct data output stream's frag byte.

In this example, the direct data output stream's frag byte has 5 bits, and the buffered data output stream contributes 3 more, making a whole byte:

Notice now that the middle stream has no contents any more. It's bit length is now 0. Since the frag byte of the direct stream is full, this is output to its whole bytes, and the direct stream's frag byte is cleared.

Next the direct output stream and the buffered data output stream (left and middle) are "Morphed". The middle buffered data output stream will morph into being a direct output stream by taking up the actual java.io.OutputStream of the left-most stream. The formerly direct data output stream on the left is now "dead", and can be discarded (garbage collected, or returned to a pool).

Now, the middle data output stream (which has morphed into being the direct one), was FINAL before, and remains FINAL. That means we continue to look to the right and combine the direct data output stream into the next buffered data output stream. In this case, the right-most stream has 3 whole bytes, the direct stream has no frag byte, so those whole bytes are just output to the direct stream's whole bytes.

Next the frag byte contents from the right-most data stream are output to the direct stream. As the frag byte of the direct stream is empty, these bits just move over to the direct stream's frag byte.

Note that in the above, we're now collapsing the data output stream created by unparsing element 'c', and that element had bit order LSBF. When we copied the whole bytes, since there was no frag byte the bit-order change is legal. Had there been a frag byte on the direct stream, then changing bit orders would have had to create a Runtime Schema Definition Error, as it is not meaningful to change bit order in the middle of a byte.

Now the last step is that the direct stream and buffered stream will again morph:

At this point, all the data output streams have been collapsed, and all subsequent unparsing will be going to this last remaining live data output stream. The collapsing ends the need for the data output streams to first buffer all output. So the overhead introduced by all the above machinery is no longer encountered, unless the need again arises due to another element with dfdl:outputValueCalc, another change of bit order, or the need for alignment fill regions or final unused regions of unknown length.

This algorithm can achieve streaming behavior from unparsing. Note that the delayed unparsing of element 'a' lasted until element 'd' was received and unparsed (into buffered data output stream). At that point the unparser outputs the data for elements 'a', 'b', 'c', and 'd' (most of it for 'd', 7 bits are left in a frag byte). The output of data has caught up to the streaming-in of infoset events.

This discussion did not cover several other important aspects of the direct and buffered data output streams and related algorithms:

  • suspended unparsers - for alignment fill, unused regions, and for expressions and dfdl:outputValueCalc.
    • TBD: expressions involving variables yet to be set. 
  • queueing of suspensions on the infoset, and infoset event detection including open/final infoset nodes.
  • capture and propagation of start/end of content and value
    • propagation of start positions, lengths of data output streams, splitting off of buffered data output streams with unknown start positions.

Those will be covered in other pages.

 

  • No labels