Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The length of the OVC element may not be known; hence, the starting bit position of this added buffering DataOutputStream may not be known until unparsing of the OVC element's suspension has completed.
  • Alignment: Because elements and model groups (terms generally) can have alignment, and text anywhere can have mandatory text alignment, then in the case where we do not know the starting bit position, we are not able to compute the size of the alignment fill region needed.
    • This implies that non-zero alignment requires a split of the data output stream of its own - in the case where the starting bit position is not known.
  • Bit order: Elements can have dfdl:bitOrder, and model groups can have text (e.g., dfdl:initiator), and text implies a bit order as charset encodings each have a specified bit order. It is not meaningful for the bit order to change except on a byte boundary (8 bit boundary). So, if the starting bit position of a buffering data output stream is not known, then the unparser cannot determine whether a bit order change is legal or not until that starting bit position has been determined.
    • This implies that bit order changes require a split of the data output stream of their own - in the case where the starting bit position is not known.
  • Interior Alignment affecting Length: The length of an element of complex type may depend on its starting bit position in data output stream. This The element's initial alignment is not part of its length, but this dependency happens because terms (elements or model groups) may have alignment (or mandatory text alignment) on terms they contain (aka "interior" terms). These alignment regions may be of varying size depending on where the term starts in the data output stream; hence, the length of a complex type may not be able to be computed until its starting position is known, and recursively the starting positions of any elements interior terms inside it are known.
    • This implies that expressions that compute the dfdl:contentLength or dfdl:valueLength of an element must potentially suspend until the starting bit positions become known so that the length of the alignment regions can be computed.
      • Hence, expressions can block, not only on values of infoset elements, but the ending bit position of the representation of infoset elements.
    • Circular deadlocks can occur if an OVC element needs the length of a later element, but the length of the later element depends (by way of this interior alignment issue), on the length of the OVC element.
      • Note: it is expected that formats are rare (but possible) where an OVC element itself is a variable-length element.
    Target length: Some
      • Most commonly OVC elements have fixed lengths (in our experience), as they are most common in binary data formats where the length fields are also fixed-length binary integers. Formats have been described; however, where a length is expressed in a textual integer, which varies in size depending on the magnitude of the value, followed by a terminating delimiter. So variable-length OVC elements are possible. Just uncommon.
  • Target length: Some elements have an explicit length which can be fixed, or given by a dfdl:length expression. When unparsing, this dfdl:length expression is evaluated to give a value known as the target length. This can differ from the value's implicit length an explicit length which can be fixed, or given by a dfdl:length expression. When unparsing, this dfdl:length expression is evaluated to give a value known as the target length. This can differ from the value's implicit length in that the value may need to be padded to achieve the target length, or for xs:string only, the value may need to be truncated to fit within the target length.
    • TBD: For elements with explicit length, there is an element unused region at the end which may need to be filled (with dfdl:fillByte). For simple elements this would also be a difference between value and content length. For complex types. .......
    • There is commonly a circular dependency between an OVC element storing a length, and the element whose length it stores. Deadlock is avoided when unparsing because the value of the OVC element must depend only on the dfdl:valueLength (which excludes padding/filling), and so can be computed without reference to the target length of the element. The target length expression is then able to depend on the value of the OVC element and the circularity is avoided.

This example recreates what the problem is.

Each of the above topics is deserving of a design note of its own.

This note describes the DataOutputStream buffering, that is, the mechanisms implemented by the DirectOrBufferedDataOutputStream class.

It is perhaps best to show how this works by way of an example.

Consider unparsing of Consider unparsing data described by this schema:

Code Block
<dfdl:format lengthKind="explicit"
   dfdl:byteOrder="littleEndian" 
   dfdl:bitOrder="mostSignificantBitFirst" 
   dfdl:lengthUnits="bits"
   dfdl:alignment="1" dfdl:alignmentUnits="bits"/>

<element name="r" dfdl:lengthKind="implicit">
  <complexType>
    <element name="x" type="xs:int" dfdl:length="8"
    <element name="a" type="xs:int" dfdl:outputValueCalc="{ dfdl:valueLength(../d) }"
                      dfdl:length="5"/>
    <element name="b" type="xs:int" dfdl:length="3"/>
    <element name="c" type="xs:int" dfdl:length="4" dfdl:bitOrder="leastSignificantBitFirst"/>
    <element name="d" type="xs:string" dfdl:lengthKind="pattern" 
        ... string in a LSBF encoding like the 7-bit ascii packed one ..../>
   ...
 </complexType>
</element>

The important feature above is that at element 'c', the bit order changes from MSBF to LSBF. And this happens immediately after element 'a' which has dfdl:outputValueCalc, that must wait suspend until we have the infoset element 'd' to compute its valueunparsed so that we can get is value length.

This should be fine, because 'x' is 8 bits wide,  'a' is 5 bits wide, and 'b' is 3 bits wide, so we're on a byte boundary logically when this bit order change occurs.

Now let's consider this infoset expressed as XML:

Code Block
<r>
  <x>255</x> <!-- hex FF binary 1111 1111-->
  <a>22</a> <!-- binary 10110 -->
  <b>7</b> <!-- binary 111 -->
  <c>6</c> <!-- binary 0101 -->
  <d>abc</d>
  ...
</r>

...

Next to unparse is element 'a'. However, this has dfdl:outputValueCalc, which depends on the infoset event for element 'd', which we don't have yet. So now the unparser works its magic, and splits off a buffered creates a suspension which it queues for later evaluation, and it splits off a buffered data input stream. So below, the stream on the left is the direct stream, the buffered stream is on the right.

...

We see that the 3 bits representing the value 67, 111, are in the frag byte of the buffered stream. The buffered stream now contains 3 bits. That means it ends at bit 16, with position at bit 17.

...

This should be fine, because the logical position in the bit stream is 17, which is 1 mod 8, i.e., byte aligned.

However, this is the subtle bugthere is a subtlety here. It's not at a byte boundary in the buffered stream because that didn't start on a byte boundary. Rather, every byte in the buffered stream is off by 3 bits. We're at bit 4 in MSBF order in the frag byte.

...

Because element 'c' is LSBF, and we are starting a new byte, the bits for 'c' go in the least-significant bits of the third byte. The value is 6, which is 0101. So the bits should be:

1111 1111 | AAAA A111 | XXXX 0101

So, now considering our buffered data output stream above, we want to write out the bits of element 'c', but we can't yet flush the frag byte because we don't have all its bits yet. The byte boundary in the data is after the three 111 bits in the frag byte. So we have 5 bits of the next byte available to write in the frag byte, but we only have the data as yet to populate 1 of them.

So we actually need a second frag byte, or we have no place to put all the bits of 'c' in the data output stream.

But ultimately our algorithm and frag byte system, have broken down here. As the bits in frag byte 1 aren't contiguous any more. It contains 4 live bits and 4 unused, but the used bits are not adjacent. The second frag byte is supposed to be LSBF, but its shown here MSBF so we can see how the bits are assembled so they come out adjacent when combined with the bits in frag byte 1.

...

starting a new byte, the bits for 'c' go in the least-significant bits of the third byte. The value is 6, which is 0101. So the bits should be:

1111 1111 | AAAA A111 | XXXX 0101

So, now considering our buffered data output stream above, we want to write out the bits of element 'c', but we can't yet flush the frag byte because we don't have all its bits yet. The byte boundary in the data is after the three 111 bits in the frag byte. So we have 5 bits of the next byte available to write in the frag byte, but we only have the data as yet to populate 1 of them

In addition, the next 4 bits to be output, should go into the unused bits of frag byte 1, not frag byte 2. Only once those 4 bits are output, at which point we can flush frag byte 1 to the whole bytes buffer, do we start writing to frag byte 2, which can become frag byte 1, and we then drop frag byte 2.

At this point I think it should be apparent that the algorithm is a confusing mess. The fact that frag bytes don't represent byte-aligned bytes of data is the assumption we had that ultimately is not being maintained. The design in place here is not up to the job of a buffered data output stream starting not on a byte boundary.

The solution to this problem is to split the data output stream again, on every bit order change.

...

At this point we have unparsed element 'd' , the details of that are elsewhere(without showing exactly how), but basically, when we receive the event for element 'd', we construct an infoset element node. Since element 'd' is used in expressions, we know we must capture four things: contentStartPosition, valueStartPosition, valueEndPosition, and contentEndPosition (tbd; names aren't right)content start position, value start position, value end position, and content end position, and as we unparse the value of 'd' into the output stream (third one, right-most above), we record these positions.

The expression for computing element 'a' value depends on knowing the contentvalue-length of element 'd', so once we have the contentEndPositionvalue end position, we can subtract the contentStartPosition value start position and we know the content value length. This allows the suspended computation of the value for element 'a', and the unparsing thereof, to be resumed.

So, that expression is evaluated, creates the value 28, and that value is stored in the infoset element for the instance of 'a'. Then the unparsing of that infoset element is done, and stored with the suspended computation/unparser is the data output stream to use, which is the left-most one above. Now 28 expressed as binary is 11100 (5 bits), so we write these bits to the data output stream (on the left below).

draw.io Diagram
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNameCopy of Copy of Copy of Copy of Copy of buffered50
simpleViewerfalse
width
diagramWidth893
revision3

...

We've also emphasized that this first data output stream on the left is  special. it is the "direct" output stream in that it is actually connected to a java.io.OutputStream, not just to a buffer. Bytes written to the "whole bytes" of the direct output stream actually are being written out by Daffodil. So far, only a single byte "FF" has been output to the java output.

So when a buffered data output stream is setFinal, it is simply marked as such, and this will be checked on later. But when the direct data output stream is marked final, then it is combined with the next data output stream. If that stream is final, then it will be combined with the next after that, and so on until either all the streams have been collapsed and all data written out, or a data stream that is not final is encountered.

...

Notice now that the middle stream has no contents any more. It's bit length is now 0. Since the frag byte of the direct stream is full, this is output to its whole bytes, and the direct stream's frag byte is cleared.

...

Now, the middle data output stream (now which has morphed into being the direct one), was FINAL before, and remains FINAL. That means we continue to look to the right and combine the direct data output stream into the next buffered data output stream. In this case, the right-most stream has 3 whole bytes, the direct stream has no frag byte, so those whole bytes are just output to the direct stream's whole bytes.

...

Note that in the above, we're now collapsing the data output stream created by unparsing element 'c', and that element had bit order LSBF. When we copied the whole bytes, since there was no frag byte the bit-order change is legal. Had there been a frag byte on the direct stream, then changing bit orders would have had to create a runtime Runtime Schema Definition Error, as it is not meaningful to change bit order in the middle of a byte.

...