Experience with dfdl:inputValueCalc dfdl:outputValueCalc dfdl:hiddenGroupRef

(Short URL for this page is https://s.apache.org/daffodil-experience-with-computed-elements)

Daffodil is the first DFDL implementation to provide the advanced features of

dfdl:inputValueCalc (IVC for short)
dfdl:outputValueCalc (OVC for short)
dfdl:hiddenGroupRef

In addition Daffodil implements almost the complete DFDL expression language, including all the xpath functions, and the DFDL-namespaced functions such as dfdl:valueLength and dfdl:contentLength (though with some restrictions - as of this writing - on when length units of 'characters' can be specified.)

These features are essential for highly complex binary formats like:

PCAP
Asterisk
MIL-STD-2045 and related MIL-STD and STANAGs

This page is a location to centralize notes about implementation of advanced DFDL features, and to highlight places where the DFDL specification may need to be clarified, augmented, or corrected.

Infoset can contain elements that have dfdl:outputValueCalc

It is essential that when unparsing it is allowed for the infoset to already contain output-computed elements. The values of these will be discarded and recomputed during unparsing, but they have to be tolerated if present, and created if not present.

Allow IVC/OVC on Global Elements, SimpleTypes

The restriction that IVC/OVC properties are not allowed on global element is arbitrary. It was probably put in place with an eye towards reducing complexity of implementations. It is, however, unnecessary and gets in the way of some useful cases. This restriction should be removed.

Similarly, IVC/OVC are not allowed to be declared on simple types. This restriction is also unnecessary and makes reuse harder.

Allowing some DFDL properties to Coexist with IVC

For elements with dfdl:inputValueCalc, the element is not represented, so format properties don't make sense and should cause SDE if expressed directly on the element. However, some properties are completely compatible with dfdl:inputValueCalc. In particular dfdl:choiceBranchKey does not conflict, because it is not a format property.

This restriction in DFDL is easily worked around by wrapping a sequence around the branch element having IVC property, and putting the dfdl:choiceBranchKey on the sequence.

SDE for element in hidden group that has no default, nor dfdl:outputValueCalc

Within a hidden group, if an element has no fixed/default value, and no OVC, then because it is hidden and so does not appear in the infoset, there's no way for it to get a value at all when unparsing.

Daffodil issues an SDE in this situation as is required by the DFDL spec.

This SDE is too strong: If the element has IVC, then it doesn't need a value for unparsing unless an expression that is used at unparse time (i.e., not in an assert or discriminator) references the element. In many cases the value of an IVC may be used only in dfdl:occursCount expressions, or in assert/discriminator test expressions. Since those aren't evaluated when unparsing, there is no unparsing situation would never need the value, so the SDE is too strong.

IVC and OVC on the same element

In many situations we have found that we need both IVC and OVC on the same element. This occurs when the element is in a hidden group, and is essentially being used as a variable.

The dfdl:newVariableInstance doesn't eliminate this problem, as there are situations where one needs an "array variable", that is a variable associated with each index of an array, but referenced from expressions used to compute things outside of the scope of that array element.

Another use of this is that sometimes for unparsing, one needs a location for an intermediate calculation to be saved (to avoid redundantly computing it). Such an element needs to carry an OVC, but also have no representation. Having IVC on it explicitly makes it have no representation, no alignment, and no implications as far as separators showing up for it in separated sequences. There is no way to achieve this other than using IVC.

Array Variables and dfdl:newVariableInstance

In many cases dfdl:newVariableInstance is not sufficient because what are needed are array variables.

For instance, when parsing element i+1 you have to refer to a "variable" defined as part of element i.

You can't do that with dfdl:newVariableInstance because of the scope it has. If you introduce a dfdl:newVariableInstance for an array element, it's scope will be that element only. If you introduce it outside the array element there will be only one instance for the whole array.

But you can do an "array variable" with a hidden element. (Doesn't have to be hidden, but that's the sensible way to use it.)

The headache being the relative path to it., and then the fact that the most natural thing to do is to put both IVC and OVC on it (though the OVC may not be needed in many cases).

Variables: The direction property, and forward reference from dfdl:setVariable value and dfdl:newVariableInstance defaultValue Expressions

It is clear that variables need to be able to be evaluated at unparse time, and the expressions used with them to default or set their values need to be able to forward-reference into the infoset when they are evaluated at unparse time.

We have prototyped a extension direction property for dfdl:defineVariable, which allows one to declare the variable dfdl:direction 'parseOnly', 'unparseOnly', or 'both'.

We believe using dfdl:setVariable is inconsistent with use of dfdl:newVariableInstance having a default value, as there are race conditions between reading a default value and setting the value that are complex. Stylistically, one should declare a variable with no default value nor external value, if one intends to use it with dfdl:newVariableInstance.

The following scenario comes from PCAP and illustrates use of dfdl:newVariableInstance and variables with forward-referencing defaultValue expressions:

      <!-- Internally used by IPAddressGroup at unparse time
           These are IPAddressGroup's local variables. -->
      <dfdl:defineVariable name="remainingDottedAddr" type="xs:string" dfdlx:direction="unparseOnly"/>
      <dfdl:defineVariable name="priorRemainingDottedAddr" type="xs:string" dfdlx:direction="unparseOnly"/>

      <!-- Parameter for IPAddressGroup used at unparse time -->
      <dfdl:defineVariable name="ipAddressElement" type="xs:string" dfdlx:direction="unparseOnly"/>

<!-- 
A PCAP schema has two different IP addresses, IPSrc and IPDest.

They are defined in terms of a common group definition pcap:IPAddressGroup
which works like a common "subroutine" within the schema. The variable pcap:ipAddressElement is
a formal parameter of IPAddressGroup.
-->

  <xs:group name="IPSrcGrp">
      <xs:sequence>
        <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
          <!-- 
            Note how this defaultValue expression forward references to the 
            Element containing the string (e.g, 1.2.3.4) to be used at unparse time only,
            as the variable is itself an 'unparseOnly' variable declaration.
           -->
          <dfdl:newVariableInstance ref="pcap:ipAddressElement" defaultValue='{ IPSrc }'/><!-- example 1.2.3.4 -->
        </xs:appinfo></xs:annotation>
        <xs:element name="IPSrcString">
          <xs:complexType>
            <xs:group ref="pcap:IPAddressGroup"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
  </xs:group>


  <xs:group name="IPDestGrp">
      <xs:sequence>
        <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
          <dfdl:newVariableInstance ref="pcap:ipAddressElement" defaultValue='{ IPDest }'/><!-- example 1.2.3.4 -->
        </xs:appinfo></xs:annotation>
        <xs:element name="IPDestString">
          <xs:complexType>
            <xs:group ref="pcap:IPAddressGroup"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
  </xs:group>


 <xs:group name="IPAddressGroup">
      <xs:annotation><xs:documentation><![CDATA[
   
      This group is a reusable subroutine of DFDL. It parses a string like 1.2.3.4 into 4 integers named Byte1, Byte2, Byte3, Byte4
      containing the "." separated integers. 

      Arguably, this is extreme for DFDL. I mean an infoset with the 4 bytes it should be a good enough parsed representation of the
      4 bytes. But this serves as a useful example regardless. 
      
      This is used to unparse IP addresses expressed in the dotted notation that is common. 

      There is one parameter. Users must bind the $pcap:ipAddressElement variable to the string to be so parsed
      using dfdl:newVariableInstance.

      ]]></xs:documentation></xs:annotation>
    <xs:sequence>
      <xs:sequence>
        <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
          <dfdl:newVariableInstance ref="pcap:priorRemainingDottedAddr" 
             defaultValue='{ $pcap:ipAddressElement }'/><!-- example 1.2.3.4 -->
          <dfdl:newVariableInstance ref="pcap:remainingDottedAddr" 
             defaultValue='{ $pcap:priorRemainingDottedAddr }'/><!-- example 1.2.3.4 -->
        </xs:appinfo></xs:annotation>
        <xs:element name="Byte1" type="xs:unsignedByte" 
          dfdl:outputValueCalc="{
            xs:unsignedByte(fn:substring-before($pcap:remainingDottedAddr, '.'))
          }"/>
        <xs:sequence>
          <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:newVariableInstance ref="pcap:priorRemainingDottedAddr" 
              defaultValue='{ $pcap:remainingDottedAddr }'/><!-- example 1.2.3.4 -->
            <dfdl:newVariableInstance ref="pcap:remainingDottedAddr" 
              defaultValue='{ fn:substring-after($pcap:priorRemainingDottedAddr, ".") }'/><!-- example 2.3.4 -->
          </xs:appinfo></xs:annotation>
          <xs:element name="Byte2" type="xs:unsignedByte" 
           dfdl:outputValueCalc="{
             xs:unsignedByte(fn:substring-before($pcap:remainingDottedAddr, '.'))
           }"/>
          <xs:sequence>
            <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
              <dfdl:newVariableInstance ref="pcap:priorRemainingDottedAddr" 
                defaultValue='{ $pcap:remainingDottedAddr }'/><!-- example 2.3.4 -->
              <dfdl:newVariableInstance ref="pcap:remainingDottedAddr" 
                defaultValue='{ fn:substring-after($pcap:priorRemainingDottedAddr, ".") }'/><!-- example 3.4 -->
            </xs:appinfo></xs:annotation>
            <xs:element name="Byte3" type="xs:unsignedByte" 
              dfdl:outputValueCalc="{
                xs:unsignedByte(fn:substring-before($pcap:remainingDottedAddr, '.'))
              }"/>
            <xs:sequence>
              <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
                <dfdl:newVariableInstance ref="pcap:priorRemainingDottedAddr" 
                  defaultValue='{ $pcap:remainingDottedAddr }'/><!-- example 3.4 -->
                <dfdl:newVariableInstance ref="pcap:remainingDottedAddr" 
                  defaultValue='{ fn:substring-after($pcap:priorRemainingDottedAddr, ".") }'/><!-- example 4 -->
              </xs:appinfo></xs:annotation>
              <xs:element name="Byte4" type="xs:unsignedByte" 
                dfdl:outputValueCalc="{
                  xs:unsignedByte($pcap:remainingDottedAddr)
                }"/>
            </xs:sequence>
          </xs:sequence>
        </xs:sequence>
      </xs:sequence>
    </xs:sequence>
  </xs:group>


<!-- These groups are then used like so -->

              ...
              <xs:sequence dfdl:hiddenGroupRef="pcap:IPSrcGrp"/>
              <!-- IPSrc will be of the usual IP address form: 1.2.3.4 --> 
              <xs:element name="IPSrc" type="xs:string" 
                dfdl:inputValueCalc="{ 
                  fn:concat(../IPSrcString/Byte1, '.', 
                            ../IPSrcString/Byte2, '.',
                            ../IPSrcString/Byte3, '.',
                            ../IPSrcString/Byte4) }"/>
              <xs:sequence dfdl:hiddenGroupRef="pcap:IPDestGrp"/>
              <xs:element name="IPDest" type="xs:string" 
                dfdl:inputValueCalc="{ 
                  fn:concat(../IPDestString/Byte1, '.',
                            ../IPDestString/Byte2, '.',
                            ../IPDestString/Byte3, '.',
                            ../IPDestString/Byte4) }"/>
             ....

The above DFDL schema enables the 4 bytes of IP Source Address and 4 bytes of IP Destination Address to be parsed into this logical XML infoset:

Data Bytes (hex) 01 02 03 04 05 06 07 08

<IPSrc>1.2.3.4</IPSrc>

<IPDest>5.6.7.8</IPDest>

These will unparse back to the same 8 bytes.

Parse-Time Forward Reference

Several standards in the MIL-STD and NATO STANAG space express formats using a forward reference idiom that requires those forward references to be used at parse time.

These standards are not publicly available. They are US For-official-use-only or NATO Unclassified (which is controlled, not public), but below illustrates roughly the idiom.

This is pseudo-DFDL - because the expressions are forward referencing:

<choice dfdl:choiceDispatchKey="{ ../key }">
  <sequence dfdl:choiceBranchKey="A">
    <element name="e1" type="xs:int"/>
    <element name="e2" type="xs:int"/>
  </sequence>
  <sequence dfdl:choiceBranchKey="B">
    <element name="e2" type="xs:long"/>
  </sequence>
</choice>
....
<element name="key" type="xs:string" dfdl:length="1"/>
....

In the above you see that the key comes after the choice. When parsing this means the parse must suspend, move past the uncertain region (the size of which has to be determined without parsing it, meaning all branches of the choice must be the same fixed predictable length), continue parsing after the uncertain region until the elements have been parsed which allow the choiceDispatchKey expression to be evaluated. At that point the parsing of the choice can be resumed, (whatever format properties were in effect at the start of the choice must be restored while it is parsed) and the result of the parse added to the infoset. Any streaming of the parse-output infoset is of course suspended until the choice can be resolved.

The workaround for this is to include everything after the choice, up to and including the key, into every branch, and then have every branch end with a discriminator on the key. This would work and enable the data to parse and unparse. It is undesirable as it is not an efficient implementation, is redundant, and is much less declarative.

Of note: the distance (amount of schema) between the choice and the key element(s) that enable the choice to be determined, is not large in any of the examples we have seen to date. It is at most several elements, and is a fixed distance from the beginning of the choice as well.

We expect this forward-reference feature evolved out of a reasonable pattern of behavior for the way a successful data format evolves. A data record layout existed. The format is mostly fixed-length required fields. That layout cannot be modified. However, it needed to be extended to accommodate more fields, and those are added onto the end of the data layout. However, there was an opportunity to save space by reusing some preceding areas of the record layout that would otherwise be unused in certain new uses of the now-extended data record. The flag to indicate this reuse is of course a new field, and must be added on the end of the record. Introducing a choice earlier in the pre-existing part of the data record layout is allowed. It doesn't change anything about the existing layout, as the first branch of the choice would contain the pre-existing layout. Other branches would contain the re-purposing of that same data area for the new extended purposes. All this naturally leads to a choice (for the reused parts of the pre-existing data record layout) where the flag appears after the choice at the end of the record.

In some of the formats where this forward-ref behavior appears, the data records have been extended in this manner more than once, so a single data record has more than one such choice with forward reference.

Note that this issue doesn't actually involve IVC/OVC. It is independent of them. There is a workaround, and so this problem doesn't require a solution as part of DFDL v1.0.

Restricting IVC/OVC in Unordered Sequences

Elements having IVC are not allowed as the root of a choice branch. For the same reasons IVC elements should not be allowed as children of an unordered sequence, or a sequence with floating elements.

Unparsing and Choice Groups, Especially Hidden Choice Groups

There are two problems related to unparsing and xs:choice.

Problem 1: dfdl:outputValueCalc that must refer into a child element inside a xs:choice

Example of this is this fragment of a messaging format where each message has a label and sublabel element that indicate which message it is. Logically, users think of, and expect to find the label and sublabel elements within the message structure, but we must have them before the element in order to use an xs:choice with dfdl:choiceDispatchKey to determine the message.

The alternative, using discriminators inside each message format, is too slow, a dfdl:choiceDispatchKey, is simply required and actually expresses the format better, since it captures the uniform way all messages are determined. Separate discriminators could all work the same way, but there are no guarantees.

Here's the Infoset, without any hidden groups being used:

<label>3</label>
<sublabel>2</sublabel>
<message_3.2> <!-- choiceDispatchKey on label and sublabel above selects specific message element -->
    <label>3</label> <!-- use IVC to derive from outer label -->
    <sublabel>2</sublabel> <!-- use IVC to derive from outer sublabel -->
    ....

If we want the outer label and sublabel elements to use dfdl:outputValueCalc, symmetric to the IVC when parsing, then the path needs to have a wildcard in it:

<element name="label" type="xs:int"
  dfdl:outputValueCalc="{ ../*/label }" />
<element name="sublabel" type="xs:int"
  dfdl:outputValueCalc="{ ../*/sublabel }" />

We only have use cases for this when unparsing.

This "*" wildcard is always restricted to a single path step, and the possible values it can take on are always part of the DFDL schema; hence, it is still possible to type-check these expressions statically.

The ESA DFDL4Space project has invented a similar wildcard concept as a DFDL extension. They allow a match not to just an entire name, as a path step, but the allow extensions of a name prefix. E.g, ../pathStep(.*)/ meaning any step the name of which begins with "pathStep". The ".*" notation is motivated by regular expressions, but there is no use of patterns richer than ".*". This generality is not needed as far as we can see because one can always model such data creating a parent element named "pathStep" in this case, and then the children of this have names that would match the "*" part of the pattern. So instead of "pathStep(.*)" matching pathStepData1 and pathStepData2, you would structure the elements so that "pathStep/*" matches /pathStep/Data1 and pathStep/Data2.

Problem 2: Choices inside hidden groups

The DFDL Spec discussion of unparsing choices assumes an element that appears within one of the branches of the choice exists in the infoset. This isn't necessarily the case. If the choice is part of a complex representation that is hidden, then no infoset element will exist corresponding to any branch.

The problem can be phrased simply as: Depending on some value, choose a representation.

The classic example of this is a "smart string" representation. This is a representation that tries to avoid wasted space for short strings. For example, uses a 1-byte length field for strings from length 0 to 127, and a 4-byte length for longer strings.

Logically, the value is just a text string, but it has a complex representation, and so a hidden group to hide that complex representation is natural.

In DFDL we'd like to have a hidden group that holds the specifically short length field, or the longer length field. This is a choice.

<!-- in the hidden group -->
<choice>
   <element name="shortLen" type="xs:byte" dfdl:outputValueCalc="{ fn:length(../s) }" />
   <element name="longLen" type="xs:int" dfdl:outputValueCalc="{ - fn:length(../s) }"/>
</choice>
<!-- after the hidden group -->

<element name="s" type="xs:string"
  dfdl:length="{ if (fn:exists(../shortLen)) ../shortLen
                         else  0 - ../longLen }"/>

When unparsing we have in the infoset only the string 's'. We have no place to express that we must scrutinize the length of 's' in order to decide whether the representation should use a shortLen length, or a longLen length.

One suggestion is to create symmetric unparse-time discriminators, and an unparser-specific version of dfdl:choiceDispatchKey and dfdl:choiceBranchKey. These could then refer to the logical infoset element 's' and determine how to resolve the choice.

Thusfar we don't have a use case where the generality of discriminators is needed versus what can be done with the unparse-time equivalent of dfdl:choiceDispatchKey.

The situation is, in general, quite symmetric with parsing, so we should expect the same sorts of solutions: backtracking through the choice alternatives selecting the first one that doesn't cause an unparse error, use of specific asserts/discriminators for unparse time to control the backtracking, or an unparse version of choice resolution by dispatch.

Page tree