This proposal is replaced by Proposal: Simplified Feature to Support Enumerations

Simplification is due to these features being very complex, and thus far mostly unused in DFDL schemas. 

However, this proposal does reflect the feature as implemented in Daffodil 3.2.0. (Substantial parts have existed since Daffodil 2.4.0) several of which are in use by Daffodil users, and so are assumed to be carried forward in future Daffodil releases. 

A prior proposal is available here.

Introduction

Much data contains numeric values that are enumerations, where each value is associated with a logical string the provides a meaningful symbolic interpretation of it.

Such lookups are already expressible, in theory, in DFDL using the DFDL expression language, however practical considerations greatly limit the utility of such solutions.

The primary limitation is that DFDL expressions do not provide any constant-dispatch construct, so a lookup table would need to be implemented as a giant if-then-else chain, which would be prohibitively inefficient for large tables.

A secondary concern is that parsing and unparsing would require inverse tables, which must be specified separately, resulting in significant duplication.

This proposal provides an alternative mechanism by introducing a new notion to dfdlx:inputTypeCalc and outputTypeCalc which are analogous to inputValueCalc and outputValueCalc except that they are associated with types, not elements; and that they compose with preexisting parsing behaviors.

Using this notion, this proposal will then introduce a specific construct, KeySet-Value maps, to allow an efficient implementation of enum lookups using the inputTypeCalc and outputTypeCalc concepts.

This proposal will then provide some additional constructs to support a wider array of use cases and discuss how it can be integrated with other DFDL features; particularly xs:choice elements and InputValueCalc/OutputValueCalc.

Theory

Before discussion the concrete implementation, it is worth considering the theoretical structure that is being proposed abstractly.

A type in DFDL can be thought of as a pair of 2 functions, parse and unparse, which associate the binary representation of data with the logical representation of said data. For the sake of discussion, we will be using an informal parameterized type system, where the notation t1[X] indicates a type with name t1, whose logical values have type X.

type t1[A] := {
  parse:          bin -> (A, bin)
  unparse:     A -> bin
}

What we would like to do is introduce a way of taking an existing type, and constructing a new type by describing a translation between logical values:

type t2[B] = {
  repType = t1[A]
  inputTypeCalc : A -> B
  outputTypeCalc : B -> A

  parse : bin -> B
  parse :=  inputTypeCalc ∘ t1.parse

  unparse : B -> bin
  unparse := t2.unparse ∘ outputTypeCalc
}

As you can see, we can describe a new type, t2, which translates between binary and logical type B, by using the existing type t1[A], and a pair of functions to translate between A and B.

There are some subtleties to consider here. First, it is not necessary for  inputTypeCalc and outputTypeCalc to be inverses, however they should be pseudo-inverses. That is to say, we should have:

inputTypeCalc(outputTypeCalc(inputTypeCalc(x))) = inputTypeCalc(x)

To see this, consider the mapping:

String -> Int
NULL   -> 0
0      -> 0

Here, two distinct input values map to the same output value. This is okay so long as this output value would map back to one of the corresponding input values.

The second subtlety is the domain of the inputTypeCalc function, which will be referred to the as the repValues.

The above example defining t2[B] would suggest that the domain of inputTypeCalc is all of A, which is all logical values associated with the repType t1[A]. However, it is often useful for inputTypeCalc to be defined for only a subset of A. For instance, suppose t1[Int] is all 8-bit unsigned integers representing error codes, and t2[String] is a human readable description of said codes. If there are only 100 codes defined, then it might make sense to define inputTypeCalc over only values 0-99. Similarly, outputTypeCalc need not be defined for all strings, just those which may be returned by inputTypeCalc (although it might be desirable to define outputTypeCalc over a broader domain to better support edited infosets). To support this, we allow inputTypeCalc and outputTypeCalc to be partial functions, and refer to the domain of inputTypeCalc as the repValues of t2.

Representing Transforms

This proposal does not actually specify transforms independently, but as part of the specification of a new type.

Identity Transform

Suppose we have an existing type t1[A] and we want to define a new type, t2[A] with the trivial identity transforms. We may do this by defining t1 as a new xsd simpleType with base A, and add the dfdlx:repType annotation to specify the repType as t2.

<xs:simpleType name='t2' dfdlx:repType='t1'>
  <xs:restriction base='A' />
</xs:simpleType>

This is not particularly useful, but will serve as a base for more complicated transforms.

Restriction Transform

A less pointless variant of the identity transform is the restriction transform.  The restriction transform behaves like the identity transform except it restricts the set of repValues.

<xs:simpleType name='t2' dfdlx:repType='t1'>
  <xs:restriction base='A'>
    <xs:minInclusive=”1”/>
    <xs:maxInclusive=”10”/>
  </xs:restriction>
</xs:simpleType>

As you can see, we accomplish this using the standard xsd restriction feature. This has the added benefit that non DFDL aware xml validators will automatically be aware of the restriction on the legal values of the resulting type.

KeySet-Value Transform

The KeySet-Value transforms are central to the support of enumerations. Abstractly, a KeySet-Value transform is defined by a set of (keyset, canonicalKey, value) tuples, where each canonicalKey is a member of the corresponding keyset, all values are unique, and all keysets are mutually disjoint. The transform is then defined by:

data : { (keyset, canonicalKey, value) }
parse(x) = let (keyset, canonicalKey, value) ∈ data such that x ∈ keyset in
               value
unparse(x) = let (keyset, canonicalKey, value) ∈ data such that x = value in
               canonicalKey

This behaves similarly to a standard invertible key-value map, except that it is possible for multiple keys to map to the same value, in which case a single key is chosen as the inverse of said value.  

This is specified in schema by definng t2[B] as an xsd enumeration of type B. On each enumeration value, we use DFDL annotations to specify one or more keys (or repValues) to associate with it. There are two ways to specify repValues. The dfdlx:repValues annotation is a space delimited list of values; and the dfdlx:repValueRanges is a space separated list of ints which will be interpreted as “min1 max2 min2 max2 … minN maxN”, which represents the union of all intervals [minK, maxK]. The repValue set of t2 is the union of that specified by the above to methods. For example:

<xs:simpleType name="fruitEnumType" dfdlx:repType="tns:fruitRepType">
  <xs:restriction base="xs:string">
    <xs:enumeration value="Apple" dfdlx:repValues="0" />
    <xs:enumeration value="Banana" dfdlx:repValues="1" />
    <xs:enumeration value="Disused" dfdlx:repValues="11 13 15" />
    <xs:enumeration value="Illegal" 
      dfdlx:repValues="12 14" 
      dfdlx:repValueRanges=”3 10 16 255”/>
  </xs:restriction>
</xs:simpleType>

The canonical repValue is the first value specified by dfdlx:repValues, or (if dfdlx:repValues is not present), the first value specified by dfdlx:repValueRanges.

Union Transfom

Suppose we have multiple types using a common repType, but with disjoint repValues. For instance, we might have a separate type for negative integers and non-negative integers. We can combine these into a single type using the xsd union construct:

<xs:simpleType name=”signedInt” dfdlx:repType=”tns:intRepType”>
  <xs:union memberTypes=”negativeInt nonnegativeInt” />
</xs:simple>

Here, we require the the repType of all component types match the repType of the parent type. The repValues of the parent type is the disjoint union of the repValues of the child types, and the inputTypeCalc/outputTypeCalc functions are defined piecewise by those of the component functions.

Expression Transform

The final type of transform that this proposal will consider are those defined by arbitrary DFDL expressions. These expressions will be defined by means of explicit dfdlx:inputTypeCalc and dfdlx:outputTypeCalc annotations on the type. In addition, the repValue set must be explicitly defined by placing dfdlx:repValues and/or dfdlx:repValueRanges directly on the type.

<xs:simpleType name="fruitLocalType"
     dfdlx:inputTypeCalc ="{ dfdlx:repTypeValue() – 2 }"
     dfdlx:outputTypeCalc ="{ dfdlx:logicalTypeValue() + 2 }" 
     dfdlx:repType=”tns:fruitIntType”
     dfdlx:repValues="12 14"
     dfdlx:repValueRanges="3 10 16 255" >
  <xs:restriction base=”xs:int” />
</xs:simpleType>

Note that, in the above example, a non DFDL aware validator will mistakenly believe that all integers are legal values. This can be resolved by explicitly specifying the set of logical values using the xsd restriction mechanism:

<xs:simpleType name="fruitLocalType" 
    dfdlx:inputTypeCalc="{ dfdlx:repTypeValue() - 2 }" 
    dfdlx:outputTypeCalc="{ dfdlx:logicalTypeValue() + 2 }" 
    dfdlx:repValues="12 14" dfdlx:repValueRanges="3 10 16 255" >
  <xs:union>
    <xs:simpleType>
      <xs:restriction base="xs:int ">
        <xs:enumeration value="10"/>
        <xs:enumeration value="12"/>
      </xs:restriction>
    </xs:simpleType>
    <xs:simpleType>
      <xs:restriction base="xs:int">
        <xs:minInclusive="1"/>
        <xs:maxInclusive="8"/>
      </xs:restriction>
    </xs:simpleType>
    <xs:simpleType>
      <xs:restriction base="xs:int">
        <xs:minInclusive="14"/>
        <xs:maxInclusive="253"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:union>
</xs:simpleType>

Note that the only effect of adding these restrictions on the logical type is in validation.

Interaction with xs:choice

It may be desirable to select a different transform based on the value encountered at runtime. This is possible using the above mentioned union transform, however that solution requires that all transforms result in the same element, thereby hiding information of which case was used in the generated infoset. Additionally, such a method would not allow the distinct transforms to have different output types.

As an alternative, we add two annotations to xs:choice: dfdlx:choiceBranckKeyKind, and dfdlx:choiceDispatchKeyKind

When choiceBranckKeyKind is “byType” each branch of the xs:choice must be a simple element with a transform. The choice will then behave as if each element specified dfdlx:choiceBranchKey as the set of repValues defined by the type of said element.

When dfdlx:choiceDispatchKeyKind is “byType”, we require all choice options to be simple elements which share a common repType. We then parse the repType, and use the resulting simple value as the choiceDispatchKey.

For example:

<xs:choice 
  dfdlx:choiceBranchKeyKind=”byType” 
  dfdlx:choiceDispatchKeyKind=”byType”>
  <xs:element name=”fruit” type=”tns:fruitEnumType”/>
  <xs:element name=”localFruit” type=”tns:fruitLocalType”/>
  <xs:element name=”disused” type=”tns:fruitDisusedType”/>
</xs:choice>

We explicitly forbid choiceBranchKeyKind="explicit" to co-occur with choiceDispatchKeyKind="byType"

This is to avoid dealing with potentially ambiguous unparse situations that could occur with schemas like the following:

<xs:simpleType name="one_or_two" dfdl:repType="tns:uint8">
  <xs:restriction base="xs:string">
    <xs:enumeration value="1 or 2" dfdl:repValues="1 2"/>
  </xs:restriction>
</xs:simpleType>

<xs:choice dfdl:choiceBranchKeyKind="explicit" dfdl:choiceDispatchKeyKind="byType">
  <xs:element name="one" type="tns:one_or_two" dfdl:choiceBranchKey="1"/>
  <xs:element name="two" type="tns:one_or_two" dfdl:choiceBranchKey="2"/>
</xs:choice>

In this case, the binary input 02 would parse to <two>1 or 2</two>. However, it is ambiguous if we should unparse this according to the canonical value of the type (1), or the canonical branchKey (2).

Using with explicit raw elements

It may be desirable to include both the raw and logical values in the infosets. Traditionally, this use case has been accomplished using inputValueCalc and outputValueCalc annotations. This remains the case here. To support this use case, we expose the inputTypeCalc/outputTypeCalc functions to the DFDL expression language:

<xs:sequence>
  <xs:element name="raw" type="tns:fruitRepType" 
    dfdlx:outputValueCalc=”dfdlx:outputTypeCalc(tns:fruitEnumType, ../fruit)”/>
  <xs:element name=”fruit” type=”tns:fruitEnumType” 
    dfdlx:inputValueCalc=”dfdlx:inputTypeCalc(tns:fruitRepType, ../raw)”/>
</xs:sequence>

A more complicated example would be using a raw element with a choice of logical elements:

The only additional mechanism is the dfdlx: outputTypeCalcNextSibling function, which takes the value of the following sibling and applies the outputTypeCalc function associated with the element type of the following sibling.

<xs:sequence>
  <xs:element name="raw" type="tns:fruitIntType" 
    dfdlx:outputValueCalc="dfdlx:outputTypeCalcNextSibling()" />
  <xs:choice dfdlx:choiceBranchKeyKind="byType" 
    dfdlx:choiceDispatchKeyKind=”explicit” dfdlx:choiceDispatchKey="../raw" >
    <xs:element name="fruit" type="tns:fruitType" 
      dfdlx:inputValueCalc="dfdlx:inputTypeCalc(tns:fruitType, ../raw)" />
    <xs:element name="localFruit" type="tns:fruitLocalType" 
      dfdlx:inputValueCalc="dfdlx:inputTypeCalc(tns:fruitLocalType, ../raw)" />
    <xs:element name="disused" type="tns:fruitDisusedType" 
      dfdlx:inputValueCalc="dfdlx:inputTypeCalc(tns:fruitDisusedType, ../raw)" />
  <xs:choice>
</xs:sequence>

In principle, this could be accomplished more generically, by allowing dfdlx:outputTypeCalc to take an arbitrary expression returning a path to a node, along with some form of next-sibling function (to allow for the fact that there is not a constant name for the next sibling). However, due to ease of implementation, only this more limited structure will be supported by this proposal.

Summary of annotations

  • dfdlx:repType
    • Applies to xs:simpleType
    • Defines the representation type associated with the annotated type.
    • On parse, the DFDL processor first parses according to the repType, then applies any conversion specified by the annotated type.
    • On unparse, the DFDL processor first applies the conversion specified by the annotated type, then the unparse behavior specified by the repType
  • dfdlx:choiceBranchKeyKind
    • Applies to xs:choice
    • Values: byType, explicit, speculative, implicit
    • byType
      • Each choice option must be a simple element
      • All choice options must have a type with a common repType
      • The valueSets of all options must be mutually disjoint
      • The choice dispatch will behave as if the choiceBranchKeys specified by an option are the valueSet of the options type.
    • Explicit
      • Each choice option must directly specify a choiceBranchKey. These values will be used for direct dispatch
      • Requires choiceDispatchKeyKind=explicit as well
    • Speculative
      • Direct dispatch will not be used. Choice options will be parsed speculatively, and the first non-failing case will be used
      • Requires choiceDispatchKeyKind=speculative
    • Implicit
      • Current behavior
      • If choice options provide explicit choiceBranchKeys, then behave as if we were “explicit”
      • Otherwise, behave as if we were “speculative”
  • dfdlx:choiceDispatchKeyKind
    • Applies to xs:choice
    • Values: byType, explicit, speculative, implicit
    • byType
      • Each choice option must be a simple element
      • All choice options must have a type with a common repType
      • First, parse according to the common repType without consuming any input
      • Then, use the resulting value as the choiceDispatchKey
    • Explicit
      • Gets the choiceDispatchKey from the dfdlx:choiceDispatchKey annotation
    • Speculative
      • Direct dispatch will not be used. Choice options will be parsed speculatively, and the first non-failing case will be used
      • Requires choiceBranchKeyKind=speculative
    • Implicit
      • Current behavior
      • If dfdlx:choiceDispatchKey is present, them behave as if we were explicit
      • Otherwise, behave as if we were speculative
  • dfdlx:inputTypeCalc
    • Applies to xs:simpleType
    • Requires dfdlx:repType to also be present
    • Is a DFDL expression
    • On parse, first parse according to the repType, then populate the value of this element to the result of evaluating the dfdlx:inputTypeCalc expression
    • The value of the repType may be accessed by the expression through the dfdlx:repTypeValue functions
  • dfdlx:outputTypeCalc
    • Applies to xs:simpleType
    • Requires dfdlx:repType to also be present
    • Is a DFDL expression
    • On unparse, first evaluate this expression, then unparse according to the repType as if the logical value were the result of evaluating this expression
    • The original logical value of this type may be accessed by the expression through the dfdlx:logicalTypeValue functions
  • dfdlx:repValues
    • Applies to xs:enumeration and xs:simpleType
    • A space separated list of values
    • Values must be of a type consistend with repType
    • When applied to xs:enumeration:
      • Defines a KeySet-Value transform, and associates the annotated enumeration value with the listed keys
      • Adds the listed values to the repValue set of the parent simpleType
    • When Applied to xs:simpleType
      • Adds the listed keys to the repValue set of the parent
      • This set will be used by xs:choice when choiceBranchKeyKind=byType
  • dfdlx:repValueRanges
    • Applies to xs:enumeration and xs:simpleType
    • Requires dfdlx:repType to be present and refer to an integer type
    • A space separated list of integers defining ranges of integers
    • Takes the form “min1 max1 min2 max2 … minN maxN”
    • Represents the set of integers described by the union of the intervals [mink, maxK]
    • Behaves as if all members of this set were included in the dfdlx:repValues annotation

Summary of Functions

  • dfdlx:inputTypeCalc(f: QName, x:A)
    • f must be a constant QName resolving to a simpleType with a transform defined
    • The type of x is determined statically at compile time as the primitive type of the repType of f.
    • The return type is given by the primitive type of the logical type of f.
    • If the types given by f do not match what is required, the relevent expression may be cast according to standard DFDL expression casting rules.
    • Returns the result of applying the inputTypeCalc function associated with f to x
  • dfdlx:outputTypeCalc(f: QName, x:Any)
    • f must be a constant QName resolving to a simpleType with a transform defined
    • The type of x is determined statically at compile time as the primitive type of the logical type of f.
    • The return type is given by the primitive type of the repType of f.
    • If the types given by f do not match what is required, the relevent expression may be cast according to standard DFDL expression casting rules.
    • Returns the result of applying the outputTypeCalc function associated with f to x
  • No labels

5 Comments

  1. We should consider if the two properties dfdl:choiceDispatchKeyKind, and dfdl:choiceBranchKeyKind should be combined into a single property. They're not orthogonal, and it's very easy to confuse the two.  For example:

    This combination is disallowed.

    dfdl:choiceBranchKeyKind="explicit" dfdl:choiceDispatchKeyKind="byType"

    but the flip

    dfdl:choiceBranchKeyKind="byType" dfdl:choiceDispatchKeyKind="explicit"

    is a primary use case.

    That's too subtle. We should consider coming up with names for the valid combinations of these two properties and collapsing them to one property with those names as its enum values.

    Suppose the combined property is dfdl:choiceBranchKInd. I claim this combination:

    dfdl:choiceBranchKeyKind="byType" dfdl:choiceDispatchKeyKind="explicit"

    would sensibly just be dfdl:choiceBranchKind="byTypeWithKey" and the pair

    dfdl:choiceBranchKeyKind="byType" dfdl:choiceDispatchKeyKind="byType"

    would sensibly just be dfdl:choiceBranchKind="byType"

    I think those two plus "implicit" (which either does speculation or direct dispatch depending on presence of dfdl:choiceBranchKey  and dfdl:choiceDispatchKey properties) are actually the only valid combinations. So really there's just

    dfdl:choiceKeyKind with values 'byType', 'byTypeWithKeys', or 'implicit'.

    In fact, I think the 'byTypeWithKeys' really can just be 'byType', when a dfdl:choiceDispatchKey property with expression is present.  So really I think we only have two possibilities

    dfdl:choiceKeyKind 'byType' or 'implicit'.

  2. A primary purpose of this proposal was large enums. E.g., where a 12 bit field is mapped to up to 4096 enumerated string values. These are infeasible using inputValueCalc/outputValueCalc expressions.

    So I took a look at how a proposed XML schema for a binary data format works as far as the enums for Air types, space-type, etc. which are 12-bit fields.

    We put quite a bit of complexity into our typeCalc feature in order to be able to round-trip illegal values i.e., produce <illegal>123</illegal> elements that capture the multiple sub-ranges of the 4096 possibilities that map to "illegal".

    Interestingly, the proposed XML schema simply takes the position that 4096 isn't very many, and puts a enum named ILLEGAL123 with a numeric suffix corresponding to the illegal value, and has one such for each discrete possible illegal value. So they have an enumeration with 4096 values, including one enumeration for each illegal value.

    They have no annotations on each element of this enumeration indicating the corresponding integers - the order of these enumeration elements in the schema is assumed to provide the integer-to-string mapping. The value 0 is assumed to be first.

    This allows it to trivially round trip, based on pure 1-to-1 table lookup with no calculations needing to be expressed.

    This simple approach has some real advantages. The element remains of simple xs:string type, for example, whereas as soon as we start introducing illegal range values, we have to model it as a complex type that is a choice of different types for legal and illegal values.

    I would suggest that unless an enumeration is truly excessively large (more than 18 bits maybe), that this approach may in fact be preferable to modeling as a complex type. The schema could be as simple as this:

    <xs:simpleType name="vehicleType" dfdl:repType="tns:uint3">
       <xs:restriction base="xs:string">
         <xs:enumeration value="NoStatement"/> <!-- no dfdl:repValues="N" needed -->
         <xs:enumeration value="truck"/> 
         <xs:enumeration value="suv"/>
         <xs:enumeration value="bus"/>
         <xs:enumeration value="train"/>
         <xs:enumeration value="car"/>
         <xs:enumeration value="ILLEGAL6"/>
         <xs:enumeration value="ILLEGAL7"/>
      </xs:restriction>
    </xs:simpleType>
    
    <xs:simpleType name="uint3" dfdl:length="3" dfdl:lengthUnits="bits">
      <xs:restriction base="xs:unsignedInt"/>
    </xs:simpleType>
  3. Note that of the features described in this memo, many/most are unused.

    But not all.

    There are large DFDL schemas for Link16, VMF, and JREAP-C created by users which make heavy use of inputTypeCalc and outputTypeCalc.

    These schemas have some user community that will not want to have to revise all their XML. So while we might deprecate this approach and go with something simpler in the future, we cannot withdraw all the functionality.

    These functions are not used anywhere in these complex schemas that made extensive use of this proposal:

    • dfdlx:repTypeValue
    • dfdlx:logicalTypeValue
    • dfdlx:outputTypeCalcNextSibling

    These could probably be safely removed from Daffodil.

    Of the proposed properties, repType and repValues are heavily used also in these same schemas. These dfdlx properties are unused and probably could be safely removed from Daffodil.

    • repValueRanges
    • choiceBranchKeyRanges
    • choiceBranchKeyKind
    • choiceDispatchKeyKind
  4. I edited the page to remove the functions that are unused, which are being removed from Daffodil in PR 600. 

    (repTypeValue, logicalTypeValue, and outputTypeCalcNextSibling)

    The features that are actually in use in some DFDL schemas are retained. (inputTypeCalc, outputTypeCalc functions). 


  5. As of Daffodil 3.2.0 (commit eb603e7f4a342e1a63d1e07f59c714cf531724bc) the functions mentioned above:

    • repTypeValue
    • logicalTypeValue
    • outputTypeCalcNextSibling

    have been removed from Daffodil. Their implementations were in the way of fixing DAFFODIL-1879, and they were unused.