You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

this proposal is a work in progress. A prior proposal is available here.

Introduction

Much data contains numeric values that are enumerations, where each value is associated with a logical string the provides a meaningful symbolic interpretation of it.

Such lookups are already expressible, in theory, in DFDL using the DFDL expression language, however practical considerations greatly limit the utility of such solutions.

The primary limitation is that DFDL expressions do not provide any constant-dispatch construct, so a lookup table would need to be implemented as a giant if-then-else chain, which would be prohibitively inefficient for large tables.

A secondary concern is that parsing and unparsing would require inverse tables, which must be specified separately, resulting in significant duplication.

This proposal provides an alternative mechanism by introducing a new notion to DFDL: inputTypeCalc and outputTypeCalc which are analogous to inputValueCalc and outputValueCalc except that they are associated with types, not elements; and that they compose with preexisting parsing behaviours.

Using this notion, this proposal will then introduce a specific construct, KeySet-Value maps, to allow an efficient implemtation of enum lookups using the inputTypeCalc and outputTypeCalc concepts.

This proposal will then provide some additional constructs to support a wider array of use cases and discuss how it can be integrated with other DFDL features; particuarly xs:choice elements and InputValueCalc/OutputValueCalc.

Theory

Before discussion the concrete implementation, it is worth considering the theoretical structure that is being proposed abstractly.

A type in DFDL can be thought of as a pair of 2 functions, parse and unparse, which associate the binary representation of data with the logical representation of said data. For the sake of discussion, we will be using an informal parameterized type system, where the notation t1[X] indicates a type with name t1, whose logical values have type X.


type t1[A] := {
  parse:          bin -> (A, bin)
  unparse:     A -> bin
}


What we would like to do is introduce a way of taking an existing type, and constructing a new type by describing a translation between logical values:


type t2[B] = {
  repType = t1[A]
  inputTypeCalc : A -> B
  outputTypeCalc : B -> A

  parse : bin -> B
  parse :=  inputTypeCalc ∘ t1.parse

  unparse : B -> bin
  unparse := t2.unparse ∘ outputTypeCalc
}


 

As you can see, we can describe a new type, t2, which translates between binary and logical type B, by using the existing type t1[A], and a pair of functions to translate between A and B.

There are some subtleties to consider here. First, it is not necessary for  inputTypeCalc and outputTypeCalc to be inverses, however they should be pseudo-inverses. That is to say, we should have:

inputTypeCalc(outputTypeCalc(inputTypeCalc(x))) = inputTypeCalc(x)

To see this, consider the mapping:


String -> Int
NULL   -> 0
0      -> 0


Here, two distinct input values map to the same output value. This is okay so long as this output value would map back to one of the corresponding input values.

The second subtlety is the domain of the inputTypeCalc function, which will be referred to the as the repValues.

The above example defining t2[B] would suggest that the domain of inputTypeCalc is all of A, which is all logical values associated with the repType t1[A]. However, it is often useful for inputTypeCalc to be defined for only a subset of A. For instance, suppose t1[Int] is all 8-bit unsigned integers representing error codes, and t2[String] is a human readable description of said codes. If there are only 100 codes defined, then it might make sense to define inputTypeCalc over only values 0-99. Similarly, outputTypeCalc need not be defined for all strings, just those which may be returned by inputTypeCalc (although it might be desirable to define outputTypeCalc over a broader domain to better support edited infosets). To support this, we allow inputTypeCalc and outputTypeCalc to be partial functions, and refer to the domain of inputTypeCalc as the repValues of t2.

Representing Transforms

This proposal does not actually specify transforms independently, but as part of the specification of a new type.

Identity Transform

Suppose we have an existing type t1[A] and we want to define a new type, t2[A] with the trivial identity transforms. We may do this by defining t1 as a new xsd simpleType with base A, and add the dfdl:repType annotation to specify the repType as t2.


<xs:simpleType name=t2 dfdl:repType=t1>
  <xs:restriction base=A />
</xs:simpleType>


This is not particularly useful, but will serve as a base for more complicated transforms.

Restriction Transform

A less pointless variant of the identity transform is the restriction transform.  The restriction transform behaves like the identity transform except it restricts the set of repValues.


<xs:simpleType name=t2 dfdl:repType=t1>
  <xs:restriction base=A>
    <xs:minInclusive=”1”/>
    <xs:maxInclusive=”10”/>
  </xs:restriction>
</xs:simpleType>


As you can see, we accomplish this using the standard xsd restriction feature. This has the added benefit that non DFDL aware xml validators will automatically be aware of the restriction on the legal values of the resulting type.

KeySet-Value Transform

The KeySet-Value transforms are central to the support of enumerations. Abstractly, a KeySet-Value transform is defined by a set of (keyset, canonicalKey, value) tuples, where each canonicalKey is a member of the corresponding keyset, all values are unique, and all keysets are mutually disjoint. The transform is then defined by:

data : { (keyset, canonicalKey, value) }

parse(x) = let (keyset, canonicalKey, value) ∈ data such that x ∈ keyset

                   value

unparse(x) = let (keyset, canonicalKey, value) ∈ data such that x = value

                        canonicalKey

This behaves similarly to a standard invertible key-value map, except that it is possible for multiple keys to map to the same value, in which case a single key is chosen as the inverse of said value.  

This is specified in schema by definng t2[B] as an xsd enumeration of type B. On each enumeration value, we use DFDL annotations to specify one or more keys (or repValues) to associate with it. There are two ways to specify repValues. The dfdl:repValues annotation is a space deliminated list of values; and the dfdl:repValueRanges is a space separated list of ints which will be interperated as “min1 max2 min2 max2 … minN maxN”, which represents the union of all intervals [minK, maxK]. The repValue set of t2 is the union of that specified by the above to methods. For example:

<xs:simpleType name="fruitEnumType" dfdl:repType="tns:fruitRepType">

  <xs:restriction base="xs:string">

    <xs:enumeration value="Apple" dfdl:repValues="0" />

    <xs:enumeration value="Banana" dfdl:repValues="1" />

    <xs:enumeration value="Disused" dfdl:repValues="11 13 15" />

    <xs:enumeration value="Illegal" dfdl:repValues="12 14" dfdl:repValueRanges=”3 10 16 255”/>

  </xs:restriction>

</xs:simpleType>

The canonical repValue is the first value specified by dfdl:repValues, or (of dfdl:repValues is not present), the first value specified by dfdl:repValueRanges.

Union Transfom

Suppose we have multiple types using a common repType, but with disjoint repValues. For instance, we might have a separate type for negative integers and non-negative integers. We can combine these into a single type using the xsd union construct:

<xs:simpleType name=”signedInt” dfdl:repType=”tns:intRepType”>

  <xs:union memberTypes=”negativeInt nonnegativeInt” />

</xs:simple>

Here, we require the the repType of all component types match the repType of the parent type. The repValues of the parent type is the disjoint union of the repValues of the child types, and the inputTypeCalc/outputTypeCalc functions are defined piecewise by those of the component functions.

Expression Transform

The final type of transform that this proposal will consider are those defined by arbitrary DFDL expressions. These expressions will be defined by means of explicit dfdl:inputTypeCalc and dfdl:outputTypeCalc annotations on the type. In addition, the repValue set must be explicitly defined by placing dfdl:repValues and/or dfdl:repValueRanges directly on the type.

<xs:simpleType name="fruitLocalType"

     dfdl:inputValueCalc ="{ dfdl:repTypeValue() – 2 }"

     dfdl:ouputValueCalc ="{ dfdl:logicalTypeValue() + 2 }" 

     dfdl:repType=”tns:fruitIntType”

     dfdl:repValues="12 14"

     dfdl:repValueRanges="3 10 16 255" >

  <xs:restriction base=”xs:int” />

</xs:simpleType>

Note that, in the above example, a non DFDL aware validator will mistakingly believe that all integers are legal values. This can be resolved by explicitly specifying the set of logical values using the xsd restriction mechanism:

<xs:simpleType name="fruitLocalType" dfdl:inputTransform="dfdl:repTypeValue() - 2" dfdl:outputTransform="dfdl:logicalTypeValue() + 2" 

               dfdl:repValues="12 14" dfdl:repValueRanges="3 10 16 255" >

  <xs:union>

    <xs:simpleType>

      <xs:restriction base="xs:int ">

        <xs:enumeration value="10"/>

        <xs:enumeration value="12"/>

      </xs:restriction>

    </xs:simpleType>

    <xs:simpleType>

      <xs:restriction base="xs:int">

        <xs:minInclusive="1"/>

        <xs:maxInclusive="8"/>

      </xs:restriction>

    </xs:simpleType>

    <xs:simpleType>

      <xs:restriction base="xs:int">

        <xs:minInclusive="14"/>

        <xs:maxInclusive="253"/>

      </xs:restriction>

    </xs:simpleType>

  </xs:union>

</xs:simpleType>

Note that the only effect of adding these restrictions on the logical type is in validation.

Interaction with xs:choice

It may be desirable to select a different transform based on the value encountered at runtime. This is possible using the above mentioned union transform, however that solution requires that all transforms result in the same element, thereby hiding information of which case was used in the generated infoset. Additionally, such a method would not allow the distinct transforms to have different output types.

As an alternative, we add two annotations to xs:choice: dfdl:choiceBranckKeyKind, and dfdl:choiceDispathKeyKind

When choiceBranckKeyKind is “byType” each branch of the xs:choice must be a simple element with a transform. The choice will then behave as if the each element specified dfdl:choiceBranchKey as the set of repValues defined by the type of said element.

When dfdl:choiceDispathKeyKind is “byType”, we require all choice options to be simple elements and share a common repType. We then parse the repType, and use the resulting simple value as the choiceDispatchKey.

For example:

<xs:choice dfdl:choiceBranckKeyKind=”byType” dfdl:choiceDispathKeyKind=”byType”>

  <xs:element name=”fruit” type=”tns:fruitEnumType”/>

  <xs:element name=”localFruit” type=”tns:fruitLocalType”/>

  <xs:element name=”disused” type=”tns:fruitDisusedType”/>

</xs:choice>

 

Using with explicit raw elements

It may be desirable to include both the raw and logical values in the infosets. Traditionally, this usecase has been accomplished using inputValueCalc and outputValueCalc annotations. This remains the case here. To support this usecase, we expose the inputTypeCalc/outputTypeCalc functions to the DFDL expression language:

<xs:sequence >

  <xs:element name="raw" type="tns:fruitRepType" dfdl:outputValueCalc=”dfdl:outputTypeCalcInt(tns:fruitEnumType, ../fruit)”/>

  <xs:element name=”fruit” type=”tns:fruitEnumType” dfdl:inputValueCalc=”dfdl:inputTypeCalcString(tns:fruitRepType, ../raw)”/>

</xs:sequence>

 

A more complicated example would be using a raw element with a choice of logical elements:

The only additional mechanism is the dfdl: outputTypeCalcNextSiblingInt/String functions, which takes the value of the following sibling and applies the outputTypeCalc function associated with the element type of the following sibling.

<xs:sequence>

  <xs:element name="raw" type="tns:fruitIntType" dfdl:outputValueCalc="dfdl:outputTypeCalcNextSiblingInt()" />

  <xs:choice dfdl:choiceBranchKeyKind="byType" dfdl:choiceDispatchKeyKind=”explicit” dfdl:choiceDispatchKey="../raw" >

    <xs:element name="fruit" type="tns:fruitType" dfdl:inputValueCalc="dfdl:inputTypeCalc(tns:fruitType, ../raw)" />

    <xs:element name="localFruit" type="tns:fruitLocalType" dfdl:inputValueCalc="dfdl:inputTypeCalc(tns:fruitLocalType, ../raw)" />

    <xs:element name="disused" type="tns:fruitDisuedType" dfdl:inputValueCalc="dfdl:inputTypeCalc(tns:fruitDisusedType, ../raw)" />

  <xs:choice>

</xs:sequence>

In principle, this could be accomplished more generically, by allowing dfdl:outputTypeCalc to take an arbitrary expression returning a path to a node, along with some form of next-sibling function (to allow for the fact that there is not a constant name for the next sibling). However, due to ease of implementation, only this more limited structure will be supported by this proposal.

  • No labels