This version has been superseded by this new proposal 

A prior proposal is here.

Introduction

Much data contains numeric values that are enumerations, with corresponding logical strings that provide the symbolic interpretation of them.

In general, one or more discrete integers are to be translated into a symbolic value.

When transforming the other way, the ambiguity is resolved by using the primary value (first in the list).

Another important capability is numeric ranges. A min max pair corresponds to a symbolic value. When transforming the other way the ambiguity is resolved by providing the min value, or if that is not desired, by specifying a distinct value to use.

Multiple such numeric ranges may correspond to the same value.

Expressing these lookups could, theoretically, be done in DFDL's expression language. However, the practical limitations here are important.

First, there are enumerations with thousands of members, this would result in enormous expressions, and DFDL's expression language has no constant-dispatch construct, so a giant nest of if-then-else logic is the only way to express this.

Second, it is very undesirable to express a table lookup in the diffuse way that expressions would require. Normally the table would be used in one direction for parsing, and the other for unparsing. Using expressions requires distinct expressions which have basically the same information content in them.

This proposal defines new DFDL annotations that can express a table for lookups directly, compactly enough to be considered declarative. The tables can be used for parsing and for unparsing.

It has been suggested that the transformations expressed here between numeric and symbolic data would be useful in contexts outside of DFDL, and that this proposal could be formulated as an extension of XSLT or XQuery. This is certainly the case. What is presented first here is presented in the context of DFDL however.

Proposal

Each enum value for a string simple type can be annotated with properties that give the corresponding numeric value(s) either as a discrete list, or as a numeric range.

Consider this element

<xs:element name="AltitudeSource" type="tns:AltitudeSourceType"/>

<xs:simpleType name="AltitudeSourceType">
<xs:restriction base="xs:string">
  <xs:enumeration value="Sensor"/>
  <xs:enumeration value="InstrumentRead"/>
  <xs:enumeration value="Estimated"/>
  <xs:enumeration value="Illegal"/>
</xs:restriction>
</xs:simpleType>

Now consider this example which updates the above to provide numeric representation mappings for the symbolic values. Note that it uses more than one key in some cases, and uses numeric ranges for others:

<xs:simpleType name="AltitudeSourceType"
  dfdl:repType="tns:AltitudeSourceIntType"> <!-- the rep type -->
  <xs:restriction base="xs:string"> <!-- the logical type -->
    <xs:enumeration value="Sensor" dfdl:lookupKey="1"/>
    <xs:enumeration value="InstrumentRead" dfdl:lookupKey="2"/>
    <xs:enumeration value="Estimated" dfdl:lookupKey="3 4 5 6 7"/>
    <xs:enumeration value="Illegal"
      dfdl:lookupRange"8 255 512 1023"
      dfdl:lookupKey="255"/>
    <xs:enumeration value="Reserved"
      dfdl:lookupRange"0 0 256 511"
      dfdl:lookupValue="511"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="AltitudeSourceIntType"><!-- all properties for a simple type can go here -->
  <xs:restriction base="xs:int">
    <xs:minInclusive value="0"/>
    <xs:maxInclusive value="1023"/>
  </xs:restriction>
</xs:simpleType>

The above uses short form style of DFDL annotations. The long form would look like:

 <xs:enumeration value="Sensor">
   <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl">
     <dfdl:enumeration lookupKey="1"/>
   </xs:appinfo></xs:annotation>
 </xs:enumeration>

A dfdl:lookupKey is a whitespace separated list of values. The type of the values is the repType. (This is a generalization from allowing only integers.) It is possible to translate integers to strings with this, or to translate strings to integers, or strings to strings, or integers to integers. When unparsing, the first value of the list is used.

A dfdl:lookupRange is a whitespace separated list of pairs of values of the repType, which must be derived from xs:integer or a subtype thereof. These are alternating min0 max0 min1 max1 for as many ranges as are needed. The list must be of even length. The intervals are inclusive only, and only integers are allowed as the lookup keys for any type where any xs:enumeration has a dfdl:lookupRange.

When both dfdl:lookupRange and dfdl:lookupKey are specified, they are combined to create the aggregate set of values and ranges for parsing. When unparsing the first dfdl:lookupKey is used when unparsing. When unparsing, if no dfdl:lookupKey is specified, then the lowest value of the first specified dfdl:lookupRange is used.

Given these annotations, a DFDL processor can provide a logical string in the infoset, where the underlying representation is integer. Unparsing inverts the logical value back to a physical integer.

The integer value can be validated, as can the string value.

Simple types can be combined in unions, and in that case, the dfdl:lookupValue cannot overlap, and the dfdl:lookupRange intervals may not overlap across different types of a union. Simple types can extend eachother, in that case again the values cannot overlap. (Non-overlap is the most conservative design choice - enables us to loosen the restriction in the future and enable some sort of overriding/combining if desirable.)

If the integer being looked up is not in a range interval or a dfdl:lookupKey property's value for some xs:enumeration, then it is a processing error. By specifying facets on the key element's type/element decl, one can insure that any valid value has a corresponding mapping.

If a simple type has dfdl:lookupKey or dfdl:lookupRange specified on some xs:enumerations, but other xs:enumerations do not have any dfdl:lookupKey nor dfdl:lookupRange, then it is a schema definition error. Either all the enumerations for a type have a dfdl:lookupValue (or dfdl:lookupRange) annotating them, or none do.

If any enumeration has dfdl:lookupRange, then all values of dfdl:lookupKey attributes must be integer.

If a simpleType has a repType property, then the type referenced by the repType cannot itself have a repType.

Note that the above can be implemented without use of advanced DFDL features like dfdl:inputValueCalc. The simple existence of an element with a dfdl:repType property would enable an implementation of this table-lookup capability without the need for a complete implementation of DFDL's expression language.

Recasting the Proposal for use Outside of DFDL Context

If the above were re-cast for use in XSLT or XQuery, or in Schematron assertions, etc. then the sensible thing would be to provide these same annotations on a XSD schema, along with an XSLT function that can be called, passing an element of the physical type to get out a value of the logical type, or vice versal. The function would be given the type name of the logical type. E.g., assume namespace prefix f, the functions might be

f:lookupRep(logicalTypeQName, ...value of the rep type...) returns value of the logical type
f:lookupValue(logicalTypeQName, ... value of the logical type... ) returns value of the rep type

Multi-Dimensional and other Complex Lookups and Function

Note: This is a more advanced feature. Not initially part of the proposal, but here for initial feedback.

Consider a logical 2-dimensional table lookup.

The specification of the data format has a 2-d table giving the enumerated constant in terms of 2 different other fields.

Consider this table which defines a derived element value called 'AltitudePrecision' which is an enumeration of VeryHigh, High, Medium, Low, and NoStatement.

               |  Boeing747  |  Drone2 |  SopwithCamel | No Statement
------------------------------------------------------------------------------------
Sensor         |  VeryHigh   |  High   |  Low          | High
InstrumentRead |  High       |  NS     |  Low          | Medium
Estimated      |  Low        |  Low    |  Low          | Low
Illegal        |  NS         |  NS     |  NS           | NS
Reserved       |  *          |  *      |  *            | *

In the above, the * means NS means No Statement - a syntax that indicates placeholding until the reserved values are assigned meanings.

The rows use the values of our AltitudeSource described above. The cells contain values corresponding to the column for the Platform element which is an enumeration with values

Boeing747, Drone2, SopwithCamel, and NoStatement.

This is intended to be analogous to many of the table-lookup situations in data standards such as NATO STANAG 5516.

Given the above definition for AltitudeSource, we need this similar definition for Platform.

<xs:element name="Platform" type="tns:PlatformType"/>

<xs:simpleType name="PlatformType" dfdl:repType="tns:PlatformIntType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Boeing747" dfdl:lookupKey="1"/>
      <xs:enumeration value="Drone2" dfdl:lookupKey="2"/>
      <xs:enumeration value="SopwithCamel" dfdl:lookupKey="3"/>
      <xs:enumeration value="NoStatement" dfdl:lookupKey="0"/>
    </xs:restriction>
</xs:simpleType>

<xs:simpleType name="PlatformIntType">
  <xs:restriction base="xs:int">
    <xs:minInclusive value="0"/>
    <xs:maxInclusive value="3"/>
  </xs:restriction>
<xs:simpleType>

Now we can define the AltitudePrecision.

First we need a literal XML equivalent of the 2D table. What is proposed here is a syntax for defining a function, its parameters, and static data that it references.

<dfdl:defineFunction name="AltitudePrecisionTable"
  root="dfdl:Table1"
  params="AltitudeSource Platform"
  xpath="$AltitudeSource/$Platform"><![CDATA[
                | Boeing747 | Drone2 | SopwithCamel | NoStatement
Sensor          | VeryHigh  | High   | ...          | ...
InstrumentRead  | High      | NS     | ...          | ....
...
]]></dfdl:defineFunction>

This defining form is intended to allow creation of static data, and a parameterized function the code (in xpath) of which can use the static data and parameters.

A variant of this would allow one to just embed the XML data directly, but being able to parse a string representation of the data using DFDL allows us to create tables that look like tables.

The root attribute gives the QName of a DFDL schema element for parsing the static data.

(Using DFDL for this is what is commonly called "eating your own dogfood".)

If not provided, the contents of the element itself are assumed to be XML corresponding to the static data desired.

The resulting 'table' is the DFDL infoset given by parsing the string value of the defineFunction element as that root element.

The params defines argument names for use in the query.

The xpath is a relative XPath expression (Not DPath!) starting from the root element, where the variables will substituted in with their values to provide the path to the value of interest.

(An option could allow the more powerful xquery language also.)

In this case, you get a XML Infoset that looks like

 <dfdl:Table1>
  <Sensor><Boeing747>VeryHigh</Boeing747><Drone2>High</Drone2>...</Sensor>
  <InstrumentRead><Boeing747>High</Boeing747><Drone2>NS</Drone2>...</InstrumentRead>
  ...
</dfdl:table>

This table is referenced using a function call:

dfdl:callFunction("tns:AltitudePrecisionTable", ../AltitudeSource, ../Platform)

The dfdl:inputValueCalc expression can be used to populate an element with this value.

To achieve the inverse lookup at unparse time requires a different table, or requires that the ../Platform and ../AltitudeSource elements are themselves members of the Infoset, so they need not be computed.

Now, the above works only if the param values are acceptable as the NCNames of elements. That would be common, but not universally true.

If not, then a more complex table and query is needed, and the non-NCName values must appear as the values of elements e.g.,

<dfdl:KVTable>
  <Pair>
    <Key>Sensor</Key>
    <Value><Pair><Key>Boeing747</Key><Value>VeryHigh</Value></Pair><Pair>...
  </Pair>
  ...
</dfdl:table>

Then we want the xpath to be:

Pair[Key eq $AltitudeSource]/Value/Pair[Key eq $Platform]/Value

to compute the value. This is potentially less performant, as it's not obvious that the lookup of Pair where the Key element has a specific value is going to be O(1) i.e., constant time.

 

 

  • No labels