You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Describes the feature as-is-built 2021-09-09

Motivation

The Daffodil Java and Scala API's require the use of user allocated classes that are passed into the Daffodil parse/unparse functions. Examples of such objects include implementations of the InputSourceDataInputStream , InfosetInputter , and InfosetOutputter interfaces. These interfaces are designed so that users can implement custom inputs and outputs that require no knowledge from Daffodil aside from the interface functions. For example, a user could implement a custom InfosetInputter  and InfosetOutputter  to support EXI or YAML infoset representations, which Daffodil does not natively support, without Daffodil needing to know the implementation details.

It may sometimes be desirable for these custom implementations to have configurable behavior based on annotations applied to the DFDL schema. Some use cases include:

  • Define "dirty words" on the schema for particular elements which should be redacted/removed when output from, or input to, an infoset
  • Define a basic transformation to be applied on some infoset elements, e.g. uppercase/lowercase
  • Define a complex transformation, such as converting a simple element string to actual XML nodes (i.e. non-escaped XML)

In each of these cases, it is not Daffodil that performs such redactions/transformations, but the InfosetInputter  and/or InfosetOutputter .  But such classes are not necessarily schema aware, so likely have no information about which elements must be transformed or how to perform those transformations.

The following proposal suggests a way to add generic annotations to a DFDL schema which are then passed into the InfosetInputter  and InfosetOutputter , allowing custom input/output behavior to be specified on the DFDL schema.

Implementation

A new extension property is added called dfdlx:runtimeProperties. The value of this property is a space-separated list of key/value pairs, with each keys and pairs separated by an equals sign (=). For example:

<xs:element name="xs:string" dfdlx:runtimeProperties="key1=value1 key2=value2" ... />

Because the use-cases only include transforming the infoset text, this property is valid only on simple types.

At schema compilation time, this key/value pairs are parsed and converted to a Map. If a duplicate key is found in this list, the previous is discarded. If dfdlx:runtimeProperties is not defined, then an empty Map is used.

The Map, whether empty or not, is added as a new member of the ElementRuntimeData  class called runtimeProperties. Because this property is valid on on simple elements, this member for complex ElementRuntimeData instances is always the empty Map.

Parse

When an InfosetOutputter is supposed to output a simple element, Daffodil calls the startSimple method, passing in the DISimple. For InfosetOutputter implementations that wish to alter how the simple text is output, the runtimeProperties map can be accesses from the erd member of the DISimple parameter. For example using the Scala API:

class MyInfosetOutputter extends InfosetOutputter {
  ...
  override def startSimple(simple: DISimple): Boolean = {
    val runtimeProperties = simple.erd.runtimeProperties
    val key1Value = runtimeProperties.getOrDefault("key1", "defaultValue1")
    val key2Value = runtimeProperties.getOrDefault("key2", "defaultValue2")
    val simpleText = simple.dataValueAsString
    // redact or trasform simpleText based on key1/key2 values, and then output simpleText
    ...
  }
  ...
}

Unparse

When Daffodil needs the simple text of a simple element during unparse, it calls the getSimpleText method on the InfosetInputter. A new function is added to the InfosetInputter API, with the same getSimpleText name as the existing function, but it takes two parameters. The first is the NodeInfo.Kind, like the existing getSimpleText function. The second parameter is the runtime properties Map . To allow for backwards compatibility, a defult implementation of this function is added which calls the existing getSimpleText function with a single argument:

abstract class InfosetInputter ... {
  def getSimpleText(primNode: NodeInfo.Kind, runtimeProperties: Map[String,String)): String = {
    getSimpleText(primNode)
  }
}

For InfosetInputter implementations that want to redact/transform simple text before returning it to Daffodil to be unparsed, they can override this new method and use the runtimeProperties . For example:

class MyInfosetInputter extends InfosetInputter {
  override def getSimpleText(primNode: NodeInfo.Kind, runtimeProperties: Map[String,String)): String = {
    val simpleText = ... // get the simple text for this current event
    val key1Value = runtimeProperties.getOrDefault("key1", "defaultValue")
    val key2Value = runtimeProperties.getOrDefault("key2", "defaultValue")
    // redact or transform simpleText base on key1/key2 values, and return
    val transformedSimpleText = ...
    transformedSimpleText
  }
}

Example Implementation

The primary use case for this feature is the ability to parse a simple element with an xs:string type that expected to contain XML content. Rather than escaping the XML string and treating it like simple content, we instead want to output it as if it were part of the XML infoset. Similarly, when unparsing, we want to treat all the children of a particular infoset element as if it raw text so that it unparses as a normal string.

To accomplish this with the new dfdlx:runtimeProperties  feature, we can annotate specific elements that should have text string treated as if it where XML, like so:

<xs:schema ...>

  <xs:element name="root>
    <xs:complexType>
      <xs:sequence>
        ...
        <xs:element name="payload" type="xs:string" dfdlx:runtimeProperties="stringAsXML=true" ... />
        ...
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

This annotates the payload element as one where the string value should be treated as XML.

Different InfosetInputter  and InfosetOutputters  could handle this property differently (or not at all if they chose), but one possible implementation based on Scala XML Nodes might look like the following (note that error checking or XML validity is excluded for brevity).

class ScalaXMLWithStringAsXMLInfosetOutputter extends InfosetOutputter {
  private val elemStack = ...
  ...
  override def startSimple(simple: DISimple): Boolean = {
    val text = simple.dataValueAsString
    val children =
      if (simple.erd.runtimeProperties.getOrDefault("stringAsXML", "false") == "true") {
        scala.xml.XML.load(text)
      } else {
        new scala.xml.Text(text)
      }
    val elem = scala.xml.Elem(
      simple.erd.prefix,
      simple.erd.name,
      ...
      Seq(children)
    )
    elemStack.push(elem)
    true
  }
  ...
}

The above InfosetOutputter converts the string to Scala XML Nodes if the runtimeProperty value is set, otherwise it treats the text as normal scala.xml.Text. A new Element is created with this as it's children, and added to a stack, which is ultimately used to build the final infoset.

The unparse side looks like this:

class ScalaXMLWithStringAsXMLInfosetInputter extends InfosetInputter {
  private val elemStack = ...
  ...
  override def getSimpleText(primType: NodeInfo.Kind, runtimeProperties: java.util.Map[String,String]): String = {
    val sb = new StringBuilder()
    val curElem = elemStack.top
    val childrenIter = curElem.child.iterator()
    val stringAsXML = runtimeProperties.getOrDefault("stringAsXML", "false") == "true"
	while (childrenIter.hasNext) {
      childrenIter.next() match {
        case txt: scala.xml.Text => txt.addString(sb)
        case elem: scala.xml.Elem if stringAsXML => sb.append(elem.toString)
        case _ => throw new NonTextFoundInSimpleContentException(...)
      }
    }
    sb.toString()
  }
  ...
}

The above InfosetInputter returns the text of the current simple element, but if the runtimeProperty is set, then it consumes children elements, converts them to a string, and appends that string to a builder to be returned as the simple text.

  • No labels