Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: some updates to show current state of the system

The Java org.xml.sax SAX API is a well known and understood API for handling XML data in an event driven manner. Beause the SAX API is event based, rather than other approaches such as DOM, it allows for efficient processing of XML with reduced memory usage.

This proposes changes Daffodil add support parsing and unparsing in conformance with the SAX API.

Note that this proposal does not address issues that are related to reducing Daffodil memory usage (a common benefit of the SAX API), such as creating InfosetOuputter events as early as possible rather than at the end of a parse, or allowing parts of the internal infoset representation to be garbage collected. Those issues will be resolved in separate proposals.

Requirements

  • Daffodil shall implement the Java SAX API in accordance with the org.xml.sax SAX API documentation
  • Daffodil shall maintain support for existing APIs
  • Daffodil shall create SAX events representing a DFDL infoset while parsing data
  • Daffodil shall receive SAX events representing a DFDL Infoset to unparse data

SAX Overview

Requirements

  • implement the Java SAX API in accordance with the org.xml.sax SAX API documentation
  • maintain support for existing APIs
  • create SAX events representing a DFDL infoset while parsing data
  • receive SAX events representing a DFDL Infoset to unparse data

SAX Overview

The two main components of a SAX The two main components of a SAX API are the XMLReader and ContentHandler interfaces.

The XMLReader defines an interface for reading/parsing XML documents, including defining the entrypoint to begin the parse as well as funtions functions to configure some aspects of the parse. Although this interface is named "XMLReader" and is traditionally used for parsing XML data, the interface itself does not strictly require that an implementation actually parse XML. In actuality, it only defines that the XMLReader implementation creates XML events. How it determines when and which XML events to create is entirely up to the implementation.

...

Fortunately, these two SAX API components translate quite nicely to Daffodil's concepts of a Parser and Unparser. Much like a XMLReader, a Parser reads in data and converts it to XML. And much like a ContentHandler, an Unparser reads in XML and takes actions (i.e. unparses data) based on the XML. The following sections describe how Daffodil can implement the implements XMLReader and ContentHandler to support parse and unparse.

Parse

XMLReader

The XMLReader interace interface requires various getters and setters for internal mutable state. For example, the ContentHandler is set with the setContentHandler(...) method. Although it might feel natural to modify the DataProccesor to implement the XMLReader interface, the DataProcessor is considered to be immutable, which conflicts with the mutability of the XMLReader. For this reason, a new function is added to the DataProcessor to create an implementation of the XMLReader called a DaffodilXMLReaderDaffodilParseXMLReader, for example:

Code Block
scala
scala
val dataProcessor = processorFactory.onPath("/")
val xmlReader = dataProcessor.newXMLReaderInstance()

...

Code Block
scala
scala
class DataProcessor(...) {
  ...
  def newXMLReaderInstance(): DaffodilXMLReaderDaffodilParseXMLReader = new DaffodilXMLReaderDaffodilParseXMLReader(this)
}

class DaffodilXMLReaderDaffodilParseXMLReader(dp: DataProcessor) extends XMLReader {
  ...
}

A list of the XMLReader interface methods and specific details related to the Daffodil implementation are described below. See the XMLReader for more details on how an XMLReader should behave for particular functions. 

ContentHandlergetContentHandler()
Return the current content handler.
DTDHandlergetDTDHandler()
Return the current DTD handler.
EntityResolver
EnyoutityResolvergetEntityResolver()
Return the current entity resolver.
ErrorHandlergetErrorHandler()
Return the current error handler.
booleangetFeature(String name)
Look up the value of a feature flag. The only two features that are implemented are http://xml.org/sax/features/namespaces and http://xml.org/sax/features/namespace-prefixes as required by the XMLReader interface. All other features shall throw a SAXNotRecognizedException.
Object

getProperty(String name)

No propertyies are supported–this shall always throw a SAXNotRecognizedException.voidparse(InputSource input)
Parse data from an InputSource. The InputSource must be backed by an InputStream. The getByteStream() method must return non-null or an IOException shall be thrown. This shall call the custom parse(InputStream)  method described below.voidparse(String systemId)
This function is not supported. If called, this shall throw an IOException.voidsetContentHandler(ContentHandler handler)
Store the parameter in local state. This handler will receive the SAX events created by Daffodil.voidsetDTDHandler(DTDHandler handler)
Store the parameter in local state. Note that Daffodil will never use the DTDHandler except for when getDTDHandler()  is called.

Look up the value of a property URN. We support the following:

voidparse(InputSource input)
Parse data from an InputSource. The InputSource must be backed by an InputStream. The getByteStream() method must return non-null or an IOException shall be thrown. This shall call the custom parse(InputStream)  method described below.
voidparse(String systemId)
This function is not supported. If called, this shall throw an IOException.
voidsetContentHandler(ContentHandler handler)
Store the parameter in local state. This handler will receive the SAX events created by Daffodil.
voidsetDTDHandler(DTDHandler handler)
Store the parameter in
voidsetEntityResolver(EntityResolver resolver)
Store the parameter in
local state. Note that Daffodil will never use the DTDHandler except for when
getEntityResolver
getDTDHandler()  is called.
void
setErrorHandler
setEntityResolver(
ErrorHandler handler
EntityResolver resolver)
Store the
parmaeter
parameter in local state.
The
Note that Daffodil will never use the EntityResolver except for when getEntityResolver()  is called.
voidsetErrorHandler(ErrorHandler handler)
Store the parameter in local state. The handler.error
handler.fatalError
()  callback is used
for Schema Definition Errors
where diagnostics.isError is true. The handler.warning()  callback is used for
Schema Definition Warnings
any other diagnostics state.
voidsetFeature(String name, boolean value)
Set the value of a feature flag. The only two features that are implemented are http://xml.org/sax/features/namespaces and http://xml.org/sax/features/namespace-prefixes as required by the XMLReader interface. All other features shall throw a SAXNotRecognizedException.
void

setProperty(String name, Object value)

No properties are supported–this shall always throw a SAXNotRecognizedException.

In addition the the above funtions, the following functinons are added to support other input types that Daffodil supports, which may allow for some optimizations.

Set the value of a property. We only support the setting of the propeties below. All other properties shall throw a SAXNotRecognizedException. Property values must be of the type defined below, otherwise it will throw a SAXNotSupportedException.

PropertyVAlue Type
BlobDirectoryjava.nio.file.Paths
BlobPrefixString
BlobSuffixString


Info

ParseResult cannot be set externally


In addition the the above functions, the following functions support other input types that Daffodil supports, which may allow for some optimizations.

void

parse

voidparse

(InputStream stream)

Creates an InputSourceDataInputStream based on the stream and
a SAXInfosetOutputter (see below) and calls the DataProcessor parse
calls the DaffodilParseXMLReader.parse(InputSourceDataInputStream) method.
void

parse(Array[Byte] arr)

Creates an InputSourceDataInputStream based on the array and calls the DaffodilParseXMLReader.parse(InputSourceDataInputStream) method. 

void

parse(InputSourceDataInputStream isdis)

Creates a SAXInfosetOutputter (see below) based on the DaffodilParseXMLReader and calls the DataProcessor parse method. 

SAXInfosetOutputter

The SAXInfosetOutputter is an implementation of the Daffodil InfosetOutputter interface responsible for converting InfosetOutputter events to SAX ContentHandler events. According to the SAX API, applications may register a new or different ContentHandler with the XMLReader in the middle of a parse, and the SAX parser must begin using the new handler immediately. Because of this, the SAXInfosetInputter SAXInfosetOutputter must take the XMLReader as a parameter, and any time a SAX event is generated, it must call getContentHandler()  on that parematerparameter. The definition for this class looks like:

Code Block
scala
scala
class SAXInfosetOutputter(rdrxmlReader: DaffodilXMLReaderDaffodilParseXMLReader) extends InfosetOutputter with InfosetInputterXMLInfosetOutputter { ... }

Fortunately, the InfosetOutputter events correlate nicely to the InfosetOutputter ContentHandler events. Below is their mapping. Note that in some cases a single InfosetOutputter event may require calling multiple ContentHandler events.

InfosetOutputter APIContentHandler API
startDocument()startDocument()
endDocument()endDocument()

startSimple()

startPrefixMapping() (optional, only when new namespace mapping is added)

startElement(...)

characters(...)

endSimple()

endElement(...)

endPrefixMapping(...)  (optional, only when new namespace mapping is removed)

startComplex()

startPrefixMapping(...) (optional, only when new namespace mapping is added)

startElement(...)

endComplex()

endElement(...)

endPrefixMapping(...)  (optional, only when new namespace mapping is removed)

startArray()no-op
endArray()no-op

Other functions in the ContentHandler interface will are not be usedused used.

Unparse

ContentHandler

The ContentHandler interface is used to receive and react to SAX XML events. In order to unparse data based on these events, Daffodil must unparse data based on the events that are received. However, the design of the unparser and InfosetInputter's behaves InfosetInputters behaves opposite to this–rather than receiveing receiving events, the unparser and InfosetInputter requests the next event. This is essentially push vs pull, or SAX vs StAX. To support unparsing based on SAX events, we must convert these push-style SAX events inout into the pull style events that Daffodil requires.

To accomplsih accomplish this, we need an a ContentHandler implementation specific to Daffodil that can receive all the SAX events and unparse to a specified OutputStream. Similar to parsing, this ContentHandler shall be created by means of the DataProcessor:

...

The newContentHandlerInstance()  method allocates a Daffodil specific implementation of the ContentHandler interface that can be used to unparse data using the SAX API. Because the DataProcessor has all the actual logic for unparsing data, it must be passed into the ContentHandler implementation so that it can unparse data at the appropriate time. In addtionaddition, the ContentHandler must know where to unparse data to, so an OutputStream is also provided as a parameter. The implementation for this function looks something along the lines of the following:

Code Block
scala
scala
class DataProcessor(...) {
  ...
  def newContentHandlerInstance(output: OutputStream): DaffodilContentHandlerDaffodilUnparseContentHandler = new DaffodilContentHandlerDaffodilUnparseContentHandler(this, output)
}

class DaffodilContentHandlerDaffodilUnparseContentHandler(dp: DataProcessor, output: OutputStream) extends ContentHandler {
  ...
}

...

The SAXInfosetInputter is a new InfosetInputter that supports the SAX style events. This InfosetInputter has mutable state for a mutable SAXInfosetEvent with information on each piece of state that the InfosetInputter could request (e.g. event type, localname, namespace, text content). When This is the currentEvent SAXInfosetEvent, so when the InfosetInputter getter functions are called, this SAXInfosetInputter simply returns the respective values of from the mutable statecurrentEvent.

How and when this state event is mutated and when requires coordination between the push-style ContentHandler and the pull-style SAXInfosetInputter. This coordination is handled using coroutines and is describe in the following section.

Coroutine Coordination

The Scala Coroutines library legacy Daffodil Coroutine library allows for pausing the execution of a subroutine to temporarily yeild yield to the caller, and allow the caller to resume the coroutine back to where it paused earlier. This library will be used was revived to coordinate the interactions between the ContentHandler and the SAXInfosetInputter.

The dataProcessor.unparse(...) method is run as a coroutine. Because within the SAXInfosetInputter methos are called from the unparse() method, it's coroutine, so its calls are also part of the coroutine execute execution stack. The ContentHandler is run in the main execution stack, with control yielded to it from passed back and forth between it and the SAXInfosetInputter coroutine, until the unparse completes and the SAXInfosetInputter yields control for the last time.

The coordination between the ContentHandler and the SAXInfosetInputter in the coroutine behaves with the following rules:

  • The ContentHandler maintains the private state of the next event information next SAXInfosetEvent to provide to the SAXInfosetInputter, which is copied and stored by the SAXInfosetInputter, in the nextEvent member, when it calls hasNext
  • The ContentHandler triggers the SAXInfosetInputter coroutine, and The ContentHandler starts the coroutine, and thus the unparse, when the startDocument SAX method event is received
  • When the unparser calls the hasNext method in the SAXInfosetInputter (inside the coroutine), it either returns true if there is a stored nextEvent, false if we've already received an EventDocument, or it calls yieldvalresume passing the constant value HAS_NEXTin any status information from the currentEvent and yielding control to the ContentHandler to create a new event. When control resumes back to hasNext, after the yield, it can assume the hasNext state has been set appropriately and return itthe ContentHandler calls resume (with the new event) on its end, it will store the new event in nextEvent and return true.
  • When the unparser calls the next method in the SAXInfosetInputter (inside the coroutine), it calls yieldval passing the constant value NEXT. When control resumes back to next after the yield, it can assume the mutable event state has been set appropriately
  • When the ContentHandler is yielded back control, if the yielded value is HAS_NEXT, then:
    • If the ContentHandler does not have event information or it is incomplete, then it continues execution and recieves SAX events until complete event information is received (or there are no more events, such as endDocument was received). It then mutates the hasNext state in the SAXInfosetInputter accordingly and resumes the coroutine.
    • If the ContentHandler already has complete event information, then it simply resumes the coroutine–the hasNext state has already been set from the previous rule.
  • When the ContentHandler is yielded back control, if the yielded value is NEXT, then:
    • The ContentHandler mutates the state of the SAXInfosetInputter based on its private event state, resets its private event state, and resumes the coroutine.

An outline for what the above rules and coroutine interactions look like is below:

  • hasNext first to see if there is a nextEvent queued to copy, if there is, it copies nextEvent into currentEvent. A false hasNext is an error state, as next should not be called if hasNext is false.
  • This continues, until hasNext in SAXInfosetInputter returns false, and the unparse call completes. The unparseResult is queried for errors and any errors are set via SAXInfosetEvent.causeError The unparseResult is set via SAXInfosetEvent.unparseResult. 

An outline for what the above rules and coroutine interactions look like is below:

Code Block
scala
scala
class DaffodilContentHandler(dp: DataProcessor, output: OutputStream) extends ContentHandler {
  
  private val nextEvent = new EventState()

  private val inputter = new SAXInfosetInputter()

  private val unparseCoroutine = coroutine { (inputter, output) =>
    dp.unparse(inputter, output)
  }

  private def handleUpdatedEvent(....) {
    // currently trying to answer hasNext, trying to find a
    // complete InfosetInputter event before we can asnwer
    Assert(unparseCoroutine.value == HAS_NEXT)
    if (nextEvent.isComplete) {
      inputter.hasNext = true
      unparseCoroutine.resume
      while (unparseCoroutine.value == HAS_NEXT) {
        // keep resuming if unparse keeps asking hasNext, answer won't change
        unparseCoroutine.resume
      }
      Assert(unparseCoroutine.value == NEXT)
      nextEvent.copyToInputter(inputter)
      nextEvent.reset
      unparseCoroutine.resume
      Assert(unparseCoroutine.value == HAS_NEXT)
      // now looking for next complete event, keep handling SAX events
      // until we gather a complete event
    } else {
      // do not have a complete event yet, keep handling SAX events
      // until we gather a complete eventDaffodilUnparseContentHandler(dp: DataProcessor, output: OutputStream) extends ContentHandler with coroutine[SAXInfosetEvent] {
  
  private val infoseEvent = new SAXInfosetEvent()
  private val inputter = new SAXInfosetInputter(this, dp, output)


  private def sendtoInputter(....) {
    // queueing the infosetEvent for SAXInfosetInputter
	val infosetEventWithResponse = this.resume(inputter, Try(infosetEvent))
    infosetEvent.clear()
    // if event is wrapped in a Try failure, we will not have an unparseResult, so we only set
    // unparseResults for events wrapped in Try Success, including those events that have expected
    // errors
    if (infosetEventWithResponse.isSuccess && infosetEventWithResponse.get.unparseResult.isDefined) {
      unparseResult = infosetEventWithResponse.get.unparseResult.get
    }
    // the exception from events wrapped in Try failures and events wrapped in Try Successes
    // (with an unparse error state i.e unparseResult.isError) are collected and thrown to stop
    // the execution of the xmlReader
    if (infosetEventWithResponse.isFailure || infosetEventWithResponse.get.isError) {
      val causeError = if(infosetEventWithResponse.isFailure) {
        infosetEventWithResponse.failed.get
      } else {
        infosetEventWithResponse.get.causeError.get
      }
      causeError match {
        case unparseError: DaffodilUnparseErrorSAXException =>
          // although this is an expected error, we need to throw it so we can stop the xmlReader
          // parse and this thread
          throw unparseError
        case unhandled: DaffodilUnhandledSAXException => throw unhandled
        case unknown => throw new DaffodilUnhandledSAXException("Unknown exception: ", new Exception(unknown))
      }
    }
  }

  def startDocument() {
    // Start the coroutine
    infosetEvent.eventType = One(StartDocument)
    sendToInputter()
  }

  def endDocument() {
    infosetEvent.eventType = One(EndDocument)
    sendToInputter()
  }
  
  ...
}

class SAXInfosetInputter(unparseContentHandler: DaffodilUnparseContentHandler, dp: DataProcessor, output: OutputStream) extends InfosetInputter with coroutine[SAXInfosetEvent] {

  val currentEvent: SAXInfosetEvent = new SAXInfosetEvent
  val nextEvent: SAXInfosetEvent = new SAXInfosetEvent  

  def getEventType: InfosetInputterEventType = currentEvent.eventType.orNull
  def getLocalName: String = currentEvent.localName.orNull
  def getNamespaceURI: String = currentEvent.namespaceURI.orNull
  ...

  def hasNext: Boolean = {
    if (endDocumentReceived) false
    else if (!nextEvent.isEmpty) true
    else {
      val event = this.resume(unparseContentHandler, Try(currentEvent))
      copyEvent(source = event, dest = nextEvent)
      true
    }
  }

  def startDocument() {
    // Start the coroutine next(): Unit = {
    callif (unparseCoroutinehasNext(inputter, output)) {
    unparseCoroutine.resume

  copyEvent(source = nextEvent.updateEvent(START_DOCUMENT)
    handleUpdatedEvent()
  }

  def endDocument() {
Try(nextEvent), dest = currentEvent)
      nextEvent.updateEventclear(END_DOCUMENT)
    handleUpdatedEvent()
  if  inputter.hasNext(currentEvent.eventType.contains(EndDocument)) endDocumentReceived = falsetrue
    unparseCoroutine.resume
} else {
  val res = unaprseCoroutine.result
 // we should ...
never  }

  def someSaxEvent(...) {
call next() if hasNext() is false
     nextEvent Assert.updateEventabort(...)
    handleUpdatedEvent()}
  }
  ...
}

class SAXInfosetInputterSAXInfosetEvent(event: Event) {

  // mutable infoset inputters state
  var hasNext: Boolean = false

  var eventTypelocalName: InfosetInputterEventTypeMaybe[String] = _Nope
  var localNamesimpleText: Maybe[String] = _Nope
  var simpleTextnamespaceURI: Maybe[String] = _Nope
  ...

  def hasNext: Booleanvar eventType: Maybe[InfosetInputterEventType] = {Nope
  var  yeildval(HAS_NEXT)
    hasNext
  }
nilValue: Maybe[String] = Nope
  defvar next()causeError: UnitMaybe[SAXException] = {Nope
  var  yeildval(NEXT)
  }unparseResult: Maybe[UnparseResult] = Nope

  def getEventTypeisError: InfosetInputterEventType = eventTypeBoolean
  def getLocalNameclear: StringUnit
 = localName
def isEmpty: ...Boolean
}