Introduction

When parsing, Daffodil currently builds up the entire infoset in memory, and only when parsing has completed does it traverse the infoset nodes while calling the appropriate InfosetOutputter functions to create the infoset representation the user wants (e.g. XML, JDOM, JSON, SAX). For large infosets, the memory required to store the entire infoset in memory can be a limiting factor. This also prevents a streaming SAX-like behavior, since we only output infoset events once parse is complete rather than during the parse. The following proposes a solution to call InfosetOutputter events while parsing (though not necessarily as early as possible) and to decrease the memory usage required for the infoset representation.

Infoset Streaming

To output infoset events, each DINode currently implements a visit() function which calls a method on the InfosetOutputter and then recursively calls the visit() function of its children. This method of creating InfosetOutputter events assumes that the entire infoset is ready to be visited, and thus must be changed to support outputting events during the parse.

This proposes to remove this visit() function from the DINodes , and instead add a new InfosetWalker class to the PState . The InfosetWalker has a single function called walk() , which walks the infoset and calls the appropriate InfosetOutputter functions, much like the visit() method. The difference is that rather than traversing the infoset recursively, the InfosetWalker instead does so iteratively. Additionally, it maintains enough internal state so that the walk() function can pause walking when the InfosetWalker chooses, return, and then be called again to continue where it last stopped. This way, the InfosetWalker.walk() method can be periodically called to send out additional infoset events, such as when new infoset nodes are added or some state changes that indicates that new infoset events could be created. Different locations where this walk() method is called will give different behaviors (e.g. burst of many events vs single events more frequently). Performance tests should be done to determine the ideal location for the calls to walk(), with perhaps a tunable to configure different behaviors based on desired streaming characteristics.

Because we may need to backtrack and remove infoset nodes, the InfosetWalker cannot always walk the full infoset as nodes are created. It must stop walking when it reaches a node that has a point of uncertainty associated with it. However, the infoset does not contain any information about points of uncertainty--we must add that knowledge.

In Daffodil, points of uncertainty are represented by creating new Mark instances. When a point of uncertainty is resolved, we either discard the Mark or reset back to the Mark , which also discards it. We will modify the infoset nodes to indirectly be aware of these Marks.

This is done by adding a "mark" reference counter to each infoset node. When a Mark is created, the reference counter of the the current infoset node is incremented. When that Mark is discarded, either via a reset or discard call, the reference counter is decremented. Only when this reference counter is zero do we know that there are no longer any points of uncertainty associated that could cause this infoset node to be backtracked, and thus the InfosetWalker can walk past this node and continue to create infoset events.

Note that even if the reference counter is zero, the InfosetWalker must not walk past, or output events for, the current infoset element. This is because some parsers set the infoset value and then modify it, or a parser may have not yet set a Mark on the current element.

With the new InfosetWalker and Mark reference counter, infoset events can now be output during the parse, producing a streaming behavior.

Discriminators, Points of Uncertainty, and Marks

As mentioned in the previous section, Marks are created during points of uncertainty, and the existence of Marks are used to determine when the InfosetWalker can output events for a DINode. Discriminators are used to resolve points of uncertainty, but when a discriminator is set to true, it does not immediately discard any Marks . Instead, discriminators just mutate a boolean in the PState--when the parsers unwind back to the point of uncertainty, only then to they examine the discriminator state and discard the Mark. Because the InfosetWalker depends on these Marks being discarded, InfosetWalker events will not occur until much later in a parse after unwinding.

To resolve this issue, instead of the PState containing a stack of Boolean's representing the discriminators, it will be modified to instead maintain a stack of Marks. When a point of uncertainty is found, we push the associated Mark onto this stack. When a discrimiantor evaluates to true, we clear this Mark. Note that clearing the Mark is distinct from discarding the Mark. This clear frees up associated data associated with the Mark (e.g. decreases the refcounter so the InfosetWalker could continue, clear's DataInputStream marks so buckets may be freed), but does not actually discard the Mark. Upon unwinding, the parser that create the Mark determines if the point of uncertainty was resolved by checking if the Mark was cleared, and then it can discard the Mark are return it to the MarkPool.

With this change, we can clear Mark state as soon as points of uncertainty are resolved and allow for the InfosetWalker to make progress in creating events.

Memory Reduction

Athough the above changes allow for infoset events to be output much earlier than currently, Daffodil still maintains references to the DINodes associated with those events. These DINodes can take up are large amount of memory, especially in the case of large infosets with many repetitions/arrays.

The first approach to reduce the memory is after the InfosetWalker creates and end event for a DINode, we can simply remove the referces to the element and allow it to be garbage collected. Once the end event is output, the DINode is no longer neded. Practicallly, this means setting the appropriate index to null in the DINode's parent children array.

The one exception to this is in the case when a DINode is used in an expression. In such cases, it is possible that a DFDL expression will later reference this element and we will need its value–the element cannot be removed. This can be tested via the isReferencedByExpression field in the DPathElementCompileInfo.

One drawback to this approach is that large arrays, which are the main contributors to a large infoset memory footprint, cannot be removed if they are referenced by an expression. This simple logic of using isReferencedByExpression can be too restrictive, especially since in many cases an expression only acessses certain indices, such as the first index or the previous index, and not all array indices.

Onen possible solution to this is to examine how indices in arrays are accessed by expressions, and remove indicies once they are no longer needed. Unfortunatley, detecting such expressions is complex, and likely not worth the effort. As a simplification, we can add one or more new properties that provide a hint to Daffodil that certain array indices are not needed at some point, even if they are used in an expression, and that Daffodil may discard them.

A common case of such circumstances is when an expression only accesses the previous array index, such as maintaining a running sum of an array of integers. The following schema describes this, but provides a hint that only one previous array index is needed by an expression:

<xs:element name="arr" maxOccurs="unbounded" dfdlx:previousArrayIndiciesRequired="1">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="int" type="xs:int" ... />
      <xs:element name="runningSum" type="xs:int"
        dfdl:inputValueCalc="{
          if (dfdl:occursIndex() eq 1)
          then ../int
          else ../int + ../../arr[dfdl:occursIndex() - 1]/runningSum
        }" />
    </xs:sequence>
  </xs:complexType>
</xs:element>

Because the arr element is used in an expression, we would be unable to release any of the elements by the above logic. However, the dfdlx:previousArrayIndiciesRequired="1" attribute provides a hint to Daffodil that only the previous element is actually used in an expression. This allows Daffodil to only keep the current and one previous element in an array. All others indicies may be discarded once the InfosetWalker has created their associated events.

Debugger

The debugger currently disables some optimizations so that it can provide a fuller picture of the parse. For example, it disables the removal of buckets in the InputSourceDataInputStream so that is can access all previous data.

It also makes sense to disable the removal of the DINodes used to free up memory so that the complete infoset can be viewed at any time. However, it may sometimes be useful to allow the debugger to visualize which elements have been removed. To support this, a flag shall be added to the InfosetWalker to disable the removal of nodes, likely enabled only during debugging.

Additionally, unlike the InfosetWalker , which pauses output of the infoset based on points of uncertainty, the debugger needs a complete view of the infoset, regardless of points of uncertainty. For this reason, the InfosetWalker will also have a flag to disable the pausing behavior, essentially ignoring the Mark reference counters. When the InteractiveDeubgger needs a complete view of the Infoset, it can create a new instance of this InfosetWalker with the debug flags set and walk the infoset.

Summary of Changes

Disciminators

The disciminator stack changes from a stack of booleans to a stack of Marks .
Marks associated with a PoU are pushed onto this discriminator stack
When a discriminator evaluates to true, the associated Mark is immediately discarded
Parsers determine if a discriminator was set by checking if it was discarded

Infoset

A new Mark reference counter is added to each DINode
When a Mark is created associated with a PoU, that reference counter is incremented
When a Mark is discarded, that reference counter is decremented

InfosetWalker

The Infoset visit() method is removed
A new InfosetWalker class is created which can walk the Infoset and create InfosetOutputter events
The InfosetWalker can pause and resume walking the infoset
The InfosetWalker will pause if it reaches the current Infoset element, or if it reaches an Infoset element that has a non-zero Mark reference count
To support debugging, flags are provided to the InfosetWalker to enable/disable node removal or pausing. To get a view of the infoset, the InteractiveDebugger creates a new instance of the InfosetWalker with these flags and walks the infoset.

Memory Reduction

After the InfosetWalker creates the end event for an element, the element is removed from the infoset if it is not used in an expression
For array elements that are used in an expression, a new DFDL extension property is added which provides a hint to Daffodil when an array index will no longer be used in an expression. The InfosetWalker will remove previous array elements according to this property.

Page tree

Proposal: Infoset Streaming and Memory Reduction