Introduction

When parsing, Daffodil currently builds up the entire infoset in memory, and only when parsing has completed does it traverse the infoset nodes while calling the appropriate InfosetOutputter functions to create the infoset representation the user wants (e.g. XML, JDOM, JSON, SAX). For large infosets, the memory required to store the entire infoset in memory can be a limiting factor. This also prevents a streaming SAX-like behavior, since we only output infoset events once parse is complete rather than during the parse. Unparsing was designed with streaming in mind so is less problematic, but some aspects such as suspensions and cleaning up no longer used infoset nodes still limit streaming. The following proposes a solution to call improving parsing and unparsing to support streaming and decrease memory usage required for the infoset representation.

Parse

The following sections describe proposed changes to improve infoset streaming and memory reduction during parse.

Streaming Out Infoset

To output infoset events, each DINode currently implements a visit() function which calls a method on the InfosetOutputter and then recursively calls the visit() function of its children. This method of creating InfosetOutputter events assumes that the entire infoset is ready to be visited, and thus must be changed to support outputting events during the parse.

This proposes to remove this visit() function from the DINodes , and instead add a new InfosetWalker class to the PState . The InfosetWalker has a single function called walk() , which walks the infoset and calls the appropriate InfosetOutputter functions, much like the visit() method. The difference is that rather than traversing the infoset recursively, the InfosetWalker instead does so iteratively. Additionally, it maintains enough internal state so that the walk() function can pause walking when the InfosetWalker chooses, return, and then be called again to continue where it last stopped. This way, the InfosetWalker.walk() method can be periodically called to send out additional infoset events, such as when new infoset nodes are added or some state changes that indicates that new infoset events could be created. Different locations where this walk() method is called will give different behaviors (e.g. burst of many events vs single events more frequently). Performance tests should be done to determine the ideal location for the calls to walk(), with perhaps  a tunable to configure different behaviors based on desired streaming characteristics.

Because we may need to backtrack and remove infoset nodes, the InfosetWalker cannot always walk the full infoset as nodes are created. It must stop walking when it reaches a node that has a point of uncertainty associated with it. However, the infoset does not contain any information about points of uncertainty--we must add that knowledge.

In Daffodil, points of uncertainty are represented by creating new Mark instances. When a point of uncertainty is resolved, we either discard the Mark or reset back to the Mark , which also discards it. We will modify the infoset nodes to indirectly be aware of these Marks.

This is done by adding a "mark" reference counter to each infoset node. When a Mark is created, the reference counter of the the current infoset node is incremented. When that Mark is discarded, either via a reset or discard call, the reference counter is decremented. Only when this reference counter is zero do we know that there are no longer any points of uncertainty associated that could cause this infoset node to be backtracked, and thus the InfosetWalker can walk past this node and continue to create infoset events.

Note that even if the reference counter is zero, the InfosetWalker must not walk past, or output events for, the current infoset element. This is because some parsers set the infoset value and then modify it, or a parser may have not yet set a Mark on the current element.

With the new InfosetWalker and Mark reference counter, infoset events can now be output during the parse, producing a streaming behavior.

Discriminators, Points of Uncertainty, and Marks

As mentioned in the previous section, Marks are created during points of uncertainty, and the existence of Marks are used to determine when the InfosetWalker can output events for a DINode. Discriminators are used to resolve points of uncertainty, but when a discriminator is set to true, it does not immediately discard any Marks . Instead, discriminators just mutate a boolean in the PState--when the parsers unwind back to the point of uncertainty, only then to they examine the discriminator state and discard the Mark. Because the InfosetWalker depends on these Marks being discarded, InfosetWalker events will not occur until much later in a parse after unwinding.

To resolve this issue, instead of the PState containing a stack of Boolean's representing the discriminators, it will be modified to instead maintain a stack of Marks. When a point of uncertainty is found, we push the associated Mark onto this stack. When a discrimiantor evaluates to true, we clear this Mark. Note that clearing the Mark is distinct from discarding the Mark. This clear frees up associated data associated with the Mark (e.g. decreases the refcounter so the InfosetWalker  could continue, clear's DataInputStream marks so buckets may be freed), but does not actually discard the MarkUpon unwinding, the parser that create the Mark determines if the point of uncertainty was resolved by checking if the Mark was cleared, and then it can discard the Mark are return it to the MarkPool.

With this change, we can clear Mark state as soon as points of uncertainty are resolved and allow for the InfosetWalker to make progress in creating events.

Memory Reduction

Athough the above changes allow for infoset events to be output much earlier than currently, Daffodil still maintains references to the DINodes associated with those events. These DINodes can take up are large amount of memory, especially in the case of large infosets with many repetitions/arrays.

The first approach to reduce the memory is after the InfosetWalker creates and end event for a DINode, we can simply remove the referces to the element and allow it to be garbage collected. Once the end event is output, the DINode is no longer neded. Practicallly, this means setting the appropriate index to null in the DINode's parent children array.

The one exception to this is in the case when a DINode is used in an expression. In such cases, it is possible that a DFDL expression will later reference this element and we will need its value–the element cannot be removed. This can be tested via the isReferencedByExpression field in the DPathElementCompileInfo.

One drawback to this approach is that large arrays, which are the main contributors to a large infoset memory footprint, cannot be removed if they are referenced by an expression. This simple logic of using isReferencedByExpression can be too restrictive, especially since in many cases an expression only acessses certain indices, such as the first index or the previous index, and not all array indices.

Onen possible solution to this is to examine how indices in arrays are accessed by expressions, and remove indicies once they are no longer needed. Unfortunatley, detecting such expressions is complex, and likely not worth the effort. As a simplification, we can add one or more new properties that provide a hint to Daffodil that certain array indices are not needed at some point, even if they are used in an expression, and that Daffodil may discard them.

A common case of such circumstances is when an expression only accesses the previous array index, such as maintaining a running sum of an array of integers. The following schema describes this, but provides a hint that only one previous array index is needed by an expression:

<xs:element name="arr" maxOccurs="unbounded" dfdlx:previousArrayIndiciesRequired="1">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="int" type="xs:int" ... />
      <xs:element name="runningSum" type="xs:int"
        dfdl:inputValueCalc="{
          if (dfdl:occursIndex() eq 1)
          then ../int
          else ../int + ../../arr[dfdl:occursIndex() - 1]/runningSum
        }" />
    </xs:sequence>
  </xs:complexType>
</xs:element>

Because the arr element is used in an expression, we would be unable to release any of the elements by the above logic. However, the dfdlx:previousArrayIndiciesRequired="1" attribute provides a hint to Daffodil that only the previous element is actually used in an expression. This allows Daffodil to only keep the current and one previous element in an array. All others indicies may be discarded once the InfosetWalker has created their associated events.

Debugger

The debugger currently disables some optimizations so that it can provide a fuller picture of the parse. For example, it disables the removal of buckets in the InputSourceDataInputStream so that is can access all previous data.

It also makes sense to disable the removal of the DINodes used to free up memory so that the complete infoset can be viewed at any time. However, it may sometimes be useful to allow the debugger to visualize which elements have been removed. To support this, a flag shall be added to the InfosetWalker to disable the removal of nodes, likely enabled only during debugging.

Additionally, unlike the InfosetWalker , which pauses output of the infoset based on points of uncertainty, the debugger needs a complete view of the infoset, regardless of points of uncertainty. For this reason, the InfosetWalker will also have a flag to disable the pausing behavior, essentially ignoring the Mark reference counters. When the InteractiveDeubgger needs a complete view of the Infoset, it can create a new instance of this InfosetWalker with the debug flags set and walk the infoset.

Unparse

Although unparsing was designed with streaming in mind, and complications such as backtacking during parsing do not apply, issues related to suspension evaluation and memory reduction still prevent a full streaming behavior. The following sections describe proposed changes to improve infoset streaming and memory reduction during unparse. 

Suspensions

When a suspension is created, it is currently added to a list stored in the UState. Upon successful completition of an unparse, all gathered suspensions are tested and evaluated until they all either complete or a loop is detected causing an error. By postponing these suspensions until the very end of an unparse, all data must be cached, potentially require large amounts of memory. In many cases, it is not necessary to wait until the end of an unaprse befor a suspension can be successfully evaluated, this just made the implementation easier. The follow proposes changes so that suspensions can be evaluated much earlier, allowing cached memory to be output to reduce the amount of memory required for large files.

Rather than storing all suspensions in a single list and evaluating them all at the end, each suspension is instead placed in a new list hung on the objects that the individual suspension test depends on. When a state change occurs on one of these objects, the associated suspensions are tested and evaluated as necessary. Note that a state change may not always mean the test will pass. In somecases, the state may not have changed to the appropriate value. In other cases, state may have changed and a test made progress, but completion of the test may depend on state change of another object. Because the test knows what the necessary state and for which objects, the logic for adding a suspension to these suspensions lists must be handled by a suspensions test method. Thus the steps for managing suspensions looks something like this:

  • Create new suspension, do not add to any lists
  • Immediately run the suspension test method
    • If any part of the test fails, the test method adds this suspension to the suspension list in the object associated with the failure
    • If the test passes, the suspenion body is evaluated and the suspension is not added to any suspension list
  • Eventually, state will change in objects that contain a suspension list. When this occurs:
    • All suspensions associated with this list are removed from list
    • The test methods of each removed suspension is called
    • The cycle repeats (i.e. tests fail and are added to suspensions lists, or tests pass and suspensions are evaluated) 

Note that for diagnostic purposes, we still want to maintain a single list of all pending suspensions in UState. This allows easy access to all suspensions without the need to keep track of all the different objects that suspensions might hang off of. Extract care must be takeing to ensure these lists are in sync--when a suspension is removed from one of the suspensions lists, it must also be removed from this UState list. Likewise for when suspensions are added. In addition to diagnostics, it also provides a method to detect circular deadlocks with suspensions. Upon completion of an unparse, this all suspensions should have been successfully evaluated and removes from the UState list. If this UState list is not empty, it means a ciruclar deadlock must exist involving the remaining suspensions.

To support this new suspension capability, a new trait is created that includes a suspension list and functions to add, remove, test, and evaluate suspensions. This trait must then be mixed into the various objects where suspensions might need to be hung. These objects must also be updated so that the test/evaluate function is called when certain pieces of state change. Finally, the test functions of the various suspensions must be modified to call the add function when a test fails.

The list of objects that need to mixin this new trait, the state that could change that requires suspensions test/evaluation, and the suspensions associated with that state change are below:

  • DINode
    • hasValue
      • SimpleTypeRetryUnparserSuspendableOperation
    • targetLengthEv evalutes to an answer
      • TargetLengthOperation
      • NeedValueAndTargetLengtHmixin
      • PaddingUnparserMixin
      • NilLiteralCharacterUnparserSuspendableOperation
    • valueLength is defined
      • NeedValueAndTargetLengtHmixin
      • PaddingUnparserMixin
      • PrefixLengthSuspendableOperation
    • charsetEv evaluates to an answer
      • PaddingUnparserMixin
    • isNilled is known
      • NilLiteralCharacterUnparserSuspendableOperation
  • DataOutputStream
    • non-zero status is determined to be true or false
      • ChoiceUnusedUnparserSuspendableOperation
      • SuppressableSeparatorUnparserSuspendableOperation
    • absolute position is defined
      • AlignmentFillUnparserSuspendableOperation
  • None
    • test passes the second time it is called
      • RegionSplitSuspendableOperation

Only two objects must mixin this trait, DINode and DataOutputStream, and only a handful of state changes must trigger suspension evaluation.

One additional change is needed for suspensions. Some suspensions, such as ChoiceUnusedUnparserSuspendableOperation, can be trigger by state changes in many DataOutputStreams. Thus, the same suspension maybe in multiple diferent lists. When the state changes in on of those objects, it is possible the the suspenion test could pass and be evaluated. However, other objects may still have a reference to this suspension. We do not what the same susenspino to be evaluated again. Thus, a flag must be added to suspensions to signifiy if they have already been evaluated. When testing suspensions in a suspension list, suspensions with this flag set should simply be ignored. 

Also note that some tests may rely on the fact that processing for a simple element is complete. Often times a DINode will have a value, but a subsequent parsing may change the value and so the simple element processing is not complete. However, some suspension tests use the hasValue DINode member to determien if an element has a value, with the assumption that the value is complete because suspensions are evaluated at the end. This assumption is no longer true, and so a concept of "final" must be applied to simple types, and tests updated to take this into account when testing simple types. To accomplish this, the DIFinalizable trait, which is already used for complex types and arrays, must be mixedin to simple types and set appropriately.

Memory Reduction

As with parsing we need a mechanism to remove unused infoset nodes once it is determined they are no longer needed. Fortunately, the logic here is quite a bit more simplified that with parsing. DIComplex and DIArray objects already have a concept of isFinal which is set when no more changes will occurr on a node. And DISimple elements will also support this concept with the modifications made for suspensions. To support memory reduction, once an element is set as final the DINode can be pruned just like as with parsing. This must take into account expressions and can use simlar logic to handle pruning arrays as well.

Summary of Changes: Parse

Disciminators

  • The disciminator stack changes from a stack of booleans to a stack of Marks .
  • Marks associated with a PoU are pushed onto this discriminator stack
  • When a discriminator evaluates to true, the associated Mark is immediately discarded
  • Parsers determine if a discriminator was set by checking if it was discarded

Infoset

  • A new Mark reference counter is added to each DINode
  • When a Mark is created associated with a PoU, that reference counter is incremented
  • When a Mark is discarded, that reference counter is decremented

InfosetWalker

  • The Infoset visit() method is removed
  • A new InfosetWalker class is created which can walk the Infoset and create InfosetOutputter events
  • The InfosetWalker can pause and resume walking the infoset
  • The InfosetWalker will pause if it reaches the current Infoset element, or if it reaches an Infoset element that has a non-zero Mark reference count
  • To support debugging, flags are provided to the InfosetWalker to enable/disable node removal or pausing. To get a view of the infoset, the InteractiveDebugger creates a new instance of the InfosetWalker with these flags and walks the infoset.

Memory Reduction

  • After the InfosetWalker creates the end event for an element, the element is removed from the infoset if it is not used in an expression
  • For array elements that are used in an expression, a new DFDL extension property is added which provides a hint to Daffodil when an array index will no longer be used in an expression. The InfosetWalker will remove previous array elements according to this property.

Summary of Changes: Unparse

Suspensions

  • Create new trait that manages suspensions, including functions to test, evaluate, add, and remove them to a trait specific list, and a UState specific list for diagnostics
  • Mixin this new trait to DINode and DataOutputStream
  • Modify DINode and DataOutputStream so that when certain state changes occur, they call the functions to test and evaluate the suspensions. 

Memory Reduction

  • Add isFinal support for simple types
  • When a DINode is marked as isFinal, remove it from the infoset using similar logic as with parse, taking into accounte expressions and DFDL extension hints to prune arrays
  • No labels

1 Comment

  1. I think the default should be that arrays are kept entirely (parsing) when (a) maxOccurs is below some threshold and (b) they are referenced by an expression. Otherwise they should be discarded, retaining some window of adjacent elements. We do have the option of always just keeping the first and last elements since those may be treated specially, if there are initial values or final accumulated sums that are referenced outside of the array. We should assume the span is 1. Having this span be tunable is interesting. Any finite N (5, 10, 100) all are sufficient to make the memory footprint finite. Just larger constant. But my guess is that N is some small number (like 1), or the pattern of access is not predictable, and one must keep the whole array.

    We should come up with a list of events that can trigger evaluation of a suspension. I think you are missing numerous functions that access infoset nodes and which have to block if they are unable to complete.  I think for example fn:exists(...some node...) can block until either the node being tested for (a specific child of the enclosing complex) is attached (fn:exists evaluates to true), or it is not attached but the node is set isFinished (fn:exists evaluates to false). Either frees up a blocked fn:exists(....) call. The fn:count(... array node...) is similar. It blocks until isFinished of the array, which is not the same as isfinished of the enclosing complex element, since there could  be an element the presence of which indicates no more array elements are possible. Anyway, both of these scenarios (fn:exists, fn:count) are very common, as fn:exists is used to compute presence bit flag values, and fn:count is used to compute stored counts of number of occurrences.

    Of course we could just punt all this, and say "we transitioned state of this infoset node", so any attached suspension is re-spun and will simply succeed or re-block the same as before. That may be sufficient. This is assuming, for any given node, there is generally one suspension on it, and it is waiting for one, obvious thing. I don't know if it is true, but it is worth an experiment before trying to come up with detailed queueing of suspensions on very precise events.