Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When parsing, Daffodil currently builds up the entire infoset in memory, and only when parsing has completed does it traverse the infoset nodes while calling the appropriate InfosetOutputter functions to create the infoset representation the user wants (e.g. XML, JDOM, JSON, SAX). For large infosets, the memory required to store the entire infoset in memory can be a limiting factor. This also prevents a streaming SAX-like behavior, since we only output infoset events once parse is complete rather than during the parse. Unparsing was designed with streaming in mind so is less problematic, but some aspects such as suspensions and cleaning up no longer used infoset nodes still limit streaming. The following proposes a solution to call InfosetOutputter events while parsing (though not necessarily as early as possible) and to decrease the improving parsing and unparsing to support streaming and decrease memory usage required for the infoset representation.

Parse

The following sections describe proposed changes to improve infoset streaming and memory reduction during parse.

Streaming Out Infoset

...

To output infoset events, each DINode currently implements a visit() function which calls a method on the InfosetOutputter and then recursively calls the visit() function of its children. This method of creating InfosetOutputter events assumes that the entire infoset is ready to be visited, and thus must be changed to support outputting events during the parse.

...

With the new InfosetWalker and Mark reference counter, infoset events can now be output during the parse, producing a streaming behavior.

Discriminators, Points of Uncertainty, and Marks

As mentioned in the previous section, Marks are created during points of uncertainty, and the existence of Marks are used to determine when the InfosetWalker can output events for a DINode. Discriminators are used to resolve points of uncertainty, but when a discriminator is set to true, it does not immediately discard any Marks . Instead, discriminators just mutate a boolean in the PState--when the parsers unwind back to the point of uncertainty, only then to they examine the discriminator state and discard the Mark. Because the InfosetWalker depends on these Marks being discarded, InfosetWalker events will not occur until much later in a parse after unwinding.

To resolve this issue, instead of the PState containing a stack of Boolean's representing the discriminators, it will be modified to instead maintain a stack of Marks. When a point of uncertainty is found, we push the associated Mark onto this stack. When a discrimiantor evaluates to true, we discard this Mark. clear this Mark. Note that clearing the Mark is distinct from discarding the Mark. This clear frees up associated data associated with the Mark (e.g. decreases the refcounter so the InfosetWalker  could continue, clear's DataInputStream marks so buckets may be freed), but does not actually discard the MarkUpon unwinding, the parser determines that create the Mark determines if the point of uncertainty was resolved by determining checking if the associated  Mark was discardedcleared, and then it can discard the Mark are return it to the MarkPool.

With this change, we can discard Marks as clear Mark state as soon as points of uncertainty are resolved and allow for the InfosetWalker to make progress in creating events.

Memory Reduction

Athough the above changes allow for infoset events to be output much earlier than currently, Daffodil still maintains references to the DINodes associated with those events. These DINodes can take up are large amount of memory, especially in the case of large infosets with many repetitions/arrays.

...

Because the arr element is used in an expression, we would be unable to release any of the elements by the above logic. However, the dfdlx:previousArrayIndiciesRequired="1" attribute provides a hint to Daffodil that only the previous element is actually used in an expression. This allows Daffodil to only keep the current and one previous element in an array. All others indicies may be discarded once the InfosetWalker has created their associated events.

Debugger

The debugger currently disables some optimizations so that it can provide a fuller picture of the parse. For example, it disables the removal of buckets in the InputSourceDataInputStream so that is can access all previous data.

...

Additionally, unlike the InfosetWalker , which pauses output of the infoset based on points of uncertainty, the debugger needs a complete view of the infoset, regardless of points of uncertainty. For this reason, the InfosetWalker will also have a flag to disable the pausing behavior, essentially ignoring the Mark reference counters. When the InteractiveDeubgger needs a complete view of the Infoset, it can create a new instance of this InfosetWalker with the debug flags set and walk the infoset.

Unparse

Although unparsing was designed with streaming in mind, and complications such as backtacking during parsing do not apply, issues related to suspension evaluation and memory reduction still prevent a full streaming behavior. The following sections describe proposed changes to improve infoset streaming and memory reduction during unparse. 

Suspensions

When a suspension is created, it is currently added to a list stored in the UState. Upon successful completition of an unparse, all gathered suspensions are tested and evaluated until they all either complete or a loop is detected causing an error. By postponing these suspensions until the very end of an unparse, all data must be cached, potentially require large amounts of memory. In many cases, it is not necessary to wait until the end of an unaprse befor a suspension can be successfully evaluated, this just made the implementation easier. The follow proposes changes so that suspensions can be evaluated much earlier, allowing cached memory to be output to reduce the amount of memory required for large files.

Rather than storing all suspensions in a single list and evaluating them all at the end, each suspension is instead placed in a new list hung on the objects that the individual suspension test depends on. When a state change occurs on one of these objects, the associated suspensions are tested and evaluated as necessary. Note that a state change may not always mean the test will pass. In somecases, the state may not have changed to the appropriate value. In other cases, state may have changed and a test made progress, but completion of the test may depend on state change of another object. Because the test knows what the necessary state and for which objects, the logic for adding a suspension to these suspensions lists must be handled by a suspensions test method. Thus the steps for managing suspensions looks something like this:

  • Create new suspension, do not add to any lists
  • Immediately run the suspension test method
    • If any part of the test fails, the test method adds this suspension to the suspension list in the object associated with the failure
    • If the test passes, the suspenion body is evaluated and the suspension is not added to any suspension list
  • Eventually, state will change in objects that contain a suspension list. When this occurs:
    • All suspensions associated with this list are removed from list
    • The test methods of each removed suspension is called
    • The cycle repeats (i.e. tests fail and are added to suspensions lists, or tests pass and suspensions are evaluated) 

Note that for diagnostic purposes, we still want to maintain a single list of all pending suspensions in UState. This allows easy access to all suspensions without the need to keep track of all the different objects that suspensions might hang off of. Extract care must be takeing to ensure these lists are in sync--when a suspension is removed from one of the suspensions lists, it must also be removed from this UState list. Likewise for when suspensions are added. In addition to diagnostics, it also provides a method to detect circular deadlocks with suspensions. Upon completion of an unparse, this all suspensions should have been successfully evaluated and removes from the UState list. If this UState list is not empty, it means a ciruclar deadlock must exist involving the remaining suspensions.

To support this new suspension capability, a new trait is created that includes a suspension list and functions to add, remove, test, and evaluate suspensions. This trait must then be mixed into the various objects where suspensions might need to be hung. These objects must also be updated so that the test/evaluate function is called when certain pieces of state change. Finally, the test functions of the various suspensions must be modified to call the add function when a test fails.

The list of objects that need to mixin this new trait, the state that could change that requires suspensions test/evaluation, and the suspensions associated with that state change are below:

  • DINode
    • hasValue
      • SimpleTypeRetryUnparserSuspendableOperation
    • targetLengthEv evalutes to an answer
      • TargetLengthOperation
      • NeedValueAndTargetLengtHmixin
      • PaddingUnparserMixin
      • NilLiteralCharacterUnparserSuspendableOperation
    • valueLength is defined
      • NeedValueAndTargetLengtHmixin
      • PaddingUnparserMixin
      • PrefixLengthSuspendableOperation
    • charsetEv evaluates to an answer
      • PaddingUnparserMixin
    • isNilled is known
      • NilLiteralCharacterUnparserSuspendableOperation
  • DataOutputStream
    • non-zero status is determined to be true or false
      • ChoiceUnusedUnparserSuspendableOperation
      • SuppressableSeparatorUnparserSuspendableOperation
    • absolute position is defined
      • AlignmentFillUnparserSuspendableOperation
  • None
    • test passes the second time it is called
      • RegionSplitSuspendableOperation

Only two objects must mixin this trait, DINode and DataOutputStream, and only a handful of state changes must trigger suspension evaluation.

One additional change is needed for suspensions. Some suspensions, such as ChoiceUnusedUnparserSuspendableOperation, can be trigger by state changes in many DataOutputStreams. Thus, the same suspension maybe in multiple diferent lists. When the state changes in on of those objects, it is possible the the suspenion test could pass and be evaluated. However, other objects may still have a reference to this suspension. We do not what the same susenspino to be evaluated again. Thus, a flag must be added to suspensions to signifiy if they have already been evaluated. When testing suspensions in a suspension list, suspensions with this flag set should simply be ignored. 

Also note that some tests may rely on the fact that processing for a simple element is complete. Often times a DINode will have a value, but a subsequent parsing may change the value and so the simple element processing is not complete. However, some suspension tests use the hasValue DINode member to determien if an element has a value, with the assumption that the value is complete because suspensions are evaluated at the end. This assumption is no longer true, and so a concept of "final" must be applied to simple types, and tests updated to take this into account when testing simple types. To accomplish this, the DIFinalizable trait, which is already used for complex types and arrays, must be mixedin to simple types and set appropriately.

Memory Reduction

As with parsing we need a mechanism to remove unused infoset nodes once it is determined they are no longer needed. Fortunately, the logic here is quite a bit more simplified that with parsing. DIComplex and DIArray objects already have a concept of isFinal which is set when no more changes will occurr on a node. And DISimple elements will also support this concept with the modifications made for suspensions. To support memory reduction, once an element is set as final the DINode can be pruned just like as with parsing. This must take into account expressions and can use simlar logic to handle pruning arrays as well.

Summary of Changes: Parse

Disciminators

  • The disciminator stack changes from a stack of booleans to a stack of Marks .
  • Marks associated with a PoU are pushed onto this discriminator stack
  • When a discriminator evaluates to true, the associated Mark is immediately discarded
  • Parsers determine if a discriminator was set by checking if it was discarded

...

  • After the InfosetWalker creates the end event for an element, the element is removed from the infoset if it is not used in an expression
  • For array elements that are used in an expression, a new DFDL extension property is added which provides a hint to Daffodil when an array index will no longer be used in an expression. The InfosetWalker will remove previous array elements according to this property.

Summary of Changes: Unparse

Suspensions

  • Create new trait that manages suspensions, including functions to test, evaluate, add, and remove them to a trait specific list, and a UState specific list for diagnostics
  • Mixin this new trait to DINode and DataOutputStream
  • Modify DINode and DataOutputStream so that when certain state changes occur, they call the functions to test and evaluate the suspensions. 

Memory Reduction

  • Add isFinal support for simple types
  • When a DINode is marked as isFinal, remove it from the infoset using similar logic as with parse, taking into accounte expressions and DFDL extension hints to prune arrays