Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here's a link to a page about Java performance coding - avoiding unnecessary allocation for Java code. The same principles apply to Scala code: http://blog.takipi.com/5-coding-hacks-to-reduce-gc-overhead/

Avoid Unnecessary Allocation

Many things in Scala cause allocation of objects on the heap. This involves quite a lot of overhead to allocate the object (which has extra locations in it beyond the members), initialize memory, call the constructor, etc.

...

Scala's Int, Long, Short, Byte are all value types, so are passed and returned without allocating. Specialized collections like Array[Byte] also avoid allocating a boxed byte. But generic collections such as ListBuffer[T] are going to allocate a box every time a number of primitive type is inserted.

There is no idea ideal fix for this. But consider carrying around number objects instead, so that the box gets allocated once, and then the code just carries around the number in its boxed form. So instead of allocating and discarding boxes for numbers all the time the code just carries around an object reference to a single boxed object.

...

Then the prefix "J" on the name makes it clear that a boxed numeric type is being used.

Avoid

...

Scala BigInt and BigDecimal Types

Scala's BigInt and BigDecimal types are both wrappers around Java's BigInteger and BigDecimal type. Since we do not need any of the additional functionality they provide, they just add another layer of indirection and object allocation. Instead, we should use Java's version of these data structures directly. To be consistent with our convention around numeric types, these Java types should be imported as below:

Code Block
import java.math.{BigInteger => JBigInt, BigDecimal => JBigDecimal}

Avoid Generic Collections of Unboxed Types

Code Block
val boxedInts = new ArrayBuffer(len)  // adding an Int allocates a box every time. 
var unboxedInts = new Array[Int](len) // Non allocating - note var idea - in case you need to resize it.

Avoid match-case with Pattern Matching

Scala's very nice match-case, and case classes uses Option types underneath in the matching.

Instead of this:

Code Block
foo match {
case Foo
Code Block
val boxedInts = new ArrayBuffer(len)  // adding an Int allocates a box every time. 
var unboxedInts = new Array[Int](len) // Non allocating - note var idea - in case you need to resize it.

Avoid match-case with Pattern Matching

Scala's very nice match-case, and case classes uses Option types underneath in the matching.

Instead of this:

Code Block
foo match {
case Foo(a, b) => ....
...

Write this instead:

Code Block
foo match {
case f: Foo => { val a = f.a ; val b = f.b ; ....
...

This doesn't allocate in execution.

Avoid Destructuring by Pattern in Assignments

The Scala compiler may or may not optimize assignment statements like:

Code Block
case class Frame(x: String, z: String)

val Frame(a, b) = y // our own type

val Some(c) = w     // built in Some[T] type

This may optimize for some built-in types, but not our own definitions. Different versions of the Scala compiler may optimize or not depending on flags.

It is better in performance-oriented code to stick with the straightforward:

> ....
...

Write this instead:

Code Block
foo match {
case f: Foo => { val a = f.a ; val b = f.b ; ....
...

This doesn't allocate in execution.

Avoid Destructuring by Pattern in Assignments

The Scala compiler may or may not optimize assignment statements like:

Code Block
case class Frame(x: String, z: String)

val Frame(a, b) = y
Code Block
class Frame(val x: String, val z: String) // forces members to be accessed individually

val a = y.x     // our own type

val bSome(c) = y.z
val c = w.get w     // built in Some[T] type

This is preferred simply because there's no uncertainty about its performance being best possible for this operation. Declaring the class this way forces developers to access the members individually, not by pattern matching with its potential for inefficiency/non-optimization. 

It is true that the prior offers the code reader more information - it makes it clear the type of y is Frame, whereas in the latter code the reader only knows, from this line alone, that y is a object with x and z method/members. The developer writing the code knows the type is Frame, and in writing this pattern matching style, they are effectively asserting this truth, but the Scala compiler may be unable to prove this is true in many cases, particularly if object-oriented programming - base and derived classes - are involved.

Test coverage analysis can be diluted by this. For example, the assignment " val Some(c) = w " is always marked as not fully covered, possibly because the type of w is only known to be Option by the compiler, not Some. Hence,  this code actually is doing a type conversion, which as far as the Scala compiler is concerned, could fail (and throw). The programmer by writing the pattern assignment is asserting this should not be the case, but the Scala compiler cannot (or does not) prove this and so the code still has a path which throws and this path will never be exercised/covered. So the code coverage report will red-mark this line of code. This is another, albeit minor, reason to stick with the straightforward latter coding style.

Avoid scala.collections.Map. Use NonAllocatingMap (wraps Java Maps) Instead

This is a library instance of the "avoid Option type" problem.

Scala's maps, including mutable HashMap, allocate a Some[T] object for every successful get(key) call.

This is unacceptable overhead for something done so frequently.

It is better to use Java's java.util.Map classes instead, where get(key) returns a value or null, and never allocates anything.

To make this convenient, there is a wrapper NonAllocatingMap:

Code Block
import org.apache.daffodil.util.NonAllocatingMap

val myMap = new NonAllocatingMap[String, String](new java.util.HashMap[String, String])

This provides the map functionality of the java-provided map class, recast as Scala's types.

Use MStack, avoid mutable.Stack

We need lots of stacks, and since Scala's general stacks are generic collections, we created our own non-boxing flavors:

...

This may optimize for some built-in types, but not our own definitions. Different versions of the Scala compiler may optimize or not depending on flags.

It is better in performance-oriented code to stick with the straightforward:

Code Block
class Frame(val x: String, val z: String) // forces members to be accessed individually

val a = y.x     // our own type
val b = y.z
val c = w.get   // built in Some[T] type

This is preferred simply because there's no uncertainty about its performance being best possible for this operation. Declaring the class this way forces developers to access the members individually, not by pattern matching with its potential for inefficiency/non-optimization. 

It is true that the prior offers the code reader more information - it makes it clear the type of y is Frame, whereas in the latter code the reader only knows, from this line alone, that y is a object with x and z method/members. The developer writing the code knows the type is Frame, and in writing this pattern matching style, they are effectively asserting this truth, but the Scala compiler may be unable to prove this is true in many cases, particularly if object-oriented programming - base and derived classes - are involved.

Test coverage analysis can be diluted by this. For example, the assignment " val Some(c) = w " is always marked as not fully covered, possibly because the type of w is only known to be Option by the compiler, not Some. Hence,  this code actually is doing a type conversion, which as far as the Scala compiler is concerned, could fail (and throw). The programmer by writing the pattern assignment is asserting this should not be the case, but the Scala compiler cannot (or does not) prove this and so the code still has a path which throws and this path will never be exercised/covered. So the code coverage report will red-mark this line of code. This is another, albeit minor, reason to stick with the straightforward latter coding style.

Avoid scala.collections.Map. Use NonAllocatingMap (wraps Java Maps) Instead

This is a library instance of the "avoid Option type" problem.

Scala's maps, including mutable HashMap, allocate a Some[T] object for every successful get(key) call.

This is unacceptable overhead for something done so frequently.

It is better to use Java's java.util.Map classes instead, where get(key) returns a value or null, and never allocates anything.

To make this convenient, there is a wrapper NonAllocatingMap:

Code Block
import org.apache.daffodil.util.NonAllocatingMap

val myMap = new NonAllocatingMap[String, String](new java.util.HashMap[String, String])

This provides the map functionality of the java-provided map class, recast as Scala's types.

Use MStack, avoid mutable.Stack

We need lots of stacks, and since Scala's general stacks are generic collections, we created our own non-boxing flavors:

  • MStack.Of[T] - generic
  • MStack.OfInt - stack of Int - non-boxing
  • MStack.OfMaybe[T] - doesn't create box for the Maybe object. Uses null for Nope, and a regular object reference for One.
    • However, MStackOfMaybe[Int] will box and unbox the Int - 

Allocate on "the stack" Using OnStack and LocalStack

See the definitions of these. These make it convenient to reuse objects in recursive code, as is common in Daffodil. A small pool of reusable objects is maintained per thread. Accessing one causes it to either be created, or an existing one initialized. They get put back on exit of scope.

What these achieve is some of the efficiency one sees in programming languages like C or C++  where most data structures are stack allocated and never require heap allocation nor garbage collection. There is more overhead to OnStack because it uses a thread-local to insure thread safety. if you have a per-thread data structure (such as a per-thread state block) being passed around, then you can use a LocalStack in the state block, which has less overhead.

Use Reusable Pools of Stored Objects

When reusable objects do not follow a stack discipline, then you can still reuse them by pooling them.

See Pool.scala for a common pooling idiom

Iteration Patterns - Avoid Iterators - Use Cursors and Accessors

If you consider that we have to avoid scala's nice generic collection functional operations like foreach and map, one might be tempted to just use the Iterator[T] class.

However, if the generic type T here is a value type (e.g., Int, or Long or Boolean) then calling next() will return a boxed object. Whether that box is saved somewhere or is being created and discarded immediately depends on what you are iterating over.

But the real issue here is about return objects - when they're not just returned from a method call, but you want to iterate over them.

We want to iterate over something, but the items in whatever we're iterating are aggregates of things that we immediately want to just break apart and use the pieces.

Code Block
val iter : Iterator[(String, String)] = ....
...
val (left, right) = iter.next()
...

Now each time we iterate, an object is created for return from iter.next() (that is unless that object already exists in some collection, but in many cases it doesn't exist.)

An alternative idiom is the Cursor & Accessor pattern:

Code Block
case class R(var left: String, var right: String) extends Accessor[R] { // note mutable vars !
   def cpy(): R = R(left, right)
   def assignFrom(other: R) { 
      this.left = other.left
      this.right = other.right
   }
   
}

class RCursor[R] extends Cursor[R] {

    val advanceAccessor = R(null, null)

    override def advance : Boolean = { ... }
    ... 
}

The idea here is you call the advance method, and it returns a boolean telling you if

...

Allocate on "the stack" Using OnStack and LocalStack

See the definitions of these. These make it convenient to reuse objects in recursive code, as is common in Daffodil. A small pool of reusable objects is maintained per thread. Accessing one causes it to either be created, or an existing one initialized. They get put back on exit of scope.

What these achieve is some of the efficiency one sees in programming languages like C or C++  where most data structures are stack allocated and never require heap allocation nor garbage collection. There is more overhead to OnStack because it uses a thread-local to insure thread safety. if you have a per-thread data structure (such as a per-thread state block) being passed around, then you can use a LocalStack in the state block, which has less overhead.

Use Reusable Pools of Stored Objects

When reusable objects do not follow a stack discipline, then you can still reuse them by pooling them.

See Pool.scala for a common pooling idiom

Iteration Patterns - Avoid Iterators - Use Cursors and Accessors

If you consider that we have to avoid scala's nice generic collection functional operations like foreach and map, one might be tempted to just use the Iterator[T] class.

However, if the generic type T here is a value type (e.g., Int, or Long or Boolean) then calling next() will return a boxed object. Whether that box is saved somewhere or is being created and discarded immediately depends on what you are iterating over.

But the real issue here is about return objects - when they're not just returned from a method call, but you want to iterate over them.

We want to iterate over something, but the items in whatever we're iterating are aggregates of things that we immediately want to just break apart and use the pieces.

Code Block
val iter : Iterator[(String, String)] = ....
...
val (left, right) = iter.next()
...

Now each time we iterate, an object is created for return from iter.next() (that is unless that object already exists in some collection, but in many cases it doesn't exist.)

An alternative idiom is the Cursor & Accessor pattern:

Code Block
case class R(var left: String, var right: String) extends Accessor[R] { // note mutable vars !
   def cpy(): R = R(left, right)
   def assignFrom(other: R) { 
      this.left = other.left
      this.right = other.right
   }
   
}

class RCursor[R] extends Cursor[R] {

    val advanceAccessor = R(null, null)

    override def advance : Boolean = { ... }
    ... 
}

The idea here is you call the advance method, and it returns a boolean telling you if it was able to advance the cursor to another object. The object is "returned" by side-effecting the accessor. Each call to advance clobbers the same object. This is a way to iterate over vast amounts of complex data without having to create any objects.

There is also an inspect method (which works like peek() - looks ahead at next thing, but doesn't 'advance' to it. It fills in a different accessor so that you don't have to copy to look at the current and next items simultaneously.

If you want to revert to using ordinary Scala idioms like collections and Iterators you can copy the accessor, or assign to them with methods on the Accessor class (cpy and assignFrom).

to copy to look at the current and next items simultaneously.

If you want to revert to using ordinary Scala idioms like collections and Iterators you can copy the accessor, or assign to them with methods on the Accessor class (cpy and assignFrom).

See Cursor.scala for the traits.

Avoid Seq-like functions for Strings and Arrays

When writing in Scala, it usually feels natrual to treat Array[T] as a sequence of T's and Strings as a sequence of characters. For example

Code Block
scala
scala
val str = "Some String"
val hasSpaces = str.exists { _ == ' ' }

While this is convenient and feel's like "correct" Scala, in such cases Scala will implicity box the String with a StringOps to provide that extra functionality, which requires an allocation. Similarly, using Seq-like functions on an Array will also box he underlying array with an ArrayOps, again requiring allocation. Note that even simple things like the String apply()  function, e.g. str(4) , will cause such boxing. Instead, you should use the equivalent str.charAt(4). This is also a key difference between calling .size  on an array vs .length. The size method requires allocating an ArrayOps, while length will directly access the length from the java array primitive.

Note that in most cases, these allocations are so efficient that it likely won't affect performance. However, it's possible it could have an affect in a tight inner loop, and at the very least, it avoids noise when profiling.

Consider Cloning vs Creating a New Instance

In some cases, it can be relatively expensive to create a new instance of an object. In such cases, it might be worth considering if clone()ing an existing instance and mutating it is faster.

One case where this appears to be beneficial is with ICU4J Calendar objects. Creating a new Calendar object via Calendar.getInstance(...)  is a fairly expensive process with lots of different object allocations. Instead, it is reccommend that one consider something like the following, which minimizes allocations and initialization computations:

Code Block
scala
scala
object SomeClass {
  val emptyCalendar = {
    val c = Calendar.getInstance(TimeZone.UNKNOWN_ZONE)
    c.clear()
    c
  }
}

def functionThatNeedsANewCalendar = {
  val cal = SomeClass.emptyCalendar.clone.asInstanceOf[Calendar]
  ...
  cal.set(...)
  ...
  cal
}

Examining Bytecode

As is apparent from many of the above suggestions, minimizing allocations is often key to improving Daffodil performance and making profiling less noisy. Often times an allocation will occur but it isn't clear based on the source why such an allocation might be happening. In these cases, it is often necessary to inspect the bytecode. To do so, the use of the javap  function can be invaluable. The following will convert a class to bytecode, including some helpful bytecode interpretations in comments:

Code Block
bash
bash
java -p -c path/to/class/file.class

It can also be useful to search the entire code base for certain allocations by looking through the disassembled code. A useful script to decompile all class files is the following:

Code Block
bash
bash
find daffodil.git -name '*.class' -exec javap -p -c '{}' \; > disassembled.txt

From there, you can grep this file and determine where unexpected allocations may be taking place. For example, to find allocations of java.math.BigInteger:

Code Block
bash
bash
grep -a "new" -n disassembled.txt | grep "java/math/BigInteger"

Profiling & Timing

Often time it is useful to use a profiling to example memory allocations and CPU usage to determine where to target optimizations. However, due to the nested nature of Daffodil parsers/unparser, some profilers can make it difficult to determine how long certain sections of code take, or they incur too make overhead and skew the results. For this reason a speical timer is added to Daffodil's utilties to track sections of code. This timer is the TimeTracker  in Timer.scala. A common use of this timer is to track the time of all the parsers. Do enable this, adjust the parse1()  method in Parser.scala  to like like this:

Code Block
scala
scala
TimeTracker.track(parserName) {
  parse(pstate)
}

Then add this section to the end of however your are trigger parsing (e.g. Daffodil CLI code, unit test, performance rig)

Code Block
scala
scala
TimeTracker.logTimes(LogLevel.Error)

This will result in something that looks like the following, where the time is in seconds, the average is nanoseconds, and count is the number of times that section was executed.

Code Block
[error] Name                                 Time     Pct  Average    Count
[error] LiteralNilDelimitedEndOfDataParser  3.330  34.03%     4030   826140
[error] StringDelimitedParser               2.455  25.09%     4184   586640
[error] DelimiterTextParser                 1.038  10.61%      879  1180480
[error] SimpleNilOrValueParser              0.985  10.07%     1192   826140
[error] OrderedSeparatedSequenceParser      0.806   8.23%    10232    78720
[error] ElementParser                       0.404   4.13%      342  1180520
[error] DelimiterStackParser                0.308   3.15%      244  1259220
[error] ChoiceParser                        0.226   2.31%     5750    39360
[error] SeqCompParser                       0.113   1.15%      318   354300
[error] ConvertTextNumberParser             0.060   0.61%  1489652       40
[error] OrderedUnseparatedSequenceParser    0.058   0.60%  2922016       20
[error] ConvertTextCombinatorParser         0.000   0.00%     8825       40

This gives a clear breakown of how much time was spent in each parser (excluding nested child parsers) and gives a rough idea of were to focus optimizations. Note that it often sometimes helpto to add additional tracked sections within a parser to determine what parts of a parser are the bottlenecksSee Cursor.scala for the traits.