REEF and Tang

Tang is a dependency injection Framework co-developed with REEF. It has extensive documentation in its own right, and this is not an attempt to replace that documentation. Instead, this shall give the motivation for dependency injection as the right mechanism for REEF configuration and for why Tang is preferable to other available dependency injectors for REEF.

Why dependency injection for REEF?

The API maintainability angle

One of REEF's goals is to provide a common core library for many big data applications.

Those applications expose a great diversity in needs in terms of the REEF features they use. For instance, some applications feel the need to forego REEF's resource request abstraction and drop down to the level of the underlying resource manager for that. Others have no need for the storage layer of REEF, but want the networking package. Furthermore, we expect considerable expansion and also API changes as REEF and the set of applications developed with it expand and mature.

Dependency injection allow us to provide this global diversity (libraries, API versions, ...) with local clarity. Each of the application class have the part of the REEF API they need injected via Tang. If REEF adds new APIs, nothing has to change in the application code. Furthermore, REEF can provide old and new APIs in parallel, without compromising either. The keen reader may want to compare this agility with the debacle that was th old, new then old-is-new again API changes of Hadoop MapReduce.

The extensibility angle

REEF itself is written using Tang. This means that individual components of REEF (e.g. the EvaluatorManager class) obtain instances of other parts (e.g. the Configuration serializer) via Tang. This allows us to easily replace those components without broad-scale code changes. Further, this approach provides REEF with a principled extension mechanism for advanced applications: Just about anything in REEF's implementation can be replaced, augmented and tailored to custom needs this way. In a way, this is exemplified by the event handlers provided by the application itself: The REEF implementation merely has those injected via Tang, treating them just like any other component of REEF, be it provided by the application or REEF itself.

The configuration management angle

Configuration is incredibly hard for distributed systems. Not only are the applications usually complex with many parameters, the problem is also distributed itself: For example, configuration for Tasks is created by the Driver, but acted upon by an Evaluator. Even worse, that Configuration often is an amalgam of user-supplied and application defined parameter values. With traditional approaches (e.g. the Hadoop JobConf), it is often only at runtime that trivial issues like misspellings and missing parameters are discovered.

Tang treats configuration parameters as part of the dependency tree of an object. This makes sure that one can ask questions like "can a Task be instantiated?" and get an answer based on configuration data alone: Tang can decide whether all the dependencies of the Task are met, and that includes the configuration parameters just as it does instances of other classes, that themselves need to have their dependencies met.

Further, Tang uses Java types to identify configuration parameters as a replacement for the common string keys used in <key,value> based configuration. While this obviously increases the typing effort at development time, it makes it virtually impossible to misspell a parameter name. Further, it allows Tang and tools build with it to perform validation of configuration in the static space before the application is running.

Why Tang and not Guice, MEF, ...?

Much of the above can be achieved with any of the available dependency injectors. In fact, much of it is the very motivation for developing and using them. Hence, one might reasonably ask: What could have possibly motivated us to build yet another one? The answer is multi-facetted, as to be expected:

REEF is for distributed systems: As alluded to above, configuration is set by one machine and acted upon by another. Hence, we cannot allow for features like Guice's provider methods that have access to the local state of a JVM. That state may not be available on the remote JVM. Worse, it could differ in subtle ways.
REEF aims to bridge CLR and JVM, so does Tang: One of the goals of REEF is to support applications that need to execute a mixture of JVM and CLR tasks coordinated by a Driver in either one of these environments. In order to retain the benefits of dependency injection, we needed a dependency injector that is designed for multiple language environments. Tang is.

Page tree