This wiki page is the top level page for a UIMA Version 3 "spec" of what might actually be in version 3. It should be considered a draft, and not settled on.
Some of these things may be implemented over an extended timeline. Things denoted for later in the timeline might be marked <later>.
Layers and packaging
Currently, we have the core SDK, UIMA-AS for scaleout, uimaFIT.
The core SDK will be broken up; in particular, the uimaj-core Jar has layers:
- For those not wanting annotators, but just CAS processing
- various serializers (including legacy and JSON)
- Pipeline support
- Index support
- Multiple-index (views) support
- A compatibility layer - supporting old v2 APIs in v3
It would be good to integrate uimaFIT (Richard - could you work up a proposal for this, maybe as a linked wiki-page?)
Subsetting approach for non-Java implementations
UIMA supports C++ implementations (e.g. uimacpp). New Java types supported by the Java version of UIMA V3 may not be supported (initially) in the C++ implementation. The expectation is that these types will be excluded in any transport to C++.
Application, Annotators, and ClassLoading
UIMA has several kinds of APIs
- Application - creates pipelines, runs them, can create and reference CASes. This code is the code which is "outside" UIMA, and calls into it.
- Annotators - run by the framework, accesses CASes. This code is called by the UIMA framework.
- External Resources - accessible by both application and annotator code
In V3, part of the lifecycle is generating the JCas classes from the merged type definition of a pipeline. The resulting classes are loaded, and instances of these represent the Feature Structures.
To have these FSs accessible to Application, Annotators, and External Resources, the class loader used for these must be either the same one or a child of one where the generated JCas classes are loaded.
A general requirement is that the type system commit event must proceed any reference to any of the JCas classes (because they won't yet be generated)
Styles of use of class loaders
Simple UIMA pipelines not supporting (within the UIMA framework itself) multiple (different) type systems, the original ClassLoader can be used to be the classloader.
- For this to work, no reference to any of the JCas classes can be resolved prior to type system commit.
- If this classloader is an instance of UIMAClassLoader, then it can be initialized (at type system commit) with the merged type system, and do lazy (on demand) generating/loading of JCas classes
CAS
The CAS is similar to V2 CAS except it doesn't hold the Feature Structures:
- it has a merged type system
- it has an optional subject-of-analysis
- it has 1 or more "views" which represent indexed Feature Structure instances
- Each view may have its own index definition; Annotation index may be omitted. <<< new proposed feature in V3
Lifecycle events
Pipeline (optional) and type system merging; and resulting creation of JCas Class definitions - occur at the start of processing, and whenever the pipeline and/or the merged type system needs changing.
- within one such event, multiple CASs may be created, and multiple pipe line instance may be realized, all sharing the same time system.
- Indexes associated with CASs may be "reset" (to empty), the SofA may be initialized, and the pipeline run, adding /removing things from various CAS indexes.
- Merged Type Systems can be shared among multiple index repositories. However, JCas-generation is run for a particular pair of merged-type-system and index repository definitions, to capture the features needing extra checking against index corruption.
Types and Features
UIMA supports type merging among collections of components to be run together in a pipeline.
An instance of a TypeSystem represents one such merged set of types and features. As it is constructed, Types and Feature information are collected. After all merging is complete, the Type System instance is committed, which calculates constant values for this particular type system used during running, and attaches this instance to any loadable JCas implementations that correspond (by name) to those types.
- Note that multiple, different type system instances can share the same JCas class definitions (assuming the differences in the type systems all continue to "fit" with the JCas definitions). A common case of this are the Built-In JCas class definitions, e.g., those for TOP, Annotation, FSArray, etc.
- This sharing is by "class loader"; each set of Type System Instances, sharing a common class loader that specifies a loading context for the JCas classes, shares those JCas classes.
No merging of identical type systems is done at commit time, to allow future expansion of capability where it may be possible to augment the type system after commit time with new features and types.
There are two kinds of these:
- "static" - known via type and feature specifications merged at pipeline startup time. These are "compiled" for efficient operation while running.
- "dynamic" - additional types and/or features added after pipeline startup time.
Feature ranges may have different "strength" of strong typing (new feature). Strong type: byte or short; weak type: number.
Where Type/Feature meta-information is kept
The TypeImpl and FeatureImpl objects are one place this information is kept. These are used prior to generating the actual JCas Cover Object for the type, and while running to provide direct access to meta-information about types and features.
The JCas Cover object directly represents some of this metadata
- super type (as value of extends for the class)
- for each feature: the range type (as the value used in getters and setters)
- Used meta-information copied to "class" variables of the type to avoid extra dereferencing
Categories of Types
Feature Structures
These are implemented using Java objects, one per FeatureStructure. They can be Garbage Collected.
There is a generic Java class for these, plus (optional) specific classes for JCas style access.
This page has the details.
Serialization and Deserialization
Multiple formats supported, including legacy (XCas, XMI, JSON). These may need extensions to support new data types.
New formats (Google's kyro), possibly "pluggable".
Pipelines (optional)
In a layering sense, UIMA pipelines are optional. An example, one could write a "reducer" which took some serialized CASes, deserialized them, and then used FeatureStructure APIs to access this data and compute an reduced summary of it.
If there is a pipeline, it has the same concepts of type system merging as V2.
To allow for reloading, a pipeline definition (including its merged type system) are loaded once under a specific isolating (meaning doesn't delegate up first) class loader.
Similar concept of shared external resources, except: <later>additional ability to "scope" these to an aggregate/primitive (currently, these are global to the pipeline).
Pear pipeline packaging
Keep isolation of classpath as before; consider adding additional isolations (e.g. type redefinition, not merged, except for input/output types)
Concurrency
Support for multi-processing flows.
Indexes (optional)
In v2, Indexes are the only way to access FSs in a CAS. This (may be / is) extended in v3:
- a basic index is always available, the same as the v2 default bag index. This allows retrieval by type (optionally including subtypes); and doesn't support duplicates of identical FS
- Defined bag indexes are redundant, but are supported for backwards compatibility.
- set or sorted indexes. Annotation index may be excluded if not wanted (new feature).
ability to specify different indexes for each view in multi-view scenarios (new feature).
Using JavaObjects (a new UIMA "type") which implement Java collections is an alternative to using indexes. Some APIs may be modified to accept "roots" - collection objects specifying FSs to operate on.
Iterating over FSs
V3 incorporates some of the styles from uimaFIT's select operators.
Java 8 facilities like streams and spliterators are used.
Design outline is here.
Configuration
TBD; a mixture of current pipeline/aggregation-based specs, external override specs, and uimaFIT approaches?
Feature Structures
Supporting both pre-defined (merged) types efficiently, and dynamically augmented features of existing types (less efficiently).
As Java objects (only); and therefore garbage-collectable
Dynamically added features
Additional built-in types
Support for Java collections and maps (implies augmenting existing serializations)
Incompatibilities
The indexing structures for v3 won't support duplicate-add-to-indexes for bags and for sorted indexes.
Preserve ability to run version 2 pipelines/apps
Packaged as an optional layer.
Migration tool with some manual work may be needed for JCas customized cover classes.
Migration and PEAR support - JCas
Internals
Indexes and Iterators
These are organized around objects having collections of FSs, not ints (as was the case in v2).
Internationalized Exceptions and Messages
Internationalization is handled by the static methods in I18nUtil. These are called by the Internationalized Exceptions, but may be used for non-exception message localization.
Exception messages are collected into classes. These classes may be organized further into hierarchies, but the top of these extend one of the following 3 classes:
- Exception - for checked exceptions
- RuntimeException - for unchecked exceptions
- SaxException - for exceptions thrown during Sax related callbacks requiring SaxExceptions be thrown
Common code for getting a localized message from arguments and message key and resource bundle are put in the interface I18nExceptionI as default methods.
The individual classes:
- hold static public MSG_NAME = "prop-file-key-name" values, allow IDE search via completion, allows renaming via Eclipse refactorization
- Classes collect messages for some sub-section of the code
- Super class structure can supply common resource bundles
Non-functional
JIRA Organization
JUnit 4 test cases
Packaging/release smaller grain
To allow updating, just, for instance, a serialization form.