Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Gliffy Diagram
namev3_FeatureStructure_organization_diagram

xxx_Type files

These are eliminated in v3.  They served 2 purposes:

  1. save one slot per feature structure - instead of a casImpl ref and a typeImpl ref, there was just one ref to the _Type instance, which in turn, and these two refs
  2. provided a place for the low level accessors.

It's unclear if anyone is using the low level accessors.  These may be retained, but moved to the main xxx classes.

For the non-JCas style of Java cover classes (FeatureStructureImplC) these did not implement a _Type instance, and had as a consequence both a casImpl ref and a typeimpl ref.

JCas Class generation

JCas cover classes now come in single classes, rather than in pairs.  These classes are either built-in or are generated; built-in ones cannot be generatedhas a ref to the CAS view.  A single class definition might be used for multiple type systems; a single definition is used for all the built in types.  Each JCas class extends

  • has a ref to the corresponding TypeImpl.  This can't be in the main class as a static, as there is a one-to-many relationship because the built-in main classes are shared across type systems.

When generated, they are specific to one (merged) type system, except for shared, common, built-in class definitions.  To allow for multiple type systems within one JVM simultaneously, class loader isolation is used.

  • Class loader isolation is optional - it may not be needed for simple deployments, or it may be being handled outside of UIMA (e.g., a single UIMA pipeline running as a servlet)
  • UIMA provides the UIMATypeSystemClassLoader which can be used for classpath isolation, and also serves to implement lazy (just in time) generation of the JCas classes.
    • When this is used, UIMA artifacts that might reference types (application, external resources, or annotators) are need to be loaded under this class loader.
    • The same class loader lazily generates JCas classes (both the x.y.z.Foo and x.y.z.Foo_Type) and loads them on demand.
      • To enable this, the UIMATypeSystemClassLoader has a settable reference to the associated Type System; type system commit searches the class loader chain for an instance of UIMATypeSystemClassLoader , and sets this reference.  If it is already set, if the merged type system is different, throw an error.
      • The generation/load happens when a reference is made to a class having the same name as a UIMA type (or xxx_Type).  With Java's lazy loading, if this class is not already loaded, it is generated and loaded.
  • If there is no UIMATypeSystemClassLoader  in the parent chain:
    • type system commit does a batch generate and injection-load of all types (not lazy), using the current ExtensionClassLoader from the UIMA ResourceManager (if exists) or the current class loader (or perhaps the current context class loader if it exists).
    • If these types already are loaded (findLoadedClass doesn't return null), throw an exception because the types were loaded from the class path because a reference was made to a JCas Class prior to type system commit; this caused a load of a class by that name from the classpath, which may not be the same as the generated one.  The user (application developer) will need to fix this by insuring type system commit happens before a reference by name to a JCas class (e.g. new Foo...)

Since the generated classes have static fields which ref from the _Type to the main class, generate the main class first, then the _Type one.  Avoid circular references.

Connecting Instances with MetaInformation

Meta information about types and features is stored in

  • TypeImpl and FeatureImpl instances
    • These are not shared among TypeSystems, as they need to have (for constrained iterator impl) refs to the TypeSystem.
  • The "static" information of generated or built-in JCas classes representing types
    • Some of these classes (but not class instances) are shared among type systems (e.g. the built-in types)
      • Therefore, the static data cannot reference Type/Feature instances
  • The _Type class is generated for each Typesystem, for the merged type
    • An instance of this is kept per Cas View, and referred to from the instance of the JCas Type

...


APIs for creating Feature Structures, and setting / getting Feature Values in them

There are several kinds of APIs for this.

 

  • Basic: this was the original API, and makes use of UIMA Feature and Type objects as arguments.
  • JCas: this is an API that uses common Java idioms for creating, getting, and setting. 
  • LowLevel: this was like Basic, but substituted an int-valued address for the Java Feature Structure object, and in general, avoided created Java objects.
    • In V3, it is dangerous to create FS using the low level API, because the resulting FS is identified only by an int, and if the Java Garbage Collector runs before any reference is created referring to the newly created FS, it will disappear (due to garbage collection).  So the low level APIs in Version 3 are depreciated.

 

 Descriptioncreate exampleget a valueset a value
Plain

Uses UIMA
Type and Feature
Instances

API: CAS

casView.createFS(aType)

casView.createXXArray(size),
XX was the type. 

fs.getIntValue(aFeature)

fs.get(index) when fs is
one of the built-in arrays 

fs.setFloatValue(aFeature, value)

fs.set(index, value) when fs is
one of the built-in arrays 

JCasFollows Java conventions,
Types and features must
be known at compile time 
new MyType()
  • can have additional constructors 

fs.getMyFeature()

fs.getMyArrayFeature(index)
when the value of myArrayFeature
is
one of the built-in arrays  

fs.get(index) when fs is
one of the built-in arrays 

fs.setMyFeature(value)

fs.setMyArrayFeature(index, value)

fs.set(index, value)

Low
Level 

In version 2 this allowed
CAS access without making
any Java objects; there was
much less "checking" and
it was for high-performance
cases. Feature Structures
were referred to by their
int address in the internal heap.

API: LowLevelCAS 

These had the same name as the
Plain API, except prefixed with
"ll_", e.g.
casView.ll_createFS(aType).

Instead of returning a Java object
representing the FS, these return
ints. 
lowLvlCas.ll_getIntValue(addr, feat)
where the addr and feat are both
ints.

lowLvlCas.ll_setFloatValue(addr, feat)

lowLvlCas.ll_setBooleanArrayValue(addr, index, value)

Getting and setting Feature values in V3

The JCas style of getting / setting feature values requires that the feature names be known at compile time, so you can write getXXXX where XXXX is the known-at-compile-time name of the feature.

The Plain style does not need this information; instead the range must be known, and calls are made like getIntValue(featureValue), where featureValue can be dynamically computed at run time. 

Plain style APIs bypass any JCas getter or setter customization

The plain style APIs do not invoke the JCas style getters and setters, even if those are present and perhaps customized.  This is a design decision made to follow the V2 implementation, and also for performance reasons.  So, if you have customized a getter or a setter in JCas, you must use the JCas APIs to run the customizations.

xxx_Type JCas classes removed in V3

These are eliminated in v3.  They served 2 purposes:

  • save one slot per feature structure - instead of a casImpl ref and a typeImpl ref, there was just one ref to the _Type instance, which in turn, and these two refs
  • provided a place for the low level accessors; these are accessors that take the "address" (now "id") of the FS as the way to designate which FS is being used.  There are 2 varieties of these low level accessors - those implemented in the CASImpl, and those implemented in the JCAS Type classes.  The latter has methods like "myShared_TypeInstance.setXXX(address, value)".  These are instance methods on the shared xxx_type instance, and were intended to permit access without creating the Java cover object for the FS.

The performance reason for using the low level accessors is not present in V3; in fact, these, if implemented, would be slower than the other APIs.

JCas Class sharing

JCas classes are associated with a class loader.  Except for the built-in types which always have JCas Classes, other JCas classes are optional. Furthermore, JCas classes may define only a subset of the features of the fully merged type system. So, even when a JCas class is present, it may not have getters and setters for some features of the corresponding UIMA type. These features can be accessed of course using the plain APIs (see above). 

When a UIMA type is instantiated in V3, the Java class used is the most specific instance of a JCas class for that type that is found.  For example, if you have a type Foo, with superType Bar, which in turn is a subtype of Annotation, and have no JCas classes defined, then when you create an instance of Foo (using the plain API: casView.createFS(fooType) because you can't do the JCas style of new Foo, because you haven't got a JCas class for type Foo), it will create an instance of Annotation as the implementing Java class.

One set of JCas classes per class loader may be used (even simultaneously) for multiple different type systems.  This can occur sequentially, for example, in the use case where a sequence of CASs and their type systems are being deserialized and worked on, sequentially; it can also occur when running multiple different pipelines under one class loader. When committing a type system, a check is made for each type to see if there is a corresponding JCas class, and if found, that any defined features have the proper range.

It is possible to run multiple pipelines with non-compatible type systems and JCas classes by running each one under its own class loader; in this scenario, each pipeline will load its own copy of JCas classes from its own classloader's classpath.

JCas Class and UIMA Type conformance

JCas Classes have static final fields computed at load time. Each type system commit loads corresponding JCas classes (the load only happens the first time, per class loader).

A particular type system instance is being committed when a JCas class is loaded.  At load time, these rules are checked:

  • Construct the supertype chain of the class being loaded.  It must be the case that, scanning upwards, there is a supertype that has a corresponding UIMA type.
    • It is OK if there are UIMA types between this and the found corresponding supertype - that just means there were no JCas types defined for those.
    • It is OK for the supertype chain to pass through supertypes which are not UIMA types, as long as the JCas supertypes are abstract (can't be instantiated)
  • For each feature
    • the feature offset assigned to the class's static final value must match the feature offset
    • the feature's range must match
    • JCas-defined features which do not exist in the 1st type system loading this JCas class will result in invalid getters and setters for that feature, if an attempt is made in some code to get/set those features.

How JCas feature offsets are computed or validated at type-system-commit time

The type system is walked in subsumption order, and offsets are assigned to all features.  Then the JCas classes are loaded - the corresponding features are used to set the static final int offset values in the JCas class, if they are actually loaded.  If they are already loaded, the existing values are checked to insure that they match the type system assigned values. A mismatch can occur if multiple different type systems are being used. Mismatches (which cannot happen if only one type system is in use) result in a fatal error.

Connecting Instances with Type and Feature information

Information about types and features is stored in TypeImpl and FeatureImpl instances.  These are unique per type system.  However, multiple type system instances created using the same (merged) definition, and therefore "equal", are recognized at type system commit time, and the existing type system implementation is reused in this case.  This is different from V2, and may require updating code which gets references to types and features prior to type system commit; that code needs to be updated to re-acquire those references after type system commit, because the Type and Feature instances may be replaced with a shared version if the type system is equal to one already committed.

Locating the corresponding UIMA Type when creating a JCas type using the "new" operator

When a JCas instance is created using the "new" operator, it locates the type using information in a JCasRegistry.  The type cannot be statically kept in the JCas class definition, since one JCas class might be used by multiple different type systems.  Instead, each JCas class, when it is loaded, is assigned a unique incrementing number; this number is kept with the static (one per class loader) information for TypeSystemImpl.

At instance creation time, a lookup is done, using the instance of the type system, to get the actual type associated with the registry number.  This mechanism is encapsulated within the JCasRegistry class.

Locating the corresponding UIMA Feature when accessing a feature using JCas APIs

The generated getter or setter code for a JCas feature needs the stored-feature-offset-index information for the feature being accessed.  In the use-case of having multiple type systems for one JCas class set loaded under one class loader, each type system might have a different number for this; this design would make it necessary to have all accesses go thru one level of indirection to get the particular type system's offset for a feature.

This is avoided using the following technique that assigns the offsets to match already assigned ones:

  • The first time a JCas class is loaded at type system commit time, it defines a final static int constant of the pre-computed offset.
  • The 2nd time a JCas class is accessed at type system commit time, the first value stored is read and is used for the offset.

This requires that no JCas class access is done prior to type system commit, since the static final value can only be assigned once at resolution time.  This is normally the case, since it would be invalid to do something with a JCas class before the pipeline is set up.

To make a new instance of Type, the Type (and _Type) classes have to be generated if not already available.  They may be available because user code might have referenced a JCas class by name, causing it to be generated and loaded. (The class loader used has a check for attempts to find a JCas cover type, and generate it on demand.)

To generate a JCas class, the class loader (an instance of UIMAClassLoader) has access to the type system impl if the type system has been committed; it checks to insure the type system is committed, and then generates and loads the Type and _Type classes, in the context of that type system.  Built-in versions of these classes are always "found" and not generated.

  • The nearest in the class-loader parent chain which is a UIMAClassLoader, to a type system being committed, sets that class loader's type system reference.
  • If the type system ref is already set, this is an error condition; a new classloader instance is required for new type systems (might be able to optimize for new but equal type system)
  • If UIMAClass loaders are not being used, then lazy loading can't be done; instead user may call method to load all the classes for all the types.

Instances of a JCas type may be created via the "new" operator, passing in the JCas.  

Locating or instantiating the corresponding _Type instance

When a JCas instance is created, it needs to reference a corresponding _Type instance; these are "per CAS View".  A table is kept, by view, of already instantiated _Type instances, key =  JCas type class (identity key).  If not present, a new instance is generated from the corresponding (generated or provided) _Type class.  It should always be available (have been generated or set up) by the time it's needed.

Using the JCasRegistry

In v2, this was a map from ints to loaded JCas cover classes.

  • Keep this for now to ease backwards compatibility. But it would be nice to get rid of it.  
    • Need to enumerate all uses of it
  • Goal: make this work with multiple type systems, and use as index the (dense) typecode from TypeImpl. 
    • These type codes are common up to the end of the built-ins, and then branch, one per type system. Some of these type systems will come and go, so insure GC can happen for the gone ones. 

Lookup needs to be by type system, obtainable from instances (via ref to _Type).  Generated classes have ref to type system and can use typecode for this value.

Go from typecode via typesystem to typeimpl to generator (creator).

Getters, Setters, Constructors, indirection

For JCas style, the getters, setters, and constructors are "direct": the users code says things like

Code Block
themeEclipse
titlecreate, getters
new Foo() // create Foo instance, or
myFooInstance.getMyFeat()  // to get a feature or
myFooArrayInstance.getMyIndexedFeat(4)  // to get the 4th element of an array
Code Block
themeEclipse
titleSetters
myFooInstance.setMyFeat(featValue)  // to set a feature with a value 
myFooArrayInstance.setMyIndexedFeat(4, featValue)  // to set the 4th element of an array

For non-JCas style, the user writes something like this:

Code Block
themeEclipse
titleNon-JCas, indirect via Type/Feature instances
acasinstance.createFs(aType)  // create a feature structure; aType is an instance of TypeImpl
myInstance.getIntValue(aFeature) // get an int valued feature; aFeature is an instance of FeatureImpl

There are also low-level equivalents, where the typeCode or featureCode is passed instead, and the featureStructureID is passed as well.  These methods are on the CAS itself, because there's no JCas object in this case.

For these to work in version 3, we need to go from the Type or Feature instance to being able to get/set/create in the Java space.  Java 8 provides a mechanism that can be optimized by JIT and appears to be equally performant as direct access, using a form of MethodHandles and LambdaMetaFactory (see http://stackoverflow.com/questions/19557829/faster-alternatives-to-javas-reflection ) or the alternative available in Java 8 of method references (e.g.   ClassXYZ::getFoo ).  A test of these approaches appears to indicate they are as fast as native access. 

To use these, the generated class needs to initialize a set of variables in the associated Type and Feature classes with the appropriate Constructor/Method references.  A way this could be done:

  • Have the class declare a set of static Supplier or Consumer or other appropriate Functional Interface values, one per getter/setter/constructor, as a particular name
  • as part of loading the class, get this value and distribute the values to all the features and type

The values would be extracted and inserted into the corresponding TypeImpl or FeatureImpl structure.  These would be invoked using the Functional Interface's method.  For example, if that method were get(), then the method would be invoked as myTypeImpl._accessors[featCode].get();

Issue with supporting multiple different type systems, serially.

This has one serious issue, illustrated by the use case: 

  1. make a pipeline, 
  2. deserialize some CAS's type system, and then deserialize that CAS
  3. do some generic processing on that CAS
  4. repeat 2 and 3 in a loop, with different type systems each time. 

The key points that cause a problem are 

  • having a UIMA pipeline that is being reused for multiple deserialized CASes, each of which might have a different type system
    • Note: this may not seem possible; because all UIMA pipelines have superclasses: AnalysisEngineImplBase -> ConfigurableResource_ImplBase -> Resource_ImplBase
      and Resource_ImplBase has a reference to a CasDefinition used for creating a CAS that matches the merged type system of the pipeline.  
      Deserialization may supply a different type system (e.g., having extra features for some types) and create a CAS having the definition that is read in as part of the deserialization process.
      • User code might merge the deserialized type system with the definition from the pipeline.
      • Some deserializations include the concept of setting aside or ignoring types and features used in the CAS being deserialized, but not defined in the receiving CAS (which is typically the one set up from the pipeline merged typesystem.  
  • The problem arises if the pipeline code has some JCas-like reference to some type / feature which is 
    • not built-in
    • but present in all the (varied) Type Systems being deserialized.   

    The pipeline code might have, for instance, an assignment 

    Code Block
    MyFooType mft = ....  // some code fragment yielding an instance of MyFooType
    mft.setMyFeature(333);  // sets a [named] feature in MyFooType

    When the merged type system is constructed, a "generate" step generates a JCas cover class definition which includes a class MyFooType, and in that class, a "setter" method "setMyFeature(...)", and loads this. When the pipeline is run, the code in the pipeline will be "linked" to the loaded class's setter method. 

    The difficult arises when the next CAS with a different definition of MyFooType (say, with extra features) is deserialized.  If the deserialization approach is to ignore extra features not in the merged type system from the main pipeline, then there is no problem.  But if user code, for example, merges deserialized type systems with the uima pipeline, this new definition needs to replace the old one, but the old one is now "linked" with the pipeline code "mft.setMyFeature(333);" above, and can't be replaced (to my knowledge), without also unloading the pipeline code and reloading it. (That's one potential, but significantly inefficient "solution"; another is disallowing changing type systems in this scenario.).

Proposed solution

See the section on class loaders.  Have different class loaders for the new TypeSystem and the user application and annotator code.

Collections

UIMA v2 supports specially-named arrays of primitives (+ string), e.g. BooleanArray. 

UIMA v2 supports arrays of Feature Structures, using FSArray (JCas) or ArrayFS (Generic).  

For v3, support , support (not yet done, TBD?)

  • new notation (arrays):  aligned with Java: TOP[] or Annotation[] or MyType[] or short[]
  • new notation (collections): aligned with Java generics: List<TOP> or ArrayList<Annotation> or HashSet<MyType>

...

  • limit (initially) generic spec to only simple type names, no support for extends, ?, etc.  Use TOP for "Object".

Strings

Keep special UIMA String type for compatibility and subtyping.

Feature Structure APIs

JCas style - where the name of the Type and Feature are known, and present in the code.

Generic style - where the name of the Type and Feature are not known ahead of time, and are referred to indirectly via variables, in the code.

"Low level" style - only for backwards compatibility.