Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
h1.Background
This is a summary of ideas that have been discussed on the mailing lists regarding configuring UIMA pipelines.

h1.What running scenarios are contemplated?
The normal run scenario is one where you start up a pipeline, it initializes, and processes CASes and eventually finishes.

Alternatives: During the run, it is reconfigured.  We won't consider this for now.

h1.What is configuration?
Configuration is a collection of things, set for a particular UIMA pipeline run.  It can include things like conventional UIMA parameter settings, as well as other kinds of settings such as "placeholder" values that are substituted into UIMA deployment descriptors, debug flags, dump the CAS flags, the logging specification for the run, etc.

h1.Orthogonal issues and considerations in configuring UIMA pipelines
h2.Where to put this information
There is a continuum of places, ranging from least dynamic (most training required) to most dynamic (least training required):
* code (least dynamic, need to understand the code and where to go to modify what you want, requires most training)
* UIMA descriptors (still quite complex; have some special GUI tools for editing them)
* Structured properties files (like Jar Manifests - multiple sections, each section having key-value pairs of strings)
* "properties" files (simple key-value pair strings)
* JVM command line "defines" parameters - individual key-value pairs

Experience shows that for doing "runs", people feel that code and descriptors are complex and hard to comprehend / change, 
and prefer simpler, more focussed ways to specify things for the run.
It is possible to provide more than one of these approaches; if multiples were supported, then some conventional 
(no surprise) rule for which overrides which, is needed.

h3.JMX
In addition to the above, JMX settings may be desired.  

h2.Using the JVM command line as a source of configuration information
Putting configuration into the command line ties a "run" of a pipeline to a "run" of a JVM.  There are cases (e.g., running UIMA inside servlets
inside a web application container) where mutliple, independent instances of UIMA pipelines may independently start, run, and terminate - all
without taking the JVM up and down.  This argues for an approach not tied to the JVM command line.  

On the other hand, the command line approach is very handy for quick augmentation / overriding of particular runs, where
starting/stopping the JVM is an option.

Note that JMX settings could be arranged to be either global, or per "UIMA Context";  

h2.Encapsulation versus reaching down inside trees of nested aggregates
The original UIMA design attempts to support encapsulation in an aggregate, for parameter overrides.  An aggregate may override parameters
that its delegates declare.  An aggregate can choose which of these it, in turn, is willing to allow a containing aggregate to be able to
override; it can choose to "shield" some parameters, making them incapable of being overridden.

This has complicated the practice of designing large complex nested trees of annotators - in requiring aggregates to expose upwards parameters
that the top level may want to override. 

An alternative mechanism is wanted in these use cases, to allow "reaching down" from a top level into lower levels, without needing all the
intervening levels of aggregation to expose individual parameters.  However, some degree of control over this is also desired.  

A suggested approach is to augment the configuration parameter *definition* with an additional property - a "global-name" which, if specified,
would enable this reaching down, by having at the top level a key-value pair specification, where the key would be the global-name.

h2.Non-path specification of the global name
A use case is to be able to use parameter specifications for different sub-parts of a big descriptor tree, or for the entire tree, without
editing the key name.  So - the key-name at the top should *not* include the path (down the nested hierarchy of aggregates); this allows
its reuse even if the hierarchy changes.

h2.Configuration settings - arrays
UIMA supports array-valued settings for configuration parameters.  In key-value pair formats, some approach is needed for these.

Multiple keys: having the same key name repeat, indicating multiple values.

Multiple keys with conventional suffix (e.g.,  foo.1, foo.2, foo.3): this could be done, but introduces more opportunities for
silly user errors (e.g., origin 0 or 1, etc).  

Single key with special syntax for multiple values - e.g., blank or comma separated, with escaping char ("\"?)

h2.Other kinds of settings
Other kinds of settings for a UIMA pipeline have been attached to the JVM lifecycle by
being specified as -D JVM parameters.  Examples of these are the logging properties, 
UIMA-AS settings for controlling monitoring, UIMA-AS CAS logging, etc.

For consistency, these should have alternatives which are tied to a particular UIMA instance 
running (for example) as one of many within a container JVM (such as would be the case for 
multiple servlets, running in a web container).

h2.Computed values via concatenation
A typical use case is to have some parameters be directory paths.  In a particular use, several of these may
need to have a common root.

These could be written:

param1 : /commonRootString/commonPart2/a1
param2 : /commonRootString/commonPart2/a2
...

or some concatenation could be used:

r : /commonRootString/commonPart2

param1 : \{r\}/a1
param2 : \{r\}/a2

This is a design trade-off - to support a concatenation-style factoring-out of common parts in
the values part of the specification, or not.  

Leaving it out in favor of simplicity may make sense, given that today's editors make it very easy to
do global changes, and the human eye seems to be OK with seeing spelled-out patterns of repetition.

h2.Inherited settings
Most systems with lots of configuration settings (e.g., Hadoop, most windowing systems) end up with a 
capability to have nested hierarchies of setting specifications.  This allows putting in a set of defaults
for all the settings, in one place, and then specifying an override for just a few settings, in another (often much smaller)
file.

The Java Properties class supports this by supporting a chain of key-value maps, each one referring to 
another map to use if the key is not found in the map.  We could use this to support this capability.

h2.Tooling
{strike:class=mystrike}Tooling should support taking a UIMA pipe line spec and "resolving" what all the parameters and settings would
be once all the overrides etc. are done. This should print out a specification, together with information where useful
on where various settings came from (e.g. via what overrides).{strike}

Parts of the framework should log (under the [CONFIG] level) the actual parameter settings, with where they came from.

h2.Use of global settings for additional things
To bring other global settings currently specified as -D on the JVM command line into this framework, 
without a lot of extra mechanism or learning curves, some keys are reserved.  These correspond to the 
names currently used in the -D parameters.

This allows the -D to still be used, but also allows these values to be specified in a top level descriptor.

h2.Form of the key-value pairs - XML or simple
Java properties files can be represented in files as XML or using a simpler plain syntax.
The Java Properties class has built-in methods for reading/writing both styles.

The XML style could allow for additional functionality - for instance, in Hadoop, the
specifications can include a "final" attribute, which if specified in a defaulting file,
prevents subsequent override files from overriding that particular value.

This may be over-design for what UIMA needs here.  
h2.Design Specification
h3.Goals 
* incremental change

h3.Configuration Parameter
Change the configuration parameter declaration to optionally have a globalName:

{code}
<configurationParameter>
    <name>[String]</name> 
    <globalName>[String]</globalName>     <!-- <<<<< New -->
    <description>[String]</description> 
    <type>String|Integer|Float|Boolean</type> 
    <multiValued>true|false</multiValued> 
    <mandatory>true|false</mandatory>
    <overrides>
      <parameter>[String]</parameter>
      <parameter>[String]</parameter>
        ...
    </overrides>
  </configurationParameter>
{code}

If present, it means this parameter is overridable from the top (if specified), using the global name as the key.
The name must be a suitable key name for a Java Properties file key.

The assumption would be that the publisher of the annotator would not include globalName specification, but that the 
assembler, who is putting together multiple annotators, would insert these wherever they needed, with whatever 
uniqueness in the name, to satisfy the need to expose parameters at the top level.

h3.No factoring/concatenation support

h3.Syntax of key-value support
We follow the normal syntax of Java properties files.  From that spec, we inherit: the specification that the 
codepage = ISO-8859-1, with other character codes possible by using Unicode escapes.

For array value specifications, we use a single key, and specify the array values as a blank or comma-separated list.  The escape character is "\\".

h3.Attaching key-value pair information to top level UIMA descriptors
Multiple methods are supported.

h4.Within the top level descriptor
The top level descriptor has already the xml:

{code}
<operationalProperties>
  <modifiesCas> true|false </modifiesCas>
  <multipleDeploymentAllowed> true|false </multipleDeploymentAllowed>
  <outputsNewCASes> true|false </outputsNewCASes>
</operationalProperties>
{code}

We add the following optional element

{code}
  <globalSettings>
      <import  (by name or by value, like all other imports) />   and/or
      <settings>             <!-- inline -->
            name value
            name value   etc.
      </settings>
   </globalSettings>
{code}

The import identifies a file to use.  Multiple imports indicate multiple files.
The order is the first one is the default; later ones override earlier ones.

The globalSettings element is ignored if it is not at the top level.

h4.From the command line
There are 2 things that can be specified in the command line.  
* A comma or blank separated list of paths,
either in the file system or in the classpath, to properties files, where later paths in the list override the earlier ones.
* One or more -D specifications, identifying a key-value pair, using normal Java command line syntax for -D parameters.