You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

Unknown macro: {style}

body {
margin-top: 1em;
margin-bottom: 1em;
margin-left: 1em;
}
p {
font-family: "Palatino Linotype", "Times New Roman", Times, serif;
font-size: 12pt !important;
margin-left: 3em !important;
}
h2

Unknown macro: { margin-left}

h3

Unknown macro: { margin-left}

h4

Unknown macro: { margin-left}

Background

This is a summary of ideas that have been discussed on the mailing lists regarding configuring UIMA pipelines, and subsequent discussions.

What running scenarios are contemplated?

The normal run scenario is one where you start up a pipeline, it initializes, and processes CASes and eventually finishes.

Alternatives: During the run, it is reconfigured. We won't consider this for now.

What is configuration?

Configuration is a collection of things, set for a particular UIMA pipeline run. It can include things like conventional UIMA parameter settings, as well as other kinds of settings such as "placeholder" values that are substituted into UIMA deployment descriptors, debug flags, dump the CAS flags, the logging specification for the run, etc.

Configuring simple and complex values

Configuration parameters in UIMA have types like integer, float, string, boolean, or arrays of these. UIMA provides for arbitrarily complex Java Objects as configuration values, using the External Resource specification. The UIMA External Resources design also allows "sharing" of these complex objects among multiple annotators. Similarly, the normal configuration parameters allow "sharing" - that is, the UIMA parameter override design lets one parameter setting be connected to multiple parameters, down the nested hierarchy of annotators.

The same use cases motivating the setting of UIMA configuration parameters, also motivate a (hopefully) similar mechanism for overriding external resource specifications.

Orthogonal issues and considerations in configuring UIMA pipelines

Where to put this information

There is a continuum of places, ranging from least dynamic (most training required) to most dynamic (least training required):

  • code (least dynamic, need to understand the code and where to go to modify what you want, requires most training)
  • UIMA descriptors (still quite complex; have some special GUI tools for editing them)
  • Structured properties files (like Jar Manifests - multiple sections, each section having key-value pairs of strings)
  • "properties" files (simple key-value pair strings)
  • JVM command line "defines" parameters - individual key-value pairs

Experience shows that for doing "runs", people feel that code and descriptors are complex and hard to comprehend / change, and prefer simpler, more focussed ways to specify things for the run. It is possible to provide more than one of these approaches; if multiples were supported, then some conventional (no surprise) rule for which overrides which, is needed.

JMX

In addition to the above, JMX settings may be desired.

Using the JVM command line as a source of configuration information

Putting configuration into the command line ties a "run" of a pipeline to a "run" of a JVM. There are cases (e.g., running UIMA inside servlets inside a web application container) where mutliple, independent instances of UIMA pipelines may independently start, run, and terminate - all without taking the JVM up and down. This argues for an approach not tied to the JVM command line.

On the other hand, the command line approach is very handy for quick augmentation / overriding of particular runs, where starting/stopping the JVM is an option.

Note that JMX settings could be arranged to be either global, or per "UIMA Context";

Encapsulation versus reaching down inside trees of nested aggregates

The original UIMA design attempts to support encapsulation in an aggregate, for parameter overrides. An aggregate may override parameters that its delegates declare. An aggregate can choose which of these it, in turn, is willing to allow a containing aggregate to be able to override; it can choose to "shield" some parameters, making them incapable of being overridden.

This has complicated the practice of designing large complex nested trees of annotators - in requiring aggregates to expose upwards parameters that the top level may want to override.

An alternative mechanism is wanted in these use cases, to allow "reaching down" from a top level into lower levels, without needing all the intervening levels of aggregation to expose individual parameters. However, some degree of control over this is also desired.

A suggested approach is to augment the configuration parameter definition with an additional property - a "global-name" which, if specified, would enable this reaching down, by having at the top level a key-value pair specification, where the key would be the global-name.

Non-path specification of the global name

A use case is to be able to use parameter specifications for different sub-parts of a big descriptor tree, or for the entire tree, without editing the key name. So - the key-name at the top should not include the path (down the nested hierarchy of aggregates); this allows its reuse even if the hierarchy changes.

Configuration settings - arrays

UIMA supports array-valued settings for configuration parameters. In key-value pair formats, some approach is needed for these.

  • Multiple keys: having the same key name repeat, indicating multiple values.
  • Multiple keys with conventional suffix (e.g., foo.1, foo.2, foo.3): this could be done, but introduces more opportunities for silly user errors (e.g., origin 0 or 1, etc).
  • Single key with special syntax for multiple values - e.g., blank or comma separated, with escaping char ("\"?)

Other kinds of settings

Other kinds of settings for a UIMA pipeline have been attached to the JVM lifecycle by being specified as -D JVM parameters. Examples of these are the logging properties, UIMA-AS settings for controlling monitoring, UIMA-AS CAS logging, etc.

For consistency, these should have alternatives which are tied to a particular UIMA instance running (for example) as one of many within a container JVM (such as would be the case for
multiple servlets, running in a web container).

Computed values via concatenation

A typical use case is to have some parameters be directory paths. In a particular use, several of these may need to have a common root.

These could be written:

param1 : /commonRootString/commonPart2/a1
param2 : /commonRootString/commonPart2/a2
...

or some concatenation could be used:

r : /commonRootString/commonPart2

param1 : ${r}/a1
param2 : ${r}/a2

This is a design trade-off - to support a concatenation-style factoring-out of common parts in the values part of the specification, or not.

Leaving it out in favor of simplicity may make sense, given that today's editors make it very easy to do global changes, and the human eye seems to be OK with seeing spelled-out patterns of repetition.

But if "correct" operation requires that some parts of the configuration specification have exactly the same value, then supporting this kind of thing allows expressing that constraint, and could reduce configuration errors.

Reusable multiple sets of settings

Users want to have settings for some subset of big pipelines, available as separate files, so that these can be reused in other contexts, for instance, when the subset is run separately, or
inserted into another pipeline.

Inherited settings

Most systems with lots of configuration settings (e.g., Hadoop, most windowing systems) end up with a capability to have nested hierarchies of setting specifications. This allows putting in a set of defaults for all the settings, in one place, and then specifying an override for just a few settings, in another (often much smaller)
file.

The Java Properties class supports this by supporting a chain of key-value maps, each one referring to another map to use if the key is not found in the map. We could use this to support this capability.

Tooling

Unknown macro: {strike}

Tooling should support taking a UIMA pipe line spec and "resolving" what all the parameters and settings would be once all the overrides etc. are done. This should print out a specification, together with information where useful
on where various settings came from (e.g. via what overrides).

Parts of the framework should log (under the [CONFIG] level) the actual parameter settings, with where they came from.

Use of global settings for additional things

To bring other global settings currently specified as -D on the JVM command line into this framework, without a lot of extra mechanism or learning curves, some keys are reserved. These correspond to the names currently used in the -D parameters.

This allows the -D to still be used, but also allows these values to be specified in a top level descriptor.

Form of the key-value pairs - XML or simple

Java properties files can be represented in files as XML or using a simpler plain syntax. The Java Properties class has built-in methods for reading/writing both styles.

The XML style could allow for additional functionality - for instance, in Hadoop, the specifications can include a "final" attribute, which if specified in a defaulting file, prevents subsequent override files from overriding that particular value.

This may be over-design for what UIMA needs here.

Parameter Groups

There is a rather complex mechanism in UIMA supporting parameter groups, with additional defaulting rules, triggered by language specifications.

Assuming topLevelName(s) can be supplied for individual specifications, it may be that nothing special needs to be done to support parameter groups.

Design Specification

Goals

  • incremental change

Configuration Parameter

Change the configuration parameter declaration to optionally have a topLevelName:

<configurationParameter>
    <name>[String]</name> 
    <topLevelName>[String]</topLevelName>     <!-- <<<<< New -->
    <description>[String]</description> 
    <type>String|Integer|Float|Boolean</type> 
    <multiValued>true|false</multiValued> 
    <mandatory>true|false</mandatory>
    <overrides>
      <parameter>[String]</parameter>
      <parameter>[String]</parameter>
        ...
    </overrides>
  </configurationParameter>

If present, it means this parameter is overridable from the top (if specified), using the global name as the key. The name must be a suitable key name for a Java Properties file key.

The assumption would be that the publisher of the annotator would not include topLevelName specification, but that the assembler, who is putting together multiple annotators, would insert these wherever they needed, with whatever uniqueness in the name, to satisfy the need to expose parameters at the top level.

No factoring/concatenation support

Syntax of key-value support

We follow the normal syntax of Java properties files. From that spec, we inherit: the specification that the codepage = ISO-8859-1, with other character codes possible by using Unicode escapes.

For array value specifications, we use a single key, and specify the array values as a blank or comma-separated list. The escape character is "
".

Attaching key-value pair information to top level UIMA descriptors

Multiple methods are supported.

Within the top level descriptor

The top level descriptor has already the xml:

<operationalProperties>
  <modifiesCas> true|false </modifiesCas>
  <multipleDeploymentAllowed> true|false </multipleDeploymentAllowed>
  <outputsNewCASes> true|false </outputsNewCASes>
</operationalProperties>

We add the following optional element

  <topLevelSettings>
      <import  (by name or by value, like all other imports) />   and/or
      <settings>             <!-- inline -->
            name value
            name value   etc.
      </settings>
   </topLevelSettings>

The import identifies a file to use. Multiple imports indicate multiple files. The order is the first one is the default; later ones override earlier ones.

The globalSettings element is ignored if it is not at the top level.

From the command line

There are 2 things that can be specified in the command line.

  • A comma or blank separated list of paths, either in the file system or in the classpath, to properties files, where later paths in the list override the earlier ones.
  • One or more -D specifications, identifying a key-value pair, using normal Java command line syntax for -D parameters.
  • No labels