Solr File Manager Developer's Guide

Introduction

The Apache OODT File Manager and Apache Solr are in many respects two complementary technologies that can be combined to offer a very attractive solution for managing scientific data and metadata. The File Manager is a service for archiving and retrieving data products (files and directories), typically used as part of a data distribution service or data processing pipeline. Solr is a scalable, high performance search engine that provides a web based API on top of the underlying Lucene index. By deploying an OODT File Manager backed-up by a Solr metadata store, a scientific project can leverage the product creation and ingestion functionality of the OODT framework, together with the fast and powerful query capabilities offered by Solr.

Architecture

When the File Manager is deployed with a Solr-based metadata catalog, the overall system architecture is composed of the following tiers:

File Manager clients send product archiving requests to the File Manager (typically running as a local service on default port 9000).
This communication is carried over HTTP through XML-RPC encoded request/response pairs.
As part of the archiving process, the File Manager transforms the available product metadata
(either generated on the client side and sent as part of the archiving request, or generated by metadata extractors on the server side)
into Solr documents, which are sent to the Solr server for indexing. The Solr server must be deployed as an external service, running
within Tomcat (default ports: 80 or 8080) or Jetty (default port: 8983), and configured with a metadata schema that defines
the project specific fields to be indexed. Product metadata can be queried either by the standard File Manager clients, but more efficiently by clients that interact directly with the Solr web services (as Solr is optimized for querying).

In conclusion, in this architecture the OODT File Manager provides the data archiving functionality, while Solr is responsible for
providing metadata querying services.

Release Notes and Limitations

The OODT File Manager is only compatible with a Solr version of 4.X or above - i.e. no support is provided for interacting with pre-existing Solr installations of 3.X or below. The technical motivation is that the Solr implementation of the OODT Catalog interface relies heavily on the "atomic update" functionality that was introduced in Solr 4 - i.e. the ability to update single parts of a Solr document without the need to re-index the whole document. The logical rationale behind this decision is that this is a new functionality provided to the OODT framework, and consequently there is no need to support legacy deployments: a project that wishes to leverage this architecture might as well start anew with the most up to date version of Solr, instead of installing an older version.

Furthermore, in order to leverage the out-of-the-box "atomic update" functionality, ALL fields to be indexed must be declared in the Solr schema.xml with "stored=true". This requirement comes with the price of a larger index, but will allow documents to be updated in place, as opposed to be fully replaced.

Finally, currently, multiple Solr cores are NOT supported: all documents sent by the File Manager are indexed and retrieved from the default Solr core.

Extension Points

Currently, a File Manager deployment based on a Solr back-end can be customized in two possible ways,
controlled by properties set in the File Manager filemgr.properties file. Both extension points have reasonable
default implementations, which should be adequate in many cases, or at least to start experiment with this architecture.

Generation of product unique identifiers

When a product is first ingested, a unique identifier must be created to reference that product in all subsequent requests.
The particular algorithm that is used to generate the identifier has great consequences for the underlying semantics of the
metadata catalog.

By default, the Solr File Manager will assign a newly generated UUID as the product unique identifier.
As a consequence, if the FM client sends the same physical product twice, two distinct records will be created in the Solr index.

Another possibility is to configure the Solr File Manager to use the product name as the unique identifier. In this case, if the
same product is ingested for a second time, the new Solr document will completely override the previous one, resulting in only
one product record in the catalog.

Alternatively, a project can provide a custom algorithm to generate product identifiers (for example, based on the system time
when the product is ingested) by implementing the ProductIdGenerator interface.

Serialization/Deserialization of Solr documents from product metadata.

When a physical product is sent for archiving to the File Manager, the associated metadata must be transformed into query-able information that is stored in the back-end Solr catalog. By default, the Solr File Manager will transform each product into one corresponding Solr document, thus generating a single searchable record in the Solr index. Each product attribute is transformed into a corresponding Solr field with the same name and value(s) (note that all fields must be defined in the project specific schema deployed with the Solr installation).

Alternatively, a project may provide its own algorithm for generating Solr records from a CAS product by implementing the ProductSerializer interface. For example, a project that manages products composed of full directories may wish to create a "collection"-level Solr record for the enclosing directory, and separate "file"-level Solr records for each file in the directory. These different record types could be stored in the same Solr core, or sent to separate Solr cores.

Viceversa, when a client queries the File Manager for product information, result documents are retrieved from Solr and must be transformed into product objects that are presented back to the client. By default, the Solr File Manager will generate a single product for each Solr result document, based on the inverse rules that were used to generate the Solr document in the first place.

Alternatively, a project may provide its own product generation algorithm by implementing the ProductDeserializer interface, in a way that is consistent with its own implementation of the ProductSerializer interface.

Configuration

The deployment of a Solr File Manager is controlled by two main files: the File Manager filemgr.properties, and the Solr schema.xml.

filemgr.properties

To use a Solr-based metadata catalog, the File Manager file filemgr.properties must be edited as follows.

At a minimum the following two properties must be defined:

org.apache.oodt.cas.filemgr.catalog.factory=org.apache.oodt.cas.filemgr.catalog.solr.SolrCatalogFactory
- Mandatory: instructs OODT to instantiate a Solr Catalog implementation at startup
org.apache.oodt.cas.filemgr.catalog.solr.url=http://<hostname>:<port>/solr
- Mandatory: points the File Manager to the base URL of the Solr server

Additionally, the following properties control how products are ingested and extracted into/from the Solr server, i.e. the implementations used for the extension points described above. These properties have default values, and need to be set only when the default is not the desired behavior.

org.apache.oodt.cas.filemgr.catalog.solr.productIdGenerator=org.apache.oodt.cas.filemgr.catalog.solr.UUIDProductIdGenerator
- Optional: controls the algorithm for generating the product unique identifier when it is first stored in the catalog.
- Default: UUIDProductIdGenerator: this class generates a new UUID every time a product is indexed.
- Alternative out of the box implementation: NameProductIdGenerator: this class will assign the product an identifier equal to the product name.
- Alternatively: provide any custom implementation of the ProductIdGenerator interface.
org.apache.oodt.cas.filemgr.catalog.solr.productSerializer=org.apache.oodt.cas.filemgr.catalog.solr.DefaultProductSerializer
- Optional: controls the format of the documents ingested into Solr, i.e. how a CAS product object is transformed into one (or more) Solr records; and vice-versa how CAS products are queried back from the Solr index
- Default: DefaultProductSerializer: creates one Solr record for each incoming CAS product:
  - the product core attributes (id, name, type) are converted to Solr fields starting with "CAS." ("CAS.ProductId", "CAS.ProductName", ....)
  - the product identifier is used again to assign the Solr record identifier (i.e. "id" and "CAS.ProductId" have the same value)
  - the product references are converted into Solr fields starting with ("CAS.Reference..." or "CAS.RootReference...")
  - the product metadata attributes are converted into Solr fields with the same name and number of values
- Alternative: any custom implementation of the ProductSerializer interface can be used.

Note that each specific implementation of ProductSerializer must declare the format of the Solr documents it understands (XML, JSON, etc.), so each implementation is free to generate and parse Solr documents in the document format of choice.

Following is a full example of Solr File Manager configuration (with default behavior):

org.apache.oodt.cas.filemgr.catalog.factory=org.apache.oodt.cas.filemgr.catalog.solr.SolrCatalogFactory
org.apache.oodt.cas.filemgr.catalog.solr.url=http://localhost:8983/solr
org.apache.oodt.cas.filemgr.catalog.solr.productIdGenerator=org.apache.oodt.cas.filemgr.catalog.solr.NameProductIdGenerator
org.apache.oodt.cas.filemgr.catalog.solr.productSerializer=org.apache.oodt.cas.filemgr.catalog.solr.DefaultProductSerializer

schema.xml

The file schema.xml, part of each specific Solr deployment, defines which metadata fields are stored in the Solr index, and can consequently be queried and retrieved by clients. Note that no metadata field can be ingested in Solr unless it is defined (explicitely or implicitely) in schema.xml. Additionally, a specific requirement of the File Manager - Solr integration is that each metadata field included in schema.xml must be "stored" (i.e. defined with stored="true"), so that it can be retrieved and re-inserted during partial document updates.

Each project using a Solr File Manager is responsible for creating and deploying a schema.xml file that is consistent with its own algorithms for generating Solr documents from product metadata, and viceversa (as defined by the specific implementation of the ProductSerializer interface). An example schema.xml is provided as part of the File Manager distribution in the _resources/ sub-directory. This schema is compatible with the class DefaultProductSerializer (the default implementation of ProductSerializer), and contains the following definitions:

The field "id" is used as the unique Solr document identifier (this is the Solr default).
Solr fields corresponding to the core CAS product attributes (id, name, type, references etc.) are explicitely defined, either as single or multi-valued fields.
All other incoming fields ending in '*Date' or '*Time' are stored as single-valued Solr fields of type "date" with the same name.
All other incoming metadata fields are stored as multi-valued Solr fields of type string with the same name.
All field values are copied into a catch-all field called 'text' to allow for free-text searching over the full profuct metadata.

Each project is free to use the default schema.xml provided as part of the File Manager distribution, or to change it in full or in part,
compatibly with that project algorithm for creating and retrieving Solr records.

Space shortcuts

Page tree