Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

The Apache OODT File Manager and Apache Solr are in many respects two complementary technologies
that technologies that can be combined to offer a very attractive solution for managing scientific data and metadata.
The  The File Manager is a service for archiving and retrieving data products (files and directories),
typically  typically used as part of a data distribution service or data processing pipeline.
Solr  Solr is a scalable, high performance search engine that provides a web based API on top of the underlying
Lucene underlying Lucene index. By deploying an OODT File Manager backed-up by a Solr metadata store,
a  a scientific project can leverage the product creation and ingestion functionality of the OODT framework,
together with the fast and powerful query capabilities offered by Solr.

...

When the File Manager is deployed with a Solr-based metadata catalog, the overall system architecture is composed of the following tiers:

File Manager (archive) clients  -- File Manager       --             Solr                             -- Solr (query) clients
                                              (default port:9000) (default ports: 80, 8080 or 8983)Image Added

File Manager clients send product archiving requests to the File Manager (typically running as a local service on default port 9000).
This communication is carried over HTTP through XML-RPC encoded request/response pairs.
As part of the archiving process, the File Manager transforms the available product metadata
(either generated on the client side and sent as part of the archiving request, or generated by metadata extractors on the server side)
into Solr documents, which are sent to the Solr server for indexing. The Solr server must be deployed as an external service, running
within Tomcat (default ports: 80 or 8080) or Jetty (default port: 8983), and configured with a metadata schema that defines
the project specific fields to be indexed. Product metadata can be queried either by the standard File Manager clients, but more efficiently by clients that interact directly with the Solr web services (as Solr is optimized for querying).

...

The OODT File Manager is only compatible with a Solr version of 4.X or above - i.e. no support is provided for interacting with pre-existing Solr installations of 3.X or below. The technical motivation is that the Solr implementation of the OODT Catalog interface relies heavily on the "atomic update" functionality that was introduced in Solr 4 - i.e. the ability to update single parts of a Solr document without the need to re-index the whole document. The logical rationale behind this decision is that this is a new functionality provided to the OODT framework, and consequently there is no need to support legacy deployments: a project that wishes to leverage this architecture might as well start anew with the most up to date version of Solr, instead of installing an older version.

...

When a physical product is sent for archiving to the File Manager, the associated metadata must be transformed into query-able information that is stored in the back-end Solr catalog. By default, the Solr File Manager will transform each product into one corresponding Solr document, thus generating a single searchable record in the Solr index. Each product attribute is transformed into a corresponding Solr field Solr field with the same name and value(s) (note that all fields must be defined in the project specific schema deployed with the Solr installation).

Alternatively, a project may provide its own algorithm for generating Solr records from a CAS product by implementing the ProductSerializer interface. For example, a project that manages products composed of full directories may wish to create a "collection"-level Solr record for the enclosing directory, and separate "file"-level Solr records for each file in the directory. These different record types could be stored in the same Solr core,  or or sent to separate Solr cores.

...

Additionally, the following properties control how products are ingested and extracted into/from the Solr server,
i.e. the implementations used for the extension points described above. These properties have default values,
and need to be set only when the default is not the desired behavior.

  • org.apache.oodt.cas.filemgr.catalog.solr.productIdGenerator=org.apache.oodt.cas.filemgr.catalog.solr.UUIDProductIdGenerator
    • Optional: controls the algorithm for generating the product unique identifier when it is first stored in the catalog.
    • Default: UUIDProductIdGenerator: this class generates a new UUID every time a product is indexed.
    • Alternative out of the box implementation: NameProductIdGenerator: this class will assign the product an identifier equal to the product name.
    • Alternatively: provide any custom implementation of the ProductIdGenerator interface.
  • org.apache.oodt.cas.filemgr.catalog.solr.productSerializer=org.apache.oodt.cas.filemgr.catalog.solr.DefaultProductSerializer
    • Optional: controls the format of the documents ingested into Solr, i.e. how a CAS product object is transformed into one (or more) Solr records; and vice-versa how CAS products are queried back from the Solr index
    • Default: DefaultProductSerializer: creates one Solr record for each incoming CAS product:
      • the product core attributes (id, name, type) are converted to Solr fields starting with "CAS." ("CAS.ProductId", "CAS.ProductName", ....)
      • the product identifier is used again to assign the Solr record identifier (i.e. "id" and "CAS.ProductId" have the same value)
      • the product references are converted into Solr fields starting with ("CAS.Reference..." or "CAS.RootReference...")
      • the product metadata attributes are converted into Solr fields with the same name and number of valuvalues
    • Alternative: any custom implementation of the ProductSerializer interface can be used.

...

The file schema.xml, part of each specific Solr deployment, defines which metadata fields are stored in the Solr index, and can consequently be queried and queried and retrieved by clients. Note that no metadata field can be ingested in Solr unless it is defined (explicitely or implicitely) in schema.xml. Additionally, a specific requirement of the File Manager - Solr integration is that each metadata field included in schema.xml must be "stored" (i.e.  defined defined with stored="true"), so that it can be retrieved and re-inserted during partial document updates.

Each project using a Solr File Manager is responsible for creating and deploying a schema.xml file that is consistent with its own algorithms for generating Solr documents from product metadata, and viceversa (as defined by the specific implementations implementation of the ProductSerializer  interfaceinterface). An example schema.xml is provided as part of the File Manager distribution in the _resources/ sub-directory. This schema This schema is compatible with the class DefaultProductSerializer (the default implementation of ProductSerializer), and contains the following definitions:

...