Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

The Apache OODT File Manager and Apache Solr are in many respects two complementary technologies
that technologies that can be combined to offer a very attractive solution for managing scientific data and metadata.
The  The File Manager is a service for archiving and retrieving data products (files and directories),
typically  typically used as part of a data distribution service or data processing pipeline.
Solr  Solr is a scalable, high performance search engine that provides a web based API on top of the underlying
Lucene underlying Lucene index. By deploying an OODT File Manager backed-up by a Solr metadata store,
a  a scientific project can leverage the product creation and ingestion functionality of the OODT framework,
together with the fast and powerful query capabilities offered by Solr.

...

When the File Manager is deployed with a Solr-based metadata catalog, the overall system architecture is composed of the following tiers:

File Manager (archive) clients  -- File Manager       --             Solr                             -- Solr (query) clients
                                              (default port:9000) (default ports: 80, 8080 or 8983)

Image Added

File Manager clients send product archiving requests to the File Manager (typically running as a local service on default port 9000).
This communication is carried over HTTP through XML-RPC encoded request/response pairs.
As part of the archiving process, the File Manager transforms the available product metadata
(either generated on the client side and sent as part of the archiving request, or generated by metadata extractors on the server side)
into Solr documents, which are sent to the Solr server for indexing. The Solr server must be deployed as an external service, running
within Tomcat (default ports: 80 or 8080) or Jetty (default port: 8983), and configured with a metadata schema that defines
the project specific fields to be indexed. Product metadata can be queried either by the standard File Manager clients, but more efficiently by clients that interact directly with the Solr web services (as Solr is optimized for querying).

...

The OODT File Manager is only compatible with a Solr version of 4.X or above - i.e. no support is provided for interacting with pre-existing Solr installations of 3.X or below. The technical motivation is that the Solr implementation of the OODT Catalog interface relies heavily on the "atomic update" functionality that was introduced in Solr 4 - i.e. the ability to update single parts of a Solr document without the need to re-index the whole document. The logical rationale behind this decision is that this is a new functionality provided to the OODT framework, and consequently there is no need to support legacy deployments: a project that wishes to leverage this architecture might as well start anew with the most up to date version of Solr, instead of installing an older version.

...

When a physical product is sent for archiving to the File Manager, the associated metadata must be transformed into query-able information that is stored in the back-end Solr catalog. By default, the Solr File Manager will transform each product into one corresponding Solr document, thus generating a single searchable record in the Solr index. Each product attribute is transformed into a corresponding Solr field Solr field with the same name and value(s) (note that all fields must be defined in the project specific schema deployed with the Solr installation).

Alternatively, a project may provide its own algorithm for generating Solr records from a CAS product by implementing the ProductSerializer interface. For example, a project that manages products composed of full directories may wish to create a "collection"-level Solr record for the enclosing directory, and separate "file"-level Solr records for each file in the directory. These different record types could be stored in the same Solr core,  or or sent to separate Solr cores.

...

Alternatively, a project may provide its own product generation algorithm by implementing the ProductDeserializer interface, in a way that is consistent with its own implementation of the ProductSerializer interface.

Configuration

...

The deployment of a Solr File Manager is controlled by two main files: the File Manager filemgr.properties, and the Solr schema.xml.>>> File Manager

filemgr.properties

To use a Solr-based metadata catalog, the File Manager file filemgr.properties must be edited as follows. 

At a minimum the following two properties must be defined:o

  • org.apache.oodt.cas.filemgr.catalog.factory=org.apache.oodt.cas.filemgr.catalog.solr.SolrCatalogFactory

      ...

        • Mandatory: instructs OODT to instantiate a Solr Catalog implementation at startup

      ...

      • org.apache.oodt.cas.filemgr.catalog.solr.url=http://<hostname>:<port>/solr

          ...

            • Mandatory: points the File Manager to the base URL of the Solr

          ...

            • server 

          Additionally, the following properties control how products are ingested and extracted into/from the Solr server,
          i.e. the implementation implementations used for the extension points described above. These properties have default values,
          and need to be set only when the default is not the desired behavior.o

          • org.apache.oodt.cas.filemgr.catalog.solr.productIdGenerator=org.apache.oodt.cas.filemgr.catalog.solr.UUIDProductIdGenerator

              ...

                • Optional: controls the algorithm for generating the product unique identifier when it is first stored in the catalog.

              ...

                • Default: UUIDProductIdGenerator: this class generates a new UUID every time a product is indexed.

              ...

                • Alternative out of the box implementation: NameProductIdGenerator: this class will assign the product an identifier equal to the product name.

              ...

                • Alternatively: provide any custom implementation of the ProductIdGenerator interface.

              ...

              • org.apache.oodt.cas.filemgr.catalog.solr.productSerializer=org.apache.oodt.cas.filemgr.catalog.solr.DefaultProductSerializer

                  ...

                    • Optional: controls the format of the documents ingested into Solr, i.e. how a CAS product object is transformed into one (or more) Solr records

                  ...

                    • ; and vice-versa how CAS products are queried back from the Solr index
                    • Default: DefaultProductSerializer: creates one Solr record for each incoming CAS product:

                      ...

                          • the product core attributes (id, name, type) are converted to Solr fields starting with "CAS." ("CAS.ProductId", "CAS.ProductName", ....)

                      ...

                          • the product identifier is used again to assign the Solr record identifier (i.e. "id" and "CAS.ProductId" have the same value)

                      ...

                          • the product references are converted into Solr fields starting with ("CAS.

                      ...

                          • Reference..." or "CAS.

                      ...

                          • RootReference.

                      ...

                          • ..")

                      ...

                          • the product metadata attributes are converted into Solr fields with the same name and number of values
                        • Alternative: any custom implementation of the ProductSerializer interface can be used.

                      Note that each specific implementation of ProductSerializer must declare the format of the generated Solr documents it understands (XML, JSON, etc.),
                      so  so each implementation is free to generate and parse Solr documents in the document rormat format of choice.

                      o org.apache.oodt.cas.filemgr.catalog.solr.productDeserializer=org.apache.oodt.cas.filemgr.catalog.solr.DefaultProductDeserializer
                      optional: controls how CAS products are queried back from the Solr index.
                      default: DefaultProductDeserializer: creates one CAS product for each returned Solr record, based on reversing the rules that created that record
                      in the first place (see DefaultProductSerializer).
                      Alternative: any custom implementation of the ProductDeserializer interface can be used,
                      but it should be consistent with the specific implementation used for the ProductSerializer implementation.
                      Note that each specific implementation of ProductDeserializer must declare the format of the Solr response documents it can parse (XML, JSON, etc.),
                      so each implementation is free to parse the document format of choice.

                      Example of full Following is a full example of Solr File Manager configuration (with default behavior):

                      Code Block
                      
                      org.apache.oodt.cas.filemgr.catalog.factory=org.apache.oodt.cas.filemgr.catalog.solr.SolrCatalogFactory

                      ...

                      
                      org.apache.oodt.cas.filemgr.catalog.solr.url=http://localhost:8983/solr

                      ...

                      
                      org.apache.oodt.cas.filemgr.catalog.solr.productIdGenerator=org.apache.oodt.cas.filemgr.catalog.solr.NameProductIdGenerator

                      ...

                      
                      org.apache.oodt.cas.filemgr.catalog.solr.productSerializer=org.apache.oodt.cas.filemgr.catalog.solr.DefaultProductSerializer

                      ...

                      
                      
                      

                      ...

                      schema.xml

                      The file schema.xml, part of each specific Solr deployment, defines which metadata fields are stored in the Solr index, and can consequently be queried
                      and retrieved by clients. Note that no metadata field can be ingested in Solr unless it is defined (explicitely or implicitely) in schema.xml.
                      Additionally Additionally, a specific requirement of the File Manager - Solr integration is that each metadata field included in schema.xml must be "stored" (i.e.
                      defined with stored="true"), so that it can be retrieved and re-inserted during partial document updates.

                      Each project using a Solr File Manager is responsible for creating and deploying a schema.xml file that is consistent with its own algorithms
                      for algorithms for generating Solr documents from product metadata, and viceversa (as defined by the specific implementations implementation of the ProductSerializer and
                      ProductDeserializer interfacesinterface). An example schema.xml is provided as part of the File Manager distribution in the resourced_resources/ sub-directory. This
                      schema is compatible with the class DefaultProductSerializer (the default implementations DefaultProductSerializer and DefaultProductDeserializerimplementation of ProductSerializer), and contains the following defninitionsdefinitions:o

                      • The field "id" is used as the unique Solr

                      ...

                      • document identifier (this is the Solr default).

                      ...

                      • Solr fields corresponding to the core CAS product attributes (id, name, type, references etc.) are explicitely defined, either as single or multi-valued fields.

                      ...

                      • All other incoming fields ending in '*Date' or '*Time' are stored as single-valued Solr fields of type "date" with the same name.

                      ...

                      • All other incoming metadata fields are stored as multi-valued Solr fields of type string with the same name.

                      ...

                      • All field values are copied into a catch-all field called 'text' to allow for free-text searching over the full profuct metadata.

                      Each project is free to use the default schema.xml provided as part of the File Manager distribution, or to change it in full or in part,
                      compatibly with the that project algorithm for creating and retrieving Solr records.