Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview

 

Many Geode users create data models using nested and complex objects. Lucene text search in Geode 1.2.0 release supports indexing and querying only the top-level fields in the data object.

The objective of this feature is to support indexing and querying an arbitrary depth of nested objects.

Goals

  1. User can specify fields in nested object to be indexed. 

  2. User can query on these nested fields and collections of nested fields as well as top level fields. 

...

Approach

  1. Expose a LuceneSerializer interface and let the user write code to convert their objects to lucene docs.

  2. Provide a default LuceneSerializer when creating the index that will flatten nested objects (eg, person.address.zip just gets stored as a list of zipcodes on the top level document).

  3. Provide a query syntax to search nested, flattened, fields.

Out of Scope

  1. Item in collection will not be index and searched.

 

  1. Extending core Lucene functionality to define a syntax for creating multiple documents and performing automatic joins in the StandardQueryParser for handling parent/child relationships with collections of nested fields.

API Change

...

  • Add a new method to create a lucene index that takes a callback. The callback gives the user explicit control of how their value is converted to lucene documents and stored in the index. 

...

It will still create one document for each parent object . But add adding the nested object as embedded fields of the document. The field name will use the qualified name.  Collections will be flattened and treated as tokens in the single field.

For example, the FlatFormatSerializer will convert a Customer object into a document as

(name:John11),(contact.name:tzhou11), (contact.email:tzhou11@gmail.com), (contact.address:15220 Wall St), (contact.homepage.id:11), (contact.homepage.title: Mr. tzhou11), (contact.homepage.content: xxx)

Risks and Mitigations

With this solution, collections (lists and maps) will be treated as a single flattened field, with the risk that queries into a collection may produce the wrong results. For example, a person entry with 2 address fields, one containing Main Street and the other containing zipcode 90210, a query such as:

person.address.zip=90210 and person.address.street=main

would incorrectly return this person entry. Because Apache Lucene does not define a standard approach for this, we are providing the LuceneSerializer interface to allow a user to write code to convert their objects to separate Lucene docs and to use Lucene ParentBlockJoinQuery to produce the desired results. For example, using the above query, the user could write the following code to produce the desired results:

 
Code Block
final StandardQueryParser queryParser = new StandardQueryParser();
Query addressQuery = queryParser.parse("zip:90210 AND street:main", "");
BitSetProducer parentDocFilter = new QueryBitSetProducer(new TermQuery(new Term("parent", "true")));
final ToParentBlockJoinQuery addressPart = new ToParentBlockJoinQuery(addressQuery, parentDocFilter, ScoreMode.Total);
TopDocs people = searcher.search(addressPart, 10);