Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview

Some Geode users have data models containing nested and complex objects. Lucene text search in Geode 1.2.0 supports indexing and querying only the top-level fields in the data object. The objective of this feature is to support indexing and querying an arbitrary depth of nested objects. 

Goals

  1. User can specified specify fields in nested object to be indexed. 

  2. Query User can query on these nested fields and collections of nested fields as well as top level fields. 

 

Out of Scope

  1. Item in collection will not be index and searched.

 

API

Approach

  1. Expose a LuceneSerializer interface and let the user write code to convert their objects to lucene docs.

  2. Provide a default LuceneSerializer when creating the index that will flatten nested objects (eg, person.address.zip just gets stored as a list of zipcodes on the top level document).

  3. Provide a query syntax to search nested, flattened, fields.

Out of Scope

  1. Extending core Apache Lucene functionality to define a syntax for creating multiple documents and performing automatic joins in the StandardQueryParser for handling parent/child relationships with collections of nested fields.

API Change

  • Add a new method to create a lucene index that takes a callback. The callback gives the user explicit control of how their value is converted to lucene documents and stored in the index. 
Code Block
public LuceneIndexFactory {
 /**
 * Configure the way objects are converted to lucene documents for this lucene index
 * @param luceneSerializer A callback which converts a region value to a 
 * Lucene document or documents to be stored in the index.
 */
 public LuceneIndexFactory setLuceneSerializer(LuceneSerializer luceneSerializer);
}  
  
/**
 * An interface for writing the fields of an object into a lucene document
 * The region key will be added as a field to the returned documents.
 * @param index lucene index
 * @param value user object to be serialized into index
 */
public interface LuceneSerializer {
  Collection<Document> toDocuments(LuceneIndex index, Object value);
}

XML Configuration 

 

<cache
    xsi:schemaLocation="http://geode.apache.org/schema/cache
        http://geode.apache.org/schema/cache/cache-1.0.xsd
        http://geode.apache.org/schema/lucene
        http://geode.apache.org/schema/lucene/lucene-1.0.xsd"
    version="1.0">
 
    <region name="region" refid="PARTITION">
        <lucene:index name="index">
           <lucene:field name="a" analyzer="org.apache.lucene.analysis.core.KeywordAnalyzer"/>
           <lucene:field name="b" analyzer="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
           <lucene:field name="c" analyzer="org.apache.lucene.analysis.standard.ClassicAnalyzer"/>
           <lucene:serializer>
             <class-name>org.apache.geode.cache.lucene.FlatFormatSerializer</class-name>
           </lucene:serializer>
       </lucene:index>
    </region>
</cache>
 

 

 

We will also provide a built-in implementation for LuceneSerializer called FlatFormatSerializer(). With this example serializer users can specify nested fields using the syntax fieldnameAtLevel1.fieldnameAtLevel2 for both Use fieldnameAtLevel1.fieldnameAtLevel2 to specify a field in nested object both for indexing and querying. 

For example, a in the following data model Customer object contains both a Person field. A object and a collection of Page objects. The Person object also contains a Page fieldobject.

Code Block
public class Customer implements Serializable {
  private String name;
  private StringCollection<String> symbol; // search integer in string formatphoneNumbers;
  private intCollection<Person> revenuecontacts;
  private int SSN; // search int
  private Person contact; // search nested object Page[] myHomePages;
  ......
}
public class Person implements Serializable {
  private String name;
  private String email;
  private int revenue;
  private String address;
  private String[] phoneNumbers;
  private Page homepage;
  .......
}

 
public class Page implements Serializable {
  private int id; // search integer in int format
  private String title;
  private String content;
  final String desc = "At client and server JVM, initializing cache will create the LuceneServiceImpl object," 
     +" which is a singleton at each JVM."; 
  ......
}
 

 Example

The example below demonstrates how to index on the nested fields: in following example, a nested field contactcontacts.name, contacts.email, contacts.address, contacts.homepage.title, .

Note: each segment is a field name, not the type name. This will tell the system to find the parent and grandparent. It's possible that several fields will have the same type. Such as a field type, because Customer class could have 2 Person fields: Person contact more than one field of type Person; e.g. Person contacts and Person deliveryman. The field name is used to identify the parent field.

 

Code Block
// Get LuceneService
LuceneService luceneService = LuceneServiceProvider.get(cache);

// Create Index on fields, some are fields in nested objects:

luceneService.createIndexFactory().addField("name").addField("symbol").addField("revenue").addField("SSN")setLuceneSerializer(new FlatFormatSerializer()) /* an out-of-box LuceneSerializer implementation */
      .addField("contactname").addField("contacts.name").addField("contactcontacts.email").addField("contactcontacts.address").addField("contactcontacts.homepage.title")
      .create("customerIndex", "Customer");


// Now to create region
Region CustomerRegion = ((Cache)cache).createRegionFactory(shortcut).create("Customer");


gfsh command line:


Code Block
//gfsh Createcreate Querylucene index --name=customerIndex --region=/Customer --field=name,contacts.name,contacts.email,contacts.address,contacts.homepage.title --serializer=org.apache.geode.cache.lucene.FlatFormatSerializer

 

The syntax for querying the nested field is the same as for a top level field, but with the additional qualifying parent field name, such as "contacts.name:tzhou11*". This distinguishes which "name" field when there can potentially be more than one 'name' field at different hierarchical levels in the object.

Code Block
with nested objects
final StandardQueryParser queryParser = new StandardQueryParser(new KeywordAnalyzer());
LuceneQuery query = luceneService.createLuceneQueryFactory().setLimit(200).setPageSize(20)
  .create(indexName"customerIndex", regionName"Customer", querystring"contacts.name:tzhou11*", "field1" /* default field */name");

// Search using Query
 
PageableLuceneQueryResults<K,Object> results = query.findPages();

// Pagination
while (results.hasNext()) {
  results.next().stream().forEach(struct -> {
    Object value = struct.getValue();
    System.out.println("Key is "+struct.getKey()+", value is "+value);
  });
}

Out-Of-Box implementation

We will provide an out-of-box implementation for the LuceneSerializer: FlatFormatSerializer.

It will still create one document for each parent object adding the nested object as embedded fields of the document. The field name will use the qualified name. Collections will be flattened and treated as tokens in the single field.

For example, the FlatFormatSerializer will convert a Customer object into a document as

(name:John11),(contacts.name:tzhou11), (contacts.email:tzhou11@gmail.com), (contacts.address:15220 Wall St), (contacts.homepage.id:11), (contacts.homepage.title: Mr. tzhou11), (contacts.homepage.content: xxx)

Risks and Mitigations

With this solution, collections (lists and maps) will be treated as a single flattened field, with the risk that queries into a collection may produce the wrong results. For example, a person entry with 2 address fields, one containing "main street" and the other containing zipcode "90210", a query such as:

person.address.zip=90210 and person.address.street=main

would incorrectly return this person entry. Because Apache Lucene does not define a standard approach for this, we are providing the LuceneSerializer interface to allow a user to write code to convert their objects to separate Lucene documents and to use Lucene ParentBlockJoinQuery to produce the desired results. For example, using the above query, the user could write the following code to produce the desired results:

 
Code Block
final StandardQueryParser queryParser = new StandardQueryParser();
Query addressQuery = queryParser.parse("zip:90210 AND street:main", "");
BitSetProducer parentDocFilter = new QueryBitSetProducer(new TermQuery(new Term("parent", "true")));
final ToParentBlockJoinQuery addressPart = new ToParentBlockJoinQuery(addressQuery, parentDocFilter, ScoreMode.Total);
TopDocs people = searcher.search(addressPart, 10);