Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


  1. Allow user to create Lucene Indexes on data stored in Geode

  2. Update the indexes asynchronously to avoid impacting write latency
  3. Allow user to perform text (Lucene) search on Geode data using the Lucene index. Results from the text searches may be stale due to asynchronous index updates.

  4. Provide highly available of indexes using Geodes HA capabilities

    Provide high throughput indexing and querying by partitioning index data to match partitioning in Geode

     

  5. Scalability
  6. Performance comparable to RAMFSDirectory

Out of Scope
  1. Building next/better Solr/Elasticsearch.

  2. Enhance Enhancing the current Geode OQL to use Lucene index.

...

  1. A region and list of to-be-indexed fields
  2. [ Optional ] Standard Analyzer or its implementation to be used with all the fields in a index
  3. [ Optional ] Field types. A string can be Text or String in luceneLucene. The two have different behavior

...

  1. A single index will not support multiple regions (join . Join queries between regions are not supported)

  2. Heterogeneous objects in single region will be supported
  3. Only top level fields and nested objects can be indexed, not nested collections
  4. The index needs to be created before adding the data (for phase1) 
  5. Pagination of results will be supported

Users will interact with a new LuceneService interface, which provides methods for creating and querying indexes. Users can also create indexes through gfsh or cache.xml.

Java API 

LuceneService

 

Code Block
/**
   * Create a lucene index using default analyzer.
   */
  public LuceneIndex createIndex(String indexName, String regionName, String... fields);
  
  /**
   * Create a lucene index using specified analyzer per field
   */
  public LuceneIndex createIndex(String indexName, String regionName,  
      Map<String, Analyzer> analyzerPerField);

  public void destroyIndex(LuceneIndex index);
 
  public LuceneIndex getIndex(String indexName, String regionName);
  
  public Collection<LuceneIndex> getAllIndexes();

  /**
   * Get a factory for building queries
   */ 
  public LuceneQueryFactory createLuceneQueryFactory();

...

Gfsh API

Code Block
// Create Index
gfsh> create lucene-index --name=indexName --region=/drugs --fields=sideEffects,manufacturer

// Destory Index
gfsh> destroy lucene-index --name=indexName --region=/drugs

Execute Lucene query
gfsh> luceneQuery --regionName=/drugs -queryStrings="" --limit=100 page-size=10

 

XML Configuration 

Code Block
<region name="region">  
 <lucene-index indexName="luceneIndex">
             <FieldDefinition name="fieldName" analyzer="KeywordAnalyzer"/> 
 </lucene-index>
</region>

 

REST API

TBD

Spring Data GemFire Support

TBD - But the Searchable annotation described in this blog might be a good place to start.

Implementation

Index Storage

...

 

PlantUML
[Lucene Indexer] --> [GeodeFSDirectory]
() "User"
node "Colocated and Replicated" {
  () User --> [User Region] : Puts
  [User Region] --> [Async Queue]
  [Async Queue] --> [Lucene Indexer] : Batch Writes
  [GeodeFSDirectory] --> [Lucene Regions]
}

Partitioned region data flow

PlantUML
() User -down-> [Cache] : PUTs
node cluster {
 database {
 () "indexBucket1Primary"
 }

 database {
 () "indexBucket1Secondary"
 }

[Cache] ..> [Bucket 1]
 [Bucket 1] -down-> [Async Queue Bucket 1]
[Async Queue Bucket 1] -down-> [FSDirectoryBucket1] : Batch Write
[FSDirectoryBucket1] -> indexBucket1Primary
indexBucket1Primary -right-> indexBucket1Secondary

 database {
 () "indexBucket2Primary"
 }

 database {
 () "indexBucket2Secondary"
 }

[Cache] ..> [Bucket 2]
 [Bucket 2] -down-> [Async Queue Bucket 2]
 [Async Queue Bucket 2] -down-> [FSDirectoryBucket2] : Batch Write
 [FSDirectoryBucket2] -> indexBucket2Primary
 indexBucket2Primary -right-> indexBucket2Secondary 
}
 


In a partition region every bucket in the region will have its own GeodeDirectory to store the lucene indexes. The GeodeDirectory implements a file system using 2 regions 
  • FileRegion : holds the meta data about indexing files
  • ChunkRegion : Holds the actual data chunks for a given index file. 

The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The GeodeFSDirectory will have a key that contains the bucket id for file metadata chunks. The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.
In AsyncEventListener, when a data entry is processed
  1. determine the bucket id of the entry.
  2. Get the directory for that bucket, do the indexing operation into that instance.

...