Gfsh API

Code Block

// Create Index
gfsh> create lucene-index --name=indexName --region=/orders --fields=customer,tags

// Destory Index
gfsh> destroy lucene-index --name=indexName --region=/orders

Execute Lucene query
gfsh> luceneQuery --regionName=/orders -queryStrings="" --limit=100 page-size=10

XML Configuration

Code Block
<region name="region"> <lucene-index indexName="luceneIndex"> <FieldDefinition name="fieldName" analyzer="KeywordAnalyzer"/> </lucene-index> </region>

REST API

TBD - But using solr to provide a REST API might make a lot of sense

Spring Data GemFire Support

TBD - But the Searchable annotation described in this blog might be a good place to start.

Implementation Flowchart

Index Storage

The lucene indexes will be stored in memory instead of disk. This will be done by implementing a lucene Directory called RegionDirectory which uses Geode as a flat file system. This way we get all the benefits offered by Geode and we can achieve replication and shard-ing of the indexes. The lucene indexes will be co-located with the data region in case of HA.

A LuceneIndex object will be created for each index, to manage all the attributes related with the index, such as reflection fields, AEQ listener, RegionDirectory array, Search, etc.

PlantUML
[LuceneIndex] -->

PlantUML

[LuceneIndex] --> [RegionDirectory]
() "User"
node "Colocated PR or Replicated Region" {
  () User --> [User Data Region] : Puts
  [User Data Region] --> [Async Queue]
  [Async Queue] --> [LuceneIndex] : Batch Writes
  [RegionDirectory] --> [Lucene Regions]
}

...

PlantUML

() User -down-> [User Data Region] : PUTs

[User Data Region] ..> [Bucket 1]
 [Bucket 1] -down-> [Async Queue Bucket 1]
node LuceneIndex {
[Async Queue Bucket 1] -down-> [AEQ listener processes events into index documents]:Batch Write
[AEQ listener processes events into index documents] -down-> [RegionDirectory1]
[RegionDirectory1] -down-> [file region bucket 1]

[file region bucket 1] -down-> [chunk region bucket 1]
}
 
[User Data Region] ..> [Bucket 2]
 [Bucket 2] -down-> [Async Queue Bucket 2]
node LuceneIndex {
[Async Queue Bucket 2] -down-> [AEQ listener processes events into index documents]:Batch Write
[AEQ listener processes events into index documents] -down-> [RegionDirectory2]
[RegionDirectory2] -down-> [file region bucket 2]
[file region bucket 2] -down-> [chunk region bucket 2]
}

If user's data region is a partitioned region, there will be one LuceneIndex is for the partitioned region. Every bucket in the data region will have its own RegionDirectory (implements Lucene's Directory interface), which keeps the FileSystem for index regions. Index regions contain 2 regions:

FileRegion : holds the meta data about indexing files
ChunkRegion : Holds the actual data chunks for a given index file.

The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.

An AsyncEventQueue will be used to update the LuceneIndex. AsyncEventListener will procoess the events in AEQ in batch. When a data entry is processed

create document for indexed fields. Indexed field values are obtained from AsyncEvent through reflection (in case of domain object) or by PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region.
determine the bucket id of the entry.
Get the RegionDirectory for that bucket, save the document into RegionDirectory.

Processing Queries

PlantUML

() User -down-> [LuceneQuery] : fields, Analyzer, query strings, or Query
[LuceneQuery] -down-> [User Data Region]: call search()
[User Data Region] -down-> [Function Excetion]
[Function Excetion] -down-> [Bucket 1]
[Bucket 1] -down-> [RegionDirectory for bucket 1]
[RegionDirectory for bucket 1] ..> [Bucket 1] : TopDocs, ScoreDocs
[Bucket 1] ..> [Function Excetion] : score, key

[Function Excetion] -down-> [Bucket 2]
[Bucket 2] -down-> [RegionDirectory for bucket 2]
[RegionDirectory for bucket 2] ..> [Bucket 2] : TopDocs, ScoreDocs
[Bucket 2] ..> [Function Excetion] : score, key

Implementation Details

Index Storage

The lucene indexes will be stored in memory instead of disk. This will be done by implementing a lucene Directory called RegionDirectory which uses Geode as a flat file system. This way we get all the benefits offered by Geode and we can achieve replication and shard-ing of the indexes. The lucene indexes will be co-located with the data region in case of HA.

A LuceneIndex object will be created for each index, to manage all the attributes related with the index, such as reflection fields, AEQ listener, RegionDirectory array, Search, etc.

If user's data region is a partitioned region, there will be one LuceneIndex is for the partitioned region. Every bucket in the data region will have its own RegionDirectory (implements Lucene's Directory interface), which keeps the FileSystem for index regions. Index regions contain 2 regions:

FileRegion : holds the meta data about indexing files
ChunkRegion : Holds the actual data chunks for a given index file.

The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.

An AsyncEventQueue will be used to update the LuceneIndex. AsyncEventListener will procoess the events in AEQ in batch. When a data entry is processed

create document for indexed fields. Indexed field values are obtained from AsyncEvent through reflection (in case of domain object) or by PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region.
determine the bucket id of the entry.
Get the RegionDirectory for that bucket, save the document into RegionDirectory.

Storage with different region types

PersistentRegions

The Lucene Index will be persisted.

OverflowRegions

The Lucene Index will not be overflowed. The rational here is that the Lucene index will be much smaller than the data size, so it is not necessary to overflow the index.

EmptyRegions

The Lucene Index not supported

OffHeapRegions

The Lucene index will be stored in OffHeap

Walkthrough creating index in Geode region

1) Create a LuceneIndex object to hold the data structures that will be created in following steps. This object will be registered to cache owned LuceneService later.

2) LuceneIndex will keep all the reflective fields.

3) Assume the dataregion is PartitionedRegion (otherwise, no need to define PartitionResolver). Create a FileRegion (let's call it "fr") and a ChunkRegion (let's call it "cr"), collocated with Data Region (let's name it "dataregion"). Define PartitionResolver to use dataregion's bucket id as routing object, which will guarantee the index bucket region will be the same bucket id as the dataregion's bucket region's even when dataregion has its own customer-defined PartitionResolver. We don't nedd to define PartitionResolver on dataregion.

4) FileRegion and ChunkRegion use the same region attributes as dataregion. In partitioned region case, the FileRegion and ChunkRegion will be under the same parent region, i.e. /root in this example. In replicated region case, the index regions will be root regions all the time.

5) Create a RegionDirectory object for a bucket using the FileRegion and ChunkRegion's same bucket.

6) Create PerFieldAnalyzerWrapper and save the fields in LuceneIndex.

7) Create a Lucene's IndexWriterConfig object using Analyzer.

8) Create a Lucene's IndexWriter object using GeodeDirectory and IndexWriterConfig object.

9) Define AEQ with multiple dispatcher threads and order-policy=partition. That will group events by bucket id into different dispatcher queues. Each dispatcher thread will call our AEQ listener to process events for one or more buckets. Each event will be processed to be document and write into ChunkRegion via RegionDirectory. We don't need lock for RegionDirectory, since only one thread will process one bucket's events.

10) If dataregion is a replicated region, then define AEQ with single dispatcher thread.

11) Register the newly created LuceneIndex into LuceneService. The registration step will also publish the meta data into the "lucene_meta_region" which is a persistent replicate region, then other JVM will know a new luceneIndex with these meta data was created. All the members should have a LuceneService instance with the same LuceneIndex definition.

Index Maintenance

LuceneIndex can be created and destroy. We don't support creating index on a region with data for now.

Handling failures, restarts, and rebalance

The index region and async event queue will be restored with its colocated data region's buckets. So during failover the new primary should be able to read/write index as usual.

Aggregation

Storage with different region types

PersistentRegions