Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 

 

PlantUML
[Lucene Indexer] --> [GeodeFSDirectory]
() "User"
node "Colocated and Replicated" {
  () User --> [User Region] : Puts
  [User Region] --> [Async Queue]
  [Async Queue] --> [Lucene Indexer] : Batch Writes
  [GeodeFSDirectory] --> [Lucene Regions]
}

 

 

Partitioned region data flow

PlantUML
() User -down-> [Cache] : PUTs
node cluster {
 database {
 () indexBucket1"indexBucket1Primary"
 }

 database {
 () "indexBucket1Secondary"
 }

[Cache] ..> [Bucket 1]
 [Bucket 1] -down-> [Async Queue Bucket 1]
[Async Queue Bucket 1] -down-> [FSDirectoryBucket1] : Batch Write
 [FSDirectoryBucket1] -> indexBucket1indexBucket1Primary
indexBucket1Primary -right-> indexBucket1Secondary

 database {
 () indexBucket2"indexBucket2Primary"
 }

 database {
 () "indexBucket2Secondary"
 }

[Cache] ..> [Bucket 2]
 [Bucket 2] -down-> [Async Queue Bucket 2]
 [Async Queue Bucket 2] -down-> [FSDirectoryBucket2] : Batch Write
 [FSDirectoryBucket2] -> indexBucket2Primary
 indexBucket2Primary -right-> indexBucket2Secondary indexBucket2
}
 


In a partition region every bucket in the region will have its own GeodeDirectory to store the lucene indexes. The GeodeDirectory implements a file system using 2 regions 
FileRegion : holds the meta data about indexing files
ChunkRegion : Holds the actual data chunks for a given index file. 
The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The GeodeFSDirectory will have a key that contains the bucket id for file metadata chunks.
The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.
In AsyncEventListener, when a data entry is processed ,
1) determine the bucket id of the entry.
2) Get the directory for that bucket, do the indexing operation into that instance.

...

 

Walkthrough creating index in Geode region

  
1) Create a LuceneIndex object to hold the data structures that will be created in following steps. This object will be registered to cache owned LuceneService later. 
 
2)  Assume the dataregion is PartitionedRegion (otherwise, no need to define PartitionResolver). Create a FileRegion (let's call it "fr") and a ChunkRegion (let's call it "cr"), collocated with Data Region (let's name it "dataregion"). FileRegion and ChunkRegion use the same region attributes as dataregion. If the index regions are persistent, use dataregion's bucket name as path to persist index region. For example, dataregion bucket name is /root/_P_BUCKET_1, then the path will be _B_dataregion_21 (dataregion's bucket 21).
 
3) Create a GeodeDirectory object using the FileRegion, ChunkRegion and the path we got in previous step. 
 
4) Create PerFieldAnalyzerWrapper and save the fields in LuceneIndex. 
 
5) Create a Lucene's IndexWriterConfig object using Analyzer. 
 
6) Create a Lucene's IndexWriter object using GeodeDirectory and IndexWriterConfig object. 
 
7) Define PartitionResolver to use dataregion's bucket id as routing object, which will guarantee the index bucket region will be the same bucket id as the dataregion's bucket region's even when dataregion has its own customer-defined PartitionResolver. We don't nedd to define PartitionResolver on dataregion. 
 
8) Define AEQ with multiple dispatcher threads and order-policy=partition. That will group events by bucket id into different dispatcher queues. Each dispatcher thread will call our AEQ listener to process events for one or more buckets. Each event will be processed to be Document and write into ChunkRegion via GeodeDirectory. We don't need lock for GeodeDirectory, since only one thread will process one bucket's events. 
 
9) If dataregion is a replicated region, then define AEQ with single dispatcher thread. 
 
10) Register the newly created LuceneIndex into LuceneService. The registration step will also publish the meta data into the "lucene_meta_region" which is a persistent replicate region, then other JVM will know a new luceneIndex with these meta data was created. All the members should have a LuceneService instance with the same LuceneIndex definition.

Processing Queries
 


Partitioned regions

In the case of partitioned regions, the query must be sent out to all of the primaries. The results will then need to be aggregated back together. We are still investigating options for how to aggregate the data, see Text / Lucene Search.


Replicated regions

TBD

 

Result collection and paging

The ResultSet will support pagination mechanism to retrieve the results. All the keys are aggregated at the query executor node (client or peer); and getAll is used to fetch the values according to page size.