...
Allow user to create Lucene Indexes on data stored in Geode
- Update the indexes asynchronously to avoid impacting write latency
Allow user to perform text (Lucene) search on Geode data using the Lucene index. Results from the text searches may be stale due to asynchronous index updates.
Provide highly available of indexes using Geodes HA capabilities
Provide high throughput indexing and querying by partitioning index data to match partitioning in Geode- Scalability
- Performance comparable to RAMFSDirectory
Out of Scope
Building next/better Solr/Elasticsearch.
Enhance Enhancing the current Geode OQL to use Lucene index.
...
- A region and list of to-be-indexed fields
- [ Optional ] Standard Analyzer or its implementation to be used with all the fields in a index
- [ Optional ] Field types. A string can be Text or String in luceneLucene. The two have different behavior
...
A single index will not support multiple regions (join . Join queries between regions are not supported)
- Heterogeneous objects in single region will be supported
- Only top level fields and nested objects can be indexed, not nested collections
- The index needs to be created before adding the data (for phase1)
- Pagination of results will be supported
Users will interact with a new LuceneService interface, which provides methods for creating and querying indexes. Users can also create indexes through gfsh or cache.xml.
Java API
LuceneService
Code Block |
---|
/** * Create a lucene index using default analyzer. */ public LuceneIndex createIndex(String indexName, String regionName, String... fields); /** * Create a lucene index using specified analyzer per field */ public LuceneIndex createIndex(String indexName, String regionName, Map<String, Analyzer> analyzerPerField); public void destroyIndex(LuceneIndex index); public LuceneIndex getIndex(String indexName, String regionName); public Collection<LuceneIndex> getAllIndexes(); /** * Get a factory for building queries */ public LuceneQueryFactory createLuceneQueryFactory(); |
...
Gfsh API
Code Block |
---|
// Create Index gfsh> create lucene-index --name=indexName --region=/drugs --fields=sideEffects,manufacturer // Destory Index gfsh> destroy lucene-index --name=indexName --region=/drugs Execute Lucene query gfsh> luceneQuery --regionName=/drugs -queryStrings="" --limit=100 page-size=10 |
XML Configuration
Code Block |
---|
<region name="region"> <lucene-index indexName="luceneIndex"> <FieldDefinition name="fieldName" analyzer="KeywordAnalyzer"/> </lucene-index> </region> |
REST API
TBD
Spring Data GemFire Support
TBD - But the Searchable annotation described in this blog might be a good place to start.
Implementation
Index Storage
...
PlantUML |
---|
[Lucene Indexer] --> [GeodeFSDirectory] () "User" node "Colocated and Replicated" { () User --> [User Region] : Puts [User Region] --> [Async Queue] [Async Queue] --> [Lucene Indexer] : Batch Writes [GeodeFSDirectory] --> [Lucene Regions] } |
Partitioned region data flow
PlantUML |
---|
() User -down-> [Cache] : PUTs node cluster { database { () "indexBucket1Primary" } database { () "indexBucket1Secondary" } [Cache] ..> [Bucket 1] [Bucket 1] -down-> [Async Queue Bucket 1] [Async Queue Bucket 1] -down-> [FSDirectoryBucket1] : Batch Write [FSDirectoryBucket1] -> indexBucket1Primary indexBucket1Primary -right-> indexBucket1Secondary database { () "indexBucket2Primary" } database { () "indexBucket2Secondary" } [Cache] ..> [Bucket 2] [Bucket 2] -down-> [Async Queue Bucket 2] [Async Queue Bucket 2] -down-> [FSDirectoryBucket2] : Batch Write [FSDirectoryBucket2] -> indexBucket2Primary indexBucket2Primary -right-> indexBucket2Secondary } |
In a partition region every bucket in the region will have its own GeodeDirectory to store the lucene indexes. The GeodeDirectory implements a file system using 2 regions
- FileRegion : holds the meta data about indexing files
- ChunkRegion : Holds the actual data chunks for a given index file.
The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The GeodeFSDirectory will have a key that contains the bucket id for file metadata chunks. The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.
In AsyncEventListener, when a data entry is processed
- determine the bucket id of the entry.
- Get the directory for that bucket, do the indexing operation into that instance.
...