Table of Contents

Please refer to Geode 1.2.0 documentation with final implementation is here.

Work in Progress

...

Requirements

Allow user to create Lucene Indexes on data stored in Geode
Update the indexes asynchronously to avoid impacting write latency
Allow user to perform text (Lucene) search on Geode data using the Lucene index. Results from the text searches may be stale due to asynchronous index updates.
Provide highly available of indexes using Geodes HA capabilities
Provide high throughput indexing and querying by partitioning index data to match partitioning in Geode
Geode's HA capabilities
Scalability
Performance comparable to RAMFSDirectory

Out of Scope

Building next/better Solr/Elasticsearch.
Enhance Enhancing the current Geode OQL to use Lucene index.

2

Terminology

Documents: In Lucene, a Document is the unit of search and index. An index consists of one or more Documents.
Fields: A Document consists of one or more Fields. A Field is simply a name-value pair.
Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

4

API

User Input

A region and list of to-be-indexed fields
[ Optional ] Specified Analyzer for fields or Standard Analyzer or its implementation to be used with all the fields in a index
[ Optional ] Field types. A string can be Text or String in lucene. The two have different behavior
if not specified with fields

Key points

A single index will not support multiple regions (join . Join queries between regions are not supported)
Heterogeneous objects in single region will be supported
Only top level fields and of nested objects can be indexed, not nested collections
The index needs to be created before adding the data region is created (for phase1)
Pagination of results will be supported

Users will interact with a new LuceneService interface, which provides methods for creating indexes and querying indexes. Users can also create indexes through gfsh or cache.xml.

Java API

Now that this feature has been implemented, please refer to the javadocs for details on the Java API.

Examples

Code Block
// Get LuceneService

LuceneService


LuceneService luceneService = LuceneServiceProvider.get(

Cache

cache);

/** *

// Create

a

Index

lucene

on

index

fields

using

with default analyzer

. * * @param indexName * @param regionName * @param fields * @return LuceneIndex object */ public LuceneIndex createIndex(String indexName, String regionName, String... fields); /** * Create a lucene index using specified analyzer per field * * @param indexName index name * @param regionName region name * @param analyzerPerField analyzer per field map * @return LuceneIndex object * */ public LuceneIndex createIndex(String indexName, String regionName, Map<String, Analyzer> analyzerPerField); /** * Destroy the lucene index * * @param index index object */ public void destroyIndex(LuceneIndex index); /** * Get the lucene index object specified by region name and index name * @param indexName index name * @param regionName region name * @return LuceneIndex object */ public LuceneIndex getIndex(String indexName, String regionName); /** * get all the lucene indexes. * @return all index objects in a Collection */ public Collection<LuceneIndex> getAllIndexes(); /** * create LuceneQueryFactory * @return LuceneQueryFactory object */ public LuceneQueryFactory createLuceneQueryFactory();

LuceneQueryFactory

Code Block

public enum ResultType {
    /**
     *  Query results only contain value, which is the default setting.
     *  If field projection is specified, use projected fields' values instead of whole domain object
     */
    VALUE,
    
    /**
     * Query results contain score
     */
    SCORE,
    
    /**
     * Query results contain key
     */
    KEY
  };
  /**
   * Set page size for a query result. The default page size is 0 which means no pagination.
   * If specified negative value, throw IllegalArgumentException
   * @param pageSize
   * @return itself
   */
  LuceneQueryFactory setPageSize(int pageSize);
  
  /**
   * Set max limit of result for a query
   * If specified limit is less or equal to zero, throw IllegalArgumentException
   * @param limit
   * @return itself
   */
  LuceneQueryFactory setResultLimit(int limit);
  
  /**
   * set weather to include SCORE, KEY in result
   * 
   * @param resultTypes
   * @return itself
   */
  LuceneQueryFactory setResultTypes(ResultType... resultTypes);
  
  /**
   * Set a list of fields for result projection.
   * 
   * @param fieldNames
   * @return itself
   */
  LuceneQueryFactory setProjectionFields(String... fieldNames);
  
  /**
   * Create wrapper object for lucene's QueryParser object.
   * The queryString is using lucene QueryParser's syntax. QueryParser is for easy-to-use 
   * with human understandable syntax. 
   *  
   * @param regionName region name
   * @param indexName index name
   * @param queryString query string in lucene QueryParser's syntax
   * @param analyzer lucene Analyzer to parse the queryString
   * @return LuceneQuery object
   * @throws ParseException
   */
  public LuceneQuery create(String indexName, String regionName, String queryString, 
      Analyzer analyzer) throws ParseException;
  
  /**
   * Create wrapper object for lucene's QueryParser object using default standard analyzer.
   * The queryString is using lucene QueryParser's syntax. QueryParser is for easy-to-use 
   * with human understandable syntax. 
   *  
   * @param regionName region name
   * @param indexName index name
   * @param queryString query string in lucene QueryParser's syntax
   * @return LuceneQuery object
   * @throws ParseException
   */
  public LuceneQuery create(String indexName, String regionName, String queryString) 
      throws ParseException;
  
  /**
   * Create wrapper object for lucene's Query object.
   * Advanced lucene users can customized their own Query object and directly use in this API.  
   * 
   * @param regionName region name
   * @param indexName index name
   * @param query lucene Query object
   * @return LuceneQuery object
   */
  public LuceneQuery create(String indexName, String regionName, Query query);

LuceneQuery

Code Block

/**
 * Provides wrapper object of Lucene's Query object and execute the search. 
 * <p>Instances of this interface are created using
 * {@link LuceneQueryFactory#create}.
 * 
 */
public interface LuceneQuery {
  /**
   * Execute the search and get results. 
   */
  public LuceneQueryResults<?> search();
  
  /**
   * Get page size setting of current query. 
   */
  public int getPageSize();
  
  /**
   * Get limit size setting of current query. 
   */
  public int getLimit();
  /**
   * Get result types setting of current query. 
   */
  public ResultType[] getResultTypes();
  
  /**
   * Get projected fields setting of current query. 
   */
  public String[] getProjectedFieldNames();
}

LuceneResultStruct

Code Block

/**
 * <p>
 * Abstract data structure for one item in query result.
 * 
 * @author Xiaojian Zhou
 * @since 8.5
 */
public interface LuceneResultStruct {
  /**
   * Return the value associated with the given field name
   *
   * @param fieldName the String name of the field
   * @return the value associated with the specified field
   * @throws IllegalArgumentException If this struct does not have a field named fieldName
   */
  public Object getProjectedField(String fieldName);
  
  /**
   * Return key of the entry
   *
   * @return key
   * @throws IllegalArgumentException If this struct does not contain key
   */
  public Object getKey();
  
  /**
   * Return value of the entry
   *
   * @return value the whole domain object
   * @throws IllegalArgumentException If this struct does not contain value
   */
  public Object getValue();
  
  /**
   * Return score of the query 
   *
   * @return score
   * @throws IllegalArgumentException If this struct does not contain score
   */
  public Double getScore();
  
  /**
   * Get the types of values ordered list
   * Item in the list could be either ResultType, or field name
   * @return the array of result types
   */
  public Object[] getNames();
  
  /**
   * Get the values in same order as result types
   * @return the array of values
   */
  public Object[] getResultValues();
}

Examples to use the APIs:

Code Block

// Get LuceneService
LuceneService luceneService = LuceneService.get(cache);

// Create Index
LuceneIndex index = luceneService.createIndex(indexName, regionName, "field1", "field2", "field3");

// create index on fields with specified analyzer:
LuceneIndex index = luceneService.createIndex(indexName, regionName, analyzerPerField);

// Create Query
LuceneQuery query = luceneService.createLuceneQueryFactory().setLimit(200).setPageSize(20)
  .setResultType(SCORE, VALUE, KEY).setFieldProjection("field1", "field2")
  .create(indexName, regionName, querystring, analyzer);


// Search using Query
LuceneQueryResults results = query.search();

List values = results.getNextPage(); // return all results in one page

// Pagination
while (results.hasNextPage())
  List page = results.getNextPage(); // return result page by page

  for (LuceneResultStruct r : page) {
    System.out.prinlnt(r.getValue());
  }
}

Gfsh API

Code Block

// Create Index
gfsh> create lucene-index --name=indexName --region=/drugs --fields=sideEffects,manufacturer

// Destory Index
gfsh> destroy lucene-index --name=indexName --region=/drugs

Execute Lucene query
gfsh> luceneQuery --regionName=/drugs -queryStrings="" --limit=100 page-size=10

:
luceneService.createIndex(indexName, regionName, "field1", "field2", "field3");

// create index on fields with specified analyzer:
Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();
analyzerPerfield.put("field1", new StandardAnalyzer());
analyzerPerfield.put("field2", new KeywardAnalyzer());
luceneService.createIndex(indexName, regionName, analyzerPerField);
 
Region region = cache.createRegionFactory(RegionShutcut.PARTITION).create(regionName);

// Create Query
LuceneQuery query = luceneService.createLuceneQueryFactory().setLimit(200).setPageSize(20)
  .create(indexName, regionName, querystring, "field1" /* default field */);

// Search using Query
PageableLuceneQueryResults<K,Object> results = query.findPages();

// Pagination
while (results.hasNext()) {
  results.next().stream().forEach(struct -> {
    Object value = struct.getValue();
    System.out.println("Key is "+struct.getKey()+", value is "+value);
  });
}

Gfsh API

Code Block

// List Index
gfsh> list lucene indexes [with-stats]
// Create Index
gfsh> create lucene index --name=indexName --region=/orders --field=customer,tags

// Create Index
gfsh> create lucene index --name=indexName --region=/orders --field=customer,tags --analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer,org.apache.lucene.analysis.bg.BulgarianAnalyzer

Execute Lucene query
gfsh> search lucene --regionName=/orders -queryStrings="John*" --defaultField=field1 --limit=100

XML Configuration

Code Block

<cache
    xmlns="http://geode.apache.org/schema/cache"
    xmlns:lucene="http://geode.apache.org/schema/lucene"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://geode.apache.org/schema/cache
        http://geode.apache.org/schema/cache/cache-1.0.xsd
        http://geode.apache.org/schema/lucene
        http://geode.apache.org/schema/lucene/lucene-1.0.xsd"
    version="1.0">

    <region name="region" refid="PARTITION">
        <lucene:index name="index">
          <lucene:field name="a" analyzer="org.apache.lucene.analysis.core.KeywordAnalyzer"/>
          <lucene:field name="b" analyzer="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
          <lucene:field name="c" analyzer="org.apache.lucene.analysis.standard.ClassicAnalyzer"/>
        </lucene:index>
    </region>
</cache>

REST API

TBD - But using solr to provide a REST API might make a lot of sense

Spring Data GemFire Support

TBD - But the Searchable annotation described in this blog might be a good place to start.

Implementation Flowchart

PlantUML

[LuceneIndex] --> [RegionDirectory]
() "User"
node "Colocated PR or Replicated Region" {
  () User --> [User Data Region] : Puts
  [User Data Region] --> [Async Queue]
  [Async Queue] --> [LuceneIndex] : Batch Writes
  [RegionDirectory] --> [Lucene Regions]
}

Inside LuceneIndex

PlantUML
node "LuceneIndex" { [Reflective fields] [AEQ listener] [RegionDirectory array (one per bucket)] [Query objects] }

A closer look at Partitioned region data flow

PlantUML

() User -down-> [User Data Region] : PUTs

[User Data Region] ..> [Bucket 1]
 [Bucket 1] -down-> [Async Queue Bucket 1]
node LuceneIndex {
[Async Queue Bucket 1] -down-> [AEQ listener processes events into index documents]:Batch Write
[AEQ listener processes events into index documents] -down-> [RegionDirectory1]
[RegionDirectory1] -down-> [file region bucket 1]

[file region bucket 1] -down-> [chunk region bucket 1]
}
 
[User Data Region] ..> [Bucket 2]
 [Bucket 2] -down-> [Async Queue Bucket 2]
node LuceneIndex {
[Async Queue Bucket 2] -down-> [AEQ listener processes events into index documents]:Batch Write
[AEQ listener processes events into index documents] -down-> [RegionDirectory2]
[RegionDirectory2] -down-> [file region bucket 2]
[file region bucket 2] -down-> [chunk region bucket 2]
}

Processing Queries

PlantUML

() User -down-> [LuceneQuery] : fields, Analyzer, query strings, or Query
[LuceneQuery] -down-> [User Data Region]: call search()
[User Data Region] -down-> [Function Execution]
[Function Execution] -down-> [Bucket 1]
[Bucket 1] -down-> [RegionDirectory for bucket 1]
[RegionDirectory for bucket 1] ..> [Bucket 1] : TopDocs, ScoreDocs
[Bucket 1] ..> [Function Execution] : score, key

[Function Execution] -down-> [Bucket 2]
[Bucket 2] -down-> [RegionDirectory for bucket 2]
[RegionDirectory for bucket 2] ..> [Bucket 2] : TopDocs, ScoreDocs
[Bucket 2] ..> [Function Execution] : score, key

Implementation Details

Index Storage

The lucene indexes will be stored in memory instead of disk. This will be done by implementing a lucene Directory called RegionDirectory which uses Geode as a flat file system. This way we get all the benefits offered by Geode and we can achieve replication and shard-ing of the indexes. The lucene indexes will be co-located with the data region in case of HA.

A LuceneIndex object will be created for each index, to manage all the attributes related with the index, such as reflection fields, AEQ listener, RegionDirectory array, Search, etc.

If user's data region is a partitioned region, there will be one LuceneIndex is for the partitioned region. Every bucket in the data region will have its own RegionDirectory (implements Lucene's Directory interface), which keeps the FileSystem for index regions. Index regions contain 2 regions:

FileRegion : holds the meta data about indexing files
ChunkRegion : Holds the actual data chunks for a given index file.

The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.

An AsyncEventQueue will be used to update the LuceneIndex. AsyncEventListener will procoess the events in AEQ in batch. When a data entry is processed

create document for indexed fields. Indexed field values are obtained from AsyncEvent through reflection (in case of domain object) or by PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region.
determine the bucket id of the entry.
Get the RegionDirectory for that bucket, save the document into RegionDirectory.

Storage with different region types

PersistentRegions

The Lucene Index will be persisted.

OverflowRegions

The Lucene Index will not be overflowed. The rational here is that the Lucene index will be much smaller than the data size, so it is not necessary to overflow the index.

EmptyRegions

The Lucene Index not supported

OffHeapRegions

The Lucene index will be stored in OffHeap

Walkthrough creating index in Geode region

1) Create a LuceneIndex object to hold the data structures that will be created in following steps. This object will be registered to cache owned LuceneService later.

2) LuceneIndex will keep all the reflective fields.

3) Assume the dataregion is PartitionedRegion (otherwise, no need to define PartitionResolver). Create a FileRegion (let's call it "fr") and a ChunkRegion (let's call it "cr"), collocated with Data Region (let's name it "dataregion"). Define PartitionResolver to use dataregion's bucket id as routing object, which will guarantee the index bucket region will be the same bucket id as the dataregion's bucket region's even when dataregion has its own customer-defined PartitionResolver. We don't nedd to define PartitionResolver on dataregion.

4) FileRegion and ChunkRegion use the same region attributes as dataregion. In partitioned region case, the FileRegion and ChunkRegion will be under the same parent region, i.e. /root in this example. In replicated region case, the index regions will be root regions all the time.

5) Create a RegionDirectory object for a bucket using the FileRegion and ChunkRegion's same bucket.

6) Create PerFieldAnalyzerWrapper and save the fields in LuceneIndex.

7) Create a Lucene's IndexWriterConfig object using Analyzer.

8) Create a Lucene's IndexWriter object using GeodeDirectory and IndexWriterConfig object.

9) Define AEQ with multiple dispatcher threads and order-policy=partition. That will group events by bucket id into different dispatcher queues. Each dispatcher thread will call our AEQ listener to process events for one or more buckets. Each event will be processed to be document and write into ChunkRegion via RegionDirectory. We don't need lock for RegionDirectory, since only one thread will process one bucket's events.

10) If dataregion is a replicated region, then define AEQ with single dispatcher thread.

11) Register the newly created LuceneIndex into LuceneService. The registration step will also publish the meta data into the "lucene_meta_region" which is a persistent replicate region, then other JVM will know a new luceneIndex with these meta data was created. All the members should have a LuceneService instance with the same LuceneIndex definition.

Index Maintenance

LuceneIndex can be created and destroy. We don't support creating index on a region with data for now.

Handling failures, restarts, and rebalance

The index region and async event queue will be restored with its colocated data region's buckets. So during failover the new primary should be able to read/write index as usual.

Aggregation

In the case of partitioned regions, the query must be sent out to all the primaries. The results will then need to be aggregated back together. Lucene search will use FunctionService to distribute query to primaries.

Input to primaries

Serialized Query
CollectorManager to be used for local aggregation
Result limit

Output from primaries

Merged collector created from results of search on local bucket indexes.

PlantUML

 participant LuceneQuery
 participant FunctionService
 participant FunctionCollector
 participant CollectorManager
 participant M1_LuceneFunction
 participant M1_CollectorManager
 participant Index_1
 participant Index_2
 LuceneQuery -> FunctionService: Query
 activate FunctionService
 FunctionService --> M1_LuceneFunction : LuceneContext
 activate M1_LuceneFunction
 FunctionService --> M2_LuceneFunction: LuceneContext
 activate M2_LuceneFunction
 M1_LuceneFunction -> Index_1 : search(Collector_1)
 Index_1 -> M1_LuceneFunction : loaded Collector_1
 M1_LuceneFunction -> Index_2 : search(Collector_2)
 Index_2 -> M1_LuceneFunction : loaded Collector_2
 M1_LuceneFunction -> M1_CollectorManager : merge Collectors
 activate M1_CollectorManager
 M1_CollectorManager -> M1_LuceneFunction : merged Collector
 deactivate M1_CollectorManager
 activate FunctionCollector
 M1_LuceneFunction -> FunctionCollector:Collector_M1
 deactivate M1_LuceneFunction
 M2_LuceneFunction -> FunctionCollector:Collector_M2
 deactivate M2_LuceneFunction
 FunctionCollector -> CollectorManager : merge Collectors
 activate CollectorManager
 CollectorManager -> FunctionCollector : Final Collector
 deactivate CollectorManager
 FunctionCollector -> FunctionService : Final Collector
 deactivate FunctionCollector
 FunctionService -> LuceneQuery : QueryResults
 deactivate FunctionService

We are still investigating options for how to aggregate the data, see Text Search Aggregation Options.

In case of replicated regions, query will be sent to one of the members and get the results there. Aggregation will be handled in that member before returned to the caller.

Result collection and paging

The ResultSet will support pagination mechanism to retrieve the results. All the keys are aggregated at the query executor node (client or peer); and getAll is used to fetch the values according to page size.

JMX MBean

A Lucene Service MBean is available and accessed through an ObjectName like:

GemFire:service=CacheService,name=LuceneService,type=Member,member=192.168.2.13(59583)<ec><v5>-1026

This MBean provides operations these operations:

Code Block

language	java
title	LuceneServiceMBean API

/**
 * Returns an array of {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.LuceneIndex}
 * instances defined in this member
 *
 * @return an array of LuceneIndexMetrics for the LuceneIndexes defined in this member
 */
public LuceneIndexMetrics[] listIndexMetrics();

/**
 * Returns an array of {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.LuceneIndex}
 * instances defined on the input region in this member
 *
 * @param regionPath The full path of the region to retrieve
 *
 * @return an array of LuceneIndexMetrics for the LuceneIndex instances defined on the input region
 * in this member
 */
public LuceneIndexMetrics[] listIndexMetrics(String regionPath);

/**
 * Returns a {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.LuceneIndex}
 * with the input index name defined on the input region in this member.
 *
 * @param regionPath The full path of the region to retrieve
 * @param indexName The name of the index to retrieve
 *
 * @return a LuceneIndexMetrics for the LuceneIndex with the input index name defined on the input region
 * in this member.
 */
public LuceneIndexMetrics listIndexMetrics(String regionPath, String indexName);

A LuceneIndexMetrics data bean includes raw stat values like:

Code Block

title	LuceneIndexMetrics Sample

Region=/data2; index=full_index
	commitTime->107608255573
	commits->5999
	commitsInProgress->0
	documents->498
	queryExecutionTime->0
	queryExecutionTotalHits->0
	queryExecutions->0
	queryExecutionsInProgress->0
	updateTime->7313618780
	updates->6419
	updatesInProgress->0

Limitations include:

no rates or average latencies are available
no aggregation (which means no rollups across members in the GemFire -> Distributed MBean)

XML Configuration

Code Block
<region name="drugs"> <lucene-index indexName="luceneIndex"> <FieldDefinition name="fieldName" analyzer="KeywordAnalyzer"/> </lucene-index> </region>

REST API

TBD

Spring Data GemFire Support

TBD - But the Searchable annotation described in this blog might be a good place to start.

Implementation

Index Storage

The lucene indexes will be stored in memory instead of disk. This will be done by implementing a lucene FSDirectory called GeodeFSDirectory which uses Geode as a flat file system. This way we get all the benefits offered by Geode and we can achieve replication and shard-ing of the indexes. The lucene indexes will be co-located with the region they are defined on.

PlantUML

[Lucene Indexer] --> [GeodeFSDirectory]
() "User"
node "Colocated and Replicated" {
  () User --> [User Region] : Puts
  [User Region] --> [Async Queue]
  [Async Queue] --> [Lucene Indexer] : Batch Writes
  [GeodeFSDirectory] --> [Lucene Regions]
}

Partitioned region data flow

PlantUML

() User -down-> [Cache] : PUTs
node cluster {
 database {
 () "indexBucket1Primary"
 }

 database {
 () "indexBucket1Secondary"
 }

[Cache] ..> [Bucket 1]
 [Bucket 1] -down-> [Async Queue Bucket 1]
[Async Queue Bucket 1] -down-> [FSDirectoryBucket1] : Batch Write
[FSDirectoryBucket1] -> indexBucket1Primary
indexBucket1Primary -right-> indexBucket1Secondary

 database {
 () "indexBucket2Primary"
 }

 database {
 () "indexBucket2Secondary"
 }

[Cache] ..> [Bucket 2]
 [Bucket 2] -down-> [Async Queue Bucket 2]
 [Async Queue Bucket 2] -down-> [FSDirectoryBucket2] : Batch Write
 [FSDirectoryBucket2] -> indexBucket2Primary
 indexBucket2Primary -right-> indexBucket2Secondary 
}

In a partition region every bucket in the region will have its own GeodeDirectory to store the lucene indexes. The GeodeDirectory implements a file system using 2 regions

FileRegion : holds the meta data about indexing files

ChunkRegion : Holds the actual data chunks for a given index file.

The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The GeodeFSDirectory will have a key that contains the bucket id for file metadata chunks.

The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.

In AsyncEventListener, when a data entry is processed ,

1) determine the bucket id of the entry.

2) Get the directory for that bucket, do the indexing operation into that instance.

Storage with different region types

PersistentRegions

The Lucene Index will be persisted.

OverflowRegions

The Lucene Index will not be overflowed. The rational here is that the Lucene index will be much smaller than the data size, so it is not necessary to overflow the index.

EmptyRegions

The Lucene Index not supported

OffHeapRegions

The Lucene index will be stored in OffHeap

Index Maintenance

An AsynchEventQueue will be used to update the LuceneIndex. This will allow us to do updates in batches supported by AEQ.

Indexed field values are obtained from AsynchEvent through reflection (in case of domain object) or by PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region.

Handling failures, restarts, and rebalance

The index and async event queue will be stored and a region with the same redundancy level as the original region. We will take care to ensure that all updates are written to the index files before removing events from the queue. So during failover the new primary should be able to read index files from disk.

Walkthrough creating index in Geode region

1) Create a LuceneIndex object to hold the data structures that will be created in following steps. This object will be registered to cache owned LuceneService later.

2) Assume the dataregion is PartitionedRegion (otherwise, no need to define PartitionResolver). Create a FileRegion (let's call it "fr") and a ChunkRegion (let's call it "cr"), collocated with Data Region (let's name it "dataregion"). FileRegion and ChunkRegion use the same region attributes as dataregion. If the index regions are persistent, use dataregion's bucket name as path to persist index region. For example, dataregion bucket name is /root/_P_BUCKET_1, then the path will be _B_dataregion_21 (dataregion's bucket 21).

3) Create a GeodeDirectory object using the FileRegion, ChunkRegion and the path we got in previous step.

4) Create PerFieldAnalyzerWrapper and save the fields in LuceneIndex.

5) Create a Lucene's IndexWriterConfig object using Analyzer.

6) Create a Lucene's IndexWriter object using GeodeDirectory and IndexWriterConfig object.

7) Define PartitionResolver to use dataregion's bucket id as routing object, which will guarantee the index bucket region will be the same bucket id as the dataregion's bucket region's even when dataregion has its own customer-defined PartitionResolver. We don't nedd to define PartitionResolver on dataregion.

8) Define AEQ with multiple dispatcher threads and order-policy=partition. That will group events by bucket id into different dispatcher queues. Each dispatcher thread will call our AEQ listener to process events for one or more buckets. Each event will be processed to be Document and write into ChunkRegion via GeodeDirectory. We don't need lock for GeodeDirectory, since only one thread will process one bucket's events.

9) If dataregion is a replicated region, then define AEQ with single dispatcher thread.

10) Register the newly created LuceneIndex into LuceneService. The registration step will also publish the meta data into the "lucene_meta_region" which is a persistent replicate region, then other JVM will know a new luceneIndex with these meta data was created. All the members should have a LuceneService instance with the same LuceneIndex definition.

Processing Queries

Partitioned regions

In the case of partitioned regions, the query must be sent out to all of the primaries. The results will then need to be aggregated back together. We are still investigating options for how to aggregate the data, see Text / Lucene Search.

Replicated regions

TBD

Result collection and paging

The ResultSet will support pagination mechanism to retrieve the results. All the keys are aggregated at the query executor node (client or peer); and getAll is used to fetch the values according to page size.

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 3

New Version Current

Key

Please refer to Geode 1.2.0 documentation with final implementation is here.

*Work in Progress*

Requirements

Related Documents

Terminology

API

User Input

Key points

Java API

Examples

Gfsh API

Gfsh API

XML Configuration

REST API

Spring Data GemFire Support

Implementation Flowchart

Inside LuceneIndex

A closer look at Partitioned region data flow

Processing Queries

Implementation Details

Index Storage

Storage with different region types

Walkthrough creating index in Geode region

Handling failures, restarts, and rebalance

Aggregation

Result collection and paging

JMX MBean

XML Configuration

REST API

Spring Data GemFire Support

Implementation

Index Storage

Storage with different region types

Index Maintenance

Handling failures, restarts, and rebalance

Walkthrough creating index in Geode region

Processing Queries

Partitioned regions

Replicated regions

Result collection and paging

Work in Progress