Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


  1. Allow user to create Lucene Indexes on data stored in geodeGeode

  2. Update the indexes asynchronously to avoid impacting write latency
  3. Allow user to perform text (Lucene) search on Geode Data data using the Lucene index. Results from the text searches may be stale due to asnchyronous asynchronous index updates.

  4. Provide highly available of indexes using Geodes HA capabilities

  5. Provide high throughput indexing and querying by partitioning index data to match partitioning in Geode
  6. Scalability
  7. Performance comparable to RAMFSDirectory

Out of Scope
  1. Building next/better Solr/Elasticsearch.

  2. Enhance current Geode OQL to use Lucene index.

...

  • Documents: In Lucene, a Document is the unit of search and index. An index consists of one or more Documents.
  • Fields: A Document consists of one or more Fields. A Field is simply a name-value pair. For example, a Field commonly found in applications is title. In the case of a title Field, the field name is title and the value is the title of that content item.
  • Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

4 API

...

User Input

Key points

  1. A region and list of to-be-indexed fields
  2. [ Optional ] Standard Analyzer or its implementation to be used with all the fields in a index
  3. [ Optional ] Field types. A string can be Text or String in lucene. The two have different behavior

Key points

  1. A single index will not support A single index will not support multiple regions (join queries between regions are not supported)

  2. Heterogeneous objects in single region will be supported
  3. Only top level fields and nested objects can be indexed, not nested collections
  4. The index needs to be created before adding the data (for phase1) 
  5. Pagination of results will be supported

Users will interact with a new LuceneService interface, which provides methods for creating and querying indexes. Users can also create indexes through gfsh or cache.xml.

Java API 

 

 
Code Block

LuceneService LuceneService.get(Cache cache);

  /**
   * Create a lucene index using default analyzer.
   * 
   * @param indexName
   * @param regionName
   * @param fields
   * @return LuceneIndex object
   */
  public LuceneIndex createIndex(String indexName, String regionName, String... fields);
  
  /**
   * Create a lucene index using specified analyzer per field
   * 
   * @param indexName index name
   * @param regionName region name
   * @param analyzerPerField analyzer per field map
   * @return LuceneIndex object
   *
   */
  public LuceneIndex createIndex(String indexName, String regionName,  
      Map<String, Analyzer> analyzerPerField);
  /**
   * Destroy the lucene index
   * 
   * @param index index object
   */
  public void destroyIndex(LuceneIndex index);
  
  /**
   * Get the lucene index object specified by region name and index name
   * @param indexName index name
   * @param regionName region name
   * @return LuceneIndex object
   */
  public LuceneIndex getIndex(String indexName, String regionName);
  
  /**
   * get all the lucene indexes.
   * @return all index objects in a Collection
   */
  public Collection<LuceneIndex> getAllIndexes();
  /**
   * create LuceneQueryFactory
   * @return LuceneQueryFactory object
   */
  public LuceneQueryFactory createLuceneQueryFactory();


 

   
 

LuceneQueryFactory

Code Block
public enum ResultType {
    /**
     *  Query results only contain value, which is the default setting.
     *  If field projection is specified, use projected fields' values instead of whole domain object
     */
    VALUE,
    
    /**
     * Query results contain score
     */
    SCORE,
    
    /**
     * Query results contain key
     */
    KEY
  };
  /**
   * Set page size for a query result. The default page size is 0 which means no pagination.
   * If specified negative value, throw IllegalArgumentException
   * @param pageSize
   * @return itself
   */
  LuceneQueryFactory setPageSize(int pageSize);
  
  /**
   * Set max limit of result for a query
   * If specified limit is less or equal to zero, throw IllegalArgumentException
   * @param limit
   * @return itself
   */
  LuceneQueryFactory setResultLimit(int limit);
  
  /**
   * set weather to include SCORE, KEY in result
   * 
   * @param resultTypes
   * @return itself
   */
  LuceneQueryFactory setResultTypes(ResultType... resultTypes);
  
  /**
   * Set a list of fields for result projection.
   * 
   * @param fieldNames
   * @return itself
   */
  LuceneQueryFactory setProjectionFields(String... fieldNames);
  
  /**
   * Create wrapper object for lucene's QueryParser object.
   * The queryString is using lucene QueryParser's syntax. QueryParser is for easy-to-use 
   * with human understandable syntax. 
   *  
   * @param regionName region name
   * @param indexName index name
   * @param queryString query string in lucene QueryParser's syntax
   * @param analyzer lucene Analyzer to parse the queryString
   * @return LuceneQuery object
   * @throws ParseException
   */
  public LuceneQuery create(String indexName, String regionName, String queryString, 
      Analyzer analyzer) throws ParseException;
  
  /**
   * Create wrapper object for lucene's QueryParser object using default standard analyzer.
   * The queryString is using lucene QueryParser's syntax. QueryParser is for easy-to-use 
   * with human understandable syntax. 
   *  
   * @param regionName region name
   * @param indexName index name
   * @param queryString query string in lucene QueryParser's syntax
   * @return LuceneQuery object
   * @throws ParseException
   */
  public LuceneQuery create(String indexName, String regionName, String queryString) 
      throws ParseException;
  
  /**
   * Create wrapper object for lucene's Query object.
   * Advanced lucene users can customized their own Query object and directly use in this API.  
   * 
   * @param regionName region name
   * @param indexName index name
   * @param query lucene Query object
   * @return LuceneQuery object
   */
  public LuceneQuery create(String indexName, String regionName, Query query);

 

LuceneQuery

Code Block
/**
 * Provides wrapper object of Lucene's Query object and execute the search. 
 * <p>Instances of this interface are created using
 * {@link LuceneQueryFactory#create}.
 * 
 */
public interface LuceneQuery {
  /**
   * Execute the search and get results. 
   */
  public LuceneQueryResults<?> search();
  
  /**
   * Get page size setting of current query. 
   */
  public int getPageSize();
  
  /**
   * Get limit size setting of current query. 
   */
  public int getLimit();
  /**
   * Get result types setting of current query. 
   */
  public ResultType[] getResultTypes();
  
  /**
   * Get projected fields setting of current query. 
   */
  public String[] getProjectedFieldNames();
}

 

 

LuceneResultStruct

Code Block

/**
 * <p>
 * Abstract data structure for one item in query result.
 * 
 * @author Xiaojian Zhou
 * @since 8.5
 */
public interface LuceneResultStruct {
  /**
   * Return the value associated with the given field name
   *
   * @param fieldName the String name of the field
   * @return the value associated with the specified field
   * @throws IllegalArgumentException If this struct does not have a field named fieldName
   */
  public Object getProjectedField(String fieldName);
  
  /**
   * Return key of the entry
   *
   * @return key
   * @throws IllegalArgumentException If this struct does not contain key
   */
  public Object getKey();
  
  /**
   * Return value of the entry
   *
   * @return value the whole domain object
   * @throws IllegalArgumentException If this struct does not contain value
   */
  public Object getValue();
  
  /**
   * Return score of the query 
   *
   * @return score
   * @throws IllegalArgumentException If this struct does not contain score
   */
  public Double getScore();
  
  /**
   * Get the types of values ordered list
   * Item in the list could be either ResultType, or field name
   * @return the array of result types
   */
  public Object[] getNames();
  
  /**
   * Get the values in same order as result types
   * @return the array of values
   */
  public Object[] getResultValues();
}

 

 

 

 

 

  

Examples to use the APIs:

 

 

Code Block
// Get LuceneService
LuceneService luceneService = LuceneService.get(cache);

// Create Index
LuceneIndex index = luceneService.createIndex(indexName, regionName, "field1", "field2", "field3");

// create index on fields with specified analyzer:
LuceneIndex index = luceneService.createIndex(indexName, regionName, analyzerPerField);

// Create Query
LuceneQuery query = luceneService.createLuceneQueryFactory().setLimit(200).setPageSize(20)
  .setResultType(SCORE, VALUE, KEY).setFieldProjection("field1", "field2")
  .create(indexName, regionName, querystring, analyzer);


// Search using Query
LuceneQueryResults results = query.search();

List values = results.getNextPage(); // return all results in one page

// Pagination
while (results.hasNextPage())
  List page = results.getNextPage(); // return result page by page

  for (LuceneResultStruct r : page) {
    System.out.prinlnt(r.getValue());
  }
}

 

...

The lucene indexes will be stored in memory instead of disk. This will be done by implementing a lucene FSDirectory called GemDirectory GeodeFSDirectory which uses Geode as a flat file system. This way we get all the benefits offered by Geode and we can achieve replication and shard-ing of the indexes. The lucene indexes will be co-located with the region they are defined on.

 

 

PlantUML
[Lucene Indexer] --> [GeodeFSDirectory]
() "User"
node "Colocated and Replicated" {
  () User --> [User Region] : Puts
  [User Region] --> [Async Queue]
  [Async Queue] --> [Lucene Indexer] : Batch Writes
  [GeodeFSDirectory] --> [Lucene Regions]
}

 

 

PlantUML
() User -down-> [Cache] : PUTs
node cluster {
 database {
 () indexBucket1
 }

 [Cache] ..> [Bucket 1]
 [Bucket 1] -down-> [Async Queue Bucket 1]
[Async Queue Bucket 1] -down-> [FSDirectoryBucket1] : Batch Write
 [FSDirectoryBucket1] -> indexBucket1
 
 database {
 () indexBucket2
 }

 [Cache] ..> [Bucket 2]
 [Bucket 2] -down-> [Async Queue Bucket 2]
 [Async Queue Bucket 2] -down-> [FSDirectoryBucket2] : Batch Write
 [FSDirectoryBucket2] -> indexBucket2
}
 


In a partition region In a partition region , every bucket in the region will have its own GeodeDirectory to store the lucene indexes. The GeodeDirectory implements a file system using 2 regions 
FileRegion : holds the meta data about indexing files
ChunkRegion : Holds the actual data chunks for a given index file. 
The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed.  The GemDirectory GeodeFSDirectory will have a key that contains the bucket id for file metadata chunks.
The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only.
In AsyncEventListener, when a data entry is processed ,
1) determine the bucket id of the entry.
2) Get the directory for that bucket, do the indexing operation into that instance.

...