Requirements
Allow user to create Lucene Indexes on data stored in Geode
- Update the indexes asynchronously to avoid impacting write latency
Allow user to perform text (Lucene) search on Geode data using the Lucene index. Results from the text searches may be stale due to asynchronous index updates.
Provide highly available of indexes using Geode's HA capabilities
- Scalability
- Performance comparable to RAMFSDirectory
Building next/better Solr/Elasticsearch.
Enhancing the current Geode OQL to use Lucene index.
Related Documents
A previous integration of Lucene and GemFire:
Similar efforts done by other data products
Hibernate Search: Hibernate search
Solandra: Solandra embeds Solr in Cassandra.
Terminology
- Documents: In Lucene, a Document is the unit of search and index. An index consists of one or more Documents.
- Fields: A Document consists of one or more Fields. A Field is simply a name-value pair.
- Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.
API
User Input
- A region and list of to-be-indexed fields
- [ Optional ] Specified Analyzer for fields or Standard Analyzer if not specified with fields
Key points
A single index will not support multiple regions. Join queries between regions are not supported
- Heterogeneous objects in single region will be supported
- Only top level fields and nested objects can be indexed, not nested collections
- The index needs to be created before adding the data (for phase1)
- Pagination of results will be supported
Java API
LuceneService
/** * Create a lucene index using default analyzer. */ public LuceneIndex createIndex(String indexName, String regionName, String... fields); /** * Create a lucene index using specified analyzer per field */ public LuceneIndex createIndex(String indexName, String regionName, Map<String, Analyzer> analyzerPerField); public void destroyIndex(LuceneIndex index); public LuceneIndex getIndex(String indexName, String regionName); public Collection<LuceneIndex> getAllIndexes(); /** * Get a factory for building queries */ public LuceneQueryFactory createLuceneQueryFactory();
LuceneQueryFactory
public enum ResultType { /** * Query results only contain value, which is the default setting. * If field projection is specified, use projected fields' values instead of whole domain object */ VALUE, /** * Query results contain score */ SCORE, /** * Query results contain key */ KEY }; /** * Set page size for a query result. The default page size is 0 which means no pagination. * If specified negative value, throw IllegalArgumentException * @param pageSize * @return itself */ LuceneQueryFactory setPageSize(int pageSize); /** * Set max limit of result for a query * If specified limit is less or equal to zero, throw IllegalArgumentException * @param limit * @return itself */ LuceneQueryFactory setResultLimit(int limit); /** * set weather to include SCORE, KEY in result * * @param resultTypes * @return itself */ LuceneQueryFactory setResultTypes(ResultType... resultTypes); /** * Set a list of fields for result projection. * * @param fieldNames * @return itself */ LuceneQueryFactory setProjectionFields(String... fieldNames); /** * Create wrapper object for lucene's QueryParser object. * The queryString is using lucene QueryParser's syntax. QueryParser is for easy-to-use * with human understandable syntax. * * @param regionName region name * @param indexName index name * @param queryString query string in lucene QueryParser's syntax * @param analyzer lucene Analyzer to parse the queryString * @return LuceneQuery object * @throws ParseException */ public LuceneQuery create(String indexName, String regionName, String queryString, Analyzer analyzer) throws ParseException; /** * Create wrapper object for lucene's QueryParser object using default standard analyzer. * The queryString is using lucene QueryParser's syntax. QueryParser is for easy-to-use * with human understandable syntax. * * @param regionName region name * @param indexName index name * @param queryString query string in lucene QueryParser's syntax * @return LuceneQuery object * @throws ParseException */ public LuceneQuery create(String indexName, String regionName, String queryString) throws ParseException; /** * Create wrapper object for lucene's Query object. * Advanced lucene users can customized their own Query object and directly use in this API. * * @param regionName region name * @param indexName index name * @param query lucene Query object * @return LuceneQuery object */ public LuceneQuery create(String indexName, String regionName, Query query);
LuceneQuery
/** * Provides wrapper object of Lucene's Query object and execute the search. * <p>Instances of this interface are created using * {@link LuceneQueryFactory#create}. * */ public interface LuceneQuery { /** * Execute the search and get results. */ public LuceneQueryResults<?> search(); /** * Get page size setting of current query. */ public int getPageSize(); /** * Get limit size setting of current query. */ public int getLimit(); /** * Get result types setting of current query. */ public ResultType[] getResultTypes(); /** * Get projected fields setting of current query. */ public String[] getProjectedFieldNames(); }
LuceneResultStruct
/** * Return the value associated with the given field name * * @param fieldName the String name of the field * @return the value associated with the specified field * @throws IllegalArgumentException If this struct does not have a field named fieldName */ public Object getProjectedField(String fieldName); /** * Return key of the entry * * @return key * @throws IllegalArgumentException If this struct does not contain key */ public Object getKey(); /** * Return value of the entry * * @return value the whole domain object * @throws IllegalArgumentException If this struct does not contain value */ public Object getValue(); /** * Return score of the query * * @return score * @throws IllegalArgumentException If this struct does not contain score */ public Double getScore(); /** * Get the types of values ordered list * Item in the list could be either ResultType, or field name * @return the array of result types */ public Object[] getNames(); /** * Get the values in same order as result types * @return the array of values */ public Object[] getResultValues(); }
Examples
// Get LuceneService LuceneService luceneService = LuceneServiceProvider.get(cache); // Create Index on fields with default analyzer: LuceneIndex index = luceneService.createIndex(indexName, regionName, "field1", "field2", "field3"); // create index on fields with specified analyzer: LuceneIndex index = luceneService.createIndex(indexName, regionName, analyzerPerField); // Create Query LuceneQuery query = luceneService.createLuceneQueryFactory().setLimit(200).setPageSize(20) .setResultType(SCORE, VALUE, KEY).setFieldProjection("field1", "field2") .create(indexName, regionName, querystring, analyzer); // Search using Query LuceneQueryResults results = query.search(); List values = results.getNextPage(); // return all results in one page // Pagination while (results.hasNextPage()) List page = results.getNextPage(); // return result page by page for (LuceneResultStruct r : page) { System.out.prinlnt(r.getValue()); } }
Gfsh API
// Create Index gfsh> create lucene-index --name=indexName --region=/orders --fields=customer,tags // Destory Index gfsh> destroy lucene-index --name=indexName --region=/orders Execute Lucene query gfsh> luceneQuery --regionName=/orders -queryStrings="" --limit=100 page-size=10
XML Configuration
<region name="region"> <lucene-index indexName="luceneIndex"> <FieldDefinition name="fieldName" analyzer="KeywordAnalyzer"/> </lucene-index> </region>
REST API
Spring Data GemFire Support
Implementation Flowchart
Index Storage
Inside LuceneIndex
A closer look at Partitioned region data flow
- FileRegion : holds the meta data about indexing files
- ChunkRegion : Holds the actual data chunks for a given index file.
- determine the bucket id of the entry.
- Get the directory for that bucket, do the indexing operation into that instance.
Storage with different region types
Index Maintenance
An AsynchEventQueue will be used to update the LuceneIndex. This will allow us to do updates in batches supported by AEQ. Indexed field values are obtained from AsynchEvent through reflection (in case of domain object) or by PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region.
Handling failures, restarts, and rebalance
The index and async event queue will be stored and a region with the same redundancy level as the original region. We will take care to ensure that all updates are written to the index files before removing events from the queue. So during failover the new primary should be able to read index files from disk.
Walkthrough creating index in Geode region
Processing Queries
Partitioned regions
In the case of partitioned regions, the query must be sent out to all of the primaries. The results will then need to be aggregated back together. We are still investigating options for how to aggregate the data, see Text Search Aggregation Options.
Replicated regions
TBD