Table of Contents | ||||
---|---|---|---|---|
|
Goals
- Accurate text search
- Reuse code
- Scalability
- Performance, compare to RAMFSDirectory
User Input
- A region and list of to-be-indexed fields (or text searchable fields)
- [ Optional ] Standard Analyzer or its implementation to be used with all the fields in a index
- [ Optional ] Field types. A string can be Text or String in lucene. The two have different behavior
Index Persistence
Lucene context
- A index batch write (
IndexWriter.close()
) will result in creation of a new set of segment files. This could trigger a segment merge operation which could be resource intensive (think compaction in LSM). - A large number of segments would increase search latency.
- Lucene buffers documents in memory (
writer.setMaxBufferedDocs and writer.setRAMBufferSizeMB
). More RAM size means larger segments means less merging later. - Searchers will not see any changes till IndexWriter is closed.
- Optimizations
- If a large amount of data is to be indexed, then it is better to build N smaller indexes and combine using
writer.addIndexesNoOptimize
- If a large amount of data is to be indexed, then it is better to build N smaller indexes and combine using
Approach
PlantUML |
---|
() PUTs -> [Cache] [Cache] .down.> [Async Queue] [Async Queue] -right-> [Lucene Indexer] [Lucene Indexer] -up-> [GeodeFSDirectory] [GeodeFSDirectory] -left-> [Cache] [Cache] -up-> () Search |
...
PlantUML |
---|
() User -down-> [Cache] : PUTs node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] -down-> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [Cache] ..> [PR 2] [PR 2] -down-> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
Limitations
Text Search
Option - 1: Custom Parser Aggregator
A search request will be intercepted by a custom ParserAggregator. This component will distribute the search query to all PRs. Each PR will route the request to local Lucene. The result will be routed to ParserAggregator. ParserAggregator will reorder and trim the aggregated result set and return the updated result set to user.
PlantUML |
---|
() User -> [Cache] : Search node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] --> [ParserAggregator] [ParserAggregator] --> [LucenePR1] [LucenePR1] --> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [ParserAggregator] --> [LucenePR2] [LucenePR2] --> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
Advantages
- Scalability
- Performance
Limitations
- High maintenance
- Complexity
Option - 2: Distributed FS Directory implementation
Here search request is handled by Lucene and Lucene's Parser and aggregator is utilized. DistributedFSDirectory will provide a unified view to Lucene. Lucene will request DistributedFSDirectory to fetch index chunks. DistributedFSDirectory will aggregate the index chunks from the PR which hosts the data. This is similar to a Cache Client in behavior. Cache Client reaches different PRs and provides a unified data view to the user.
PlantUML |
---|
() User -> [Cache] : Search node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] --> [LucenePR1] [LucenePR1] --> [DistributedFSDirectory] [DistributedFSDirectory] -down-> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [DistributedFSDirectory] -down-> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
...
- Low maintenance
- Full API compliance
- Accurate results
Limitations
- Performance:
- Memory requirement
- Network overhead
Option - 3: Embedded Solr
Here search request is handled by Solr. Solr distributes queries to Solr agents and its aggregator is utilized. SolrCloud solves some issues related to index distribution. These issues are not relevant If the index is managed in Cache. So the Solr *Distributed Search* seems like a promising solution.
...
PlantUML |
---|
() User -> [Cache] : Search node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] --> [SolrServer] [SolrServer] --> [SolrPR1] [SolrPR1] -down-> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [SolrServer] --> [SolrPR2] [SolrPR2] -down-> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
Advantages
- Performance
- Full API compliance
- Accurate results
Limitations
- Solr instance management complexity
- Additional point of failures
Work In Progress
- How many active segment files are maintained per index? It seems one large file remains after merge. If so how to chunk a segment and colocate it with region?
Faceting
Lucene / Solr support flat, Json and API based interfaces for faceting
...