Table of Contents

maxLevel	3
outline	true

Goals

Accurate text search
Reuse code
Scalability
Performance, compare to RAMFSDirectory

User Input

A region and list of to-be-indexed fields (or text searchable fields)
[ Optional ] Standard Analyzer or its implementation to be used with all the fields in a index
[ Optional ] Field types. A string can be Text or String in lucene. The two have different behavior

Index Persistence

Lucene context

A index batch write (IndexWriter.close()) will result in creation of a new set of segment files. This could trigger a segment merge operation which could be resource intensive (think compaction in LSM).
A large number of segments would increase search latency.
Lucene buffers documents in memory (writer.setMaxBufferedDocs and writer.setRAMBufferSizeMB). More RAM size means larger segments means less merging later.
Searchers will not see any changes till IndexWriter is closed.
Optimizations
1. If a large amount of data is to be indexed, then it is better to build N smaller indexes and combine using writer.addIndexesNoOptimize

Approach

PlantUML
() PUTs -> [Cache] [Cache] .down.> [Async Queue] [Async Queue] -right-> [Lucene Indexer] [Lucene Indexer] -up-> [GeodeFSDirectory] [GeodeFSDirectory] -left-> [Cache] [Cache] -up-> () Search

...

PlantUML

() User -down-> [Cache] : PUTs
node cluster {
 database {
 () indexPR1
 }

 [Cache] ..> [PR 1]
 [PR 1] -down-> [FSDirectoryPR1]
 [FSDirectoryPR1] -> indexPR1
 
 database {
 () indexPR2
 }

 [Cache] ..> [PR 2]
 [PR 2] -down-> [FSDirectoryPR2]
 [FSDirectoryPR2] -> indexPR2
}

Limitations

Text Search

Option - 1: Custom Parser Aggregator

A search request will be intercepted by a custom ParserAggregator. This component will distribute the search query to all PRs. Each PR will route the request to local Lucene. The result will be routed to ParserAggregator. ParserAggregator will reorder and trim the aggregated result set and return the updated result set to user.

PlantUML

() User -> [Cache] : Search
node cluster {
 database {
 () indexPR1
 }

 [Cache] ..> [PR 1]
 [PR 1] --> [ParserAggregator]
 [ParserAggregator] --> [LucenePR1]
 [LucenePR1] --> [FSDirectoryPR1]
 [FSDirectoryPR1] -> indexPR1
 
 database {
 () indexPR2
 }

 [ParserAggregator] --> [LucenePR2]
 [LucenePR2] --> [FSDirectoryPR2]
 [FSDirectoryPR2] -> indexPR2
}

Advantages

Scalability
Performance

Limitations

High maintenance
Complexity

Option - 2: Distributed FS Directory implementation

Here search request is handled by Lucene and Lucene's Parser and aggregator is utilized. DistributedFSDirectory will provide a unified view to Lucene. Lucene will request DistributedFSDirectory to fetch index chunks. DistributedFSDirectory will aggregate the index chunks from the PR which hosts the data. This is similar to a Cache Client in behavior. Cache Client reaches different PRs and provides a unified data view to the user.

PlantUML

() User -> [Cache] : Search
node cluster {
 database {
 () indexPR1
 }

 [Cache] ..> [PR 1]
 [PR 1] --> [LucenePR1]
 [LucenePR1] --> [DistributedFSDirectory]
 [DistributedFSDirectory] -down-> [FSDirectoryPR1]
 [FSDirectoryPR1] -> indexPR1
 
 database {
 () indexPR2
 }

 [DistributedFSDirectory] -down-> [FSDirectoryPR2]
 [FSDirectoryPR2] -> indexPR2
}

...

Low maintenance
Full API compliance
Accurate results

Limitations

Performance:
Memory requirement
Network overhead

Option - 3: Embedded Solr

Here search request is handled by Solr. Solr distributes queries to Solr agents and its aggregator is utilized. SolrCloud solves some issues related to index distribution. These issues are not relevant If the index is managed in Cache. So the Solr *Distributed Search* seems like a promising solution.

...

PlantUML

() User -> [Cache] : Search
node cluster {
 database {
 () indexPR1
 }

 [Cache] ..> [PR 1]
 [PR 1] --> [SolrServer]
 [SolrServer] --> [SolrPR1]
 [SolrPR1] -down-> [FSDirectoryPR1]
 [FSDirectoryPR1] -> indexPR1
 
 database {
 () indexPR2
 }

 [SolrServer] --> [SolrPR2]
 [SolrPR2] -down-> [FSDirectoryPR2]
 [FSDirectoryPR2] -> indexPR2
}

Advantages

Performance
Full API compliance
Accurate results

Limitations

Solr instance management complexity
Additional point of failures

Option - 4: IndexWriter and MultiReader implementation

A custom implementation of IndexWriter and IndexReader could be provided as an alternative to FSDirectory implementation. FSDirectory is file-like interface. Lucene constructs a file and hands it over to FSDirectory for writes and reads. Lucene manages file merges. The directory implementation does not have visibility into the contents of the file. The IndexWriter approach is one layer above FSDirectory. Lucene interacts at a document and term level granularity with IndeReader/IndexWriter layer. The following are the important classes and methods to look at:

org.apache.lucene.index.MultiReader: An IndexReader which reads multiple indexes, appending their content.
1. termDocs(Term term): Returns an enumeration of all the documents which contain term.
2. termPositions: Returns an enumeration of all the documents which contain term. For each document, in addition to the document number and frequency of the term in that document, a list of all of the ordinal positions of the term in the document is available.
org.apache.lucene.index.IndexWriter
1. updateDocument, addDocument

IndexWriter can control how the terms are distributed and persisted. In case of a distributed search, MultiReader can distribute the query to shard based sub-readers and each sub-reader streams filtered results from the shard to the query coordinator.

A map with this form <term, map <docId, list <position>>> is needed for supporting various lucene functions.

Limitations

A popular term will have a large value (map of doc and position of term in the doc). Managing such a large needs to be efficient.

Work In Progress

How many active segment files are maintained per index? It seems one large file remains after merge. If so how to chunk a segment and colocate it with region?

Faceting

Lucene / Solr support flat, Json and API based interfaces for faceting

...

Space shortcuts

Page tree

Versions Compared

Old Version 6

New Version Current

Key

Goals

User Input

Index Persistence

Lucene context

Approach

Limitations

Text Search

Option - 1: Custom Parser Aggregator

Advantages

Limitations

Option - 2: Distributed FS Directory implementation

Limitations

Option - 3: Embedded Solr

Advantages

Limitations

Option - 4: IndexWriter and MultiReader implementation

Limitations

Work In Progress

Faceting

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 6

New Version Current

Key

Goals

User Input

Index Persistence

Lucene context

Approach

Limitations

Text Search

Option - 1: Custom Parser Aggregator

Advantages

Limitations

Option - 2: Distributed FS Directory implementation

Limitations

Option - 3: Embedded Solr

Advantages

Limitations

Option - 4: IndexWriter and MultiReader implementation

Limitations

Work In Progress

Faceting