Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Questions should only be added to this Wiki page when they already have an answer that can be added at the same time.

Table of Contents

Lucene FAQ

General

How do I start using Lucene?

Lucene has no external dependencies, so just add lucene-core-x.y-dev.jar to your development environment's classpath. After that,

If you think Lucene is too low-level for you, you might want to consider using Solr, which usually requires less Java programming.

...

  • Always make sure that you explicitly close all file handles you open, especially in case of errors. Use a try/catch/finally block to open the files, i.e. open them in the try block, close them in the finally block. Remember that Java doesn't have destructors, so don't close file handles in a finalize method – this method is not guaranteed to be executed.
  • Use the compound file format (it's activated by default starting with Lucene 1.4) by calling IndexWriter's setUseCompoundFile(true)
  • Don't set IndexWriter's mergeFactor to large values. Large values speed up indexing but increase the number of files that need to be opened simultaneously.
  • Make sure you only open one IndexSearcher, and share it among all of the threads that are doing searches – this is safe, and it will minimize the number of files that are open concurently.
  • Try to increase the number of files that can be opened simultaneously. On Linux using bash this can be done by calling ulimit -n <number>.

When I compile Lucene x.y.z from source, the version number in the jar file name and MANIFEST.MF is different. What's up with that?

...

  1. Make sure you have looked through the BasicsOfPerformance
  2. Describe your problem, giving details about how you are using Lucene
  3. What version of Lucene are you using? What JDK? Can you upgrade to the latest?
  4. Make sure it truly is a Lucene problem. That is, isolate the problem and/or profile your application.
  5. Search the java-user and java-dev Mailing lists, see http://lucene.apache.org/java/docs/mailinglists.html

What does l.a.o and o.a.l.xxxx stand for?

...

  • The desired term is in a field that was not defined as 'indexed'. Re-index the document and make the field indexed.
  • The term is in a field that was not tokenized during indexing and therefore, the entire content of the field was considered as a single term. Re-index the documents and make sure the field is tokenized.
  • The field specified in the query simply does not exist. You won't get an error message in this case, you'll just get no matches.
  • The field specified in the query has wrong case. Field names are case sensitive.
  • The term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the StopFilter, a search for the word 'the' will always fail (i.e. produce no hits).
  • You are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching.
  • The analyzer you are using is case sensitive (e.g. it does not use the LowerCaseFilter) and the term in the query has different case than the term in the document.
  • The documents you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. See IndexWriter.setMaxFieldLength(int).
  • Make sure to open a new IndexSearcher after adding documents. An IndexSearcher will only see the documents that were in the index when it was opened.
  • If you are using the QueryParser, it may not be parsing your BooleanQuerySyntax the way you think it is.
  • Span and phrase queries won't work if omitTf() has been called for a field since that causes positional information about tokens to not be saved in the index. Span queries & phrase queries require the positional information in order to work.

If none of the possible causes above apply to your case, this will help you to debug the problem:

  • Use the Query's toString() method to see how it actually got parsed.
  • Use Luke to browse your index: on the "Documents" tab, navigate to a document, then use the "Reconstruct & Edit" to look at how the fields have been stored ("Stored original" tab) and indexed ("Tokenized" tab)

Why am I getting a TooManyClauses exception?

The following types of queries are expanded by Lucene before it does the search: RangeQuery, PrefixQuery, WildcardQuery, FuzzyQuery. For example, if the indexed documents contain the terms "car" and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place. The number of these terms is limited to 1024 by default. Here's a few different approaches that can be used to avoid the TooManyClauses exception:

  • Use a filter to replace the part of the query that causes the exception. For example, a RangeFilter can replace a RangeQuery on date fields and it will never throw the TooManyClauses exception – You can even use ConstantScoreRangeQuery to execute your RangeFilter as a Query. Note that filters are slower than queries when used for the first time, so you should cache them using CachingWrapperFilter. Using Filters in place of Queries generated by QueryParser can be achieved by subclassing QueryParser and overriding the appropriate function to return a ConstantScore version of your Query.
  • Increase the number of terms using BooleanQuery.setMaxClauseCount(). Note that this will increase the memory requirements for searches that expand to many terms. To deactivate any limits, use BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE).
  • A specfic solution that can work on very precise fields is to reduce the precision of the data in order to reduce the number of terms in the index. For example, the DateField class uses a microsecond resultion, which is often not required. Instead you can save your dates in the "yyyymmddHHMM" format, maybe even without hours and minutes if you don't need them (this was simplified in Lucene 1.9 thanks to the new DateTools class).

How can I search over multiple fields?

...

This is not supported by QueryParser, but you could extend the QueryParser to build a MultiPhraseQuery in those cases.

Is the QueryParser thread-safe?

No, it's not.

How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this?

...

Then grab the last Term in TermDocs that this method returns.

Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other?

MultiSearcher searches indices sequentially. Use ParallelMultiSearcher as a searcher that performs multiple searches in parallel. Please note that there's a known bug in Lucene < 1.9 in the MultiSearcher's result ranking.

...

No, not by default. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. These queries are case-insensitive anyway because QueryParser makes them lowercase. This behavior can be changed using the setLowercaseExpandedTerms(boolean) method.

Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes?

According to the Javadoc for IndexReader maxDoc() method "returns one greater than the largest possible document number".

...

Also consider using a JSP tag for caching, see http://www.opensymphony.com/oscache/ for one tab library that's easy and works well.

Is the IndexSearcher thread-safe?

Yes, IndexSearcher is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory.

...

  • Use QueryFilter with the previous query as the filter. Doug Cutting recommends against this, because a QueryFilter does not affect ranking.
  • Combine the previous query with the current query using BooleanQuery, using the previous query as required.

The BooleanQuery approach is the recommended one.

...

  • Stored = as-is value stored in the Lucene index
  • Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
  • Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
  • Vectored = term frequency per document is stored in the index in an easily retrievable fashion.

What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?

No, there will be multiple copies of the same document in the index.

...

The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.


Wiki Markup
Is the [IndexWriter] class, and especially the method addIndexes(Directory\[\]) thread safe?



Wiki Markup
Yes, {{IndexWriter.addIndexes(Directory\[\])}} method is thread safe (it is a {{synchronized}} method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.


When is it possible for document IDs to change?

...

See also this article Parsing, indexing, and searching XML with Digester and Lucene.

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

Have a look at Tika, the content analysis toolkit.

...

How can I index PDF documents?

...

Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the ASTParser of the eclipse java development tools is used.


Wiki Markup
What is the difference between [IndexWriter].addIndexes(IndexReader\[\]) and [IndexWriter].addIndexes(Directory\[\]), besides them taking different arguments?


When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.

...