Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.


Wiki Markup
Is the [IndexWriter] class, and especially the method addIndexes(Directory\[\]) thread safe?



Wiki Markup
Yes, {{IndexWriter.addIndexes(Directory\[\])}} method is thread safe (it is a {{synchronized}} method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.


When is it possible for document IDs to change?

...

In order to index XML documents you need to first parse them to extract text that you want to index from them. Have a look at Tika, the content analysis toolkit.

See also this article Parsing, indexing, and searching XML with Digester and Lucene.

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

...

Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the ASTParser of the eclipse java development tools is used.


Wiki Markup
What is the difference between [IndexWriter].addIndexes(IndexReader\[\]) and [IndexWriter].addIndexes(Directory\[\]), besides them taking different arguments?


When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.

...