Page History

...

The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.

Wiki Markup
Is the [IndexWriter] class, and especially the method addIndexes(Directory\[\]) thread safe?

Wiki Markup

Yes, {{IndexWriter.addIndexes(Directory\[\])}} method is thread safe (it is a {{synchronized}} method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.

When is it possible for document IDs to change?

...

In order to index XML documents you need to first parse them to extract text that you want to index from them. Have a look at Tika, the content analysis toolkit.

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

...

Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the ASTParser of the eclipse java development tools is used.

Wiki Markup
What is the difference between [IndexWriter].addIndexes(IndexReader\[\]) and [IndexWriter].addIndexes(Directory\[\]), besides them taking different arguments?

When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.

...

Space shortcuts

Page tree

Versions Compared

Old Version 72

New Version 73

Key

When is it possible for document IDs to change?

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?