Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The purpose of Tika content extractor is to provide the essential support of text extraction from supported document formats. Additionally, the metadata is being extracted as well depending on the document format (author, modified, created, pages, ...). The TikaContentExtractor accepts the list of supported parsers and returns the extracted metadata together with the desired extracted content format (by default raw text). For example:

Code Block
java
java
TikaContentExtractor extractor = new TikaContentExtractor(new PDFParser(), true);
TikaContent content = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ) );

By default, the TikaContentExtractor  also performs the content type detection and validation, which could be turned off using the 'validateMediaType' constructor argument.

Using TikaLuceneContentExtractor

The TikaLuceneContentExtractor is very similar to TikaContentExtractor but instead of raw content and metadata it returns prepared Lucene document. However, in order to properly create the Lucene document which is ready to be indexed, TikaLuceneContentExtractor  accepts an additional parameter, LuceneDocumentMetadata, with the field types and type converters. For example:

Code Block
java
java
LuceneDocumentMetadata documentMetadata = new LuceneDocumentMetadata("contents").withField("modified", Date.class);
TikaLuceneContentExtractor extractor = new TikaLuceneContentExtractor(new PDFParser(), true);
Document document = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ), documentMetadata  );

At this point, the document is ready to be analyzed and indexed. The TikaLuceneContentExtractor uses LuceneDocumentMetadata to create the properly typed document fields and currently supports DoubleField, FloatField, LongField, IntField, IntField, TextField (for content) and StringField (also used to store dates).