Page History

...

The purpose of Tika content extractor is to provide the essential support of text extraction from supported document formats. Additionally, the metadata is being extracted as well depending on the document format (author, modified, created, pages, ...). The TikaContentExtractor accepts the list of supported parsers and returns the extracted metadata together with the desired extracted content format (by default raw text). For example:

Code Block

	java
	java

TikaContentExtractor extractor = new TikaContentExtractor(new PDFParser(), true);
TikaContent content = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ) );

By default, the TikaContentExtractor also performs the content type detection and validation, which could be turned off using the 'validateMediaType' constructor argument.

Using TikaLuceneContentExtractor

The TikaLuceneContentExtractor is very similar to TikaContentExtractor but instead of raw content and metadata it returns prepared Lucene document. However, in order to properly create the Lucene document which is ready to be indexed, TikaLuceneContentExtractor accepts an additional parameter, LuceneDocumentMetadata, with the field types and type converters. For example:

Code Block

	java
	java

LuceneDocumentMetadata documentMetadata = new LuceneDocumentMetadata("contents").withField("modified", Date.class);
TikaLuceneContentExtractor extractor = new TikaLuceneContentExtractor(new PDFParser(), true);
Document document = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ), documentMetadata  );

At this point, the document is ready to be analyzed and indexed. The TikaLuceneContentExtractor uses LuceneDocumentMetadata to create the properly typed document fields and currently supports DoubleField, FloatField, LongField, IntField, IntField, TextField (for content) and StringField (also used to store dates).

Child pages

Versions Compared

Old Version 19

New Version 20

Key

Using TikaLuceneContentExtractor