Page History

...

Custom org.apache.cxf.jaxrs.ext.search.SearchConditionParser implementations can be registered as a "search.parser" contextual property starting from CXF 3.0.0-milestone2.

Content Extraction

Starting from CXF 3.0.2, the content extraction support has been added in order to complement the search capabilites with text extraction from various document formats (PDF, ODF, DOC,TXT,RTF,...). It is based on Apache Tika and is available in two shapes: raw content extraction (TikaContentExtractor) and Lucene document content extraction (TikaLuceneContentExtractor).

Using TikaContentExtractor

The purpose of Tika content extractor is to provide the essential support of text extraction from supported document formats. Additionally, the metadata is being extracted as well depending on the document format (author, modified, created, pages, ...). The TikaContentExtractor accepts the list of supported parsers and returns the extracted metadata together with the desired extracted content format. For example:

Code Block

	java
	java

TikaContentExtractor extractor = new TikaContentExtractor(new PDFParser(), true);
TikaContent content = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ) );

By default, the TikaContentExtractor also performs the content type detection and validation, which could be turned off using the 'validateMediaType' constructor argument.

Child pages

Versions Compared

Old Version 18

New Version 19

Key

Content Extraction

Using TikaContentExtractor