Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Custom org.apache.cxf.jaxrs.ext.search.SearchConditionParser implementations can be registered as a "search.parser" contextual property starting from CXF 3.0.0-milestone2.

Content Extraction

Starting from CXF 3.0.2, the content extraction support has been added in order to complement the search capabilites with text extraction from various document formats (PDF, ODF, DOC,TXT,RTF,...). It is based on Apache Tika and is available in two shapes: raw content extraction (TikaContentExtractor) and Lucene document content extraction (TikaLuceneContentExtractor).

Using TikaContentExtractor

The purpose of Tika content extractor is to provide the essential support of text extraction from supported document formats. Additionally, the metadata is being extracted as well depending on the document format (author, modified, created, pages, ...). The TikaContentExtractor accepts the list of supported parsers and returns the extracted metadata together with the desired extracted content format. For example:

Code Block
java
java
TikaContentExtractor extractor = new TikaContentExtractor(new PDFParser(), true);
TikaContent content = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ) );

By default, the TikaContentExtractor  also performs the content type detection and validation, which could be turned off using the 'validateMediaType' constructor argument.