...
Custom org.apache.cxf.jaxrs.ext.search.SearchConditionParser implementations can be registered as a "search.parser" contextual property starting from CXF 3.0.0-milestone2.
Content Extraction
Starting from CXF 3.0.2, the content extraction support has been added in order to complement the search capabilites with text extraction from various document formats (PDF, ODF, DOC,TXT,RTF,...). It is based on Apache Tika and is available in two shapes: raw content extraction (TikaContentExtractor) and Lucene document content extraction (TikaLuceneContentExtractor).
Using TikaContentExtractor
The purpose of Tika content extractor is to provide the essential support of text extraction from supported document formats. Additionally, the metadata is being extracted as well depending on the document format (author, modified, created, pages, ...). The TikaContentExtractor accepts the list of supported parsers and returns the extracted metadata together with the desired extracted content format. For example:
Code Block | ||||
---|---|---|---|---|
| ||||
TikaContentExtractor extractor = new TikaContentExtractor(new PDFParser(), true);
TikaContent content = extractor .extract( Files.newInputStream( new File( "testPDF.pdf" ).toPath() ) ); |
By default, the TikaContentExtractor also performs the content type detection and validation, which could be turned off using the 'validateMediaType' constructor argument.