Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

These are applied during the parse by classes that extend ContentHandlerDecorator.  A small handful may cache all the contents in memory.  One small risk for these is that there's no guarantee that parsers will pass in meaningful amounts of text in the call to characters(); theoretically, a parser could write one character at a time, which would be render a regex match useless.

Programmatically, users have control to use any of the ContentHandlers in tika-core or they can write their own Contenthandlers.  If doing this, make sure to consider the ContentHandlerDecorator which allows overriding only the methods you need; also consider using the TeeContentHandler, which allows multiple handlers to be run during the parse.

To set ContentHandlerDecorators via tika-config.xml, set the ContentHandlerDecoratorFactory in the <autoDetectParserConfig> element in tika-config.xml.


In this example, we're calling a test class that simply upcases all characters in the content handler.

Code Block
languagexml
titleContentHandlerDecoratorFactory
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
    <params>
      <spoolToDisk>123450</spoolToDisk>
      <outputThreshold>678900</outputThreshold>
    </params>
    <contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
  </autoDetectParserConfig>
</properties>


Metadata Filters

These are applied at the end of the parse.  These are intended to modify the contents of a metadata object for different purposes:

...