Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There are several ways to modify, limit or process content during or after the parse.  In the following, we describe the three four main categories.  Each of these has been developed for specific purposes and may not be appropriate for all needs.

...

1. ContentHandlers

These are applied during the parse by classes that extend ContentHandlerDecoratorimplement org.xml.saxContentHandler.  A small handful may cache contents in memory.  One small risk for these is that there's no guarantee that parsers will pass in meaningful amounts of text in the call to characters(); theoretically, a parser could write one character at a time, which would be render a regex match useless.

Programmatically, users have control to use any of the ContentHandlers in tika-core or they can write their own Contenthandlers ContentHandlers.  If doing this, make sure to consider the ContentHandlerDecorator which allows overriding only the methods you need; also consider using the TeeContentHandler, which allows multiple handlers to be run during the parse.

Some common content handlers are specified for tika-server's /tika json output and the /rmeta endpoint by appending "/xml", "/text", "/html", "/body" or "/ignore" to the endpoint.


To set custom ContentHandlerDecorators via tika-config.xml, set the ContentHandlerDecoratorFactory in the <autoDetectParserConfig> <autoDetectParserConfig/> element in tika-config.xml.


In this example, we're calling a test class that simply upcases all characters in the content handler.

Code Block
languagexml
titleContentHandlerDecoratorFactory
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
    <params>
      <spoolToDisk>123450</spoolToDisk>
      <outputThreshold>678900</outputThreshold>
    </params>
    <contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
  </autoDetectParserConfig>
</properties>


2. Metadata Filters

These are applied at the end of the parse.  These are intended to modify the contents of a metadata object for different purposes:

...

Code Block
languagexml
titleMetadataFilters
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
      <params>
        <include>
          <field>X-TIKA:content</field>
          <field>extended-properties:Application</field>
          <field>Content-Type</field>
        </include>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

3. Metadata Write Filters

These filters are applied during the parse.

...