...
These are applied during the parse by classes that extend ContentHandlerDecorator
. A small handful may cache all the contents in memory. One small risk for these is that there's no guarantee that parsers will pass in meaningful amounts of text in the call to characters()
; theoretically, a parser could write one character at a time, which would be render a regex match useless.
Programmatically, users have control to use any of the ContentHandlers in tika-core or they can write their own Contenthandlers. If doing this, make sure to consider the ContentHandlerDecorator
which allows overriding only the methods you need; also consider using the TeeContentHandler
, which allows multiple handlers to be run during the parse.
To set ContentHandlerDecorators via tika-config.xml, set the ContentHandlerDecoratorFactory in the <autoDetectParserConfig> element in tika-config.xml.
In this example, we're calling a test class that simply upcases all characters in the content handler.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<autoDetectParserConfig>
<params>
<spoolToDisk>123450</spoolToDisk>
<outputThreshold>678900</outputThreshold>
</params>
<contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
</autoDetectParserConfig>
</properties> |
Metadata Filters
These are applied at the end of the parse. These are intended to modify the contents of a metadata object for different purposes:
...