...
Code Block |
---|
language | xml |
---|
title | ContentHandlerDecoratorFactory |
---|
|
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<!-- we're including the <parsers/> element to show that it is a separate element from the
autoDetectParserConfig element. If it is not included, the standard default parser will
be used -->
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
</parser>
<parser class="org.apache.tika.parser.microsoft.OfficeParser">
<params>
<param name="byteArrayMaxOverride" type="int">700000000</param>
</params>
</parser>
</parsers>
<!-- note that the autoDetectParserConfig element is separate from the <parsers/> element.
The composite parser built in the <parsers/> element is used as the base parser
for the AutoDetectParser. -->
<autoDetectParserConfig>
<!-- note that this is a test class only available in tika-core's test-jar as an example.
Specify your own custom factory here -->
<contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
</autoDetectParserConfig>
</properties> |
...
If you need different behavior, implement a WriteFilterFactory
, add it to your classpath and specify it in the tika-config.xml
.
4. AutoDetectParserConfig
...
We've mentioned briefly above some of the factories that can be modified in the AutoDetectParserConfig
. There are other parameters that can be used to modify the behavior of the AutoDetectParser
via the tika-config.xml
the behavior of the AutoDetectParser
via the tika-config.xml
. The AutoDetectParser
is built from/contains the <parsers/>
element (or SPI if no <parsers/>
element is specified) in the tika-config
. Because of this, the configuration of the AutoDetectParser
differs from the component parsers that it wraps – the AutoDetectParser
uses its own <autoDetectParserConfig/>
element at the main level inside the <properties/>
element.
Code Block |
---|
language | xml |
---|
title | AutoDetectParserConfig |
---|
|
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<autoDetectParserConfig>
<params>
<!-- if the incoming metadata object has a ContentLength entry and it is larger than this
value, spool the file to disk; this is useful for some file formats that are more efficiently
processed via a file instead of an InputStream -->
<spoolToDisk>100000</spoolToDisk>
<!-- the next four are parameters for the SecureContentHandler -->
<!-- threshold used in zip bomb detection. This many characters must be written
before the maximum compression ratio is calculated -->
<outputThreshold>10000</outputThreshold>
<!-- maximum compression ratio between output characters and input bytes input bytes -->
<maximumCompressionRatio>100</maximumCompressionRatio>
<!-- maximum XML element nesting level -->
<maximumCompressionRatio>100<<maximumDepth>100</maximumCompressionRatio>maximumDepth>
<!-- maximum XMLembedded elementfile nesting leveldepth -->
<maximumDepth>100<<maximumPackageEntryDepth>100</maximumDepth>maximumPackageEntryDepth>
<!-- maximum embedded file depth as of Tika > 2.7.0, you can skip the check and exception for a zero-byte inputstream-->
<maximumPackageEntryDepth>100<<throwOnZeroBytes>false</maximumPackageEntryDepth>throwOnZeroBytes>
</params>
<!-- as of Tika 2.5.x, this is the preferred way to configure digests -->
<digesterFactory class="org.apache.tika.parser.digestutils.CommonsDigesterFactory">
<params>
<markLimit>100000</markLimit>
<!-- this specifies SHA256, base32 and MD5 -->
<algorithmString>sha256:32,md5</algorithmString>
</params>
</digesterFactory>
</autoDetectParserConfig>
</properties> |
...