Page History

...

Code Block

language	xml
title	ContentHandlerDecoratorFactory

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <!-- we're including the <parsers/> element to show that it is a separate element from the
       autoDetectParserConfig element.  If it is not included, the standard default parser will
       be used -->
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
    </parser>
    <parser class="org.apache.tika.parser.microsoft.OfficeParser">
      <params>
        <param name="byteArrayMaxOverride" type="int">700000000</param>
      </params>
    </parser>
  </parsers>  
  <!-- note that the autoDetectParserConfig element is separate from the <parsers/> element.
       The composite parser built in the <parsers/> element is used as the base parser
       for the AutoDetectParser. -->
  <autoDetectParserConfig>
    <!-- note that this is a test class only available in tika-core's test-jar as an example. 
         Specify your own custom factory here -->
    <contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
  </autoDetectParserConfig>
</properties>

...

If you need different behavior, implement a WriteFilterFactory, add it to your classpath and specify it in the tika-config.xml.

4. AutoDetectParserConfig

Anchor

	AutoDetectParserConfig

...

	AutoDetectParserConfig

We've mentioned briefly above some of the factories that can be modified in the AutoDetectParserConfig. There are other parameters that can be used to modify the behavior of the AutoDetectParser via the tika-config.xmlthe behavior of the AutoDetectParser via the tika-config.xml. The AutoDetectParser is built from/contains the <parsers/> element (or SPI if no <parsers/> element is specified) in the tika-config. Because of this, the configuration of the AutoDetectParser differs from the component parsers that it wraps – the AutoDetectParser uses its own <autoDetectParserConfig/> element at the main level inside the <properties/> element.

Code Block

language	xml
title	AutoDetectParserConfig

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
    <params>
      <!-- if the incoming metadata object has a ContentLength entry and it is larger than this
           value, spool the file to disk; this is useful for some file formats that are more efficiently
           processed via a file instead of an InputStream -->
      <spoolToDisk>100000</spoolToDisk>
      <!-- the next four are parameters for the SecureContentHandler -->
      <!-- threshold used in zip bomb detection. This many characters must be written
           before the maximum compression ratio is calculated -->
      <outputThreshold>10000</outputThreshold>
      <!-- maximum compression ratio between output characters and input bytes input bytes -->
      <maximumCompressionRatio>100</maximumCompressionRatio>
      <!-- maximum XML element nesting level -->
      <maximumCompressionRatio>100<<maximumDepth>100</maximumCompressionRatio>maximumDepth>
      <!-- maximum XMLembedded elementfile nesting leveldepth -->
      <maximumDepth>100<<maximumPackageEntryDepth>100</maximumDepth>maximumPackageEntryDepth>
      <!-- maximum embedded file depth  as of Tika &gt; 2.7.0, you can skip the check and exception for a zero-byte inputstream-->
      <maximumPackageEntryDepth>100<<throwOnZeroBytes>false</maximumPackageEntryDepth>throwOnZeroBytes>
    </params>
    <!-- as of Tika 2.5.x, this is the preferred way to configure digests -->
    <digesterFactory class="org.apache.tika.parser.digestutils.CommonsDigesterFactory">
      <params>
        <markLimit>100000</markLimit>
        <!-- this specifies SHA256, base32 and MD5 -->
        <algorithmString>sha256:32,md5</algorithmString>
      </params>
    </digesterFactory>   
  </autoDetectParserConfig>
</properties>

...

Page tree

Versions Compared

Old Version 24

New Version Current

Key

4. AutoDetectParserConfig