Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagexml
titleContentHandlerDecoratorFactory
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
 <!-- we're including <contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
  </autoDetectParserConfig>
</properties>

2. Metadata Filters

These are applied at the end of the parse.  These are intended to modify the contents of a metadata object for different purposes:

  1. Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
  2. Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
  3. Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
  4. Limit the metadata fields -- let's say you only want dc:title and text, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.

Metadata filters are specified in the <metadataFilters/> element in tika-config.xml.  They are run in order, and order will matter.

FieldNameMappingFilter

This is used to select fields to include and to rename fields from the Tika names to preferred names.  This was initially designed for modifying field names before emitting document to OpenSearch or Solr.

Code Block
languagexml
titleFieldNameMappingFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
      <params>
        <excludeUnmapped>true</excludeUnmapped>
        <mappings>
          <mapping from="X-TIKA:content" to="content"/>
          <mapping from="Content-Length" to="length"/>the <parsers/> element to show that it is a separate element from the
       autoDetectParserConfig element.  If it is not included, the standard default parser will
       be used -->
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
    </parser>
    <parser class="org.apache.tika.parser.microsoft.OfficeParser">
      <params>
        <param name="byteArrayMaxOverride" type="int">700000000</param>
      </params>
    </parser>
  </parsers>  
  <!-- note that the autoDetectParserConfig element is separate from the <parsers/> element.
       The composite parser built in the <parsers/> element is used as the base parser
       for the  <mapping from="dc:creator" to="creators"/AutoDetectParser. -->
  <autoDetectParserConfig>
        <mapping from="dc:title" to="title"<contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
          <mapping from="Content-Type" to="mime"/>
          <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception"/>
        </mappings>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

DateNormalizingMetadataFilter

This blindly adds a timezone to dates that may not have a time zone.  Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezones.

Code Block
languagexml
titleDateNormalizingMetadataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <!-- depending on the file format, some dates do not have a timezone. This
     filter arbitrarily assumes dates have a UTC timezone and will format all
     dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
     -->
    <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
  </metadataFilters>
</properties>

LanguageDetection

Two language detectors have a metadata filter option (OpenNLPMetadataFilter and the OptimaizeMetadataFilter).  These are applied to the X-TIKA:content field at the end of the parse.  This is an example of specifying the OptimaizeLanguageDetector. The language id will be added to the metadata  object with the TikaCoreProperties.TIKA_DETECTED_LANGUAGE key.

</autoDetectParserConfig>
</properties>


2. Metadata Filters

These are applied at the end of the parse.  These are intended to modify the contents of a metadata object for different purposes:

  1. Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
  2. Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
  3. Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
  4. Limit the metadata fields -- let's say you only want dc:title and text, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.


Metadata filters are specified in the <metadataFilters/> element in tika-config.xml.  They are run in order, and order will matter.

FieldNameMappingFilter

This is used to select fields to include and to rename fields from the Tika names to preferred names.  This was initially designed for modifying field names before emitting document to OpenSearch or Solr.

Code Block
languagexml
titleFieldNameMappingFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
      <params>
        <excludeUnmapped>true</excludeUnmapped>
        <mappings>
          <mapping from="X-TIKA:content" to="content"/>
          <mapping from="Content-Length" to="length"/>
          <mapping from="dc:creator" to="creators"/>
          <mapping from="dc:title" to="title"/>
          <mapping from="Content-Type" to="mime"/>
          <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception"/>
Code Block
languagexml
titleLanguageDetection
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.langdetect.optimaize.metadatafilter.OptimaizeMetadataFilter">
      <params>
        <maxCharsForDetection>10000<</maxCharsForDetection>mappings>
      </params>
    </metadataFilter>
  </metaFilters>metadataFilters>
</properties>

ClearByMimeMetadataFilter


DateNormalizingMetadataFilter

This blindly adds a timezone to dates that may not have a time zone.  Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezonesWhen using the RecursiveParserWrapper (the /rmeta endpoint in tika-server or the -J option in tika-app), you can delete metadata objects for specific file types.

Code Block
languagexml
titleClearByMimeMetadataFilterDateNormalizingMetadataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
      <params>
		<!-- this will remove metadata objects for jpegs and pdfs; more seriously, 
             this may be useful for image files or emf or wmf depending on your use case <!-- depending on the file format, some dates do not have a timezone. This
     filter arbitrarily assumes dates have a UTC timezone and will format all
     dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
     -->
        <mimes>
          <mime>image/jpeg</mime>
          <mime>application/pdf</mime>
        </mimes>
      </params>
    </metadataFilter>
  </metaFilters>
</properties>

IncludeFieldMetadataFilter

...

<metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
  </metadataFilters>
</properties>

LanguageDetection

Two language detectors have a metadata filter option (OpenNLPMetadataFilter and the OptimaizeMetadataFilter).  These are applied to the X-TIKA:content field at the end of the parse.  This is an example of specifying the OptimaizeLanguageDetector. The language id will be added to the metadata  object with the TikaCoreProperties.TIKA_DETECTED_LANGUAGE key.

Code Block
languagexml
titleMetadataFiltersLanguageDetection
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadatalangdetect.filteroptimaize.metadatafilter.IncludeFieldMetadataFilterOptimaizeMetadataFilter">
      <params>
        <include><maxCharsForDetection>10000</maxCharsForDetection>
          <field>X-TIKA:content</field></params>
          <field>extended-properties:Application</field></metadataFilter>
          <field>Content-Type</field>
        </include>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

3. Metadata Write Filters

These filters are applied during the parse.

The primary goal of the metadata write filters is to limit the the amount of data written to a metadata object for two purposes:

  1. Limit the total number of bytes written to a metadata objects (prevent DoS from files with large amounts of metadata)
  2. Limit the fields written to a metadata object (decrease bytes held in memory during the parse and decrease the bytes sent over the wire/written to a file after the parse

To configure the StandardWriteFilter, set the properties in its factory in the <autoDetectParserConfig/> element in the tika-config.xml file.

</metaFilters>
</properties>

ClearByMimeMetadataFilter

When using the RecursiveParserWrapper (the /rmeta endpoint in tika-server or the -J option in tika-app), you can delete metadata objects for specific file types.

Code Block
languagexml
titleClearByMimeMetadataFilter
Code Block
languagexml
titleStandardWriteFilter
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig><metadataFilters>
    <metadataWriteFilterFactory<metadataFilter class="org.apache.tika.metadata.writefilterfilter.StandardWriteFilterFactoryClearByMimeMetadataFilter">
      <params>
		<!-- allthis measurementswill areremove inmetadata UTF-16 bytes. If any values are truncated, 
			TikaCoreProperties.TRUNCATED_METADATA is set to true in the metadata object -->

        <!-- the maximum size for a metadata key.objects for jpegs and pdfs; more seriously, 
             this may be useful for image files or emf or wmf depending on your use case -->
        <maxKeySize>1000</maxKeySize>

<mimes>
          <mime>image/jpeg</mime>
          <mime>application/pdf</mime>
        </mimes>
      </params>
    </metadataFilter>
  </metaFilters>
</properties>

IncludeFieldMetadataFilter

This removes all other metadata fields after the parse except those specified here.

Code Block
languagexml
titleMetadataFilters
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
      <params>
        <include>
          <field>X-TIKA:content</field>
          <field>extended-properties:Application</field>
          <field>Content-Type</field>
        </include>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

3. Metadata Write Filters

These filters are applied during the parse.

The primary goal of the metadata write filters is to limit the the amount of data written to a metadata object for two purposes:

  1. Limit the total number of bytes written to a metadata objects (prevent DoS from files with large amounts of metadata)
  2. Limit the fields written to a metadata object (decrease bytes held in memory during the parse and decrease the bytes sent over the wire/written to a file after the parse

To configure the StandardWriteFilter, set the properties in its factory in the <autoDetectParserConfig/> element in the tika-config.xml file.

Code Block
languagexml
titleStandardWriteFilter
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
    <metadataWriteFilterFactory class="org.apache.tika.metadata.writefilter.StandardWriteFilterFactory">
      <params>
		<!-- all measurements are in UTF-16 bytes. If any values are truncated, 
			TikaCoreProperties.TRUNCATED_METADATA is set to true in the metadata object -->
<!-- max total size for a field in UTF-16 bytes.  If a field has multiple values, 
			their lengths are summed to calculate the field size. -->
        <maxFieldSize>10000</maxFieldSize>

        <!-- max total estimated byte is a sum of the key sizes and values -->
        <maxTotalEstimatedBytes>100000</maxTotalEstimatedBytes>
  
        <!-- limit the count ofmaximum valuessize for a multi-valued fieldsmetadata key. -->
        <maxValuesPerField>100<<maxKeySize>1000</maxValuesPerField>maxKeySize>

        <!-- includemax onlytotal thesesize fields. NOTE, however that there for a severalfield fields that are 
			 important to the parse process and these fields are always allowed in additionin UTF-16 bytes.  If a field has multiple values, 
			their (see ALWAYS_SET_FIELDS and ALWAYS_ADD_FIELDS in the StandardWriteFilterlengths are summed to calculate the field size. -->
        <includeFields><maxFieldSize>10000</maxFieldSize>

        <!-- max  <field>dc:creator</field>
          <field>dc:title</field>total estimated byte is a sum of the key sizes and values -->
        <<maxTotalEstimatedBytes>100000</includeFields>maxTotalEstimatedBytes>
  
    </params>
    </metadataWriteFilterFactory>
  </autoDetectParserConfig>
</properties>

If you need different behavior, implement a WriteFilterFactory, add it to your classpath and specify it in the tika-config.xml.

4. AutoDetectParserConfig

We've mentioned briefly above some of the factories that can be modified in the AutoDetectParserConfig.  There are other parameters that can be used to modify the behavior of the AutoDetectParser via the tika-config.xml.

Code Block
languagexml
titleAutoDetectParserConfig
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
    </parser>
    <parser class="org.apache.tika.parser.microsoft.OfficeParser">
      <params>!-- limit the count of values for multi-valued fields -->
        <maxValuesPerField>100</maxValuesPerField>
        <!-- include only these fields. NOTE, however that there a several fields that are 
			 important to the parse process and these fields are always allowed in addition 
			 (see ALWAYS_SET_FIELDS and ALWAYS_ADD_FIELDS in the StandardWriteFilter -->
        <includeFields>
        <param name="byteArrayMaxOverride" type="int">700000000</param>
 <field>dc:creator</field>
       </params>
   <field>dc:title</field>
        </parser>includeFields>
  </parsers>  
  <!-- note that</params>
 the autoDetectParserConfig element is</metadataWriteFilterFactory>
  </autoDetectParserConfig>
</properties>


If you need different behavior, implement a WriteFilterFactory, add it to your classpath and specify it in the tika-config.xml.

4. AutoDetectParserConfig

We've mentioned briefly above some of the factories that can be modified in the AutoDetectParserConfig.  There are other parameters that can be used to modify the behavior of the AutoDetectParser via the tika-config.xml.

Code Block
languagexml
titleAutoDetectParserConfig
<?xml version="1.0" encoding="UTF-8"?>
<properties>separate from the <parsers/> element.
       The composite parser built in the <parsers/> element is used as the base parser
       for the AutoDetectParser. 
  -->
  <autoDetectParserConfig>
    <params>
      <!-- if the incoming metadata object has a ContentLength entry and it is larger than this
           value, spool the file to disk; this is useful for some file formats that are more efficiently
           processed via a file instead of an InputStream -->
      <spoolToDisk>100000</spoolToDisk>
      <!-- the next four are parameters for the SecureContentHandler -->
      <!-- threshold used in zip bomb detection. This many characters must be written
           before the maximum compression ratio is calculated -->
      <outputThreshold>10000</outputThreshold>
      <!-- maximum compression ratio between output characters and input bytes -->
      <maximumCompressionRation>100</maximumCompressionRatio>
      <!-- maximum XML element nesting level -->
      <maximumDepth>100</maximumDepth>
      <!-- maximum embedded file depth -->
      <maximumPackageEntryDepth>100</maximumPackageEntryDepth>
    </params>
    <!-- as of Tika 2.5.x, this is the preferred way to configure digests -->
    <digesterFactory class="org.apache.tika.parser.digestutils.CommonsDigesterFactory">
      <params>
        <markLimit>100000</markLimit>
        <!-- this specifies SHA256, base32 and MD5 -->
        <algorithmString>sha256:32,md5</algorithmString>
      </params>
    </digesterFactory>   
  </autoDetectParserConfig>
</properties>

...