Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
  2. Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
  3. Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
  4. Limit the metadata fields -- let's say you only want dc:title and text, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.

IncludeFieldMetadataFilter


Code Block
languagexml
titleMetadataFilters
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
      <params>
        <include>
          <field>X-TIKA:content</field>
          <field>extended-properties:Application</field>
          <field>Content-Type</field>
        </include>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

Metadata Write Filters

These filters are applied during the parse.

...

To configure the StandardWriteFilter, set the properties in its factory in the <autoDetectParserConfig> <autoDetectParserConfig/> element in the tika-config.xml file.

Code Block
languagexml
titleStandardWriteFilter
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
    <metadataWriteFilterFactory class="org.apache.tika.metadata.writefilter.StandardWriteFilterFactory">
      <params>
		<!-- all measurements are in UTF-16 bytes. If any values are truncated, TikaCoreProperties.TRUNCATED_METADATA is set to true in the metadata object -->

        <!-- the maximum size for a metadata key. -->
        <maxKeySize>1000</maxKeySize>

        <!-- max total size for a field in UTF-16 bytes.  If a field has multiple values, their lengths are summed to calculate the field size. -->
        <maxFieldSize>10000</maxFieldSize>

        <!-- max total estimated byte is a sum of the key sizes and values -->
        <maxTotalEstimatedBytes>100000</maxTotalEstimatedBytes>
  
        <!-- limit the count of values for multi-valued fields -->
        <maxValuesPerField>100</maxValuesPerField>
        <!-- include only these fields. NOTE, however that there a several fields that are important to the 
             parse process and these fields are always allowed in addition (see ALWAYS_SET_FIELDS and ALWAYS_ADD_FIELDS 
             in the StandardWriteFilter -->
        <includeFields>
          <field>dc:creator</field>
          <field>dc:title</field>
        </includeFields>
      </params>
    </metadataWriteFilterFactory>
  </autoDetectParserConfig>
</properties>


If you need different behavior, implement a WriteFilterFactory, add it to your classpath and specify it in the tika-config.xml.