Modifying Content with Handlers, Metadata Filters and Metadata WriteFilters

There are several ways to modify, limit or process content during or after the parse. In the following, we describe the three main categories. Each of these has been developed for specific purposes and may not be appropriate for all needs.

ContentHandlers

These are applied during the parse. A small handful may cache all the contents in memory.

Metadata Filters

These are applied at the end of the parse. These are intended to modify the contents of a metadata object for different purposes:

Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
Limit the metadata fields -- let's say you only want dc:title and text, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.

Metadata Write Filters

These filters are applied during the parse.

The primary goal of the metadata write filters is to limit the the amount of data written to a metadata object for two purposes:

Limit the total number of bytes written to a metadata objects (prevent DoS from files with large amounts of metadata)
Limit the fields written to a metadata object (decrease bytes held in memory during the parse and decrease the bytes sent over the wire/written to a file after the parse

To configure the StandardWriteFilter, set the properties in its factory in the <autoDetectParserConfig> element in the tika-config.xml file.

StandardWriteFilter

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <autoDetectParserConfig>
    <metadataWriteFilterFactory class="org.apache.tika.metadata.writefilter.StandardWriteFilterFactory">
      <params>
		<!-- all measurements are in UTF-16 bytes. If any values are truncated, TikaCoreProperties.TRUNCATED_METADATA is set to true in the metadata object -->

        <!-- the maximum size for a metadata key. -->
        <maxKeySize>1000</maxKeySize>

        <!-- max total size for a field in UTF-16 bytes.  If a field has multiple values, their lengths are summed to calculate the field size. -->
        <maxFieldSize>10000</maxFieldSize>

        <!-- max total estimated byte is a sum of the key sizes and values -->
        <maxTotalEstimatedBytes>100000</maxTotalEstimatedBytes>
  
        <!-- limit the count of values for multi-valued fields -->
        <maxValuesPerField>100</maxValuesPerField>
        <!-- include only these fields. NOTE, however that there a several fields that are important to the 
             parse process and these fields are always allowed in addition (see ALWAYS_SET_FIELDS and ALWAYS_ADD_FIELDS 
             in the StandardWriteFilter -->
        <includeFields>
          <field>dc:creator</field>
          <field>dc:title</field>
        </includeFields>
      </params>
    </metadataWriteFilterFactory>
  </autoDetectParserConfig>
</properties>

Page tree

ModifyingContentWithHandlersAndMetadataFilters

Modifying Content with Handlers, Metadata Filters and Metadata WriteFilters

ContentHandlers

Metadata Filters

Metadata Write Filters