In Tika 2.x, users may define metadata filters for the RecursiveParserWrapper in the tika-config.xml file.  These filters will be applied by the RecursiveParserWrapper after the parse.  There are two primary goals to metadata filters:

  1. Limit the amount of data returned to the client to only the desired information
  2. In the tika-pipes module, modify the metadata for the emitters.

These filters are not currently applied after the completion of the parse for each embedded object.  So, they do not decrease the amount of memory required by Tika to hold the output of the full parse in memory.

Users may specify multiple filters.  They are applied in the order specified in the tika-config.xml file.

FieldNameMappingFilter

This filter allows users to map field names from the Tika field names to custom field names.  In the following example, the filter change the names of the metadata fields from the from element to the to element, and (because excludeUnmapped is set to true, this filter will remove all metadata that does not have a key of "X-TIKA:content", "dc:title" or "dc:created".  If excludeUnmapped is set to false, this filter will apply the mappings but maintain all other metadata.

<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
      <params>
        <excludeUnmapped>true</excludeUnmapped>
        <mappings>
          <mapping from="X-TIKA:content" to="content"/>
          <mapping from="dc:title" to="title"/>
          <mapping from="dc:created" to="date"/>
        </mappings>
      </params>
    </metadataFilter>
  </metadataFilters>


IncludeFieldMetadataFilter

This filter will include only the fields specified and will leave the field names as they are.

<properties>
    <metadataFilters>
        <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
            <params>
                <include>
                    <field>X-TIKA:content</field>
                    <field>extended-properties:Application</field>
                    <field>Content-Type</field>
                </include>
            </params>
        </metadataFilter>
    </metadataFilters>
</properties>


ExcludeFieldMetadataFilter

This filter will allow all metadata items to pass through except for the excluded keys.

<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ExcludeFieldMetadataFilter">
      <params>
        <exclude>
          <field>dc:title</field>
          <field>dc:creator</field>
        </exclude>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


ClearByMImeMetadataFilter

This filter removes all metadata from files of specific mime types.  For example, you may want to parse EMF files because they can contain embedded files, but you might not want to include the metadata or content from those files in what you show to users.

<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
      <params>
        <mimes>
          <mime>image/emf</mime>
          <mime>image/jpeg</mime>
        </mimes>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


TikaEvalMetadataFilter

TODO: fill in

  • No labels