Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
  2. Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
  3. Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
  4. Limit the metadata fields -- let's say you only want dc:title and text, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.


Metadata filters are specified in the <metadataFilters/> element in tika-config.xml.  They are run in order, and order will matter.

FieldNameMappingFilter

This is used to select fields to include and to rename fields from the Tika names to preferred names.  This was initially designed for modifying field names before emitting document to OpenSearch or Solr.

Code Block
languagexml
titleFieldNameMappingFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
      <params>
        <excludeUnmapped>true</excludeUnmapped>
        <mappings>
          <mapping from="X-TIKA:content" to="content"/>
          <mapping from="Content-Length" to="length"/>
          <mapping from="dc:creator" to="creators"/>
          <mapping from="dc:title" to="title"/>
          <mapping from="Content-Type" to="mime"/>
          <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception"/>
        </mappings>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


DateNormalizingMetadataFilter

This blindly adds a timezone to dates that may not have a time zone.  Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezones.

Code Block
languagexml
titleDateNormalizingMetadataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <!-- depending on the file format, some dates do not have a timezone. This
     filter arbitrarily assumes dates have a UTC timezone and will format all
     dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
     -->
    <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
  </metadataFilters>
</properties>

LanguageDetection

Two language detectors have a metadata filter option (OpenNLPMetadataFilter and the OptimaizeMetadataFilter).  These are applied to the X-TIKA:content field at the end of the parse.  This is an example of specifying the OptimaizeLanguageDetector. The language id will be added to the metadata  object with the TikaCoreProperties.TIKA_DETECTED_LANGUAGE key.

Code Block
languagexml
titleLanguageDetection
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.langdetect.optimaize.metadatafilter.OptimaizeMetadataFilter">
      <params>
        <maxCharsForDetection>10000</maxCharsForDetection>
      </params>
    </metadataFilter>
  </metaFilters>
</properties


ClearByMimeMetadataFilter

When using the RecursiveParserWrapper (the /rmeta endpoint in tika-server or the -J option in tika-app), you can delete metadata objects for specific file types.

Code Block
languagexml
titleClearByMimeMetadataFilter
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
      <params>
		<!-- this will remove metadata objects for jpegs and pdfs; more seriously, this may be useful for image files or
             emf or wmf depending on your use case -->
        <mimes>
          <mime>image/jpeg</mime>
          <mime>application/pdf</mime>
        </mimes>
      </params>
    </metadataFilter>
  </metaFilters>
</properties

IncludeFieldMetadataFilter

This removes all other metadata fields after the parse except those specified here.

Code Block
languagexml
titleMetadataFilters
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
      <params>
        <include>
          <field>X-TIKA:content</field>
          <field>extended-properties:Application</field>
          <field>Content-Type</field>
        </include>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

...