...
- Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
- Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
- Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
- Limit the metadata fields -- let's say you only want
dc:title
and text
, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.
Metadata filters are specified in the <metadataFilters/>
element in tika-config.xml
. They are run in order, and order will matter.
FieldNameMappingFilter
This is used to select fields to include and to rename fields from the Tika names to preferred names. This was initially designed for modifying field names before emitting document to OpenSearch or Solr.
Code Block |
---|
language | xml |
---|
title | FieldNameMappingFilter |
---|
|
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
<params>
<excludeUnmapped>true</excludeUnmapped>
<mappings>
<mapping from="X-TIKA:content" to="content"/>
<mapping from="Content-Length" to="length"/>
<mapping from="dc:creator" to="creators"/>
<mapping from="dc:title" to="title"/>
<mapping from="Content-Type" to="mime"/>
<mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception"/>
</mappings>
</params>
</metadataFilter>
</metadataFilters>
</properties> |
DateNormalizingMetadataFilter
This blindly adds a timezone to dates that may not have a time zone. Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezones.
Code Block |
---|
language | xml |
---|
title | DateNormalizingMetadataFilter |
---|
|
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<metadataFilters>
<!-- depending on the file format, some dates do not have a timezone. This
filter arbitrarily assumes dates have a UTC timezone and will format all
dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
-->
<metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
</metadataFilters>
</properties> |
LanguageDetection
Two language detectors have a metadata filter option (OpenNLPMetadataFilter and the OptimaizeMetadataFilter). These are applied to the X-TIKA:content
field at the end of the parse. This is an example of specifying the OptimaizeLanguageDetector. The language id will be added to the metadata object with the TikaCoreProperties.TIKA_DETECTED_LANGUAGE key.
Code Block |
---|
language | xml |
---|
title | LanguageDetection |
---|
|
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.langdetect.optimaize.metadatafilter.OptimaizeMetadataFilter">
<params>
<maxCharsForDetection>10000</maxCharsForDetection>
</params>
</metadataFilter>
</metaFilters>
</properties |
ClearByMimeMetadataFilter
When using the RecursiveParserWrapper
(the /rmeta
endpoint in tika-server
or the -J
option in tika-app
), you can delete metadata objects for specific file types.
Code Block |
---|
language | xml |
---|
title | ClearByMimeMetadataFilter |
---|
|
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
<params>
<!-- this will remove metadata objects for jpegs and pdfs; more seriously, this may be useful for image files or
emf or wmf depending on your use case -->
<mimes>
<mime>image/jpeg</mime>
<mime>application/pdf</mime>
</mimes>
</params>
</metadataFilter>
</metaFilters>
</properties |
IncludeFieldMetadataFilter
This removes all other metadata fields after the parse except those specified here.
Code Block |
---|
language | xml |
---|
title | MetadataFilters |
---|
|
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
<params>
<include>
<field>X-TIKA:content</field>
<field>extended-properties:Application</field>
<field>Content-Type</field>
</include>
</params>
</metadataFilter>
</metadataFilters>
</properties>
|
...