THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- Enrich the data (similar to a ContentHandler) -- these metadata filters might run language detection on the cached contents at the end of the parse.
- Modify the metadata contents – one might want to run a regex over a specific field and extract only the information matching a regex, for example.
- Modify the metadata keys – If you need to rename metadata keys before emitting the object to, say, OpenSearch, you can use the FieldNameMappingFilter
- Limit the metadata fields -- let's say you only want
dc:title
andtext
, you can use these: ExcludeFieldMetadataFilter or IncludeFieldMetadataFilter NOTE: these were created before we had MetadataWriteFilters; those are probably a better option for this behavior.
IncludeFieldMetadataFilter
Code Block | ||||
---|---|---|---|---|
| ||||
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
<params>
<include>
<field>X-TIKA:content</field>
<field>extended-properties:Application</field>
<field>Content-Type</field>
</include>
</params>
</metadataFilter>
</metadataFilters>
</properties>
|
Metadata Write Filters
These filters are applied during the parse.
...
To configure the StandardWriteFilter, set the properties in its factory in the <autoDetectParserConfig>
<autoDetectParserConfig/>
element in the tika-config.xml
file.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8"?> <properties> <autoDetectParserConfig> <metadataWriteFilterFactory class="org.apache.tika.metadata.writefilter.StandardWriteFilterFactory"> <params> <!-- all measurements are in UTF-16 bytes. If any values are truncated, TikaCoreProperties.TRUNCATED_METADATA is set to true in the metadata object --> <!-- the maximum size for a metadata key. --> <maxKeySize>1000</maxKeySize> <!-- max total size for a field in UTF-16 bytes. If a field has multiple values, their lengths are summed to calculate the field size. --> <maxFieldSize>10000</maxFieldSize> <!-- max total estimated byte is a sum of the key sizes and values --> <maxTotalEstimatedBytes>100000</maxTotalEstimatedBytes> <!-- limit the count of values for multi-valued fields --> <maxValuesPerField>100</maxValuesPerField> <!-- include only these fields. NOTE, however that there a several fields that are important to the parse process and these fields are always allowed in addition (see ALWAYS_SET_FIELDS and ALWAYS_ADD_FIELDS in the StandardWriteFilter --> <includeFields> <field>dc:creator</field> <field>dc:title</field> </includeFields> </params> </metadataWriteFilterFactory> </autoDetectParserConfig> </properties> |
If you need different behavior, implement a WriteFilterFactory
, add it to your classpath and specify it in the tika-config.xml
.