...
No Format |
---|
curl -T test_recursive_embedded.docx --header "writeLimit: 1000" http://localhost:9998/rmeta |
Filtering Metadata Keys
The rmeta
endpoint can return far more metadata fields than a user might want to process. As of Tika 1.25, users can configure a MetadataFilter that either includes
or excludes
fields by name.
A user can set the following in a tika-config.xml
file to have the /rmeta
end point only return three fields:
No Format |
---|
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
<params>
<param name="include" type="list">
<string>X-TIKA:content</string>
<string>extended-properties:Application</string>
<string>Content-Type</string>
</param>
</params>
</metadataFilter>
</metadataFilters>
</properties> |
To exclude those three fields but include all other fields:
No Format |
---|
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.ExcludeFieldMetadataFilter">
<params>
<param name="exclude" type="list">
<string>X-TIKA:content</string>
<string>extended-properties:Application</string>
<string>Content-Type</string>
</param>
</params>
</metadataFilter>
</metadataFilters>
</properties> |
Finally, a user may want to parse a file type to get at the embedded contents within it, but s/he may not want a metadata object or contents for the file type itself. For example, image/emf
files often contain duplicative text, but they may contain an embedded PDF file. If the client had turned off the EMFParser
, the embedded PDF file would not be parsed. When the /rmeta
endpoint is configured with the following, it will delete the entire metadata object for files of type image/emf
.
No Format |
---|
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
<params>
<param name="mimes" type="list">
<string>image/emf</string>
</param>
</params>
</metadataFilter>
</metadataFilters>
</properties> |
Unpack Resource
No Format |
---|
/unpack |
...