...
Filtering Metadata Keys
The /rmeta
endpoint can return far more metadata fields than a user might want to process. As of Tika 1.25, users can configure a MetadataFilter that either includes
or excludes
fields by name.
Note: the MetadataFilters
only work with the /rmeta
endpoint. Further, they do not shortcut metadata extraction within Parsers. They only delete the unwanted fields after the parse. This still can save resources in storage and network bandwidth.
A user can set the following in a tika-config.xml
file to have the /rmeta
end point only return three fields:
...
No Format |
---|
<properties> <metadataFilters> <metadataFilter class="org.apache.tika.metadata.filter.ExcludeFieldMetadataFilter"> <params> <param name="exclude" type="list"> <string>X-TIKA:content</string> <string>extended-properties:Application</string> <string>Content-Type</string> </param> </params> </metadataFilter> </metadataFilters> </properties> |
Filtering Metadata Objects
A Finally, a user may want to parse a file type to get at the embedded contents within it, but s/he may not want a metadata object or contents for the file type itself. For example, image/emf
files often contain duplicative text, but they may contain an embedded PDF file. If the client had turned off the EMFParser
, the embedded PDF file would not be parsed. When the /rmeta
endpoint is configured with the following, it will delete the entire metadata object for files of type image/emf
.
No Format |
---|
<properties> <metadataFilters> <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter"> <params> <param name="mimes" type="list"> <string>image/emf</string> </param> </params> </metadataFilter> </metadataFilters> </properties> |
Integration with tika-eval
As of Tika 1.25, if a user adds the tika-eval
jar to the server jar's classpath, the /rmeta
endpoint will add key "profiling" statistics from the tika-eval
module, including: language identified, number of tokens, number of alphabetic tokens and the "out of vocabulary" percentage. These statistics can be used to decide to reprocess a file with OCR or to reprocess an HTML file with a different encoding detector.
To accomplish this, one may put both the tika-eval jar and the server jar in a bin/
directory and then run:
No Format |
---|
java -cp bin/* org.apache.tika.server.TikaServerCli |
See the TikaEval module page for more details. Please open issues if you would like other statistics included or if you'd like to make the calculated statistics configurable.
Unpack Resource
No Format |
---|
/unpack |
HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the raw bytes of the embedded files. Note that this does not operate recursively; it extracts only the child documents of the original file.
...