Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Filtering Metadata Keys

The /rmeta  endpoint can return far more metadata fields than a user might want to process.   As of Tika 1.25, users can configure a MetadataFilter that either includes  or excludes  fields by name.  


Note: the MetadataFilters only work with the /rmeta  endpoint.  Further, they do not shortcut metadata extraction within Parsers.  They only delete the unwanted fields after the parse.  This still can save resources in storage and network bandwidth.

A user can set the following in a tika-config.xml file to have the /rmeta  end point only return three fields:

...

No Format
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ExcludeFieldMetadataFilter">
      <params>
        <param name="exclude" type="list">
          <string>X-TIKA:content</string>
          <string>extended-properties:Application</string>
          <string>Content-Type</string>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


Filtering Metadata Objects

A Finally, a user may want to parse a file type to get at the embedded contents within it, but s/he may not want a metadata object or contents for the file type itself.  For example, image/emf files often contain duplicative text, but they may contain an embedded PDF file.  If the client had turned off the EMFParser, the embedded PDF file would not be parsed.  When the /rmeta  endpoint is configured with the following, it will delete the entire metadata object for files of type image/emf .

No Format
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
      <params>
        <param name="mimes" type="list">
          <string>image/emf</string>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

Integration with tika-eval

As of Tika 1.25, if a user adds the tika-eval  jar to the server jar's classpath, the /rmeta  endpoint will add key "profiling" statistics from the tika-eval  module, including: language identified, number of tokens, number of alphabetic tokens and the "out of vocabulary" percentage.  These statistics can be used to decide to reprocess a file with OCR or to reprocess an HTML file with a different encoding detector.

To accomplish this, one may put both the tika-eval jar and the server jar in a bin/ directory and then run:


No Format
java -cp bin/* org.apache.tika.server.TikaServerCli

See the TikaEval module page for more details.  Please open issues if you would like other statistics included or if you'd like to make the calculated statistics configurable.


Unpack Resource

No Format
/unpack

HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the raw bytes of the embedded files.  Note that this does not operate recursively; it extracts only the child documents of the original file.

...