Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

No Format
curl -T test_recursive_embedded.docx --header "writeLimit: 1000" http://localhost:9998/rmeta


Filtering Metadata Keys

The rmeta  endpoint can return far more metadata fields than a user might want to process.   As of Tika 1.25, users can configure a MetadataFilter that either includes  or excludes  fields by name.


A user can set the following in a tika-config.xml file to have the /rmeta  end point only return three fields:

No Format
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
      <params>
        <param name="include" type="list">
          <string>X-TIKA:content</string>
          <string>extended-properties:Application</string>
          <string>Content-Type</string>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


To exclude those three fields but include all other fields:


No Format
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ExcludeFieldMetadataFilter">
      <params>
        <param name="exclude" type="list">
          <string>X-TIKA:content</string>
          <string>extended-properties:Application</string>
          <string>Content-Type</string>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


Finally, a user may want to parse a file type to get at the embedded contents within it, but s/he may not want a metadata object or contents for the file type itself.  For example, image/emf files often contain duplicative text, but they may contain an embedded PDF file.  If the client had turned off the EMFParser, the embedded PDF file would not be parsed.  When the /rmeta  endpoint is configured with the following, it will delete the entire metadata object for files of type image/emf .

No Format
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
      <params>
        <param name="mimes" type="list">
          <string>image/emf</string>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


Unpack Resource

No Format
/unpack

...