Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

DateNormalizingMetadataFilter

This blindly adds a timezone to dates that may not have a time zone.  Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezones.  This filter respects dates with timezones, and blindly adds a UTC timezone to dates that do not have a time zone.

Code Block
languagexml
titleDateNormalizingMetadataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <!-- depending on the file format, some dates do not have a timezone. This
     filter arbitrarily assumes dates have a UTC timezone and will format all
     dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
     -->
    <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
  </metadataFilters>
</properties>

GeoPointMetadataFilter

If a metadata object has a TikaCoreProperties.LATITUDE and a TikaCoreProperties.LONGITUDE, this concatenates those fields with a comma delimiter as LATITUDE,LONGITUDE and adds that value to the field specified by geoPointFieldNameNote: This was added in Tika 2.5.1.

Code Block
languagexml
titleGeoPointMetadataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.GeoPointMetadataFilter">
      <params>
        <-- default: "location" -->
        <geoPointFieldName>myGeoPoint</geoPointFieldName>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

TikaEvalMetadataFilter

If the tika-eval-core jar is on the classpath, this filter should be added automatically. Users may specify it as below.  This runs Tika's custom version of OpenNLP's language detector and includes counts for tokens, unique tokens, alphabetic tokens and the "oov" (% out of vocabulary) statistic. See TikaEval for more details on the tika-eval-app.

Code Block
languagexml
titleTikaEvalMetadataFilter
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.eval.core.metadata.TikaEvalMetadataFilter"/>
  </metadataFilters>
</properties>

...