...
DateNormalizingMetadataFilter
This blindly adds a timezone to dates that may not have a time zone. Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezones. This filter respects dates with timezones, and blindly adds a UTC timezone to dates that do not have a time zone.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<metadataFilters>
<!-- depending on the file format, some dates do not have a timezone. This
filter arbitrarily assumes dates have a UTC timezone and will format all
dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
-->
<metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
</metadataFilters>
</properties> |
GeoPointMetadataFilter
If a metadata object has a TikaCoreProperties.LATITUDE
and a TikaCoreProperties.LONGITUDE
, this concatenates those fields with a comma delimiter as LATITUDE,LONGITUDE
and adds that value to the field specified by geoPointFieldName
. Note: This was added in Tika 2.5.1.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.GeoPointMetadataFilter">
<params>
<-- default: "location" -->
<geoPointFieldName>myGeoPoint</geoPointFieldName>
</params>
</metadataFilter>
</metadataFilters>
</properties> |
TikaEvalMetadataFilter
If the tika-eval-core jar is on the classpath, this filter should be added automatically. Users may specify it as below. This runs Tika's custom version of OpenNLP's language detector and includes counts for tokens, unique tokens, alphabetic tokens and the "oov" (% out of vocabulary) statistic. See TikaEval for more details on the tika-eval-app
.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.eval.core.metadata.TikaEvalMetadataFilter"/>
</metadataFilters>
</properties> |
...