...
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8"?> <properties> <!-- we're including the <parsers/> element to show that it is a separate element from the autoDetectParserConfig element. If it is not included, the standard default parser will be used --> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/> </parser> <parser class="org.apache.tika.parser.microsoft.OfficeParser"> <params> <param name="byteArrayMaxOverride" type="int">700000000</param> </params> </parser> </parsers> <!-- note that the autoDetectParserConfig element is separate from the <parsers/> element. The composite parser built in the <parsers/> element is used as the base parser for the AutoDetectParser. --> <autoDetectParserConfig> <contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/> </autoDetectParserConfig> <<!-- note that this is a test class only available in tika-core's test-jar as an example. Specify your own custom factory here --> <contentHandlerDecoratorFactory class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/> </autoDetectParserConfig> </properties> |
2. Metadata Filters
...
Metadata filters are specified in the <metadataFilters/>
element in tika-config.xml
. They are run in order, and order will matter.
FieldNameMappingFilter
matters.
See TikaServerEndpointsCompared for which endpoints apply metadataFilters in tika-server. Metadata filters are applied in tika-pipes and tika-app when using the -J option. MetadataFilters are not applied when Tika streams output.
FieldNameMappingFilter
This is used to This is used to select fields to include and to rename fields from the Tika names to preferred names. This was initially designed for modifying field names before emitting document to OpenSearch or Solr.
...
DateNormalizingMetadataFilter
This blindly adds a timezone to dates that may not have a time zone. Some file formats store timezone, others don't. By default, OpenSearch and Solr need timezones. This filter respects dates with timezones, and blindly adds a UTC timezone to dates that do not have a time zone.
Code Blockcode | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8" ?> <properties> <metadataFilters> <!-- depending on the file format, some dates do not have a timezone. This filter arbitrarily assumes dates have a UTC timezone and will format all dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone. --> not they actually have a timezone. --> <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/> </metadataFilters> </properties> |
GeoPointMetadataFilter
If a metadata object has a TikaCoreProperties.LATITUDE
and a TikaCoreProperties.LONGITUDE
, this concatenates those fields with a comma delimiter as LATITUDE,LONGITUDE
and adds that value to the field specified by geoPointFieldName
. Note: This was added in Tika 2.5.1.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<metadataFilters>
<metadataFilter class="org.apache.tika.metadata.filter.GeoPointMetadataFilter">
<params>
<-- default: "location" -->
<geoPointFieldName>myGeoPoint</geoPointFieldName>
</params>
</metadataFilter>
</metadataFilters>
</properties> |
TikaEvalMetadataFilter
If the tika-eval-core jar is on the classpath, this filter should be added automatically. Users may specify it as below. This runs Tika's custom version of OpenNLP's language detector and includes counts for tokens, unique tokens, alphabetic tokens and the "oov" (% out of vocabulary) statistic. See TikaEval for more details on the tika-eval-app
.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8" ?> <properties> <metadataFilters> <metadataFilter class="org.apache.tika.eval.core.metadata.filter.DateNormalizingMetadataFilterTikaEvalMetadataFilter"/> </metadataFilters> </properties> |
...
If you need different behavior, implement a WriteFilterFactory
, add it to your classpath and specify it in the tika-config.xml
.
4. . AutoDetectParserConfig
Anchor | ||||
---|---|---|---|---|
|
We've mentioned briefly above some of the factories that can be modified in the AutoDetectParserConfig
. There are other parameters that can be used to modify the behavior of the AutoDetectParser
via the tika-config.xml
the behavior of the AutoDetectParser
via the tika-config.xml
. The AutoDetectParser
is built from/contains the <parsers/>
element (or SPI if no <parsers/>
element is specified) in the tika-config
. Because of this, the configuration of the AutoDetectParser
differs from the component parsers that it wraps – the AutoDetectParser
uses its own <autoDetectParserConfig/>
element at the main level inside the <properties/>
element.
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="UTF-8"?> <properties> <autoDetectParserConfig> <params> <!-- if the incoming metadata object has a ContentLength entry and it is larger than this value, spool the file to disk; this is useful for some file formats that are more efficiently processed via a file instead of an InputStream --> <spoolToDisk>100000</spoolToDisk> <!-- the next four are parameters for the SecureContentHandler --> <!-- threshold used in zip bomb detection. This many characters must be written before the maximum compression ratio is calculated --> <outputThreshold>10000</outputThreshold> <!-- maximum compression ratio between output characters and input bytes bytes --> <maximumCompressionRatio>100</maximumCompressionRatio> <!-- maximum XML element nesting level --> <maximumCompressionRation>100<<maximumDepth>100</maximumCompressionRatio>maximumDepth> <!-- maximum XMLembedded elementfile nesting leveldepth --> <maximumDepth>100<<maximumPackageEntryDepth>100</maximumDepth>maximumPackageEntryDepth> <!-- maximum embedded file depth as of Tika > 2.7.0, you can skip the check and exception for a zero-byte inputstream--> <maximumPackageEntryDepth>100<<throwOnZeroBytes>false</maximumPackageEntryDepth>throwOnZeroBytes> </params> <!-- as of Tika 2.5.x, this is the preferred way to configure digests --> <digesterFactory class="org.apache.tika.parser.digestutils.CommonsDigesterFactory"> <params> <markLimit>100000</markLimit> <!-- this specifies SHA256, base32 and MD5 --> <algorithmString>sha256:32,md5</algorithmString> </params> </digesterFactory> </autoDetectParserConfig> </properties> |
...