...
Metadata.RESOURCE_NAME_KEY
has been renamedTikaCoreProperties.RESOURCE_NAME_KEY
.TikaCoreProperties.KEYWORDS
has been removed in favor ofOffice.KEYWORDS
tika-parsers –
...
Configuring via tika-config.xml
In 2.x, we're moving to centralize and prefer configuration for everything through a tika-config.xml
file. Two major popular parsers used to rely on *.properties files.
PDFParser
Code Block | ||||
---|---|---|---|---|
| ||||
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<!-- this is not formally necessary, but prevents loading of unnecessary parser -->
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<!-- these are the defaults; you only need to specify the ones you want
to modify -->
<param name="allowExtractionForAccessibility" type="bool">true</param>
<param name="averageCharTolerance" type="float">0.3</param>
<param name="detectAngles" type="bool">false</param>
<param name="extractAcroFormContent" type="bool">true</param>
<param name="extractActions" type="bool">false</param>
<param name="catchIntermediateIOExceptions" type="bool">true</param>
<param name="dropThreshold" type="float">2.5</param>
<param name="enableAutoSpace" type="bool">true</param>
<param name="extractAnnotationText" type="bool">false</param>
<param name="extractBookMarksText" type="bool">true</param>
<param name="extractFontNames" type="bool">false</param>
<param name="extractInlineImages" type="bool">false</param>
<param name="extractUniqueInlineImagesOnly" type="bool">true</param>
<param name="ifXFAExtractOnlyXFA" type="bool">false</param>
<param name="maxMainMemoryBytes" type="long">-1</param>
<param name="ocrDPI" type="int">300</param>
<param name="ocrImageFormatName" type="string">png</param>
<param name="ocrImageQuality" type="float">1.0</param>
<param name="ocrRenderingStrategy" type="string">ALL</param>
<param name="ocrStrategy" type="string">auto</param>
<param name="ocrStrategyAuto" type="string">better</param>
<param name="ocrImageType" type="string">gray</param>
<param name="setKCMS" type="bool">false</param>
<param name="sortByPosition" type="bool">false</param>
<param name="spacingTolerance" type="float">0.5</param>
<param name="suppressDuplicateOverlappingText" type="bool">false</param>
</params>
</parser>
</parsers>
</properties> |
TesseractOCRParser
...
language | xml |
---|---|
title | TesseractOCR Configuration |
...
; see their individual pages for details: PDFParser and TesseractOCRParser.
See other individual parser pages for available configurations: TikaParserNotes. If you notice any missing parsers, please help us document configurations for all parsers.
tika-parsers module
When using tika-parsers in your project, you need to change the dependencies from
...