Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Metadata.RESOURCE_NAME_KEY has been renamed TikaCoreProperties.RESOURCE_NAME_KEY.
  • TikaCoreProperties.KEYWORDS has been removed in favor of Office.KEYWORDS

tika-parsers –

...

Configuring via tika-config.xml 

In 2.x, we're moving to centralize and prefer configuration for everything through a tika-config.xml file.  Two major popular parsers used to rely on *.properties files

PDFParser

Code Block
languagexml
titlePDFParser Configuration
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
       <!-- these are the defaults; you only need to specify the ones you want 
            to modify -->
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param name="averageCharTolerance" type="float">0.3</param>
        <param name="detectAngles" type="bool">false</param>
        <param name="extractAcroFormContent" type="bool">true</param>
        <param name="extractActions" type="bool">false</param>
        <param name="catchIntermediateIOExceptions" type="bool">true</param>
        <param name="dropThreshold" type="float">2.5</param>
        <param name="enableAutoSpace" type="bool">true</param>
        <param name="extractAnnotationText" type="bool">false</param>
        <param name="extractBookMarksText" type="bool">true</param>
        <param name="extractFontNames" type="bool">false</param>
        <param name="extractInlineImages" type="bool">false</param>
        <param name="extractUniqueInlineImagesOnly" type="bool">true</param>
        <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
        <param name="maxMainMemoryBytes" type="long">-1</param>
        <param name="ocrDPI" type="int">300</param>
        <param name="ocrImageFormatName" type="string">png</param>
        <param name="ocrImageQuality" type="float">1.0</param>
        <param name="ocrRenderingStrategy" type="string">ALL</param>
        <param name="ocrStrategy" type="string">auto</param>
        <param name="ocrStrategyAuto" type="string">better</param>
        <param name="ocrImageType" type="string">gray</param>
        <param name="setKCMS" type="bool">false</param>
        <param name="sortByPosition" type="bool">false</param>
        <param name="spacingTolerance" type="float">0.5</param>
        <param name="suppressDuplicateOverlappingText" type="bool">false</param> 
      </params>
    </parser>
  </parsers>
</properties>

TesseractOCRParser

...

languagexml
titleTesseractOCR Configuration

...

; see their individual pages for details: PDFParser and TesseractOCRParser.

See other individual parser pages for available configurations: TikaParserNotes.  If you notice any missing parsers, please help us document configurations for all parsers.

tika-parsers module

When using tika-parsers in your project, you need to change the dependencies from

...