...

The first two are fairly self-explanatory through the javadocs.

Here follows an example tika-config.xml file for setting catchIntermediateExceptions to false and for checking for whether the PDF allows for extraction for accessibility.

In the following, we document all of the parameters for the PDFParser in Tika 2.x. You only need to specify the parameters you want to change.

Code Block

language	xml
title	PDFParser Configuration

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude

No Format

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>        
        <parser class="org.apache.tika.parser.pdf.DefaultParserPDFParser"/>
        </parser>
    <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
      <params>
       </parser>
!-- these are the defaults; you only need to specify the ones you want 
            to modify -->
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
      name="averageCharTolerance" type="float">0.3</param>
        <param name="detectAngles" type="bool">false</param>
        <param name="extractAcroFormContent" type="bool">true</param>
        <param name="extractActions" type="bool">false</param>
        <param name="catchIntermediateIOExceptions" type="bool">true</param>
        <param name="dropThreshold" type="float">2.5</param>
        <param name="allowExtractionForAccessibilityenableAutoSpace" type="bool">true</param>
        <param name="extractAnnotationText" type="bool">false</param>
        <param name="extractBookMarksText" type="bool">true</param>
        <param name="extractFontNames" type="bool">false</param>
        <param name="catchIntermediateExceptionsextractInlineImages" type="bool">false</param>
        <param name="extractUniqueInlineImagesOnly" type="bool">true</param>
        <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
        <!-- we really should throw an exception for this.<param name="maxMainMemoryBytes" type="long">-1</param>
        <param name="ocrDPI" type="int">300</param>
        <param name="ocrImageFormatName" type="string">png</param>
        <param name="ocrImageQuality" type="float">1.0</param>
        <param name="ocrRenderingStrategy" type="string">ALL</param>
        <param name="ocrStrategy" type="string">auto</param>
        <param name="ocrStrategyAuto" type="string">better</param>
    We  are currently not checking --><param name="ocrImageType" type="string">gray</param>
        <param name="setKCMS" type="bool">false</param>
        <param name="someRandomThingOrOthersortByPosition" type="bool">true<>false</param>
        <param name="spacingTolerance" type="float">0.5</param>
         </params>
<param name="suppressDuplicateOverlappingText" type="bool">false</param> 
      </params>
    </parser>
    </parsers>
</properties>

Optional Dependencies

If you need to process TIFF or JPEG2000 images within PDFs (either for inline image extraction or OCR), please consider adding the optional dependencies specified by PDFBox. These dependencies are not compatible with ASL 2.0; please make sure that any third party licenses are suitable for your project.

Finally, M. Caruana Galizia alerted us to the need to use maven-shade's ServicesResourceTransformer because the third-party dependencies' services file will be overwritten unless you transform the services. See an example: here.

...

OCR

Note: the configuration of some of these features via the config file requires a nightly build of Tika after 11/8/2016 or Tika version >= 1.15.

...

Page tree

Versions Compared

Old Version 9

New Version 10

Key

Optional Dependencies

OCR

Page tree

Page History

Versions Compared

Old Version 9

New Version 10

Key

Optional Dependencies

OCR