Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The first two are fairly self-explanatory through the javadocs.

Here follows an example tika-config.xml file for setting catchIntermediateExceptions to false and for checking for whether the PDF allows for extraction for accessibility.

In the following, we document all of the parameters for the PDFParser in Tika 2.x. You only need to specify the parameters you want to change.

Code Block
languagexml
titlePDFParser Configuration
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude
No Format
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>        
        <parser class="org.apache.tika.parser.pdf.DefaultParserPDFParser"/>
        </parser>
    <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
      <params>
       </parser>
!-- these are the defaults; you only need to specify the ones you want 
            to modify -->
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
      name="averageCharTolerance" type="float">0.3</param>
        <param name="detectAngles" type="bool">false</param>
        <param name="extractAcroFormContent" type="bool">true</param>
        <param name="extractActions" type="bool">false</param>
        <param name="catchIntermediateIOExceptions" type="bool">true</param>
        <param name="dropThreshold" type="float">2.5</param>
        <param name="allowExtractionForAccessibilityenableAutoSpace" type="bool">true</param>
        <param name="extractAnnotationText" type="bool">false</param>
        <param name="extractBookMarksText" type="bool">true</param>
        <param name="extractFontNames" type="bool">false</param>
        <param name="catchIntermediateExceptionsextractInlineImages" type="bool">false</param>
        <param name="extractUniqueInlineImagesOnly" type="bool">true</param>
        <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
        <!-- we really should throw an exception for this.<param name="maxMainMemoryBytes" type="long">-1</param>
        <param name="ocrDPI" type="int">300</param>
        <param name="ocrImageFormatName" type="string">png</param>
        <param name="ocrImageQuality" type="float">1.0</param>
        <param name="ocrRenderingStrategy" type="string">ALL</param>
        <param name="ocrStrategy" type="string">auto</param>
        <param name="ocrStrategyAuto" type="string">better</param>
    We  are currently not checking --><param name="ocrImageType" type="string">gray</param>
        <param name="setKCMS" type="bool">false</param>
        <param name="someRandomThingOrOthersortByPosition" type="bool">true<>false</param>
        <param name="spacingTolerance" type="float">0.5</param>
         </params>
<param name="suppressDuplicateOverlappingText" type="bool">false</param> 
      </params>
    </parser>
    </parsers>
</properties>

Optional Dependencies

If you need to process TIFF or JPEG2000 images within PDFs (either for inline image extraction or OCR), please consider adding the optional dependencies specified by PDFBox. These dependencies are not compatible with ASL 2.0; please make sure that any third party licenses are suitable for your project.

Finally, M. Caruana Galizia alerted us to the need to use maven-shade's ServicesResourceTransformer because the third-party dependencies' services file will be overwritten unless you transform the services. See an example: here.

...

OCR

Note: the configuration of some of these features via the config file requires a nightly build of Tika after 11/8/2016 or Tika version >= 1.15.

...