Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagexml
titlePDFParser Configuration
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <!-- these are the defaults; you only need to specify the ones you want 
            to modify -->
        <!-- if you want to extract content, whether or not the PDF allows extraction
             at all, do not set this parameter.  If you want to extract content and
             the PDF allows for extraction for accessibility, set this to true.
             If you do not want to extract content when the PDF does not allow extraction
             but does allow extraction for accessibility, set this to false -->
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param name="averageCharTolerance" type="float">0.3</param>
        <param name="detectAngles" type="bool">false</param>
        <param name="extractAcroFormContent" type="bool">true</param>
        <param name="extractActions" type="bool">false</param>
        <!-- as of 2.8.0 -->
        <param name="extractIncrementalUpdateInfo" type="bool">false</param>
        <param <param name="catchIntermediateIOExceptions" type="bool">true</param>
        <param name="dropThreshold" type="float">2.5</param>
        <param name="enableAutoSpace" type="bool">true</param>
        <param name="extractAnnotationText" type="bool">false</param>
        <param name="extractBookMarksTextextractBookmarksText" type="bool">true</param>
        <param name="extractFontNames" type="bool">false</param>
        <param name="extractInlineImages" type="bool">false</param>
        <param name="extractMarkedContent" type="bool">false</param>
        <param name="extractUniqueInlineImagesOnly" type="bool">true</param>
        <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
        <param name="maxMainMemoryBytes" type="long">-1</param>
        <!-- as of 2.8.0 -->
        <param name="maxIncrementalUpdates" type="int">10000</param>
        <param name="ocrDPI" type="int">300</param>
        <param name="ocrImageFormatName" type="string">png</param>
        <param name="ocrImageQuality" type="float">1.0</param>
        <param name="ocrRenderingStrategy" type="string">ALL</param>
        <param name="ocrStrategy" type="string">auto</param>
        <param name="ocrStrategyAuto" type="string">better</param>
        <param name="ocrImageType" type="string">gray</param>
         <!-- as of 2.8.0 -->
        <param name="parseIncrementalUpdates" type="bool">false</param>
        <param name="setKCMS" type="bool">false</param>
        <param name="sortByPosition" type="bool">false</param>
        <param name="spacingTolerance" type="float">0.5</param>
        <param name="suppressDuplicateOverlappingText" type="bool">false</param>
        <!-- as of versions after 2.8.0 -->
        <param name="throwOnEncryptedPayload" type="bool">false</param>
      </params>
    </parser>
  </parsers>
</properties>

...

This will extract inline images as if they were attachments, and then, if Tesseract is correctly configured, it should run against the images. Note: by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time.

No Format

...
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser
...

...

This will render each PDF page and then run OCR on that image. This method of OCR is triggered by the ocrStrategy parameter, but users can manipulate other parameters, including the image type (see org.apache.pdfbox.rendering.ImageType for options) and the dots per inch dpi. The defaults are: gray and 300 respectively. For ocrStrategy, we currently have: no_ocr (rely on regular text extraction only), ocr_only (don't bother extracting text, just run OCR on each page), ocr_and_text (both extract text and run OCR) and (as of Tika 1.21) auto (try to extract text, but run OCR if fewer than 10 characters were extracted of if there are more than 10 characters with unmapped Unicode values). 

No Format

...
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_only</param>
                <param name="ocrImageType" type="string">rgb</param>
                <param name="ocrDPI" type="int">100</param>
            </params>
        </parser>
...

...

Setting Parse Time/Per File configurations via tika-server

See: Configuring Parsers At Parse Time in tika-server.

Optional Dependencies

Note, you should include the following dependency to process JBIG2 images:

No Format

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>jbig2-imageio</artifactId>
        <version>3.0.2</version>
    </dependency>

Note, if their licenses are compatible with your application, you may want to include the following jai libraries in your classpath to handle jp2, jpeg2000 and tiff files. The licenses are not Apache 2.0 compatible!

No Format

    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.4.0</version>
    </dependency>
    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-jpeg2000</artifactId>
        <version>1.3.0</version>
        <scope>test</scope>
    </dependency>

...

Tables Aren't Extracted as Tables

Right. In PDF/A UA (accessibilityUniversal Accessibility) tables can be stored with structural markup.  As of Tika 1.24, you can set "extractMarkedContent" = "true" via the PDFParserConfig, and Tika will extracted marked content, including tables, if the PDF was generated with marked content. 

In many PDFs, tables are often not stored as tables. A human is easily able to see tables, but all that is stored in the PDF is text chunks and coordinates on a page (if there's any text at all). One needs to apply some advanced computation to extract table structure from a PDF. Tika does not currently do this. Please see TabulaPDF as one open source project that extracts tables from PDFs and maintains their structure

Note that as of 2023, we still see papers at research conferences on extracting structural elements (including tables) from PDFs – THIS IS NOT A SOLVED PROBLEM.  And, humorously, this kind of task is sometimes called "PDF remediation".

No Text

Mildly Garbled Text

...