Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Installing Tesseract on Windows

See UB-Mannheim.

Optimizing

...

Tesseract

There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg #263 and #1171 and this wiki page.

...

Overriding Default Configuration

When using the OCR Parser Tika will use the following default settings:

  • Tesseract installation path = ""
  • Language dictionary = "eng"
  • Page Segmentation Mode = "1"
  • Minmum file size = 0
  • Maximum file size = 2147483647
  • Timeout = 120

To changes these settings you can either modify the existing TesseractOCRConfig.properties file in tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on your classpath.

It is worth noting that doing this when using one of the executable JARs, either the tika-app or tika-server JARs, will require you to execute them without using the -jar command. For example, something like the following for the tika-app or tika-server, respectively:

java -cp /path/to/your/classpath:/path/to/tika-app-X.X.jar org.apache.tika.cli.TikaCLI

...

In Tika 2.x, users can modify configurations via a tika-config.xml. With the exceptions of the paths, we document the defaults in the following:

Code Block
languagexml
titleTesseractOCR Configuration
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
      <params>
        <!-- these are the defaults; you only need to specify the ones you want
             to modify -->
        <param name="applyRotation" type="bool">false</param>
        <param name="colorSpace" type="string">gray</param>
        <param name="density" type="int">300</param>
        <param name="depth" type="int">4</param>
        <param name="enableImagePreprocessing" type="bool">false</param>
        <param name="filter" type="string">triangle</param>
        <param name="imageMagickPath" type="string">/my/custom/imageMagicPath</param>
        <param name="language" type="string">eng</param>
        <param name="maxFileSizeToOcr" type="long">2147483647</param>
        <param name="minFileSizeToOcr" type="long">0</param>
        <param name="pageSegMode" type="string">1</param>
        <param name="pageSeparator" type="string"></param>
        <param name="preserveInterwordSpacing" type="bool">false</param>
        <param name="resize" type="int">200</param>
        <param name="skipOcr" type="bool">false</param>
        <param name="tessdataPath" type="string">/my/custom/data</param>
        <param name="tesseractPath" type="string">/my/custom/path</param>
        <param name="timeoutSeconds" type="int">120</param>
      </params>
    </parser>
  </parsers>
</properties>

OCR and PDFs

See also PDFParser notes for more details on options for performing OCR on PDFs.

Note: With Tika server 1.x, the PDFConfig is generated for each document, so any configurations that you may specify in the tika-config.xml file that you pass to the tika-server on startup are overwritten.  This behavior is changed in Tika 2.x, where the PDFConfig remembers settings from tika-config.xml and will only temporarily update custom configs sent via headers.

To go with option 1 for OCR'ing PDFs (run OCR against inline images), you need to specify configurations for the PDFParser like so:

...

To go with option 2 (render each page and then run OCR on that rendered image), you need to specify the ocr strategy:
curl -T testOCR.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"

Note: These two options are independent.  If you set extractInlineImages to true and select an OcrStrategy that includes OCR on the rendered page, Tika will run OCR on the extracted inline images and  the rendered page. 

Disable

...

OCR in Tika
Anchor
disable-ocr
disable-ocr

Tika's OCR will trigger on images embedded within, say, office documents in addition to images you upload directly. Because OCR slows down Tika, you might want to disable it if you don't need the results. You can disable OCR by simply uninstalling tesseract, but if that's not an option, here is a tika.xml config file that disables OCR:

No Format
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

In Tika 2.x, you can selectively turn off OCR per parse programmatically by setting skipOcr  on a TesseractOCRConfig. This will only affect that one call to parse.

No Format
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setSkipOcr(true);
        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        
        Parser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata, context);

In Tika 2.x,  with tika-server, add this header to skip OCR per request: X-Tika-OCRskipOcr: true

Optional Dependencies

Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

To identify rotation

python must be installed with scikit-image and numpy

pip3 install numpy

pip3 install scikit-image

(As of January 5, 2021, there's a bug in the most recent numpy for Windows, specify 1.19.3: pip3 install numpy==1.19.3)

In Tika 2.0, python3 must be installed and callable as python3.

Install ImageMagick

See: https://imagemagick.org/script/download.php

iOS: brew install imagemagick

Ubuntu: sudo apt install imagemagick

Windows: download the binary installer from the above page, e.g. https://imagemagick.org/download/binaries/ImageMagick-7.0.10-55-Q16-HDRI-x64-dll.exe


TODO: document how to configure these options in Tika