Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Installing Tesseract on Windows

See UB-Mannheim.

Optimizing

...

Tesseract

There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg #263 and #1171 and this wiki page.

...

No Format
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>


Optional Dependencies

Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

To identify rotation

python must be installed with scikit-image and numpy

pip3 install numpy

pip3 install scikit-image

(As of January 5, 2021, there's a bug in the most recent numpy for Windows, specify 1.19.3: pip3 install numpy==1.19.3)

In Tika 2.0, python3 must be installed and callable as python3.

Install ImageMagick

See: https://imagemagick.org/script/download.php

iOS: brew install imagemagick

Ubuntu: sudo apt install imagemagick

Windows: download the binary installer from the above page, e.g. https://imagemagick.org/download/binaries/ImageMagick-7.0.10-55-Q16-HDRI-x64-dll.exe


TODO: document how to configure these options in Tika