...
Installing Tesseract on Windows
See UB-Mannheim.
Optimizing
...
Tesseract
There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg #263 and #1171 and this wiki page.
...
No Format |
---|
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> </parser> </parsers> </properties> |
Optional Dependencies
Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.
To identify rotation
python must be installed with scikit-image and numpy
pip3 install numpy
pip3 install scikit-image
(As of January 5, 2021, there's a bug in the most recent numpy for Windows, specify 1.19.3: pip3 install numpy==1.19.3
)
In Tika 2.0, python3
must be installed and callable as python3
.
Install ImageMagick
See: https://imagemagick.org/script/download.php
iOS: brew install imagemagick
Ubuntu: sudo apt install imagemagick
Windows: download the binary installer from the above page, e.g. https://imagemagick.org/download/binaries/ImageMagick-7.0.10-55-Q16-HDRI-x64-dll.exe
TODO: document how to configure these options in Tika