Page History

...

OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR.
We upgraded from log4j to log4j2 in tika-app, tika-server and anywhere else we used to use log4j.
The tika-parsers package has been split into tika-parsers-standard-package, tika-parsersparser-scientific-module package and tika-parsersparser-sqlite3-modulepackage.
tika-app only includes parsers in tika-parsers-standard-package; users have to add tika-parsersparser-scientific-module package and tika-parsersparser-sqlite3-module package if desired.
tika-server is now tika-server-standard and only includes parsers in tika-parsers-standard-package
tika-server is now run in --spawnChild mode by default.
Removed deprecated Metadata keys/properties (TIKA-1974). See below for a list of changed keys.
Removed deprecated PDFPreflightParser (TIKA-3437).
Parsers are now configured via tika-config.xml on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser. See below for links to the specific parsers.
Changed namespaces of translator implementations (o.a.t.language.translate.impl) to avoid split-package with tika-core

...

Page tree