Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR.
  • We upgraded from log4j to log4j2 in tika-app, tika-server and anywhere else we used to use log4j.
  • The tika-parsers package has been split into tika-parsers-standard-package, tika-parser-scientific-package and tika-parser-sqlite3-package.
  • tika-app only includes parsers in tika-parsers-standard-package; users have to add tika-parser-scientific-package and tika-parser-sqlite3-package if desired.
  • tika-server is now tika-server-standard and only includes parsers in tika-parsers-standard-package
  • tika-server is now run in --spawnChild mode by default.
  • Removed deprecated Metadata keys/properties (TIKA-1974).  See below for a list of changed keys.
  • Removed deprecated PDFPreflightParser (TIKA-3437). 
  • Parsers are now configured via tika-config.xml on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser. See below for links to the specific parsers.
  • Changed namespaces of translator implementations (o.a.t.language.translate.impl) to avoid split-package with tika-core

For more details on changes in tika-server in 2.x, please see: TikaServer in Tika 2.x.

Metadata

Removed duplicate/triplicate keys

...