THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
NOTE: THIS PAGE IS IN PROGRESS. PLEASE CHECK BACK FOR MORE DETAILS.
For now, see: https://archive.apache.org/dist/tika/2.0.0/CHANGES-2.0.0.txt
Major breaking changes
- OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR.
- Removed deprecated Metadata keys/properties and moved some commonly used keys from Metadata to TikaCoreProperties (such as TikaCoreProperties.RESOURCE_NAME_KEY) (TIKA-1974). See below for a list of changed keys.
- We upgraded from
log4j
tolog4j2
in tika-app, tika-server and anywhere else we used to uselog4j
. - The
tika-parsers
package has been split into several sub packages, inluding:tika-parsers-standard-package
,tika-parser-scientific-package
andtika-parser-sqlite3-package
. tika-app
only includes parsers intika-parsers-standard-package
; users have to addtika-parser-scientific-package
andtika-parser-sqlite3-package
if desired.tika-server
is nowtika-server-standard
and only includes parsers intika-parsers-standard-package
tika-server
is now run in--spawnChild
mode by default.- Removed deprecated Metadata keys/properties (TIKA-1974). See below for a list of changed keys.Removed deprecated PDFPreflightParser (TIKA-3437).
- Parsers are now configured via
tika-config.xml
on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser. See below for links to the specific parsers. - Changed namespaces of translator implementations (e.g.
org.apache.tika.language.translate.impl
) to avoid split-package with tika-core.
For more details on changes in tika-server
in 2.x, please see: TikaServer in Tika 2.x.
Metadata
Breaking Metadata Key Changes Between 1.x and 2.x
These are changes in locations of keys, not in the key names that consumers/clients will see:
Metadata.RESOURCE_NAME_KEY
has been renamedTikaCoreProperties.RESOURCE_NAME_KEY
.TikaCoreProperties.KEYWORDS
has been removed in favor ofOffice.KEYWORDS
Changed Metadata Keys
There are a few other subtle changes in key names listed below:
Tika 1.x | Tika 2.x |
---|---|
X-Parsed-By | X-TIKA:Parsed-By |
X-TIKA:EXCEPTION:runtime | X-TIKA:EXCEPTION:container_exception |
Removed duplicate/triplicate keys
...
Tika 1.x | Tika 2.x |
---|---|
Author, meta:author, dc:creator | dc:creator |
Last-Author, meta:last-author | meta:last-author |
Creation-Date, date, dcterms:created | dcterms:created |
Last-Modified, modified, dcterms:modified | dcterms:modified |
Last-Save-Date, meta:save-date | meta:save-date |
Application-Name, extended-properties:Application | extended-properties:Application |
Character Count, meta:character-count | meta:character-count |
Company, extended-properties:Company | extended-properties:Company |
Edit-Time, extended-properties:TotalTime | extended-properties:TotalTime |
Keywords, meta:keyword, dc:subject | meta:keyword, dc:subject |
Page-Count, meta:page-count | meta:page-count |
Revision-Number, cp:revision | cp:revision |
subject, cp:subject, dc:subject | dc:subject |
Template, extended-properties:Template | extended-properties:Template |
Word-Count, meta:word-count | meta:word-count |
Changed Metadata Keys
There are a few other subtle changes in key names listed below:
...
Other small metadata key changes between 1.x and 2.x
These are changes in locations of keys, not in the key names that consumers/clients will see:
...
tika-parsers – Configuring via tika-config.xml
...