...
For now, see: https://downloads.apache.org/tika/2.0.0/CHANGES-2.0.0.txt
Metadata
Removed duplicate/triplicate keys
Background: In early 1.x, we had basic metadata keys that were created somewhat ad hoc. We then added normalized metadata keys based on standards such as Dublin Core, or we at least tried to add namespaces to the metadata keys for specific file formats. To maintain backwards compatibility, we kept the old keys and added new keys. This led to quite a bit of metadata bloat, where we'd have the same information two or three times. In Tika 2.x, we slimmed down the metadata keys and relied only on, say Dublin Core if it exists.
Tika 1.x | Tika 2.x |
---|---|
Author, meta:author, dc:creator | dc:creator |
Last-Author, meta:last-author | meta:last-author |
Creation-Date, date, dcterms:created | dcterms:created |
Last-Modified, modified, dcterms:modified | dcterms:modified |
Last-Save-Date, meta:save-date | meta:save-date |
Application-Name, extended-properties:Application | extended-properties:Application |
Character Count, meta:character-count | meta:character-count |
Company, extended-properties:Company | extended-properties:Company |
Edit-Time, extended-properties:TotalTime | extended-properties:TotalTime |
Keywords, meta:keyword, dc:subject | meta:keyword, dc:subject |
Page-Count, meta:page-count | meta:page-count |
Revision-Number, cp:revision | cp:revision |
subject, cp:subject, dc:subject | dc:subject |
Template, extended-properties:Template | extended-properties:Tempate |
Word-Count, meta:word-count | meta:word-count |
Changed Metadata Keys
Tika 1.x | Tika 2.x |
---|---|
X-Parsed-By | X-TIKA:Parsed-By |
Metadata.RESOURCE_NAME_KEY
has been renamedTikaCoreProperties.RESOURCE_NAME_KEY
.TikaCoreProperties.KEYWORDS
has been renamedOffice.KEYWORDS
.- Meta
X-Parsed-By
has changed toX-TIKA:Parsed-By
X-TIKA:EXCEPTION:runtime
has been changed toX-TIKA:EXCEPTION:container_exception
...