Tika 1.x	Tika 2.x
Author, meta:author, dc:creator	dc:creator
Last-Author, meta:last-author	meta:last-author
title, dc:title	dc:title
Creation-Date, date, dcterms:created	dcterms:created
Last-Modified, modified, dcterms:modified	dcterms:modified
Last-Save-Date, meta:save-date	meta:save-date
w:comments	w:Comments
Application-Name, extended-properties:Application	extended-properties:Application
Character Count, meta:character-count	meta:character-count
Company, extended-properties:Company	extended-properties:Company
Edit-Time, extended-properties:TotalTime	extended-properties:TotalTime
Keywords, meta:keyword, dc:subject	meta:keyword, dc:subject
Page-Count, meta:page-count	meta:page-count
Revision-Number, cp:revision	cp:revision
subject, cp:subject, dc:subject	dc:subject
Template, extended-properties:Template	extended-properties:Template
Word-Count, meta:word-count	meta:word-count
identifier	dc:identifier
publisher	dc:publisher
dc:description, subject (as in MSG files)	dc:description (dc:subject was added back in 2.4.0).

tika-parsers – Configuring via tika-config.xml

...

See other individual parser pages for available configurations: TikaParserNotes. If you notice any missing parsers, please help us document configurations for all parsers.

tika-parsers module

In Tika 2.x, we separated the 1.x tika-parsers module into three modules and packages:

tika-parsers-standard – the most common parsers – should not require rest calls nor native libs (NOTE: despite the goal of this package, we do include the TesseractOCR parser which will run Tesseract if you have that installed)
tika-parsers-extended – may include native libs and/or dependencies that not everyone wants (e.g. netcdf)
tika-parsers-ml – may include heavy dependencies (e.g. dl4j) or parsers that rely on rest calls and external services

The goal is to allow users to select only the parsers (and dependencies) that they want.

When using tika-parsers in your project, you need to change the dependencies from:

Code Block

language	xml
title	pom.xml from 1.27

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.27</version>
</dependency>

to , e.gat least tika-parsers-standard-package. If you want netcdf parsing and/or sqlite3 parsing – both of which were included in tika-parsers in 1.x – you'll need to include tika-parser-scientific-package and/or the tika-parser-sqlite3-package.

Code Block

language	xml
title	pom.xml for 2.0.0+

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
  <version>2.17.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-scientific-module<package</artifactId>
  <version>2.17.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-sqlite3-module<package</artifactId>
  <version>2.17.0</version>
</dependency>

NOTE: As of Tika 2.7.0, we have added tika-parser-nlp-package to our release artifacts.

NOTE: As in Tika 1.x, if you need detection on container formats (e.g. OLE2: .doc, .ppt, .xls or zip-based: .xlsx, .pptx, .docx or .ogg based), you need to include the underlying Tika parsers that will parse the container files and make the detection based on the information in the container. In Tika 2.x, this means that you need to include tika-parsers-standard-package!

...

Page tree

Versions Compared

Old Version 47

New Version Current

Key

tika-parsers – Configuring via tika-config.xml

tika-parsers module

Page tree

Page History

Versions Compared

Old Version 47

New Version Current

Key

tika-parsers – Configuring via tika-config.xml

tika-parsers module