...
See other individual parser pages for available configurations: TikaParserNotes. If you notice any missing parsers, please help us document configurations for all parsers.
tika-parsers module
In Tika 2.x, we separated the 1.x tika-parsers
module into three modules and packages:
- tika-parsers-standard – the most common parsers – should not require rest calls nor native libs (NOTE: despite the goal of this package, we do include the TesseractOCR parser which will run Tesseract if you have that installed)
- tika-parsers-extended – may include native libs and/or dependencies that not everyone wants (e.g. netcdf)
- tika-parsers-ml – may include heavy dependencies (dl4j) or parsers that rely on rest calls and external services
The goal is to allow users to select only the parsers (and dependencies) that they want.
When using tika-parsers
in your project, you need to change the dependencies from:
Code Block | ||||
---|---|---|---|---|
| ||||
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.27</version> </dependency> |
to , e.gat least tika-parsers-standard-package
. If you want netcdf parsing and/or sqlite parsing – both of which were included in tika-parsers
in 1.x – you'll need to include tika-parser-scientific-package
and tika-parser-sqlite3-package
.
Code Block | ||||
---|---|---|---|---|
| ||||
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-standard-package</artifactId> <version>2.17.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parser-scientific-module<package</artifactId> <version>2.17.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parser-sqlite3-module<package</artifactId> <version>2.17.0</version> </dependency> |
NOTE: As of Tika 2.7.0, we have added tika-parser-nlp-package
to our release artifacts.
NOTE: As in Tika 1.x, if you need detection on container formats (e.g. OLE2: .doc, .ppt, .xls or zip-based: .xlsx, .pptx, .docx or .ogg based), you need to include the underlying Tika parsers that will parse the container files and make the detection based on the information in the container. In Tika 2.x, this means that you need to include tika-parsers-standard-package
!
...