NOTE: THIS PAGE IS IN PROGRESS. PLEASE CHECK BACK FOR MORE DETAILS.
For now, see: https://downloads.apache.org/tika/2.0.0/CHANGES-2.0.0.txt
Metadata
Metadata.RESOURCE_NAME_KEY
has been renamedTikaCoreProperties.RESOURCE_NAME_KEY
.TikaCoreProperties.KEYWORDS
has been removed.- Meta
X-Parsed-By
has changed toX-TIKA:Parsed-By
.
tika-parsers – specific parser changes
tika-parsers module
When using tika-parsers in you project, you need to change the dependencies from
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.27</version> </dependency>
to
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-standard-package</artifactId> <version>2.0.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-scientific-module</artifactId> <version>2.0.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-sqlite3-module</artifactId> <version>2.0.0</version> </dependency>
Also, there's a small transitive dependency issue with jcl-over-slf4j between tika-parsers-standard-package 2.0.0 and tika-parser-scientific-module:2.0.0. So if you are using maven enforcer plugin, you will need to fix it by adding this:
<!-- Fix tika-parsers-standard-package 2.0.0 vs tika-parser-scientific-module:2.0.0 transitive dependency --> <dependency> <groupId>org.slf4j</groupId> <artifactId>jcl-over-slf4j</artifactId> <version>1.7.31</version> </dependency>
If you are checking for CVEs (recommended), the tika-parser-scientific-module:2.0.0 comes with a transitive dependency on quartz 2.2.0 which should be fixed like this:
<dependency> <groupId>edu.ucar</groupId> <artifactId>netcdf4</artifactId> <version>${netcdf-java.version}</version> <exclusions> .... <exclusion> <groupId>org.quartz-scheduler</groupId> <artifactId>quartz</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.quartz-scheduler</groupId> <artifactId>quartz</artifactId> <version>2.3.2</version> </dependency>
When using lang detection, you need to change now use:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-langdetect-optimaize</artifactId> <version>2.0.0</version> </dependency>
Also note that org.apache.tika.langdetect.OptimaizeLangDetector.getDefaultLanguageDetector
has moved to org.apache.tika.langdetect.optimaize.OptimaizeLangDetector.getDefaultLanguageDetector
.
For OCR, you can not use anymore TesseractOCRConfig.setTesseractPath(String)
and TesseractOCRConfig.setTessdataPath(String)
methods. They moved to the TesseractOCRParser
class.
tika-app
tika-server
General
enableFileUrl
has been removed in favor of aFileSystemFetcher
see tika-pipes#FetchersInClassicServerEndpoints.
Configuration
tika-pipes
See the tika-pipes page.
tika-eval
tika-langid
In the 1.x branch, the default (hardwired) language identification component was the wrapper around Optimaize. If you used the following in 1.x:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-langdetect</artifactId> <version>1.27</version> </dependency>
In 2.x, change this to:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-langdetect-optimaize</artifactId> <version>2.0.x</version> </dependency>
The original language id component that was built by Tika devs and that used to be in tika-core is now in the tika-langdetect-tika module.