See Migrating to Tika 2.0.0 for a general overview of changes in Tika 2.x.
See TikaServer for building and general usage of tika-server.
Major changes
- Modularization – We've modularized tika-server:
tika-server-core
includes all of the functionality oftika-server
, but with no bundled parsers. Users might want this if they are only parsing a few file formats or want to use only their custom parsers.tika-server-standard
is what most people will want to use. As with thetika-parsers-standard
module, this includes most of the common file format parsers. If needed, users may also add thetika-parser-scientific-package
andtika-parser-sqlite3-package
to the class path. In 1.x, the first was included in tika-server 1.x by default, and the second was included only if users added xerial's sqlite3 jar on the classpath.
--spawnChild
mode is now default. In Tika 1.x, users had to specify this on the commandline to forcetika-server
to fork a process that did the actual parsing. This option is far more robust against timeouts, OOMs, crashes and other mishaps; the forking process monitors the forked process and will restart on timeouts, etc. NOTE: Client code needs to be able to handle the times whentika-server
is restarting and is not available; this typically only takes a few seconds. To disable this mode, use--noFork
on the commandline.- Configuring
tika-server
in Tika 2.x. See below. We've moved most configuration options intotika-config.xml
and dramatically limited the commandline options. - The namespace has changed slightly for
TikaServerCli
toorg.apache.tika.server.core.TikaServerCli
. If adding optional jars to the class path in, say, abin/
directory, start tika-server with:java -cp "bin/*" org.apache.tika.server.core.TikaServerCli -c tika-config.xml
enableFileUrl
-- We have removed this capability from tika-server in 2.x. We have replaced it with the FileSystemFetcher, which is available in tika-core. See FetchersInClassicServerEndpoints.
Configuring tika-server in Tika 2.x
As with other components, in Tika 2.x, we moved configuration into tika-config.xml
. We have left only a few commandline options available (to see the options: java -jar tika-server-standard-2.x.x.jar --help
).
- -h, --host – hostname
- -p, --port – which port to bind to. Can specify ranges, e.g.
9990-9999
, and Tika will launch 10 servers in forked processes on each of those ports. Can also specify a comma-delimited list, e.g. (9996,9998,9999
). - -?, --help
- -c, --config – specify the tika-config.xml file to use for this tika-server and its forked processes.
- -i, --id – specify the id for this server. This is used in logging and in the
/status
endpoint. - --noFork – run tika-server in legacy mode without forking a process.
tika-config.xml
<?xml version="1.0" encoding="UTF-8"?> <properties> <!-- <parsers etc.../> --> <server> <params> <!-- which port to start the server on. If you specify a range, e.g. 9995-9998, TikaServerCli will start four forked servers, one at each port. You can also specify multiple forked servers via a comma-delimited value: 9995,9997. --> <port>9998</port> <host>localhost</host> <!-- if specified, this will be the id that is used in the /status endpoint and elsewhere. If an id is specified and more than one forked processes are invoked, each process will have an id followed by the port, e.g my_id-9998. If a forked server has to restart, it will maintain its original id. If not specified, a UUID will be generated. --> <id></id> <!-- whether or not to allow CORS requests. Set to 'ALL' if you want to allow all CORS requests. Set to NONE or leave blank if you do not want to enable CORS. --> <cors>NONE</cors> <!-- which digests to calculate, comma delimited (e.g. md5,sha256); optionally specify encoding followed by a colon (e.g. "sha1:32"). Can be empty if you don't want to calculate a digest --> <digest>sha256</digest> <!-- how much to read to memory during the digest phase before spooling to disc...only if digest is selected --> <digestMarkLimit>1000000</digestMarkLimit> <!-- request URI log level 'debug' or 'info' --> <logLevel>info</logLevel> <!-- whether or not to include the stacktrace when a parse exception happens in the data returned to the user --> <includeStack>false</includeStack> <!-- If set to 'true', this runs tika server "in process" in the legacy 1.x mode. This means that the server will be susceptible to infinite loops and crashes. If set to 'false', the server will spawn a forked process and restart the forked process on catastrophic failures (this was called -spawnChild mode in 1.x). noFork=false is the default in 2.x --> <noFork>false</noFork> <!-- maximum time to allow per parse before shutting down and restarting the forked parser. Not allowed if nofork=true. --> <taskTimeoutMillis>300000</taskTimeoutMillis> <!-- how often to check whether a parse has timed out. Not allowed if nofork=true. --> <taskPulseMillis>10000</taskPulseMillis> <!-- maximum time to allow for a response from the forked process before shutting it down and restarting it. Not allowed if nofork=true. --> <pingTimeoutMillis>60000</pingTimeoutMillis> <!-- how often to check whether the fork process needs to be restarted Not allowed if nofork=true. --> <pingPulseMillis>10000</pingPulseMillis> <!-- maximum amount of time to wait for a forked process to start up. Not allowed if noFork=true. --> <maxForkedStartupMillis>120000</maxForkedStartupMillis> <!-- maximum number of times to allow a specific forked process to be restarted. Not allowed if noFork=true. --> <maxRestarts>-1</maxRestarts> <!-- maximum files to parse per forked process before restarting the forked process to clear potential memory leaks. Not allowed if noFork=true. --> <maxFiles>100000</maxFiles> <!-- if you want to specify a specific javaHome for the forked process. Not allowed if noFork=true. --> <javaHome></javaHome> <!-- jvm args to use in the forked process --> <forkedJvmArgs> <arg>-Xms1g</arg> <arg>-Xmx1g</arg> <arg>-Dlog4j.configurationFile=my-forked-log4j2.xml</arg> </forkedJvmArgs> <!-- this must be set to true for any handler that uses a fetcher or emitter. These pipes features are inherently unsecure because the client has the same read/write access as the tika-server process. Implementers must secure Tika server so that only their clients can reach it. A byproduct of setting this to true is that the /status endpoint is turned on --> <enableUnsecureFeatures>true</enableUnsecureFeatures> <!-- you can optionally select specific endpoints to turn on/load. This can improve resource usage and decrease your attack surface. If you want to access the status endpoint, specify it here or set unsecureFeatures to true --> <endpoints> <endpoint>status</endpoint> <endpoint>rmeta</endpoint> </endpoints> </params> </server> </properties>