Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

List all the available parsers, along with what mimetypes they support

Specifying a URL Instead of Putting Bytes in Tika >= 2.x

In Tika 2.x, use a FileSystemFetcher, a UrlFetcher or or an HttpFetcher. See: tika-pipes (FetchersInClassicServerEndpoints)

We have entirely removed the -enableFileUrl capability that we had in 1.x .

Specifying a URL Instead of Putting Bytes in Tika 1.x

In Tika 1.10, we removed this capability because it posed a security vulnerability (CVE-2015-3271). Anyone with access to the service had the server's access rights; someone could request local files via file:/// or pages from an intranet that they might not otherwise have access to.

In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:

No Format
$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl

because it posed a security threat.

Transfer-Layer Compression

As of Tika 1.24.1, users can turn on gzip compression for either files on their way to tika-server  or the output from tika-server.

If you want to gzip your files before sending to tika-server , add

No Format
curl -T test_my_doc.pdf -H "Content-Encoding: gzip" http://localhost:9998/rmeta


If you want tika-server  to compress the output of the parseThis allows the user to specify a fileUrl in the header:

No Format
curl -i -H "fileUrl:http://tika.apache.org"T test_my_doc.pdf -H "Accept:text/plain-Encoding: gzip" -X PUT http://localhost:9998/tika

or

No Format
curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

By adding back this capability, we did not remove the security vulnerability. Rather, if a user is confident that only authorized clients are able to submit a request, the user can choose to operate tika-server with this insecure setting. BE CAREFUL!

Also, please be polite. This feature was added as a convenience. Please consider using a robust crawler (instead of our simple TikaInputStream.get(new URL(fileUrl))) that will allow for better configuration of redirects, timeouts, cookies, etc.; and a robust crawler will respect robots.txt!

Transfer-Layer Compression

As of Tika 1.24.1, users can turn on gzip compression for either files on their way to tika-server  or the output from tika-server.

If you want to gzip your files before sending to tika-server , add

No Format
curl -T test_my_doc.pdf -H "Content-Encoding: gzip" http://localhost:9998/rmeta

If you want tika-server  to compress the output of the parse:

No Format
curl -T test_my_doc.pdf -H "Accept-Encoding: gzip" http://localhost:9998/rmeta

Making Tika Server Robust to OOMs, Infinite Loops and Memory Leaks in Tika 2.x

See below.  In Tika 2.x, the -spawnChild functionality is turned on by default.  This has the effect that tika-server may occasionally be unavailable when it is restarting.  Clients should have logic to wait for a restart if tika-server has had to restart.

To turn off this behavior and to go back to the more dangerous legacy behavior, start tika-server with the --noFork option. 

Making Tika Server Robust to OOMs, Infinite Loops and Memory Leaks in Tika 1.x

As of Tika 1.19, users can make tika-server more robust by running it with the -spawnChild option. This starts tika-server in a child process, and if there's an OOM, a timeout or other catastrophic problem with the child process, the parent process will kill and/or restart the child process.

The following options are available only with the -spawnChild option.

  • -maxFiles: restart the child process after it has processed maxFiles. If there is a slow building memory leak, this restart of the JVM should help. The default is 100,000 files. To turn off this feature: -maxFiles -1. The child and/or parent will log the cause of the restart as HIT_MAX when there is a restart because of this threshold.
  • -taskTimeoutMillis and -taskPulseMillis: taskPulseMillis specifies how often to check to determine if a parse/detect task has timed out taskTimeoutMillis
  • -pingTimeoutMillis and -pingPulseMillis: pingPulseMillis specifies how often for the parent process to ping the child process to check status. pingTimeoutMillis how long the parent process should wait to hear back from the child process before restarting it and/or how long the child process should wait to receive a ping from the parent process before shutting itself down.

If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.

NOTE 1: -spawnChild has become the default in Tika 2.x.  If you need to return to the legacy 1.x behavior, configure tika-server element in the tika-config.xml with <noFork>true</noFork> or add --noFork as the commandline argument.

NOTE 2: In Tika 1.x, to specify the JVM args for the child process, prepend the arguments with -J as in -JXmx4g after the -jar tika-server.x.x.jar call as in:

No Format
$ java -Dlog4j.configuration=file:log4j_server.xml -jar tika-server-x.x.jar -spawnChild -JXmx4g -JDlog4j.configuration=file:log4j_child.xml}}

NOTE 3: Before Tika 1.27, we strongly encourage -JXX:+ExitOnOutOfMemoryError, which admittedly has limitations: https://bugs.openjdk.java.net/browse/JDK-8155004.  When a JVM is struggling with memory, it is possible that the final trigger for the OOM happens in reading bytes from the client or writing bytes to the client NOT during the parse.  In short, OOMs can happen outside of Tika's code, and our internal watcher can't see/respond to some OOMs.  In 1.27 and later (and in 2.x), we added a shutdown hook in TesseractOCRParser to decrease the chances of orphaning tesseract.  The use of -JXX:+ExitOnOutOfMemoryError prevents the shutdown hooks from working, and tesseract processes may more easily be orphaned on an out of memory error.

rmeta


Making Legacy Tika Server Endpoints Robust to OOMs, Infinite Loops and Memory Leaks in Tika >= 2.x

As of Tika 2.x, the default behavior is that the main code forks the server process to handle parsing. If there's an OOM or a timeout or other crash during the parse, the forked process will shutdown and restart.

If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.

To turn off this behavior and to go back to the more dangerous Tika 1.x legacy behavior, configure tika-server with the <noFork>true</noFork> or add --noFork as the commandline argument.

NOTE: As mentioned aboveNOTE 4: When using the -spawnChild option, clients will need to be aware that the server could be unavailable temporarily while it is restarting.  Clients will need to have a retry logic.

Making Tika Server Robust to OOMs, Infinite Loops etc. with the tika-pipes handlers

There are two handlers (/pipes and /async) that use the tika-pipes framework that was added in Tika 2.x. These run the parses, a single file at a time, in forked processes so that the tika-server is always "up" even when a parser strikes catastrophe. See tika-pipes.

Logging

In Tika 1.x, you can customize logging via the usual log4j commandline argument, e.g. -Dlog4j.configuration=file:log4j_server.xml. If using -spawnChild, specify the configuration for the child process with the -J prepended as in java -jar tika-server-X.Y-jar -spawnChild -JDlog4j.configuration=file:log4j_server.xml. Some important notes for logging in the child process in versions <= 1.19.1: 1) make sure that the debug option is off, and 2) do not log to stdout (this is used for interprocess communication between the parent and child!).

...