...
This page is documentation on accessing Tika as a RESTful API via the Tika server (tika-server module). See TikaServer in Tika 2.x for details in changes in how to configure tika-server
when moving from 1. x to 2.x. See TikaServerEndpointsCompared for a summary of differences across the endpoints.
...
No Format |
---|
$ curl -T price.xls http://localhost:9998/tika/text --header "Accept: application/json" |
Skip Embedded Files/Attachments
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/plain" --header"X-Tika-Skip-Embedded: true" |
Multipart Support
Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
...
Note that the address has an extra "/form" path segment.
Specifying Limits
Skip Embedded Files/Attachments
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta --header "X-Tika-Skip-Embedded: true" |
Specifying Limits
As As of Tika 1.25, you can limit the maximum number of embedded resources and the write limit per handler.
...
List all the available parsers, along with what mimetypes they support
Specifying a URL Instead of Putting Bytes in Tika >= 2.x
In Tika 2.x, use a FileSystemFetcher, a UrlFetcher or or an HttpFetcher. See: tika-pipes (FetchersInClassicServerEndpoints)
We have entirely removed the -enableFileUrl
capability that we had in 1.x .
Specifying a URL Instead of Putting Bytes in Tika 1.x
In Tika 1.10, we removed this capability because it posed a security vulnerability (CVE-2015-3271). Anyone with access to the service had the server's access rights; someone could request local files via file:///
or pages from an intranet that they might not otherwise have access to.
In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:
No Format |
---|
$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl
|
This allows the user to specify a fileUrl
in the header:
because it posed a security threat.
Transfer-Layer Compression
As of Tika 1.24.1, users can turn on gzip
compression for either files on their way to tika-server
or the output from tika-server
.
If you want to gzip
your files before sending to tika-server
, add
No Format |
---|
curl -T test_my_doc.pdf -H "Content-Encoding: gzip" http://localhost:9998/rmeta |
If you want tika-server
to compress the output of the parse:
No Format |
---|
curl -T test_my_doc.pdf -H "Accept-Encoding: gzip" |
No Format |
curl -i -H "fileUrl:http://tika.apache.org" -H "Accept:text/plain" -X PUT http://localhost:9998/tika |
or
No Format |
---|
curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
|
By adding back this capability, we did not remove the security vulnerability. Rather, if a user is confident that only authorized clients are able to submit a request, the user can choose to operate tika-server with this insecure setting. BE CAREFUL!
Also, please be polite. This feature was added as a convenience. Please consider using a robust crawler (instead of our simple TikaInputStream.get(new URL(fileUrl))
) that will allow for better configuration of redirects, timeouts, cookies, etc.; and a robust crawler will respect robots.txt!
Transfer-Layer Compression
As of Tika 1.24.1, users can turn on gzip
compression for either files on their way to tika-server
or the output from tika-server
.
If you want to gzip
your files before sending to tika-server
, add
No Format |
---|
curl -T test_my_doc.pdf -H "Content-Encoding: gzip" http://localhost:9998/rmeta |
If you want tika-server
to compress the output of the parse:
No Format |
---|
curl -T test_my_doc.pdf -H "Accept-Encoding: gzip" http://localhost:9998/rmeta |
Making Tika Server Robust to OOMs, Infinite Loops and Memory Leaks in Tika 2.x
See below. In Tika 2.x, the -spawnChild
functionality is turned on by default. This has the effect that tika-server may occasionally be unavailable when it is restarting. Clients should have logic to wait for a restart if tika-server has had to restart.
To turn off this behavior and to go back to the more dangerous legacy behavior, start tika-server with the --noFork
option.
Making Tika Server Robust to OOMs, Infinite Loops and Memory Leaks in Tika 1.x
As of Tika 1.19, users can make tika-server more robust by running it with the -spawnChild
option. This starts tika-server in a child process, and if there's an OOM, a timeout or other catastrophic problem with the child process, the parent process will kill and/or restart the child process.
The following options are available only with the -spawnChild
option.
-maxFiles
: restart the child process after it has processedmaxFiles
. If there is a slow building memory leak, this restart of the JVM should help. The default is 100,000 files. To turn off this feature:-maxFiles -1
. The child and/or parent will log the cause of the restart asHIT_MAX
when there is a restart because of this threshold.-taskTimeoutMillis
and-taskPulseMillis
:taskPulseMillis
specifies how often to check to determine if a parse/detect task has timed outtaskTimeoutMillis
-pingTimeoutMillis
and-pingPulseMillis
:pingPulseMillis
specifies how often for the parent process to ping the child process to check status.pingTimeoutMillis
how long the parent process should wait to hear back from the child process before restarting it and/or how long the child process should wait to receive a ping from the parent process before shutting itself down.
If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable
. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.
NOTE 1: -spawnChild
has become the default in Tika 2.x. If you need to return to the legacy 1.x behavior, configure tika-server element in the tika-config.xml with <noFork>true</noFork>
or add --noFork
as the commandline argument.
NOTE 2: In Tika 1.x, to specify the JVM args for the child process, prepend the arguments with -J
as in -JXmx4g
after the -jar tika-server.x.x.jar
call as in:
No Format |
---|
$ java -Dlog4j.configuration=file:log4j_server.xml -jar tika-server-x.x.jar -spawnChild -JXmx4g -JDlog4j.configuration=file:log4j_child.xml}}
|
NOTE 3: Before Tika 1.27, we strongly encourage -JXX:+ExitOnOutOfMemoryError
, which admittedly has limitations: https://bugs.openjdk.java.net/browse/JDK-8155004. When a JVM is struggling with memory, it is possible that the final trigger for the OOM happens in reading bytes from the client or writing bytes to the client NOT during the parse. In short, OOMs can happen outside of Tika's code, and our internal watcher can't see/respond to some OOMs. In 1.27 and later (and in 2.x), we added a shutdown hook in TesseractOCRParser to decrease the chances of orphaning tesseract. The use of -JXX:+ExitOnOutOfMemoryError
prevents the shutdown hooks from working, and tesseract processes may more easily be orphaned on an out of memory error.
rmeta |
Making Legacy Tika Server Endpoints Robust to OOMs, Infinite Loops and Memory Leaks in Tika >= 2.x
As of Tika 2.x, the default behavior is that the main code forks the server process to handle parsing. If there's an OOM or a timeout or other crash during the parse, the forked process will shutdown and restart.
If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable
. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.
To turn off this behavior and to go back to the more dangerous Tika 1.x legacy behavior, configure tika-server with the <noFork>true</noFork>
or add --noFork
as the commandline argument.
NOTE: As mentioned aboveNOTE 4: When using the -spawnChild
option, clients will need to be aware that the server could be unavailable temporarily while it is restarting. Clients will need to have a retry logic.
Making Tika Server Robust to OOMs, Infinite Loops etc. with the tika-pipes handlers
There are two handlers (/pipes
and /async
) that use the tika-pipes framework that was added in Tika 2.x. These run the parses, a single file at a time, in forked processes so that the tika-server is always "up" even when a parser strikes catastrophe. See tika-pipes.
Logging
In Tika 1.x, you can customize logging via the usual log4j
commandline argument, e.g. -Dlog4j.configuration=file:log4j_server.xml
. If using -spawnChild
, specify the configuration for the child process with the -J
prepended as in java -jar tika-server-X.Y-jar -spawnChild -JDlog4j.configuration=file:log4j_server.xml
. Some important notes for logging in the child process in versions <= 1.19.1: 1) make sure that the debug
option is off, and 2) do not log to stdout (this is used for interprocess communication between the parent and child!).
...