Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This page is documentation on accessing Tika as a RESTful API via the Tika server (tika-server module).  See TikaServer in Tika 2.x for how to configure tika-server. See TikaServerEndpointsCompared for a summary of differences across the endpoints.


Table of Contents

Installation of Tika Server

...

  1. Checkout the source from SVN as detailed on the Apache Tika contributions page or retrieve the latest code from Github,
  2. Build source using Maven
  3. Run the Apache Tika JAXRS server runnable jar.
No Format

git clone https://github.com/apache/tika.git tika-trunk
cd ./tika-trunk/
mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar

...

You will then see a message such as the following:

No Format

$ java -jar tika-server-1.24-SNAPSHOT.jar
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.8-SNAPSHOT server
19-Jan-2015 14:23:36 org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Started

...

You can specify additional information to change the host name and port number:

No Format

java -jar tika-server-x.x.jar --host=intranet.local --port=12345

...

Using prebuilt Docker image

There is an unofficial image for Tika that has been available for yearsOfficial image for Tika can be found at DockerHub.  You can download and start it with:

No Format
javadocker -jarrun tika-server-x.x.jar --host=intranet.local --port=12345 

With --rm option it will be deleted as soon as container stopped. Dockerfile can be found at Github.

-d -p 9998:9998 apache/tika:<version>

Full set of documentation can be found at Github.There is also an in-progress effort to publish an official Tika Docker image.  That code can be found at https://github.com/apache/tika-docker and will eventually replace the version produced by LogicalSpark.

Running Tika Server as Unix Service

...

Your specific customization to Tika server setup are stored in the /etc/init.d/tika file.

Tika Server Services

Configuring Tika server in 2.x

See TikaServer in Tika 2.x for the details of configuring tika-server in 2.x.

Tika Server Services

All services that All services that take files use HTTP "PUT" requests. When "PUT" is used, the original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

...

  • 200 Ok - request completed sucessfully
  • 204 No content - request completed sucessfully, result is empty
  • 422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
  • 500 Error - Error while processing document

NOTE: Please see TikaServerEndpointsCompared for a quick comparison of the features of some of these endpoints.

Metadata Resource

No Format

/meta

HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

Some Example calls with cURL:

No Format

$ curl -X PUT --data-ascii @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

No Format

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

Get metadata as JSON:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/json"

Or XMP:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/rdf+xml"

Get specific metadata key's value as simple text string:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/plain"

Returns:

No Format

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Get specific metadata key's value(s) as CSV:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/csv"

Or JSON:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/json"

Or XMP:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/rdf+xml"

...

Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

No Format

curl -F upload=@price.xls URL http://localhost:9998/meta/form

Note that the address has an extra "/form" path segment.

Tika Resource

No Format

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text in text. HTTP GET prints , html or "body" format (see below). See also the /rmeta  endpoint for text and metadata of embedded objects.  

HTTP GET prints a greeting stating the server is a greeting stating the server is up.

Some Example calls with cURL:

Get HELLO message back

No Format

$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT

Get the Text of a Document

No Format

$ curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"

Use the Boilerpipe handler (equivalent to tika-app's --text-main) with text output:

No Format

$ curl -T price.xls http://localhost:9998/tika/main --header "Accept: text/plain"

Multipart Support

Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:


With Tika 1.27 and greater, you can get the text and metadata of a file in json format with:

No Format
$ curl -T price.xls http://localhost:9998/tika --header "Accept: application/json"

To specify whether you want the content to be text (vs. html) specify the handler type after /tika:

No Format
$ curl -T price.xls 
No Format

curl -F upload=@price.xls URL http://localhost:9998/tika/form

Note that the address has an extra "/form" path segment.

Detector Resource

No Format

/detect/stream

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

Default return is a string of the Media type name.

Some Example calls with cURL:

PUT an RTF file and get back RTF

No Format

$ curl -X PUT --data-binary @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

text --header "Accept: application/json"

Skip Embedded Files/Attachments

No Format
$ curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/plain" --header"X-Tika-Skip-Embedded: true" 

Multipart Support

Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

No Format
curl -F upload=@price.xls URL
No Format

$ curl -X PUT --upload-file foo.csv http://localhost:9998/detecttika/stream

PUT a CSV file with filename hint and get back text/csv

form

Note that the address has an extra "/form" path segment.

Detector Resource

No Format
/detect/stream

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

Default return is a string of the Media type name.

Some Example calls with cURL:

PUT an RTF file and get back RTF

No Format
$ curl -X PUT --data-binary @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

No Format
$ curl -X PUT
No Format

$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

Language Resource

No Format

/language/stream

PUT a CSV file with filename hint and get back text/csv

No Format
$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

Language Resource

No Format
/language/stream

HTTP PUTs or POSTs a UTF-8 text file to HTTP PUTs or POSTs a UTF-8 text file to the LanguageIdentifier to identify its language. 

...

PUT a TXT file with English This is English! and get back en

No Format

$ curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
en

PUT a TXT file with French comme çi comme ça and get back fr

No Format

curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
fr


No Format

/language/string

HTTP PUTs or POSTs a text string to the LanguageIdentifier to identify its language.

...

PUT a string with English This is English! and get back en

No Format

$ curl -X PUT --data "This is English!" http://localhost:9998/language/string
en

PUT a string with French comme çi comme ça and get back fr

No Format

curl -X PUT --data "comme çi comme ça" http://localhost:9998/language/string
fr

Translate Resource

No Format

/translate/all/translator/src/dest

...

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Lingo24

No Format

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.Lingo24Translator/es/en
lack of practice in Spanish

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft

No Format

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.MicrosoftTranslator/es/en
I need practice in Spanish

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google

No Format

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/es/en
I need practice in Spanish


No Format

/translate/all/src/dest

HTTP PUTs or POSTs a document to the identified *translator* and auto-detects the *src* language using LanguageIdentifiers, and then translates *src* to *dest*

...

PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language

No Format

$ curl -X PUT --data-binary @sentences2 http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/en
so so

Recursive Metadata and Content

No Format

/rmeta

Returns a JSONified list of Metadata objects for the container document and all embedded documents. The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta

Returns:

No Format

[
 {"Application-Name":"Microsoft Office Word",
  "Application-Version":"15.0000",
  "X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
  "X-TIKA:content":"embed_0 "
  ...
 },
 {"Content-Encoding":"ISO-8859-1",
  "Content-Length":"8",
  "Content-Type":"text/plain; charset=ISO-8859-1"
  "X-TIKA:content":"embed_1b",
  ...
 }
 ...
]

The default format for "X-TIKA:content" is XML. However, you can select "text only" with

No Format

/rmeta/text

HTML with

No Format

/rmeta/html

and no content (metadata only) with

No Format

/rmeta/ignore

Multipart Support

Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

No Format

curl -F upload=@test_recursive_embedded.docx URL http://localhost:9998/rmeta/form

Note that the address has an extra "/form" path segment.

Skip Embedded Files/Attachments

No Format
$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta --header "X-Tika-Skip-Embedded: true" 

Specifying Limits

As of Tika 1.25, you can limit the maximum number of embedded resources and the write limit per handler.

...

No Format
curl -T test_recursive_embedded.docx --header "writeLimit: 1000" http://localhost:9998/rmeta

Unpack Resource

No Format

/unpack

HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the raw bytes of the embedded files.  Note that this does not operate recursively; it extracts only the child documents of the original file.

You can also use /unpack/all to get back the text and metadata from the container file.  If you want the text and metadata from all embedded files, consider using the /rmeta end point.

Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.

Some example calls with cURL:

PUT zip file and get back met file zip

No Format

$ curl -X PUT --data-binary @foo.zip http://localhost:9998/unpack --header "Content-type: application/zip"

PUT doc file and get back met file tar

No Format

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpack > /var/tmp/x.tar

PUT doc file and get back the content and metadata

No Format

$ curl -T Doc1_ole.doc http://localhost:9998/unpack/all > /var/tmp/x.zip

Text is stored in TEXT file, metadata cvs in METADATA. Use "accept" header if you want TAR output.

Information Services

Available Endpoints

No Format

/

Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc

Defined Mime Types

No Format

/mime-types

Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML

Available Detectors

No Format

/detectors

The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML

Available Parsers

No Format

/parsers

Lists all of the parsers currently available

No Format

/parsers/details

List all the available parsers, along with what mimetypes they support

Specifying a URL Instead of Putting Bytes

In Tika 1.10, we removed this capability because it posed a security vulnerability (CVE-2015-3271). Anyone with access to the service had the server's access rights; someone could request local files via file:/// or pages from an intranet that they might not otherwise have access to.

In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:

No Format

$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl

This allows the user to specify a fileUrl in the header:

No Format

curl -i -H "fileUrl:http://tika.apache.org" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

or

No Format

curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

By adding back this capability, we did not remove the security vulnerability. Rather, if a user is confident that only authorized clients are able to submit a request, the user can choose to operate tika-server with this insecure setting. BE CAREFUL!

Also, please be polite. This feature was added as a convenience. Please consider using a robust crawler (instead of our simple TikaInputStream.get(new URL(fileUrl))) that will allow for better configuration of redirects, timeouts, cookies, etc.; and a robust crawler will respect robots.txt!

Making Tika Server Robust to OOMs, Infinite Loops and Memory Leaks

As of Tika 1.19, users can make tika-server more robust by running it with the -spawnChild option. This starts tika-server in a child process, and if there's an OOM, a timeout or other catastrophic problem with the child process, the parent process will kill and/or restart the child process.

The following options are available only with the -spawnChild option.

  • -maxFiles: restart the child process after it has processed maxFiles. If there is a slow building memory leak, this restart of the JVM should help. The default is 100,000 files. To turn off this feature: -maxFiles -1. The child and/or parent will log the cause of the restart as HIT_MAX when there is a restart because of this threshold.
  • -taskTimeoutMillis and -taskPulseMillis: taskPulseMillis specifies how often to check to determine if a parse/detect task has timed out taskTimeoutMillis
  • -pingTimeoutMillis and -pingPulseMillis: pingPulseMillis specifies how often for the parent process to ping the child process to check status. pingTimeoutMillis how long the parent process should wait to hear back from the child process before restarting it and/or how long the child process should wait to receive a ping from the parent process before shutting itself down.

If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.

NOTE 1: to specify the JVM args for the child process, prepend the arguments with -J as in -JXmx4g after the -jar tika-server.x.x.jar call as in:

No Format

$ java -Dlog4j.configuration=file:log4j_server.xml -jar tika-server-x.x.jar -spawnChild -JXmx4g -JDlog4j.configuration=file:log4j_child.xml}}

NOTE 2: When using the -spawnChild option, clients will need to be aware that the server could be unavailable temporarily while it is restarting.  Clients will need to have a retry logic.

Logging

You can customize logging via the usual log4j commandline argument, e.g. -Dlog4j.configuration=file:log4j_server.xml. If using -spawnChild, specify the configuration for the child process with the -J prepended as in java -jar tika-server-X.Y-jar -spawnChild -JDlog4j.configuration=file:log4j_server.xml. Some important notes for logging in the child process in versions <= 1.19.1: 1) make sure that the debug option is off, and 2) do not log to stdout (this is used for interprocess communication between the parent and child!).

NOTE: In Tika 2.0, the writeLimit applies to the full document including the embedded files, not to each handler.

Filtering Metadata Keys

The /rmeta  endpoint can return far more metadata fields than a user might want to process.   As of Tika 1.25, users can configure a MetadataFilter that either includes  or excludes  fields by name.  

Note: the MetadataFilters only work with the /rmeta  endpoint.  Further, they do not shortcut metadata extraction within Parsers.  They only delete the unwanted fields after the parse.  This still can save resources in storage and network bandwidth.

A user can map Tika field names to names they prefer. If excludeUnmapped is set to true, only those fields that are included in the mapping are passed back to the client.

Code Block
languagexml
titleFieldNameMappingFilter
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
      <params>
        <excludeUnmapped>true</excludeUnmapped>
        <mappings>
          <mapping from="X-TIKA:content" to="content"/>
          <mapping from="a" to="b"/>
        </mappings>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>



A user can set the following in a tika-config.xml file to have the /rmeta  end point only return three fields:

Code Block
languagexml
titleIncludeFieldMetadataFilter
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
      <params>
        <include>
          <field>X-TIKA:content</include>
          <field>extended-properties:Application</include>
          <field>Content-Type</include>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>


To exclude those three fields but include all other fields:


Code Block
languagexml
titleExcludeFieldMetadataFilter
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ExcludeFieldMetadataFilter">
      <params>
        <exclude>
          <field>X-TIKA:content</field>
          <field>extended-properties:Application</field>
          <field>Content-Type</field>
        </param>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

Filtering Metadata Objects

A user may want to parse a file type to get at the embedded contents within it, but s/he may not want a metadata object or contents for the file type itself.  For example, image/emf files often contain duplicative text, but they may contain an embedded PDF file.  If the client had turned off the EMFParser, the embedded PDF file would not be parsed.  When the /rmeta  endpoint is configured with the following, it will delete the entire metadata object for files of type image/emf .

Code Block
languagexml
titleClearByMimeMetadataFilter
<properties>
  <metadataFilters>
    <metadataFilter class="org.apache.tika.metadata.filter.ClearByMimeMetadataFilter">
      <params>
        <mimes>
          <mime>image/emf</mime>
        </mimes>
      </params>
    </metadataFilter>
  </metadataFilters>
</properties>

Integration with tika-eval

As of Tika 1.25, if a user adds the tika-eval  jar to the server jar's classpath, the /rmeta  endpoint will add key "profiling" statistics from the tika-eval  module, including: language identified, number of tokens, number of alphabetic tokens and the "out of vocabulary" percentage.  These statistics can be used to decide to reprocess a file with OCR or to reprocess an HTML file with a different encoding detector.

To accomplish this, one may put both the tika-eval jar and the server jar in a bin/ directory and then run:

No Format
java -cp bin/* org.apache.tika.server.TikaServerCli

See the TikaEval page for more details.  Please open issues on our JIRA if you would like other statistics included or if you'd like to make the calculated statistics configurable.

Unpack Resource

No Format
/unpack

HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the raw bytes of the embedded files.  Note that this does not operate recursively; it extracts only the child documents of the original file.

You can also use /unpack/all to get back the text and metadata from the container file.  If you want the text and metadata from all embedded files, consider using the /rmeta end point.

Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.

Some example calls with cURL:

PUT zip file and get back met file zip

No Format
$ curl -X PUT --data-binary @foo.zip http://localhost:9998/unpack --header "Content-type: application/zip"

PUT doc file and get back met file tar

No Format
$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpack > /var/tmp/x.tar

PUT doc file and get back the content and metadata

No Format
$ curl -T Doc1_ole.doc http://localhost:9998/unpack/all > /var/tmp/x.zip

Text is stored in TEXT file, metadata cvs in METADATA. Use "accept" header if you want TAR output.

PUT zip file and get back met file zip and bump max attachment size from default 100MB to custom 1GB

This is available in tika-server versions greater than 2.8.0.

No Format
$ curl -X PUT --data-binary @foo.zip http://localhost:9998/unpack --header "Content-type: application/zip" --header "unpackMaxBytes:  1073741824"

Information Services

Available Endpoints

No Format
/

Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc

Defined Mime Types

No Format
/mime-types

Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML

Available Detectors

No Format
/detectors

The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML

Available Parsers

No Format
/parsers

Lists all of the parsers currently available

No Format
/parsers/details

List all the available parsers, along with what mimetypes they support

Specifying a URL Instead of Putting Bytes in Tika >= 2.x

In Tika 2.x, use a FileSystemFetcher, a UrlFetcher or or an HttpFetcher. See: tika-pipes (FetchersInClassicServerEndpoints)

We have entirely removed the -enableFileUrl capability that we had in 1.x because it posed a security threat.

Transfer-Layer Compression

As of Tika 1.24.1, users can turn on gzip compression for either files on their way to tika-server  or the output from tika-server.

If you want to gzip your files before sending to tika-server , add

No Format
curl -T test_my_doc.pdf -H "Content-Encoding: gzip" http://localhost:9998/rmeta


If you want tika-server  to compress the output of the parse:

No Format
curl -T test_my_doc.pdf -H "Accept-Encoding: gzip" http://localhost:9998/rmeta


Making Legacy Tika Server Endpoints Robust to OOMs, Infinite Loops and Memory Leaks in Tika >= 2.x

As of Tika 2.x, the default behavior is that the main code forks the server process to handle parsing. If there's an OOM or a timeout or other crash during the parse, the forked process will shutdown and restart.

If the child process is in the process of shutting down, and it gets a new request it will return 503 -- Service Unavailable. If the server times out on a file, the client will receive an IOException from the closed socket. Note that all other files that are being processed will end with an IOException from a closed socket when the child process shuts down; e.g. if you send three files to tika-server concurrently, and one of them causes a catastrophic problem requiring the child to shut down, you won't be able to tell which file caused the problems. In the future, we may implement a gentler shutdown than we currently have.

To turn off this behavior and to go back to the more dangerous Tika 1.x legacy behavior, configure tika-server with the <noFork>true</noFork> or add --noFork as the commandline argument.

NOTE: As mentioned above, clients will need to be aware that the server could be unavailable temporarily while it is restarting.  Clients will need to have a retry logic.

Making Tika Server Robust to OOMs, Infinite Loops etc. with the tika-pipes handlers

There are two handlers (/pipes and /async) that use the tika-pipes framework that was added in Tika 2.x. These run the parses, a single file at a time, in forked processes so that the tika-server is always "up" even when a parser strikes catastrophe. See tika-pipes.

Logging

In Tika 1.x, you can customize logging via the usual log4j commandline argument, e.g. -Dlog4j.configuration=file:log4j_server.xml. If using -spawnChild, specify the configuration for the child process with the -J prepended as in java -jar tika-server-X.Y-jar -spawnChild -JDlog4j.configuration=file:log4j_server.xml. Some important notes for logging in the child process in versions <= 1.19.1: 1) make sure that the debug option is off, and 2) do not log to stdout (this is used for interprocess communication between the parent and child!).

The default level of logging is debug, but you can also set logging to info via the commandline: -log info.

In Tika 2.x, you can set log4j2.xml configuration for the forked process in the <jvmArgs/> element.  See below.

Monitoring

ServerStatus

Tika Server uses ServerStatus object to maintain and track current status of server. The data can be exported to REST resource and JMX MBean (from 1.26).

To enable REST endpoint and JMX MBean:

No Format
java -jar tika-server-x.x.jar -status

REST resource to access server status:

No Format
/status

MBean:

Object name: org.apache.tika.server.mbean:type=basic,name=ServerStatusExporter

NOTE: In Tika 2.x, this endpoint is enabled by either enablingUnsecureFeatures or by specifying it as an endpoint.

No Format
<properties>
  <server>
    <params>
      <enableUnsecureFeatures>true</enableUnsecureFeatures>
    </params>
  </server>
</properties>


No Format
<properties>
  <server>
    <params>
      <port>9999</port>
      <forkedJvmArgs>
        <arg>-Xmx2g</arg>
        <arg>-Dlog4j.configurationFile=my-forked-log4j2.xml</arg>
      </forkedJvmArgs>
      <endpoints>
        <endpoint>status</endpoint>
        <endpoint>rmeta</endpoint>
      </endpoints>
    </params>
  </server>
</properties>


SSL (Beta)

Tika Server now has the ability to be spawned with SSL enabled by providing a keystore/Truststore as part of the configuration, this is likely to change but is available as part of Tika 2.4.0.

Example:

Code Block
<properties>
  <server>
    <params>
      <port>9999</port>
      <forkedJvmArgs>
        <arg>-Xmx2g</arg>
      </forkedJvmArgs>
      <endpoints>
        <endpoint>rmeta</endpoint>
      </endpoints>
    </params>
    <tlsConfig>
      <params>
        <active>true</active>
        <keyStoreType>myType</keyStoreType>
        <keyStorePassword>pass</keyStorePassword>
        <keyStoreFile>/something/or/other</keyStoreFile>
        <trustStoreType>myType2</trustStoreType>
        <trustStorePassword>pass2</trustStorePassword>
        <trustStoreFile>/something/or/other2</trustStoreFile>
      </params>
    </tlsConfig>
  </server>
</properties>


If you are new to TLS, see our README.txt for how we generated client and server keystores and truststores for our unit tests.

Configuring Parsers at Parse time/per file

See Configuring Parsers At Parse Time in tika-serverThe default level of logging is debug, but you can also set logging to info via the commandline: -log info.

Architecture

Tika Server is based on JSR 311 for a network serve. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java.